 Let's check your notebook now with the new... Yeah, mine is working, so... See, we solved the technical issues. So let's start with the opening session. First of all, welcome to Basel. So this is EuroPython 2019. It's the 18th EuroPython conference. So we've come a long, long way. We sold even more tickets than last year. We sold 1,200 tickets for the conference and 300 for the training days. We're new to EuroPython. Wow. Excellent. So who's the regular? Who's been to EuroPython before? Not that many. That's interesting. So for all the regulars, I'd like to ask you to help all the newcomers feel at home at EuroPython to maybe explain a bit how things work because every year, basically, we do more or less the same thing. And so if someone needs help, please do guide them, please help them and let them know how we do things. For example, what lightning talks are or what a poster session is or what open space is. That would be really nice of you because this is our conference and so we want to make it really nice for everyone. I'd also like to say a big thanks to our sponsors. The sponsor exhibit is just outside on the second floor. Something that I'd like to mention here because, you know, you typically just go to a conference, you see all the sponsors, maybe you go to their booths. What you typically don't realize is that the sponsors actually make the tickets affordable. So we'd have to charge you about if it were not for these companies to help us. So please give them a big hand. And now I'd like to hand over to Martin Christen. He's going to tell you a bit about Basel. Welcome to Basel. Welcome to Europas in 2019. I'm actually from Basel. I will tell more about Basel. Actually, in Switzerland, we do have four languages. So German, French, Italian and this Rhetoromanian language. And Basel, you can speak German, English, French, so it's no problem at all. You see here the picture of the Baselisk. That's a dragon or something like a dragon. It's a symbol of Basel and also the name Basel Basilea came from the Roman. Basel is a very old city. It was founded by Celts 120 BC. So really, really old. They didn't program Pison back then but I think they would have if they knew it. And the most important thing in Basel is the river Rhine. It actually separates Basel in two places, the lower and the higher part or the lower and higher, I could say. And you will see Basel also has two city centers on each side. I think you will enjoy that. Basel is a Pison city. I'm actually from the University of Applied Sciences and Arts, Northwestern Switzerland. That's where the tutorials yesterday and on Monday were happening. Those of you who are there, remember that certainly. I'm teaching Pison, so many students here, they know Pison and they spread the word how cool this language is. I'm also the organizer of Pi Basel and there is the big Geopison event every year in Basel. You can see that Pison and Geodata, you could say the Geospatial Data Science with Pison Conference. And there are also new trends like Pi Pharma because Basel is a pharma city so the pharma industry discovered how important Pison is in data science and discovered that it is really useful for that. And as I said before, Pison in education is very popular in Switzerland. More and more universities switch to Pison. It's a perfect language for many purposes. So I'm giving back the word to Mark. So yeah, please give a big hand to Mark because he actually invested a lot of time getting all this working here in Basel. It was a pleasure. It was really a pleasure. I like Pison, I do everything for Pison. So which brings me to the next slide who's making all this happen. It's actually just very few people. These are essentially the main organizers. Plus of course we have lots and lots of on-site volunteers. The people in the yellow T-shirts who are helping us enormously make this happen without the volunteers, the on-site volunteers that we have, this wouldn't be possible. All of the people that you see here do this for free. This is all volunteer jobs. No one gets paid for these things. So I think all of these people deserve a big hand from you as well. The Europe Pison Society is running the show. It's running Europe Pison. More recently we also started to support the European Pison community. So if you for example run a conference somewhere and you need a grant, you need some money to get you started or you need some other help, maybe organizational help, then please contact us. We are very open to helping you out. If you want to help us even more then of course you can earn one of these yellow T-shirts. And for that you only need to register on the website to become a volunteer and then please come to the registration desk and then we can find something for you to do or you can go to the volunteer app that we have to register there. This is the hitchhiker's guide to Europe Pison. Especially important is the Wi-Fi information in there because as you can see the username is not exactly something you can easily remember. We've tried to put these pieces of paper around the venues with all the data so hopefully you can get the information from there. If not you can go to the conference app. It has the information as well. And we'd like to encourage you to install the Attendify app. Again you can go to the website to learn how to use that. We have a telegram channel that's open to the public so anyone can register. If you want to for example arrange dinner or you want to meet some other people in the city then you can coordinate on that channel. You can also post images, post videos and so on there. Just please be aware that all this is subject to our code of conduct. I'm going to get to that later on. So you shouldn't post anything that could make people unhappy. Everything else is available at the conference desk. So if you have a question then you can just go down there and ask them. I tried to put these maps on the slide here. They're too small you can't really see them. If you have the app installed it has a section with the maps so you can find out where things are. We also put some signage outside which explains you where the different rooms are. So there are different monitors around the venue which show you things. Essentially we are using three floors. We're not using the first floor, we're just using the ground floor. We're using the second floor and the third floor. So actually I'm not sure. Did you get your bags already? Yes. No? Yes? No? Okay. So if you already received your bags so the people who went to the train days already received the bags for example. You'll have a booklet in there, a conference booklet which has all the information. The program in the booklet is not really up to date anymore because we made a few changes but most of the talks should still be fine. You should have a t-shirt. You should have the official EuroPython PewPew device. Did you get that already? Yes. Yeah? You'll probably also find some sponsor gadgets, coupons and other sponsor information in there. Coming to the PewPew device, this is it. This is how it looks like. It's a specially made device for EuroPython. It was designed by Radomir. I don't know whether he's here. Radomir? Okay. We had workshops for this device to teach you how to use it. It's programmable in Python. You can do lots of nice stuff with it. You can play games on it. Batteries are included. We have our own Python batteries for these devices. If you want to read up on how it's used, you can either go to Radomir and maybe he can help you a bit or you can go read the documentation there to figure out how it works. Essentially, you just put in the batteries, you turn it on, and it starts with showing the EuroPython, then you have to press a few buttons to go to the games. Right. The official part of the conference, of course, we have lots of talks. We have lots of poster sessions. We have keynotes. These are our keynotes. We're going to start with Lynn today. As you can see, we have four ladies, one gentleman there. So, diversity-wise, this is, I think, a good selection. We also have some other events that I would like to mention. We already had on the trading days all the great ones. We're going to have the poster session. I don't know exactly when it is. Maybe today or tomorrow. Then we have a pie data track. So, lots of talks in the pie data field. We have a recruitment session for those of you who are looking for a new job. We have an open space where you can arrange talks yourself. We're going to put up a whiteboard or something to manage this. We're going to have helpdesk where teams can have registered to then show you certain open source projects and help you with any questions that you might have. And, of course, on the weekend, we have the sprints and hackathons at the university again. So, just a few words about the open spaces. I'm not going to go into detail. Open space is essentially something that you arrange yourself. The room is called Delphi. It's at the level zero, so ground floor. And similarly, we have the lightning talks. How many of you don't know what a lightning talk is? Excellent, so I don't have to explain. So, it's first come, first serve. We're going to have them on Wednesday, Thursday, and Friday. We're going to set up a flip chart, probably near the registration desk where you can then mark your talks that you want to give. And then this is all going to be organized by Alexander. Where's Alexander? Alexander. There he is, down there. And so, these are usually probably the most exciting part of the conference. These are always fun and nice to go to. I already mentioned the conference app. I won't go into detail here. The only thing I want to mention is that you can actually rate the sessions. So, for each session, we have this rating bar in there. Whether it works out this year or not, I don't know. Last year it didn't work out, so maybe, you know. But it's definitely good to find the schedule. We're updating it regularly, and it works offline. Then we have a social event on Thursday. It's important. Please get your tickets. We only have a fixed number of seats. I think 400 we have, right? So, if you want to go to the social event, then please get your ticket in the ticket shop. We're going to have grilled sausages, baked potatoes. We're going to have the ticket includes two drinks, but of course you can buy more. We're going to have a very special Python music session in there. So, what's it called? Python de Travers. This is a group that's actually using Python to produce music. It's a bit experimental, so it's not the stuff that you get in discos. But it's going to be very interesting. We have a quiet room for those of you who want to take a break. It's also at the level zero ground floor. We have a code of conduct, like I mentioned. So, the very, very short version is be nice to each other, because this is our conference. We want to make it enjoyable for everyone and be professional and don't spam. This is more geared towards companies attending the conference who are not sponsors or even the companies who are sponsors, but because we don't want to, we don't want to spam people. If you want to read the full version of the code of conduct, you need to go to our website. The people on the right here, these are the code of conduct workgroup members. If you have any issues, please contact one of these people. It's probably best to contact them personally. If you don't see them, please go to the registration desk and they can then inform us and we will come to the registration desk to sort out any issues that you find. This applies to all the channels that we have, all the social channels, of course the conference, but also the social event and anything that happens around the conference. Right, so last slide, speaker guidelines. If you are a speaker, please do test your notebooks before you come to the session. You know, we should have done that as well, but we kind of forgot. After your talk, please do upload your slides. And there was one more thing I wanted to say about the catering. We have a special food point when you exit on the second level to the right from, you know, on that side, so it's the left on your side, for vegetarian food and vegan food and any other diets, I suppose that they can also serve other diets as well. If you want to have food specifically on one of those diets and please go there for all the others, I would recommend not going there because otherwise we're going to run out of vegetarian food and we don't want that. So we have plenty of other food points available for you to get your food. So that's it. Enjoy the conference. And have a good time. So now I'm going to pass over to Lin. Ah, you want to introduce her. Okay, excellent. Yeah, that's too complicated. I just need it for a minute. So I'm really happy to introduce you to the first keynote of this year's EuroPython. I'm very happy to welcome Lin. Lin has a distinguished career and user research, data mining and user experience. She holds two PhDs from Stanford and Cambridge. No, I misread this. Oh, a Masters in the PhD, sorry. Excuse me, I was, I was, excuse my... You got uninvited. No, sorry. But I think what I really like about her work is she has worked in the industry and in research and currently she's located in Lyon at a business school. All right, okay. And she's also a regular speaker at international conferences. So I'm really happy she's able to join us today and to tell you how to get your data genre de vivre. My French is, please excuse my French. Back, so please give Lin a warm welcome. Yeah, I don't have two PhDs. I also wasn't registered for the conference and my MacBook wasn't working with the projector. Let's see if we get through this. So I'm very happy to be here. This is always interesting to me to give talks. So I do a lot of NLP these days. I'm not going to talk about NLP as much today, but we're going to do some word-to-vec models. Right now I'm working on hate speech detection, toxic and offensive speech, and it's horribly depressing. That's also why I'm not going to talk about NLP. So this talk for me is a sort of a self-care piece on getting some fun back into my daily hacking life so that I'm not working on things like that all the time and things like that all the time. So I was a teacher. I'm actually consulting now. That's where I'm doing the hate speech detection, but I'm still in Lena, so I'm just down the road from you guys by a train. When I do invited talks like this, I get to do what I want, right? So like I said, this is me doing self-care and hacking on the weekends, and probably I should develop some real-life hobbies rather than hacking, but I'm a nerd just like you guys. So this is the thing about hacking. It's really hard for those of us who program to separate out the programming for fun from the programming for work. Everybody I talk to is like, oh, yeah, I did the side project and I learned how to do blah, and now I can use it on my job. It's an especially American problem where we're always thinking about how to monetize our free time. Let's do less of that. Let's make more junk and let's have a good time doing it. This is one of the things I wanted to say to you guys. It's also a message to me. So I'm going to talk about some things I did for this talk that are just totally junk, but fun junk that I admit I learned a lot from and probably wouldn't have done them if I hadn't, but I was inspired to do them by wanting to do junk. So you guys might be familiar with Hieronymus Bosch's The Garden of Earthly Delights. This is this massive, massive beautiful painting with these three panes of it. The left side is sort of a Garden of Eden simple thing. In the middle we have whatever the hell is going on that has to do with like, stuff people on earth and maybe corruption and desire and talking fish. Who knows? And then over on the right we have the last judgment or sort of the end of the world, hell, whatever. There's a lot of analysis of this picture. It's a form of big data. I got really interested in this because of the Bosch bot on Twitter. The Bosch bot is this person who is posting quantum random selections of this big picture as a tiny little thumbnail. And it's an amazingly fun distraction during the day when you're watching the American politics and British politics scrolling past and you get this little picture of pure joy from the 1500s. The way it works is she posts this image that's a very, very zoomed in random segment. And then of course you want to go and click on it and you do and you get like the bigger context which is random. I mean so sometimes it's a good segment and sometimes it's not and then you favor it and you move on. So what I wondered as a data analyst is what are the details in this big picture that people faved the most? What did they like the most? So I made an app and these are the sort of stages I went through doing this. I'm going to, because we're a hacking audience I want you to know like the parts that were fun and the parts that were really hard but also because I'm learning while I'm hacking I'm pushing through the pain, right? There was some real pain on some of these projects. So first of all, there's this cool library, TWINT that you can actually, you don't need to use Twitter's API and it has a really nice command line interface to pull down tweets and you can even say like tweets from this user with only images in them which is what I did for this case. So I pulled down January 1st through May 5th of this year. I finished the project, finished for the current state in May. So that was a super awesome library that I hadn't used, it's really cool. Then I used pandas to load that data from JSON sort, filter, do some work on it and get the top 10 by likes. Pandas is always fun, totally awesome. The part that was interesting and hard and that I did not succeed at, talk about that in a second, is figuring out where in the giant picture that little segment came from. You can talk about that in a second. And then I used leaflet.js as a web app to display this. So you're going to see a lot of web stuff because I was sort of getting back to some web programming for these. So this is the app. Let me just quickly demo it so you can see. So what I did was with the top 10, I located them in the image and put a little marker there for where they were taken from the big picture. You can see at a glance, the thing I was most curious about is what panel are they coming from. The one on the right has an awful lot of action going on. It's not too surprising when you look at the actual details of what's going on in that panel. There's very little action over in the garden of Eden section. And then we have a bit, and I think there would be a bit more when I redo this and update it in the middle panel because there's so much stuff there. So I wanted to size by likes so you can see there's a big one here. This is the biggest one. And then I was like, okay, well, is it just a fact that the things that are older have more likes? But no, it's not true. So this one, at the time I did the scraping, was the most liked, but actually the newest as well. If we click on that, we zoom in and that's the picture people liked. Uh-huh. So I learned an awful lot about, like, Twitter from that. Now, if we look at, like, some other things like going on up here. So this one, this is like just burning in hell, like that little image, which is pretty cool too. There's one down here that I like. So one of the cool things about this picture is really zooming in, it pays off a lot. This is like big data that's totally fun, so we can sort of move around. There's so much detail in this picture, it's ridiculous. So this is the lowest level, zoom level. You can see probably, like, the paint fleck-drying effect here. This is one of the reasons localizing where the little segments came from in the big picture was a no-go for me. Okay, so this is big data. Big data is just data that's hard to deal with, right? It was slow to load the high-res image, which is 223 meg in one go in the browser, which is what I was going to do originally. And then if you want to move around, pan and zoom, it's basically ridiculous. Pillow-based Python tools won't load it. You actually get an error warning about decompression bombs, which I was like, what? So basically, you can disable that and get it to work with it anyway, but that shows you the size of this as a big image. But worse, the algorithms I tried to locate where that little snippet, the posted snippet were in the big image crashed my laptop several times. Now, obviously, I could have gone to a remote machine with more memory, et cetera, but the way they were failing and the kinds of matches I was getting when I changed resolutions, et cetera, were leading me to believe that they weren't going to work. If anyone is a really good expert at neural net template matching and feature detection, I would love to talk to you afterwards to see if I could figure this out. However, there wasn't out. This was a hack, right? I'm like, okay, solving this problem isn't the point of my exercise. It was what are the parts of the picture people like? So I just went into Photoshop and figured out where these segments were and like hand-coded a JSON file for the top 10. Obviously, that doesn't scale to a live app. And I talked to the Boschbot author and he or she has added the actual snapshot locations with top left corner, et cetera, to all of the posts now, just for me. So I have to make a live app with these after I'm done with this talk. So there will be a live app showing what the most recent likes and faves were and what section of the image they were. So that's kind of cool. I made a friend. So I said I used leaflet. So the big data part of this for web design, this was something I had been wondering about for a while. I sort of knew about it, but I hadn't made anything with it. You can think of a really big picture that you want to zoom and pan on as a map. And the tools for doing maps online are the tools for doing big image display like that. So what you do is you tile your image. You have sections of the big image at different resolutions and different size for all these different zoom layers and it theoretically happened seamlessly. I think my map left, I was running out of memory and that's why it wasn't so seamless when I was moving around. But essentially use something that's a map tool, GDAL to tiles, to make these layered directory folders for different zoom levels. And there's a lot online about how to do it. It's just interesting that we're essentially dealing with tools for maps. So this is me this morning in my hotel room trying to remember the bash script to show the size of each directory. So at the 0th level, which is the top view, there's only one image. And then as you zoom in, 7 being the lowest zoom level where you can see the little paint flex, we have the most images. And we only pull up the ones that we're looking at at that point which is how you get a smooth effect. So the other thing that's cool about treating this image like a map is that if you use a tool like LeafletJS which is for interactive map display online, it's an old established tool with lots and lots of UI features and controls. It's actually really easy to add stuff like when I zoom into a certain level replace it with the rectangle showing what the edges of that snippet were. When I zoom back out, replace it with a dot. And you get tool tips and things like that. So that was cool. And they could guys butt on a fish. So this one, it's only because I was talking to the Bosch bot about how he or she did the segments of the image randomly that I learned why this was number one. At first I was like, well, obviously, that's just people like that picture, right? Actually, it's the pinned image on their account at the top of the account. And so obviously it's the first thing you see, so you go and you hit like, right? It's not just that. There was a celebrity who went and retweeted it. And so that's actually why it's number one. So when you're looking at faves on Twitter, you're always going to have these effects that are very famous, like, accelerates your liking. So it's a bit hard. She's a comic. She's actually a pretty funny follow. So just this past week, the pudding, they do a lot of interactive, interesting data visualizations. They did something similar. This is a TSNE layout with, using Mario Klingon's roster fairy to squareify the layout. They used a tool called OpenSeaDragon instead of Leaflet. So you get the same kinds of zoom, pan, move around. I don't have a mouse, so it's less than cool. But essentially these are very detailed, tiled images. I could probably plug in my same tiles. All they've got is the zoom in and out button. Because they didn't use Leaflet, they don't get the access to the same number of UI controls and things that I wanted to have. So even if I switched, I could do a lot of coding from hand to get some of the effects that I was building in the app. Okay, so that was a fun project that now has me promising to build a live, updating, real web app. Excellent. Word2vec toys. So one of these is a project I did last year for a conference that I then didn't go to because they had weird funding problems. But it led to the project after it, so I'm going to talk about this. So most of you are probably, at least, glancingly familiar with Word2vec, which has been all over the place for a few years. The concept in Word2vec is that we do an analysis of text and a lot of text, and we come up with essentially a matrix with a vector representation for each word in the document collection that we've analyzed. We do that by looking at sort of a window of context around the words. So what we do is we essentially encode in this matrix things about the relationship of this word to other words in the collection, which means that we can do math on that matrix like find me the most similar vectors using cosine similarity, for instance, and those are words that are related and that they've occurred in the same contexts. My source for these projects is Project Gutenberg, which is a great source for fair use text, lots and lots of text in lots of formats. You can even download and read it on your Kindle. You can get the plain text. There are libraries out there to deal with getting text out of this because it's a huge collection. The one that I settled on using is actually an R project, called Gutenberg from David Robinson because it allowed me to query by subject. So I do a lot of things by subject, like say ghost stories or poetry in this case or detective stories or whatever. You can do other queries by author by other types of metadata. Now Gutenberg updates its catalog fairly regularly, so you have to keep your local copy of the metadata up to date because their API, I don't believe they have an online API that you can query and do all of what we've done here. All right, so anyway, so I downloaded his code. I had to update all of the metadata tables because in January, in particular in January, there was a big change. And then one of the things his library does that some libraries don't do is it strips out the header text and the footer text, which is this sort of generic stuff on Gutenberg files. You don't want that in your language model. All right, so then if you want to make a word to VEC model, there's lots of info online about how to do it. The most popular way is Gensim. Gensim wants you to put in tokenized sentences. So you have essentially this is a sentence and it's tokens, so just the words as a list. And then your entire document collection is one of these for each document. Sentences generally. And then I'm giving you this particular code example because it shows how to save the model and then load the model. If you're going to do the kinds of things that I did where you use your model for other things, you need to be able to save and then reload it in some place like a Flask API. So one of the attributes of Word to VEC that a lot of people have heard about is that you can make these sort of 2D layouts that show you those relationships between words. So like I said, you can do cosine similarity and determine words that are similar to each other in the vector space. So you can also make these sort of explorable maps. Thanks to Peter Baumgartner for a gist that I modified to get this working quickly for this talk. So this is folk and fairy tales. And this is an interactive exploratory tool for looking at the space. So it's in plotly, which means that I can filter and sort and things better with a map. So if I roll over these, you can see what the words are and see possibly why they're grouped together. So this is things that have to do with time and distance probably. Fifth, sixth, fourth, eighth. Afterwards, week, before, short. Afterward, years, months, miles. So frequently you get parts of speech that are grouped similarly. Boy, it would be way better if I had a mouse. Over here we have things that are sayest, mayest, worst, willst, thyself. So apparently there was some really old fashioned text in this Gutenberg file. And it grouped all of those old fashioned helper verbs together. Over here, further, rather sooner, higher, sweeter, et cetera. Up here we have other languages. Words from other languages, which is strange and I could go and look at these. In this particular layout, color means that this word occurred more often in the text file. So his gist that I used to make this, I updated so that you can do your own with the word counts as well as the model. And it makes this cool, this cool map. Okay. So, like I said, you can find words that are related to each other with these models. So you can do the, so for polite, what are the most similar words? Courteous, friendly, cordial, professional, attentive, gracious, that looks like a good model. So my question was, what if we went from the closest word to polite, and then from courteous we looked for what the closest word is. It's not necessarily friendly because of the way these things work, right? And so what if we kept chaining from word to closest neighbor, word to closest neighbor in these models? And so I made an app. This is the word fire in a Gutenberg older poetry word-to-vec model. Closest word is burning, closest word to that, but it's a bit distant compared to others. It's folded. Lids, curls, silk, scarlet, glossy. And so if we keep doing that, this swoop is supposed to indicate that we're reading left to right. We get these words. Now if I change to a different model, this is a slightly different model that includes things that were released in January. We get different results. So the closest word for fire is flame, burning, crimson, purple, foam, gleaming. Now if I switch to folklore and fairy tales, the one I was just looking at in the big graph, we get hearth, and then pretty far away oven, pale, empty, dipped. So think about like Hansel and Gretel, things like that. The witch is going to put them in the oven or whatever. So these are nouns related to scary fire situations probably. Okay, and I don't think Nerd2Vec works very well. I'm going to talk about that in a second. This is quite different. Yeah, so Nerd2Vec, which is built on like Star Trek, Doctor Who, Star Wars, firing, shooting, throwing. Not so surprising. All right. So this is another example. Alison Parrish is the one who inspired me to do this originally because she had a poetry Gutenberg corpus, and then I made the word 2Vec and then built these apps with it. So what we're seeing here is the vertical distance. So the distance between one word and its next closest neighbor. And then the size here is essentially encoding the fact that there were loops. We don't want cycles in this list of words. It gets boring really fast because you see the same word repeated over and over. So I just sort of increment the count for a word and then go on to the next closest. So that we get an actual chain. And the color is just pretty in this case. I didn't say that as a database person, but it is. All right. So like I said, there was new public domain stuff in January 2019. So some of the things that were added were really cool things to work with, like Robert Frost. Robert Frost is a well-known American poet. His stuff all entered public domain. A bunch of other great fiction works that are essentially more modern in feel. So this is an interesting contrast here. Believe, the word believe in the older model pre-January. We had a model for poetry that went learn, grieve, suffrage, choose, understand, etc. And then after I updated the model with the things from January of this year that were just released, we get magazines in there somewhere, which is super weird. So now believe goes to suppose, believe. I didn't take out loops from the seed word. And then over there in the second column, magazines, satires, newspaper, etc. You can see it kind of goes off the rails sometimes over to the right where we get into really specific words like parliament. This isn't a poetry model too, which is weird. Yeah, so this is the source on Nerd2Vec. So Nerd2Vec is definitely very different in its properties. Now this brings me to my next project. So all of that was just sort of precursor to the project I actually wanted to make for this talk, which is that I'm really interested in creativity assist tools that help you build things using AI or machine learning or whatever, but with a human solidly in the loop and in control. So one of my inspirations with text games is things like blackout poetry. You have an entire text, but you only use certain words in it. That's Austin Cleon. This is cutout poetry or cutout poetry where it's different texts combined in order to make a poem by a human author. So this goes all the way back to the survey list some before. That's Timothy David Ray. This is an interesting case. This is Reznikov wrote poetry based on legal briefings. So he took legal documents and then put together segments to make poems that are fairly moving actually. So I built an app. So what I wanted to do was mess up poetry. I wanted to take good poetry and then play with it at the word level. The inspiration here was like, am I going to learn, am I going to make anything better? Probably not. But am I going to learn why they did what they did in great detail? Probably. So it's a really interesting close reading exercise to take someone's work and then mess with it. But I wanted to do it in a kind of principled way with lots of constraints. So what I wanted to do was for each word there that's in color, I'm loading up the list of the 10 closest words in a poetry model. What other poets might have used instead? So essentially this is an editable demo. So that's Robert Frost. I have a bunch of different models in here, word-to-vec models. I have the older original poetry one. I have up to 1923 poetry. So if I pick woods, I can pick a different word. So I'm going through the woods. Maybe I'm going through the fields instead. Whose fields these are, I think I know. His house is in the village. His table is in the village. See, this is not a good poem. But his table is in the castle, I kind of like. All right. So it turns out that just I'll get to the color in a second. It turns out that these longer poems are a bit harder to work with. Let me pick something shorter. Okay. So this is Amy Lowell. Your voice is like bells over roofs at dawn when a bird flies. Okay. You see how they aren't all the same part of speech. So there are word-to-vec models that are encoded for part of speech. So I could have done that. But instead I'm building a model that's just like, what did other poets do in this kind of context, right? Your cry is like bells. So at the point where you've done the second word, you might want to go back and fix the other one. This is the way this thing works, right? Okay. So these words about time are usually the easiest. Okay. Now, nerd-to-vec is weird. So remember Star Wars, Star Trek, Doctor Who, character, name, role, performance, vocal. Some of them are just like, what? Sorry. I don't know. So pillars, I don't know. So vampires have got to be at night. Notice there's sirens. One of these, you get sith, the sith lord. So, all right. So they're fun. In fact, it turns out that actually the really funnest ones are the haiku, for some reason. So with haiku, of course, you have a syllable count. So you have to decide, am I going to stick with the syllable count or not? So I usually try to stick with the syllable count. So old pond. Okay. Now look, here's a point about these older embeddings. They are not good at getting contexts of words that are very different. So if a word can appear in different meanings in different places, the modern transformer language models are much, much better at handling that. These old models aren't. This is an exact example of that. So pond. I'm in the nerd-to-vec model. Doctor Who, Amy Pond, is a character. If you're a Doctor Who fan. So is Rory. A bunch of these are people. So essentially the word pond in the nerd-to-vec model is about a person more than it is about ponds in the woods or whatever. In poetry, it's about ponds in the woods. But so here we essentially have a case where the word to vec sort of breaks down or this is a shitty model for poetry, which it definitely is. Okay. All right. So let's talk about this a little more. So in fact, the color here represents a normalized distance between the original word and its next closest relative. So essentially if you see something that's pinker, that means that the next relative here is further away from it than the next relative is in these cases. So that can mean that that's like an interestingly different word that isn't as common in these poems. I did a lot of work to get this color because I'm a data-vis person. I don't know if it was worth it and paid off in terms of actual use. That's one of the points about doing data-vis and interactive work is you aren't always sure you're going to get a payoff from the feature that you just spent a lot of time working on. So essentially this is just me explaining how this worked. D3 is what I used for data visualization on the web. The way this works, just so you understand, is that you essentially take a domain of numbers, which in my case are the distances between a word and its next closest neighbor, an array of all of those distances. And I get the minimum and I get the maximum. And those are my anchor points for the color scale. So the minimum is going to be the blue and the maximum is going to be the pink. So I could have recoded this, but I just used somebody's picture last night at 11 o'clock. So that's how the encoding to color works. So it also turned out that to get those scores, to get that color, I had to radically re-architect the app. This is another thing about data-vis. So this is the stack. So because I stupidly used React, the stack is incredibly deep here. We have the app at the top with the dropdowns for the model and the poem, and then we have the poem renderer. We have each line of the poem. We have words that are clickable that load up a little dialogue that have this list that you pick from and then you pass it back up to redraw the whole poem. So initially I called the API to get the closest word here, which is the obvious place. I click on a word, I get the list of closest words. Super simple. But to get the colors, I had to do it up here and I was in promise hell. So truthfully, like anyone here who's actually good at React, I wouldn't mind having a chat. Because React essentially is passing state is this. This is what my code looks like, too. I have like 18 functions. Yeah. I have like 18 functions called handle change. It's insane. All right. So like I said, the Basho haiku are actually the coolest. This is the original haiku, the first cold shower. Even the monkey seems to want a little coat of straw. I turned it in with one of the poetry models, the last wet shower. Even the basil seems to try a little cap of dirt, which I think is very haiku. But then the last hot meteor, even the lizard seems to need a little robe of guts, is the nerd to back version, which I thought is the so I laughed. All right. So the parts of this project. So making the word to back models is always easy with Gensium and always fun. When I decided to code the poem in React, so it's essentially because I've been doing data science in like server side Python for a little while now and haven't been making data of his apps. I was like, oh, let's just use the tool to draw. But in fact, this was a really complicated app to be my starter app in React. The learning curve was hard. The why isn't this updating and why is this updating too much was really hard. And then at the point where I decided I had to do the visualization by getting all of the distances up front and then passing them back down to the dialog list, that was really hard. The promises handling was hard. Putting the actual color on each button was super easy. D3 is fun and it's still fun in React. I was glad to see. And then at the end, like last week, I was like, oh, I'm gonna add a dropdown to change the poem, add a dropdown to change the word to back model so I can get nerd to back in there. That was actually kind of easy. So at that point, I'm like, maybe React is good for big projects where you need to make a little change. Okay. I think hooks would have helped, which is the latest way of handling all the state passing. So some related AI creativity work. So we're moving away from my projects to sort of the bigger picture. I really like that human in the loop working with a model or a tool or a learned representation. And for me, the interesting AI art is the art where I got to do an awful lot of the customization of the authoring, of the curation, not just throw up a big GAN model and then tweak one parameter. So obviously, if you're keeping up with the GP2 model, big language generation models, Hugging Faces, Write with Transformer is an example where you can write with the GP2 model. I still don't love it. I mean, I don't know if you've tried it, but essentially it's a great app, but all right. So essentially, we've got a starter text, and then we get options that the model generated based on what it was trained with. So Legolas and Gimli advanced on the orcs, raising their weapons with a herring work y. But the heroes were not to be defeated. So that's it. And then you start typing something. Hit tab, and then you get what it would complete with. A lot of times it's nonsensical. Most times it's nonsensical. Even GP2 isn't very good at big picture coherence and text. So for me, this is still not like a superb creativity tool. It's fun and funny, and I'll probably add a slide to my deck after that's one of my favorite stories I created with it, but it still has its issues. Okay. In the visual space, there's a lot more going on. Nvidia's Gauguin demo is pretty cool. So what you do is you draw with a pen that says, like, what are you drawing? Sky, tree, cloud, mountain or whatever. And then you tell it to turn it into a photo-realistic picture. And it does that based on what it's trained with. So this is like my not very good trees turned into weird trees by an ocean with some hills. So I like it because it's totally surreal, but it's obviously not photo-realism. Like, it's surreal because maybe because my input was bad or because of how it's putting it together. And then, like, you can edit it again and get more stuff going on, but it's still not superb. This is one of my favorites. Sorry about my Pinterest thing. This is one of my favorites, though. Ganbreeder is a very cool app. This person is updating it to do more things with it. The way it works is this. Those were my pictures, by the way. I'm obsessed with generating castles. I just want to build the coolest castle, fortress tower, whatever, and that's one of my obsessive goals. Okay, so if you go to the Ganbreeder page and you can see sort of random things other people have created or that have come out of the app, you can start from an image. And then what it does is it generates children related to that image, and you can edit the genes and you can see what went into this. Garden, spider, jellyfish, sea anemone, comic book, consomme, a soup, bell pepper. Okay, click on that. Get more children. Those are looking pretty, actually. Okay, I'm going to save that. Anyway, there's a bot that's posting cool ones from this and they're about to do major updates on the UI tools for this thing. It's going to be great. It's a super awesome tool. So this is a case of a human in the loop editing generated model pictures. I said I'm obsessed with castles, so as soon as he or she said they were updating it, I'm like, more castles, please. Because essentially there's one category for castles and they aren't sufficient for me to really make the castle of my dreams. All right. So that brings us to some of my next points. We're doing this stuff. Find, make, use, cool data sets, but be clever about it, right? There's so much interesting data out there and you can make your own data sets. Finding fun data sets isn't actually that hard, but there's two resources I'm going to point you to. Jeremy Singer Vine's data is plural mailing list is awesome. He's a data journalist and there's a big archive of all of his data sets out there and it's a great newsletter with just a few short bullets in the description and he loves weird data sets. So that's super fun. I also obsessively collect links to data set collections. So my pinboard, which I update daily with my many open tabs from Twitter, there's a data set tag. Some of it isn't going to be obvious. It's things like game archives, things like that. So you're going to look and go, why is that a data set? It's a data set if you're a data scientist or a data goofer offer. You can obviously make your own data sets. This is Josh Stevens, who's a great map maker. He did this now quite famous map of Bigfoot sightings in the U.S. Bigfoot, the vomitable snowman in the Yeti. So there's like public data about this and he turned it into this map that was all over in lots and lots of articles at the time. This wasn't for work, it was goofing off. And then he has over here a little like educational by variant explainer about population versus sightings of Bigfoot, which is super interesting. Okay, of Oz the Wizard. This is one of my favorite weird examples. So this is data you didn't know was data. So just like me editing poetry. So this is this guy, Matt Moosey, who alphabetized everything in the Wizard of Oz. All right, all right, all right, all right. Okay, yeah, so everything in here is alphabetized. So it's essentially alphabetical and then a temporal sequence. Decree, need the grade of degree. Illusion is not so loud. All right, so obviously he used code to do this. We could have done this, right? I'm kidding, I wouldn't have done this, but it's amazing. So one of the things about that exercise is that, as he said, his appreciation for the film increased. It's just like looking at the choice of a word in a poem and thinking, could I do better? Could I turn it into mine? When you look very closely at something using tools, you start to love how it was made. You start to love the data. Doing things with artistic works is even better because you learn a lot about the art. Another interesting AI art project. So Victor Divya did this African mask curation project. So he hand curated zillions of pictures of African masks and then used it to generate new African masks with a GAN. So there's a good write-up and an interesting demo that he made. This is a case of a passion project where you curate your own data set to do something interesting with it. I'm working on gargoyles and castles, obviously. Stay tuned. Anna Riddler. So probably Luba Elliott is going to talk about these projects tomorrow, so I'm not going into any detail, but Anna Riddler took pictures of tulips and has a huge display of all the tulips she used, and then she used them for the Netherlands to generate GAN images of tulips that have never existed. This is another hand curated beautiful collection. Helena Saren makes really interesting GAN art. She also curates and collects and does her own imagery that she feeds into the GAN to make her own artwork. I'm quite sure Luba will talk about this tomorrow. I love her style. It's very different from a lot of other GAN art. So that said, don't be a jerk. When you're doing this kind of creative work and using data sets other people created or inspiration from other people or poems or code, give credit to your sources and the inspirations. And in particular, don't be this guy. Hopefully we all know about deep nude. This dude is like the algorithm only works with women because images of nude women are easier to find online. But he's hoping to create a male version to bullshit. He's not going to make a male version. And his argument was if I don't do it, someone else will do it in a year. What? Be smarter and be more interesting. And don't be offensive, essentially. So the code was all over GitHub because he said it was too late. It was out there. I saw yesterday late last night that GitHub is taking it down. So good luck to them. But there's a lot of really offensive things you can do just because they're easy. It doesn't make them fun. So this is an example of sort of a tired metaphor, but it spoke to me. So this is my patio in Lyon. I have lots of flower pots on it. There's a particular flower that I really like that's essentially a weed that I cannot grow well in these pots. I don't know what the deal is. One seed fell off the edge. You can see it sort of down here. This is one plant. It gets almost no sun, no rain, and it fell on gravel, and it did that. That's one plant. I can't put it in a pot and have that success. But sometimes the stuff you just throw out there, the lands where you think it's shitty, it does the best. And that's one of the cool things about being creative and putting your stuff up there is that sometimes it will put a flower in a place full of weeds. I think you understand what I'm getting at. So Max Kraminski also has this reminder, if inexperienced creators are using your tool to churn out loads of half-baked garbage, your tool is a phenomenal success. So true. What we want is people to be making a lot of half-baked garbage and having a good time. So what I would say as a takeaway is if you find yourself playing with something you made and you're actually smiling or you're laughing at it and you're playing with it to check out something new, you've definitely succeeded at a fun weekend hack. And you want to make yourself and other people smile. You don't want to be a jerk, and you want to make the world a better place for yourself and for other people. So I want to give some shout-out thanks to this is not all of the open-source projects I use, but the particular helps for this. The Bosch Bot was really helpful. Peter Baumgartner's just for making a U-Map display in Plotly was an amazing help. I used this dude's word-to-vec API code, which I could have written myself, but I saved a bunch of time weekend hack. Peter Gassner helped me a little with React, and this person helped a ton of people in a comment on the U-Map repo for how to actually get U-Map recognized in your Python project. I want to see how many people are super happy. I'm the one here. So anyway, that's my talk. Thank you. Thank you, Lynn. Thanks. That was a great keynote. I guess you have many questions maybe, but you're around until Thursday? Yeah, I think let's talk over coffee so we don't... Okay, thank you very much. So we have a short 15-minute coffee break, and we're going to continue at 10.30. So hi, everyone. My name is Olivier Grisel. I work at Inria in Paris, and I have some funding from the Inria Foundation. Maybe it doesn't really work this... Okay, no. So basically, I work as a software engineer to support the development of the scikit-learn project with a couple of colleagues at Inria and with more generally an international community of contributors around the world. So scikit-learn is a machine learning library in Python. So how many of you do not know about scikit-learn? I've not used scikit-learn in the past. Okay, a couple of new people. So for this presentation, what I would like to do is give a first introduction to other stuff that were also released in the past releases and some ongoing work. So first, scikit-learn and machine learning. So predictive modeling that we call also statistical machine learning is basically the process of using repeated events that are recorded in the database. So to keep a historical record and extract some statistical structures of those records, in order to turn them into some kind of executable model. So the end goal is to generate a program that will be able to predict the outcome of repeated events. So you can see that as an alternative to hard coding rules written by experts that know well about the process that is being modeled. But instead, we just use a record of data and hope that this executable model will be able to make good predictions on future events. So it's generally most useful for small predictions, a large number of fast predictions that do not have a big importance if you make an individual failure. But you want that, on average, your predictions are better than a random guess or some baseline rule that you could design quickly. And then you can be used to optimize many business processes or scientific discoveries and so on. So the general data flow is to start from the recorded data. So you have some acquisition process that could be based on mobile phones, cameras, microphones, or interaction with users via a web app. And then all those individual events, they are recorded in a database and the first step is to first find a numerical representation of those records. So those are the blue lines in this database and you have one specific column that you're interested in, the target variable that you want to predict. So this is the green one. Sometimes it's naturally present in your database. For instance, if you want to do forecast of something, you have the past information that is already there. Sometimes you have to collect human annotations. For instance, if you want to do image classification or translation of things like this, you need to have professional annotators to give you those labels. Once you have this, you can plug any kind of machine learning algorithms. So those are mathematical models implemented in Python or C++ or whatever. And the outcome of that process is a statistical model. So the statistical model is some kind of summary. It's a bunch of a couple of megabytes of parameters, for instance. That kind of summarizes the big training set that you used as an input. So typically the statistical models are a couple of megabytes or even smaller, whereas the training set could be possibly gigabytes or terabytes in some cases. Once you have this, you can deploy it. And basically, you don't need to recall all the training sets. You can just make a copy of that statistical model and deploy it on another server or on a mobile phone and record new data with the same data generative process, the same acquisition process and then execute the prediction algorithm by feeding the new test record and the model to get the predicted outcome. So one typical example of this is if you're a real estate agency, you can record the historical transactions for different housing types. And for each of those records, you collect numerical features or categorical features in different columns. So the number of rooms, it's an integer. The area in square meters is a continuous variable. The year is a specific kind of ordinal variable. And typically, you also have the target variable. In this case, it's the price in your role in a specific column. And you can also have new records in your database where you don't have this target variable filled in because those are new customers that want to get an estimate for the house that they want to sell. And for those, by definition, you do not know the price of the future transaction. So here, the goal is to find the statistical relationships between the price and the descriptors in the historical data and then use that statistical relationship to predict some estimate of the price. So the names that we use for these are the features, the columns that we use for the input. The target variable is the output of the model. And then the records in that database are what we call in machine learning the samples. And those where we have labels and we can use for training the models are called the training set. And the new ones where we want to make predictions are the test set data. And once we collect the true value, we can compare the prediction with the true value and compute the accuracy of the machine learning model on the test set to do the generalization ability of the model, which is basically the quality of the model. So scikit-learn is a machine learning library that has hundreds of classical machine learning algorithms implemented in it. So those are the traditional families. And the goal is not to implement the latest state of the art, but more like to provide the data scientists with good baselines that they could apply on their data to build something useful out of it or to compare if they have new fancy ideas for better machine algorithms to compare that their new stuff is better than the traditional stuff. So it's an open source project using the BSD license, so pretty much anybody can use it. It's just that you have a disclaimer in case you have a bug, you take your responsibility. With a community of hundreds of contributors, now actually a pair of release, so I think it's more than 1,000 so far. And we have a team of core developers scattered around the globe in Australia and China, in France mostly around in RIA, in Germany and in the United States, in Columbia University, for instance. So most of the algorithms are meant to be used from a Python API and they are also sometimes implemented with the Python programming language with the help of numerical algebra, linear algebra routines made available by NumPy and SciPy. Sometimes the vector operations that are very efficient in NumPy are not sufficient to implement some algorithms, like for instance decision trees, where the bottleneck is not really in a matrix-matrix multiplication. So in those cases, when we have nested four loops in Python, that can be very inefficient, we have a compiled programming language which is called SITEN, which is basically an extension of the Python syntax that can be used to generate C code with types so that you can use a compiler to build a compiled extension for Python with a syntax that is very high-level and you still have NumPy operations and things like that readily available in SITEN. The main selling point of the scikit-learn machine learning library is the fact that we provide very mathematically heterogeneous models under a very homogeneous API. So data scientists can try to swap the models very easily because they all follow the same kind of API conventions, the fit method for training the model, and then the predict and transform method to apply the model to new data. And because all the models follow this standard API, we have standard tools to evaluate the quality of the model for classification, measuring the accuracy for regression, measuring the mean absolute error, and so on. Cross-validation procedures, model selection, hyperparameter tuning, and how to ensemble several models into a big model. And sometimes also to build pipelines, and so on with preprocessing. So it's a very active project. Nowadays, I think we are in the over 800,000 monthly active users browsing the documentation online. And basically it's doubling every two years approximately, so we estimate that it's around maybe a million of scikit-learn users or something like that. So what is new in scikit-learn O21? So if you look at the changelog, it's just a snippet. If you were to scroll the page and take screenshots, it would cover the walls basically. So I will just focus in a subset. And in particular, I will focus on the new gradient boosting implementation in scikit-learn. So gradient boosting is a very, very useful supervised classification model. It's based on decision trees that you train sequentially one after the other, such that second decision tree tries to correct the prediction errors of the first one. And you build an ensemble by sequentially refining the predictions using different trees. So in scikit-learn previously, we already had an implementation since approximately 80 years or something. But the way it was implemented was implementing the traditional exact method. But this traditional exact method was shown to be completely deprecated by a new approximate method implemented, for instance, in another C++ project called LightGBM by Microsoft and other libraries like XGBoost. But in particular, LightGBM could show that you could get really, really high performance by what we call beaning the data in a small number of buckets and computing histograms for the frequencies of when you see this data point with this value and compute frequencies. And this makes it possible when we train the decision trees to not have to do sorting operations that are very expensive in the traditional algorithm. So because we do that, we can get rid of sorting and we just do comparison operations and counting operations. And this is much, much more scalable. And furthermore, the beaning step itself can also improve a bit the quality by adding some regularization. So to implement this for scikit-learn, we first started to do it as some kind of prototype using the NUMBA framework, which is basically a just-in-time compiler for scientific programming, because I wanted to find a good opportunity to learn how to use NUMBA. And so this was implemented in this project, which is also open source, but just as this single algorithm implemented with NUMBA, and after that, once we could prove that we could reach this kind of performance with NUMBA, we decided to translate the code into scikit-learn, which is very similar to NUMBA. We just need to have additional type declarations, basically, so that it can be easily embedded in scikit-learn that already has a dependency on scikit-learn. And we didn't want to introduce a new dependency quickly. So maybe in the future, scikit-learn would use NUMBA, but it was easier for the short-term to just translate it. But we still plan to keep this PIGBM project around, so if you're interested in NUMBA and gradient-gustic trees, you can still use that directly. The two projects have basically the same performance, and when we measure on some benchmark datasets, for instance, the Higgs boson dataset, it's quite competitive with light GBM, which is very optimized C++ implementation. Sometimes it's even a bit faster. So the site translation was actually contributed by Nikolaou at Columbia University, and we are still working together to integrate the missing features. So this slide is slightly outdated. Now we have more losses. We can do classification with a multiclass. We still want to have new losses for quantile regression and so on to support sparse data, support missing values, which is almost ready to merge. And there are other features like this from light GBM that are not yet present in scikit-learn, so we're still working on that. Just to give you some intuition of how NUMBA makes it very easy to do high-performance algorithm in pure Python. So this is actually a snippet of the code for the beaning part, to go from continuous values to integer values in buckets. And so this is a very naive algorithm, but there is no... It's the traditional way to write beaning, and there is no obvious way to do better. But if you do that in Python, you see that you have two nested loops. It's very inefficient. But if you just import NUMBA and the ngToperator, and you decorate your function like this, then NUMBA will use LLVM to do a native version of this code and to compile it specifically for your platform. And this will run as fast as C++, basically. And furthermore, if you use the P-range function here, you see that the for loop will change the range operation with P-range. And automatically, by doing this, you get a parallel execution of the for loop. So you can do in different threads on the multi-core machines, the independent operations in different threads, and get a good speedup because of this. So with siton, it's very similar. It's just that we need to have additional types, so it wouldn't fit on the slide anymore, but the code would be very, very similar anyway. So the siton implementation is still flagged as experimental in scikit-learn. So because we plan to implement more features in it, so we know that we are going to need to change maybe the behavior of the default hyperparameters. So because of that, and we are very conservative in scikit-learn not to break our user's code, if you want to use the new Instagram of Gratin Bootstead classifier, you need to acknowledge in your source code that you are using an experimental feature and that for this specific model, we do not guarantee backward compatibility with the deprecation cycle. We will not break it for fun. We will probably just change the small behavior for the default hyperparameters. So we still need to implement sample weights, parse data, missing values is almost ready, capital recall variables and so on, but it's already usable for numerical values or if you preprocess your data you only have just numerical values. So from a performance point of view, if you benchmark this, this is a classification problem with, I think, 20 features and it's a synthetic classification problem and you see that on the x-axis you have the number of samples from hundreds of thousands to millions. I think the last point is 5 million so it could go to 10 millions, but on this laptop it was using too much memory in the end. You see that you have this kind of linear scalability, so this is a log-log plot on the y-axis, I missed the label, but it's the time of in seconds. So you see that for 50 million data points it takes approximately 10 seconds to train a typical ensemble of trees on this using this algorithm. So the blue line is scikit-learn, the orange line is light GBM so you see that for small data sets we have some override in scikit-learn because we do additional input validation, I guess, but then when you move to large data sets that last for more than one second we are very competitive with light GBM and significantly better than Exibust. I don't have exactly the same hyperparameters all Exibust and light GBM so it's hard to compare exactly but this is the kind of performance that you can expect. So I would like to switch to some interactive demo. So here I have a Jupyter notebook where I will show how to use scikit-learn basically in a typical use case and compare the different solvers and in particular focus on gradient boosted trees. So the data set... Oops, sorry, I was not at the beginning. So the data set that I'm going to use is what we call the California housing data set. So it's a small real estate data set where you have 20,000 samples so it's not very big and you have those features to describe a different housing type and the goal is to predict the price of the houses in different neighborhoods. See, it's all numeric data. So first we split the data set into a training set and a test set so that we can measure the quality on the test set and not cheat by just memorizing the prices that we have observed on the training set. So first we will just use a linear regression baseline so it's possible with scikit-learn to do linear regression like in Excel and it's very fast. You see it's 11 milliseconds and we can compute the error so it's the absolute error so I don't remember the unit but it must be several tens of thousands of dollars or something like that. And so you see that on the training set the error is slightly lower than on the test set which is kind of expected. It's easier to memorize than to generalize and you see this number so it's just a baseline. We can also do this kind of plot to compare the predictions on the x-axis to the true labels that we wanted to predict on the test set and you see that if you have a perfect model all the points should be on the diagonal. They should be the identity function basically. Here you see that there are many off-diagonal elements and especially for the large values our model tends to underpredict the large values. It goes off-diagonal. There is this kind of bias. So this is probably because we have this kind of distribution for the true labels so this is the distribution of the prices using a histogram and you see that there is this kind of censoring effect at 5 which means that basically housing above this price where censored were limited and so in the database just recorded 5 instead of more than 5. So this is an artifact of the training set and for linear models it can actually be a problem because the loss function that they used does not make this kind of assumption. So we can filter out this and we can also take the logarithm of that so that it's more like Gaussian distributed and see if our linear regression can perform better in this case and we also make a pipeline where we do some standard preprocessing linear regression on the data that has been preprocessed on the labels that have been preprocessed. So if we do that and we compute the scores again when we compute the scores we need to inverse the preprocessing that we did on Y but you see that it's significantly better already so just using a linear model let's say more correctly we can see that we already improved the predictions by quite a bit. So let's see if we can use a non-linear model to gain some more expressive power so in scikit-learn you can build complex pipelines where you do transformation of the data for instance in this case we will do polynomial feature extraction so we will keep the original numerical features and we will do also pairwise interactions and actually up to 5 degrees interactions between the numerical values and then after this feature expansion basically we will fit a final linear model but the full pipeline is no longer a linear model it's more complex so it's more expressive so this takes significantly slower because for 4 seconds instead of 11 milliseconds but you see that it's more and more diagonal you no longer see this kind of extreme deviation pessimism by the model so instead of doing feature engineering to turn a linear model into a non-linear pipeline what we can do is directly use a non-linear model like a neural network so it's kind of a generalized linear regression with built-in non-linear capabilities in it and if we do that it's also more expensive so in scikit-learn we have basic neural networks that you can use if you really want to use neural network I would advise you to rather use Keras or TensorFlow or PyTorch because you have more flexibility to design the architecture that you want but if you just want to use traditional MLP architecture you can use scikit-learn on the CPU and here you see that the accuracy is getting even slightly better than the feature extraction that we did manually with the pipeline so it's 0.21 on the test set and the training time starts to be slow it's 8 seconds now so we can do again this plot and you see that it's more and more diagonal it's still not perfect but it's slightly better than what we did before so finally the gradient boosting so whenever you have this kind of tabular data where you have some kind of Excel spreadsheet with different columns with different physical units like a number of rooms the criminality in the neighborhood whether or not it's close to a public transportation whether or not it's close to the ocean the GPS coordinates those are different columns with very different heterogeneous types physical units so this is what we call tabular structured data and for this kind of data typically neural networks are not significantly better than traditional machine learning and in particular decision trees and gradient boosting is very very competitive and most of the time significantly faster and less finicky to adjust so if you have this kind of tabular data this is the kind of machine learning algorithm that we recommend to try very quickly so it's always good to start with a linear model but then very quickly try this one to see if it's better so this is gradient boosting the traditional ones that were previously implemented in Scikit-learn so the exact algorithm so you see that it's taking a bit more time but six seconds and this is with the original data set without filtering the sensor stuff and you see that it's already much closer to the diagonal and the test error is even smaller than with the neural network and it's easier to find the good hyper parameters with this algorithm so also something that is interesting to note is that if you serialize the trained model it's not very big, it's 1.5 megabytes so it's easy to store on your disk and to deploy on servers to load in memory on many servers to run predictions on a compute farm or even in mobile phones and all models most of the time and if you make predictions you can time how to predict for a batch of 100 samples you see that to predict 100 houses it takes just a couple of hundreds of microseconds so less than 1 millisecond so they are also very fast to predict so this is a good feature for deploying this kind of model in production you can contrast this with the random forest which is another way to build ensembles of decision trees more traditional and maybe more popular in the past so from a training time point of view it's kind of equivalent from a test error point of view you see that it's very similar it's 0.19 and here it was 0.18.7 so it's very close typically gradient boosting tends to be slightly better than a random forest but the big difference is the size of the model that you get for a random forest you generally build much deeper trees so they take a lot more space in memory and furthermore they take a lot more time to predict you see that it's 100 milliseconds to do the same prediction so it's 100 times slower than the gradient boosting trees so this is why typically in production people tend to favor gradient boosting trees rather than a random forest you can reduce the inference cost basically so now in scikit-learn you also have this the new histogram based gradient boosting trees so you can do the same and what you will observe is that typically for this small dataset the training time is very similar because on small dataset it doesn't make a big difference the accuracy is very similar to the traditional gradient boosting algorithm the model size is slightly bigger because just for this run maybe it has built slightly more trees I don't know exactly but this is not very important and it's quite fast also to predict something that is very interesting is that if you use that on a dataset with millions of data points then the traditional method that I've demonstrated before would not work at all it would crash, it would use too much memory and it would be much too slow so one would have no problem with tens of millions of data points it would take between tens of seconds and a couple of minutes depending on the high parameters and something else that is very interesting with histogram of gradient boosting trees is the ability to do early stopping so remember that I said that we train the trees one after the other to try to fix the errors of the previous tree so when we do that we can keep on monitoring the train and validation error of the model so this is on the x-axis it's the number of trees that we put in the ensemble and on the y-axis it's the score function basically the higher the better the negative error in this case so you see that when you add more trees the training accuracy basically increases and the validation accuracy on the held out validation set is also increasing but at some point it's reaching a plateau it's making no longer good significant improvement so what early stopping does is basically computing this accuracy on the validation set after each tree such that whenever you detect that you are reaching this plateau you stop and by doing this here you see we stopped before reaching 100 trees so it's smaller than before we can build smaller models that are faster to train because we do not need to go to the end and faster to predict and smaller to store in memory or on a disk so it's very good to use early stopping in practice okay I will stop here for the demo so that I have some time left to talk about other things in Scikit-Lan so in the previous release of Scikit-Lan 020 that was published in September or October last year there were also a bunch of very cool features that some people might not necessarily know in this room and in particular one that I would like to highlight is the column transformer and there was a lot of effort done to make it much easier to do feature engineering typically on heterogeneously typed typed pandas data frames with columns with categorical variables columns with numerical variables different distributions and so on there were also other improvements but those are the ones that I would like to emphasize so for instance here we are using pandas to read a CSV file from the Titanic data set so it's the list of all the passengers of the Titanic with a specific column and to highlight whether or not the passengers have survived the Titanic or not and so we can introspect the different columns of this data frame and look at the data type the data type of the data frame whether or not they are integer or floating point values and if they are integer or floating point values we say that this is a numerical column numerical data if it's not integer or floating point for instance if it's a string, an object or something it's probably the label of a categorical variable at least it's a case on this data set so because of that we decided to split the columns into two groups the numerical ones and the categorical ones and then we can use this make column transformer method to define two different pipelines one that will apply for the numerical values and one that will apply for the categorical values so for the numerical values we will do a missing value imputation with the median whenever there is a missing value in a record in one of those numerical columns we compute the median for the non-missing ones in the same column and we put that instead we can also insert a new indicator column to say that we have done this imputation and then we will use the numerical preprocessing tool which is called Kbin's discretizer that we present in the next slide so this is the pipeline for numerical values and we can do a similar pipeline for categorical values but in this case the missing values we cannot compute the median because a category is just a name, a label so in that case we just fill in with a constant value which is missing and that's enough and then we use the one hot encoder dummy categorical value encoder and then we can combine the two pipelines using the column transformer and call that the resulting operation the preprocessor once we have this preprocessor we can make a final pipeline that stage the preprocessor first and the classifier second the logistic regression in this case and we can pass this full pipeline to a cross-validation procedure that will do the model evaluation of the full set of modeling decisions basically and if you do that you see the accuracy that you get in this one which is like a very good baseline for this data set and you see that the code that we have tried to do this level of flexible preprocessing is kind of limited and still very easy to understand so the numerical preprocessing that I mentioned is the cabins discretizer so this is interesting because on this there are three different data sets so the first data set you have the purple points and the green points that are arranged into overlapping folded half moons so the goal here is based on the position in this 2D space the position is the input for the model and the color of the dot is the output that we want to predict so basically the goal of the model is to generalize the color basically to all the possible locations in this space and on the first column you see that we fit a linear classifier which is basically a linear classification model so it will try to find a linear boundary into that 2D space to separate the two classes and when the two classes are overlapping like this you see that it's not possible to use a linear model for that for the last row you see that the two groups are approximately separable so in this case the linear model is optimal but for the other data sets the first two lines it's not possible to use a linear model to get good performance in this in this case but what we can do is use the K-Beans discretizer as a preprocessing stage before the linear model and in this case we will group values between ranges of possible values and output more features so we will generate a higher dimensional feature space and if we do a linear decision boundary in that higher dimensional feature space then it corresponds to the linear decision boundary in the original feature space and this is what we observe the second column is the combination of K-Beans discretizer with linear regression and you see that logistic regression and you see that the quality even on nonlinear problem is very good and this is a very fast model so you can compare to gradient boosting and support vector machines that are also nonlinear by nature so they can solve this classification problem quite easily and significantly faster and smaller to deploy so in this case it would make sense to just do this kind of preprocessing plus a linear model and that would be enough so there are other features that were introduced I will skip those I will just highlight that when we worked on that we also fixed a lot of things directly into the Python standard library for the serialization of large objects with non-py arrays so it's also important to keep in mind that the Python ecosystem is a community and when you work on your project sometimes you want to quickly fix and hack stuff on your project but sometimes it's better to make the investment to fix the problem upstream so that other projects can benefit from the solution so this we improve the PICON module over a span of two years and it's not going to be integrated in Python 3.8 so I would like to just go to the end and thanks to the partner of the INRIA Foundation who supported this work and thanks more generally the scikit-learn developer community and users community so you see pictures of last couple of international sprints in Paris, in Austin and in New York and thank you for your attention maybe we have a couple of minutes for one or two questions so there is a microphone here hello, thank you for the presentation are there slides somewhere and the iPython notebook available? they are already online but I need to tweet the URLs I will tweet them my Twitter is ogrizel ogri is here and probably the conference organizers will collect the slides and put them on the page of the website and the second question you introduced a few algorithms one after another in a iPython notebook and Random Forest was the last one and you said it's used in production very often but the Random Forest often overfits from experience the Random Forest can overfit but it's not necessarily the case typically in Random Forest the more trees you put in the Random Forest the lower you will overfit so by putting more trees you can decrease a bit the overfitting you can also limit the depth of the trees using max depth in scikit-learn and you can also do some kind of feature selection to remove the features that are not predictive and that will help also combat overfitting so sometimes you have a lot of noise in your data and there is no perfect solution but generally I would say that it's possible to reduce the overfitting with the Random Forest the main problem with the Random Forest is that there are much bigger models and they are slower to predict compared to gradient boosting like in the new scikit-learn or in XGBoost or in light GBM those models are smaller and faster to predict so this is why they tend to be favored in production Hi my name is Björn I have a question for your column transformer so basically it seems like you have now attached pandas so basically with sklearn you don't have to transform everything to like a numpy matrix before you input it and basically there was a tool called sklearn pandas this sklearn pandas was basically a big inspiration for the column transformer so we wanted to have something similar by default in scikit-learn that's really great because when I started to learn scikit-learn I was really a bit put off by that it makes it much easier to do feature engineering from original data that was loaded with pandas and in the future this could also probably be adapted to work with other kind of tabular data structures that's great Hi, thank you very much for the talk just a question regarding the support of categorical variables you said it's like ongoing is there a timeline? okay so there are two ways to support categorical variables either you do pre-processing like I showed with the column transformer and that will work for all the models in scikit-learn and you have much more flexibility on putting some business logic like filtering the rare events or stuff like that and we plan to improve so far we have one hot encoding for linear models on this kind of neural network we also have ordinal encoding which is better for decision trees but in the future we want to have impact coding that use the target variable to also find a good representation and better support for weird distributions of rare values and sometimes you could also have labels that are informative categorical category labels that have typos but are almost words and for this there are a third party project like dirty cats we plan to be used in scikit-learn pipeline to improve this and to do this kind of categorical variable pre-processing and then for decision trees it's possible directly inside the decision tree to deal with categorical variables so this is not yet implemented right now we are focusing on missing values and there are a couple of other stuff that have more priority like sample weights but I think this is after that in the coming months we plan to work on this Can this new algorithm work on GPU units? So for most of the scikit-learn solvers they depend on either an umpire and those libraries they do not support working with GPUs right now and furthermore some models will not really benefit from GPUs like for instance linear models they are memory-bound and they would not benefit from GPUs so there is no point in trying to use GPUs for them but there will be a presentation by NVIDIA later today and maybe another one on Rapids AI I don't know one on desk and one maybe another one on yeah I'm not sure you will see on the program or you can ask Peter here that basically provides similar estimators to scikit-learn compatible API it's not necessarily an exact drop-in replacement but basically similar features and some of them can really benefit from running on the GPU for decision trees for gradient boosting it's not always the case that running on the GPU will be faster so you have to keep in mind that it's not as for neural network that really like convolutional neural network really benefit from GPUs for decision trees and for linear models sometimes the CPU is good enough so yeah hi thanks for the very informative talk maybe you can quickly comment on regularization you mentioned L2 regularization in the prototype also planning to include the others like L1 so basically when we do the beaming preprocessing we are simplifying the representation of the data because we are decreasing the precision we are using 8-bit integers on 66 levels to represent the original values that were encoded in 64-bit float for instance and by doing so we reduce the complexity of the model and so this is a kind of regularization so for decision trees it can help a bit but it's not magical in any way so but sometimes you observe that by doing this approximate method you get better performance than the exact method so this might be a bit surprising but this is a regularization effect I don't know I think we should stop here because yeah we are running out of time okay thank you very much again hello everyone okay this works so today I'm going to talk about a new approach that we've been developing about how to handle relatively uncomfortable datasets I don't want to really use the term big data because everybody has their own definition of big and sometimes this is volume sometimes it's velocity sometimes it's complex structures that you need to maybe join 100 tables together to make one cohesive unit but I'm going to talk about handling volume on a simple machine that is probably not as big for example if you work at I don't know Google or Facebook and you have maybe a different idea of big data but data that is maybe making you uncomfortable with your data scientists that you want to work on your local laptop or on your local CPU at your employer so for example how many well I assume that most of the people here are data scientists or engineers or interacting with people like this so you have a few options here that I've listed so I think everybody can use a dataset that has about a million or so samples we can use the standard tools out of the books and we can even be sloppy with our programming here how about if we move to 10 million samples I think modern machines can handle this we have plenty of RAM now RAM is getting cheaper every day how about 100 million samples this maybe starts to get a little uncomfortable maybe we need to be really careful how we can do categoricals maybe we need to not load all the columns maybe we need to use binary format and how about a billion samples I'm talking about not going to the cloud just using your normal laptop or your normal workstation either at home or at work so this becomes a little bit of a problem especially if you have more than two columns let's say you have 50, 60 maybe 100 rows and you want to work and explore and build models and maybe even deploy these models in production and how about larger datasets so today I'm going to talk about a solution that we are developing and actively using actually to enable us to work with this kind of datasets without us being uncomfortable so I'm sure there are tools not I'm sure but I know that there are tools that you can go to the cloud and just rent a beastie machine or spin up a cluster but this already incurs additional costs time management and if you just want to explore your idea and make sure that this works and if it's worthy you don't always want to go to the cloud and sometimes I came from academia you cannot always justify to your employer oh give me this much time and some Amazon server and maybe this will work but I'm not sure so I'm Jovan, hi I used to be an astrophysicist in Groningen which is a small town in the north of the Netherlands now I'm a data scientist at CVL Labs where I work on DevOps pipelines and when I was an astronomer I met Martin Briddles and we were postdocs together in the same group and we were developing this VEX package because we needed it for our work as I'll show in a bit Jonathan who is a more experienced data scientist and he thought that what we're doing was really cool and helped us transition from academia to industry so making our package more standardized following pandas and cycle learning API and standards to basically enable all data scientists to use this not just a bunch of scientists in some little corner of the world and Mario is really using our tools to create cool dashboards without caring about scalability and data sets so me and Martin were really lucky when we started working together we worked on this data set that came from this gas satellite that the European Space Agency launched it was 20 years in the making and we were really really fortunate to be the first ones to explore it but the thing is it was kind of big for a bunch of academics it had over 1 billion stars and when you give stars to astronomers we want to plot them and see how the sky looks like and we tried to do that and this is what happened so our boss was not very happy because she spent 20 years trying to produce this data set and we couldn't even visualize it so what actually we thought about instead of just plotting points that will end up overlapping each other how about making histograms and when we were starting to make histograms we started to see structure the more data we added the more structure we could see and we figured out that actually you can build histograms really fast and you don't really need that much memory so this is the foundation of the VEX library really fast computations aggregate data that will now enable us to use as big data as we can fit in our hard drive so this is where my presentation ends and I will switch to doing actually a live demo I think the best way to showcase something is to basically show it how it works in practice so is this visible enough yes, no, okay so we took great inspiration from Pandas really Pandas showed us the way how a data firm library should look like and we really tried to follow their example in the way the API design and as we developed we basically needed more things to do our jobs we started to add machine learning capability and that's where we followed the scikit-learn convention so we're almost fully scikit-learn compatible so let's try to let me try to show you to demo VEX for you so VEX is built on few concepts these are memory mapable storage expression systems delayed computations and efficient algorithms so as I will go through this demo I will try to explain this concept and later on we'll see how this works in practice so for this demo I will use this very large dataset from the New York taxi yellow cab taxi company it basically stores all the trips that all the yellow cabs did in New York City between 2009-2015 I mean you can get the data until I think 2018 but something changed in their data system after 2015 so I decided to ignore that portion of the data but this is big enough so let me just show you how large this file is when I merge all this data into one file it is about 100 gigabytes and even though I have a very nice macbook but it's kind of off the shelf and I definitely don't have 100 gigabytes of RAM to play around with it but I will play around with it so let me open this dataset because this HDF5 allows memory mapping all what this means is that when I try to open a file when I opened it it doesn't really do anything it just points the operative system and says oh look the file is here and this is the structure there are a bunch of columns and a bunch of rows and when I want to display it all it does it gives me a preview so I get the first by default the first 5 and the last 5 rows for the entire thing I have over a million rows I can see them instantaneously because I actually do not need to see all of them I just want to see a preview so you can open in quotes 100 gigabyte file well instantly well we have you can do the dataframe.info gives you a little bit more info you can put custom metadata information you get information about the column structure, their type and again a preview so vex follows the standard dataframe API as I said showcased by pandas so for example you can see a specific column in this file let's say the trip distance and again even though I have a billion columns billion rows sorry I see this instantly because all I see because of the memory mapping I see just a preview the first and the last 5 examples data type so for example this is daytime this is the pickup daytime of each toxic ship in the dataframe but what happens when I do certain operations these are also instantaneous and this leads me to expressions so when I do certain certain expression by expression I mean a really mathematical expression so here I take a log of something and I add the square root of something else I do this on the fly what vex does is memorizes the expression so a formula that is required to generate the result and only evaluates it when needed this is kind of like a computational graph something what neural network libraries have been doing for quite some time so this is instant because all I need to do is actually calculate 10 values not the whole billion I will only calculate a whole billion when I actually need to know the whole billion for example calculating the mean or some other aggregate statistics and this expression is really the core the core system of vex so for example let's talk let's put ourselves in a position of a taxi driver or a CEO of a taxi company and we want to know what is the total amount divided by the trip distance this is nearly instantaneous because again I'm just calculating 10 values I can save this as if I would do I would do with a normal dataframe with pandas and here let's say fee over distance is what I call a virtual column because it doesn't store the entire output but it only stores the mathematical expression needed to generate this and now if I scroll to the far right I see this here so what happens if I want to calculate the mean well then I actually do use the full billion data set and because we use fast algorithms and everything is fully parallelized I can calculate the mean of the distance in I'd say 3 seconds so how about mean of fee over distance which is now a virtual expression so now I do this thing simultaneously where first I need to compute this fee over distance and then from that calculate the mean but oh there is a none and because this is fee divided by distance I see a none so probably there was 0 somewhere in the distance so let's do a filtering just as you would do in pandas let's select all the trip distances that are bigger than 0 and compute this and now everything is done for a billion rows and now we get the super fast 4 seconds billion rows, no memory used whatsoever so we can now do name selections name selections are really cool this is what hooked me onto vex you can specify a filter let's say trip distance bigger than something and bigger than 10 give it a name and then for every data that vex supports you can basically say what the selection is and get the result for this particular filter so you can imagine the flexibility where vex really shines as I show you with this skyplot slide at the beginning is bin statistics so we can count really efficiently as we saw so this is just a number of rows but we can count on a grid so on one grid counting this is basically we're creating a histogram we can say what the limits are and the shape of the grid basically the number of bins and then we can use standard plotting tools your favorite tool, modplotlib or whatever to plot this we can do this in two dimension so two dimensional grid and I'm doing this in real time for the whole billion sets of points so it took about 7 seconds and now I can visualize it and well you can kind of see New York and this is kind of Manhattan we offer wrappers around modplotlib so you can do this nicely and get a histogram really quickly with all the correct axes and labels and shapes and also for two dimensions and now you can see New York much nicer and you can see where people took their taxes from these are all the pickup locations and you can see all the streets it's a really nice graphic you can also pass the selections that I made earlier and how long would it take to make two images with billion points with different filters around three seconds so this really enables us to do interactive filtering, selections with basically a billion points so I'm going to stop there and this basically gives you the overall idea behind tex and how one would use it but what I would like to do in the next part is actually let's do a data science example because maybe this works really fast in a very curated because I was home and I was preparing to simulate this presentation but then I thought okay it would be more maybe convincing and informative if we actually do a quick data science project together on a big data center on my local machine in real time so now that the actual scary demo starts will be scary for me, fun for you and this is going to be a disclaimer so on on this demo I'm going to do some data science project with this texid data center in front of you and it's going to take as long as it takes I mean hopefully hopefully within the time that I have allocated so some steps may take a minute or two because I'm going to be using the full data set in real time and because I'm partly doing this to show you a nice data science project and partly to show Vex I'm not going to always make the best data science decisions in terms of cross-validation and maybe building the right features and so on okay so let's start oh by the way if you have any questions in the meantime please you can ask at the end but you can also stop me in the middle so I'm going to load the same data set again there 1.1 billion of samples I can print head and tail and see what the data frame contains we're going to go back to that later the data frame you can see here is roughly ordered by date so pick up date so I'm going to try to make a machine learning model out of this so I'm going to split it by time and here I've been really lazy so I'm just I know basically the instance is when 2015 starts and so I'm going to take that as a test set and everything before 2015 I'm going to use as a train set so I have this data set described and this is the longest part of this presentation so what it does for all these columns I think there are about 12 or 11 different data types so we have the vendor ID pick up date and drop of date number of passengers how the passengers are paid, trip distance where they were picked up from long into the latitude I'm actually not sure what this ratecode is and this other is whether they if they paid with a credit card whether the card was charged right away or whether the taxi had to go to basically a place with wifi and then the transaction occurred drop of longitude, latitude fair amount additional charges, taxes if there was any tip if they had any tools to pay and the total amount paid so for all of these columns with this data frame described basically computing the mean, median computing if there are any missing values mean max really gives you a nice overview of the data frame and this takes a bit of time because I'm doing it for a billion points but at the same time I can use this as an opportunity to show you how much RAM am I actually using so so far I'm only using 2 gigabytes even though my full file is over 100 gigabytes in size so from time to time I will switch here and you see that my CPU is running at full throttle everything is fully parallelized and it's going to take a bit of time maybe we can use this time if anybody has any questions so far or everything is clear or everybody is fully lost yes please, microphone please right at the start you said HDF5 allows you to memory map it you can memory map any file so does HDF5 is that specially designed to be easy with memory map applications well I don't know if that was the point initially but it allows us to do this I forgot to say we also support Arrow so the Apache Arrow format also allows you to do this and we're fully compatible so it just that I had this file a while ago so I used HDF5 but you can equally use Apache Arrow and actually with development of Apache Arrow it allows us to do more interesting things like store lists and dicks which are kind of tricky to store in HDF5 to make it memory mapable but HDF5 is not the only option so we actually would like to move more and more to Arrow as it gets better and more stable ok so this is done we calculated all the statistics for all the 11 columns in 2.5 minutes which is kind of impressive given that we have a billion points data frame and everything was done out of core so now we can do a little bit of a data science by exploring what everything is here so we don't really know much about this about this data set so let's go slowly over it first let's see if there are any missing values and this is basically what Pandas defined as a missing value in A so either noon P none or none here there is nothing there is one pickup latitude that's missing out of all this rate code I really don't know what it is I'm going to ignore this column for people that pay with credit cards I guess this is not important if they paid right away or they were charged a bit later drop of longitude and latitude and some of the tools are missing so ok I'm going to drop the missing values from the longitude from the drop of longitude and latitude and a pickup latitude and because every operation they do is a virtual column or it's an expression nothing gets down right away as you see I've dropped columns immediately but I'm just storing the information I'm storing the command saying drop missing values when you actually need to do some calculation for now just save the command ok so now I look at the number of passengers and I see the minimum number of passengers is zero so maybe somebody was shipping something that people were not traveling but the maximum is 255 so that's kind of tricky so if you're a data scientist and you're using pandas all the time probably your favorite method is value counts and I know it's mine and so it's really important to have a fast value count so let's do that and because Vex is really aimed at this progress bar so you can know you know more or less how long you're going to wait if you have ridiculously big data on a single computer at least so this is what we're doing now so ok 20 seconds and we see ok there most oh sorry this is the trip distance wrong thing let's do it again alright so mostly we had passengers we had taxi trips with what passengers 2, 5, 3, 4, 6 and after 6 we have many with zero so maybe taxis were used to also ship things and after that we have these spurious random numbers there were other mistakes or the taxi drivers were having fun taking notes so what I'm going to do is just take the taxi trips that had well some passengers and less than 7 so then we want to look at the trip distances we've already calculated this so this is just to show you the performance when we don't have a kind of categorical variable but the trip distances are just floats and it also works really fast so it took 20 seconds and we notice that there are lots of trip distances that are zero so that's kind of weird so let's make a plot we want to see the distribution of distances on the log scale and while there are even some negative distances and some very large distances and in fact if I just take the maximum this is a ridiculously big number in terms of miles this comes from the US so in fact this is 67 times the distance to the moon or half the distance to Mars so kind of large so probably fake so let's give us a limit so let's go from 0 to 20 and this is how the histogram looks like now and here I decided to let's say take all the trips within 10 miles it's kind of where the histogram flattens out I really want to focus on the core on the core of the taxi data set so what is the extent of New York this is the New York city cast but they really covered the whole New York I want to see so I'm going to plot the taxi trips the pickup locations but I really want to kind of play around so we have this widget with the help of iPad widgets and material design so this should be New York because we're really affected by some outliers so I can interactively zoom oops I don't have a mouse so I have to do this a bunch of times and this is not pre-computed as I'm zooming the grid the 2D histogram effectively is being computed on the fly you can see nice progress bar so I'm zooming, zooming, zooming you can kind of start to see New York and there it is and we can see that now I'm familiar with New York, I personally have not been this is Manhattan this is the rest of New York so really city caps covered mostly Manhattan rather than the full New York but that's fine so now we can zoom in even more if we want to explore these hotspots and if you look at Google Maps or some other map provider you can see these hotspots are actually either major hotels or train stations or intersections between buses and subways so this really allows you to interactively look at some plots and even for this big data without using much memory so then I played around with this and decided this will be my the edges of my box of New York I'm going to make a filter just like I normally would now I'm going to create some features I'm going to look at the mean speed so this is trip distance divided by the time it took and this is going to be in hours kind of a natural then I'm going to look at the trip duration in minutes so this is just the difference between the drop-off and pickup time in minutes and I'm going to look at this metric that we looked at fairly divided by distance because sometimes you can think of oh but more money is great yeah but maybe you need to travel somewhere very far and then you have expenses you lose time, you lose petrol and so on so really fairly divided by distance seems to be like a good metric to look at profit okay so now look at the histogram of trip durations I'm going to make another plot so now basically I'm computing on the fly this expression that I've just defined above and binning it let's check oh okay it's done so wow there are some really really long trips that make no sense I'm going to make another plot now a bit zooming in between 0 and 100 minutes and this is how it looks like so there is not really a clear cut-off here where I can say okay from here spurious results started to see but it's New York I highly doubt somebody would be a thousand minutes in a taxi so I honestly don't know how patient New Yorkers are but let's say I'll be generous and say between 0 and 2 hours so if somebody really wants to spend more than 2 hours in a taxi in New York then yeah my model will probably not take account into that so I'll do that filtering as well now I want to let's say I'm a CEO of a taxi company I want to see when they are the most profitable times so I'm going to just execute this so we don't wait while I explain I'm going to extract from the time date the hour of pickup same as you would do in pandas I'm going to extract the day of week the month here I subtract 1 so I get this original encoding by default so January is not 1 but 0 and then I will make a simple feature just to see to check if a taxi trip occurred on a weekend or not so basically if the pickup day was Saturday or Sunday make it a weekend otherwise in 0 so then I'm using this categorized feature method basically what it does is even though I have integers 1, 2, 3, 2, 6, 4 days and 2, 11, 4 months I'm using these categories so I can do easily beaning on-ins rather than trying to find splits and so on I can quickly plot because vex knows that these are our categories I can quickly plot a map of pickup time versus day of week so we can see that in the early hours on the working days there are not that many trips well in the evening as we go to the evening side there are more taxi trips and on Saturday and Sunday well this is actually Friday Thursday night because this is midnight and Saturday this is Friday night basically Saturday night we have the most most taxi trips we can do the same per month and for some reason Saturday in March is really popular for taxis in October I have not figured out why and in July and August there is not that many taxis maybe people are on holiday but anyway you can do this kind of one of grids really quickly took 30 seconds to bend this in real time basically so now I want to do some more drill down work I want to group by standard thing in Pandas by hour and see maybe if I am a taxi driver I want to see maybe I am a part-time taxi driver and I want to see when it is most worth it for me to do this work where can I get which times of hours of day I can get the most tip where I get this metric to maximize our shortest or longer strips where I can drive the fastest or slowest some ideas like this so this is the same way as you would do group by in Pandas and here I am just this is just a default plotting code so now we we are able to do this in just under just over a minute for a billion points so we can see mean tip amount people in the morning tend to tip a bit more people that travel just after midnight fair by distance probably this is affected by traffic jams around 3 or 4 when people leave work kind of makes sense mean trip duration or it is kind of inverse of trip speed that makes sense if you drive fast you get there faster and yeah makes sense in the early hours of the day 4 or 5 the taxis drive fastest because there is no traffic and the peak hours the traffic jam hours of the day well they earn more money but they drive slower we can do this same now per day of week and we see the speed is larger on Sunday I guess probably less traffic and if you are a taxi driver it is most worth it for you to do this on Thursday however I have not computed the error or standard deviation so I don't know how significant this thing is but keep in mind that this is computed for more than a billion points so maybe average is out density maps we already so we can do this I am just going to plot another density map so now it takes a bit longer because I have a lot of features and I have made a lot of filtering so we can see how the pickup density of New York looks like after we did all this filtering this looks like this I am going to skip this but oops for us maybe as taxi drivers we want to know where it is most worth it to pick up people so what I am doing here is I am actually picking up long into the latitude but instead of just simply displaying the counts I am trying to display the mean of fare by distance because if this value is maximum it is most worth it for me so I really now see the streets where I get the maximum of fare divided by distance so if I try to pick up passengers along these routes and this is one of the airports this is another airport and this is kind of a main vein of the city get the most profit so we have part of the exploration phase so now let's do a little bit of machine learning it's fun so we have a module in VEX called VEXML where we provide machine learning APIs that are fully out of core that can help us pre-process data very quickly so I have taken some inspiration from Kegel there was a similar competition to predict taxi trips durations for example or taxi trip tip amounts I can't remember so here are the ideas I'm going to make a custom function that calculates the arc distance between two points and to test it out we'll calculate the distance between BASEL and Utrecht where I work and this is in miles because it's an American data set and this is about right if you check Google you will see that this is but you have to be careful it's not the row distance but it's the actual arc distance so now I can add this as a virtual column in VEX and if I scroll far right this is a complicated expression it's a common cosine bunch of them are arithmetic it can take a while when we actually compute it so what VEX does is also it supports jitting it stands for just in time compiling and we can use NUMBA so basically this expression is translated into C so then when you execute it basically you get almost two times faster because all these low level computations are done in C we actually support CUDA as well so if you do instead of just NUMBA just CUDA you can use you have NVIDIA GPU you can use GPU for parallelized low level computations as well I cannot demo it because I have a Mac unfortunately so this is one virtual column that I'm adding I'm going to add another this is just to kind of calculate the angle another complicated expression that I have can also use git to speed up and now I can just display the head the data frame to make sure that I have all these new columns here that I've created and I can check how much memory I'm using 3GB so far nothing much that no other standard laptops can handle it so now I want to do PCA the reason why I'm doing PCA I'm going to execute this as well is if I look at the streets you can see that New York or most American cities are kind of building a grid and this is a tabular dataset and we learned from the scikit-learn talk earlier that really boosted trees really work well with tabular datasets so we kind of know that the first thing we're going to try is going to be some sort of a tree model and this is a grid so if the grid was like a chess board it would really help the tree splitting algorithm decide the best way to do splits faster so PCA does this it transforms the data well trying to align the data along the principal axis and we're hoping that because the map of New York is kind of like a grid it will align the city on more regular grid so everything is done here out of core it took about 30 seconds to compute and transform the PCA for the pickup location the pickup location and the drop of location and now I'm just spending a little bit of time generating plots I'm actually generating four plots all with a billion points in real time so it might take a little bit of time to do but I think it will really illustrate the idea behind doing this any questions at this point yes please so I have a question so maybe on the PCA you will get the same result just on a much smaller sub-sample do you have an efficient way to extract or run the algorithm on a sub-sample of the data yeah yeah so you can just slice your data frame as you normally would instead of slicing a sampling uniformly at random because maybe it has some time dependency you mean to sub-sample yeah randomly sub-sample we don't at the moment because randomly sub-sampling means generating in memory some index that you need to track and at the moment we really try to avoid doing this but you can we don't have an in-built method but I can add a prescription of how you would do this you will just need to pay the price of having one in memory in column okay so this is how it looks like and you see the city is now transformed into a more linear well the grid is transformed so it's not it's more along the principal axis if I was doing this for let's say to get optimal performance I would probably isolate Manhattan more and do the PCA just on this so I can get this street straight and out even more so then I want a problem for for some reason I want to use this is a feature of the payment type and I see that this is basically text and sometimes it's like all capital letters and sometimes it's it's mix of capital and non-capital letter so we're going to do value counts on this column but first I'm going to do string lower so that supports also string operations in a very fast way because we're using C Bidens to do all these operations in a fully parallel and extremely fast so I'm going to I'm trying to just get all the possible values for the payment type and from the documentation website of the taxi dataset I saw that these are these are the meanings behind well this is what the categories that they already had placed so there is something cash card I guess this is cash as well credit credit unknown and so on and because there are finite numbers of these things that I founded the dataset and very few documented types what I do I can just make a map just like you would do in Bandas and apply a map transformation really quickly and this is using a hash table in the background so the mapping goes extremely fast and it's also fully parallelized and it's fully out of core so now if I scroll to the right I see the PCA columns transformed and this type and once again I'm not really using so this is one of these two actually this one is from the previous notebook that I showed and this is the current one just because the question fits here how do you restrict memory usage can you run out of memory still how do you restrict memory usage I'm fully doing memory because everything is done in chunks so most operations support chunking so I don't have to hold anything in memory I hold as much as around my half I calculate the statistics that I need select the next chunk update the statistics and so on so there is no actually you don't if I had 8 gigabytes or would make no difference actually most of these things are CPU reading well CPU limited or SSD read speed limited I'm going to use some okay some estimator I'm going to use light GBM so in Vex we provide bindings to various this is some because I've not installed it properly or something so this is not a fork it's just binding to light GBM if you install it using conda or pip or from source it will just work so I'm using some default parameters I'm instant instantiating the wrapper it supports all the default the default parameters that you would use normally or if you use the secular API you don't have to use a wrapper you can just use the secular API and because light GBM is still not out of course so this I'm going to do I'm just going to take the first million samples so I can train this in real time and show it to you and this is me training it so it will take a second or two it's really fast for a million and I'm going to do a prediction so the booster has dot-predict method standard you get in-memory output of your predictions so this is I'm trying to predict the duration of the taxi trips but I have something transformed that usually scikit-learn uses for transformations of data in Vex what it means it adds a virtual column to the end of the data frame so you don't actually have to have it in memory you can only have it when you need it and this is really cool because now if you're suspicious maybe you want to add additional models to this it's quite simple because everything here we're just storing basically the model that generates the prediction and we can execute it anytime so we can let's say estimate this it doesn't matter these numbers are more or less meaningless because we've not done cross-validation but we can easily add a second estimator let's say we want to try a Lidia model just for fun but for a linear model we cannot use well trees work really well with label encoding and integers values but because they have ordinality in them let's one-hot encode all this payment type hour, day of month very quickly and we can also support we can also standard scale let's say the distance, direction treat duration and so on so we support a number of transformers exactly the full scikit-learn list because they have a big list, they've done a great job we support a subset of that that is fully out of core so it doesn't matter how many samples you have all this is if you're limited you can see my this is computer crunching numbers and this is the memory that I'm using only 3 gigabytes for the entire dataset and now once this is done I will print the head of the data frame and you can see all the columns that are added so basically when I do the fitting all it does, it memorizes the command and the meta parameters like mean, median, the number of unique categories that you have and only when you do transform it's when the actual transformation happens in chunks so that's how you get away from the memory problem as I said this is an honest demo so sometimes we have to wait for the good insight of the performance when you work with a billion size dataset how long typical operations that you may do will take and here it is computations were finished as in I scroll to the right and you see we have because I'm keeping everything in the same data frame I have nice labels of all the transformation I did so you can know at all times what is what and here are the standard scale things so we also support the transformers because they are quite complicated but we support a wrapper for all scikit-learn estimators so you can just install scikit-learn through KONDA or PEEP for your favorite way of styling it and you can use all scikit-learn estimator within BEX they will not be out of core yet but you can use them once you do your aggregation so here I'm just importing the linear model I'm going to select the features that I want and again I'm training only on 1 million samples now but when I do the transformation here I'm actually applying this train model on my full 1 billion data frame and here if when I scroll very much to the right I have the latest linear prediction but I also have the other prediction without wasting any memory well lots of columns right here from LightGBM so what this allows me to do is quick examples because all these are virtual columns I can just do a simple mean like this so this is the final part of the data frame of the demo so if you were bored until now and this is all kind of lots of data frames, expressions so on if I want to put this in production now I have to go back and leave the notebook that we all know and love with pipelines and transformers and all the rest but because basically what we're doing is creating a computational graph every operation that we saved so far is written, it's an expression it's a mathematical formula and that's what we call a state so we can simply save the state into a JSON format like this and now I'm going to, this is only going to take a second I'm going to open a new notebook new kernel, fully independent of the previous one I'm going to load I'm going to open the data set I'm going to open just the test set and you can see here that there is nothing applied it's just the raw data set that we start with the test set and all I need to do is load now this is like GBM loading load the state and all the operations that we've done so far will be transferred onto the test set so now everything is done for the first time so this is the test set I'm printing the 21 things, 21 elements and we see all the transformations PCA the one-hot encoding all the way to the final ensemble prediction so now we can also push this with our REST API so this is another way of creating a pipeline without actually explicitly creating a pipeline so in principle you don't even have to leave the Jupyter notebooks where you just can experiment, get immediate feedback and just try out more models and parameters I hope you enjoyed it, thank you very much now a few questions and we'll move on oh by the way this is all open source so for research, personal, commercial use we also run a consultancy but the software itself is fully open source and you can use and experiment with great talk, thank you I have a few questions about this last piece with the pipelines because you say you've stored all the computational steps one of those computational steps was for example to train and transform with a PCA so if I apply this same state to a new dataset, am I retraining a new PCA? no, because basically you do fit as if you do with a scikit-learn fit so you've memorized the eigenvalues, the eigenvectors all those things are stored so you're just applying the transformation to a new dataset as if you use scikit-learn so does the export then contain the scikit-learn pipeline? yes, so the export is just a JSON file that contains the orders of what you need to do but also these meta-values like means eigenvectors for example when we did the encoding the number of unique, well not the number but the unique categories all you need to do to transform the dataset that's great, thank you how does it compare to DASC? so do you mean DASC or DASC DataFrame? where are you? here just in front so do you mean DASC or DASC DataFrame? both how does it compare to DASC? if you mean DASC DASC is like noopy but distributed so maybe VEX will use DASC one day to become distributed so then they're not actually competing it's just so far we're using noopy for most of the computations but we have plans to maybe support DASC so you can run this on a cluster if you really want to go to hundreds of terabytes of data DASC DataFrame is really nice but it's a pain to install and it's a bit well I found it hard to manage on benchmark and from my tests on this dataset and on a standard well there was another big dataset with airline companies I think we were a bit faster but without any clusters just on a single machine you can't compare because you just wouldn't be able to run all the information all the stuff that I run and on a cluster we're just talking about different things right hey nice talk, thanks do you do some sort of predicate pushdown on VEX operations? sorry? you have lazy graph yes so do you do some predicate pushdown or query optimization at some point? yes so all the operations because now I'm interactively going and deciding I want to calculate this I want to calculate that there is an algorithm I did not mention this because I only have so much time but there is an algorithm in the background that calculate evaluates the computational graph and tries to go over the data in as little masses as possible so you're basically trying to optimize everything well sometimes it's not possible depending on the steps that you want to take but for simpler computational graphs we try to do it with one pass over the data and if that's not possible with as few pass over the data as possible you can even visualize the computational graph but this is still experimental and I didn't show it but you can see what is dependent on what and the computational graph in the neural network maybe I missed it but they also have joins yes joins we have basic joins so kind of SQL live join the proper joins are currently in PR state where we're trying to use hash tables to make it really fast and also out of core it's coming over the next few months but it's planned and you can check out the branch and play around with that if you want cool very nice thanks excuse me is the next speaker yes so you can catch me I'm here today tomorrow we are looking for the next speaker in that case if he's around here please come up and set up your computer okay I guess we have time for one more question if any my question goes into the same the previous one do you support indexes on the data side for example filtering so indexes are sort of depending on the type of filtering the way that pandas does it the explicit index we don't support yet at least we basically didn't want to add another row into memory we want to make this as memory efficient as possible and it comes with some drawbacks because you cannot do indexes but really if you have big data sets in the terms of what I just said in our experiences this tends to be more detrimental than helpful because we really focus on aggregation and group buy and value counts and histograms and so on but we are looking into having very fast ways to just find the data that you're looking for without explicit index there are lots of methods that you use hash maps in the background pandas does this as well for value counts we've taken inspiration and not explicitly just to avoid this if you have a billion rows that will add 8GB of RAM just to have it in memory and we really really don't want to do that unless absolutely necessary but you can add it if you want yourself if you have the money or the RAM money to pay you can do it yourself we have to go to the next yes thank you very much thank you very much for your talk so why much can you hear me okay great so hello thank you to be here the first thing I want to say even if Marc already said it is that we are giving a concert at the social event tomorrow and in this concert we are going to use Python heavily even if it's not very visible in the concert so I propose to have this talk to explain and how we are using Python for making music so this is the theoretical part and if you don't believe that it works just come tomorrow to the social event and listen so a very usual question when you meet someone in a convention like this one is what are you using Python for and my answer is a little bit less usual it's well mainly for real-time audio processing in live music context and this is quite unexpected and it sometimes triggers reactions like what are you crazy why Python for this kind of this kind of thing and as it happens the answers are very easy are you crazy yes we are definitely and why Python because it's fun so I could stop here thank you for your attention have a nice meal thank you but as I was lucky enough to be allotted a 35 minute slot for this talk I think I can get into some more detail than that so first some elements of context my name is Mathieu Amiget I'm a musician and a developer I'm artistic director at Legemann Travers jointly with Barbara Minder and Legemann Travers is a collective of musicians in a variety of styles from Renaissance repertoire to algorithmic composition by the way this music was generated by Python but that's not at all what I'm going to talk about today one thing we've been researching quite a lot for the last decade or so is augmented instruments what do I mean by that it's taking an acoustic instrument like a flute, violin, piano or something like that and trying to extend the sonic possibilities of this instrument using new technologies and especially computers so why the strange name augmented instruments actually it comes from augmented reality we mix real time views of the world with synthetic information that we add to the image and augmented instruments do the same they mix real time acoustic sound of the instrument with processed audio so in a sense augmented instruments are augmented reality applied to music as a side note it's not that important in that talk but it's very important in our research we decided to use only free software for our research so we are making music with Linux and free software that's not a very common choice in the music world but I guess we would end up to use python even if we hadn't decided for this restriction anyway the definition of augmented instruments is a little bit theoretical for example actually I will show you a set of examples the first one is very very simple you have a musician I picture the flute because of the instrument I play but it could be any instrument he plays through a speaker so you have a set of microphones wires, amplificator and everything and this goes through a speaker and in a very simple setup you could simply add a delay module that will like his name suggests delay the sound in time so a time shifting of the sound and the musician can have a food controller he needs a food controller because the hands are already busy playing the instrument a food controller to control the time of the delay the length of the delay and even with a very simple setup like this you can already do some interesting things that's not bad for a very simple setup and one flute playing so this is really one flute playing with itself there's no pre-recorded sound or anything I'm not sure Talaman had envisioned this way of playing his music but actually it works pretty well for this kind of setup you really don't need a computer you can do it with a hardware pedal and it's cheaper, it's easier but if you get a little bit crazy with delays and begin to have several delays wired in strange manners and the delay times are linked one to another it's not that clear that it's better to do it with hardware pedals in this example with a set of four delays that are set up in the right manner and if you play the right notes in the right time you can get some interesting effects by the way this is an excerpt of a piece we are going to play tomorrow at the social event so if you like it just come to the concert so we are quickly reaching the point where it might be more reasonable to use a computer instead of multiple hardware pedals but it's still relatively easy to do it with stock software just taking existing software and wiring it the right way and you can play it the next example is a little bit more complicated I'm going to show you a complex piece of music with a strong architecture with a beginning, a middle, an end and really an evolution and many things happening on the technical side many volumes changing and loops being recorded and loops being triggered and it becomes unpracticable for the musician to control all the details himself so either we have a technician that does all the knob turning and button pressing while the musician plays but that's not exactly what we want to do with augmented instruments because we want them to be musical instruments that can be played by one person so the other possibility is to have choice that are made in advance and encoded in the computer way or another so in this example we have a music machine and when the musician presses on the buttons of the food controller he triggers stage changes I'm going from this stage to this stage and this triggers a set of actions like changing volumes or recording loops and so many things happen but the musician has only a few simple actions to make and hopefully it frees up his head to do better music music thank you so as I said before everything is played live there are no pre-recorded sounds I once played this piece in a wedding party and after I played it someone a professional musician came to me and said that was nice your karaoke like piece and I said well no it's not really karaoke you really have to understand that the idea is that everything is played live in the concert also we are slowly exiting the realm of existing software of stock software the box state machine doesn't really exist with the right connections and everything so we had to to develop this part ourselves to play this piece perhaps a last example it's very similar but that's an interesting thing until now I showed you only things with loopers and delays and so only time shifting if you want it's also possible to add effects of all kinds or synthetic sounds but that's not something we do much synthetic sounds and in this one something funny is happening if you look at the bottom blue path that goes through a looper and then something we call a envelope follower what comes out is a red path so an audio path is transformed into a control path for another another sound and that's something funny to do and also that we had to develop ourselves if you think of a solo flute piece probably you don't picture this kind of sound and that's exactly what we are trying to do to extend as much as possible the sonic possibilities of the instrument and actually for a few years we have been doing this kind of thing and everything was going very smoothly using partly existing software so free software as I told you so super looper, guitarics, rock-rock this kind of thing also custom fragments written in audio programming languages so specialized programming languages for audio we mostly used Chuck but we also had a few experiences with pure data super collider never see sound so we have done this kind of thing and we would connect everything with Jack I don't know if you are probably not familiar with Jack for once it's one of the best recursive acronyms in the history of free software the Jack audio connection kit and it's a software that it's an audio daemon that allows to connect different applications on the same computer in the same way you would connect different audio rackable audio units with Jack cables but you do it in software it's very nice and we would manage everything with bash scripts so simply launch the software we needed and connect everything and everything was good we thought but then we hit a wall we had a big problem and we realized that we couldn't go on the same way we had to change something very fundamental in our way to do it what was the problem the problem was that we were able to play single songs single tunes very easily but we couldn't go smoothly from one song to another what we had to do was launch the right script then play the song and then go to the computer quit everything stop every sound launch a new script and then we could continue and that's nice in a concert sometimes you want to crossfade from one song to another or simply someone going to a computer and bending and typing in things and that's not very nice to look at so and of course possibility could have been to have some kind of mega patch with every song encoded every song ready to go and just going from one to another but we have two problems with this the first one is performance if you have every possible song running in parallel you are likely to have some performance problems on your computer and the other problem is that we really wanted to have a modular approach because we compose songs and then when we make gigs we say well I'm going to play that song and that song and that song but maybe for another gig I will take another song so we really had to have a modern way of implementing our songs and then reusing them in gigs or in set lists so what we needed was some kind of gig framework like in web framework but for gigs the flask of the gig is a real musician if you want and what we realized is that that's something really really difficult to do in audio programming languages audio programming languages are very good at programming audio they better be but they lack the high abstractions the meta programming features that make it easy to make something that looks even remotely like a framework so we did quite a lot of research and finally we found this PIO PIO is a dedicated Python module for digital signal processing it's a very nice module developed mainly by Olivier Benongé in the University of Montreal in Canada and actually I was already quite familiar with Python before and when I saw this I thought well, it sounds nice but if you know anything about real-time audio processing you should be quite skeptical are you? it should be quite skeptical because it's very likely that Python is too slow for real-time audio and even if it's not too slow things like memory management gallery connections this kind of thing are very likely to introduce too much latency and so you get clicks in your audio and that's not nice however PIO does work because it works more or less like a marble run this one the idea is that you have blocks and you can build paths with these blocks and in this example if you drop a marble on a finished path it will just follow the path on a normal speed on its own speed even if you were slow to build the path and you can build a second path while marbles are running down the first one and then just switch to the other you just have to be a little bit careful on the moment of the switch because if there's a marble at that time it will go out but you can do things relatively slowly and then have the path run at a higher speed and that's exactly what PIO is doing PIO has an audio engine that's implemented in C it's very efficient, very lightweight very nice sorry and there are bindings to Python that give you building blocks and you know hooks to change things at all kind of places and so all the heavy works of dealing with audio samples and memory and everything low level is completely invisible and you just have the nice colored blocks and you construct your path so this is not a toy convention this is a python convention so maybe I can get a little bit more precise on how it works but the first example I showed you the Telemann cannon played by one musician how could we implement this in PIO actually it's very easy first you need some boilerplate code but really not that much just an import create what's called a server that's the audio engine and then later on you will start the server and find a way to keep it alive because the server is started on a different thread and so if you just say server start and stop there the script will quit so one way is launching a GUI there are other ways we don't use a GUI on stage so we don't launch a GUI but that's not that important then we try to do the upper path on the drawing so just having the sound of the musician going to the speaker and that's really easy you just have to create an input object and the input object will represent the audio stream coming from the input of the program from the sound card and then if you call on any stream any audio stream of PIO if you call the out method it will send this stream to the output of the program so this is a fully working program that will just get the sound through so that's not bad with what 1, 2, 3, 4, 5 lines of code for the second path the one that goes through the delay that's not much more difficult we have several delay objects in PIO here I use the simple delay and the first argument to a PIO object so to an audio stream is the input so here the delay will take its input from the input object we created and then as we want the delay to go also to the speaker we call the out method on it and we have a third path that's the red one so I want to use a foot controller to tap tempo the length of the delay for this this code is using foococo the foot controller controller which is a small library I implemented to use the soft step foot controller with PIO and some boilerplate code what's interesting is B1 equals press button 1 so mainly I'm making an object that represents all the times when I press a button on my foot controller and then I make a timer object that will compute the time between two successive presses so if I press and then I wait 3 seconds and I press it will contain the number 3 and it's also a stream of data which continuously contains this information and then I just say to my delay object so the length of the delay will be the value of the timer and this is full implementation of what you see above and it's really usable in a concert I mean you have to do some work to set up your computer and deal with low latency audio that can be some work but the code can work like this so we are very happy but we still have the wall because if I want to go to another song I'll have to kit this script and launch another one and I have gained nothing because now I have Python so really needed to do some kind of a framework and we thought we have to modelize our gigs or our sets in a simple way so we said our gigs will be modules and we have some naming convention for instance I'll say scenes equals a list after that that will be the scenes or the tunes that I want to play in my gig scenes are also modules so that means that I will take advantage of the dynamic importing capabilities of Python and then some set up code so I'm saying well for this gig I will have two microphones and I will be able to to crossfade from one to another and I have some kind of blackboard object that anyone can read or write anyone mean the gig and the scenes the tunes so I can for instance in my gig I set up my microphone and then I said context of make equals make so I can access it from part of my code that's taking advantage of course of the dynamic typing possibilities of Python and the scenes become very, very easy so in a scene so as I said it's a module and I can say well I need to use the expression pedal and I want to have loops of course I can use all the features of Python for instance in this example I had several buttons of my foot controller that had to behave in a similar fashion so why not use a this comprehension to make all four of them in one time you see that I use the context of make in the definition of my loops and I also have some decorators to hide hooks at some point in the lifecycle of the scene so when the scene is created, activated, deactivated and so and then it's very easy to have a master script that's the core of our framework that will find the gig in this example you have to call it in the common line with the name of the gig so I launch in Europe Python 2019 it will find the right module it will find the scenes that are in it import every scene and then I can register some events for instance when I press on certain buttons of my foot controller to switch from one thing to another and with this I can really easily make a kind of gig framework I talked about and it works pretty well of course this is only a principle the real code is much longer there's some error checking and I think like that but still I think the whole framework must be way under a thousand line which is really reasonable for the kind of thing that we are doing and so this was possible thanks to very nice features of Python like dynamic typing dynamic imports decorators, code introspection this kind of thing to be completely honest in the first version we also used some disreputable features like monkey patching inspection of execution frames and all kind of hacks but still we thought we needed them and we had them so we could have a prototype very quickly and after some months we thought well this is really really ugly we must make something about it and we are getting rid of the ugliest features one by one but still all the features are there and if you need something and you need to do something really unusual or strange everything is there and that's something really nice with the Python language I think so now we found a way around the world we see that there's still long journey in front of us but now we can go forward and explore new territories and we are now able to go seamlessly from one scene to another without sound interruption or also with for those who know this kind of thing also with effect stales you know if you have a long long reverb and you switch to another scene or tune you don't want the reverb to be cut but you want it to die slowly and everything works very well so my conclusion would be that the combination of Python and Pyro really supports our creative process and in that it makes experimentation easy when we have an idea, a musical idea it's very easy to implement it and test it and this is very important because we have many ideas and to be honest 9 out of 10 never reach the stage we try them and say well no that wasn't a good idea so if we need I don't know 3, 4, 5 days to implement an idea before we test it we simply don't have the time to do it and with Python everything is going very fast and we have a very direct path from the the initial idea to the prototype and well most of the time the prototype is also the production code another really really great thing is that Pyro is very actively developed the main developer Olivier Benongé is very very dedicated to making Pyro better and better and it happened many a times that I was working on some codes and sometimes suddenly I was blocked and I would write to the list saying well I'm trying to do this and this with Pyro and I can't find how to do it and usually I would do it you know on the evening I would go to sleep and I'm living in Switzerland Olivier Benongé is living in Canada that means that he had still a long day in front of him at that time and when I would wake up the following morning I would have an answer on the list well this was not possible but it's now implemented just check out the latest code and it really happened many times and well that's simply great so I know he couldn't be here today but thank you Olivier for this great work and this combination of Python and Pyro allows us to have the efficiency we really need efficiency when we do a real-time audio but with all the flexibility of Python it's also an unexpected use case for Python and I think it really shows the versatility of the language and the ecosystem and that's great now maybe an interesting question would be we are very happy now with this Python Pyro solution but what could possibly make us consider another solution and I can see two places where I'm not completely satisfied and I would consider changing the first one is catching errors if you see this code here I have a callback that would be called when I press the button on my foot controller and as it happens I made a typo in my callback code I wanted to say loop set something I wrote something else as I'm a very very serious developer I also documented my typo but I don't do it all I don't always do it and of course this typo will be absolutely no problem when I launch my script and it's only when I press on the foot controller that I will get an error so Pyro is relatively resilient it won't crash the whole thing so even if it happens in a gig it's not the end of the world but one thing is sure it won't do what I intended it to do and it can be quite annoying so I would appreciate to have some tools that would catch most of errors and they are even executed another thing is that like many frameworks in imperative languages Pyro heavily relies on callbacks and callbacks are very nice they work nice we are used to it to them but they are not always the best way of expressing ideas and maybe it would be interesting to explore other manners organizing things in time than callbacks so maybe I read too much about Haskell now I want catching errors compile time, get rid of callbacks I don't know anyway re-implementing our whole setups and gigs would be quite an expensive thing to do so I think we would really need to have very very obvious advantages to go away from this solution but that was just to say what could be even better if you want to hear more music than the little excerpts you heard of course the best thing to do is come to the social event tomorrow we are playing live if you are the kind of old fashioned people that still buy CD like me you can buy a CD I have a few with me you can just come to me this is our latest album with many many augmented instruments things all backed by Pyro you can also listen to this album in a dematerialized that's the hard word in mp3 format on bandcamp and if you really want to support the platforms instead of supporting the musicians you can also stream from Spotify, Deezer, Google and virtually any streaming platform so that's it questions I think we can take one or two questions and of course I'm available after my talk to answer questions one by one thank you for your attention hello so thank you for this insightful presentation I'm just curious to know how you choose to annotate your music score in order to know which footbutton to press at what time sorry I didn't get it I'm just curious to know how you choose to annotate your music score your partition in order to know which footbutton to press at what time that's a big problem how to write we do quite some compositions for augmented instruments and the writing part is a real problem sometimes we just have music scores and we just annotate numbers or something like that sometimes we really have a completely different notation because we we don't have any use for the traditional five lines notation but actually we don't really know and sometimes it's even the code that's becoming the score it happens that we also do a lot of improvisation so on canvases and sometimes we don't even write anything and if we have a question we go to and look the code and say oh yes we decided to have that and that and that so that's that's a good question but I don't really have an answer another question okay so thank you very much to process scientific images rather than for example Instagram filters it's really more for scientific needs it's open source all the content I'm going to cover today will be open source BSD or MIT licensed and it's for Python obviously here at your Python using NumPy data arrays as images compared to other image processing tools one specificity is that scikit-image works well with 2D but also with 3D images sometimes with ND images like in science you have MOI CT a lot of modularities where you have 3D images and last but not least scikit-image tries to have consistent and simple API so that's a good question I'm going to cover this a little bit in this slide here is a short overview about what you can do with scikit-image so it's image processing for science basically manipulations of images in order to transform them for other purposes like when you want to filter them you can do it you can do it when you want to filter them here you have a denoising example or when you want to extract information like feature extraction for further classification when you want to extract objects this is called segmentations or after some processing when you want to measure the size of objects, the shape that is to transform your images into numbers out of which you can do science what scikit-image does and this is what scikit-image is not it's not a deep learning library I'm afraid you have really great deep learning libraries with image processing capabilities like Keras for example has some nice image processing examples so the reason why there is no deep learning in scikit-image is mostly because of architecture and maintenance choices we choose to be a very maintainable library very well integrated into the NumPy scipy ecosystem that is older code is in Python or Syson there is no GPU specific code for example however scikit-image interacts well with machine learning and deep learning both for the pre-processing and for the post-processing parts where you can do normalization, data augmentation or after deep learning you can improve your segmentation, do some cleaning of instances and so on and so forth also one thing that we do not want to do in scikit-image is to have a lot of very bleeding edge algorithms like the one that you just published during your PhD six months ago it might be a really cool algorithm that if we do this we will end up with like 100 denoising filters and then how will our users find their way through a library we want to have a short API so that it's easier to find the functions and therefore we let time do the Darwinian selection and choose the algorithms which we include scikit-image is a full-fledged component of the scientific Python ecosystem and as such it works with numpire ways which are the images we process and so it interacts also really well with so this pointer does not work it's very weak it interacts also really well with scikit-learn because you can pass numpire ways from scikit-image to scikit-learn and vice versa and also it interacts also really well with the visualization libraries of this ecosystem because once again it's this numpire way object which is kind of the lingua franca of the scipy ecosystem which we exchange and pass between all these modules so here is a very short glimpse into the kind of code that you would write with scikit-image so I'm not going to make a big demo you can find a lot of tutorials on youtube for example but what you can see is that you first import sub-modules so the functions are typically inside sub-modules for example the io for input output reading an image from a file this image will be an umpire way you see here I'm asking for it's shape and then the syntax the api is that you have functions like this thresholding function which take as input numpire ways and they return either numbers or filtered images which are once again numpire ways like this function for example which measures the connected component from here binary image to connected component the input is an umpire way and here the output is an umpire ways as you can see here so the numpire way has actually all that we need for image processing because pixels are just array elements and so our api is really only functions working on images and returning images the first argument is always an umpire way and then we have additional parameters which are keyword arguments if you want to tune the behavior of your function we try to have sensible default values also here I have an example with a 2D image which does this block of code but it would work exactly with the same syntax if you were to have a 3D array because we have exactly the same syntax so pixels are array elements and it allows us to use all the machinery of numpy so here it's just pixel indexing changing the values of pixels accessing to a channel like a RGB image has a 3D image with 3 channels but you can do also masking, fancy indexing and so on so the api is simple we have just submodules and the submodules have functions taking umpire way as input 2D, 3D sometimes and the output is a number or an array for the api through time we have converged to a quite consistent api if you started using scikit image a few years ago maybe it was a bit more chaotic but now for example all the denoising filters start with denoise so that you can try to discover new functions, new filters just by browsing the api and looking at the dog strings of these functions I will show also the gallery which is another way of exploring scikit image we try to be consistent also the variable names and also inside the code for example how we name indices something as stupid as are you using XYZ are you using plain row column we have heated discussions on the github to try to find some consistency for this here is a short example to show you that scikit image and scikit learn interacts really well it's an image that I acquired for my research with my team so it's a grain of gypsum what makes the plasterboards and part of it has been dehydrated the part which is textured and part of it is still intact and we wanted to do automatic segmentation of this and for this we added features using the feature submodule of scikit image in these two regions and after we fed these features to a random forest classifier of scikit learn it gave us a first segmentation but it was not really good so well it had a lot of mistakes so we cleaned this segmentation using traditional image processing like a Gaussian filtering mathematical morphology so it's to show you really quickly this interplay between machine learning and image processing a few facts about scikit image so it's release of 0.15 we have more than 200 contributors but between 5 and 10 maintenance so we really try to welcome your contributors and be happy to talk with you if you could be interested in contributing to scikit image or reviewing for requests we always need a lot of enthusiastic people our community is quite large we have 20,000 unique visitors on the scikit image website scikitimage.org per month that's how we estimate the number of visitors and if you go to the scikitimage.org website you will find one of our most beloved features which is the gallery of examples which allows you to browse through thumbnails showcasing image processing applications and you can select one and open an example I would like to the underlying package of this gallery it's called Sphinx Gallery if you're building your documentation with Sphinx you can just import it as a Sphinx extension and get such a gallery just from Python scripts and so here is one rendered example with the code the image generated by the code some explanations so the gallery of example is really the part of the scikitimage website which is visited the most because our users will come to the gallery and say I want to measure the size of images the size of objects in an image and they will do like control f on the gallery or something like that and open an example we also have a cc also sometimes between examples Sphinx Gallery also gives you nice features like at the end of the doc string in the API documentation it will create mini galleries like this one with all the examples using a specific function so this comes for free when you just import Sphinx Gallery and also in the examples you have like here you have links to the API documentation so it's a lot of redundancy crosslinking between the different parts of the documentation and it helps your users not to be lost in some dead end somewhere in the documentation so I really recommend giving a try to this Sphinx Gallery package so let's say for example that you want to denoise an image this is an image that I acquired during one of my experiments and it was very noisy so how can I denoise it when I go to the gallery there is an example showing how to denoise with several different filters so one shortcoming of our gallery is that at the moment it shows mostly pictures like cat pictures cars pictures of people and we miss examples with real data sets but we are working on this and if you have like a good open data to contribute we might be interested so there was explanations about all these different filters and here you can see so that was on my image this time that when just with one line of code I can try one filter tuning the parameters with keyword arguments and you can see that from this noisy image for example when you use quite specific filter which is this one the total variation filter the histogram gets really peaked so you can start having good results with very generic filters like the median filter here in green but it gets much better when you try the more advanced one sometimes at the cost of longer execution time something which we want to improve in the future is the speed of execution and the parallelization because some other packages use GPUs for example but we we use only NumPy code once again for maintainability so in the future we want to experiment with NumBar and Python for example but at the moment what we do is chunking into blocks so I would like now to go to the interaction with the images part because why do we want speed execution it's because sometimes when you do image processing you don't really know how what the workflow will be, what the pipeline will be and then you need to tinker a bit with your images you need to try different parameters and for this for example you can use the widgets so here for example I have used the Ipy Widgets package and it's Interact Decorator and if I want to choose the best Gaussian filter width I can just use this slider and select my best parameter so you gain a lot of time by having this kind of interactivity and this you get with widgets but sometimes you need another kind of interaction with your images that is you really you don't want to change the parameter generating your image but you want directly to draw on images for example to have markers for the segmentation to identify an object to be removed from the background to delineate roads on a satellite image or to have bounding box boxes for a training set for further classification and for this we have developed this dash canvas package I will give you one example which is this tool which is integrated into web applications thanks to the dash web application framework and so you have the different components of the web application here I can increase the size of the brush I can change the color and so on and I can perform a segmentation based on my annotations okay and so on this annotation tool I have other features like rectangle, lines undo and so on and so forth so what is this tool first of all this the web application framework here is called dash it's developed by plotly and the tagline of dash is no javascript so it's a web application framework in which you write only in Python and so all the components I showed you before are Python code I will give a few examples and it can be quite heavily customized so that you can really tune the layout so dash uses a flask server to run the applications and also all the components are based on the React JavaScript framework so there is javascript behind the scenes but the principle is that you write only Python and I have a few examples of dash code so that it's here I'm using the JupyterLab extension for dash so you see that I write some Python code here and when I executed it I have my reactive graph I have this radio items buttons here and each of these elements is defined in the layout here okay and when I want to add some interaction between these elements I can do this using the callback decorator of the app of the dash app and when I do this like when I change for example the value inside this textbox then this text paragraph is also changed so this is defined here in this callback mechanism and if I go back to here to my app for example then there is this in the dev tools of dash you can see the graph of callbacks which is a bit more complicated because I have more elements but it's exactly the same principle as my little examples okay so which components can you use in the dash apps you have the normal HTML elements which are provided by the dash HTML components the reactive components are found in the dash core components for example the sliders, the drop downs the radio items I just quickly showed also reactive charts plotly but not only plotly charts so if I go back here for example I have one graph and just clicking will populate the hover data I can select also which will change this part and it's quite classical to have figures which can be changed for user interaction but here it's a user interaction with the figures which modifies other components which is a bit more tricky to do you also have interactive data tables which you can include specialized libraries for specific components like for engineering or biology and basically every time you have a React JavaScript library you can wrap it with dash and this is what I did for this dash canvas package there was a very neat JavaScript package called React sketch and I just wrote a wrapper around it adding these little buttons and this is how it was quite easy to create a dash canvas package so dash provides you two things one of them is the dash canvas object which is a modular tool for annotations and selections so you see here some sample of such annotations and also you have functions to transform these annotations that is to make for example numpy arrays, masks which will be then processed by scikit image dash canvas depends on scikit image so for example this is how I could use these annotations to do the segmentation of these objects in the demo if you're interested there is a gallery of examples on dash canvas.plotly.host I can show a few examples of this so here is the gallery you have one example with just bounding boxes and then populating a table a numerical table like when you want to build a training set for machine learning there is one example in which you want to remove the background from just one person and then what you do since it's not perfect you can just improve it that's really the benefit of interactivity and so on and so forth I think I'm running out of time so I will wrap up this was a quick introduction to the canvas which is quite a new project it started at the beginning of this year the roadmap which we have is to improve the interaction with images having for example annotations which can be loaded from a given geometry from a file and not only from the user drawing the annotations also annotations triggering directly callbacks without having to press a button and I would be very interested also in handling 3D images and time series for example for segmentation of objects in 3D like what you have in the medical sciences adding more examples for the gallery as well and since this interactive component is based on javascript it could also be useful for other packages like some libraries using widgets and so on so we can talk about it if you're interested so thank you very much feedback is very welcome on these two tools scikit image dash and dash canvas and please be in touch, thank you any question? feel free to go to the microphone please hi, thanks very interesting when you added a little interaction from this input field what was it like to test when you changed the input field does it actually go through the server and back to javascript or does it all happen on the client? it goes through the server I think let me you don't have computations unlike on your local machine so I didn't speak about deployment like the app which I was running with the segmentation of the sales, actually the server was a local server on my machine this you can do you can also add the apps to an existing flask application and you can also deploy using a g unicorn for example on Heroku so this is the only commercial part plotly also commercializes deployment solutions for that application that's the business model around dash which is otherwise completely open source can you tell us more about the medical images what are the challenges that psychic images has regarding these type of images? the question is about medical images yeah exactly so for medical images we have identified several challenges one of them is to add more examples using real life science data sets because sometimes I go to conferences and like a biology person will tell me I never thought that psychic image is for myself because I never saw a cell image on the gallery for example so this is one thing which we want you to do also a lot of 3D images are quite large data sets like acquired automatically with I don't know light sheet microscopy CT and so on and for this improving the speed of execution through automatic parallelization is really something which we want to improve so the dash canvas part it's not psychic image there are people in common between the 2 teams but it's something which I see on top of psychic image really adding some user interaction to play with images to annotate them so I also see this as something which can be useful to the life science community because you have a lot of people using like imageJ to do measurements or to do just manual segmentation and this you could do with a psychic image and dash canvas as well okay thank you time is over so now 5 minutes and we start again at 35 we are almost to start now the conference is about Beazel the Burger for Jupiter notebooks Myth or Reality by Elisabetha so please give her a warm welcome hello everyone my name is Elisabetha Shashkova and today I want to tell you about visual debugger for Jupiter notebooks first let me introduce myself I'm software developer at JetBrains I'm working on the PyTarm IDE and currently I'm focused on debugger and data science tools we always write code with bugs but productive developer is not developer who writes code without bugs but developer can quickly find and fix them and visual debugger is a tool which can help you to do it really efficiently visual debuggers for Python files exist in almost every IDE nowadays but they usually can't work with Jupiter notebooks because Jupiter notebook doesn't contain only Python source code it's a sequence of cells with different types of content including Python source code and exactly like code in Python files code in Jupiter notebooks may contain bugs the most popular ways to find bugs in Jupiter notebooks nowadays are either print statements or common line debugger IPDB to be honest but these ways are not very useful print statements requires code modification inside your cell and rerunning your cell to get additional information and IPDB debugger is based on built-in PDB debugger produces a lot of output during the bug session and requires remembering all these comments to evaluate some variable auto-put breakpoint also there are some visual wrappers for IPDB like pixie debugger for example but they all have the same limitation like IPDB for example you can't add breakpoint during program execution so you have to wait for program to suspend and to ask you next comment the good news is you can see the whole Jupiter ecosystem lacks very important to visual debugger and the good news is that recently visual debugger for Jupiter notebooks was implemented in the PyCharm Professional by me and today I'll try to explain how it was done so the answer for the question the title is of course reality because otherwise my talk wouldn't exist as I've already said usual Python files and Jupiter notebooks have at least one thing at common both of them contain Python source code the debuggers for Python already exist so let's learn how they work and which part we can reuse to build our Jupiter debugger most Python debuggers are based on built-in tracing function which allows you to monitor program execution so you can define your custom tracing function pass it to setTrace function in this module and it will report you all the events happening in your program as you can see tracing function takes three arguments frame, event and arc frame is an object which contains information of the current place in the program event is event which happened in this place and arc is an argument of this event we defined simple tracing function which brings line number and event which happened on this line number and let's check how it works on a simple example simple function greet neighbors which sends greetings to our neighbors on the first line when we call our function the event call arrives on the first line because Python called this function greet neighbors then Python executes the second line so the event line arrives on the second line then interpreter executes lines 3 and 4 and we are receiving the following events and output highMars appears after that during the second loop iteration Python executes lines 3 and 4 again and an event high Venus appears in output and after that we are returning from function so we are executing Python executes line 5 and event line and events return appears on the line 5 okay how can we use this tracing function to implement brick points in our debugger on each program event tracing function receives a frame object which contains not only line numbers like we've seen in our example but also a filename of the current code which is stored in code object breakpoint also has a filename and a line where user put it so on each call we can compare breakpoints filename with frames filename breakpoints line number and if these values are equal we can suspend our program in this place cool so this is how Python debuggers work and how we can use tracing function to implement breakpoints but execution of Python code in Jupyter notebook files differs from usual Python files and in the next part let's learn how Jupyter executes Python code and what we should change is the existing Python debugger to implement Jupyter breakpoints breakpoints in Jupyter files you browse your Jupyter notebook in a frontend for example to support Jupyter notebooks in PyTorm we implemented our custom frontend which works similarly to the default one so when you run the first cell in your notebook it starts an Python kernel and establishes connection to it which is a process which works similarly to the idle so it's running in a loop and waits for the next comment to be executed so when you execute your cell frontend sends its source code to ipython kernel ipython kernel compiles it to a code object, executes it and sends result back to Jupyter notebook the most interesting part for us here is how kernel executes this code for every cell execution kernel generates a unique name for every cell and passes this name as a file name for generated code object usually kernel hides this information from users but it stores all this generated code object in its internals that's why when you define some function in the first cell executed it after that you can call this function in another cell because ipython kernel saved it for you to implement breadpoints in usual Python files we can use pair file name, line number to define a place in a source code because this pair uniquely identifies a location of breadpoint or some source code position but in Jupyter notebooks it doesn't work because each cell is a separate code snippet with its own line numbers inside and all cells are located in the same file so we can't reuse this the same pair for Jupyter breadpoints but we already know that ipython kernel generates all the necessary information during cell execution executed cell has a generated file name and its internal line numbers so we can use this pair to define a unique location of our code but the problem is that this generated information is available only in ipython kernel and not in our IDE and when debugger sends some message to the IDE about for example debugger suspension I'm suspended in some place this message contains generated file name but IDE doesn't know which cell is suspended because can't find its source code it was generated on ipython kernel site in the IDE we can introduce some cell identifiers for example to find their locations in the editor but we still need to find a source mapping between these two objects cell identifier in the editor and generated file name I spent a lot of time trying to understand how to implement it and solution appeared to be quite simple there are two things which helped during implementation firstly in the IDE as I've already said we have a custom Jupyter frontend that means that we can control all cells execution inside our Jupyter notebook that means that for example we can track all the cells which were executed during the session or send some additional commands the second thing which helped to implement it was a silent cell execution supported in Jupyter that means that when you can execute some code in a silent mode so it will be executed in the context of the ipython kernel but it won't be added to kernel history and it won't increase execution counter how did we use it to support source mapping now before sending a real cell code we can send several utility commands in a silent mode for example we can send a command which functions for name generation and saves currently generated name to debugger instance also we can send information about currently executed cell id and save this value to debugger instance as well that means that when cell is started to execute we already know all the necessary information about mapping because it's saved in our debugger instance and inside our Jupyter tracing function we can do the following when execution is suspended inside a code with some generated name we can map this generated name to the cell identifier which is stored in our debugger instance and then send a message to the id site and now this message doesn't contain a generated name but it contains a cell identifier in the editor and after that the id can quickly find the cell its source code in the editor and highlight a suspended line that's how Jupyter breakpoints were implemented now we know how ipython kernel works and how we can define source mapping for Jupyter breakpoints but we still have two separate entities the id and the ipython kernel and they should be able to communicate somehow as I've already said we have an id instance with its custom Jupyter frontend implemented there and ipython kernel which executes our commands but for debugger it isn't enough to send just a source code for execution or some commands in silent mode we also need to send a lot of communication something like user edit breakpoint in cell number 3 line number 2 or when debugger is suspended it should be able to send a message like I'm suspended in some place in the cell that means that debugger needs some additional communication channel with ipython kernel when I started to think about it I realized that there are two possible solutions the first one is to establish an additional connection to ipython kernel and the second one is to reuse existing Jupyter channels the first one is the simplest one and it's the first thing that came to my mind but it has some limitations and the reason of these limitations is in Jupyter architecture this is more detailed scheme of Jupyter communication model and when frontend contents to ipython kernel it doesn't connect directly it connects via kernel proxy which connects to ipython kernel via websockets and if we want to avoid Jupyter messaging architecture we can establish only one socket connection to ipython kernel and of course it isn't always possible for example if your kernel is located far far away in some cloud it doesn't connect to it without proxy so the solution with additional connection is currently implemented in Bytron Professional but it works only if you can establish a direct connection to ipython kernel but Jupyter already have a rich messaging architecture maybe we can try to reuse it yes ipython kernel has five sockets here I shown three the most important of them it sends cells for execution sends output back to frontend and requests user input it would be possible to reuse some of them in our debugger but there is another serious limitation in Jupyter ipython kernel runs a tornado event loop in the main thread which processes execution events also there is a second event loop in a separate thread each of these output commands each of these event loops is a single threaded that means that if some command started to execute event loop is busy and the following commands will be executed only when execution is finished and the following messages with some debug information which will send the other same channel will be blocked but the problem is that debug information should be sent for execution it's useless when execution is already finished that means that in the current Jupyter architecture it's impossible to reuse existing channels for sending debug information but wait everybody knows that IPDB works for both local and remote cases and it doesn't require any connection how does it do it if you remember a workflow with IPDB a debugger you understand how it works to call IPDB debugger inside your Jupyter notebook you need to add a call to setRace function inside your cell and after that debugger starts a suspense and asks you for some command you type some command debugger receives it, does some actions and asks you for the next command debugger receives it and so on and so on so an IPDB debug session is in fact a sequence of request reply commands which kernel sends to frontend and back and it works so because it's based on built-in input function and it can reuse an existing input channel because it's input function for sending debug commands it works but it has some limitations for example if you started to execute some long running cell at a debugger and realized that you forgot to boot breakpoints in some important place you have no chance to do it with IPDB you need to wait for program to suspend and ask you for the next command and only after that you can add your breakpoint or execute some stepping command it's okay for command debugger but we can't reuse the same technique in our visual debugger because in our visual debugger we want to have an ability to put breakpoint even when debugger is running and to make program suspend in this place where we added this breakpoint that's why in the current implementation in PyCharm Professional I decided to establish an additional connection and send all debug utility commands separately from Jupyter channels well in this part we've learned how Jupyter debugger sends its utility commands and why it was implemented this way also we've learned how IPDB works inside so it looks like now our visual debugger is ready let me remember you how we built it today actually we defined a tracing function which can work with code generated by IPython kernel secondly we created a mapping between editor and generated code for cells we used silent cell execution to implement it and features of our custom frontend and after that we established a debugger connection for sending commands from the IDE side to the Python kernel and back so today we've learned how Jupyter visual debugger is implemented and that means that it's time for the most entertaining part for you and the most horrifying part for me a live demo in PyCharm okay can you see anything? you can see you can see everything here it is can you see everything? well this is Jupyter notebook in PyCharm you can see cells are located on the left side and Jupyter notebook preview is located on the right side you can work with cells as if they are located in one Python file and it's important thing to notice that we don't convert Jupyter notebook to a Python file it is still the real Jupyter notebook with ipinb extension which is located in your project on the disk we just show our custom presentation so you can work with it like it is in one big Python file and you can use both features of Python editor usually Python editor in PyCharm and features of Jupyter notebooks for example when you type code you can use the same code which works or for example here we have some function you can quickly navigate to any variable declaration and even if this declaration was in other cell PyCharm will navigate you to the correct place so you can use features of PyCharm so for example you can run cell and you can see it was executed and output appeared here in the notebook and it's stored in Jupyter notebook so it works exactly like your default frontend for Jupyter notebooks also there are many other actions for example in the PyCharm 2019.2 it will be available to run all cells in your notebook or start kernel or clean outputs and do a lot of other things but we came here to check that our visual debugger works let's put breakpoint we put it here on the second line and run debug cell debugger is suspended as you remember we defined a tracing function it established source mapping editor and kernel and then sent comment to our editor and we found a place where it should be suspended you can see variable values here you can expand them check their values and for example resume programming great, simple breakpoints work let's look the next cell this is grid neighbors function that we have already seen today when we discussed tracing functions let's put breakpoint here and debug this cell as well during the talk I had time to discuss only breakpoints but the very important part of every debugger is stepping comments and they were implemented in PyCharm let's try check that it really works I can press here step into function declaration in this cell here we can also execute stepping comments you can see its values it changes and after that we can go step again and continue our stepping comments in the cell in the cell where it was executed great so stepping in the current cell works quite fine let's consider a more complicated code sample there is a lot of code but it's quite simple we have a list of planets and we iterate over these planets print name for each planets search for its neighbors the left one and the right one if they exist and call the same grid neighbors function after that and sleep for two seconds because we like sleeping well let's execute our cell under debugger execution is started you can see output appears but I forgot to put breakpoint here let's add breakpoint we added breakpoint and debugger suspended exactly inside our cell this is the thing that isn't possible currently in IPDB you can't add breakpoint but in PyCharm with our visual debugger we can do it we also can iterate here for example we can check where we stopped we can select these planets E and we're stopped in Jupiter planet also we can execute step into again grid neighbors function and we navigate it to the correct place where function was defined even it was defined in other cell okay we can resume our execution remove breakpoint great we are checked neighbors for Jupiter but I would like to learn what are neighbors for Uranus and I don't want to do a lot of stepping commons and press resume many many times for that I can put breakpoint and then I can set condition for my breakpoint so I can select this one suspend my program if only the name of the planet is Uranus okay and start our debugging session again let's just hide this let's start our debug session let's check output is starting to appear but we are waiting for our condition to work okay okay it's okay we're suspended let's check we're suspended in the correct place the current planet name is Uranus that means that condition for our breakpoint is really worked and we can check the name of neighbors for Uranus neighbors are Saturn and Neptune that's correct great we can add breakpoints even during debug session if you do a lot of some data science work and you might do a lot when you work with Jupiter notebooks you might work a lot with NumPy arrays and as data series and you know it's quite it sometimes is quite difficult to check values in this data science arrays you can press here in your variable here and it will be opened in a beautiful window as a table and you can work inspect its values type some slices and do anything you want with your code okay so we've checked that visual debugger for Jupiter notebooks really works I didn't cheat you during my talk that's excellent news and then let's go back okay during my talk today we learned how to build a visual debugger for Jupiter notebooks and that means that now after my talk we learned how to implement visual debugger for Jupiter notebooks in your favorite IDE if it's for some reasons not by charm and if it is by charm I've already implemented it for you so you can try it right now thank you very much for coming to my talk now I'm ready to answer to your questions thanks for the talk I've got one question when you register a set trace function sir is the program running slower when you have the set trace function um I didn't get it um your debugger is based on the set trace set trace function which does a lot of things when it is activated the program runs slower it's activated when you pass your function to a set trace it's activated in the next frame which was called so in this as you can see in this example where was it here it is so you can see we set this for the it will be applied to the next frame which will be executed so here we are calling next function grid neighbors so we are entering the next frame and it is activated in this function yeah and it should return a tracing function should return it's uh okay here it is it should return a tracing function in the current frame or it could return non and the tracing will be stopped okay and is it possible to transfer the trace function so the reverse effect of you should call cset trace known so that's when you push on the play button that's what it does something similar yeah if you don't have breakpoints okay thanks I have another question for you to connect to the kernel to add a new connection you had to modify the kernel to build a custom kernel or is it done in a cset trace function no it's only cset trace function we are silently executing this command which connects to our debugger and we are storing the debugger instance in some internal so it doesn't modify kernel we're just setting this tracing function okay and you have an idea of the performance impact of this cset trace function yeah it's as usual for debuggers of course it makes your program execution a bit slower and sometimes much slower if you have some a lot of computations but I think it's anyway it's faster than to add print statements and rerun your cell many many times so on average it's slower but usually you even don't notice it no more questions thank you very much for coming you can always find me at the pycharm booth during the whole conference so feel free to come and ask me any questions about my talk about pycharm or about anything you want thank you coming so today as you mentioned we're going to be talking about how to build a pet detector in a notebooks environment so just to give you a bit of a context by doing a high level overview of the deep learning mechanisms we'll be using and the machine learning workflow and then we'll dive deep into the spend most of the time in the demo notebook and talk about the actual code so just to give you a bit of context I'm Catherine Kampf I work for Microsoft I'm a program manager on a product called Azure Notebooks which is our free Azure hosted Jupyter Notebook service so we'll be seeing it shortly and here's some of my contact information I'll be showing it later so don't worry too much about taking photos or anything just yet the most important thing to know about me is I really love dogs and even me knowing a bunch of dog breeds it's still sometimes difficult to tell them apart if we look at Alaska Malmoots Siberian Huskies they can look super similar and especially when you're then trying to train a machine to understand this you get difficult and you need a ton of data to get an algorithm that can successfully distinguish between these different breeds so the way we're going to approach this today is using a technique called deep learning so this is often how deep learning is viewed where you have an input so that's the dog photograph and the sort of black box where we don't really know exactly what's happening but then we get these outputs of either a dog or cat or other or in this demo we're actually going to go a bit further to say if it's a dog or a cat which breed do we think it is and so this differs a bit from traditional machine learning where you'd often be doing manual feature extraction so this requires a bit more hands on work to say try to discern which features of the data are most important whether it's size or color etc and it also requires domain expertise so this is easier to think about in the classification of topic space but if you try to apply this same thinking to a really specialized field it can get a bit more difficult to do this manual feature extraction and then you also need to run through different classification algorithms to try to distinguish which will work best and get you the highest accuracy and this differs again from deep learning where we're going to be doing a layered approach trying to understand different features pixel by pixel of our images and it's going to be the machine doing the heavy lifting of that work rather than manual data science work tuning the features so like I said deep learning will require a ton of data but once you have that data it becomes a lot easier for the machine to try to understand the data set and generate predictions specifically within deep learning today we're going to be using what's called a convolutional neural network so this is a really popular network to use specifically for image classification and that's because this works by preserving the RGB channels in the first layer so this will take each pixel value what value they are of red, green, blue and preserve that in the first layer and then we'll use a technique and this is at a very high level but we'll use a technique to filter on the images themselves and try to extract which information is most important in the convolution layer and then next we'll move into pooling where we'll try to aggregate that data and reduce the amount of information because of course deep learning is super computationally expensive so if you can have a bit less of information going into the fully connected layer it can save you a lot in time and compute power and then in the end we'll look for whether or not the animal is a dog, cat or something else if the network is a bit confused so if we take a step back and look at the general machine learning workflow that we're going to be walking through today you're often going to want to start off with data exploration and the data itself in a machine learning workflow this is going to be where all of the power for your deep learning network comes from and it's going to be the most important part of the process so this will involve finding a data set possibly transforming the data set into a particular format you need cleaning any data and running some visualizations trying to understand the basic attributes of the data and once you have a sense of that you can move into training and with training this is where we'll actually be developing our algorithm so there's three main concepts that we like to think about with the training scripts compute and tuning we'll start off with our training script this is where we'll try out different algorithmic approaches and of course your local box can be pretty powerful but compute-wise depending on the size of your data you might need to scale out to a larger VM in the cloud or on-prem or a cluster computing environment and once you have that algorithm you might be fully satisfied with the accuracy you're seeing or you might want to do some tuning try to refine that algorithm to see a bit higher accuracies and once you're happy with your model then you'll move into the inferencing stage and this is where you're actually using your model in an application so this involves three different components where you'll have productization so this can often mean refactoring your codes you might have started in a jupiter notebook environment but you need to output a python module so you'll have to do some work there and we'll talk a bit about that later and then deploying your model to a web service so you can use it in your applications or other folks across your company or organization can as well and once your model is deployed then you can write a test application to send a photo of a dog and get returned back with your breed prediction so that's the high level overview of what we're going to be walking through today we'll spend most of our time in these first two stages but I'll point you to a get-over repository at the end of this that walks through the entire life cycle so we're going to start off talking about the data we're using so today we're using the Oxford Pet data set so this is a pretty common data set which contains 37 categories of different pet breeds so this is cats and dogs and around 200 images per each breed and here's a link to it I definitely encourage you to start with machine learning or image recognition this is a really great well-labeled and well-documented data set to use and then once we get into wanting to explore and understand our data there's a bunch of different great tools you can use depending on what types of data you're working with or the scale of your data and today we're going to be using notebooks so if you're not familiar with Jupyter notebooks they essentially let you combine marked down text images visualizations etc alongside executable code so it's super useful in data exploration and data science in general to tell a story around how you got to a specific graph or how you got to a specific model so you have the context either to present to others or look back on your own work and understand each step you went through and get strong visualizations of your data specifically today we're going to be using what I mentioned earlier Azure notebooks for the free setup and scale out so my local box is fairly powerful but for something like this I'd rather use a really big beefy GPU machine in Azure so Azure notebooks lets you connect from their free compute to a remote VM in Azure so super useful for scale out scenarios and so I can make it public for all of you to go and play with on your own as well so I'm going to switch over to the demo to show you exactly what it looks like so this is the GitHub repository I'm going to link oh this isn't showing up the IT people go alright let's see if we can fix this oh no ah okay we're good alright so this is the GitHub repository I'll link you to just to give you an overview if you scroll down here you'll see these little launch badge if you click this it'll automatically clone it into your Azure notebooks and then you'll see that you can use it as a new project so super easy to get started and once it's cloned in which I've already done you'll get an overview of all the project files here and this is the compute picker I mentioned so there's free compute offered as well as this is my camp VM that I'm connected to so it's a NC6 GPU machine in Azure I'm going to show you all of these so let's first try to understand what the data structure we're working with is so we know that we're using the Oxford Pets data set but I'm just going to do a quick ls to see the folder structure we're working with so it looks like we have 37 folders and presumably those contain the 200 images that we're going to be analyzing today and just to get a quick visual sense of these I created a little bit of a demo so we can see exactly the breeds we're going to be working with and holler in the back if this font's not big enough I tried to make it big but and I think it's very useful to always get this visual sense of your data for instance here we can see all these 37 breeds and it'd be easy to once we have our model deployed submit a breed that's not covered by this data set and then we confused if we submitted a golden retriever or if we submitted a true scene it's because it's not reflected in this original data set so as I mentioned we're working with 200 pet or 200 images per breed which may seem like a lot but for something like deep learning it's really not enough that's relatively pretty small data and we likely end up with an overfitted model that wouldn't scale out to the data we'd see in the wild so instead of just doing that we're going to be using a technique called transfer learning so with transfer learning we're going to take a pre-trained model so this is the mobile net model that we're using today which has been trained on thousands of general images and then what we'll do is retrain that last layer specifically to our 37 pet breeds so it'll use all the power of someone who trained this massive network but specify it down to our data set so when I run this training we can see here it takes a bit and has a ton of output so we can see it took around 26 seconds and we saw an accuracy of almost 80% so this is from doing that initial transfer learning not tuning any of our hyperperimeters which we'll get into later just using a flat learning rate we're still able to get to 80% in 26 seconds which is pretty impressive if you look back this data set was first released in 2012 seven years ago and even with a lot more compute power, a lot more time data scientists were still only able to get to around 59% accuracy so it's pretty impressive how far we've been able to come in just a short amount of time so 79-80% is pretty great but I want to see if we can improve this all so now we're going to be working with what are called hyperperimeters and these are attributes of your network you can determine beforehand so specifically we'll be looking at what's called learning rate and the learning rate is essentially how much you'll let the weight vary on a node from iteration to iteration so how quickly you're letting the network learn and often times in data science you find yourself trying out a bunch of these values and just for looping through randomly because it's often difficult to determine what might be the best for your specific network so instead of doing that by hand and taking hours to do it we're going to use something called azure machine learning service which lets you distribute this work across a cluster so I have a four node cluster in my azure subscription and basically I'm going to send the training script to each of the worker nodes and it'll try out a bunch of so if we see here it'll try out a bunch of values for the learning rate and it'll tell me which gets the best accuracy so then I can treat that as my best model without having to do as much work and we're going to do a couple of things to make this more efficient so I need to update my calls but we're going to use this early termination policy and basically this lets us if you see this 0.15 what's that saying is if we're seeing a run and the accuracy is less than or more than 15% away from what our current best accuracy we've seen is then we'll just cut that run short and free up that compute resource to be used with a new value so as this runs we can see a bunch of output it's not loaded yet but sometimes it takes a bit depending on wifi but we can see all these jobs running so and a bit of information about our cluster and how it's running and it'll run through each different job tell us how long it took what it's run ideas et cetera so I'll come back up in a bit so we can see the visualizations it'll start giving you but this even though it's pretty efficient to distribute it across a cluster it'll still take around 25 minutes so this is a 30 minute talk I don't really have time for that but some of the validation accuracy starting to come in as well as different learning rates but I'll just skip ahead to a run we've already done in the past so this was a run I did yesterday and I feed it the specific run ID and now I can see a bit of the information see another graph of the validation accuracy as it went through training and see that the final accuracy was around 93% so with just an additional 25 minutes of training we were able to increase our accuracy by 13% which is pretty exciting and 93% is a really great accuracy especially for something like an image recognition task so now that I have that treated as my best run I'm going to register it with the Azure Machine Learning Service and this will basically let me use this model from anywhere or deploy it easily I just wanted to access this model from BS code etc or in a future notebook it's registered and available to me as well as anyone who's working inside my workspace so I know we just covered a bit so I'm going to flip back to the slides to review some of the topics so again we just went through training and we were looking specifically at trying to do deep learning with small data and by nature deep learning is going to require training data because it's doing that feature extraction and trying to figure out the best network structure on its own and so you need as much data as possible to learn from for that so 200 images isn't going to be enough which is why we decided to use transfer learning and specifically transfer learning with the mobile net where we'll be taking the existing mobile net model and retraining the last layers specifically to our database and then since that was pretty good accuracy around 80% we decided to see if we could do any better by doing some hyper parameter tuning using azure machine learning service so just to call out a couple more exciting things if you're just getting started with machine learning AML can be super useful it's got experiences where you can just do a drag and drop automated machine learning and it tries out a bunch of different classification algorithms so if you're interested in doing some hyper parameter tuning like we just saw an automated compute scale up scale down so I have a four node cluster in my subscription right now but I've set the min nodes to zero so whenever I'm not using it and not running my jobs it'll scale down and won't cost me money which is super great so once we have this model like I mentioned you might want to do some refactoring and I'm going to go ahead and add a note book that we just had the ip y and b file but when I open it in vs code I see the json dump of what a raw notebook file looks like and I also see this option to import it so I'm going to go ahead and click that and vs code will turn this into a dot py file with a bunch of cells so here I can see the markdown has been turned into comments so I'm going to go ahead and add these little cells with this run cell option that will bring up our python interactive window and run a cell as you would see in Jupyter side by side and this essentially works as a python console so you can type code in here et cetera so when you're into refactoring this can be super useful if we just highlight a snippet of code we'll have all the refactoring and the extract method et cetera all from what started as a Jupyter notebook so you can now refactor that into whatever form fits your workload best and I have an example in the repository of what a refactored python module might look like and once we've refactored it to what we're happy with and deployed our model we'll want to go ahead and test it so this is a testing script I've written and I'm going to go ahead and run this cell to see what it looks like so basically what we're going to be doing here you can either we have some code to access a random pet to try it out or in this example I'm specifically trying with this little twa-wa I found on the internet and we can see here it loads the twa-wa image as we would expect and something new that we just introduced is actually the ability to debug cell by cell so we just ran this first cell successfully and now I'm going to hit this debug cell and we can see I have this break point here so once I hit debug cell it'll open the different tools I'm used to in my debugger I can step through etc or just continue on and once I continue on I can see that it gets it right with a pretty accurate probability so this is super useful to be able to do this debugging bit by bit and we just introduced it this week so if you want to learn more about it come by our booth and we can talk more but this is going to be super powerful when you're working in a data science space and trying to debug cell by cell and so now that we know our pet detector is working pretty well we have a great probability we're good to go so just to rehash a bit about working with Python in Visual Studio Code we just saw that we have debugging and refactoring capabilities there's also IntelliSense so auto-completion as well as the ability to import export Jupyter Notebook so as you saw we were importing that Jupyter Notebook it transformed it into a .py file with different cells and you can continue to work and refactor that into a Python module or you can re-export it back into a Jupyter Notebook if you want to present the information etc there's a variable explorer data viewer a bunch of full-fledged data science tools and if there's anything you don't see that you would love to see please come talk to us we are heavily investing in making this a great experience so now that we've covered most of that workflow what's next so here's a link to the GitHub repository where you can build your own pet detector as well as links to try out Azure Notebooks and the data science tooling in Visual Studio Code and then I have some resources on the next slide as well but I'll let people take photos of this as they wish and I think we have a few minutes we'll have a few minutes left for questions and then I'll hang around outside in the hallway as well or I'll just be outside and at the booth Thank you for the talk I have a question usually in this kind of example so we talk about classification to the small amount of categories like brews of dogs what shall we do if what can I train something to classify let's say several thousand categories or maybe 100,000 I just want to create some classifier that tells me what is on the picture Yeah so how can I accomplish this task so shall I try a lot of classifiers for each separate category or maybe there are some approaches to do it just out of box Yeah, so it depends sort of on this the size of each category you have so if you have hundreds of thousands of categories and five images per each category then you'll have to try to use some pre-existing models and do transpilling but if you're able to do if you have 100,000 categories and you have 100,000 images within those then you can employ a lot of different techniques whether you just want to do traditional deep learning on your full data set you could try different classification algorithms on each set as you mentioned Okay, and another question is there some software as a service in Azure that provides classifier as service just for me not to write it by my own but just call some API and use it Yeah, so we have something called Azure Cognitive Services which basically are suite of exactly that, APIs for speech to text, search image recognition, etc so if you search just Azure Cognitive Services that's exactly what it does Hopefully that helps In the demo you are using an image folder locally is that correctly understood so you have images there and they are then uploaded to train your model if the images are not coming from somewhere else Yeah, so these images I actually put into Azure storage they're in the GitHub repository as well if you want to download them So the notebook refers to the Azure storage so you need to upload it to the Azure storage and then train it from there So I've loaded them on to the specific VM I was working with but you could there's a limited amount of free storage in Azure notebooks as well so you could upload a subset of the data and work with it there or upload the full data set to Azure And you could also use you also have a service for uploading the images right on Azure if I remember correctly Yeah, there's a couple services you can either do it from the Azure portal or use the Azure Data Explorer which will let you upload data to Azure storage depending on what your end store is in Azure if it's Blob or Azure Data Lake It's the possibility of marking up where in the image the object is that you are So this data set is nice because it actually it gives you the full image and then it also has a highlight around, a box highlight around the face of the pet which makes it a good sample data set for cases like this for specific Does that help? Yeah, it'll box the face for you to make it a bit easier so if you're looking at a full dog or just a dog space I just didn't see it in the GitHub repository but it's somewhere Yeah, the link I provided to the Oxford Pet Status Set earlier in the slides if you read there it'll have the boxed images This one or the one and I'll tweet the link to the slides as well if you'd like them So if there are no more questions is there a question left? If there are no more questions let's thank the speaker again Learning algorithms with optimized software and he's working for Intel Let's welcome him with a round of applause please Hey, thank you very much for being here My name is Shailen and I'm an AI specialist at Intel Internally my title is Technical Consulting Engineer where I am the link between you guys, the end users and the core developers who develop software that you guys will use So today I'm going to show you how we accelerate deep learning algorithms and software and I will use a case study a real world case study done So the interesting case study that I'm very passionate about that I'm involved a lot this year is about scanning brain cancer in humans So that's the title of today brain tumor segmentation using deep learning Okay So I'm based in Germany and I have an education from Germany So a brief agenda I'm going to describe you the problem that we have and how we're trying to solve this problem with AI and then I will also tell you what software tools and packages that we use in order to solve this problem and then we have a look at some performance numbers what you can expect from such a real world case study Let's start with some statistics as a motivation of why I, my team Intel is so passionate and involved in this field As for the global can it's a world cancer statistics research group we found out that there are approximately 18 million people having cancer so new cancers were recorded and if you look at that close to half of that number involved people dying so if your family is involved in there and dying from cancer it's not nice so what can we do? How can AI help in there? And if you look those numbers are just in 2018 this year 2019 we can expect similar numbers and it's really important to diagnose cancer as early as possible and find solutions so we can avoid deaths right? An introduction to this brain tumor topic so there is a technical or medical term we use gliomas these are the most common occurring types of brain tumors and they are very dangerous if you have them it can grow really bad and you can die from that and 90% of those gliomas belong to a class of highly cancerous tumors and to date multi-sequence MRIs or magnetic resonance imaging is the de facto way to screen and diagnose for such gliomas and when you do an MRI imagine your body, your head goes into this MRI machine and it takes a volumetric free scan of your brain and for the doctors the challenge to find the cancer they have to slice or segment that 3D volume and analyze slice by slice this is a very time-consuming it's an expensive process and but that's a very crucial thing so segmenting the brain this 3D volume is very important and how do you actually fix or cure an affected area you can do radiotherapy and radiotherapy is actually using a laser to grill the bad cells and another way is to actually do surgery so you cut the skull and go and remove the bad cells okay I know some reactions there sorry about that but in order to do these two things you actually have to analyze slice by slice of this 3D volume and that's why segmentation is important now the problem is the following and that's the medical challenge the medical challenge is one fold we have a lack of specialized doctors to do that and I have a a link down there where people posted an article about the lack of physicians to do such kind of studies and analysis and the second thing this whole process of segmenting is time consuming and very expensive but we believe that computers can help so if we can automate this process we have gained time to the patients to the doctors making this whole diagnosis process faster and we also improve on the segmentation quality the second thing where computers can help in the field that we are collecting so much data and I got data from 2013 at that time approximately 150 free exabytes of data were collected just in the health care sector and that number was predicted to grow over 2000 by the year 2020 we are now in 2019 I need to check the numbers for that now you may ask what is an exabyte if you have a perspective that's over or close to 250 million DVDs worth of information one exabyte and now in this year 2019 over 2000 exabytes that's a lot of data so having high compute power to analyze all of this data is great, we are in a great time so this is where AI hopefully can help so the cred of this talk today and let's have a look at the data set that we used in order for us to train our deep neural network that we used to do that so the data set actually comes from the brain tumor segmentation or BRATIS challenge of 2018 so it's an open source data set provided by the University of Pennsylvania and the goal for our deep learning algorithm is to look at the 3D volumes and figure out whether a 3D pixel or voxel contains cancer or not so there are so healthy tissue or the three types of cancers so cancer or no cancer now a voxel just to visualize that for you it's like that it's kind of 3D pixel and let's maximize on one of the examples of that brain over there so this is it so cancer or no cancer and we can color label them to different channels on the type of cancer that we're looking at and I have one slide summary of the algorithm we implemented and this is it on the complete right we have the input image from the machine so that's one segment of this MRI scan and then in order to train our deep neural network we needed labeled data so a combination of this MRI input and the middle one and that's what the radiologist has drawn by looking at the MRI input so he has marked the cancer areas and we got tons of these input images plus the labeled one from the doctor a combination of these two we call them a set of labeled data we use these two to train our neural network so that you can do what we have here the predictions or labeled images and this is our goal we want our deep neural network to start analyzing new patients coming in and telling us whether that person has a high degree of cancer or not where are the cancer areas and so on and so forth let's have a look at the algorithm used the model in this research is a UNET model and UNET is very very popular in the neural sector especially for medical imaging and the UNET neural network looks like a U that's why it's called UNET and it has lots of convolutions involved and a bunch of researchers from the University of Heiburg in Germany came out with this research and it's really great, it's nice it works quite well and it works like an auto encoder one side is encoding and the other side is decoding and at each stage going through that neural network we're extracting features and that's how we can detect at the end of the day cancer or no cancer so basically this neural network answers the question to which class does a volumetric pixel or Voxel belong cancer or no cancer now you may think all this deep learning AI, it's complicated well not really if you look at the bird's picture of this whole algorithm how it looks like is like that very simple, we have an input data set think of this as black boxes input data set coming in so label data it goes into the neural network then the next step is train that neural network with the input data set and once we have that we have a train model which are done, easy peasy so now we have this train model the new patient comes in and then we do inferencing and then we get the result so all of this looks great but what did we use in order to make this happen a bunch of software tools some of them the Intel distribution for Python for best-in-class Python performance of course and then for the neural network framework we use tensorflow and since tensorflow may be painful sometimes we leverage Keras as a very nice layer on top so making deep learning even easier and then Horovod Horovod is a technology by Uber it's very interesting the car company came up with this piece of technology and what does Horovod do is actually distributing works, it splits a job a work into multiple small subsets, small works and distributing that work on multiple nodes or machines so that they work together and if you look at the logo it's like one person Horovod is actually a Russian word for a Russian dance and it's like one person holding the hand of the other person in a circle that's what the logo looks like, like the dots or the people, the heads of the people holding hands and this is the key message distributed computing one node talking to the other node and so on so with Horovod we split our training process on multiple machines so that we could do the training of our deep neural network faster and then the second stage of the training is inferencing, how do we do inferencing fast is using a tool called OpenVINO and OpenVINO is a very nice tool that makes inferencing easy and fast thanks to its optimizations in place in there now let's have a look at some numbers I got by going through this training process now you can imagine we have this MRI input coming in from that MRI device it contains high quality images, it's a huge dataset, large images with lots of detail it's very taxing it's very intensive for a computer to do all this processing obviously training takes a long time and some performance results I got completely right when I used TensorFlow stock that is from Google and on one machine it took me 76 hours to do the whole training going through 30 epochs times through that neural network I was not very happy with 76 hours, I thought I could do better and of course the next step is to actually use a better TensorFlow and that's the one which Intel optimized and this is the second point in there, Intel Optimized TensorFlow with just changing that TensorFlow package I dropped to almost 50% like better time 2x performance boost just by using software to 43 hours but still I was not very happy then I started looking at distributed training, how can I use multiple machines to work together and do the training faster so then I looked at let's look at the last two parts four nodes, means four machines with Horowitz eight workers eight workers on four machines I dropped from 43 43-44 ballpark hours to 7.5 hours that was great and increasing the number of workers to 16 even better, 5 hours and with 5 hours I was more as happy based on that huge data set that I had so from 76 to 5 the key message there is really optimize software and distribute your work and if you have a cluster, make use of a cluster why should you stick to one machine when you can make use of multiple machines so use better software of course looking at just one node from 76 to 43 for me that was mind blowing, I really appreciated that so without much work from me, it's leveraging just better software now plugging all of this in a big picture this is how it looks like on top what I wanted to do solve a medical problem this was my software stack involved and now you may ask what is this inter-optimized TensorFlow it is the same TensorFlow code that Google releases we take this code and we plug in our performance library in there so this performance library, it loves math so my unit network does a lot of math intensive operations and that library the inter-math kernel library for deep neural networks it loves math whenever it says math say yes so then it boosts all those math heavy computations so that I could do my training faster as you can see the numbers that I collected and of course I leveraged best-in-class Intel Xeon processors in order to do that I had four nodes, so four machines four Xeon processors so I could do training faster actually when I said one node one processor, no it was actually two circuits so two physical Xeon processors on one motherboard so I had eight physical processors working together to give me close to five hours dropping from 76 now you have seen that only one slice was being inferred so imagine now for that 3D volume that came in, every slice is being analyzed if you had to have a doctor to analyze one by one it's really expensive time-controving that's too much, that's just for one patient an AI algorithm doing that for you is obviously much better easier for everybody, for the patient, for the doctor and if you're curious how all of this plugs into a 3D this is how I got there I used this software called Mango to throw this 3D volume and that's the MRI brain originally and you can see all the slices stuck together and you can see the volume of cancer there and for the doctor who has to do radiotherapy or even surgery to cut the skull and go there, he needs to know exactly where are the bad cells otherwise he may destroy good cells and the person goes into coma or something like that so breakthrough stuff now if you're curious also this whole source code this whole source even the AI software tools that I showed you from that slide earlier that software stack, they are all open source and this is interest commitment to AI we're going open source free software, free tools and if you want to have a look at my code as well especially if you're in the medical field I've published all my work on my github and that's the link there and you have instructions on how to get all the data set and how to get started you can also reach out to me so from my github you can get to me and that's it thank you very much for your attention I'm open to questions could you use the microphone because it's being recorded it's easier than I'm sorry what is the difference between horror and spark because you talked about the workers and nodes and all that kind of stuff okay well so on the spark pressure, leverage and Hadoop and so on so however there's a few MPI based package or program so what it's doing is splitting my work into MPI processes and sending it to native nodes so there's no spark involved in there if you would be using spark we have another solution it's called bigDL which is also a spark application and with bigDL you could do the same thing splitting work in multiple chunks over several Hadoop nodes yeah and that's the main difference so on a native cluster obviously you cannot use spark because you don't have the software stack there and in this case you would use horror well one more are there unit architectures or just one for this experiment? very good question so this example here is the 2D unit model we have also tried the 3D unit and if you go to my github which I will just move here so you guys who are really curious, I totally recommend you please do that go there you will also have my Horovod code in place was my cursor so this is the 2D version leveraging the 2D unit and I also have 3D unit and training with Horovod all the code is there it's really nice so I've tried 2D and 3D unit thanks any other questions? curious about anything any Intel software tools, technologies that would like to know ask me any questions I can answer them hopefully who finds the cancer research this is just to demonstrate the capabilities or ok the question is who finds the cancer research this stuff is done by us only so it's coming from our own motivation to solve something we have partners helping us or even taking this code and using that but there's no external funding if that's what you were referring so if there's no more questions let's thank the speaker one more time and the next talk in this room will start at quarter to five whatever you say so how did you like the first day of the EuroPython thank you so I'm really happy to announce the next keynote by Yeni Yeni actually she gave her first talk at the German Python conference last year it was very very received so her enthusiasm really was I could say intoxicating everyone so that's why I was really happy to invite her to give this talk as a keynote at the EuroPython because I think Yeni is a very good example that stepping out of your comfort zone and start conversations and talk to people is a very good idea and please give a big hand to Jenny she will tell you way more about all this and you will be very intoxicated after her talk please welcome Jenny thank you for the great introduction as Alexander mentioned it was a very special moment two years ago one and a half years ago I started out speaking at Python conferences and it was the first time I get in touch with the Python community and it feels really surreal to be able to give the keynote here with all of you today so thank you for the opportunity and hope everybody had a good time today without further ado let's get started we're talking about public speaking today why you should pursue public speaking and how to get there about me my name is Yeni Chang originally from Hong Kong and right now I'm an engineering manager at Yelp and Hamburg this is your menu for today we're going to be talking about why we want to talk about public speaking in the first place and if it's that good how can we overcome this fear stage fright it's pretty scary how can we get better at public speaking and practically where do we get started I'm very interested in the topic of public speaking and so I've interviewed a lot of my colleagues about their views on this we have prepared a video for you and please sit back and enjoy hey hi my name is Mario hi my name is Ruchik I'm Samoylem my name is Tyriak I'm Nicola I'm Antoine Hello I'm Birgit your form is shitty I've been working here at Yelp for about a year and a half now I'm currently a working student at Yelp I'm a product designer intern at Yelp in Embrographis I'm kind of working at Yelp by a kind of long time and pretty liking what I'm doing mostly working on backend but he's also thinking on how to break stuff I'm right into new core mobile API team I work here at Yelp and Hamburg in Germany from Bangalow which is a city in India known as the Silicon Valley of India I'm a product designer at Yelp here in Hamburg where we work on the business side of things I'm working as a product manager here I have been working in the office for a pretty long time I work at Yelp as a software engineer and also study in computer science at the Technical University in Berlin I came here to Germany to actually explore different cultures see how people live, work eat, have fun, live life so I got into public speaking about three and a half years ago when I started talking at small conferences back in India it was mostly cocoa heads so a bunch of us used to organize cocoa heads in Bangalow my experience with public speaking has so far been pretty limited I mean other than experiences on presenting something within the team in the public world usually I'm so concerned on recording and that's kind of one of the examples is more about also thinking of what I have to share with the community that is not already state in the world is that new or is just something old and doesn't really make sense to invest energy to say that Is that like sort of a rhetorical question or I've been doing like a lot of talks in the last three or four years and I think it's great as a product manager I do public speaking presentations once a quarter or every now and then basically in front of the office or sometimes also for some bigger audience including the San Francisco offices It's great, I was doing theater as a kid so there is a lot of similarities like this play and being in public arculating your thoughts being a little bit funny sometimes engaging with the audience and everything Majority of my public speaking experience has been related to my profession and I find that as I progress in my career more and more opportunities have presented themselves for me to speak publicly like whether I want it or not talk to more junior people in the company about like what we're doing like our tech stack more comfortable with speaking to people publicly I really like that feeling of people learning stuff and then attributing that to me I think a few months ago I had somebody come up to me and she told me hey I was there two years ago when you gave that intro talk to Python and that was the first time I actually learned Python and she was telling me hey that made me start programming more I got an internship at Github so like this is like really awesome to hear Do you think you're funny? I think I hope to be but I'm pretty sure I'm not Every time before I start a talk there's a lot of butterflies in my tummy that effectively means I'm fairly scared I guess it's the case with the most seasoned public speakers as well almost everyone pretty much has given me the same answer Let's check the video I do feel nervous every time when it comes to the actual event but I think that's just part of the game Most of the time for me is just like I either forget what I need to say or I just like freeze, you know I was working on tickets and stuff at work and it was like yeah that's normal, that's easy and then talking to people I was realizing that wasn't easy, that was eventually easy for me or easier for me so maybe that is also a way a thing that I need to learn that is understanding that's not everything that I see is simple and that's everything that I don't think that is as hard as deserving to be discussed in front of people actually it could be a nice topic to talk about Yeah, it's actually very hard You would think as a meetup organizer there are a lot of applications and I just filtered them and chose the best ones but most of what I'm doing is sourcing so I look at other series nearby and I look who spoke there I look at the news in well design because it's a design meetup in my city and I try to find who is doing something interesting so it's a lot of sourcing and trying to convince people to either give a talk they have or build a talk especially for us So I think one of the things that I found really work for me is to stand in front of the mirror and give you a talk. I try to work on the way I brief It hurts the same talk at least three times before I end up giving the talk anyway it doesn't matter how large the audience is even if it's a five or a six member audience it's still important that you do not break so as to speak Do dry runs dry runs dry runs I think it helps when you prepare yourself a lot like when you practice how you want to say certain things because whenever I try to do a dry run I know there's a lot of like okay I don't really know how I want to describe this or I don't know the English word for that Knowing that the first time you're going to speak it's going to suck so I think that's the best advice I got my friend was like he's been speaking for a lot of time for a long time I think he's given hundreds of talks and the advice he gave me is what are your first talks going to suck while it sounds very cynical to give such a piece of advice the important thing that it did for me to lower my expectations of the talk and the talk actually was great. I actually want to start by going to more conferences because like that would be like I think my first step to just like see other people public speaking see the kind of thing they're talking about I would just encourage people to just like go to school rather than thinking how they are going to enter it through this ladder and everything like jump straight into this and you'll figure out how to swim. All right great hope you all enjoy the video and since I asked a lot of my colleagues about their experiences doing public speaking I think it's also fair for me to give you a sense of why I like public speaking and how did I get started so it has to start a few years ago I remember in university I was not half as keen about public speaking as I am right now I remember the professor asking a question you know who wants to answer I'll be sure to look away look down or send out any signal that I'm not ready to answer this question I think whenever I talk in public in front of people something will go wrong and plus English is not my first language so I always think that I cannot be as eloquent as people who speak it as the first language but that has changed when I joined Yelp that was my first job and I remember my manager at that time gave me this feedback hey Yeni I think you have interesting things to share with a group do you want to speak up more in meetings so that was when I when that moment was when I realized that I need to get over this I have missed out so many chances already just because I don't want to talk in front of people but instead of trying to speak up a bit more in the next meeting or the next and the next because you know I tend to procrastinate I decided to apply for PyCon Germany and that was my first experience giving a public speaking talk by the way that video is still somewhere up there in YouTube so I encourage you not to look it up because the first talk is not always the perfect one as Mario has mentioned but hey even when I haven't grasped all of the public speaking tricks I feel like I have already been reaping a lot of the benefits out of it so as expected it's easier for me to talk in meetings and present in front of the company when you feel more comfortable talking in front of 200 people or maybe right now a thousand people you just feel more natural to talk to your colleagues and they're more trusted as well so that was easier I got that out of the way and as I progress in my career I also realize that being a software engineer is not just sitting at your desk in cold all day it's a lot about communication and good ones too for example if you lead a project it's a lot getting your stakeholders your product managers, your engineers on the same page so that you know exactly what you're working on, what are the the statuses even if you don't want to take on this role of leading a project leading a team even if you want to stay as a software engineer when you have to design a system you need to pitch to your colleague that that's the way to go right so it's also about communication and talking about communication I want to bring up the next point on crucial conversations so I wrote the book called Crucial Conversations the definition for this is when stakes are high that means it's pretty important and when emotions run strong the atmosphere can be a bit charged when you have these kind of conversations so what comes into mind is for example a salary negotiation it's usually a little bit awkward right you think that you deserve a raise right now and your manager thinks that you're not ready or you think that it's the golden opportunity for you to lead this project but the manager thinks that there is someone else who is a better candidate so these kind of conversations is what I call crucial conversations and actually it resembles what we're doing here public speaking because first of all like stakes are high I definitely don't want to mess up in front of all of you and emotions run strong because it's scary to be standing on stage a lot of people have stage fright and at the same time since we are in the conference setting I'll share this with you it's a great trick for introverts to meet people instead of being awkward and be like you know hi I'm Yany, I'm from Yelp now after I give the talk people come to me and do this awkward thing so I save some work definitely recommend for introverts in the audience now I've given this talk for a couple of times I've also led a mentoring round circle on why we're afraid of public speaking so here are some of the results I've got let's see if we can resonate with some of these reasons first of all is our heart racing like we all remember at the time before we have to go on stage and then your heart is pounding very hard your sweating, your stomach is feeling queasy, a lot of nervousness building up in our system and I would say that's only normal this is what your body is you know believed in for example right now my body thinks that I'm in danger I need to run away so all the blood is being pumped to my legs right now asking me to run away but obviously my brain wants me to stay right here and talk to all of you so one trick I'm trying is I think about last time I gave a talk it was also pretty stressful on stage but I managed I survived now I'm giving another talk so it's totally okay it's something that we can come to terms with it's not your heart pounding or racing that scares you it's how you react to this physical nervousness now there's another thing we can do which is finding our harbor something that is close to our heart something that we really enjoy talking about because once you start talking about it you forget where you are you just really want to share what you want to share with the audience and for me you have already heard a lot about myself so you know I like talking about myself that's my harbor but for you maybe it's different for some people it's food but we can try that out because I have to show you so usually when I give a talk I like having like a wider stance like this have my two feet firmly planted in the ground that's what makes me feel confident that's what makes me feel powerful and when you feel good on stage that's when you give the best talks I have seen speakers do this the whole time I'm very impressed by how they can do it but if that's what makes you comfortable if that's what gives you strength go ahead and another thing people mention is humor how do you get your audience engaged and hopefully nobody is falling asleep right now it's by humor but I would want to bring out that if you're not naturally a very funny person and you try to crack your first joke on stage it usually doesn't work out very well so rather than imitating speaker styles it's important to find what our style is and be comfortable with it now this one a little bit scary now as Chidi mentioned that forgetting what to say on stage can be something that holds us back from trying out public speaking but good thing is we have already learned a few strategies that make us feel less nervous when we're on stage so we're off to a good start but what if I told you we can actually prep for the moment we screw up that's pretty cool isn't it so we can rehearse for forgetting what to say how do we go about that first when we do our dry runs when we practice our talk and we slip it happens all the time instead of restarting your timer and start all over again try to get back on your feet and think about a good comeback so that when that happens on stage you don't freak out because you have already practiced that moment when you slip another thing we can do is timing ourselves for example if you're giving a 30 minute talk try to prepare 20 to 25 minutes if you prepare too little unless you really want to do a very long Q&A it might get you stressed out because you want to fill up that content or if you overshoot and prepare for a very long presentation in that time frame to ramble and scramble to get to the end and you see your facilitator raising signs and that usually stresses people out now talking about question and answer as well a lot of people are nervous about having to deal with that because it's something that I wasn't expecting but good news is when we're preparing for this presentation when we're doing all this research we have read a lot of material on this subject so even if it doesn't directly hit home something that you know we can offer some alternate knowledge to resolve that situation or we'll talk a little bit more later about a feedback crew someone that you can rehearse your talk with and we can source those questions from them and another thing that can happen is potential problems on stage so one thing I try to do is rehearse the presentation without slides that's something that can help I've seen for example the projector not working and if you have code snippets to show if your entire presentation is that it gets you into a tough situation or on speaker notes there's this one time I try presenting and I rely a lot on it but unfortunately the stage setup didn't allow me to look at my speaker notes and slides at the same time so I ended up having to have to do this and look here and look there so it was not a very pleasant experience for both me and I hope I think the audience as well bring local copies in case the Wi-Fi is bad but the Wi-Fi has been awesome at EuroPython, thanks for that and at the same time bring your adapters, your dongles because if you're using a Mac like me you need 20 adapters now what if what if I actually blank out on stage just now I did a social experiment with you did anyone actually thought that I blanked out and I counted it was around like 8 seconds or so I was drinking water and it was until the end people started looking at me funny no what was she doing so chances are if you don't straight out tell your audience I blanked out right now they probably wouldn't even notice 45 minutes long probably half the people are tuning out right now it's okay like it happens so that's a good thing for us and at the same time your audience is in general supportive so who here in this audience just really wants to see me screw up this talk hands up be brave whoa that whoa that corner dangerous but as you can see it's probably not the majority of the audience your audience is nice they want to see you succeed and worst case scenario if you couldn't get back on your feet what happened is they'll clap until you get back on your feet and you can continue so that's the worst that can happen if you think about it now you can also do things like always skip the slide for now we'll come back later or as what I did just now drinking water it's also a good strategy to buy you 8 to 10 seconds thinking clearly another thing that holds us back is us being afraid to be exposed as a fraud but I want to bring out this there are a lot of ways to say I don't know I've seen speakers doing this on stage that's a very interesting question I thought a lot about it let's talk about it after or something like interesting question does anyone in the audience want to answer that there are a lot of things you can do but still actually what I want to bring up is it's okay to say I don't know because you don't have to be the best in the topic to give a talk about it it's just sharing your learnings and in fact I think it would be really boring for you if you're just sharing the same thing over and over again and not learning anything out of it in fact the first talk I gave on refactoring the circumstance was one of my colleagues gave me a really good review on some of the Python patterns so I was thinking that's really interesting I want to dig deeper into it so I signed up for a talk and I want to learn more about it so I couldn't claim an expert at that time but every time I do this talk again I get some really good questions from the audience and they also point me to great resources so I learned that way and I hope you will too now this one is pretty dangerous isn't good enough so I want to share with you this idea of imposter syndrome I think a lot of intelligent people have it so I think I have it too so this is dangerous because for people who have imposter syndrome our perceived ability we think we're that good but we're actually this good and it's dangerous because we're walking from a lot of opportunities that can stretch ourselves that can challenge ourselves just like public speaking and I really like Mario's quote he told us that your first talk will suck you get over it so yes I really like that example as well in fact one of the interns previously at Yelp she doesn't want to try out public speaking so I ask her why is that she told me she wants to be one of those speakers those speakers who on stage whatever they say she'll believe it and whatever they sell she'll buy it cool but you probably get there at your 30th talk so you've got to start somewhere even though your first talk might suck that's just how it is so one phrase that comes into mind is eyes on the stars but feet on the ground feet on the ground part is important and it probably went better than you think one time I was in a public speaking class the first assignment we had to do was an ad hoc speech like you go you give a speech about your favorite values or something I was talking for one minute straight I think I was straight out rambling didn't make a point very illogical but then at the same time they recorded the speech you did and we watched it together as a class and I found out that actually that one minute I was giving a speech it was better than I expected I was making some sense so it probably went better than you think now that leads us to the next question right so how good is good enough for us to give a conference talk well luckily we're all at the conference right now attend some of these conference talks you can use that as a reference to see how we need to give a speech but like going to more of these talks I'm hoping you'll discover this it's not all of these talks will be engaging to you it's just great news because that one talk that you think that wouldn't make it nobody would be interested in well you're at a talk where it's full house and you're not engaged so it might have a shot it could actually be a hit just try it out so after talking about our fear in public speaking how good is good enough let's talk about how we can get better at it who here has read the book lean startup great and the more I like it is because the lean startup model I can apply it to any talk I give it's really nice it talks about build, measure, learn and repeat so let's apply public speaking into this lean startup framework let's start with build how do we build a proposal and a talk that's engaging one that the organizer will pick me just want to debunk a common misconception not all meetups all conferences organizers are swimming in applications as you remember what Antoine mentioned in the video he's the organizer for a design meetup and a lot of his time is used to source candidates to see who is willing to give a talk so your goal might actually be aligned here they're looking for a speaker you're looking for a place to give a talk but talking about how do you build a proposal that can be easily more easily accepted pick a topic that you're truly interested in because if you picked a topic that you can remotely talk for 5-10 minutes maybe a little bit well between the application and your actual talk you're standing on stage it might be months so if you think that you might be willing to give a talk and when you're standing on stage you'll fall asleep so pick something that we're interested in and that shows in your application another thing that might seem obvious but we can check the call for proposal form first to see what they're looking for and one very smart idea that I found from Rafael on Twitter is that you can mention that you can modify your talk to a shorter slot, longer slot so now you're applying for different tracks that's a very smart idea, I haven't been doing that before another thing is to think about the who, what and how the more vivid you can let your organizers imagine your talk and how that can bring value to the target audience the easier it is for your talk to get in now we will also want to leverage on our own experience because that's what makes your talk unique and for human beings we actually really want to hear people's failures so if you can share some stories we'll engage the audience you can also share your failures using the library, your failure developing software it tends to gather a lot of attention that's a lot of build up to applying for a conference talk but don't get too discouraged if your talk gets rejected my talks get rejected all the time please don't take it too personally because chances are there are multiple reasons out there maybe there are too many people talking about the same topic or they might be looking for a different target audience a lot of reasons but good thing in Europe we have 10s, 20s of them so if it didn't work out for one we can always try to another one now building the actual talk itself this is my routine first of all rubber ducking we know that from programming it's talking through your code to a person or even a rubber duck so at some point you might find out why it's not working I tend to do the same for some of my talks is to talk out the first ideas to make sure it's in a logical sequence and then I go back my feedback crew feedback crew is a group of people where you can trust to give you good critical feedback don't confuse that with a cheerleading squad there are a lot of people who say you know this talk is great very funny, it's very engaging they're being polite you want to find people who can actually give you good critical feedback for example, well your joke really doesn't work here, you might want to take it out or these two slides don't seem to make sense together do you want to put that into a different section something like this and last is to fine tune our talks by knowing the audience when your talk gets into a conference then we'll know better what is the target audience for example, is it targeting junior developers or more senior ones then we might want to adjust the content or is it talking to a non-technical audience then we might want to change some of the vocabulary just want to quote Nikola on the dry run thing he says dry run, dry run, dry run good news for some people bad news for the others good news is if you think that you can't make it well, good speakers are usually not born, they practice bad news is for people who think that they can wing it and do a very good job well no, it takes practice so when you see speakers shine on stage for 5-10 minutes it's a lot of hard work that they put in behind them so measure we all want to know how good we are how can we improve from there we talked about our feedback crew how we can gather feedback from them another source is your audience after you give the talk you can ask them how they went or in some conference apps there are ways for you to evaluate speakers and we can also get some feedback from there you can watch our own presentation video it might be a bit difficult because not everybody likes their own voice but if you want to cut out on filler words like so all of this watching our own presentation video really helps because it annoys you a lot to watch it at the same time take notes on the questions that you gathered from the audience because they tend to be areas that people are interested in but you didn't include in your own talk learn this one is actually very interesting I have been giving this talk already but this little incident happened and I realized it's actually harder to take feedback than I imagined the scenario was after I gave the dry run with my feedback crew my colleague gave me some suggestions to improve he told me your presentation style seemed a bit hectic unlike usual I was thinking hectic I spent so much time rehearsing for this talk and hectic was what I got but okay this is something I'm still trying to get the hang of there are two mindsets we're talking about first of all is the fixed mindset at this point how am I doing then I would have thought you know bummer I totally messed it up now my colleague must think that I wasn't prepared at all that's the fixed mindset but if I have more of a growth mindset then I would be thinking I'm grateful that my colleague gave me that piece of advice now my real talk is going to go much better and between the fixed mindset and growth mindset do you think it helps me get to my goal obviously the growth mindset but yes it's easier said than done at the same time getting these feedback right what are the action items it also depends a lot on the time frame for example if someone gave you this feedback the night before the talk it might be better to just not take that advice because it's a lot of stress if you need to digress from what you have rehearsed so if we have action items try to take them one at a time and also prioritize them and not to overwhelm ourselves with the feedback now the most important part is to repeat one thing we can do since we have prepared so much putting so much effort to make this talk good next time we can give it again so that we can get better from our experience now jumping to the more practical part is where can we get started how do we find these opportunities to practice public speaking there are a lot of opportunities around us it doesn't have to be you speaking on stage it can be stand-ups, team meetings presentations at your company or even giving a speech at a party as long as you have that spotlight on you you're on you're practicing public speaking one thing I also think about is if it's an internal opportunity or external internal given that it's your company your school external might be conferences meetups so personally I enjoy better the external speaking opportunities because I tend to know less of the attendees no offense but it's better that I talk to someone I don't know and I feel less conscious I want to get to know you all but maybe after the talk another thing I try to sign up for conferences in first-time speaker-friendly conferences a shameless plug right here so me and my friend Teresa we're organizing a conference called Python Pizza it's gonna be in Hamburg in November so if you want to try out your talk for the first time it's definitely very first-time speaker-friendly it'll be a very fun crowd and at the same time going to local meetups or if you're from an underrepresented group in tech there are a lot of meetups for us to go to as well and the audience tends to be even more friendlier some people approach me asking about public speaking groups or classes is that something we want to do I would say anything helps right essentially giving a talk is testing your confidence so if you go for improv drama it also has the same theory on that aspect if you want to learn how to run you don't go swim so if you don't work out very well with the improv group it doesn't mean that you cannot give a prepared presentation it's essentially two different things in that aspect great, so we have covered some contents today we talked about the why of public speaking how can we feel less afraid overcoming that fear how do we get better and practically where do we get started these are the references that you can ask me to prepare for the talk so if you're interested you can also look up more of these resources and before we close I encourage you to apply for a public speaking opportunity within these two weeks and get started now because if not now, then when thank you thanks Jenny thanks for the keynote so we have questions, we have microphones over the place so now you can queue up and make your very first speaker presentation by asking a question maybe you mentioned you repeat your talks do you stay interested in the talks you're giving so my problem with that public speaking is that I'm only interested in giving a talk that I don't know at the beginning what I'm going to talk about and at the moment that I give it for the first time then it's boring I think that's a good point that's why I tend to pick something that I'm really interested in before getting committed to that and preparing for the talk also takes a lot of time and effort but at the same time one thing I look into is how to perfect that talk for example I talk to the audience afterwards from Q&A I get a lot of interesting information about what I can do to improve and I tend to incorporate that into the next talk to keep it interesting so like every time I have a little bit of new content here and there so that's how I keep myself engaged more questions, yes congratulations and thank you for the excellent speaking thank you I know that it should be a question not really talking but I'd like to make a comment of a comment of our colleague because I have basically the opposite approach of him I love to talk about something that I don't knew before and first time I do that when I finish I say my god, what's horrible I have to beautify these, these and these I had to have that reference so if I had the opportunity to give the same talk for the 20th time all of them would be different and hopefully a little bit better we are human beings we are all of us different and I believe anyone of us would have different approaches to the same questions so but to avoid getting out without asking a question are you feeling fear at this exact moment good question I think in the beginning yes in fact I was about to make a joke about taking my pulse but I forgot probably because I was too nervous so it's a good time for your joke right but at the same time as I mentioned getting into your harbor and talking about something you're familiar with for example on public speaking a topic that I'm really interested in at some point I forget that I'm talking in front of a big audience and it gets better so I guess right now at this very moment thank you very much Jenny so who's actually planning we have some time left so let's ask who has given at least one talk in his life places that's interesting okay and now we have to check who has never given a talk before not come on this is not the I mean we can do the math well yeah it's interesting at least one third of the conference so thank you again for Jenny, for inspiring to give everybody once in their lifetime at least one good talk and yeah thanks for coming by and glad to have you around thank you so we're setting up for the lightning talks now you're up by from 2019 you can dab we've got some dabbing did you only learn that today hi my name is Mark and this is my friend Chuck we are going to be introducing the lightning talks this is the best part of Europe Python except for obviously the last talk because the speakers probably still here so in a moment I'm going to hand it over to Chuck to explain the rules for the lightning talks for all the newcomers that stuck their hand up this morning so in fact I will do that now that's very important to know the rules because you violate the rules so for any talks that is promoting other conferences you only have two minutes instead of five minutes so other talks that is talking about something not promoting a conference you have five minutes also please remember we are a friendly community so please only positive noises don't fool people if you do that so also when as usual if the speakers almost run out of time we will do this can we practice yes and so that we make these annoying passive aggressive noises to annoy the speakers so they know that they have to finish and then when they are finished of course big club you can cheer if you want to so thank you so also also I'll hand it back to Mark she stole my bit so we are going to practice that again there are three positive noises you will make at the end of every lightning talk this evening so the first noise is the clapping noises can we have a big loud clap please and the second noise is the whooping can we have lots of whooping please whoo third noise which is the foot stamping can we have lots of foot stamping please on this lovely wooden floor and now Radimir if you would like to take your place so we have inserted an extra slot in front of the lightning talks this evening we are going to do this every night this week where Radimir who is the inventor of the pu pu system which we have all been given as part of our conference so first Radimir is going to give a quick talk I believe about how the pu pu device came to be how you developed it actually I have a talk about that on Friday so I am just going to explain how to use it awesome and then we are going to have some quick demos of software that people have written in the last few days on their pu pu devices in the workshops and so we would encourage people in the next couple of days if you would like to come up to the stage of the lightning talks and give a quick demo of something you have built on your pu pu then talk to Radimir and he will be around the conference and we will get you up here to show everybody what you have done so that is an extra kind of lightning talk within a lightning talk it is going to be great so can you make those three fantastic noises for Radimir please ok so I only have a few minutes so I will try to be very quick you probably have already received those things and if not then you can get them at the reception downstairs and you probably are wondering what they are so there are a thing that was designed to make circuit python well to make python games on a workshop the problem with workshops is that everybody brings their own laptop and everybody has something else installed on it everything except for the thing that you need for the workshop and you usually lose half of the workshop time just installing things and getting them to work and so on so on so once micro python was created and after that circuit python which is a more beginner friendly version of micro python I figured out why not create a device that lets you do that has everything installed on it already and then you just connect it to the computer and use any text editor you have in there to do it that's easy to do because on circuit python when you connect your device to the computer like this and switch it on it will come up as a usb drive called circuitpy and you can see files on the usb drive that are basically python code that is running on the device there is also a readme file in here where you have a link to the full documentation that link is also printed on the back of the device and where you have a basic summary of the api for the device if we visit that link copy it we will get the documentation on the read.docs of course there is the same library reference you have in the readme but also information about the device itself which is this version 10.2 this device is actually 10.3 but electrically there are no differences so everything is online there are links in the community section to the mailing list to the discord channel for circuit python and to the github repository where you can find all the code and all the designs that's not important right now the important thing is that there is a bouncing ball tutorial in there which you can follow to program your device yourself a simple demo of a pixel going around the screen it's really step by step thing so hopefully it will be possible to follow it in like 20 minutes or something like that and then you will be ready to write your own things and the things you make you know we already had workshops the past two days on Monday and Tuesday helping people to go through this tutorial and also to develop some new things for this device but it doesn't have to be games right you can display some some awesome animation and wear it through the conference to show off your programming skills and or have like a slogan scrolling there to show your political stance or whatever and yeah well be nice so we are going to show some of the things that people have made and do we have the camera? yeah yeah yeah so we will have a camera here so that you can see it on the big screen hopefully maybe I will just hold it and we can use advanced programs maybe VLC should work if not hold it over your or maybe I will hold it and you can stand here please and demonstrate your program and I will hold it this hold it and you can talk about I will hold the microphone oh ok and explain what it is can someone guess what movie this is? yes well this one is actually non interactive so if I press the buttons nothing happens or does it well you can try yourself because I will publish the code it's very short and yeah that's that ok let's see if we can shadow this ok now you can see this now we got with the Pew Pew a lot of nice games to play with so I thought we should need some serious software as well and that's why I wrote a text editor for the Pew Pew at the top corner you can see the cursor blinking at the bottom corner on the right you can see another cursor blinking showing you that you would need to scroll down to see the rest of the text so this of course what it will understand is Morse code because Morse code usually has 5 to 6 characters so it fits perfectly on the Pew Pew and the Morse code for the letter E is like did so if I now play back the message you can read that it actually works so hold on oh come on I will just go down here work better? no it's worse ok so I will give over twitter a link where you can download this because this means like on parties you can then on the fly set the text of your Pew Pew hey it's me again so this is my device oh I found it weird it's inception ok so it's a heart blinking and then it could go very slow or very fast I'm excited and then it could be small I could stop it my heart stopped beating it could still go small but it stopped beating that's it because he made a 3D engine for it it deserves a lightning talk of it so thank you very much and we will repeat that for every lightning talk so if you have made anything with Pew Pew that you would like to show here please come to the next lightning talk queue in there and we will let you show it thank you thank you Radimir so if next week we would like to come and set up you're actually plugging in an app top excellent so you probably watched that and thought to yourself yes I need to get up on that stage how do I get up on that stage to give a lightning talk so our system I haven't encountered another conference with the system but our system we've been using for about 12 years now there is a big sheet of paper downstairs on a column by the reception desk where you've got your badges and if you go along in the morning as early as possible you may find some space on that and you can use that space to write the title of your talk your name as clearly as possible and ideally your Twitter handle you can get in touch with you or your telegram handle if you're in the Euro-Python group are you ready to go there is nothing else on the screen though technical difficulties so this is when I tell a joke about a tractor so the farmer who lives near me I'm going to do it they love tractor jokes so there's a farmer who lives near me and he has this vehicle and he uses it to plow perfect circles I think it's a protractor oh come on it was a terrible joke I'm not telling that tractor joke that's not my joke how are we doing are you getting any closer so are we getting any feed into the matrix I have an ice cream sprint who went to the ice cream's bridge today time ever at lunchtime so I think that I got to do it tomorrow because I found an ice cream place which is like around 10 minutes walk from here near the river near the bridge and the ice cream I would say is the best in this city even though I'm only here like my fifth days here but I think it's the best ice cream in the world and yeah welcome to join me tomorrow at lunchtime I would post I would say in the telegram so you're welcome to have ice cream with me but it's cash only please prepare cash and yeah I won't pay for you so yeah so more seriously in terms of the lightning talks there don't seem to be very many people here who haven't given a talk but if you haven't given a talk and you would like to the lightning talks are a great opportunity to do that there's loads of energy in the room you can talk for 5 minutes or less it's really important you can stand up in one minute if you want to and you will still get a huge round of applause from all the wonderful people in this room and that is my lightning talk see normally I use these slots to encourage people to give talks but it turns out that everybody in the room has already given a talk and we've just had a keynote talk on giving giving talk I think Alex asked before who hasn't given a talk before who hasn't given a talk ever ever ever before oh where are the hands yeah I see some yeah so the lightning talk is the perfect place to start your first talk because I could talk about my experience in Picon Namibia so I went to Namibia I think somebody is going to present about Picon Africa later but my experience about going to Namibia is amazing the first day of the conference and then people was like oh was the lightning talk some of them really is the first Python conference there so some of us would travel all the way to Namibia so some of us are already speakers so okay we'll demonstrate what's the lightning talk and we kind of give a lightning talk about anything and then the next day there was a guy first of all people Namibians are lovely they queue up for lightning talk the second day there's some students one guy he just went up to the stage and he has no slides grab the mic and then he was like oh actually today my brother tell me to come to this conference I came to this conference but I have no idea what is the Python and then he talk about his first Python ever oh I'll tell the story later I just need to get my video playing hello video here it is alright hi hi okay hi I'm Christian and I wanted to show you something no I prefer this one because I need the hands I wanted to show you something I did on my PewPew but it's not something I did yesterday because I've actually worked with the PewPew before so I wasn't in the group that showed that stuff before and I even did that before I even knew that there would be PewPews at this conference so that was a nice surprise for me to see that everyone of you will now get one of these devices too and will be able to do the thing I did as well so I've been playing with MicroPython for a while and in particular I love these PewPew devices that are not only invented and as you can see I own a couple of them so one day I asked myself whether I could make a 3D game on this device and so I sat down one weekend and gave you a try and to my own surprise it worked quite well you can try it for yourself here's the download link it will work on your conference PewPew although slowly you can make it run a bit faster if you install the latest beta version of the circuit python firmware there's instructions for that in the documentation in the link that you can find on the back of the device I want to give you a quick glimpse into how this was made how do you do such a thing on such a small device with not much computing power and in a language like python that is not optimized for runtime speed but for development speed the usual approach of doing such games in the days before there was 3D graphics hardware was 2D raycaster 2D because the map is actually 2 dimensional all you can do in the third dimension is walls of a constant height you can texture the walls it's a bit hard to see here because my camera has trouble with the bright display you cannot texture the floor and ceiling these are just flat colors raycaster I'm not going to go into much detail on what that is but it's pretty simple and involves some geometrical computations like finding intersections of line segments but there's two problems with this approach on such a small device first these geometrical computations are most easily done using floating point numbers and the micro control in here does not have a floating point unit so floating point math is slow on here we only want to do integer math on this device and second to do this efficiently you need to put your geometry into some kind of data structure that will make it easy to look up efficiently what is where in space and such data structures take memory and we don't have a lot of memory on here either so the approach I came up with is let's pixelate the world pixels are naturally integer based and my map now is basically a pixel image instead of a vector image as it was before now pixel images take some memory as well but maybe I can make the pixels course enough that it still fits into memory after all you can't see much detail anyway on this small low resolution display and instead of doing exact math with these lines I now do pixelated lines to draw pixelated lines there is an efficient algorithm called Bresenheim's algorithm that only uses integer math and even only additions not even multiplications but it's very fast and that's it basically for the principle implementing all this took little more than 100 lines of python code and if you want to know more about it talk to me I'm here at the conference or contact me on twitter or email me or anything else if you're a better game designer than I am and are able to make a nice interesting game and if you want to know more about this that would be awesome because myself I still haven't quite figured out what works and what doesn't on this low resolution display thank you and have fun that's amazing that's like playing Doom or Minecraft on that thing it's very good I'll finish my story quickly so that guy is the first python conference or maybe his first python experience and then oh yeah we are very smooth so continue later are you ready? audio I think HDMI include audio already yeah should be fine okay while you're shouting it I'll continue my story so it's okay okay yeah so that guy python experience and then he was enjoying it and then he gave his licensing talk and at the talk he was like oh and then now everybody just tell me that I could say something for five minutes and I just do so I was like whoa really so if he can give a licensing talk I think everybody can give a licensing talk so please sign up tomorrow and I'll be sure to be there early tomorrow because I'm sure everybody wants to give a licensing talk so you have to fight for your own spots oh you have to change the settings okay yeah so who wants to give a licensing talk tomorrow if you want to let me know yes just want no way no you are hiding it because you would secretly wake up at 6am and come here to sign up right can we hear the music um yeah so show show event on Thursday who is going to show show event yay yeah so it would be in the if you have been here for the training so it would be back in the university and then outside it's lovely I've seen there's a ping pong table there if you have bring your you know your ping pong ball and so can we hear it okay I'm horrible with ping pong right so even though like people think that like I'm you know I grew up in Hong Kong I'm Asian I've ethnically Chinese I suck at ping pong ball like I can't play ping pong ball I'll be like chasing the ball around because it bounced and then so there was a story right so in my previous job so we have like a show show event with my colleagues and then we went to a place that's like very famous in London it's called bounce so they have all these ping pong table so people will drink there and then also play some ping pong which is fun you know there's like all these like neon lights and stuff and then people assume that I am very good at ping pong but at the end they just all laugh at me because I was drunk and I was playing ping pong so basically this ball hitting my head it was like flying behind me and I was like looking around and I was like you know finding the balls and stuff it's hilarious so please don't get me drunk on Thursday and pay like ask me to play ping pong ball because it would be a disaster so enjoy Thursday it would be great I hope it's not raining so yeah we will be having grilled sausage I guess this morning if I pay attention yeah that would be great I love I love the German sausage like they are very tasty it's better than the British one I don't know why but like yeah I like well well well yeah seriously I'm talking about the food yeah yeah last year I went to Germany three times right so the first time I went is like I went to Hamburg and oh yeah okay my name is Moise I'm from Brazil this lightning talk I actually prepared it for Euro Python last year but I couldn't make it in time like there are too many lightning talks and this is something called fox dot so basically it plays some sounds and I'll skip this one so zero was basically the same one you were listening so you can like change the note you can go through a list of notes or a chord if you know which ones the right ones to play but if you try to make some math it will break as you can see down there in the console but if you create a pattern you can then oh I skipped some oh I didn't yeah this one is if you play a tuple it will play all them at the same time and you can make some math with it but it's not just this boring one you can also have a bass or some drums yeah okay I'm raising here my heart as the previous talk was saying okay I set some scale and some stuff here I created some notes so this is a bass line it kind of might make some sense but also we don't have the right amount of silence because music is not just about sounds but also the space of the silence so not yet I also have some chords which I will stop and I also have some beat and then now I can play everything together what is this supposed to play huh yeah I always do this mistake you might know this one so you guys know this one right and this is my sock from Pycon Cz 2017 I have something special in it what happens when you spend that is what happens when you spend a year preparing for a learning talk so in a moment Mero is going to give the next talk he doesn't need any preparation any set up but could we have the conference speakers who have their presentation on the Google Drive just up at the side of the stage please ready to talk next thank you take it away please hi my name is Mero Slavšeđivi and this is how my keyboard looks like two years ago exactly this place like 600 kilometers to the south and remaining I stood at this place and I tried to persuade it to tell you a little bit about the keyboard layout that I am using anybody else using the same keyboard layout no because I have looked over your shoulders today I have seen you have learned nothing so this is the US standard keyboard layout which is perfect if you want to type the name of the Welsh city or town so it works perfectly well as long as you type in English or similar languages and the right and the top looks a little different and if you want to type the big shape in German there is something like that in Switzerland then you can type it in German because the shape of the shape is a question mark if you don't like the keyboard you can use the keyboard but if you want to type you can see it on the right but if you have some problems you can use the keyboard if you don't like and if you want to type you can do it with a shift but if you want to type you can use the French keyboard to program the name Spanish keyboard if you need to type the open signs of interrogation or exclamation you need this keyboard which is even more different and more difficult to learn how to type and to program the keyboard in Poland is very similar to the American one but if you want to type the one you like you need the keyboard in Poland with dots, stripes, stripes something like that keyboard and this Italian keyboard it's not easy to write because all the letters you write it's not possible to write because there are always two letters that you write in Swedish and in Arabic and in English and in Esperanto there is an international language and there are some special letters and in Turkish and we have only scratched the surface so imagine that every day you type in a few European languages and look like how your keyboard and your brain has to look like because you have to switch between all these keyboard layouts you know it, you remember we can switch problems with international keyboards that we have we can switch with using character maps and copy paste some characters this doesn't work very well if you want to type fluently but fortunately in the 70s and 80s on the keyboards there was a compost key a compost key means that I just type compost and then two other characters and in a combination it will return a new character with all the special characters that need in other European languages and it works in German, French all the languages that we have seen until now how it works, it is actually already in your system in that file you have over 6,000 lines with all the combinations that you can have a look at if it is not enough and you want to type in some obscure languages or emojis you can always define in the same way your xcompose file and your keyboard doesn't have this compost key so you need one you can define some key that you don't use every day so like the art right window everything you have to do, set xkbmap option, compost and for example menu and then you can use it as a compost key I'm already standing here, we are going to hack another key, which one is it yes, it's caps lock we can make a control out of it because it's just left from the A, our pinky key is very common because very convenient if you want to type control which is usually very far away what we do, we just say our caps lock is control modifier but it's not everything because our caps lock can take functions of two keys it can be control and it can be escape because control you usually press and combine it with other key escape you just hit it, press it and release it immediately, so you install xcape and then your caps lock will work as both keyboard layouts I mean both keys at the same time then you put some nice stickers and you are done, thank you very much how long do you prepare for your talk I think that takes more than a year so anybody want to continue the sausage story sorry you are forced to listen to my sausage story so last year I went to Germany the first time I went to Germany I was going to Hamburg and I was like yes, German sausages so basically what happened when I woke up, I looked for a place that I could have sausage and so the whole day would be like finding places so I could have sausage and then at the end I would go to near my hotel there is a little store that they sell sausages so I would have sausage from breakfast to after like drunk food would be sausage so now I try different things not just sausage like schnitzels and pretzels so I like food so yes, some other things what should I talk about, sausage maybe oh, I don't have to talk about anything good hello what did the farmer say when he lost his tractor where is my tractor okay, go ahead hello everyone how do you think what will happen if you will end up in Berlin and you are a lady and you are software developer, Python developer the problem is here that probably you will start looking for friends if you are alone then you will probably go to the meetup page looking for some communities and they are welcoming, they are open and nice and I found some communities which I was interested in but I end up being asked if I'm not a software developer for sure, if I'm an HR because I'm a lady the problem is here that I was confused, I'm a software developer and then some other girl because we were apparently just two in the whole community in the whole meetup and that's why we were recognized as not software developers because we are non-men people so the other girl she approached me and said wow, hi, nice to meet you, I'm so happy we can talk about software now that was embarrassing yeah so then I found out about PyLadies and then it was a very nice community everyone was so open and all the ideas were appreciated I could speak as a girl and a software developer I could be myself but I was lacking some kind of exchange of ideas with different people and we have different thinking, all of us it doesn't matter if you are a man or a woman we have different ideas and that's great and to be honest you have to follow those rules in reality you look like a girl, you act like a lady you think like a man and then you work like a boss and here comes the truth I want to be myself a lady, a girl, a woman I don't want to be a man in any kind of stories so the idea was to find out the community with all their support, being friendly, open and being inclusive and open-minded and to be a safe environment for everyone so then ladies would be happy to come and share ideas and they are not going to be asked if they are non-software developers but HRs and here is the answer, here is the community and one of the organizers together with a few more people we are open-minded and we have at least 30% of ladies in our communities and they feel safe there we are based in Berlin and we know how to get on the top of the TV tower in Berlin with Python so that's how we had all the events we have different kind of formats it's like lighting talks as well and 30 minutes talks and also panels where it's like open questions you can ask whatever you would like to know we started recently in February this year and every time we have new people, it's around like 70 new people every meetup I was surprised how many developers are in Berlin, especially in Python and usually we organize some events together with spy ladies to be supportive and open and the next meetup is going to be on the 23rd of July in Berlin, if you are interested in participating or willing to contribute into our Ask Me Anything panel about how to be senior please go with this link and if you are interested to be a first-time speaker or maybe an experienced speaker just looking for some opportunities we are not looking for professionals we are open to help everyone to prepare all the talks okay, I am finishing and there is a mystery, I will leave you with the mystery one organizer left us for good and this person in this room, you can find him thank you Hi, I'm going to have to be quick and I thought I would have I wanted to remind you because I'm sure you already know about the very first Pycon Africa which will take place next month in Accra in Ghana I'm part of the organizing team of that there have been numerous African Pycons I went to the 5th Pycon Namibia this February and I was in Pycon Ghana last year and there are many others to visit but this is the first Pan-African Pycon here's the organizing team there's Marlene Manhavi director of the Python Software Foundation Aaron who helped put Ghana's first satellite into space Aisha who gave a keynote here at EuroPython with me in 2016 Michael, Abigail part of the Django Girls team, Noah and me so these are people who have organized multiple conferences between them in the past we have speakers from around the world there's Anna Makaruzze she's a Django Software Foundation director in fact vice president Mustafa Cis from Google's AI lab he's the head of the AI lab in Accra and I'll have to move on because we don't have too much time we've got some great sponsors some of whom are here next mo just confirmed today so thanks very much to them we need more sponsors because we need more funds especially for our financial assistance just to give you an idea about how people travel they'll be taking a 10 or 11 hour bus journey from Lagos to Accra and back again together from the Nigerian community can you help us with sponsorship? we'd love some more money specifically for financial assistance we also even if you can't sponsor we have a GoFundMe page so please take a photo of that if you can contribute a little bit of money to that it will help someone because some of the costs are relatively very low others such as travel are higher but every little bit will help we need an extra 6000 or so euros in order to cover all the financial assistance needs that we want to make so take a note of that please africa.pycon.org perhaps you're even thinking of coming come and talk to me about that I'll be around here until Saturday and I'll be very happy to talk to you about this Pycon Africa or any of the other Pycons that take place in Africa thank you very much Mark it's time for a joke now I do have another tractor joke oh my god this is so messy how did the farmer find his lost cow he tracked her down just enough time hello so picture this you're in Mexico you're in an all-inclusive resort 30 degree weather in the beach conference or vacation why not both so I'm here on behalf of the Pycon Atom Organizing Team to invite you to come to Mexico we're going to have the first Pycon Atom in August 29 to the 31st the call for proposal is already closed but you know there's a sponsorship opportunity still open volunteering still open and the financial aid is still financial aid just closed two days ago let me think more details I think I think that was about it I made this for a 22nd presentation at Pycon and like here's just too much time big round of applause for being quick please no jokes when is a tractor not a tractor when it turns into a barn that was really good timing okay here we are yeah so I think many of you have heard about Pycon D.E. there has been 8 Pycon D.E.s so far and I think you probably also heard about PyData Berlin I think they started in 2014 and what happens if you bring like two good things together one even bigger great thing so actually we talked and we decided it's not only the time to make specializing conferences but also to get back together again like here at the Pycon and the data science communities so yeah we have the conference in October at Cosmos Berlin I want to give you some facts about our proceedings so far so we had a CFP already it's closed for story we had like 450 submissions which is in a way like yeah oh my god so actually since this was we basically with the help of Arthur like we were happy to get to a community voting yeah so we had like 33,000 votes on the community voting which helped us recreating the program of this conference major sponsorship is already sold out sorry but we have diversity sponsorships still available and so the sponsors we will offer free child care and also like two thirds of the tickets are gone already so this is like it's really nice I think it's really nice to have these communities working together also we decided this also like to save some resources on the community side because running a big conference is a lot of work and so yeah see you in the red not all of you because we don't have enough tickets for everyone you left sorry is there one more no there's not one more there's now the raffle coming doesn't it right yeah yeah well you're plugged in so go okay very good so I'm already basically set up so because we have some giveaways and it was really challenging to think how can we distribute giveaways in a really fair matter so we thought about let's have a little raffle game so and we just made this up so however this is starting to hold how this is going to work we have piker Sandra piker Sandra will ask you a question if you can answer this question with a clear yes please stand up and then we will narrate down and hopefully until there's one person left standing so okay let's do it it's really easy you get into so okay let's ask piker Sandra stand up if you ever contributed to an open source project so the next question is keep standing if the day of your birth is odd odd the day of your birth is odd okay should have left okay keep standing if you traveled less than approximately oh wow 1100 kilometers to come here okay keep standing if your first name contains upper or lower case it doesn't matter the letter C how many people are standing okay keep standing if your birthday contains the number so the full birthday day month and year and we have zero padded days and months so just like okay let's go nine everyone okay okay okay so actually so no it's run again it's time for recursion recursion in Jupiter notebooks is really easy to just go up the slide again so keep huh okay okay skip we skip this one more than okay yeah more than 32 okay that's basically also in a way a country let's skip that okay now with the names as oh we have a winner two yeah but we have three yeah but we have 14 books yeah and we have a book on class okay we have more books later so I've been walking around the conference and there's obviously lots of groups of people anybody who is here this morning for the introduction will know that there's a lot of new people new faces here to Europe Python and this can be a slightly overwhelming environment there's 1200 people here and if you come here and you don't know anybody that can be kind of intimidating and there's a lot of new people new faces here to Europe Python and this can be a slightly overwhelming that can be kind of intimidating and there's a really simple thing you can do everyone in this room can do to make life easier for the people who are trying to make new friends and it is not to stand in circles there's a thing called the Pac-Man rule which is recommended at quite a few conferences now and the idea is if you're standing with a group of people you know leave a gap just a gap for one person and then if somebody comes in and joins you then you somebody steps apart again to make another gap and this way you make your conversations welcoming for other people to join you do you not have slides? I don't have slides I'm going to offer more advice in a moment but if you would like to start your talk so the topic of my lighting talk is is that the Python t-shirt you are wearing and if you love your job don't listen to me spend these five minutes checking the events in Basel tonight the results are dancing you can go like check your phone or something else just whatever like sleep but if you don't like your job I was I've been to this Python meet-up in Germany a few months ago and when I was coming back I was talking to few women like socialize and I was asking them hey so what do you do for your work like do you like your work is it fun and one was like no my work is not fun like it cannot be fun what do you mean like it can be not well paid it can be stressful it can be okay it can be whatever but not fun why some people say that it's weird and the other one confirmed and said yeah true it cannot be fun and everything inside me was screaming like that's wrong it's just so wrong and that's why I'm giving this talk because I have these feelings my job is fun and it's always been fun and I do data analysis in Python and I really enjoy what I do and I'm really passionate about what I do and obviously I would be doing it for free just don't tell my boss because you know it's not good if your boss knows that yeah and it's just so crazy how many people are never questioning like if they enjoy what they do you know you don't have many lives like question yourself ask yourself like am I happy with my work am I happy with my life I think these questions are really important and because as I always say if tomorrow I'm 85 and I look at my life back is data analysis in Python something I mean I'm happy doing all my life like the answer is yes it is definitely yes and I don't need time thinking about it and that's I think questions which people should ask themselves because if they don't you run you may run away from it you know may you may avoid asking this question because you're afraid to confront yourself and like answer yes I am unhappy where I am I'm unhappy with my job because as an informatics you have so many opportunities there are so many fields so many bosses and companies like everywhere so you don't have to be stuck with something you hate and the purpose of my talk is the message yeah explore talk to people question if you like what you're doing because at the end it's better for everyone it's better for you you feel great you feel happy and it's better for your boss because you do a great job and it's better for the company and it's better for your friends because they don't have to listen to you complaining to them every day and I have people complaining about their wives and then go to their wives and complain about their job so don't be that don't become that like if it's too late just try to change something as much as you can and if you already like just don't be that kind of person so yeah when my when my colleagues are making fun of me saying oh Sofia you're wearing pison shirt work like that's crazy you're doing pison temporal henna tattoos you have pison bracelet you have pison stickers you're a nerd I'm just thinking you're just jealous that you don't like what you're doing fantastic advice there let's see give you a quick whoa it's outside I hope so I have to go to Munich I've run out of track to jokes maybe maybe you have another adapter windows 7 jokes yeah I have a windows 7 joke which okay if you have met me last year you know that actually I think that was my first livestream talk about windows 7 is it really ready now it does work let's go life is either a daring adventure or nothing and unfortunately we have just one life we often spend it like this in our office but sometimes we go to IT conferences don't get me wrong I totally love conferences and I'm also organizing them but I just believe there is way more potential as Donald Walsh says love begins at the edge of your comfort zone so I decided to kick myself out of there and I went to Pi Days Vienna conference on a motorcycle I decided to avoid the highway to see the real scenery and it was a beautiful scenery indeed but it was still quite comfortable I had just to wait for a couple of hours until I met a thunderstorm yeah that's the real pictures I got so wet my phone got wet it turned off I had no GPS I was running out of gas and I started to panic but then I thought isn't that exactly what you've been looking for out of the comfort zone it's an adventure next was Fajkan San Sebastian I did the same except this time it was three and a half thousand kilometers wonderful scenery I met wonderful people that I also met Leslie Leslie is a hurricane that hit France and Spain exactly a day after the conference fortunately it was just on the coast mostly so I managed to escape some skills and not getting wet thanks to the plastic bags and stuff like that and I got totally addicted to adventure by that so next was Fajkan Italia and what could possibly go wrong such a wonderful weather mountains haha five days later you will not believe that snow wind what the hell I really had to take this picture with the conference bench because nobody believed in my skills of attracting the trouble and my skills of avoiding getting wet were even even better this time haha so next here we go EuroPython I made a poll on Twitter what's the most elegant way to get here and motorcycle was not winning train was winning and the bicycle was winning so why not I took a train to Paris and I decided to cycle to here and then to Munich and don't get fooled by this picture I look happy because it's just the beginning it was not so easy it was not easy at all imagine in the middle of nowhere you have the weather it was so hot without water without the signal on your phone you have to fix your wheel again and again and I have five patches just on the front wheel till I got to here now but the scenery the scenery was totally worth it I mean you cannot get that without going through that challenge it was like running a marathon but every day and when I cross this sign yesterday evening you can imagine my emotions how cool it was and how cool it is right now to be here and I want a few so thanks for having me that's not it not yet not yet not yet I am not the only one who is doing this thing so my friend Stefan Benel whom you perhaps know was taking a wonderful train journey in last Europe Python I'm trying to convince him to give a lightning talk tomorrow so I hope it works he also has his ways and he life cycling as well and last but not least well speaking about all of this fun of travel we should not forget that you can ride a motorcycle and not everyone can ride a bicycle and actually not everyone can even walk right we should not exclude these people and my last story really quick one will be about a friend of mine from Ukraine and his childhood friend Yuri Yuri has a cerebral palsy he cannot move at all yet his dream was seeing a mountains even from the train window time passed Evgeny grew up and he never forgot about his friend's he met a paragliding manufacturer he asked them if they could maybe make since they do backpacks if they could make some sort of device that could help him carrying a very difficult very very heavy object which was his friend and they made it with a lot of practice with a lot of training time passed he came back to his friend in Ukraine and he said well that's the day we should make a dream come true we are not just going to mountains we are climbing mountains of Ukraine and here they go isn't that beautiful he made it through a lot of pain and suffering he made it and imagined the emotions that moment and why I'm showing this really my bicycle trip is nothing comparing to this honestly so I'm not special anyhow and there are so many people in this world who do this daring cool and so can you so I just want to encourage you to give it a try to go crazy go daring and just remember we're all the community and we should help each other thank you very much okay so we've already run 13 minutes over time we have time for one last lightning talk this evening so we'll be clearing the slate tomorrow morning so if you were on the end of this list come in early tomorrow and get on the top of tomorrow's list and that was perfect timing take it away hi so I'm Alex I will be talking about fuzzing black so who knows black the sort of who uses black there should be more hands so one day I was showcasing black to a friend of mine and I was sort of like okay I will add some more parenthesis to the print statement and we'll show how black removes them yeah I'm our no I've got an error message so that got me thinking okay black has some internal safeguards like the formatting stage should be important like applying the formatting twice should give you the same results or the code after formatting should give the same abstract syntax tree so the idea was born generate some source code run black on it rinse and repeat until black finds a bug in itself so there is a testing package called hypothesis which can do almost all of that for you but you still need to write some hypothesis strategy to generate the source code I'm a lazy programmer so I thought about another tool called AFL that implements fuzzing and what fuzzing is is you give fuzzing tool an input file or a set of input files it runs them through your code with coverage enabled then change the bit in the input runs it again again and again until it finds a bug or until it expands its coverage so it can basically generate new ways new paths through your code but the problem with AFL is that it's generally made for binary file formats like compressed files, image files executable files where changing only a small bit or a byte will change the picture drastically Python source code changing one byte will give you just different variable name or different function call that's not fun so what I did I went to the Python source code I got the Python grammar I got all the keywords all the punctation marks from it generated a set of strings fuzzing engine could inject into my source code and gave it to the father in the end after almost like two weeks running fuzzing on my laptop when I didn't forget to open the lead it gave me four bucks in black that I reported and there's way more about fuzzing you can minimize your test cases you can get your whole corpus of your test cases like all the files and select only those which gives you enough coverage like a subset of test cases that gives you the same coverage and there's a very cool book about fuzzing in Python free online book at fuzzingbook.org if you're interested in fuzzing check it and I've put slides online and source code I've used online so you can try fuzzing yourself that's it if you'd like to just stay seated for one more moment there's going to be an announcement more like a call for call for volunteers so if anyone wants a cool yellow t-shirt from EuroPython there's still well, except the obvious you can sign up for a chair tomorrow if you want so like we have two more spots free for session chair tomorrow so basically we need people who present the speakers and show them you have 10 minutes more so come and talk to me if you want to volunteer, thanks do it, it's a great way to meet new people this is all about becoming part of the community meeting new people can I have a big round of applause for all our lightning talk speakers this evening and a big round of applause for Chuck who stops you from having to listen to track to jokes all evening and a big round of applause for me see you tomorrow