 So, nine years ago this summer I had a freshly minted Ph.D. in finance from the University of Arizona and I was about to post to my first academic job at Utah State University. And my first academic teaching assignment was to teach the graduate level sequence in econometrics. And to tell you the truth I had a little bit of PTSD for my own learning experience coming through graduate school with that particular sequence and I was searching about for a new way, a new way to teach econometrics and deliver it. And as I was searching I was very fortunate to come across this, what to me is a very important paper by Peter Kennedy. And Kennedy contends that we should teach statistics and econometrics, especially the convergence concepts, asymptotics with Monte Carlo simulation and that this helps the students bridge the gap of the concepts and connect with those concepts in a much more meaningful way than just the symbolic mathematical representation. He contends that most students after even several courses in statistics and econometrics don't fully internalize the main ideas of econometrics and my teaching experience is that that's very much the case. And he points out that the crucial concept for students to learn that defines statistical thinking as separate from the mathematical representation is that of the sampling distribution concept and simulation with the Monte Carlo method is particularly a powerful tool for that. This paper sent me on a nine-year intellectual journey that I'm going to share with you a little bit about today. But let me start with a demonstration. Imagine that your students of mine, maybe an introductory statistics or econometrics class and my task on the day is to teach you about the law of large numbers. I could put up a definition from a textbook like this with some words and a definition, some mathematical representation. I could go more in-depth with probability conversions concepts but my experience is when I put slides like this one and the next one up that my students' eyes glaze over and they don't connect with the material in any meaningful way and I tend to lose them. Imagine instead that I just start with a simple example. Everyone's seen the example from a baby statistics class of starting with a rolling a fair die that has six sides and equal probability for each of the outcomes. We can calculate with simple mathematics the expected value of the population of being 3.5. Of course we can verify this in Python with some simple calculations so that the students can verify this for themselves. Even more meaningful I can only after a little bit of instruction about Python have them run a simulation like this. What this simulation does is it rolls the die a certain number of times and takes the average and what we see plotted is on the horizontal axis the sample size and the dot on the vertical axis is going to be the average taken from that side. They say don't run code in presentation but I'm going to risk it. So imagine that we start with sample sizes one to ten so each of those ten points is an average taken from rolling the die that many times and what I can do for my students is simply increase the number of sample size runs that I do and you can start to visually see the convergence. If we go to a thousand the story begins to be pretty clear here. We can see this, I'll stop at ten thousand because I don't want it to run and run. But in this funnel-like shape of the graph the students can immediately play with and interact with and connect with the concept of the law of large numbers here. You see that the sample average as the sample size increases converges to the true population size and so my experience has confirmed Peter Kennedy's contention that teaching with simulation methods is a powerful pedagogical tool and I'm going to talk about how I've incorporated that in teaching computational finance. So the year after I arrived one of my next academic assignments was to create a new course that hadn't been taught before in computational finance for our masters program. But this created a bit of a conundrum because most of my students had never programmed before. They are trained in economics and in finance and have never coded before and what I wanted to avoid was the situation where at the end of the semester they feel a little bit like Delmar here. I don't know if you've seen this particular clip from this film but Delmar is a little left out of the storyline and I don't want the kinds of abstractions that they're going to see in programming to leave them behind and what I want to do is use this as an opportunity to help them to learn to think in a new way. And so one of the questions becomes what is the adequate level of abstraction to think about options and option pricing, one of the topics that we're going to learn about in the computational finance course. By and large the students come into my course with pretty strong math and statistics skills, pretty deep background as far as that goes. But as I said most of them have never done any kind of programming and so my job is for the first third or so part portion of the course to teach them to program and my objective is to use this as an opportunity. It is certainly a challenge but it can also be an opportunity to help them to learn to think in a new way that will augment and add to their mathematical and statistical thinking. This has been referred to as computational and inferential thinking and of course Python is an excellent tool for this. It's designed for learners but they'll never outgrow it and so it's been a very powerful tool for the first third of the course to get them up to speed in basic programming and then to launch into computational finance concepts. And again as I say I want to have them rather not just having a list of skills on their resume. Python is wonderful as you know but I want them to leave with a new way of thinking about the world. Whereas Kennedy is oriented towards econometric students and he wants them to think about the concept of the sampling distribution which turns out to be perfect for teaching econometrics at least of the classical variety. I need to adapt this a little bit. I want students to think about what we might call the predictive distribution because in finance we're always predicting forward a random variable, a random outcome and I want them to be able to think about the payoff of an asset in terms of that predictive density but I can make a strong connection to Peter Kennedy's sampling distribution and the Monte Carlo method turns out to be a very powerful tool for that as well. So I'll ask you if you can it's a little bit hard for those who have many years of experience but to think back to the time when you were first learning to program for the first time and you came across concepts like variables and control flow and functions and how these new kinds of ways of thinking these new abstractions sort of changed your brain a little bit, rewired your neural network and I get to see this every semester as the students come in and are learning to think in a new way every fall semester. I myself learned to program with these intro books and they were wonderful with little toy games. So I'll ask you to think about what it was like to be a student maybe sitting in my class. I also remember the summer that I decided to dive into Mark Josh's book on design patterns and object-oriented programming for option pricing and I wanted to be able to deliver this to the students but in a more accessible way. C++ is would be a very powerful tool it's a kind of a gold standard in computational finance but it's a little bit inaccessible for my students never having programmed before but I wanted to sort of introduce some of these concepts especially the idea of design patterns with object-oriented programming and so I want to introduce you to the module that my students and I build each semester together from the ground up. It's for educational purposes it's to help the students understand a little bit about how to do some design in their programming oriented towards option pricing but it's not really geared towards production or research although on my last slide I'll mention a project that's moving Probo in that direction. So the first place to start is to help the students conceive of an option contract outside of their textbook in some representation in code and so I begin with them with the facade design pattern and so here we have the option facade class and it composes three objects the option which is going to represent the option contract what we're calling engine here is going to be a pricing engine and that sort of abstracts out the idea of a pricing model and then the the data object is going to represent market data we end up not doing much with that but I'll mention it a little bit later and then there's a single method the price method and it's going to call out to the engines calculate method and because Python is so simple and easy to present this almost provides the students with the domain specific language to price an option and help them think about that in as I say a new way a new mode of abstraction so for those not familiar with what an option is here's the textbook definition it's the right but not the obligation to buy or sell another asset called the underlying asset at a preset time at a preset price there are variations on that but that's the simplest definition so a call option is going to be the right to buy the underlying asset put the right to sell it the strike price will be the preset price of the underlying asset and the expiry is going to be the date of exercise at least if it's a European option so how can we represent this in code we start by creating a simple interface for the option contract that we can use derived classes to specialize and so I get to explain to the students the concept of an interface and help them think about that level of abstraction of something that can represent different kinds of option contracts and I'll draw your attention to the to the method the payoff that's going to be the key way to help them think about how to price the option is through its payoff function so here's a simple concrete class called the which we we call here the vanilla option simply because it's the simplest kind and what I'll draw your attention here to is in the payoff method simply calling out to a composed object payoff and that's going to allow us I'll give a second more here just to look at the code but that that's going to allow us to use the strategy pattern so that at runtime we can swap out the payoff function for different kinds and this turns out to be really simple in Python because functions are first-class objects and I don't have to create a higher level ab object to to pass into the option contract so I can create two simple one line functions to represent a call on a put option and by this time students are pretty familiar with writing functions and this turns out to be a pretty easy exercise for them so a simple demonstration of instantiating these so we can create a vanilla call option with here with a spot price of $41 the expiry will be the 1.0 that's one year to expiration and with a strike price of $40 and then I can simply pass the call payoff function to get the call payoff behavior and I can do the same thing for the put option here I've changed the spot price so that we have more interesting payoff and students can play with this in the notebook and get familiar with it and and really connect with the definition of what an option is and and these tools provide a very powerful pedagogical environment for students to learn and and sort of beyond the austere representation in the textbook to get to know what options are all about so the next biggest part of the course that we're going to do is spend some time on developing different option pricing models which we're going to represent here as pricing engines again just to help them maybe understand and begin to use some of the lingo that's used on the street but option pricing models are not unique and I want to think have them think about that in an abstract way and so once again we're going to use the strategy pattern and so we'll create a simple interface for the pricing engine with a single method called calculate and this allows flexibility to represent different kinds of pricing models so it could be an analytic model like the black shoals model or it could be a new simple numerical model like the binomial model or even a PDE solver one in particular that I'll be showing you as the Monte Carlo pricing model a simple form of that so here is a particular pricing engine the Monte Carlo engine and again if we look at the calculate method we can see that it's calling out to the composed object price and that again makes it for the students so that creating different option pricing algorithms is as easy as writing different functions and so let me give just a little bit of mathematical review not too much I don't want to go too in depth but to explain a little bit what's going on with Monte Carlo option pricing we can we can take Peter Kennedy's idea about teaching econometrics with Monte Carlo method and simulation and apply that again adapting it to think about the predictive density for the option it turns out that that an option can be thought of as the present discounted value of an average payoff under under many thousands of simulations so the term in the parentheses is going to be the average payoff at the time of expiry so m here is going to be some big enough number for the law of large numbers to work and the term in the front e to the negative RT is going to represent a discounted value and so with that as background I can now show the simple function that students can write and again at this point in the course they're pretty adept at being able to write something like this just a few lines of code to represent the Monte Carlo option pricing method the block of code in the middle there the middle line represents the simulation this is vectorized so we're running many thousands of simulations here at the same time so Z is going to be an array that's replications long and so that will drive many thousands of payoffs I can take the average and discount it and I've got an option price so we can we can test this pretty simply for a European option using the naive Monte Carlo method so we'll do some imports from Probo yeah maybe I'll take a second now and tell you about market data it doesn't do much right now except for abstract away that the other bits of market data that are needed to price the option but at this point the students understand that the idea of abstraction and I can tell them that this might represent a historical database maybe it's streaming data from Bloomberg the students have access to the Bloomberg terminal this last semester actually did have a student put behind the market data object a sequel historical database and we could use this to price historical options but typically that's just a small way to encapsulate those bits of data that are needed so here I'm gonna get the the different bits that are needed the Monte Carlo engine in the naive Monte Carlo pricer and I can set up the market data this is an example that comes out of their textbooks they're very familiar with it they've calculated by hand they know the answer a priori with something like the black souls model or binomial model so they have an idea about what the pricing method should the answer that should yield and then we can set up here the option both a put and a call option and here so we've got a one-year to expiry of $40 strike price and I can instantiate the call on the put objects and here I'll just run 100,000 simulations using that naive Monte Carlo pricer build the pricing engine and then we can use the option facade to encapsulate the call option the pricing engine in the market data and call out to the price method and it will yield a the price of the option so the black souls benchmark here is actually $6.96 so at 100,000 rep repetitions we're coming pretty close to the true black souls price and again with Monte Carlo simulation I can help them think about the option as the discounted average payoff and they've they've had to simulate through through this process to build this code we can do the same thing for the put option and they can compare that again to the black souls price I think this is a penny or two off it's pretty it's pretty close so I went a little bit fast I thought I've had more less time than this but what's what's next for this project is to extend Probo in a couple different directions so one is a project called praise that we intend for academic research and production hopefully building that with some graduate students we're building on top of pytorch where we want performance and we want access to GPU compute pretty simply and one of the next directions for both teaching and research is to think about another layer of abstraction that might be referred to as agent based simulation so one of the key concepts in the black souls option pricing model and therefore any of the subsequent models is that of the delta hedging market maker if you're familiar with that with that vocabulary and this is a concept that is is really challenging to help the students to think about the idea of a risk-neutral density and the argument in black souls is that there is a market maker who is hedging an option that has been purchased or sold and through Monte Carlo simulation we can actually simulate that behavior and show that the black souls price will be a special case of this if the assumptions of the black souls model holds so we can take those assumptions and break them for example I can introduce different frictions like transactions costs or discreet hedging and we can see that that that will give a more general answer than the black souls and we're also using this with some of my graduate students in research to actually implement that as a pricing method and we refer to this as hedging Monte Carlo so I finished a little bit early but I'll take some questions if anyone has so thank you thank you for your time and attention yeah we do do that yeah so one of the things we'll do is we'll look at say pull up Bloomberg and we'll look at what an options trading at and have that conversation yeah and we also do other exercises like build in a implied volatility solver so they can see what when we take an observed option price what kind of volatility does that imply so they work through exercises using real-world data yeah it's a good question it helps them but now having this computational sort of framework to build with it helps them sort of wrestle with and make that that concept real for them whereas just the simple mathematical symbolic representations a little bit too abstract for them yeah so about the first month or so I do a rapid instruction in Python most of them have never programmed before so I got to get him from there to what you just saw doing object-oriented programming with design patterns it's a bit fast and I'm sure my computer science colleagues would be a little bit horrified by how fast and and how sloppily I go but I get them there and then the next two-thirds of the class so that the middle third is going to be teaching them the the various pricing models so we start with the binomial model and then we introduce black souls and the concepts of black souls and then we work through Monte Carlo for the rest of the semester the students then have a project at the end of the semester and by this point you know we're 75% 80% of the way through the course and their project is to extend Probo and to build in a new pricing engine or to price a new kind of object option or something like that and so the course is just over three months and we it's it's pretty ambitious but so far we've over the last couple of years we've we've been able to do it it's a full-time course on campus in our master's program yep and I had I mean we just created this whole cloth and I was just asked to create a course in computational finance and I had extreme freedom about what to do and so I devised this I'm excited to hear anyone's feedback on this if you have great ideas for me so yes that's I definitely can fact maybe shouldn't admit this but the first year or two I was ambitious enough to try see plus plus because I was I had I had assistant professor itis and I was going to teach them everything I knew and you know we were still fighting compiler errors by the by the time the finals were rolling around but yeah I by the time the students are working on their project to build a new pricing engine or to price a new option they know how to play around and break things and discover and so they have a computational platform with to do educational discovery on and I've definitely discovered that that that it's that it's a new way to think about and teach these concepts rather than just the the sort of austere textbook type types of teaching and you know chalk and talk kind of away and it's been very powerful learning tool think thanks for that question I do both of those so I every lecture is going to have a Jupiter notebook attached with it and it's going to be attached to something in their textbook once we get to the computational finance material but I have to go very rapidly for the first month through Python and so I've tried different things but I've just created I think 10 or 11 of my own notebooks and I don't I'm I don't try to teach them all of Python I try to teach them enough to get going on the kinds of things we want to do and they continue to learn Python throughout the semester as they need extra bits to do their their their exercises but that's so far it's worked pretty well but yeah they have a github repository and they know that that github repository is updated each lecture yeah and I can't I can't imagine doing this without the notebook and these kinds of tools yeah thank you for your time attention just a quick reminder to silent your phones so we don't disturb the presentation and there's gonna be a Q&A at the end of the session it's gonna be hi everyone welcome to the second talk of Europe Python in 2019 Basel so the next presenter is Earl here van der here so he's a doctor in particle physics working for go data driven and he's gonna present us and he's gonna have a presentation about how to train an image classifier is a pytorch building an image classifier that can recognize cities give a warm welcome to here thank you so we're definitely not going to do any particle physics today but today I'm going to give you help you do take a first few steps on training your first image classifier using pytorch just before we start can I get a hand for everyone who has used pytorch before only a few that's good because we're really going to start at the beginning and so well basically first I'm going to tell you first to start at what is actually an image classifier that's not so difficult then we'll go by what's a neural network how do we actually build one in pytorch and then finally what can we do with them and in the spirit of this morning's this morning's keynote I'm going to show you a bit how I played around with an image classifier and what cool things you can do with them okay so first let's have a look at what is it classifier so suppose we have a label training sets of points so we have two dimensions x1 and x2 and then we have a set of data points and they come in two classes either red or green well okay that's cool on this case we can see a pattern so the green ones are on the left side of the screen and the red ones are on the right side of the screen so we could say for example well let's draw a line somewhere around here and then say everything that's left of this line is green and everything that's right of this line that's red so now we have a model a very simple model it's just a line that can tell us whether a point belongs to one class or the other one so then when we have an unlabeled data set we have points like these in this case just white points then well what we can do is we can use the same line that we had before and use that to color all the points that we had into green and red well so far that's a classifier and I'm pretty sure that most of you have seen something like this before now sometimes you want to do something that's a little bit more complicated than using just a simple line and in this case we're going to use neural networks and neural networks probably most of you have heard about them about them look something like this here you see a series of dots circles and they represent neurons and they come in layers so we have the input layer on the left a hidden layer there might be multiple in the middle and an output layer on the right and they are connected by these lines so how do these neurons actually work well neuron looks something like this you have a set of incoming signals on the left and all of these signals are then multiplied by a weight and these weights are the things that we use to train the neural network they are the things that we must learn when we actually build one neural network so what we do is we take those inputs we multiply them by the weights then sum them up and then in the end apply some kind of activation function of which the main property is that it's non-linear which makes the neural network more capable of learning stuff typically what we use and we'll see that more today is we use the rectified linear unit shorthand is just value function which is y equals the maximum of 0 and x so basically it's 0 if x is negative and otherwise it's x okay well that's fairly simple so how do we actually use a neural network like this what do we do with it well in the case but we had before so we had the 2d all these 2d points we had two dimensions x1 and x2 and then we had these points that we wanted to classify in red or green well we could use the neural network like this we can say well we have two input nodes and one of them is x1 and the other one is x2 and then we use we propagate these signals through the network with these weights and these functions and then we think we can say well we have two output nodes and then one of the output nodes is the probability that the point belongs to the green class and the other one that it belongs to a red class so now of course if you first make a neural network like this it's not going to do what you want because you have to set these weights to the correct values so what do you do is you take all those points that you have labeled before you pass them through the network see whether it makes an error and then tune the weights in such a way that it will actually perform better so you have to do a lot of tuning before you get a neural network that actually does what you want but right we're talking here about image classification and not about 2d points so what we want to do is instead of just having two inputs we want to say when we have an image well in this case the left one here is a dog and the right one is a cat that's something that we can easily see but we would want a neural network to recognize something like that for us and for that you cannot use a simple neural network like the one I showed before but we're going to need something a little bit more complicated we're going to need something that is called a deep convolutional network well that's quite a mouthful but in the end it's not so extremely complicated let's start with a deep part a deep neural network is just a neural network that has more than one hidden layer preferably in this case I think 10 or so and could be 20 or 30 but just more than a few that's all that is to deep neural networks and then the convolutional part well we need convolutions in order to interpret those images so on the left side here we see that are the input to a neural network like this is an image and what we do with these convolutions is we take a box and we say we aggregate all those pixels from that area of the image and then apply some function over it we could say we could try to see whether there is a big difference from the light right to the left or the top and the bottom or the whether all the pixels have approximately the same value it's just some function that we use apply using those weights of a neuron and then what we do is we shift this box around across this image such that we get a matrix of the results of all this this function across all the image and in this case you see we don't you just use one convolution where we get in the end four feature maps so we could use four convolutions of course you could use many more then typically what you do after using a convolution is you sub sample so you say well for all each out of two by two results we take the maximum result which is called max pooling or we could take the average result or something like that that's just we do that in order to reduce the size of our neural network because otherwise it might become too big and if we have too many weights to tune it becomes too difficult to train typically after that we do more convolutions so you have more future maps in this case we do 10 more convolutions then more sub sampling and in the end we make a fully connected layer and a fully connected layer is just one just like the one I showed you in the simple neural network before and then in the end we have an output in this case two outputs so we could one of them could represent the probability of the image being a robot and the other one of being a cat something like that okay well a typical example of a convolutional neural network is VGG 16 VGG 16 is a big neural network and it consists of 16 layers and has in total 144 million weights and what it does you'll I hope you recognize some of the the ingredients here and we start with an image and this image is 224 by 224 pixels and then three layers one for each color red green and blue and then so what we do is we start at the left with two convolutions then some max pooling to subsample two more convolutions more sub sampling and then more convolutions sub sampling more convolutions and then in the end we end up with the blue areas these are the fully connected layers that we have in the end so a standard neural network at the end but then big with like four thousand nodes and in the end we have one layer with a thousand output nodes so why a thousand well in this case this is because this network was made to be trained on ImageNet an ImageNet is a collection of 14 million images that was annotated into a thousand classes of which for example cat and dog and so this this network was trained by using a lot of computers to get like 90% accuracy on these 1000 classes so well cool this thing already exists but aren't we going to train our own image classifier well of course we are going to do that and to do that we use transfer learning so remember we had this network and this has already been trained it has a hundred and forty four million weights that have all reasonable values such that it can accurately classify all those kinds of images now we're going to make use of that by taking off this last part of 1000 classes just removing that whole last layer and putting our own layer at the end not necessarily a simple layer like that but just the new classifier that we put on top so we remove the end and we add our own layer so this has the advantage that all the weights in the previous layers they already have some reasonable values this network already knows how to recognize sharp edges round edges strange patterns all this kind of things all the things that you typically see in a photo it already knows how to deal with those and then in the end it comes up with a set of features and we build a classifier on top of that isn't that cheating of course it's cheating but hey you never get anywhere in life without a little bit of cheating so what we want to do is we're going to make use of what people have done before and build our own classifier that doesn't classify stuff that they have trained it on because well if you want to classify images into exactly those thousand classes of image net well of course you can use the pre-trained version of VGG if you don't well this is the way to go you just remove out the last layer and well then you're all set so now you know how exactly how to train your own image classifier right well let's have a look at the code but before that you might ask me the question why did you choose PyTorch and not Keras because you may have heard of Keras as well or Keras is also a library that allows you to do neural networks but you could say Keras is their first or PyTorch is more flexible you could say Keras is faster which months might sound very important but the main thing that I think is important is that PyTorch lets you play with the internals basically that means that if you get to tweak the neural networks and not just import them and use them that you learn more from PyTorch so that was the main reason that I chose for using using PyTorch okay now let's have a look at it and here it gets a little bit technical so bear with me first if we want to use PyTorch we want to use a neural network we have to define the neural network and you do that by creating a class in this case we could create a class that's called net and this inherits from the neural network that module from Torch and when we initialize this class we first initialize the superclass the module not so interesting and then we define the four layers of our class so first there's the convolutional layer which is a 2d convolutional layer with some parameters we'll go over those in a minute then a pool layer which is 2d max pooling so this is a sub sampling where we take from it in this case a kernel size 2 so a 2 by 2 matrix we take the maximum value in order to reduce the size of under our neural network a bit and then we have two fully connected layers like the normal layers that come after each other at the end and then the second fully connected layer ends with 10 nodes so we have 10 output nodes secondly we have to define the forward method and the forward method it accepts a single argument that is called acts that is the input and the input comes in batches and in this key case these are 32 by 32 pixel images in three channels now if we then apply this convolutional layer that's the first the second thing we do we apply the convolutional layer to acts and that converts it to 18 channels so we get 18 different types of convolutions and again 32 by 32 pixels then we applied applied is radio value function just to make it nonlinear that helps our neural network with learning more complicated stuff okay then we apply the pooling which reduces the picture size and form 32 by 32 to 16 by 16 pixels and then since we are done with the 2d stuff we have to reshape the whole vector to a single very long vector of size more than 4,000 then we are ready to apply first the first fully connected layer again the nonlinear function and then lastly the last fully connected layer after which our output has a size 10 okay but hold on we weren't going to train our own neural network right where we're going to do transfer learning yes that's right so what we first have to do if we're going to do transfer learning well we have to import this pre-trained network so what we can do in this case I've chosen squeeze net and I did that because well VGG is actually a little bit bigger than squeeze net takes a longer to run so I'd go for the easy option so let's have a look at squeeze net so squeeze net you can simply import it and then instantiate it say pre-trained is true and it will download the weights for you which is a big set of weights takes a while but then you get a pre-trained network and you can it's ready to use already to use for you but we weren't going to use that pre-train network with a thousand classes so we are going to modify it let's have a look at the internals before we modify it so if we look have a look at the internals we see if we just print the network it will show us all the layers and we find that it consists of two parts first is called the first part is called the features and it has a lot of layers and I couldn't fit them all in the slides I think it's 20 layers or so and then you have lots of convolutions and pooling and these revenue functions all seek in a sequence after each other and then in the end there's the classifier part which consists of four pieces of which you already recognize three there's the 2d convolution there's the revenue and then average pooling at the end the first part is dropout and dropout is a technique to help your neural network learn a little quicker by while you're training it dropping the inputs or the outputs of half in this case with the probability of 50% so half of the neurons that makes it impossible for the network to rely on a single neuron or a small subset of neurons so it must make more connections to learn the same information which basically makes it more robust so what happens here in the classifier is we apply this dropout during training then we have this 2d convolution from 512 to a thousand and this is again where you see the 1000th classes of output in the end and then there's the value in the average pooling so in the end we have again a thousand outputs and one for each of the classes that it wants to be able to classify now we are going to change this and make it our own classifier for our own classes well then all we need to do is well simply define the number of classes that we have for example for download the model set it up first it has a parameter that says the num classes so we can update that to for although internally it's it's not even used but let's do it to be complete and then what we can do is we can take this classifier part remember that it's the with the 2d convolution layer was the one with index one and we can simply replace it with a new 2d convolution layer and that goes from 512 just like the original but now to our number of classes not a thousand so that's all you need and now you have a new neural network that you can train in order to classify your classes okay now let's have a look at how you train a model like this that looks like this so we start with setting our model to the training mode that's important I'll get you to why in a little bit then we need to define our critique criterion how do we score whether the model is good or bad in this case we use cross entropy loss and we need to define an optimizer and the optimizer in this case is stochastic gradient descent we say okay these are the model parameters and then there are some arguments that we have a bit look at in the at a later stage and then we loop through stuff that comes out of a loader object and again we'll look at the loader object later and these are the inputs so the images and the labels so the classes that you've labeled them to be and for each of those sets of images and labels because we do this in batches you always process multiple images at the same time for each of those sets we first reset the optimizer because we don't want to use any information from the last batch then we simply pass the images through a neural network and then we get some outputs we calculate how good the outputs are do the outputs correspond with the labels that we give it then we propagate these this loss backwards through the network so we calculate for each neuron how well did it do on scoring your your training images and then in the end when we know that we can optimize the weights and then every time we loop through all our training images we call this one app and you're going to do this quite a few times when you want a classifier that works a little bit well okay once you've done that say suppose you've you've trained 20 epochs then of course you want to know how well does my model actually work well for that first we set the model to evaluation mode so what is this difference between the training and evaluation mode now well most importantly it disables the dropout of course if you're going to train that it might work well to let your model use only half of the information in some some of the stages but when you're evaluating when you're trying to actually classify an image you want to make sure that you use all possible information that you have and disable dropout so that's the most important reason why we always must call these eval and these train methods well then we can say with no grad which prevents the pytorch to do internal calculation that's internal calculations that you don't need and then again we loop through the slaughter we pass the inputs through the model to get the outputs we can get for the outputs these are factors with the probability for each class and we can get the maximum of these which is then the class that it would classify the images being so we can get the predictions from that and we can sum the loss in order to get some idea of how well our model is performing so I promised you also to have a look a closer look at the loader so where does our data actually come from well what you need to do first is specify where are your images on disk and you do that by defining those image folders and you want to have a separate train and test set and so you you define two image folders one with a pass to the train in a path through the train images and one with a path through the test images but to both of those you need to first also define a transform so what methods will be applied to the images when they are loaded and we define two different transforms one for the training images and one for the test images let's first have a look at the test images so what we do if we say we can compose a transform sorry it consists of multiple steps first we resize it to size 256 then we crop out to the center the two to two hundred and twenty four pixels in the center and then we transform it to a tensors as that pytorch can work with it well that's fairly simple but for the training images we do something that's a little bit different what we do is we take a randomly resized crop of the same size from the image so we don't always look at the same part of the image but it could be a little bit more zoomed in or a little bit more zoomed out or a little bit more to the left or the right this means that every time that we train an epoch our model actually gets to see a different set of images well the source images were the same but the actual image it looks at is just a little bit shifted or zoomed so it gets to learn not from the individual pixels but from actually the information that's in the image that's really important now once we've defined those train and test sets we can define the train and test loaders which are different simply a data loader where we provide the data set that we want to use we set the batch size that's the number of images that we process at the same time the number of workers is the number of processes that can process these images while loading them and we say we want to shuffle them that means that every time we train an epoch or we evaluate we do this in a random order for training this is really important for testing it isn't okay so we're almost there but I skipped something that's fairly important remember that when we defined or optimizer which is stochastic gradient descent I said well there these arguments at the end and the most important one is the first one the LR is the learning rate and this is the rate at which we change the weights when we're training so we need to figure out what is actually a good value now suppose that we have only a single weight and we only I can only plot make a plot in a single dimension so suppose we have a single weight and we want to optimize this we want to find the place right there at the bottom of this graph now suppose that we start all the way at the right of this graph and we want to by taking little steps find the bottom of the graph then of course we want to make sure that we don't take for example steps that are too large if you make steps that are too large you know you could step all the way across the valley to the opposite side and then if you're unlucky you might even go so far away that you in the end step out of the valley and even reduce the performance of your model on the other hand if your learning rate is too small then first of all it takes a very long time to get there but in this case you'll find this local optimum there and you won't fight the global up so balancing this learning rate is really important so how do we actually find the best learning rate for our problem well the best thing that you can do is just try them out basically well here we define a function we set the learning rate for the optimizer to a certain value and then for a certain range of values so this large space from some minimum learning rate to some maximum learning rate with a number of steps and for each of these learning rates we set the optimizer to this learning rate and then we train for a number of batches and then after that we evaluate for a number of batches so what you'll then find is that of course during the course of doing this your model is going to first you're starting with a very low learning rate is going to improve very very slowly and after a while this improvement is going to be quicker and quicker and quicker until your learning rate is so big that it will go all the way away from your local or global minimum where you're at and the performance will degrade enormously so what this will look like if you do this it's something like this and also typically what you see is that first you have some value of loss and as you increase the learning rate your loss will go down until after at some point it will go up way all the way until your model doesn't do anything anymore so what we found here is that typically something well like 10 to the minus 3 is the optimal learning rate so that's what we said it to but of course the optimal value also depends on the state of your model if your model doesn't do anything yet well then probably a very high learning rate is good well if it's almost there you just want to squeeze out that last percent of accuracy then probably a very low learning rate is the right way to go so that for that we have the learning rate scheduler we can use for example to reduce learning rate on plateau which is a scheduler that whenever the performance of your model during training is has reached a sort of plateau is stable it reduces the learning rate and then tries again so after every epoch we then have to call scheduler that's that with the loss that we found and based on that it might reduce the learning rate that looks like something like this so if you while you're training your accuracy goes up at the beginning and then after a while it figures out okay maybe we're stable now let's reduce the learning rate and you'll see that it takes these steps after all the time until what at some point you decide that the accuracy is good enough okay we're all set let's have a look at actually some data that I've played with so of course if you want to train a model you need data so what I did is I took one of these a Raspberry Pi and I set it to work for a couple of months and I gathered a data set of photos taken in the world's largest cities I took 72 cities half a million images of 10,000 photographers all in all some 30 gigabytes of data and I made sure that all of these are licensed for reuse such that I can show you show them to you right now so of course the first thing that you do when you gather a data set is you have a look at the images themselves so first I live in Amsterdam so what I did is I had a had a look at all or a subset of the images that were taken in Amsterdam so this is a nice one and typically we don't have weather like this so often but well it's nice right something very typical so that you find in Amsterdam or bikes this is a very typical scene from Amsterdam for those of you who have been there I'm sure you'll recognize it okay well this looks good right let's have a look at another one so this is a really nice view but hold on this wasn't taken in Amsterdam we don't have any cliffs like to within 200 kilometers from Amsterdam probably even more so what's going on let's have a look at a metadata well that's this tag that says Amsterdam so probably someone thought they were taken in Amsterdam why it wasn't on the other hand it also has a tag that says Dublin interesting interesting they're quite a few tags actually on this image and I couldn't fit more than this this is less than 5% of the tags of this image and I certainly don't see a teddy bear museum on this image or all of kind of these things so turns out people don't always tag their images as they should be so I had a look at where all the images that were supposedly taken in Amsterdam were taken in the world well around here and actually the bench that we were just looking at was taken right there at the edge in Korea some nice island in Korea well definitely not Amsterdam so what you can do then what you could do of course we only want the images from Amsterdam that were taken right there the red dot in the middle that's where Amsterdam actually is what you can do is you can take all the images take the median latitude and longitude and now the mathematicians will cringe because of course these are circular values and you can take a latitude a median of the latitude or longitude well in the end you can just do it and it works then what you can do is you can remove all the images that were more than five kilometers away and you repeat this for all cities then we have a clean data set right okay let's do it okay then after that I had a look at all the other tags and thought of something cool that we could do these were the most common tags in this data set well of course if you're going to look for photos taken in cities then the most common tag is city well but I think other than city maybe this one skyline here is the most interesting one let's try to make an image classifier that recognizes skylines of cities so what I did is I took the ten most common cities in my data set those were these I hear that the image counts I split these into a train and a test set like this and then we train them all although we wait first we wait and the way they actually is quite annoying because well training a model like this takes a while in this case I took a fast GPU I used my boss's credit card he doesn't know yet he'll be a little bit surprised and then I spent like 20 hours or so on training time and then I had a model so then well you feed in an image and who knows where this image was taken this is London and I got it correct okay well that's nice so this one where is this this is Sydney and it learned that all right well that's cool this one anyone this is Toronto I heard it right there okay cool this one stuff where is this this is LA and actually the model got it all right so it's pretty impressed so this is clearly Chicago right and then here we have Philadelphia got it all right again this is Tokyo cool even this one it's not really so complicated doesn't have too many buildings I would say I wouldn't know it but it got it all right it's Houston here we have Shanghai and this is clearly Chicago right wait what what just happened what turns out there was one photographer who labeled all his photos with the tag Chicago while all he did was take photos of sandals on pavement and my model well he got it all right it learned that a sandal on pavement it must be in Chicago and of course well then when you feed this test image to the model it gets all right this must be Chicago okay so we need to fix this let's come up with a plan okay first what we can do instead of splitting the images randomly in train test set what we can do is we can split them by photographer in this case at least now all those sandals will end up either in the train or in the test set and we wait takes a while this is really annoying once in a while because well it's late at night you want to do some hacking on your project and then you think of a solution you fix it you start training and then well you must wait till tomorrow to see the results anyway the results of this one were terrible because of course well if you put all the sandals in the train set then it will get a very high train accuracy but the test set accuracy will be terrible in fact it will just be over trained on those in the end we just have too many mistag photos so I needed to come up with another plan and in this case what I did is I built another model and it is get so what I did I took a model that classified only two classes and I said either it has a skyline or it is not a photo of a skyline and I trained on all data that I have so half a million images and I gave them the labels either it is a skyline when it has the skyline tag or it's not a skyline when it doesn't have the skyline tag then I could make predictions for all data and make and then only use it the data that were labeled with a positive prediction for skyline for my original model again then we have to wait takes a while gets really annoying after a while but then in the end the results of this were pretty nice out of those with a tag that had skyline I had about six thousand that were labeled by the model as actually having a skyline and about one thousand that were labeled as not having a skyline so I could just get rid of those but at the other hand I got a thousand images that did have a skyline according to my model but that didn't have the skyline tag so still I ended up with about the same number of images so I recreated this train test split that had to wait again as I tell you this gets really annoying and my boss will not be happy with me and in the end I got yet more results and so as you can see in the end the accuracy was about 70% after 200 airport training epochs or so so that's I think 24 hours of training I think this was fairly reasonable okay well that's cool let's have a look at some of the actual results so this was it got it right Chicago of course that's something I would also recognize so that's cool but it means it actually learns to recognize some of these cities also Los Angeles it got it right in this case the model said New York City while in reality the label was Philadelphia but to be honest looking at this picture I probably would have gotten it wrong as well so sometimes it's not that bad here we have an example where the model says it's London probably because of the bad weather but it was actually taken in Toronto sometimes though you cannot explain the errors that the model makes because in this case well although the skyline is a bit difficult to see but you can see some high buildings in the in the background but you can clearly see that this street definitely is not an American street but it's something like in Asia so and this was actually taken in Shanghai and it got it wrong yeah so that was it but it before I end just some funny final remarks training your own image classifier really isn't that difficult all you need to do is cheat a little and do transfer learning otherwise you won't be waiting for 24 hours but for months on end doing pytorch is fun Kira's might be easier and faster but pytorch is a lot of fun and in the end having clean data is way more important than having a good model thank you in after this if you want to have a look at my code you can have a look at this git lab link you'll find an example or all the code that I use to create this image classifier and keep an eye on our blog blog go to the driven.com where I will make a sort of transcript of this talk thanks there are questions we have to do mics so please line up hi and you mentioned two very different kinds of hardware the Raspberry Pi and the GPU can you say a little bit more about whether this is really practical on a Raspberry Pi alone and if it's not then how does one actually take the next step to use the GPU as well yes of course thank you I did not train any of the these models on the Raspberry Pi I merely used it to collect my data since that's just a bunch of web scraping you don't need any like big hardware for that if you're going to train a model I tried training it on my my MacBook that takes way too long I did have to get a machine with the GPU on the other hand if you have a very small training data set you might give it a try on your laptop it's still fun you might get as far as 10 or 20 epochs and then get a reasonable accuracy it still is fun to play with if you want a little bit better getting a GPU or I don't know getting a cloud machine with a GPU there's always the way to go yeah one miscategorization where you had the Shanghai like very small section of skyline why include that in the test set so I didn't actually actually make the choice myself to include this in the test set I included all images that were classified by the previous model as being a skyline and so apparently it got to recognize the properties of what is a skyline and it recognized in the background this is something that looks a bit like a skyline well I guess my question is you mentioned that clean data is better than having a good model and so this to me doesn't look like clean data it might be like I'd expect a super genius classifier to figure it out but I would have excluded this if I just saw your talk and didn't see this example so I'm curious if I'm just wrong and there's some value and stuff like this or well in this case I guess it's just laziness I didn't go through my entire test set before using it because we are going through thousands of images and manually labeling them as good or bad is not my idea of fun I think a proper data scientist must be lazy any more questions well again thanks like to give a world welcome to our next speaker Mustafa Anil Tunjel he's a software developer here in Switzerland in Zurich and his work is focused on building data analysis pipelines and statistical models for single cell generation gene expression gene expression and single cell copy number variation data that sounds very complex talk about later so his talk it's called bioinformatics pipeline for revealing tumor heterogeneity and let's hear what it's about thank you very much so yeah this talk is about the bioinformatics pipeline we're developing at ETH Zurich and yeah here is a just introduction so here is just some research interest that I have so we are working on data analysis pipelines and bioinformatics machine learning also we are developing some new methods to explain certain biological data sets and previously I had some work with recommender systems so that's also one of my research interests here I have yeah here there's GitHub, LinkedIn, Twitter information there and the outline of this talk is so I will first introduce the problem I will give you some basic biological information later I will talk about so-called DNA mutation trees so our model for representing mutations on the DNA and then I will talk about the pipeline and bioinformatics stuff that we are doing to address the cancer research and yeah so I'll start with the biology background I'm not a biologist so my both my masters and bachelors were in computer science so don't worry I can't go into much detail when it comes to biology so we are working on cancer research and so the the data we have so is coming from some hospitals some patients go there get their tissue sequenced and then we are analyzing it and providing a report to the clinician to base his treatment decision and the previous so they let me explain what a cell is so cell is the smallest living thing in the human body and when cells come together they form tissues like a muscle tissue it is consisting of different muscle cells and then the tissues come together to represent the organ organs become systems and systems become the organism that's the high school biology information that is required for this talk so the previous technologies so when when we want to analyze the cancer tissue the previous technologies were able to retrieve the information at the tissue level like the the operation is called sequencing the biological sample arrives to the lab and then it gets sequenced and as a result we get the yeah we have the digital information about it so that we can run our analysis and the previous technologies that were sequencing based on tissues they were not able to detect the heterogeneity among different cells because the so in the tissue you just get the average of the cells therefore you get the most dominant mutation but you ignore all the other mutations and as a result in the treatments when you have the treatment for the most dominant mutations sometimes the other ones also pop up and this is why single cell sequencing has an importance now and the new technology allows us to go into single cell detail at the single cell resolution and yeah from the from the single cells okay we have we have this DNA inside the cell which has the which contains the genetic information and DNA sometimes have mutations and some of those mutations are known to be associated with certain diseases and here we are so in the talk I will talk about our efforts to model those mutations so the the mutation we are considering is from the family of mutations called structural mutations it is called copy number variation and so here on the on the figure there's this blue part on one genome and on the second figure it gets duplicated so this is a this is a this is a mutation so the variation in the copy number on the left there was just one copy of the blue but on the right there are two copies so those copy number variations can be their duplications or deletions and so we are going to be analyzing those copy number variations at the single cell resolution those mutations they have so they have this family sort of relationship because one one when a mutation happens sometimes other mutations just span from the parent one so their child mutations and then sibling mutations they have ancestors as well and therefore therefore we are modeling them in a tree fashion so not necessarily binary but it's a it's a tree to represent the mutation information so the genome is divided into different regions here those regions are just meaningful parts of DNA that are associated with certain functionality in the certain functionality of life let's say and here the root one so root one doesn't have any mutations the other ones have I'm not sure if it is how readable they are but so they are representing those copy number profiles so this one says are one plus one so in region one there is one extra copy and all of the all of the single cells that are represented by this pink circle are going to have one more extra copy of that region and and it goes on like this so it's a tree where we have a dictionary of regions and then we contain the extra or missing copy number information and this is what we are trying to learn so we have a machine learning model to learn the best suited tree for a given cancer sample and the way we are learning is by using a multi-chain Monte Carlo Markov chain scheme MCMC so for each tree we have a we have a mean of scoring it by using a Dirichlet multinomial model I won't bore you with the formulation but we can discuss about it if you're interested in after the top so for a given tree and the sample data we have a way of scoring it it tells how likely it is to observe this data this tree given this data and then we are able to by using an MCMC scheme we are able to move from one tree to another and then we will score the next tree as well and then we will see how good the second tree is and if it is significantly better than the first one we will discard the previous tree and we will update the model with the next tree and we continue like this it will get usually takes lots of iterations on real data it is millions and these so that they bullet points are the MCMC moves we define so here I will talk about it the first one is prune and reattach so we randomly pick one note from the tree we pick the the brown one and it has it happens to have two two children we just prune it and then we reattach it somewhere in the tree randomly just we just prune something randomly and attach it somewhere else and afterwards we we scored the next tree as well and if it is significantly better than the previous we keep this one we discard the other and we continue another move we have is called at remove note so we pick a note and then we randomly we randomly generate another note as a child of that note and then we see if this tree is better than the other and another one is condense and split so here we have so yeah we pick one notes alongside with its parent and then we condense them into just one so these two so the the second one got swallowed into the parent they became just one and then we see we test if this one represents the data better than the other this move in fact this could be done by different at and remove notes so we but the reason we have this one is to help with the convergence because by insertion and deletion we would have to use too much iterations but here is just the simpler and each of the move we have in the Markov chain so it's it is reversible like from from the from the tree on the right hand side we're able to go back to the tree on the left and we are externally making sure that it is equally probable from to move from left tree to the right tree to from right three to the left three otherwise the Markov chain wouldn't be in balance and it would it would cost yes certain biases and yeah so this is so this this one is a tree long we learned from from a real data this data was from a brain from a mouse brain and yeah we again started with a random tree and after millions of iterations this tree happens to be the one that is explaining the mouse brain tumor evolutions the best on the right hand side and below is the is the original data matrix we are getting after the sequencing experiment and then the figure above is how we can reconstruct it from the tree from the evolutionary tree we learned continue yeah so this was the so this was how we are how we were defining the the model or how we are modeling the heterogeneity in tumor but in in real life we have many more things in addition to it so the first thing is reproducibility of the research we want any other research institute to just read our paper and then able to reproduce the results second requirement is scalability because in in genomics so in the in the past 10 years the cost of sequencing has been decreasing faster than the Moore slow as a result this is producing more genomic data every day and it is the growing rate is exponential there's too much genomic data produced therefore the bioinformatics methods have to be able to scale and this is this is another requirement we have we often use multiple programming languages the tree model it was so the MCMC parts was built in C++ since it was requiring too many iterations therefore performance and as many of the machine learning frameworks out there this so for this we also decided to C++ but many other parts are written in Python and sometimes we even need to use our certain statistical methods are only implemented there and yeah multi-processing because we are so the this data is so I cannot run the experiments of my local machine we are using the computational clusters and since we have multiple nodes so we we try to make use of multi-processing a lot and yeah cluster execution for two reasons one there's not enough memory in my local second there's not enough time because we need to run things in parallel and the resources management so for each task each bioinformatics task we are doing we need to define the memory time and the disk space requirements otherwise into in order to better utilize the cluster and we often need to look at the statistics about the resource usages in order to better tune the cluster execution and to achieve this we are using yeah among many things we are using this workflow management systems so it is the one we are using is called snake make it is similar to GNU make it just follows its paradigm but it has the Pythonic syntax and see I like this figure I took it from I guess one of the previous snake make talk somebody gave yeah so this just explains it in a in a small diagram so as snake make is a workflow management system in Python it is consisting of different different programs and those programs have dependencies between each other some of them are providing output to the others some of them are running in parallel not depending on each other and then they are getting merged connected in the end and yeah so this snake make is a so it is provided as a Python package you can just pick installs make make and it has exactly the same Python syntax with a few extensions over it and it follows the GNU make paradigm which is well established and yeah so the workflows are defined in rules and those rules are trying to create the output given the input file and the workflow management system is automatically defining the dependencies between different rules and by using snake make we can make use of all the existing Python libraries and so unlike other workflow management systems out there like when I need to use some Python functionality inside the work line and to write a Python script and I need to make it executable so that I can access it from from the shell but in snake make you can you can just use all the functionality of Python as it is you don't need to wrap them into different different scripts automated logging of the status so since workflow management systems are consisting of multiple programs sometimes even implemented in different languages when something crashes you need to know which one crashed and why it crashed and if possible you may want to continue with the rest of the workflow or you may want to stop there but logging here is very important and snake make is providing a very automated logging of all of the error warning and the status of each each rule snake make came out in bioinformatics domain but it is a general purpose workflow definition language so it can be used in any any domain it's not domain specific and I will show you some example here syntax so the rule is basically a task that needs to be done so the rule can be depending on the rule rule may use a shell or Python code itself or it can it can use I guess they support for our script as well so here this rule is going to take two inputs one is called a genome dot f a the other is fast queue and then once these two inputs are provided the rule will automatically execute the shell command and then it will provide the output if it fails to provide the output it will crash otherwise the rule will be successful and the next rule may may begin here in the shell command there's this curly curly bracelet so it is it is how to tell so it is a way of communicating between the shell command and the input files because in otherwise if you want to invoke the same rule from the shell you just need to get do some extra work likewise the output there also serving is serving the same purpose and here is one extra feature of snake make basically you can have wild cards so the sample here between the curly brackets it is known to be a wild card and this feature I like very much because without using a workflow management system this is very hard to achieve I will let me explain you what it is so in the second line of the input it looks at the data directory okay go to data samples and then find all of the fast queue files that have that match certain criteria this can be a regular expression this can be just anything like a dot fast QB dot fast QC dot fast Q and then for each of those input files create the output that contains the same wild card and for each input we have so the shell command gets executed and so without changing any line of codes we can basically make it scale just by just by using those wild cards and those wild cards so we can even use them across different rules like whatever created from this Java tool use the exact same thing in the Python rule and produce the output automatically and handle the dependencies automatically and by means of dependency days so before snake make runs it creates this direct acyclic graph of the jobs this one that's from one of our simulation studies so at first one two three four first rules are executed in in order first one finishes second one begins and so on but at some point there are multiple ones because these ones do not have dependencies and they can run in parallel and the likewise the last row of rules there also they also may run in parallel but each one is depending on the previous one and afterwards we have this aggregation at the end it is similar to the MPI paradigm basically you can run things distributed then you can aggregate them and this direct acyclic graph of job so this is this is created automatically we don't need to tell them look okay first do this rule then the second rule then the third rule we there's a way of enforcing workflow management system to do that but by using the inputs and outputs it is automatically detecting the direct acyclic graph of the job execution and see here I will show you one more realistic snake file so this is a complete make file from for a very basic example so here I want to show you how similar it is to Python syntax it is basically a Python so it's a Python library in fact and in the first lines we are importing some Python modules it can they can be built in modules or custom modules like the secondary analysis here and this is yeah just regular Python later we have this config that is so it's the configuration file and since so I mean in a program there are often parameters and a workflow is set of program so there are more parameters and therefore it's a common practice to have configuration files separated than the main workflow and in snake make there's this dictionary called config it is built in and when you're invoking snake make you can just specify a config file and it will be automatically parsed this way you don't have to use so this way you separate the workflow and the config and the rest so there's here just the Python function you can have any Python functions list comprehensions all the Python syntactic stuff but the difference here is the rule so yeah there's this rule and there's input output like in the past example but here instead of shell there's this run command and which is just accepting a Python code so we have yeah some Python code here to go through the files in one directory and do some stuff there and in the end create this file which happens to be the output and yeah so this is how simple it is compared to other workflow languages you just you are within scope of Python and you can you can make use of that yes please yes you can you can mix those two exactly so good question yeah so personally I usually use the make syntax I usually call everything from the shell even if it is Python I may have Python classes but I will write an executable in Python and call it from the shell that way I can better manage the outputs and the logs standard error uh on the warnings because this way if this crashes here the snake file will terminate and the other way is much easier to uh deal with it and yeah so this is an example config file so it configs define the snake make so it they support two syntax two formats one is JSON the other is YAML YAML has the advantage of allowing commands but JSON has the advantage of easily being easily serializable I mean because I often create the json's from some Python dictionary so I automate certain tasks therefore in this example I use JSON but yeah you can use any config file and you can use any other Python parser for the config this is the uh supported so this is one of the two supported syntaxes in snake make and the yeah the execution also so yeah snake make is automatically configurable with lsf scheduler you can just pass it and in the config file given that you define the resources for each job like this much memory for this job that much memory for the second job snake make automatically will create sub tasks sub jobs on the clusters and uh that way you can yeah give how much so you can specify how much memory or how much runtime or how much disk space you want to give to each job I'll be very quick so and another thing technology we are using is hdf5 this hierarchical data formats uh yeah they're binary easy to manipulate this comes very handy in genomics because we usually make use of metadata and so this they are storing the data alongside with the metadata and since in the pipeline we are using c++ sometimes python and sometimes are we need to have a common serializable format I mean we cannot use pickle we cannot import pickle in c++ or vice versa so this binary file uh so there is another use of hdf5 basically we can use it and we can load the exact same thing uh and then yeah and we can so the hdf5 allow us to just connect to some data and uh perform operations on a subset of it without having to load everything into memory which is also quite handy and yeah this is the last slide or the one before the last one uh so in h in python there are two wrappers one is py tables which is high level a very nice wrapper of hdf5 that interacts with pandas and the other is h5py which is very similar to the c++ api and yeah so this this was the outline basically the problem at hand our statistical model and the bioinformatics parts the pipeline and the tools we are using to deal with that and the feature work so we we will be published the method first the statistical method we will compare to other methods we will have evaluations on real data a simulated data and we will show it's on real data and later this pipeline is going to be wrapped up we will do all the bindings to other languages and we will provide it uh on github open source lay and yeah this concludes the talk thank you very much for attention q and a if anyone has any questions uh just go to one of the microphones yes please uh perhaps that go to the microphone there just like yeah the closest microphone to you if i understand correctly uh snake make is compatible with uh singularity so it's your project using singularity or not oh i'm not wondering that i don't know what singularity is kind of container technology i i don't know but maybe no problem i just i just out of topic maybe we can discuss it okay if no one else then uh lunch will be served soon i think everyone's hungry yeah so let's give a round of applause for expertise in machine learning simulation optimization and visualization so don't host phd on kinks college of london and also mastering Cambridge and we are very welcome to having him here and without further ado thank you very much um so yes hello i'm here to talk about dash uh and so this is a dash as a framework that allows you to have interactive uh data visualization web applications with minimal or no javascript and and so just before i go on as well and these slides are actually themselves a web page and so if you want to follow along or look at these slides and either during the talk or a little bit later on um please do take those links down uh and use them to uh explore the various bits of code that are in this presentation um an alternative title um for this particular talk is what you can can't should and also probably shouldn't do um with dash um from the plot lead project and so throughout this talk i'm going to give a brief introduction to dash and how it works and what you can do with it discuss what we've learned at decision lab from using dash across our various different projects and kind of identify all sort of shirts and areas that we think might be sort of good practice and lessons that we've learned and i will be kind of looking for feedback on some of the thoughts and the ways that we're using dash and so please do um ask a question at the end or grab me if you see me at the conference and and start talking about dash because it really is quite a young project and it's something that we're very um interested in and i'd be very interested in collaborating with people to see how they're using dash and where they can use dash in the future and just putting those links up there again because i'm not sure that i've put them throughout the rest of the presentation and i should also say as well if you're really keen and there's a docker uh that you can start up in the um examples uh the dash github and and the links in the slides will link to the docker so you can follow the presentation and in the comfort of your own home uh okay uh final point about the slides as well if you are viewing them on the web and because i know that this is uh recorded as well and every time you see one of those down emojis push the down and the down button rather than the right hand side button and so should also prefix this saying that i'm not um a dash expert um i'm not an author on the project uh you know and if you are actually from the plotly dash project um and you know anything i say is out of date or wildly contentious please do you get in touch um i'm more than happy to correct anything you'll learn more about dash and so actually that does kind of pose a question why on earth am i here if i'm not a dash expert and and more to the point as well i'm a um my background is in full stack web development i'm a python and javascript developer uh so why am i using python to write javascript well um to answer that question um as was uh introduced before i work for decision lab we're a london-based mathematical modeling consultancy and before i go on as well i should say i'm very sorry about the state of my country at the moment and brexit please don't ask me about brexit and i have no idea what's going on either so um but i should also invite you to another very fun python conference which is picon london which is also picon uk which is in cardiff this year in september uh it's great fun please do you come along um don't be put off by everything you're reading the news about the uk at the moment so um why are we using dash at decision lab um well one thing you might have identified at various python conferences or at your company or um your organization and is there almost like two cultures of python uh i kind of call this like the the python data and the python software cultures and you're like you'll kind of spot it in the developers that you meet or in the the meetups that you attend there are some uh you know python courses which are very much focused on jupyton notebooks pandas sci-pi tensorflow that kind of thing and and then there's also a community of software engineers that use python and you know hate jupyton notebooks um i sometimes fall into that camp and but we'll use um tools more like sql alchemy and the various other kind of database technologies and so on and to actually kind of build production software and so collaboration between these two cultures is very important and and it's something that we we try and break down as much as possible at decision lab and but it can be very time-consuming and and so what we were looking for especially on a small project i should say um so if you're building a big production web app that's certainly something that you you know you're going to start on a very different approach but if you've got a small project a little bit of data analysis and you want to do some visualization with it you don't want to have to get a software engineer to start building all sorts of custom java scripts integrations between your your python code and your data science project and it really is quite quite a challenge and so we wanted to minimize the new technologies that are maybe fresh uh phd graduates or um maybe interns even would be able to kind of minimize the new technologies that they would need to learn and in order to be able to get a project off the ground and it's purely because if i'm building this kind of very basic proof of concept and i want a data scientist to be able to take the lead in the early stages looking and exploring data creating visualizations and so on if their background isn't in web development then they might you know they're not going to be naturally placed to create a web application for their data and and also as a software engineer i want to be able to facilitate members of my team to really go places and do exciting things with data without getting bogged down thinking about various different you know java script callbacks and the latest react library and that kind of thing and and so just to summarize exactly why this was a problem for us and react has a i hope you can see this react has a very um uh difficult reputation at times it is quite a complex um style of development to get your head into and you can see here that um this is uh the tomasaw mark almost preparing to put together his hello world react app it's it's a difficult language to learn especially if your data if your background is more in a data science focused rather than a software engineer focused background uh and so we found dash it's an open source project although there are paid options for consultancy with plotly uh and it describes itself still as experimental although recently i think that's about three weeks ago it hit version one so it's an established project but um like everything in java script it changes all the time uh you can see the website here it's uh kind of very flashy um and they you know describe themselves being able to build beautiful projects uh with minimal involvement from from java script and that's a good thing sometimes um not having to use java script is in and of itself helpful and it certainly speeds up development for some project teams but as i'll cover a little bit later on it's not always um it's not always desirable to only use and the used uh uh dash so for the rest of the talk i'll just give a quick introduction to dash and how it works some examples of the uh really cool things that you can do with dash um as i said before some of those sort of tips on larger dash projects that we've developed a decision lab uh and also discuss as i mentioned before when to stop using dash and to start hiring uh java script engineers and so let's start with a hello world um it's a very um kind of simple uh syntax that hopefully if you're familiar with python you'll get to grips with quite quickly and literally just install dash by a pip um and this is a simple hello world script so you can see here we've got you know we're doing our imports we create an app on that line we produce a layout which in this case is a div tag which is an html tag with a fundamental kind of division in html and the children of that tag are going to be this h1 or header one um so just a big header that says hello euro python and at the bottom we run uh we run our app uh and you just literally call it like that now if we this is the moment of truth as to whether or not all of this works together and but here you can see yeah okay so we've got a hello world um or hello euro python um example app in dash um i've cheated a little bit i've put a style sheet on here to give us our logo and so on um but that's just completely by the by um so that's uh the basic very basic kind of hello world of um of dash there are two bits uh of code there that are quite important that you should have a look at and so there are two modules or libraries that are being used as dash and dash html components so dash html components uh is a module which just wraps all the core react components so uh every single html tag has a corresponding react tag and now every single react html uh react component has a corresponding dash component uh and so you can plug and play these together and then uh dash the actual framework um manages the relationships between these particular tags which i'll cover in a minute and it serves the layout just using a really basic flask interface um and so you can actually provide your own custom flask app as well that it and add all sorts of modules and routes onto there as well um but as i said in the title this talk um i want an interactive web page not just a hello world that serves some static data with a kind of python on top of it and so how do you get about doing that uh now i'm going to detail very quickly to ask how would you do that in javascript um i'm not sure what how familiar people in the room are with with javascript there might be some very experienced javascript engineers and some uh people who've never never touched it in their life so i'm going to give it an overview of how this would work uh but fundamentally we'll have two things on our web page and so you can see on the top line there we've got this output you might call it so um this is a paragraph tag that says hello um i said there for the time being our aim is to be able to take the name of our user um or the person on our web page and say hello to that person so we've got our output at the top there and in particular that span tag which is just um another html tag which is where we're going to place our user's name and then there's this input tag as well which is going to take some text input from the user and that has an id of hello uh hello input and the place it's going to go is id hello name uh oh sorry i uh i fell for my own mistake there and pushed right rather than down and so we can see here that uh if we uh look at the tree of those html objects we've got uh i should say as well these are um you'll often hear they hear me referring to the DOM and this is the document object model um i'm not vain enough to refer to web pages as myself um i just happen to be a web developer called dom and so this is the document object model here and so there are three there are three um nodes here in our DOM we've got our paragraph node and then the child of that paragraph node is this spam node with that id hello name the hashtag indicates an id and then we've got our inputs as well which is hello input and so we need to write a javascript um that sets the value or the inner html as we call it um of the hello name span uh to whatever the value of the input is at that particular time and now it's not good enough to just do this once as well we have to monitor it so that every time that that value changes um we update our our user also uh so that that's what we need to do now javascript we can write a program directly that would um monitor it constantly say every second or so or respond to an event and that's really time consuming and really quite difficult to keep on top of especially in a larger app so what react does is it lets us do this declaratively um so rather than write our own scripts to do everything and kind of figure out everything that should go on the page we just declare how this um the page should work and then react figures everything out works out a graph of how things are dependent on one another and sorts everything out for us and the important point to remember here though is that we're going to have to define the behavior um of or define um how a change in the input should affect the um you know the display that we have at the end of our output and so the syntax isn't really important um but in react terminology we say that input so the the value of our input would be a prop to this spam component and so we can define using this javascript syntax here that all that this does is say um I want to take the value that you give me from my input and I'm going to put it inside of this span and so yeah that's a kind of uh an example of how it would work in um javascript but if we wanted to see how it would work in dash um we we have a slightly similar concept but one that's also different um the key to interaction in dash is a callback and these define the relationships between the various different components and you can think of this as being a little bit like excel um and so whenever the input to one particular component changes and the the functions run and its output is displayed somewhere as well and so if we would to take a look at our hello world and kind of update it you can see here that we're defining another layout and so we've got that div tag like we had before um I've got this h1 that um now I'm calling my h1 sorry rather than hello name uh and you can see here I'm going to set uh we're also got this input so this comes from another dash module called dashcore components which are that the interactive dash modules uh and so we're giving this a name I'm setting its initial value to basal and its type is text as well uh sorry so uh next we set the um we define how this relationship works using our callback uh and I'm sorry that the the text here is a little bit small because um it's a little bit uh just the the way it's spaced out to be um compliant with black and but you can see here we've got a decorator at the top and we say what our inputs are so in this case it's just the the value of that uh that text box and our output is going to be the value or the html that's inside and inside our header and we get the value by running this function and here I've just used an f string to interpolate whatever the value is with hello uh and so again we just run it as we did before and now if I open this up uh hopefully you can see sorry it's a little bit small but I could change something like this to be so we have um an interactive web page now that takes the value from the user displays it and we haven't had to touch JavaScript once um there is an important caveat to this and before we kind of think about redesigning all of the web pages and software that you make and the code that defined this relationship now lives in python or it lives on our server and whereas in a react app all of this have been managed in the browser using um reacts kind of optimized algorithms to do that uh now this involves us call to the server and that has a big performance impact um so every time that a user presses a key or does something on our web page it means that we have to go to the server ask what should the new web page look like and then display that in our app page so that's a big performance impact but like I said at the start you know that's okay largely because I want to build in that really really quickly um I want to use familiar technologies um for data scientists and I want to be able to make you know just a proof of a proof of concept app or an outfit of some kind I'm not looking to make production software um using using dash and so just to give another bit of a tour you know what can you actually do with dash um you know some of the things that you can actually make are really quite impressive so as the name kind of suggests it's mostly geared towards dashboards um but you can make some quite transactional and um interesting pieces of uh no example software using it as well and so just to show how far you can go quite quickly and if you wanted to display data for example um this is the titanic data set and so it's example three in this um GitHub repository that you can take a look at and all we're doing here is you know there's a lot of code here but it's not not very complex we've got our layout again and we're just giving a header and I'm also giving the option here to filter all of the passengers on the so this displays the uh the entire manifest of the the titanic and giving an option here to filter it based on sex uh and you can see here that I'm just using this component it's called the dash data table um and that will um simply display all of the uh all of the columns in my um my data frame uh I've got this callback as well um you can see that this callback provides the value to that uh dash table uh and its input it takes the the value of the drop down at that particular given point so whenever the drop down changes um this function is run and it will either give uh all of the passengers on the titanic or just the ones that match this particular criteria uh and so that's literally all there is to it so now if we wanted to uh see our data set here we've got a fully kind of interactive data set this is the list of all of the the passengers that are on the titanic and all of the um all of the columns available if I wanted to select just the women uh you can see that's now just displaying all the females uh and likewise with males you can see here that we've we've got an interactive interactive browser for our data set as well nope okay now as uh as I said at the start this is a project from Plotley and Plotley is obviously one name for its um you know it's graphing technologies and its ability to produce kind of data visualizations uh and so it's no surprise that dash also has a really rich set of um essentially Plotley graphs that you can use and to produce your own um your own graphical visualizations of data and so sticking with the titanic data set um you can take a look at the code but it's it's a very very straightforward um dash component where I just provide it with a couple of arguments um to just like a understand the Plotley graph and so now I've taken the titanic data set uh and now this is the number of passengers on the titanic by the first letter of their first name and again that's just a couple of lines of code to make a you know a webpage visualization no javascript involves everything done in Python um you can also start to make some more interactive tools um like I mentioned before so rather than just displaying data or you know taking little bits of data if you wanted to start making changes to your data say you can use this uh so this is an example here that gives you um a simple to do list so you can see here I've got um a list that displays all of my tasks I've got an input that takes um the text for a particular task and I've got this button here and I can use uh the differences so I can use uh the fact that whenever the button is clicked um I'm able to get the value of the particular task uh and add that to a list and again if you can take a look in more detail at the code uh if you'd like on that repository um I take the the input to say whenever only to call my function whenever this button's pressed uh the state provides what the value of that function was at the particular time uh and now I can see it live uh here and so for example if we did uh to do this uh you can see here that we're able to literally start creating um interactive software so obviously that would be more uh more useful here I've just stored this in a global list but obviously you could link that to a database of some description or um you know anything else that you wanted to so you can actually start to reduce very basic uh user interfaces really quite quite quickly not just data visualization so moving on um you can also build your own dash components and if you're familiar with java script um but I should say that actually the dash api does rather limit what you can do in particular access to something we call redux switches and kind of internal database inside of your browser uh and so apps can be a little bit uh jittery at times without without access to some of these kind of technologies um but you can still go very far with ash and you can actually uh you know I've listed a couple of the limitations here you can actually really make some quite impressive and useful bits of software without having to have any real knowledge of web development or how how those things work um for example recently we um had a project at Decision Lab looking for um detecting illegal uh gold mining in the amazonian rainforest uh and so we were able to make a tool that's used by columbian police and military and so on uh to go and use a you know interface with a machine learning model to see whether or not um you know whether or not it's likely that a particular area in the rainforest is being illegally mined by uh by gold miners and that was all done with the exception of the map which I had to produce um which is a custom dash component you can see that actually all of this was done by people with no real familiar familiarity with their web development or anything like that just data scientists who are more happy with you know PCA-ing things than they are with necessarily um you know developing javascript libraries uh we are also looking to open source that uh interface into the leaflet map um over the next next few months um and so kind of moving on towards towards the end how is it that you should get the most out of dash if you want to start going and building dash applications at the moment and what what would your my advice to you be and so there are four tips I'm going to suggest and go through at the moment the first one is to organize your app be disciplined um I'll come to this in just just a second and the second one is to actually start uh well second third and fourth they're all to start kind of tooling up um how you use dash in your teams as well to build your app using something called a factory function and then that allows you to do something called route well to do routing and navigation which we'll come to briefly in a moment or so and then at the end I'll talk about how we you know plan to actually make the most out of dash and and tool up possibly with the community um towards the end of this talk and so the very first one uh organize your application um and so it's uh I think something that's become clear to me it's dash is a very novel and experimental technology people refer to the documentation all the time which as they should uh the docs though will always display an app in a single file and so the result can be at times especially with people who might be familiar with two-piter notebook coding is that we get kind of 2000 line single file dash apps which obviously are somewhat unwieldy um it might seem like a very basic piece of advice but uh that kind of two cultures that we've uh identified before it's uh it's something that we've had to kind of um sort of talk about within our team and think about how you split up your code into kind of logical um units and files uh that's kind of very basic advice and so on um we we also try to run um uh apps at decision lab using the kind of the main interface there so you can run something as a module rather than have to run a specific script um so a standard dash app or like a component inside a dash app as we kind of module sorry or um as we envisage it at the moment we'll generally have this kind of the main file to actually run the app um this sort of uh app file which will kind of manage how the app is uh app is kind of created and so on and then we separate our callbacks from our layouts and then often we'll have um other kind of associated utils files and that kind of thing but um but I'd say the big thing here is to make sure you separate your callbacks from your layouts from your app um which allows you to run just like we do here um you know as a module rather than uh rather than as a script the next one is to build your app using a factory function um which might be an unfamiliar term to some um but if you're coming from a flask world it's uh it's a very common thing to do and because this allows in my opinion you to better control and to kind of better facilitate um rooting and navigation now rooting and navigation will allow you to have more than one uh it certainly in dash have more than one kind of feature um inside your app because you're able to have different uh different pages uh so at decision lab and there's an example here in in the code I think it's app six and you can see uh that we've abstracted the dash interface and from the standard uh callbacks and layouts decorators and so we actually have our own class now which just records all of our um decorators that are coming into our app and all of our uh sorry all of our um callbacks and all of our layouts that are coming into the app and then we have a kind of a base route that allows us to control um which particular component is displayed at which particular path in in our inside our app uh and so a standard app now would look something more like this where you we're importing a couple of different sets of features which might have um so here I've got I've I've uh shamelessly stolen the code that I used earlier so here I've got two lists one as a shopping list and one as a to-do list um and so you can see now that we're just importing the the two different modules there and then we're able to uh run it uh decide which one is run and using our kind of base layout here and we've got a callback that uh that manages this and so please feel free to kind of look through this these uh examples at a bit more leisure but you can see that if we actually run this app now uh we're able now to have uh so we've got here we have three particular um options for our layout we've got a homepage that we're on at the moment we've also got our to-do list and our shopping lists which are completely kind of separate from one another so if I put something into my shopping list it'll stay in my shopping list but then I can also put it in my to-do list so now we have two separate lists uh that you can use and sort of manage through a little then almost like pseudo framework to uh to keep your at your app kind of disciplined and lean rather than having huge files with them uh rather having huge files that have lots of um unwieldy code in them and now finally as well I want to uh touch on what I call it tooling up dash and if you implement um the like I said the factory function approach before and you're able to abstract actually defining what should go into your um your web application from actually executing it and running it um so we have like I said before that that class to manage the application and then build it you can actually start to integrate lots of other useful tools and um facilities for your data and well in my position for my my data scientists to be able to uh and you know use and exploit without having to worry too much about um for example integrating with a MongoDB database or making an API request to another service that can all be abstracted out so that um you know data scientists can focus more on the important stuff which is um maybe a machine learning model or how that model is going to be used and so we've done this through implementing dependency injection into all of our callbacks and our layouts using a google um a google dependency injection framework and if this is something that kind of interests you or that you might want to look at doing for your own your own projects please do get in touch because um I'd love to have the time to open source that properly and kind of really make it available and so just coming to this final point now um there's a question that you have to ask yourself about when do you want to stop using dash and start building um what I've termed here like a proper web application so by that I mean something that you might ship to production rather than um something that's a proof of concept that you might kind of use to iterate with a client or maybe you know uh an internal project um so dash is great um I really must stress this enough that although I I criticize uh you know I'll say point out some of the the weaknesses of dash and things that you can't do on the whole it is actually really good and I never cease to be amazed by what people can produce and how quickly they can produce it using a very simple dash framework and it allows this kind of like rapid development by non non specialists and it's also informed web development to us in a sense that rather than a data scientist having to hand over lots of um you know their their code or their idea to a software engineer and and then see it either produce the way they want or not produce the way they want they can actually build it themselves and actually start to produce something that's really um you know really focused on to the particular mathematical tasks that we have in mind at Decision Lab and likewise with lots of other tasks that you might have but it's also very rapid UI development and that's a big problem um I mean you've got to go to ask yourself are you creating a technical debt and if your app's going to be built in JavaScript eventually and why are you doing a proof of concept in pure Python like is it not better to invest early in you know JavaScript within your team and it's also that point where yes it's UI development that's been informed by maybe your data scientist or the person leading your project but are they the right person to to do that is using the dash the right framework to to go about trying that surely maybe you should start with a user researcher or UX consultant to start building a production application um so these are all questions that you have to uh you know ask if you want to create your um uh you know your own dash app um there's also areas here that we should talk about like authentication testing something I haven't covered here uh another kind of complex interactions with third party libraries are things that haven't really been to date properly addressed and covered in in the dash community there's also that issue with heavy loads and uh you know the performance thing that I mentioned before and so final point dash is great for facilitating very rapid development of data-driven interfaces and dashboards uh investing a bit of time allows you to go very very far uh inside and creating a dash app but ultimately front-end developers will still have a job after dash kind of takes off so thank you very much um and uh please do get in touch ah well you can shout yeah thank you hello oh uh I'm wondering if uh dash is at all a minimal to transpiling uh into uh a web uh uh what's a web assembly or something like that uh not to my knowledge at present no rather than taking like a transpilation approach and dash is more focused on uh actually wrapping around um react components so uh there isn't direct transpilation at the moment there are projects which transpiled javascript into uh sorry which transpiled python into javascript I can't remember the names of those or what what they do I know that that does exist and as to web assembly and not to my knowledge at the moment but with typing annotations now the interaction and things like oxide putting um you know taking python putting it into rust and then putting it into web assembly could be a really interesting thing in the future okay thanks cool thanks um I have a comment and a question yeah first uh it is clear that uh the developer cannot escape from understanding html you're writing html in python yeah so you need to understand html and a typical data scientist is not familiar with it maybe and a question is uh how debugging works because some of the events happens on the client side the click listeners or and and the rest are in the server side so yeah how how it is working uh well you're right that a developer has to have um uh or you know at least some basic familiarity with with html um although that that generally hasn't hasn't really been a problem I think most people have been quite quite familiar with it um you're right there are certainly occasions where you'll get a complex bug that might be something to do with the the front end and something that's happening in javascript rather than the back end and they they tend to be bugs in the dash framework itself or in a custom component that you've made um and rather than the the actual established components which themselves are very very stable generally and will often have a quite a helpful debug message that allows you to kind of figure out what's what's going on thank you i have one question oh have you tried altair uh have i tried what sorry this altair there's another interactive library called altair uh no but uh if you send me details that'd be very good okay thank you thank you thank you what about recommendation with um adriana donales and she's a master in statistics and machine learning and she works in thought works so without further ado adriana donales hello everyone it's super nice to have you here today this is the first time uh you talk to this amount of people so i'm kind of nervous and anxious thank you for being there for me this is an amazing experience thank you very much so today i'm going to talk about recommendation engines and how to build one simple recommendation algorithm but first if you don't mind i would like to give you some context about me and also break the ice so i came from i come from brazil with the s actually i know that you know brazil with the z but we write it with the s we have more than 8 000 square kilometers of extension and it borders like 10 countries this is a this is huge uh but i come from the very south of brazil i will show you this is in the higurajudo so a little bit about brazil uh we're learning wedges the portuguese i don't know if i have another brazilians here oh nice nice to see you um portuguese have landed landed a bit more than 500 uh 500 years ago we have 30 years of a weak democracy and i don't know if you heard but we kind of are having cup de tats and cup de tats all the time we have unlimited natural wealth and how how are you is oi tudo bem well so from the very south of brazil it's higurajudo so we like shu hasco and chimarrão we don't have much of samba neither the warm weather actually we board argentina and uruguay we are full of playing fields so we use it very much for agriculture and cattle breeding our daily temperature can vary from 28 to 10 degrees in the same day and the most traditional thing we have in higurajudo is um erva matiti it's called chimarrão so this thing in the latest hand is the chimarrão you can see a very huge piece of meat that is being cooked uh by the fire that is under the ground so i don't know i don't know if you have heard about chimarrão but i don't know why maybe it's because it has caffeine some footballers have been drinking it like the england's footballers and messi he's argentina right you know that brazilians argentinians have some struggles and this is ronaldingo gaúcho also drinking the chimarrão he's from port alegría as well and the curious thing is it's a hot beverage and we drink it on the beach uh in the hot weather it doesn't doesn't matter we like it so this is me i'm a developer i'm an economist i'm doing a master degree in statistics um so i'm a dating enthusiast i'm a cat person and i am addicted to travel this was me my first travel uh talking about how to do deal with the frustration frustration of not of not having your hypothesis proven as a data scientist i don't know if i have data science here but sometimes we spend like days trying to prove a hypothesis and it doesn't happen like we want but the thing is that even if it's not a good hypothesis i mean it's not accepted we can gather information of this so yeah this was in the human data science okay let's go what about recommendation system so the key word for recommendation systems are revenue and customer engagement what's happened is that we as customers are overloaded with information so we have a lot of items out there a lot of movies to watch a lot of i don't know videos and we don't know what to watch sometimes or we don't know what to buy sometimes what is the best for us so we have a lot of information uh this is a very new personal personalized way of selling buying watching and getting to know things and uh well mcney said that this uh it helps groups of user or users to select items from a crowded item or information space well amazon youtube netflix i mean there is a lot of others like udemy google oh my god almost every website where every e-commerce uses it but just to bring some examples where amazon uh in a quarter they had almost almost 13 billion dollars of revenue and this was the first first quarter they implemented the recommendation engine and they had 30 billion and it was 30 more than the same quarter of the last year so for youtube users and for youtube more than 70 percent of the user consumption come from recommendations so when you were watching a video and there's another video coming or you go through the feed and there's some videos recommended for you almost 10 percent of the consumption of youtube is about this and for netflix 75 percent so of the consumption comes from recommendation so we can see it's a very big matter recommendations so so when we need to recommend something we might be looking for answer to problems prediction and rating there is an approach for recommendation that you use it that is being used that it has not artificial intelligence that is like oh the most sold the most click items that it appears and it can be easily made with a query in the database and showing the user what items has been having more sold and everything but what we want to have is things that the customers are likely to love are likely to buy are likely to watch so we just we don't need just to predict the rating for an item but either if they might like it or not okay a little bug there but we how do you get this data right so data is the key and we have two kinds of data i'm saying two kinds because i'm separated that way i don't know maybe we have more even more than we haven't talked about but we can put everything in these two categories so implicit data it's about tracing it's about the data that the customer didn't give to us like their name or address but where did they click what kind of movie they like and something like this we can collect this data right big data and we have the explicit data this is the more difficult to get because it depends on an action of the user so answering a survey or rating items i don't know i think i have never rated an item i don't know about you but it's it's it's not common i mean thank for the people who rate because i'm always going to the rating but yeah i have never so this is the basic models of record recommendation systems uh with a ai okay so we have the content based the color the collaborative and the hybrid solutions i mean these companies like amazon netflix youtube they are using more the hybrid solutions that is very based on the content but it's very based on the collaborative and has some secrets on the recipe that we don't know so content based the recommendations are based on the description of the item or in the synopsis or in the genre or even in the author there is um there is a an article on the internet showing that an author had uh launched a book like in 2010 and the book was not very good sold that's okay in 2015 another author launched another book but the the content was very similar to this one that was has been launched in 2010 and what happened was that the book that was launched in 2010 started to have a lot of people buying buying it and the author was like oh what's happening then tracing it back they have seen that because of this book that was very similar to the 2010 book uh it starts to sell more so it happened uh so content based recommender systems are born for date from the idea of using the content of each item for recommending purpose it avoids the cold start recommendation problem wait for it i'm going to talk about it and uh content representations are open up to options to be used with different approaches like piano or tech is text processing techniques semantic information it has a whole world to of tools to to analyze this data so uh another one is the collaborative and it's memory based recommendation so uh in this case are based on user social interaction and rankings provider provided by other users uh this collaborative uh model is called collaborative filtering and it's divided by user based and item based so let's see this problem it's Saturday night i am at home i open my favorite streaming app and i don't know what to watch i don't know but i know my friend marta she likes thriller and drama so do i maybe it's a good idea to ask her for a movie recommendation and this is the truth okay marta is my friend i didn't know what to watch and i was oh my god i'm always like this i don't know if you believe it believe in this kind of thing but i'm demony and i can make decisions so yeah i was like oh my god what i'm going to watch so i'm showing you the user based filtering this on top is my friend Henry he likes orange grapes raspberry and banana marta likes grapes and dunny likes grapes and banana maybe marta could like orange or banana or raspberry and she never have tasted it maybe i can recommend her but looking at this picture i can see that dunny is more similar to emory because she likes more fruits maybe marta just like grapes and if she likes more fruits and she's more similar to emory maybe i can recommend her oranges and raspberry well but what about the item based filtering the thing is again emory likes the same fruits marta as well and dunny as well but if i go and check the items i see that more people likes banana maybe i can recommend marta a banana and maybe she likes it so it has been shown shown in the market the collaborative filtering more accurate than the content based it's easy to implement but sometimes it's not a good idea to implement we have to keep it this in mind when it's not a good idea when we don't have enough knowledge about the item and the user so we don't have enough ratings i don't know nothing about my user i can't do the math to see with what user or item this user is more similar so how can i recommend something and when the item has not been ranked enough so it's a new item and i don't have any information of this so a problem cold start cold start as i have said before and now i will explain is the expression we use when we don't have much information about the user and would like to recommend something i think that helps on the cold start is i don't know if you have Netflix or there are other sites like this like Pinterest you go for the first time there you do your registration and the website asks you to select items that you like more or gens like that you like more so this helps the algorithm to recommend you things and avoid avoid the cold start so it's parity that is the problem when we don't have much number of ratings and you're white hands because you don't have information so it's not a good idea when we don't have any information because this collaborative filtering is very based on similarity similarity calculations so let's build our own recommendation system algorithm this is these are the steps to the basic steps to be bring to build the recommendation system so first we need to call we might choose how to do the math to find the similarity coefficient between users then we have to predict so we have to find the predicted score for the movies that the user didn't watch and tell so let the user know what to have predicted to them so recommend it so let's go back to my problem it's Saturday night i want to watch a movie i don't know what to do so maybe i will ask Marta but may i try another friend the thing is i have this database and i as i said is truth my friends gave me these ratings it's good because now we know how similar we are and how to get recommendations so this is our database let me see if i can have a pointer here okay so the wolf of wall street donnie hasn't watched it cool runnings i didn't baby driver neither donnie as well didn't watch the cool runners and we didn't watch baby driver and alta didn't watch the lord of the rings so let's do the call to choose to to see the similarity between us what i have done here is i have plot a two-dimension graph of the wolf of wall street and the devil were sprada and i am seeing how the data is dispersed so here i can see that marta gave a three rating for the devil were sprada and three for the wolf of the wall street i gave four and four and uh philip gave three for the devil were sprada and four for 50 for the wolf of wall street when i see this graph and i try to to to understand who is closer to me because uh i'm trying to get the similarity here so when we talk about data similarity is much about where are we together what points what data points we share and things like this like this so when i want to recommend something to anyone i want i really want to recommend to someone that is very similar to me so i know that this person is going to like it so thinking about it i can try to measure the distance so from philip i'm 0.5 distance and from alta well this is at triangle so let's do the hypotenuse doing the pitagarin so do remember that from the school so yeah we get the distance of this catheter and this one then we can doing uh the square of those summing up and the square root we get the hypotenuse so doing this math i know that i'm zero seven to one distance from alta so i'm more distant to alta than to philip this is the euclidean distance this is also uh a math that is used for the k nearest neighbor this is a very used algorithm in machine learning i would i would just show this formula but i thought that if i showed you the triangle and the pitagoras it will be easier because i don't think people like symbols characters and numbers exponential everything together so this is just the pitagoras with summing up everything with all dimensions we have because i have showed you the two-dimensional graph but if i have a lot of movies i would i will have a lot of dimensions so this is how we are going to make the similarity calculation if i do um function name similarity get similar sorry uh i did this i will show you the code and it shows that regarding our movies i'm more similar to alta than to everyone else so let's predict the prediction is that i have to predict what is going to be my rating for cool running and for brave driver because maybe that's the movie i'm gonna see tonight oh actually in saturday so uh here we can see that we don't have much uh ratings i mean like four of those more plus mine are missing uh so what if like otavio that is the most similar to me like here like we see here hadn't uh rated uh movies let's suppose that we have like 50 movies not just five where if a person appreciates a movie that everyone else have low rated it a solution to this problem is to use the weighted average so let's go to the saturday night what movie should i watch uh pro tip it's up to us to choose what is the recommendation threshold uh we have this method the desiristic method that is to get my average rating and just recommend if the prediction is higher than my average rating so recommendations for me uh here is the um the code i have done for the recommendation this uh measures the euclidean distance and multiplies for the similarity for the weighted uh for the weighted average so here what happened is i have put the similarity here sorry the predictions we have donnie hasn't predict the hasn't rated cool runnings uh so it's blank and i have multiplied this to to get the weighted rank and in the end we get a prediction for the cool runnings for for example for me i would rate cool runnings as 3.6 at least the algorithm says that and the same thing for brave driver i would rate it as 3.93 so uh here's the let me see here yeah this is the function of the recommendation and this is the output so as we have seen i would recommend more brave driver than cool runnings what does it mean that i may be recommended with brave driver by the algorithm so i should tell my users what is the predicted here doing the prediction for everyone i have seen that i would be uh recommending cool runners cool runnings like this baby driver like this for the donnie this is these are the predictions and at the Saturday night i'm going to see brave driver so problem solved as i have seen here my average average rating is 3 so these two can be recommended uh maybe in my netflix or my streaming app the these two is going to appear like watch it now cool runnings and brave driver they have likely to be successful of my good movies so this the code is here if you want to check if you want to see and see how to make this algorithm so tiny dot cc aero python 2019 and thank you very much i would like to hear a feedback from you and i'm here if you have any questions thank you maybe one question here thanks for the talk i'm curious uh at your company what products or items are you you're recommending nice so um i'm a data enthusiast my company is thoughtworks um we don't have much work on recommendation uh we have been doing some inferences like we have made for a big media um company uh gender prediction so when a user enters in the website is it is a day of mail or is a mail uh what age and kinds of this but recommendation we have done it yet in a project thanks welcome python and jupiter hop so please uh uh we're welcome to our both organizer and the speaker on the conference uh thank you martin thank you very much for the kind introduction um i'm talking about geospatial analysis and the second thing you see the title is jupiter hop um any one of you already used jupiter hop before oh cool so not too many but more and more and more that's great um actually i could have titled that uh talk jupiter lab with jupiter lab instead of jupiter hop but i i want to show you how cool jupiter hub is it's basically jupiter lab where you can log in so if you go on uh website and you have a jupiter hub installed you just get this page here and you can sign in and after signing in you see this so it's perfect so it's a multi user um jupiter lab and that's pretty much all um there is something i i like to show you too this one you can do with a regular jupiter um notebook or jupiter lab installation too you see i have three kernels here i have a python kernel i have a markdown kernel i have r kernel but there is another feature you can have kernels with different python versions and that's quite handy and um you just create a virtual environment you see that are both using condy in our case and um environment name whatever you like and then you specify pysons 3.5 3.6 3.7 whatever don't use two and the ipython kernel and then you activate this environment and install all your cool packages you want to use and after that um you can create a new kernel with the line above just ipy kernel install user space and name of the new kernel and the screen turns black and then you can list all the kernels using jupiter kernels backlist and you see actually all the kernels installed so if you make this procedure for let me say five different python versions you will see actually five different python versions in your jupiter lab environment and that's really quite handy and if and now we come back to the original title geospatial if you install geospatial modules then you usually have to install many c-based libraries and for that it's really really recommended to have multiple python versions and environment and of course if you are on jupiter hub you will have your file system there and you can access all your user files from the jupiter lab or hub so what we are doing we have a hber apollo 6500 server and on this server we installed jupiter hub and we bought this machine with 48 cores 192 gigs of ram and attached it to our small storage system with 120 terabytes which is actually quite fast storage where we have one gb per second reading and writing speed and that's also a very important fact if you have terabytes of geodata you want to have a really fast and reliable system we also have four and we the tesla v100 in it wow that's that's high tech here so i think the cable should be changed uh tomorrow okay so um what i wanted to say we have uh i need a tesla v100 the sxm2 model that's here one of them and um uses uh lots of power and and uh has 900 gb bandwidth so it's quite fast that we use to create our deep learning models um more about that maybe later so um what is geodata there are some standards easel standards describing what's geodata technical commission 211 series um and so but the most important is most data you have has a geospatial component most data you actually have has a location component or you can create a location component out of it and um mostly people use gis software to load and manage this data however that's um something i do not want to do personally i use pyson for that so what i show you now is everything i'm doing with geodata is done in a Jupiter notebook and um you can really um uninstall all gs software if you if you do that and today i'm limiting myself to to um vector data and a little bit raster data there is also um other geospatial data like point clouds and three of the objects and that's not what i'm going to tell you so um everything is open source i'm showing today um the most important two libraries are c++ based it's gdal ogr okay and the second library is geos and um they are they have bindings in pyson and it's really um it's not pysonic so therefore some people created new pyson modules which are really pysonic and and use the same c++ library and it's much much nicer to to work with that i would not recommend using gdal directory i would use um rasterio for raster data processing phiona for vector processing and shapely to do some vector data operations i will show you an instant and um if you know pandas uh really nice module in pyson there is also geopandas which extends pandas for geospatial data so that's i give you the links which projects we are looking at today um the most important is that we use jupyter notebook and the first module i'm showing you is pholium pholium is basically leaflet js javascript library to create maps it's one of many um um javascript libraries to create maps and with three lines of code you have a map in your jupyter notebook or jupyter lab so you can specify um the important pholium module and you just create a map you specify a location and a zoom level the zoom level is um how far you are away from the ground there are you typically about 20 zoom levels you know that from other mapping services like google maps, bing maps, open sweep map, yahoo maps and all these map services that exist today another thing is if we look at vector data um there are some specifications like the ogc simple feature access specifications where geodata and and in this case vector data is defined um this is used in many databases like poskeys posql and so on and um one of many representations is just using text so i use text to specify a point i use text to specify a polygon and so on the reason for that is um you can print it and in 100 years you can still read it so in the jigo world that's a very important topic there's also the um wkb a binary format but i'm not talking about that now so here are some examples if you specify a point in wkt well known text it's just point brackets 10 20 in this example or if you have um polygon it would be polygon text for the coordinates or there are some things like multi-polygons so you have multiple polygons for example if you have a country with islands there are multiple polygons in that there are also countries with holes and then you have a hole this is all specified in the wkt so it's a it's it's a nice thing and we can use that directly we can use that directly so we can create something similar like like the wkt just using um a python um list and tuples for the coordinates and you see you create your import polygon import point and here you just specify your polygon and if you look at it you see the first and the last point is the same that's an important aspect of this standard the first and the last point is the same so we have a closed polygon um we can actually load it from text to we can create a string with with the wkt definition and load this using shapely wkt and just load ss for string and then we have our polygon definition another format which is quite popular in the java script world is geo json and there you also create your polygons and and specified coordinates that's another approach to define vector data of course there are many other formats too i'm not going into details there now but that's what you find if you go into the geo business so let's let's um just add such a geo chasen in volume you see it's a little bit more complicated but basically you open the geo chasen file you load it and you put it on the map again same syntax and then you use the geo chasen from volume it's just called geo chasen and you add it to your map in this case i loaded a geo chasen of switzerland okay you see that it's a it's a shape of switzerland and now i do the same again but i um plot it directly using shapely and you see there is a it's not it's not the same so there is a distortion so switzerland is not that distorted distorted usually um and the reason for that is we have different coordinate systems so um yeah let me show you the critical on the sphere you know that there is longitude latitude longitude along the greater latitude to the for the poles and um you can project this to to a map um the easiest way is just to create um out of the sphere you just create a cartesian coordinate system so you do map the latitude longitude on it and then you get this one and that's um a completely distorted image of the world uh it's not what you see in google maps actually um there is even more bad with more distortions so um there are some definitions um the earth is a ellipsoid so um the world geodetic system 1984 defined some some um data how the earth is best fit in a in a rotational ellipsoid or spheroid and out of that you can create different map projections i i took three here out of many 10 000 different actually you could invent your own map projection if you want to and here i i printed three of them and you see there are all a little bit different um the mercator projection is what you know from google maps etc and you see um uh the darkness down here is bigger than most other continents which is completely wrong but it's an effect of projections so um we can look at these so-called coordinate reference systems or spatial reference systems and we we can have two special cases one is we use geocentric cartesian system that's just cartesian system with x y z or we use projected coordinates that's usually not 3d it's actually flat and that's um the actually every country has its own representation uh switzerland has its swiss grid and and for example all the countries they have their um special coordinate systems too i'm not going into details here um but you can look it up at epsg.io you can look the system of your country the epsg is the european petroleum survey group the catalog all these these coordinate systems and for example the epsg 4 3 2 6 is the world genetic system 1984 okay that was a little bit off topic let's look at the real example we are located around here we are located around here so um we can say we have a longitude of 7.5 so here greenage um is zero and we are seven degrees to the east and then 47 is is the is the latitude so um here with the equator so we go 47 up out here and we are in switzerland at the congress center basal so that's how it works maybe the problem we will see in an instant so um with shapely we can do some nice expressions we can we can check if a point is inside the polygon for example that's a very complex operation but with shapely it's just a few lines of code actually one line of code so you create a point 47 7 that's our coordinate of the congress center basal you i can look at it as w key type w k t um representation i see point and the coordinate so everything is perfect and then i um check the operation this european point is within switzerland and we get the result false so what did i do wrong lower lower case wrong projection is all wrong so no it's it's it's very simple it's very simple you see um i show you the result how it is done correctly so what was what was the difference i flipped the latitude longitude now i have the longitude first and then it works so uh the problem is um before we had the volume module volume size first latitude then longitude shapely size first first longitude then latitude and that's that's uh that's a common um problem some people say a lot long long lat a lot long this is best or no that that is better and the confusion is perfect so we have to always consider that and know which module uses which representation personally i i um prefer this approach too because it's something like x axis first and y axis second but in geographic coordinates you can't say x axis and y axis so that's the point of where many people find it's worth disputing so i said before we have all the vector formats i'm not going into the details i just recommend if you want to read vector that i use the fiona module but as the time is going on i'm i'm showing quickly geopandas which is um pandas with the ability to make some geographic or geospatial queries so i can load something um let me load a dataset with the all cities of the world with more than five thousand with a population greater than five thousand you can download this dataset at geonames.org um so it's very small so you don't see that good well because it has many data in it so i reduced it to the most important data so i take the name latitude longitude population now you see i take latitude first and longitude and um that's the dataset you can create a geopandas out of it um the trick is you make a column that's named geometry and in this geometry you have a shapely representation of of the geographic information this could be a point like in this case or a polygon a multipolygon or whatever you can create your your geometry column just there so geopandas can also plot like we know that from pandas just make your geodata frame and you plot it and if you plot all cities of the world you see um you recognize the shape of the continents more or less um so europe is quite green in this case so there are many cities so i can do some queries i can basically pandas so same and you see if i make query name basal i get basal information but the more interesting are spatial queries so let me get um the distance from the congress center here to all other cities in this dataset so i just create our point again and calculate the distance and um make a new column um this distance and i sort this column distance so that i show you the result it's simple to understand so you have the name here and the geometry and here the distance so you see we have pierce felden is just next to basal and basal itself so it's a little bit strange because it's all the whole is the center so it's the distance to the coordinate so we are closer to pierce felden than to basal binningen weiland reindersen germany san louis fronts and so on so that's the the names of of this errors in with the distance so i can also query within a polygon so i can use my polygon again and say i would like to have all the cities within the polygon switzerland and then we see if i do that and and combine it with something else like for example i make i would have like to have all the the cities with a population bigger than than 20 000 and i return this is not sorted actually but it doesn't matter so i get all the cities in this dataset within switzerland and um with a population greater than 20 000 so let's um do one more thing display the cities in a volume map that's quite easy you can combine those those modules so you just create um apply for example you can specify a function which fills the creates marker of every city and then you have that in volume so um let me do a last example before session chair throws me out um there is for example um a nice dataset for this um live earthquakes or um the earthquakes of the last two weeks so you can download that directly with this um with this link i do that for example as the request module and then i store it as a file earthquakes geochains i just did it half an hour ago about and um that was the result so i can use geopandas to open my geochains directly and display the first five um incidents and again i simplify the dataset i reduce it to um to four columns time magnitude place and geometry we see the first five it's not sorted anyway and um we see a trend in california at the moment there is a hot spot there at the moment and we can create a histogram out of the data that's a nice way using histograms with 16 bins in this case we see most luckily most earthquakes are so around um three and there are higher ones in there unfortunately and we can you see in the in the first column here you have a timestamp and to um to change this timestamp to a to a better readable um representation you can use the daytime and the time zone module of pyson and create a new column with which is more readable so we have um ten of july um this is utc time zone um maybe we hear something about time zones and the lightning talk i don't know oh tomorrow tomorrow okay it's very nice talk about time zones very important uh miroslav miroslav so um we can plot this and we can also plot multiple geodata sets and uh you see i read this geodata frame and i can combine this using just plots multiple plots using the axis you can have multiple so i can display the continents and some earthquake on it you could do more you could change the size of the of the dots um depending the magnitude um and i think the cable says it's time for questions so um thank you very much for your attention and are there any yeah i think i think uh that's actually there is a microphone on the table and i think somewhere i'm not quite sure hello i'm not sure i can can you say something about what you use this very expensive computer for yeah that's a good question i unfortunately wanted to say more about that but i was running out of time after um 35 slides i said oh we have to stop we do some project um for example to detect solar panels on the roofs we have the data set of orthophotos of hollow switzerland that's about two terabyte of data and there we try to detect solar panels different kind of solar panels and um therefore we create um models deep learning models and train that and for training we use the forward GPUs and to improve it oh there it's confusing with the microphone okay and of course many other applications we do many um deep learning projects at the moment no i didn't i didn't skip i i actually didn't even put inside in this presentation yeah but i don't have it ready actually to um okay are there any solutions for geodata especially queries in databases in jango applications you would recommend because we've seen python now but if i have to trim it down to sql it becomes a bit more complex especially when i have to do it from a jango direction of course you can you can this is something i don't i don't like to answer in the python conference but i you asked me now so for example is postgres go and post js and post js um uses spatial queries too you can do the same like i showed here and unfortunately you can do that with post to post js much faster than using um the jupiter lab solution i showed you so what i showed you is actually slower than if you are using postgres so but you can you can do actually the same the disadvantage of course is you don't have a nice python environment you can't program it nicely like this you can just do queries yeah i'm aware of that but uh it's it's a feature of a specific database and and if i want to do it from jango and and the jango query should also work with sqlite then i think there is i can't just use the postgres features are you aware of the project geojango there is a geojango which takes takes care of these details so you can directly access the features of of post js with geojango some possibilities to use the these libraries for the planets other than earth so mars or yes it's actually no problem you can you can do any planet um the only problem is that you don't have high resolution data of other planets but it's basically the same you just need the model um there are models for for mars for example there are models for most um near planets on earth we have the wgs 84 representation but the mars is basically also ellipsoid so you can do exactly the same calculations this this you could even do distance calculations from one point to the other with geopandas and the mars data set so it's no problem okay thanks yes i will make them i think all european slides will be on the program on the website program so all speakers will upload them and and you can just download it from the from the place where where the schedule is so it's just click on the on the on the topic and you will get the link to the to the slides that's a very good question um um don't use geopandas for very large datasets if you if it's the same like pandas you can't use pandas for very large datasets at the moment the developers are working on that they try to to do some memory voodoo sorry for that um but it it will not work um unless you use um modules i didn't show that because it's already too much in the detail if you use phiona for example you can take one um actually one row of the dataset and you have that in the memory so you do you have to do the memory management yourself for for example if you want to do some distance calculations you would just do it over a pair row base and then you could take a multi terabyte dataset and do your calculations with that there is also for larger datasets there is pi spark and geopi spark see there is a trend in putting a geo in front of classic python modules and with geopi spark or pi spark you can do much bigger calculations um actually there is there is almost no limit it's it's a hardware it's it's a hardware issue if you have enough money for the hardware you you can you can have unlimited amounts of data okay thank you very much again so hello everyone in this talk i'm going to show you how to design functions they can be correctly graph converted using two of the most exciting features of the new terson flow relays 2.0 autograph and tf function but first let me introduce myself so i am paolo galeone i'm a computer engineer and i do computer vision and machine learning for a living and i'm literally obsessed with tensorflow i started using tensorflow as soon as google released it publicly around november 2015 when i was a research fellow at the university of balonia at the computer visual laboratory and they never stopped since then in fact i blog about tensorflow you can see the others of my blog there i answered questions on stackover flow about tensorflow almost daily i brought open source software using tensorflow and i use tensorflow every day at work and for this reason google noticed that this strong passion and awarded me with the title of google developer expert in machine learning so as i mentioned i have a blog and i invite you to go read it mainly because this talk is born from a three-part article i've wrote about tf action and autograph and so after this brief introduction we are ready to start so in tensorflow 2.0 the concept of graph definition and section execution the core of the descriptive way of programming used in tensorflow 1 are disappeared or better they've been hidden in favor of the eager execution eager execution as everyone almost should know is the execution of the computation line by line pure typical of python this new design choice has been made with the goal of lowering the entry barriers making tensorflow more pythonic and easy to use of course the description of the computation using data flow graphs proper of tensorflow 1 have too many advantages that tensorflow 2 must still have for instance graphs have a faster execution speed are easy to replicate and to distribute graph moreover are language agnostic representation in fact a graph is not a python program but is a description of a computation being agnostic to the language they can be creating using python and then export and use it in any other programming language moreover automatic differentiation automatic differentiation comes almost for free when the computation is described using graphs so to merge the graph advantage is proper of tensorflow 1 and the ease of use of the eager execution tensorflow introduced the tf function and autograph so this is the signature of the function and tf function allow you to transform a subset of python syntax into a portable and di performance graph representation with a simple function decoration as it can be seen from the function signature in fact tf function is a decorator and uses autograph at default autograph let you write a graph code using natural python like syntax and in particular autograph allow you to use a python control flow statements like the if as why for and so on inside a tf function decorated function and it automatically automatically converts them into the appropriate tensorflow graph nodes for instance if statement python becomes a tf con for loop become a tf while and so on however in practice what happens when a function decorated with tf function is called so this is a schematic representation of what happens it is a two phase execution in particular the most important thing to note it is when a function decorated with the tf function is invoked the eager execution is disabled in that context and on the first call the function is executed and traced been eager executed disabled by the fourth every tf dot method just define a tf operation they produce a tf tensor object as output exactly in the same way as tensorflow one is the same exact behavior at the same time autograph starts and is used to detect the python construct that can be converted to the graph equivalent so a while becomes a tf while and so on so once gathered all these pieces of information we can build the graph so we have the function trace autograph representation and so since we have to replicate the eager execution after every single line what happens is that every execution every statement is an execution order forced using the tensorflow one tf control dependency statement at the end of this process we have built the graph then based on the function name and on the input parameters a unique ad is created and it is associated with the graph then the graph is placed and cached into a map so we can just have a map id equal graph any function called then will reuse the defined graph only if the k matches of course since the tf function is a decorator it forces us to organize the code using functions in fact functions are the new way of executing something into a session now that we have a basic understanding of how tf function works we can start using it to solve a simple problem and see if everything goes as we described it here so this is a problem the problem is really easy is just a multiplication of two constant metrics followed by the addition of a scatter variable b really really easy so this is the tensorflow one solution and in tensorflow one we have to first describe the computation as a graph inside a graph scope by default there is a default graph always present but in this case we explicitly here then we create a special node in we with the only goal of initializing the variables and everyone familiar with tensorflow one should have seen this line a thousand of times and then in the end we create the session object and this is the option that received the description of the computation the graph and places it upon on the core cloud hardware then we can finally use the session object to run the computation and getting the result so this is the standard implementation in tensorflow one and in tensorflow two thanks to eager execution the solution of the problem is becoming really really easier in fact we only have to declare the constants and the variables and the computation is executed directly without the need to create a session in order to replicate the same behavior of the session execution rewrite the code instead of function executing the function as in fact the same behavior of the previous session dot run of the output of the output node the only peculiarity here is that every tf operation like tf constant tf matmool and so on produces a tf tensor object and not a python native type or a numpy array therefore for this reason as you can see in the last line we have to extract from the tf tensor the numpy representation by calling the dot numpy method we can call the function as many times as we want and it works like any other python function so right now we have only a pure eager function but what happens if we try to decorate this function and convert it to its graph representation using tf function add in the decorator pretty straightforward and of course we might expect that since this function works correctly in eager mode we can convert it to its graph representation just by adding the decorator let's try and let's see what happens i added the two print statements to before the return statement one it's a print statement executed only by python therefore the first one and the second one is a tf print statement that is a node in the graph this will help us to understand what's going on so this is the first output we see on the console when the function is called the process of a graph creation starts at this stage only the python code is executed and the execution is traced in order to collect the required data to build the graph as you can see this is the only output we get the tf print call is not evaluated since as any other tf method TensorFlow already knows everything about that particular node and therefore there is no need to trace their execution moving forward we can see the second output so we got that exception tf function decorator function tried to create variables on a no first call but in eager execution this function worked correctly so what's going on here the exception of course is a little bit misleading since we call it this function only once but the exception is called is talking about a no first call but of course tf function in practice called this function more than once while trying to trace its execution to create the graph but in short as it is easy to understand tf function is complaining about the variable object as this first exception bring us to our first lesson of this talk and this is the lesson so a tf variable object in eager mode is just a python object that gets destroyed as soon as it goes out of scope and that's why the function works correctly in eager mode but a tf variable in a tf decorator function is the definition of a node in a persistent graph since is a eager execution is disabled in that context so since the graph is persistent we can define a variable every time we call a new function and this brings us to the solution of the problem the solution is to just think about the graph definition while defining the function so we can declare an variable every time the function is called we have to take care of this manually declaring a variable as a private private attribute of the class f and creating it only during the first call we can correctly define a computational graph that works as we expect and in short this brings us to our second lesson the second lesson is that eager functions are not graph convertible as they are there is no guarantee that function the work in eager mode are graph convertible always define the function structure thinking about the graph that's being built okay so this was the first topic of the analysis of tf function now we can move forward and analyze what happens when the input type of a tf function decorator function changes okay this part of the of the talk is by far perhaps the most important part since the function should bridge to different completely true should bridge to different completely words in fact python is a dynamically typed language where a function can accept any input type any input type while tensor flow being a c++ library under the hood is a strictly statically typed library and every node in the graph must have a well-defined type and also well-defined shape so we are going to define a function to test what's going on when we change the input type this is the function is the identity and as we can see only one the function accept a python variable x that can be literally everything only two we have a print function that's executed only once during the function tracing on the third line we have the tf print function that is executed every time the graph is evaluated in the end since this is the identity we return the input parameter okay this is the first test when the input as we can see is a tf tensor we expect that a graph is built for every different tf tensor the type and this should happen of course only once and then we have to reuse every time we call the same function with the same type the same graph created on the first call on every second code therefore we don't expect to see the python execution line but only the output of the graph execution let's see the output as you can see everything when the input is a tf tensor work as we expect and since everything is going smoothly we can try to deep dive a little bit inside the autograph structure and check if the graph that is being built after the autographic execution and the function tracing is what we think so in short we think that we all we should only contain the tf print statement and the return of the input parameter okay using the tf autograph module is it possible to see how autograph converts a python function to its graph representation the code of course is a mess because it's machine generated but we can notice something unexpected maybe i can try to move this this line this is a little bit unexpected in fact there is a reference to the python execution inside the graph isolation so this is a strange and is not what what we expected when we want to just create a graph we can analyze only this part and without digging too much into the constructor we can see that there is the name of the function that is python executed of course there is print its argument python execution comics wrap it inside control dependency or return the second parameter of the autograph converted call is the owner and as you can see is none this means that there is no package known to autograph or tensorflow that contains the print function definition so in short this line is a statement that gets converted to a tf no operation and it has the only set effect to force the execution order in practice we are just forcing the execution order of the sequence lines the sequence statements after the execution of a tf nope node okay we can see now after this short analysis of our function gets graph converted what happens when the input is not a tf tensor but is a python native type okay the code is similar to the previous one we just defined an apple function called print info to be sure that everything that you to be sure that we are feeding the correct data type to the function since the function is trivial we expect of course the safe behavior we get before okay as we can see now we can see what happens when our python integer is fed as input and something where it is going on of course since the python execution as you can see is displayed not only once as we might expect since this is a single data type integer but it's executed twice the graph therefore is very created at every function invocation and this is really weird but trust me things are getting even worse because now on the first execution we have defined two graphs for the one value and for the two value but what happens if we feed now the same value but with a different data type so with a float as you can see the graph now is not being recreated at every invocation but given a float input we get an integer output so this is no more the identity function this is somehow broken in fact the written type is wrong and the graph that is being built for the integers one and the integer two is being reused for the float values one and two so this was my phase when I discovered this so I spent some time to figure out what was going on and I summarized this on the next lesson this is lesson number three so the function does not automatically convert a python integer to a tf tensor with the d type expected so since integer in python are 64 bits we expect a tf int 64 and so on the graph id when the input is not a tf tensor object is built using the variable value not the type this is a design choice of the tf function authors that I don't like that much since it makes the graph conversion not lateral and you have to worry about this behavior moreover since this a new graph is being recreated for every different python value we have the risk of designing terribly terribly slow functions in fact we can see a simple performance measurement g is this entity function here in the first loop g is fed with the tf tensor object produced by tf range function execution the second loop instead invokes g with 1000 different python integers and this means that we are building 1000 different graphs autograph is actually optimized and it works well when the input is a tf tensor object as you can see from the time measurement here while it creates a new graph for every different input parameter while the for every different input parameter value while I with a huge drop in performance and this brings us to the first lesson use tf tensor everywhere seriously this is the mantra to repeat tf tensor is not the only tensor flow object that we have to use when we are using tf function in fact tf function has this where behavior when using python types but also as other where behaviors when using other python native constructs this brings us to the last part of the presentation really brief so what happens when we just plug plug inside the tf function the torrent and function some python operator this function works correctly in your mode given the tf tensor x that holds the constant value of one we expect to get the output a equal b since a and b are the same pattern object I guess that everyone here should agree that the final s should never be reached because if we feed a number every condition should be satisfied and we should never reach the what lines but in practice what happens and if we execute this dysfunction this is the output but yes so keeping this really short there are several problems in that function the bigger one that affects tensor flow from the iridescent is that the python equal operator is not overloaded as a tf equal then the second huge problem is that autograph endless the conversion of the if elif as statement but not the conversion of the boolean expressions defined using the python built-in operations so in short the correct way of writing the function is to use the tensor flow boolean operators everywhere instead of using the python native operators and this brings us to the last lesson the operator lesson that's this one use the tensor flow operators operations everywhere seriously otherwise you get that weird behaviors completely no sense and really hard to debug so we are reaching the end and this is a recap of the five points so the variable needs a special treatment you have to think about the graph by designing the function and eager to graph the conversion from eager to graph is not as forward there is no auto boxing of python native types to tf tensor so we have to use tf tensor everywhere and also we have to use the tensor flow operator explicitly everywhere so this is this is the end and just i hope you enjoyed the talk and i just want to share with you the the fact that i'm writing a book about tensor flow tf function and neural networks and if you want to stay in touch and get informed when the book is out or when a new article about tensor flow and the whole tensor flow ecosystem is out just leave your email in the subscribe page thank you okay we have we have all our time this evening to ask you questions because it's the last talk in this room if no one minds i would start with one question so first please put your slides after the talk i like really want to reproduce the examples you gave us because this what i like about tensor flow like you sometimes get this crazy stuff and crazy errors and you have no idea what what they mean that's what's my first and the second uh what do you think the developers of tensor flow did they do this on purpose like not how you said like all this about less greater and and equal operators did they do this on purpose to not replace them with tf much greater and so on so i'm 100 sure that tf equal and the underscore underscore equal python operator has not been overloaded because internally in the tensor flow code base they use tf tensor to index in as an index in the map so they have to be hashable and therefore they can't use the tf dot equal because tf dot equal generates a new tf tf operation and therefore is not something hashable and this is the reason for tf equal the other replacement so for the greater lesser and so on they should be converted and perhaps they will be converted because in the rfc they said that of course in the the future we will handle this comparison but since this is the problem of the equal operator perhaps they can't do this and they force us to use the tf boolean operators for this reason thank you if you have any questions please come to the mic because they are not detachable so we have we still have a bit of time for one or two questions okay then thank you very much for the talk and thanks everyone for being here