 All right. Good afternoon, everyone. My name is Stefan von der Volt. I'm a postdoc here at Berkeley. I'm from a small town called Stellenbosch down in South Africa The other southern tip of Africa I Came by electronics engineering ended up in applied math and My main interest is currently image processing and I'm also Lead author on psychics image, which I'll be telling you a bit about today So just to give you an idea of the layout of image processing packages in in in Python You are all very familiar with NumPy by now NumPy is based on on C source code It calls into some linear algebra libraries And there's also some siphon involved and on top of NumPy you then have the extended scientific routines and Inside of sci-fi there's a package called in the image and in the image is used for dealing with indimensional image data sets Matt plot lib is often used to visualize those images those images are just matrices Gray-level images with we typically just have an m by n image When you start working with color images, you typically have several layers So it's in in by n by 3 for RGB or 4 if you have an alpha layer as well Yeah, and we use Matt plot lib to visualize those typically so in the image comes from the astronomy community It was originally a part of it was written by Peter for fear as part of number a Which so so NumPy originated as two parts number a and numeric and those were blended together in the end and when they came together there wasn't really room for in the image in the image ended up inside of sci-fi But then In the image unfortunately is written in in C. So it's fairly hard to maintain We didn't have a lot of contributors there. So we decided to start a new project called sci-kids image All the toolboxes for for sci-fi are called sci-kids So there are several out there, and I think you'll learn today about sci-kit learn sci-kits image There's stats models for fitting linear models and so forth. So a whole bunch of them basically like toolboxes inside of MATLAB so we started sci-kits image and We do not deal with n-dimensional data sets. Mostly. We just deal with image processing algorithms on you know single-layer images But we try and write the kind of code That's very easy to maintain and we go for slightly more complex algorithms that you can't easily implement inside of C All right, so Just show you the these slides should be available Right, so This gives you an idea of the types of things implemented inside of sci-fi's nd image You've got things like convolution when you want to blur an image or sharpen an image Correlations when you want to fit a mask, you know locate an object inside an image Gaussian filters for blurring again sharpening There's a whole bunch of Fourier filters interpolation This module is when you want to do geometric transformations of your images like rotation scaling Etc. Methods for measuring objects inside of an image So for example if you want to count the number of of stars in a frame or something like that You'd identify all the different objects and use some of these methods to compute statistics on them Some morphology, so there's gray level and black and white morphology Morphology is typically used for manipulating the structure inside of an image So for example, if you have a piece of text and you want to do text recognition on it If you convert that Text to black and white before you do the text recognition You may find that a lot of text has got like little lines sticking out of it. It's not very smooth, etc Then you typically use morphology operations It's it's a group operations. They will they will do something like say hey Let's let's look at this group of pixels if if there's a little straggler sticking out cut it off So it manipulates the structure of the image in that way So you might get in that scenario you might you might get much smoother text out All right, I'm using the new new new version of notebook. Yeah All right The Okay. Yeah, the the examples for today in the breakout session you can all do using only end image You don't even need psychics image. So you should be fine right, so I thought I'd start you off by taking a Typical image processing problem that you might run into an industry And show you how we'd use these tools to solve that problem. So the example here is One where you've got an image obtained by scanning Elements microscopy And it shows a sample taken on glass. Yes, the sample There's some grains of sand on there So you'll see different different colors here So the the gray background is the glass itself and then what are the different patches? So some bubbles black and molten sand grays. Those are the dark gray blobs and we'd like to know Which fraction of the sample is covered by each of these three phases the molten ones the sand grains and the background so Yeah, so we're going to see how can we count that automatically using in the image So first thing we need to do is to get the image into python so we can manipulate it so we're going to start by importing matplotlib and NumPy and Then we're just going to read the image using matplotlib There's a small little bug in matplotlib at the moment which flips some P and G images or some JPEG images up and upside down so You do emery to get hold of the image and then until they fix the bug you just have to flip it upside down And here I show it You'll notice that I use I specify the color map as gray otherwise the image appears in all sorts of funky colors and We use nearest Interpolation so when you display an image using matplotlib you can tell it which kind of interpolation to use When you do image processing It's often handy to use nearest interpolation because that won't hide any artifacts You see each pixel that is inside of that matrix or inside the image Whereas if you use something like cubic interpolation by linear interpolation might hide some of the detail going on Right, so next step is to get rid of the black banner You'll see there's a the microscope added something there at the bottom So we just want to get rid of that so I take the first 880 rows of the image and drop the rest of them. So just using a bit of Indexing there to get rid of those Okay, so there we've got the image without the banner Now if you look at this image, you'll see there's a lot of speckles I don't know if you can see it from way back there, but there's like a whole bunch of speckles all around And we presume that those are noise. We're not really interested in counting those. So let's just get rid of them now I just described Briefly morphology to you. So how many of you are familiar with? Any morphological operations? Do you know convolution? You do a little bit. Okay, so the concept Version of these slides Thank you All right, so You start off by defining what they call a structuring element so in this case we may choose to use a little star or you know little cross and You move this little cross over your image and you say Unless this star is completely covered By a pixel remove that pixel from the image so you can see that in most parts of the image That would be fine except like when you end up in a place like this Where it's basically black with just a single pixel on the inside It's going to throw it away because it doesn't the whole of the cross isn't covered by by color and When you do that So there's the code for that We use indie images median filter there are plenty of other ways to filter those out And when you apply the median filter you see that all the all the speckles have gone Right so the median filter works slightly different than the morphological operations But there are numerous different ways you can get rid of noise like that The idea is to to make the image preprocess the image so it is more suitable for what we want to do next and The next step is actually the interesting part. How are we going to? Isolate the three different layers of this image essentially how are we going to split up the light background of the glass versus the black blobs versus the sand grains so To see how we're going to do that. I did a plot to the histogram of the image so just using matlott-libs histogram command and What do we see here? We see that the image the intensities here at the bottom Are very clearly grouped in three so we've got a Spike here around Intensity zero which is black those are the big black blobs then here around hundred Which is the light gray and again? No, that's the dog gray and around hundred and forty. That's the gray background so Where do you think we can? Cut off this image. What would be good thresholds to use that? anyone like to volunteer threshold So say I want to just separate this one out then Yep, 1040, you know, we just have to basically say just ignore all the values higher than 14 intensity and just look at these For this guy, we're gonna have to set two thresholds to get it out So we want to say between 60 and 120 and then we can look 120 and up So let's see what happens when we do that Yeah, you could All right, so that's exactly what I did here. I said the bubbles Consider that part of the image is simply the median faulted image where it has intensity less than or equal to 50 Sand is between 50 and 120 and glass is greater than 120. So Just based on what you know of numpal already, do you know what data type I'm gonna get out from this first line over here So if I do image median less or equal to 50, what am I gonna what what comes out of that that operation an array of? Yeah, absolutely. So this is basically a truth test Is the image intensity less or equal than 50? So I'm get I'm gonna get an array out with the same dimensions as Image median, but it's gonna only have two false values in it. So it's a mask, right? So I've got my three masks for the different phases And then I'm just going to plot each of those in a different subplot So there's just a little utility function that takes the original the bubbles to send the glass and it blocks them Do that? There we go So here you see the original image the big black blobs have been isolated the sand grains have been isolated and the rest is just the light background All right So I want to visualize all of these inside of one image. So I basically want to colorize the whole thing So how can I do that? Well, I can construct Using NP dot zeros I can construct a new image It has to have the same dimensions as my input image Except it now has to have three layers because I want red green and blue So I use NP dot zeros with the same shape as my input image But with the three layers and then I proceed to assign Using these masks that I just created for the bubbles to send the glass different values So I assign one zero zero to the bubbles. So red green blue one for red zero for green zero for blue So that's going to be a red Sand gets full red and full green. So there we get yellow and the glass is zero zero one so that's just blue and When we've constructed that array and we plot it. This is what we see so very nice representation of those three layers together except Do you see what goes on around the fringes here? We've still got like on the red blobs. We've still got a little bit of yellow surrounding it So we want to get rid of that as well. So we're just going to do One more step of processing using again a little bit of morphology in this case we're going to make use of a Binary opening and closing which does exactly what I described here basically just to get rid of those and when you call that operation It cleans up the boundaries very nicely. So you don't have those little speckles on the side All right, so now we're in a good position to start counting these different objects that we found so We we're going to label the different components and count these sizes so First we convert bubble sand and glass those masks of ours We convert them into integers because we don't want them to be true and false any longer We now want to for the different grains of sand We want to give each grain of sand a number basically and To do that we use ND image dot label so in the image dot label and there's an equivalent function in psychics image It goes and it runs across the image and it basically finds connected components So if you've got loose grains of sand each grain will be labeled zero one two three Etc. So we call in the I dot label on each of our images And then we use this single line of list comprehension to count the total size of those images in the image of those objects in the image we say Wherever the image is equal to I You know some the number just count the number of occurrences and do that for all the different labels we found and if you run that Then You get output like this so in the sand it found a hundred and fifteen separate regions in total The the mean object area at least was 1769 pixels 70 regions for the bubbles 27 for the glass and if we plot those labeled images you can now very clearly see what happened I plotted them with What was the color map spectral here? Which is a useful color map because it changes very rapidly so you can see how numbers Increase so this bubbles example is very nice. You can see like it starts off purple then blue Green red. That's how in the image went through and basically labeled each of these individual components separately Yeah Yeah, so this basically solves the the problem and That is a that's a very typical image processing pipeline that you would it would see in industry Yeah, right, so The question is if you have an image with different bands How would you approach this? Most algorithms and image processing are just geared to one layer. So people typically convert down the image or they move into a different color space Are you familiar with with the HSV color space at all So normally we describe an image in terms of red green and blue components like what you would on a on a TV screen You know you have like the old I don't know how the modern TV screens work But the old ones at least that like three elements red green and blue and depending on which ones you fire You get a different color back But the human the human brain has got a very different way of seeing the world We don't see the world in terms of red green and blue We I mean our senses are red green and blue, but if we talk about perception what do you see you see? If you have to describe The color of his jersey you might say well, it's RNG or red. It's kind of not very bright You know you would use terms like that and that is pretty much what the HSV color space tries to embody So HSV stands for you saturation and value. So you is the color. So that would be like red or blue or green Saturation is how full is that color? You know, is it like gray or is it like a bright bright red? For example, and then the value is just like the intensity. Is it dark or is it very bright? So often we convert images into that color space and then we can Manipulate them in a way that's much more natural for human beings But yeah, we typically unless you've got a vector approach You have to convert the image to a single layer and process it. Are there any other questions? About the example You think you've seen so far. All right, so that was just a brief Tour of Endy image and now I'll proceed to tell you a little bit about psychics image the project that We're currently building. I'm not sure which version you guys have installed zero point four was the last release Whole lot happened since the zero point four release. So we're currently almost at the point We're gonna release zero point five So some of these examples may not yet work on zero point four so let's go to So there's the psychics image homepage So it's a library that's free of charge free of restriction And we really try to write high-quality code and it's very much a community effort And if you guys end up using this for some reason we very much appreciate feedback bug reports and of course code contributions We on if you go to the front page You'll see links to the documentation how to download it etc But one of the interesting places to go if you scroll down is to the gallery and in the gallery we give examples of Some of the main functionality so it's very much like the matplot lib gallery that you've seen before Except that we've got a whole bunch of image processing examples and I'll run through those just now So the reason we started this project was because a lot of things that could be really easy Were kind of hard using pure sci-fi matplot lib and so forth you could like build the whole tool Check the whole pipeline, but you often had to go to more effort than you want to do so we wanted things to be really simple so if you say if you want to Display an image or process an image You'd start with from SK image psychics image import I Oh, that's for displaying images data if you want to get hold of some test data sets and in this example I'm going to show The filter module just to show you how these things string together. So you typically you want to start off Grab some data set. So I say image is data that coins coincided an example data set we provide Do some filtering operation like get the sobel filter and apply it to the image and then Display the image and there's your result. So really Yeah, I saw it was highlighted. It's in Yeah, I mean you could do from SK image import filter as felt Yeah So the image returned here is just a normal array and there's no reason why you why you have to use our input output In fact, our input output just uses matplot lib at the back end. We've got several plugins. So you can either use matplot lib We've got we can use Qt to display an image. We can use You know, there's the numerous output back ends matplot lips a default one So So you could just as well have used this command so IO dot m show is exactly equivalent to plot m show with The color map set to gray and interpolation set to nearest same thing right so So just to go into a little bit more detail about the input output module You saw earlier on that matplot. They've had that little bug where it flipped the image upside down So you might not want to use matplot lib to read an image So you could say for example To the input output module use the plug-in pill instead of matplot lib to do image reads And then when you read the image, you'll see it was not flipped upside down So we just chose to use the python imaging library instead of matplot lib to read it The reason we did this is because it becomes very easy to write your code in a in a modular sort of way where you can say Well, let's use matplot lib today tomorrow. You might change your mind decide you want to use Different modules to load it and you can easily swap between the different ones One that's a little bit more fun is our Qt image viewer so when you use the plug-in Qt for image showing and You provide the fancy equal true argument then the psychics image viewer pops up and This is a this is a nice way of inspecting images as you move the cursor over the image You'll see the red green blue hsv and the position of your cursor changing you can Adjust the image to play around with different color adjustments you could You know, you could manipulate the brightness contrast So almost an image editing kind of approach is just playing around with parameters seeing what works and once you're done It's say you wanted to adjust the gamma of this image. So you want to push it up to that looks horrible. Let's try It's changed the you know, let's change. Let's just make this image really green There we go. Oh, it's additive that explains. Okay There we go. So now we've got a very nice green cameraman. Let's reduce the red Okay, and then you'll see you can either save the image to file now You could also commit the changes you made so you can basically build up a number of changes put them on top of one another Revert them if you're unhappy, but this one is fine. We'll save it to the stack. So if we click on save to stack The image gets popped onto a stack and they tell us if you want to get hold of that image call IO dot pop So we're gonna go back to our gonna close this and I'm gonna go back here to There and then if I call pop, I just get that image back and I can continue manipulating it You guys may from time to time have the need to load fits images So one of our back ends we provide is for reading fits files. If you do IO dot use plug-in Fits then it makes use of pie fits in the background. So unless you have pie fits installed that won't work and then Yeah, it reads there. It reads fit fits file files from desk Some of these images are required in the breakout session, but I did convert them to PNG for you So if you don't have pie fits you can access them easily Right so when manipulating images the data type of the image is is rather important Common ways of representing an image would be to use a floating point format if you use floating point values They typically put images between zero and one So all your values the the black would be zero and white would be one another common convention is to use unsigned eight bit integers so those are integers that run from zero to two five five and Then more recently since we now have ample storage space people store images as you know 16 bit integers run from zero to six five five three five, etc So here's an example of the kind of problem you may run into That image. I just loaded using pie fits It's data type is float 32, but if you look at it its minimum and maximum values They run from a hundred to sixty five thousand something so Psychics image wouldn't know what to do with that image because the rain the range of the values are almost up for floating point values We expect them to lie between the values lie between zero and one so if you try and Operate on that image like for example doing that so ball filter again You're gonna get an error and it's gonna tell you images of type float must be between zero and one So we have numerous utility functions for dealing with that situation I Import the exposure module from SK image and I say please rescale the intensity of this image So that basically just takes the image it takes the minimum value moves it to zero It takes the maximum value moves it to one and then your image covers The whole the whole span and when you operate on that image everything will work fine You can also when you do that rescaling of the intensity you can tell it what the input range should be So I took that floating point image and said Let's assume that point that zero is my black value, but point two five is my white value So that basically discards all the values above point two five and stretches the image So that point two five is now my new white value and then so that's a that's a good way of when you're plotting images For example, like if you do the Fourier transform of the signal You very often find that the whole spectrum is basically very low in amplitude except right at the origin like the mean is super high So you've got like this low low amplitude except for this one position So they typically you want to cut off that one value and stretch the rest of them So there you'd use something like the input range and and limited To further deal with the different kinds of of images we've got these utility functions So you can say give me my image as a float as an int as an unsigned byte and that just converts between the different Representations so if you use the images float you'll get an image between zero and one as int will be between zero and two five five Sorry int is three two seven six seven that's a sixteen bit integer and unsigned bytes between zero and two five five Psychic image will also warn you if you're doing something that that might lead to trouble So you'll see that there's a possible precision precision loss when converting from floating point representation to integer Representation you can think if your if your image values lie between zero and one, but you've got floating point values You can basically have a whole wide variety of values Whereas if you use unsigned bytes to represent you can only have 255 different values so you run a risk of throwing away some of your data, right? We've got numerous test data sets because often you just quickly want to try out little algorithm You don't want to load data from disk So if you import the data module you could use data camera for example to get hold of this guy data that coins Gives you these and if you do the question mark behind a data set you get a little description of What it is? Oh? What did I do? Oh? I'm still using the fits plug-in that won't work Fitz plug-in can't use can't load PNG Okay, so there there you see those are Greek coins from Pompeii and Gives a little bit of background about what you see in the image what the image is typically used for or useful for illustrating and Where you can get hold of the original data and what the copyright restrictions would be if you decide using the paper, etc Yeah, they're shipped with the distribution Right, so the idea is that Any single function inside of cycles image can take any input any images input So you can whether you use floating point representation integer You just pass them into the function and the function will do what it needs to do and it will pass out an Image of whichever type it finds convenient So you can easily string functions together like I took the the cameraman image Applied the canny filter to it and here's what you see is the output But you could just as well string a couple of things together So from the filter module I did total variation denoising and then applied the canny filter to it And you see you get a much cleaner function a much cleaner image out But you can string arbitrary functions together like that basically build yourself a nice long pipeline and Yeah, it's as simple as that. I have an example here about color spaces Maybe I'll skip that one for now So here's an example where we are going to take an image and scale it up or down often when you start off with an image That image especially in the field you're in you have these extremely large images So while you're developing an algorithm, you typically want to scale that image down before you start You know just to test your ideas out on So any image has got a method called zoom and if you specify I Take the data the cameraman image I use any image zoom on it with a zoom factor of point one So scaling it down by 10 essentially and when I do that This is the output I get so you can see that this is a much lower resolution version of my input Psychus image allows you to do something a little bit fancier if you want It's got this idea of a Homography, which is basically just a coordinate transformation so you can specify a three by three matrix and Each coordinate in the image is represented as a vector of XYZ and that vector is Multiplied with this matrix to find its new output position. So this is a You'll recognize some things in here. This is if you did a linear algebra course You would see that I've got a cosine term a minus sign a sign in a cosine. So that's a rotation matrix then here I've got Translation in the X and Y direction and the two little coefficients here. Just adjust skew So if I just want to zoom in and out like I did earlier on I just manipulate this S parameter here. So you can see if If you take we're gonna write So if you have your image coordinates X Y and 1 and You multiply with a matrix of this form 0.1 0 0 0 0.1 Then you see what happens there you take your new X value So your new X value is just 0.1 what X used to be your new Y value is 0.1 of what Y used to be and The Z coordinate remains the same. Oh, yeah, so that's just another way of scaling the image So if I set s equal to 0.1 I used this matrix. I apply it you get Exactly the same as what we did with a zoom command in nd image But we do have a little bit more flexibility now because we can adjust this matrix to include for example a rotation angle, so let's set that angle to 10 degrees for example And there you see the images is rotated by 10 degrees If you want to translate the image as well, you can set those Parameters there translate X and Y So there you see the images rotated shifted And this cute parameters person has to be a bit careful. They're very sensitive Well, not that sensitive. Okay, let's see Yeah Let's not do the scaling because it looks a bit funny. I guess I need to do that Yeah, so there you see a little bit of rotation translation and skewed So This this kind of comes in handy in all sorts of different places, but let me show you an example from computer vision No, I so I mean So I just used the first 50 by 50 pixels This is actually the output but you see like there's the part that we're interested in because I scaled it down so much if I if I change the scaling parameter to 1 Then yeah, it's a nice high resolution Yeah, yeah these two parameters here tweak tweak those Z values Yeah, so I'll show you an example of where I use that I Included this by thin file for you as well if you want to play with unwarped at py there we go so here you start off with a warped version of the camera and You can specify. Where do you want to warp that image to anywhere anywhere on this thing? So you can click for example there there there and there and Then you also have to click the corresponding points in this image over here So I'll say well those corner points lie there there there and there and Then it maps it out for you so Let me do another example. So you see So let's let's just map this version out to a to a rectangular shape So I'll click in the in the center of each corner there there And if you click on the corners There we go so sometimes you have an image like if you have for example two star fields and they kind of almost closely aligned but there's a little bit of a Linear distortion between the two if you can map match four stars on each of the images Then you can use this to you know to transform the images so they lie right on top of one another So this is this is all linear transformation You'll see like all straight lines were preserved no matter what we did here a straight line Remained a straight line even in in this example here Straight lines here remain straight lines over there, but any image allows you to do any kind of crazy transformation so So here we've got an example where we've got extreme radial distortion some some camera lenses just Warp straight lines quite a lot and this is not really what we want to see so What this example allows me to do is I can I've got a little cursor and I can mark three points on a straight line So there there there And then let me take this line over here this one This one and this one There we go so there's the original and Then you can do some non-linear warp To and I you see all the lines are very nice and straight Yeah, so any image allows you to do all sorts of crazy transformations like that No, no, this is I put the script in there, but this is just something I wrote but it makes use of So there's that main function that it uses it's called In the image dot map coordinates It's got a bit of a crazy syntax because you basically have to map all the coordinates before and to where you want them to go And then just send them in But then it's pretty powerful so Show you one more example of that So he has a short little paper that we submitted a while ago and there was We made use of something called the log polar transform. Any of you ever run across the log polar transform? It's a very interesting image warp Probably So what you do is you basically rewrite the image in terms of polar coordinates So here you've got the x-axis and the y-axis, but on the transformed image You've got distance from the center versus angle So you'll start a pixel like that for example You'd measure the angle 45 degrees find 45 degrees on this axis and then find the distance from the center And plot it except we do something else as well. We don't only take the distance, but we take the log of the distance So why would you do that? That seems kind of like a crazy thing to do Well, the reason is because What happens to multiplication in when you take the log of multiplication becomes addition essentially, right? So what happens is if you zoom this image in and out if you scale it Zoom in on it scale it up or down Then that basically just becomes a translation in this domain if you rotate this image So if an angle here changes it becomes a translation up or down So zoom translation in the x-direction Rotation becomes translation in the y-direction. So this is super handy when you do image registration for example instead of having to try and estimate zooms and Rotations you just have to do a correlation in this domain and Those things come out for free So part of the homework today requires determining features in an image So here's a short little example from Psychic's image that uses histogram of gradients So it splits up the image into a whole bunch of blocks and then for each block it counts the different gradient orientations that it sees So from psychics image dot feature import histogram of gradients and just call that Tell it how many orientations you want to count how many pixels per cell because it's going to chop up the image in little blocks as you tell it how many pixels for each block and It's not just the Southern Hemisphere it's everywhere the Brits went Wasn't it wasn't there a war or something Well, I feel a little bit threatened now All right Right and again, I used here I used that rescale intensity function that I showed you early on because the histogram is typically like if you have You know like five different orientations the peak values may be like at 0.3 or whatever so I just Scaled that image so that it appears nice and whites on the display so Cameraman and the histogram of gradients which is basically at each position We plot a little line and the intensity of the line varies depending on the directions that were found so you'll see Around so here around the camera leg. It found Those angles were quite prevalent Those should be perpendicular to I'm not sure why they lie like that might be the discretization Yeah, and I'll be here where things are fairly vertical you see like the Horizontal gradient coming out quite strongly. So you may be interested in looking at these for the classification task All right, last thing I'll do is just run through a couple of Yeah, I'm not sure how they weigh there. Are we way the intensity? I'll have to look at the algorithm. I can't remember That's for gray level images Yeah, but you can split an image into the R the G and the B and do it on each of those And add them together or whatever you wish to do Most often we just convert this so there's a whole color space module. Let me show you so So if you do from SK image import color Then you get access to color that RGB to gray Which takes a color image and and Transforms it into a gray level image There's also great to RGB if you want to take a gray image and take it to the color domain But of course all that does is basically duplicate the gray level because we can't infer the colors from nothing Well, you can you can if you if you train a classifier before and something like that All right, so I'm just going to show you a couple of final examples So have you heard of seam carving? So often you have an image and you want to rescale that image But it's kind of annoying if you Have for example You have a photo of the beach with a bunch of people running on the beach But now you want to fit that photo into a column this wide So if you scale the image if you flatten it this way then all the people become like super thin And that's not really ideal. So here you see the same sort of thing you've got You've got an image if you scale it down Everything's too small if you crop it then you can't get all the objects of interest in the image So what you typically really want to do is something like this. I Mean it's cheating a little bit Isn't that lying Yeah, yeah, absolutely. So so that's what seam carving is all about It's like it's trying to find a way of representing all the information an image without actually Preserving everything Yeah, I mean You know sometimes you need to put the picture in and it won't go so You see it doesn't look good when you squish it You should never trust what you see in digital images so yeah, they show what's happening or they will soon Yeah, so so this is what we do we take the image we find all the edges in the image You can do that with something like the SK image dot filter dot sobel or canny any of those will give you the edges in the image and Now we want to remove pixels from the image, but which ones are good to remove So do you are your eyes? Do they see the edges or the smooth areas better? Like what would you pick up on if I took it away? I think if you if you take the edges away your eyes are going to tell like well There's something weird going on but if I just sort of run through here You know in the smooth areas and just like surreptitiously remove those pixels me no one needs to be any otherwise So we've got this algorithm called the shortest path it's the same as dynamic programming You take that edge image and you try and find the shortest path through the image the image the path with the least changes in it There we go, so Just missed it, but those pathways the ones you saw around the side and around there There's very little variation in pixel intensity as you go along. So if I remove those It's not a big deal So we find those and then one by one we just remove them and shift the pixels in edges find the parts and Take them out Yeah, I mean you just duplicate layers instead of taking them out You have to be a little bit careful because if you duplicate the same path all the time It kind of looks weird because you insert bands in the image, but if you sort of alternate between different parts Yeah, they're like I Actually do have an implementation of this was like its image. It's just a what oh It's just to make sure that Yeah, you don't want to Yeah, yeah, so so you compute the gradient then you find the path that traverses like that you don't want to cross any gradients Yeah, so they've gone further and you can mark up areas of the image that you've preferably done want to carve out Yes, there's a beach scene Yeah, I just take all of them out Yeah, so so you can implement that one yourself Using using the shortest path algorithm inside of psychics image. It's in the graph graph sub module if you want to find things inside of psychics image you just go to the documentation The can always see on the side which version of the documentation you're looking at because if you have an older version of psychics image You may not have All the functions. So if you click on API reference There's a list of all the functions inside of psychics image psychics image. So you can see Yeah, there's the graph module and it's got Shortest path where you can use root through array which basically you specify the start and end point and it finds like the shortest part Through the array that satisfies that constraint And let's look at some the other examples. So contour finding is quite pretty The these lines are superimposed on the original image, I just took some function what function was it? Oh Yeah, so, you know sign of x to the power of 3 plus cosine of y squared Plotted that found the contours on it. It uses something called the marching squares and For the assignment you may want to look at gray level co-occurrence matrices Here we've got a couple of samples from the sky a couple from the grass And when you plot their properties the gray level co-occurrence matrix properties You see that they cluster very nicely Histogram equalization that some of you may be familiar with So you start off with a low contrast image. You can either like just using that This the stretching Method that I described earlier you can get something like this or if you use the histogram equalization It tries to basically make the cumulative distribution into a straight line Which kind of causes weird-looking images? So it's sometimes just better to just you know stretch the intensities right on transforms Like when you want to do This is basically the data when you scan a brain like this or an object like this You get this out of your scanner and based on this You can do an inverse transform and recover your data. So this is what happens inside of a CT scanner So yeah and some segmentation, which is super handy for when you want to manipulate Objects you can like provide seed points. It grows different areas and you can It's a random walker segmentation as well So even in very noisy images You can specify some seed points here and here and it will grow it out for you and separate out the different layers Yeah, so have a look at the gallery see if there's any anything useful and You're very welcome on the mailing list if you've got any questions. We typically respond quite quickly and We'd love to help you to use these tools Right, thanks for your attention. Oh, right Yeah, so that was the idea of the psychics originally but many of these projects going to get a life of their own But it's it's very much a close collaboration and you know, there's good communication between the team So these things will always work together And if I'm not sure if the goal is actually to move it over eventually, but the tools are definitely meant to be used together So apparently you guys see a lot of images of stars, but I don't often so I love working with star pictures. I Provided to you with let me see if I can open this So I gave you the this is this is the output after a bunch of image processing that a guy did on Numerous layers that he took with his status cop. He's got like the red green and blue filters applied some Other filters well, so he's got four different layers that he provides and if you combine those layers You can get out to a nice picture like this. So that is essentially the assignment is to take Take these four gray level images that I provided and to try and combine them so that you have a very nice visualization a Color visualization of what you see there So So yeah, use the small images because the large ones are just too big If you want to construct a color image you want to set up a new array of size M by n by three The three layers are red green and blue matte bottle of nose that so if you try and visualize a matrix like that It will automatically assign the colors for you so try for example, just take the red green and blue layers and You know put them inside the three layers of an array and try and visualize that see if that works And then play around with how can you how you can add the fourth layer as well The link I give to the website here basically describes exactly the pipeline that that person used But feel free to play around in his yard ideas Okay Okay, so we're gonna change gears a little bit to machine learning In sci-fi so sci-kits learn is kind of your off-the-shelf machine learning Package and in Python Make sure you have the most current version installed the version 0.9 doesn't have near the amount of functionality is 0.10 and with my m thought distribution for some reason I had 0.9 so I would make sure that you have 0.10 installed Even having problems with the version. Yeah, you have nine. Yeah. Yeah This is what I mean and to see the version you can Do import sk learn and then do sk learn dot underscore underscore version underscore underscore Okay, I'll Start lecturing Anyone wants me to stop just raise your hand So sci-kits learn provides easy to use really general purpose machine learning in Python So you might be asking what is machine learning? How many people here are familiar with machine learning? Okay It's good. So my short answer is that machine learning is kind of the offspring of statistics and computer science It's kind of statistics algorithms for computer science applications or computer science Dilving into statistics and the types of problems are typically a little bit different than your classical statistics The better answer is that it's a set of models which aim to learn something about data to then apply that to future data So The the statistical models using machine learning usually Concentrate on prediction problems. So you give me some new data. I want to know what class it is or I want to know If this user would prefer this thing or this thing or I want to tell you Some feature about that that new data that would be interesting to the users So there's within sci-kits learn there's lots of different Machine learning Applications that we can there's lots of different functionality for machine learning There's classification. So using some labels from some training data to learn a statistical model to then try to predict the label for new data There's learning the relationship between explanatory and response variables. So So regression problems where you try to predict some continuous valued Variable for new data Discovering that natural clustering structure and data. So if you give me a bunch of data that has no labels Can I tell you how many natural? Clustering is how many groups are forming Are within that data and we saw a little bit about this last week when we did some hierarchical clustering and some k-means But there's a lot more clustering functionality within sci-kits learn The texture detecting low-dimensional structure and data. So I could have data that's that has 50 attributes So in this 50 dimensional space, is there a lower dimensional representation that kind of preserves all the information that's important in that data set And then finding outliers and large data So it's it's not always easy Say in 50 dimension 50 dimensional space to see what thing doesn't fit with the rest of the data And within sci-kits learned is there's really nice functionality for performing all of these tasks So to make this a little more concrete Here's a Classic data set in statistics. It's called the iris data set. It was first Studied by r.a. Fisher. So here we have three different types of lilies and we have four attributes measured for each of those lilies Sorry Iris data set Just by eye he could tell that was an iris and not a lily So we have four attributes this having to do with the the size of the the pedals and the sepals and Three different classes. So there's three different varieties of iris and they're represented by three different colors here So typical thing that one might want to do is to Learn some model that separates out these three different classes in this four-dimensional space So then when we get measurements for a new flower, we could classify into its subtype Now the the neat thing about this data set is that it's not linearly Linearly separable so only one of the classes the red one here is linearly separable Meaning we could draw a hyperplane in this four-dimensional space to separate out the classes The green and the blue Subtypes are not separable by any hyperplane So we really need some flexible say non parametric or non linear models To try to do this to try to find out the the boundary between these different subtypes So I think that that's why this is a real classical data set and also trying to find Correlation structure between the different covariates. There's lots of really interesting problems that you could do with this this type of data Right So The first sets of models that I'll be talking about is what we call a supervised learning Meaning we have a training set of objects with some covariates and some known response Here the covariates are X the response is Y So we want to build a model that can ingest a new X and predict the value of Y so in regression The outcome variable is a continuous variable. So we have we want to predict someone's age based on a bunch of different metrics about themselves Want to predict the the price of a house based on a bunch of different attributes of that house and And in psych it's learned there's a lot of different regression tools that are available Starting with linear regression. We saw that last week in sci-pi, but there's also linear regression And general generalized little linear models in psych it's learn last so and ridge regression are forms of Linear regression where you have regularization So you might have high dimensional parameter spaces But only a handful of those are actually important in predicting why so it's a way of Kind of culling down that huge dimensional space into something that's more tractable And then there's a suite of non parametric tools Gaussian process regression nearest neighbors where you simply vote the or you take kind of local averages of data points within smaller regions support vector regression regression trees And there's a lot more than this and I highly recommend that you look at the psych it's learn website. They have really nice documentation and references for each of these methods and dozens and dozens of examples and It's actually it's a very it's very nice Nicely documented piece of code so Take a look at that. Of course, there's also a classification where we want to predict something's class based on its covariates There's a lot of different methods within Psych it's learn for this Begin with logistic regression Canaries neighbors classification both linear and quadratic discrimination analysis naive base for vector machines classification trees random forests and Some of these methods are only applicable to to two class problems So yes, no zero one problems, but they're extendable with ends learn Using the multi-class sub module where you can do a one versus rest. So you basically If you had a K class problem where K is more than two you could do One versus the rest in K different classifiers and then vote those classifiers to do a multi-class problem Say one versus one you would build, you know K choose to classifiers and vote those End up with a multi-class labeling scheme Some of these classifiers naturally handle multiple classes such as trees are famously easily extendable to multiple classes But any of the classifier you can extend using the multi-class module, which is Which is nice You can also do feature selection Using the feature selection module So if you had really high-dimensional feature spaces you can figure out you can try to figure out the The subset of features that are most predictive for your For what you're trying to do. So for instance an image classification You could build up these sets of millions of features But perhaps only ten of them are actually predictive of the class of the image So it's a nice set of tools for trying to figure out Structure in this high-dimensional feature space Any questions so far moving too fast too slow good Okay, so let's just go through an example of doing this model fitting and psychics learn On a nice thing about learn is that there's a lot of built-in data sets to so if we just import data sets There's a couple dozen Classical data sets from the literature including the iris data Here I've loaded the the Boston house prices data which Has 13 features so this is kind of There's a bunch of neighborhoods, and we have information about each of the neighborhoods and The response variable. So the thing we want to predict is the median house price in that neighborhood Okay, so this is regression problem the median house price Continuous continuous random variable So we want to build a predictive model that takes in as input a new value of x and Produces a prediction for the median house price So we can do a linear regression So we just do import linear model This is instantiating the linear linear map linear regression object let's split the data in two and Fit the model fit the linear regression on half the data to predict for the other half So this is just the first half This is a number of data points from 1 to n over 2 This is how we fit the linear regression model. It's simple. It's just the the object dot fit xy and then prediction is Also easy, it's just the object dot predict and then the x variable that we want to predict on So here we fit on the first half from one to half and we predicted for the second half from half To the end of the vector of the matrix Here's a plot so we can plot the residuals as a function of the true median house prices and This is what it looks like So the mean square error is something like 300. It's linear regression doesn't do so well Not too surprising this is really difficult data set So let's try something else we can do k nearest neighbor regression So just import neighbors There's a set of preprocessing tools in learn which is highly useful Here I've simply scaled the x matrix So for each of the 13 covariates in this this big 8x matrix I have subtracted the mean and devet divided by standard deviation and that's often important For several methods that use kind of distances between points So for instance k nearest neighbors will compute the distance between each object and each other object And if things aren't scaled then certain Predictors might have more or less influence in that In that distance Yeah So typically it works pretty well. It's it's not it doesn't really have to do with Gaussianity more than Making sure if you just write down what Euclidean distance is If things aren't scaled if then each then there's like a different week That court that is proportional to the standard deviation of that measurement More or less Yeah, and I'm pretty sure that there are different commands for different types of scaling Yeah, so I mean the reason is I I tried KNN regression without scaling first and I got really weird answers So I got to scale this But yeah, I didn't play with different scaling, but Okay, so then I fit a k nearest neighbor regression using five nearest neighbors arbitrarily Again, we can fit it and predict really easily and the residuals Still don't look so good means gray or as much lower though. You can see where we're over estimating for small price house values and Under estimating for large values, but the mean square was much better than linear regression like by a factor of 10 Okay so Typically what we want to ask her so in doing machine learning we're trying to model so talking more of In a supervised context. We're trying to to model How some set of variables relate to some other set of variables But the the number one question we want to ask ourselves is how is our model going to perform on future data? We want a predictive model So if we feed new and new data into it we get accurate predictions of the variable of interest So in the previous example, I simply split the data fit on the first half what I call the training data and Then evaluate the performance of the model in the second half what I call the testing data So this is kind of trained test strategy It avoids overfitting to the training data because we're evaluating on a set of data that was not explicitly trained on But it's not necessarily the best thing to do because it kind of waste data We have this extra data that we could have fit to but we didn't we didn't actually use it in the fitting So a better procedure is cross validation So for those who aren't familiar with cross validation The easiest thing is leave one out cross validation Where we take out one data point fit the model to the rest and then predict what the value of that left out data point Should have been using the the model iterate that over all of the data points and then our our our average Air is simply the those prediction errors Leave one out prediction errors Compared with the true value which we have for all the data. So then if we were to get another data point the host is that the cross validation error is more or less the expected error for the new data okay, and The reason this is important is that we typically are trying to choose a model that will work well for new data So we want to get a handle on how the performance will be For new data, even though we haven't seen that new data yet. So cross validation is a way of trying to estimate that So leave one out is nice It's computationally expensive because you have to fit whichever model that you're using And different times where n is your total training set size So people typically use k-fold cross validation Where you just chunk up your data your training data into k sets hold that one fit on the rest predicts for the how that set iterate over all the Overall the folds Okay Luckily for us site kits has cross validation If that was a little too complicated So Okay, these are important concept concepts, but it's hard to cover all this in a half an hour. So and again the documentation on Site kits learn is really nice. So take a look at At that the manual and all of the references within right, so like I said, we want to Say we're faced with the classification tasks. We want to pick The model that we expect to do to do best on future data So for instance, which of these models should I use to do a k nearest neighbor? If so with how many neighbors Support vector machine there's lots of tuning parameters within support vector machine random forests There's lots of you know, how many trees there's other tuning parameters Gaussian processes the same thing So there's all these decisions that we have to make in trying to choose the best model Within learn there's this grid so get grid search module that allows us to more or less automatically try to pick the best model for our data So what it does is if you just run this command so an estimator is a class of model So if we instantiated say a random forest estimator We feed in that object Some parameter grid, which is just a dictionary where the names are the the names of the Of the tuning parameters of the random forest Plus whatever values we want to search over a loss function, so a function that we want to minimize or We can or score function something we want to maximize So for instance, we might want to minimize the number of errors that we make in a classification problem and We can run that on n jobs cores with CV being the number of fold cross validation these so we can simply just run this line of code And it will pick to pick for us the best model to use for that data So inherent in this is some loss function or score function and There's lots of different built-in evaluation metrics So for instance a zero one loss function is just the classification loss so in a classification problem you get a Score you get a loss of one a penalty of one if you get the classification wrong and a penalty of zero if you get it right So you want to minimize that Mean square error so in a regression case we want to minimize the mean square error There's other lots of other metrics that we want to say maximize So the zero one score is just the number or sorry the percent of the time that we're correct in a classification problem there's other things related to precision recall ROC curves There's lots of nice built-in ones. Of course we can write our own Other evaluation plots you can look at confusion matrices ROC curves. I'll show some examples of this in a second But it's really nice that they've thought hard about this and There's lots of ways that you can evaluate the performance of your of your machine learning algorithm that are built it built right in the site kits learn Okay, so let's go to the notebook no superhero okay, so What I've done is I've so built into the to learn is this Famous and this handwritten digits classification data set so the idea is that there's a database of handwritten images from zero to nine and Based on those images We want to classify whether it's a one two three four, etc So here's an example These are just the first ten images in the data set There's a one two three four five, etc So what I want to do is just so we've seen a lot of In the last hour we saw a lot of ways to Extract features and do lots of processing on these images What I'm going to do is just use the raw pixel images the raw pixel values to try to classify the images So these are all eight by eight images, so there's 64 Features per image and so I want basically a model that can ingest a new vector of size 64 and produce me a Best guess number so which digit is it? Okay, so I've taken the first 500 data points as the training set And the rest as a testing set and What I'm going to do is I'm going to do a cross validation within this 500 These 500 objects to try to To then apply to I'm so I'm I'm treating the testing set as data as kind of the future data the data that I don't have yet, but That I will see in the future. So it's just a way of evaluating my performance and So I'm going to create a classifier. I'm just going to use a support vector machine So as a sport vector machine basically looks for a separating hyperplanes between your your data so it SVM's work as a two-class problem But we're going to do a one versus one version of that. So just by default it does a one versus one so it'll first you know Take all the zeros and all the ones give me the hyperplane that separates those in future space They move on to zero versus two Etc etc and do all those combinations and at the end it's going to vote The all the individual classifiers to come up with a prediction so then when I get a new digit it basically runs it through all those classifiers to get predictions and each of those and Choose two classifiers and then votes them to get a final prediction but We don't need to know all that all we need to do is run this these couple lines of code and So what I've done I've instantiated the classifier this gamma here is a tuning parameter. So I'm using Radial basis function kernel. So it's actually a it's a non-linear. It's a the kernelized form of support vector support vector machine Fit it on the training set and then try to predict for the left out objects. So here I've basically I've trained a classifier and I've predicted for this choice of tuning parameter and For the first ten objects in the testing set. So this is the left out objects Here's the true class and here is a predicted class So I made one mistake here But it looks like the rest are correct okay We can also compute all these different metrics so the zero one loss is your one score We can look at the confusion matrix So I'll show you what all these look like. So this takes as input the true value. So this yte is the true digit and here is the predicted digit and So zero one loss is 253 so that I got 253 of the 17 1200 the testing set is 1297 So this is this is just On the testing set. So I got 253 wrong out of the 1200 I got 80% of them right. That's a zero one score the confusion matrix is so the I jth the I jth entry is The number of objects truly in group I but predicted being group J So here's the zero zero. There's 108 zeros that I correctly called zero Looks like it's the eight. Yeah, so that yeah, it's zero one Yeah, it's it's the eight Yeah, so it's kind of from zero here So the eights is a problem source, right? So there's a lot of things that aren't that aren't actually eight that I'm calling it so the classifier here is is Making a lot of errors by trying to call everything eights presumably eights look like a lot of other digits So let's just plot a few that that the classifier gets wrong oh at this yeah, well Yeah, I want to show how ugly they look though because I got them wrong I was just a one Right here, right? Oh nearest So So actually it's this is pretty easy problem. I purposely picked a bad tuning parameter To show that we could increase our performance from there I Actually, so so I actually first I picked a really good one. I was like, ah That this is actually the optimum that I picked out of the air so Anyway, so this was a one. We called it an eight. This is a two. We called it a three. We called a nine, Sarah We don't do we don't do perfectly but So I'll skip precision and recall You can do so the default is do a one versus one we can do a one versus rest So we're zero versus everything else one versus everything else and then vote those classifiers I Think here does does a little bit better. Oh, it does about 10% better. So Typically one versus one Usually does better the decision boundaries tend to be simpler when you just consider a pair-wise classes But it's not always the case So SVM's do not give class probabilities meaning they don't give you a probability of being a class Given the data, but there is this thing called plat scaling which gives you an approximation of the probability Essentially by looking at how far the data point falls from the decision boundary So there's if we do probability equals true in the support vector machine We can actually get class probabilities for each of these guys and here I've just shown for the first six objects what the probability that the digit equal to was So for the one the first one that was actually to we have a 99% probability for the second one is actually quite low That's one of the ones you got wrong This just shows how we can make an ROC curve So ROC curve plots the false positive rate versus the true positive rate For those who don't know what that is I suggest looking it up a Perfect classifier would be up here and you see there's a trade-off here as we change the probability threshold We trade the true positives for rate for the false positive rate so we trade a Higher true positive rate for a lower false positive rate Okay, so I just wanted to show the tuning in the classifier so like I said I just arbitrarily chose this value of the tuning parameter with a radial basis function kernel But what we really want to do is optimize over say a grid a grid of parameters, so it's done quite easily in his grid search and we can use Here I've used two cores with a five-fold cost validation and My metric is a zero one score, so I'm trying to Maximize the zero one score With respect to all these parameters So I can do that So my best cross-validated zero one score is point nine eight six so in the training set the cross-validation Air rate is one point four percent And this is the optimal model Gamma was point oh one radial basis function kernel And Yeah, I could do the same with random force, but this actually takes a while, so it's it's very similar How much time dive okay, yeah, yeah, so yeah So it's yes It should I'm sure it does I'm sure it says something let's see So it's it's doing a kernel SVM and So in regular SVM you just look for a separating hyperplanes How do I expand this name so Kernel coefficient Yeah, I suggest you look it up Yeah, so it's it's basically doing inner products between kernels and so it takes a specific form of this kernel and you can think of as kind of like a distance operator in a sense and That tuning parameter, I think it's it's dividing It's dividing by that term so things that are it's it's essentially projecting your data into a Higher-dimensional parameter space and looking for For for linear boundaries in that space and the space that it that it Projects it to is based on this tuning parameter. So it's kind of like how complicated is that higher dimensional space? I Think as you get smaller it gets kind of bumpier so it gets more complex usually with SVM's I find if if gamma gets too small it just goes crazy and If it's if it's too large, it's kind of over smooths a little too much There's this this is a bias variance thing going on Anyone else can chime into if they know support vector machines better Okay Just a couple other things I want to So there's lots of other functionality within the sci-kits learn There's lots of unsupervised learning methods So unsupervised learning is you have a bunch of data you don't necessarily and you don't know the response So there's no kind of response variable that you're interested in like a class There might be underlying but you don't have that information So we saw clustering last week with k-means and hierarchical clustering. So here we're trying to detect natural clusters in our data So learn also has mixture models something called affinity propagation a spectral clustering, which is Kind of late idea where you you have High-dimensional data and low-dimensional structure that you're trying to You're trying to simultaneously learn the lower dimensional structure and the clustering structure within that data and There's lots of so evaluating clustering is often difficult you can kind of look at it and say oh that looks reasonable But there's quite a few evaluation metrics in the literature and they've actually adopted a lot of those So you can kind of compare the output of k-means using five and six clusters and see rigorously Whether one is better than the other There's a lot of manifold learning. So this is where we have high-dimensional data set and we're trying to find lower dimensional non-linear structure and There's quite a few methods out there and There's like maybe five or six that are implemented and learn matrix factorization so PCA So PCA is kind of a it's a linear dimensionality reduction technique They have kernel PCA, which is a non-linear version PCA sparse PCA independent components analysis non-negative matrix factorization dictionary learning these are all things that are useful for finding correlation structure in in high dimensions You can also do outlier detection This is something called the elliptic envelope which assumes that you have Gaussian data and it looks for deviations from that the one-class support vector machine is a pretty popular technique where you Assume that everything comes from one class. So you try to draw a Separating hyperplane around the things that you know are not outliers and everything that anything that pops away from that Is presumably an outlier and Then covariance estimation usually in high dimensions It's missing a few things These are things that I think are pretty important that aren't in learn so learn can do a lot It can't yet do everything, but it seems like at least from version 9 to version 10. They've added a lot So I would be very surprised if they weren't Considering adding a lot of these things that are not currently there That's true But for the amount of things that learn does implement it seems like it's not too much of a stretch for them to do it There's a lot of good stuff here Yeah So okay, so I guess we have time to either do a breakout or I can show you some more stuff on my manifold learning Oh Yeah, yeah, so the notebook is posted online already, so okay, so the breakout is the iris data So I want you to choose your favorite classification model so first of all Like I said, it's it's a in a data sets module It's one of the pre-loaded data so you can just fire it up Split it into a training and testing set just using these commands and then choose your favorite classification model Find the parameters that maximize a three-fold cross cross validation zero one score Over the training set and then apply this to predict the class of each object in the held-out set And I want to see what's your your best So what's your your maximal three-fold cross validation score and then what's your? Zero one score on the testing set using that model and the best I was able to do was one mistake on the testing set exactly Psych it's dot look dash learn Okay Okay, so we're going to finish up with a very brief overview of the stats model package, which is also within Psy kits so Josh sent around an email earlier with some installation instructions, so I Presume most of you will have installed the package by now if not you can pull it from github in your own time Or install using easy install if you have it installed the import command is is shown there It does have one difference To the import command has one difference to other commands in that it Stats models explicitly has an API that you call That's just a consolidated interface for all the different modules in in stats models So instead of import circuits dot stats models, you should import that dot API as SM Okay, so this These slides are just a brief introduction to what stats models is And how it can be used and then I have a notebook as well, which many of you will downloaded that contains examples so stats models is aimed at statistical modeling and a computation with numpy and sci-pi as many of the other sci-kits packages are but it has a focus on econometric modeling and linear models particularly and As a result of that it has To to my eye quite a frequentus flavor rather than a Bayesian Flavor, so there is another package, which I won't talk about today called pi mc that I think we might talk about a little later Later on that does much more Bayesian likelihood analysis, but for many kinds of tasks in which one wants to use linear models Then stats models is very good package to use so as with sci-pi and most sci-kits packages Stats models is built on numpy arrays. So virtually everything you use will be a numpy array or Data types from within pandas pandas has only been mentioned once or twice so far it's Another package that you can download that provides a very versatile and full-featured data structures So it's something that is Still being developed and is becoming much much more common So within the time series analysis part of stats models the pandas time series data structures is available for use at the moment So stats models is available through all the usual sources Importantly and and not trivially it's already compatible with Python 3 and stats models is Almost entirely pure Python with a couple of siphon wrappings Okay, so resources the sci-kits home page this second point in documentation is Pretty good, so it's very useful for learning about how to use stats models The repositories and then also these slides are very useful if you're wanting to see what you can do with With stats models are these two presentations from the sci-pi conference in 2010 and 2011 Will are very useful so skipper sebold who's the author of the 2010 presentation? He's one of the principal authors of stats models and where's McKinney who is 2011 presenter is the author of pandas the data structure module that I talked about and so Both of these people are active developers for this software and for many important data software and analysis tools in Python So they're they're very well worth reading those slides Okay, so As with sci-kits dot learn stats models has some inbuilt data sets So if you've imported The stats models API as SM for instance then SM dot data sets dot tab will show you the list of data sets that are available Here is one. This is the data set from the Vote in Scotland in 1997 to devolve tax powers from Westminster to the Scottish Parliament So you can explore that data set as you see fit The the I'm I read that the the data files that it loads do contain the names of the Scottish councils But when you load it in this way, you don't get the names of the Scottish councils unfortunately So if you want to hear some exotic Scottish names, you should investigate that data set further. Okay, so more seriously these data sets and I think this is true of stats models generally the X and Y the dependent and independent variables are cast as either endogenous or ex exogenous so these terms Reflect the background of the developer of stats models, which is in econometrics and macroeconomics. So These are not words I commonly encounter in my research, but I hear dependent and independent variables And that's more or less what you're doing You have the thing that is underlying the model and then you have the things you are trying to fit or things that bear on Variations that you observe in your data Okay, so as I mentioned with the time series analysis part of stats models Panda's time series structure is available for use and yes, ultimately stats models is targeted out in the words of its creator statistical financial econometric and And econometric models I copied and pasted that from someone else's slide. So that's not a type on my part Yes, so those That that's the the flavor that stats models has but nevertheless the tools that it provides are a very full-featured and very Applicable in a variety of contexts. So let's get on to actually talking about those tools The most fundamental thing that stats models does is provide a number of regression routines that allow you to fit to data under a number of circumstances, so The the basic regression idea that stats models implements is least squares routines and it provides ordinary least squares weighted least squares and Generalized least squares. So ordinary least squares is where you have a set of data and the variance properties of those data They're independent and they're not that variance structure is not changing with time weighted least squares allows for the covariance matrix between the two Data to be non identical to be non diagonal Sorry, it allows the diagonal of the covariance matrix to vary and then generalize least squares allows for providing a full covariance matrix between your data So I will switch to the the notebook now to actually give an example of what I mean. Oh wrong way Here we go. So is this loaded There we go, okay, so we import NumPy and well, it's probably very important and and stats models Work very good. Okay. So this is a very simple example. I am going to Generate data that obeyed this function. So it's a quadratic function With noise added on and then we're going to regress the values of the coefficients Using ordinary least squares and then with weighted squares So we generate a hundred data points for the x-axis and So then this builds our model and it's worth appreciating exactly What the input model to stats model should looks like so let's unpack this line? so column stack turns this vector x so let's let's Print x There we go. So x is just the numbers 0 to 10 100 Regularly linearly spaced. Okay. So if this creates a larger matrix With three columns the first column is x squared So this is what the column stack does it stacks x squared and x and then so that's a numpy command column stack maybe you've encountered before a numpy and then this second is In stats models add constant and it creates an extra column here on the end. So it's not prepended. It's Postpended of constant values. Okay, so the why why are we doing this? So here's the structure of x you've got x squared in one column x in another column and a constant value in a third column so You might wonder why exactly is is one doing this? It's because we want to build a data set that's based on Linear linear operations on this matrix. So we're going to Take a set of coefficients So here what we're going to regress for our input data is the function point one x squared plus x plus 10 and So we generate the y data in this example by using matrix multiplication on this vector and this matrix Okay, so this creates a vector which is the same size as little x one by a hundred and so With random noise, so here we go. Let's let's have a look So here's the x-axis and here's the y-axis which is the function that we've specified with these coefficients With some noise applied to it No, this isn't for efficiency It's it's because when when it comes to regressing the coefficients this matrix big x is the thing that is going to be input Into the regression function So this isn't a case of specifying here is a functional form that I would like you to fit for that Which in my mind I associate as being a more Bayesian approach pi mc provides functionality for that stats models provides What our linear models that doesn't mean it has to be a straight line linear models means its operations that revolve around matrices Yeah, okay, so now that we've set it up We just do the regression in in the following manner sm dot ols is the ordinary least squares regression So note that I input the y values and also the matrix x So this has three columns and it says for each of these columns find the coefficients three coefficients such that You you fit the data well and this Stores the fit so Ols itself is just the package. It has many functions if you We tab you can see oh my goodness I don't know what that's about anyway, right. No, I won't fuss around with that So that that performs the fit and saves the output in an object called results. So let's Let's explore this object results. So results has a summary method That prints out a very nicely formatted table of lots of statistics and information about the regression that you've just performed Okay, so it provides really a lot of things but the most Here are the actual values that it's regressed. So you and here are some estimates of the Diagonal of the covariance matrix and then you also have a confidence interval. It gives an F statistic For the model itself. It gives a log likelihood and provides information criteria. So there are many goodness I'm not even sure what these things are so but if you are working in a field where this sort of analysis is commonly Sort of the common thing that you would do in the course of your work So econometrics is an example of this then it's extremely nice to be able to have the regression put out in that format So anyway, let's let's actually see if it's any good So yes, it is you can see the input data and you then you can see the fitted Values which are stored in results dot fitted values. So results has quite a lot in it Here are the parameters that it the best fit parameters and here's the actual covariance matrix for those parameters 3x3 so let's act This gives a full list of all the things that results has in it. So you can recall results dot That's a misspelling aka is information criteria the Bayes information criteria So everything you saw summarized in that table and more Is available here the parameters the fitted values the residuals often? You might want to fit an ordinary least squares to a model as a starting point and then examine the residuals to see if you need to use a more complicated model Okay Stats models provides detrending so just at a Just to a polynomial whose order you specify so here we're going on detrending a linear relationship from the quadratic data So you it's easiest if I show a plot And here's the x data with the noise but with the linear trend the overall linear trend removed So the beginning the quadratic terms in the quadratic term Okay. Yeah That's right, so the idea here is that you will In this case obviously I generated a data using that model But in real life the data is not found in that way So yeah, you have to specify a model and so the F statistic which was output we go up This F actually maybe in the docs. No, it's just going to say fitted value of the F so this is a Measure a quantity a quantity that measures Some information about the degree of model selection So if you have two different models you can compare the F statistics for those You know Bayesian context you compute the Bayesian evidence or the Bayes factor to compare the two models Yes, yes So these are both Measures of right, so I'm not going to do any model selection here Okay, so this is a small extension of ordinary least squares to weighted least squares I'm going to apply weights So I'm going to change how the noise varies same input data same functions and coefficients But I've modified the noise now So then it starts out noisy at the beginning and then as time goes on Maybe postdocs are taking over the measurement and the measurement gets better and better And so you can do the same thing again with OLS Which in this case is not the optimal fitting routine because the noise properties are varying and OLS assumes that that isn't the case However, if you have a model for the weights, which in this case I've specified in advance So it's very easy. You can perform weighted least squares using WLS Which has the same call as before but you also additionally specify the weights So we perform the same fit Those are the best fitting parameters and more to the point. Let's actually look and so you can see there's a difference between the two so the argument here is that the Red line, which is the weighted least squares result is a better representation of the data though According, I don't know you be the judge. I think it's probably right It's because these high points right at the end here, which have very small variances dominating the fit So the weighted least squares is making sparing every expense to make sure that it goes through the final few data points Whereas the ordinary least squares in this environment Ah Okay Okay, so I was going to switch back to the slides now. Does anyone have any questions at this point? about The regression any comments that was a good comment from before so more like that. It's certainly very welcome. No, okay So let's talk a little bit about time series analysis and regression so Ah Yeah, time series analysis is very full-featured the sm.ts a module provides a number of a very Fundamental rudimentary time series analysis methods such as auto and cross correlation and covariance as those four functions It generates a periodogram for regularly space data Using that Just Yeah, many of the basic routines like autocorrelation and Order covariance are already available through numpy and sci-pi So the power of stats models is not in providing those routines rather. It lies in the estimation methods that you can use on time series data And so the example I'm going to talk about now is one of these so the things that are available are Univariate and vector autoregressive processes and auto also autoregressive moving average processes so we're going to switch to discussion in the notebook now about Those so anyway, let's This is just an example of the rudimentary time series analysis first. So I found some EEG data on the internet I downloaded it You can do the same thing So Here we go. This actually shows what it is. So this is a bunch of EEG data. I have 32 channels. I've just selected one channel. It has 30,000 points. So it's very dense So this is sort of neuroimaging data and so it's a very dense time series and So you can take its order correlation and plot it There we go. So you can see the order correlation has some structure in it And similarly one can plot the auto covariance And this routine down here was just simply to point out that you have the same functionality in NumPy So it can actually this is as to plot the same thing as up here So that that isn't where the actions at in stats models. That's periodic grant. Okay, let's get on to where the action is which is vector Those processes I talked about so the example I'm going to give here is a vector autoregressive process of Macro-economic data, so I wanted to do something with macroeconomic data and I It turns out that the example on the stats models website is With macroeconomic data, so I've just taken that here and modified it only a little so this is something that you can also Work through very easily yourself. So I Am not an expert in vector autoregressive processes and I'm going to assume that you aren't either So I'll give a brief overview of what a vector autoregressive process is The idea is that I have t observations of k variables Okay, so k could be banned where you're observing, you know, the universe could be We're astronomers, so this is very common for us to encounter time series data where we have The brightness of objects in a number of different filters red blue infrared etc Okay, and so a vector autoregressive process says I want to describe the relationship in my data based on the idea that Because the t-axis the x-axis is time that what has come before should be determining what is coming after and you want to Describe fully the relationship between all the different bands rather than treating each one individually Okay, so can everyone see the ambition of what we're trying to trying to do and so the way we do that in a linear framework is to construct So we say the vector y of for each of the variables we're modeling at time t is going to be a sum of the matrix Which describes the relationship between the four different bands at a previous time Summed over however many previous times you'd like So in general you could take it if you thought your data had a very long baseline You could say I want to look at all the bands Over this time interval and then these matrices are the coefficients that relate all those bands at a particular time so does everyone see how this This works and then there's a And noise term as well, so in my mind this is more easily expressed in a big matrix equation which Has the following notation so the total output That is all of the K by T that you see Is described by this sequence of correlation matrices? Times the previous data and plus a noise term and the idea is that you want to regress these coefficients So that maybe wasn't terribly clear. Does anyone Maybe if I write on the board it will be clear what clearer what we're trying to do I might do it over here. Do you mind if I rub these out even though not entirely out of place? So the idea is I have a vector of data at time T Okay, and so I've got four say let's say just three observations at that time And I want to describe how this these data Can be inferred from observations at all previous time So let's imagine I take The three values at the previous interval Okay, and then I have a three by three matrix here This is three by three Which of coefficients which tells me how I mix these values to get these values Okay, and so you could fit for this matrix by solving like using a Matrix solver to evaluate those coefficients Okay, but of course in practice there's more than just the immediately previous step if you have a mark-off process This is a very good description But in general and physical processes you have to go further back So the idea is that instead of just taking the previous one. I have another matrix now for the previous time So let's call this a t minus one and now this is going to be a t minus two and here's the vector of data at t minus two Okay, and then the idea is that I sum up Those contributions from however many previous matrices, okay, and so that's all this this equation is it's saying you take Your what you're saying is that you want your final data to be Described in terms of all the previous data that you have and some coefficients which you don't know and the whole point of a vector Autoregressive process is to regress these coefficients Okay, I think we got there Okay, so let's actually do this So we can load some macroeconomic data. Let's have a look at it. So the top is GDP so this is a Gross domestic product for the entire United States in the quarters from I think about 1951 to 2005 maybe and this is expressed in billions of US dollars 2005 US dollars So as you can see that's up is positive trend is good well, I mean I In the capitalist model of society then up is a good thing Okay, so and the other two variables here are Man, I forgot what the other two are this is why I had this tab open one is GDP and The other one is consumption and the third one is private investment okay, so They fluctuate the okay Alright anyway for the purpose of demonstrating autoregressive processes. We have three values. So Yeah Within within stats models. So yes But They are they're not documented. So there are actually if you look through the Python files, there are a number of nice So you can see here. I used this call which took in like a Free by in took in three time series data and knew that it should plot them in this manner So what happened here was that the author thought oh, that's a useful thing I'll write a script to do that and then save that script and there it is okay I'm afraid there's no time. All right. Yeah. Yeah. Yeah. Yeah, so so far as I know that would have to be added by Yeah, okay all right So when we're performing the regression, we're not actually going to regress the data itself, but rather the differences so where the aim is we're going to Regress the values of the differences between the data and the reason for this is that here I Is it vector-order aggressive processes work best on data sets that have the property of stationarity and when you don't have Weren't guaranteed that property of stationarity in advance then taking the differential measure Often brings you closer to that state. Okay, so here is the change in GDP Here's the change in consumption and the change in investment. Okay, and so these are the things that we're going to be regressing Okay so Now the one has understood the data Stats models makes it very straightforward to do the fitting and construct the model. So we do that And then here I've just as an example of how it works done forecasting with the the autoregressive fit And so So here is a slightly nicer routine that shows you how based on All the previous data it's performed some autoregression and then has plotted for Four quarters in advance so a year in advance The values are the regressed values for the time series and two sigma plus and minus error bars Okay, so that was very quick, but we are running out of time So I won't I won't talking more detail about that, but I did I did want to mention if you look up this this gentleman Last year's Nobel Prize in economics was awarded for the use of Vector autoregressive processes on macroeconomic data and so obviously it's more than what we just did and Sergeant and Sims who were awarded the prize did great work So yes, so these sorts of things nevertheless are very relevant in econometric research or we're in the 1970s I joke I joke. It's true. It's relevant now. It's relevant now. Okay. All right, so that was That was all I wanted to go through in stats models and it's good because I'm out of time and my last slide is To remind me to editorialize about the differences between these two I put verses in quotes because I don't think they are in opposition But when I Come to approaching these two packages I see stats models as being quite frequentist in in flavor and pym C is being much more Bayesian So I hope that In a few weeks time maybe about a month. We will actually talk about pym C more detail and can go Go into more detail about those sorts of things then okay And so the last thing I have to do is put up the homework and tell you what that is unless There are questions about anything Or if there are I'll go through the homework first and then if you have questions about stats models or other things Please ask okay, so This week's homework is It's in four steps Step one download this file So if you have problems with that ask this gentleman So that step is not the challenging one step to She tell you what is in these images what is in this file? So this file contains a bunch of images and The images are grouped in classes as you'll see and you are going to perform you're going to build a random forest classifier That operates on these images Okay, and then we are after you have done that We are going to give you a an actual testing set and you're going to see how your random forest classifier is performed, okay? There's a validation set Yes, sorry my nomenclature yes, you should Thank you Yes, this is a very important point don't don't train on all of the data you download because then you'll overfit Select a portion of the data that you download train on that and then test on the residual and then we have a validation set that we will provide Okay, so the way this should be done is first For the images design a set of features. So this is using what Stefan taught us earlier today You have to think of these features yourself many of them could be Quite rudimentary like a race-size average of the red channel colors, but you should also have some more interesting ones in mind so Stefan gave good examples of things that you could use once you have Assigned features to each of the images build a random forest classifier within psych it's learn and produce metrics of The estimated error rate using cross validation And so you should be able to quantify how much better it is With your random forest and just guessing and also you should be able to say something about the features that you've chosen For instance, you might find that some of the features you picked are useless and others are very very Important and you could iterate between the feature selection and the random forest building. So once you've you've done that Make sure that your classifier can run on different images In the following format and then we have validation images And here is the most important thing the We have a validation set the best classifier that is submitted will get a perfect score and That person can have one previous homework score bumped up Half the distance between what they got and a perfect score. So that is that is an incentive again in System based on meritocracy and competition that is an incentive. Sorry Yeah If you've got previous if you've got if if your scores have been perfect so far and you get this one perfectly maybe Oh, here we go. Good luck. Yes that That is yeah, that is I think one of the images I I'm not sure what is this is a picture of the US economy or something like that Okay