 And we have Chris here who's a senior software engineer at Silicon Labs. He's also an instructor. So make sure you take notes because if you want lunch, you got to pass the quiz at the end of the talk. That's right. Thank you, Paul. So as Paul said, my name is Chris Morrow. I am here to talk to you today about a Python package I developed called Sign Analysis. So a little bit about me. I am a software developer at Silicon Labs. There's how you can get hold of me if you have questions afterwards. And then just a quick show of hands. Who here has worked with data in any capacity? Even if it's not in Python. Okay, so quite a few people here. Okay, so usually working with data at some point you need to analyze that data, right? And if that's the case, then I think Sign Analysis might be able to help you out there. So what is Sign Analysis? It is a Python package I developed for quickly and easily performing data analysis. And it is a high level wrapper around part of the Pi data stack such as Pandas, PsiPi and Matplotlib. And so you might be thinking, you know, these tools are great. Why do we need one more, right? And I think really what differentiates Sign Analysis from these tools is that they are very general focused, whereas Sign Analysis is very focused in terms of the analysis types it performs. It exposes a single function called Analyze that you use to do all the analysis. Also, part of my motivation for creating this was to tackle three problems that I incurred and probably other people have incurred when working with the Pi data stack. So these analysis types that Sign Analysis currently handles is the distribution analysis of continuous numeric data. It can also provide bar chart and frequency analysis of categorical data, as well as bivariate or correlation between two numeric vectors. And then lastly, location testing and distribution between different groups of numeric data. Okay, so those problems I mentioned. Problem one. When I learned Python, it was initially to do data analysis based off of the recommendation of some friends. So the first thing I did was I went out and I bought the Wes McKinney Python for data analysis book. And I was coming from working in Scala. And so Python was actually a breath of fresh air in that sense. But working with pandas in Matplotlib was tricky at first. And so I found myself referring back to the book quite a bit. So one thing that I wanted to accomplish here was I provide the Analyze function, which is a single entry point into performing data analysis here. It is handling or rather abstracting away a lot of the functions that are being called by PsiPi and Matplotlib. And taking care of that for you so you don't have to remember or constantly go back to and refer to these functions. So let's take a look at what that looks like up here. So I'd mentioned these four different analysis types for performing or just looking at the distribution of an array of numbers. You just pass that into the Analyze function. Pretty easy. Now if you are passing in an array of strings, it will perform a frequency analysis for you. And then now with location testing, this is where it gets a little tricky depending on how your data is shaped. If let's say for instance you have individual columns in a pandas data frame that you pass in, you can pass these in as a list or as a dictionary to perform some kind of analysis between these groups. Or if you're working with stacked data and what I mean there is you have a separate column that identifies the groups, then you just pass in the column of that data frame and then specify the groups. And then lastly, if we want to perform a bivariate board correlation analysis, then we just pass both those arrays in. So I know this is kind of high level right now. I'm going to go into more detail and show you some examples here in a bit. So now just one simple example here is what I did is I create a NumPy array of normally distributed random variables and then just pass that into the Analyze function. And what sign analysis gives me is a graph of the histogram. I get a box plot and then optionally I told it to give me a cumulative distribution function here as well. And for each graph I get I also get summary stats and it also tells me whether my data is normally distributed or not. So there's a little bit of input up there but everything else is what you get whenever you use the Analyze function. Okay, so let's move on to problem two. So unless you have a background in statistics, it's often difficult to know which hypothesis test you should use in certain cases. I mean even for statisticians this sometimes gets tricky and we like to argue over this quite often. So sign analysis is going to try to take care of that by choosing the most appropriate test given the data that you've supplied. So as an example let's take a look at what that looks like. So this is the decision tree for performing the location testing. First thing it's going to do is see did you provide more than two groups of data analyzed. And if so it's going to perform a one-way ANOVA if that data is normally distributed. Otherwise it's going to perform a non-parametric Kruskal-Wallis test. And then if you're using two groups and both are normally distributed it will perform a t-test. Otherwise it goes down to Mann-Whitney-U test and then if you have less than 20 samples in that data then you're down to the least-sensitive test which is the KS2 sample test. Alright so problem three, working with missing data is tricky. And so in this example here I am using Matplot lib to graph two lists. Each list is just values ranging from one to five with the exception that one of them has the value three missing in it. So how Matplot lib handles this is when it encounters that missing value you just get a missing segment in the line there. Now in the case of, now let's say we wanted to try to do a best line fit on that and we have a missing value. I pass that to the numpy polyfit function and it just gives me a value error, right? So side analysis can pretty seamlessly take care of this. I pass in the exact same list to the analyze function and it knows that there is a missing value in one of the lists and then drops the value at the same index in the corresponding array. So you can see here where it's pointing out the linear regression n is equal to four there because it did drop the corresponding value. Alright so I'm going to stop talking about some of the motivation and purpose there and I want to show you some actual examples. So what I've done is I want to try to answer the question which city has the best weather out of Austin, Denver, Las Vegas, New York and Seattle. And what I've done is I've put together a notebook that if you want to follow along the notebook you can find on my github. Let's go to github.com. Sure, okay yeah that's it right there. Slash PyTexas underscore 2019 and you can view it there or you can clone the repo and work with it in a Jupyter notebook if you prefer. Alternatively you can also open the notebook in Google Colab or Binder. Anybody here use Google Colab? A few hands? Yeah I think it's good stuff. What about Binder? Anybody use Binder? Okay, okay one, one so okay. Yeah definitely if you haven't heard of Binder go check it out. It's good stuff. Okay so what I've done for this notebook is I pulled a NOAA GSOD dataset from Google Cloud. It's stored as a BigQuery table and let's see it's a bubble dataset under samples.gsod. There's a couple of them out there so I wanted to specify which one I'm using there. Also this table is 16 gigs so I've limited the analysis to just the years 2005 to 2009. And then also best weather, I mean come on, that's a super subjective thing right. So I want to try to quantify that a little bit by adding the constraint that the best weather is going to be the city with the maximum number of good weather days minus bad weather days. So again still good weather. How do we define that? So I'm going to take a stab at it and say that it is the average days per year where the temperature is between 16, 80 degrees Fahrenheit. Plus the average number of days per year where the dew point is between 40 and 60 degrees Fahrenheit divided by 2. Alright? So we're going to codify that into a function in our notebook. Alright, so first thing is there's no cities in this dataset which I'll show in a minute. Instead it lists all the data by the WBAN weather station that collected the data. So first thing it did is write a function here that is going to set the city by the weather station number in the dataset. And next it's always a good idea when doing EDA to at least do some sanity checks upfront. Here I'm looking at the data frame and I can see that I have 9,081 columns, sorry, rows. And then here I'm just listing the column names and then I'll look at the data types of these columns to see what I'm working with. And here I can see I have a column for year, month, and day that are integers. I have a mean temperature column which that's going to be useful. Also I have mean dew point which I can use for determining good weather. A little bit further down I see I have a maximum temperature column too. So I have two different temperature columns so I might want to see which one of those to use. And then I have several Boolean columns for fog, rain, snow, hail, thunder, and tornadoes which I'm going to use to build up my bad weather list. So first things first I use the analyze function on the city column to make sure I have equal numbers of data per city. So what I can see up there is that each city makes up about 20% of the data which is great. Now let's repeat this process for year and month to make sure that those are close to equal as well. So again here for the years 2005 through 2009 have roughly the same amount of data so far so good. Here month looking at the months there's some variation there by you know number of days per month which we would expect but again it's looking good so far. So now let's now I'd mentioned there was two different temperature columns right so let's see which one we might want to use here. So one of them is called mean temp the other is called max temperature right so max temperature would make me think those values are higher. So it's a good thing that I didn't make that assumption because looking at it here I can actually see that max temperature has a lower mean by like 10 degrees on here. Now this is something that I might have missed had I not plopped this data into Sy analysis and saw this real quickly. Now since max temperature looks like a liar and I have a mean do point column I'm going to stick with mean temp for the rest of the analysis here. All right so now we're just looking at the general distribution of mean temp here then we can also now group mean temp by city and look at it that way as well. We can see that Austin and Las Vegas have higher average temperatures which we would expect. And now we can repeat the process for do point by looking at the overall distribution of the do point and then again by city here. Okay so now there is actually a relationship between temperature and do point or at least so I'm told according to Wikipedia. So let's look at that correlation and what we see is there appears to be a pretty good correlation there but our best fit line doesn't look great and we have all these kind of low flyers on there. So we can drill down into this a little bit more and kind of see what's going on. I can do repeat this exact same analysis but here I'm passing in groups is equal to our city column and it's because the colors are a little tough to see there. It's Las Vegas has a lot of those low flyer points on there. So what I might want to do here is because Las Vegas and Denver both a little bit lower I can compare both those individually to New York which actually has the highest correlation coefficient here. Okay so here the red is New York the blue is Las Vegas and we can see there's actually pretty big difference there. I think this is attributed to differences in relative humidity since Las Vegas is in the desert after all and does have a lower relative humidity throughout most of the year. And you know a similar thing for Denver here as well I can kind of see and there it's not so much I mean Denver is definitely not the desert but it is at higher altitude. So I think that is what kind of explains the data there. So now we're getting close to wrapping up. I create a function to define my bad weather which is if any of the columns rain fogs no hail thunder tornado are true then I mark that as bad weather for that day. I also have functions for determining the temperate climate which is between 1680 degrees and then also kind of the comfortable humidity level between 40 and 60 degrees and apply those to the data frame. And then I finally summarize all this by grouping by city and year and so since I'm looking at five years worth of data what I do is I get the total count per city per year throughout all that. And then next is to average that across the five years by city and then finally apply our best weather formula there. So anybody want to guess which city actually has the best weather. Just go ahead and yell it out. Okay here Vegas. Any other guesses. Seattle. Austin. Okay. So San Diego. Honestly San Diego probably is true but it's not in here. So that was kind of on purpose. Okay so the answer according to our data here is Seattle. I was shocked too actually. It turns out if I had used max temperature then the answer is Austin. But here really what's doing it for Seattle is the dew point. And I mean now Seattle does have the most bad weather days but I guess it has so many days where the humidity is between 16. Or sorry 40 and 60 degrees Fahrenheit that it edges out Austin here by a little bit. So that's actually all I have. This is just the highlights of the notebook. Again if you would like to play around with the data in the notebook I encourage you to do that to learn more about how to use sign analysis. And thank you. So we have a little bit of time for questions. Does anyone have questions? Just raise your hand and I'll run out to you. Just out of curiosity is this your first package that you were in? Yes it is actually. Thank you. Just along those lines I'm happy you've been working on this. Okay so the question is how long I've been working on this. A while actually. So I think I started in 2015. Back then it was just a single module of several functions. And it's gone through a few refactors already. So now it's actually object oriented. The code is still not as clean as I'd like it to be. So definitely if you want to check out my GitHub and look at it you can see my trash there to quote the keynote. And so I've been working on it a little bit over the years here and there between work and video games and teaching. So yeah it's still work in progress. There's still more stuff that I want to add. So I guess my main question is with that single there's no way to choose what you're looking for if you know what you're looking for. So that's a great question. And it's something that I've been thinking about a lot is behind the scenes there is kind of an API for performing different analysis types. And the analyze function is basically just the logic for choosing which one to use. So I've been thinking about creating the docs for that so you can go in use the one you specifically want. But now kind of what I'm toying around with is actually creating preferences. And so you could optionally create a serialized preference file to disk that it can read from that. Or alternatively you can pass in some arguments to kind of set your preferences for the rest of the session. So then that way like say you don't want to use the Kruskal Wallace test you have something against that. Then you can set the preference for the test you want to use and it will just always use that one every time. Oh great. Any other questions? So are you still working with this? What's coming in next? That's about the future if you can. Great question. So I'm currently squishing bugs but also so I showed that I can do some analysis of categorical data. That is the next big expansion. So one feature I've been working on for a while now is actually grouping being able to do grouped categorical analysis. And then also some support in there for sampling. So that might enable analyzing polling data for example. That's something that I've really been interested in. Also the other two big features are going to be being able to create like a heat map or matrix plot of multiple numeric pieces of data. And then also time series analysis. That's a big one that I'm pretty excited for and to get started on. But that I think is going to take me a while to get all the way through that. Any other questions? Okay let's give a round of applause.