 Okay, everyone. So we'll start now our final two talks before the keynote. So Give a big welcome to Daniel Kohlfraub. He's a software engineer, senior software engineer at Israel for Bloomberg and he figured out how to make money for with pandas, not just do it for fun. So Thank you. Thank you Okay, so I hope you're all having a good time at the pylon dinium Welcome to pandas for fun and profit So let's start out with a question. How many of you show of hands? Let's say in the past year. How many of you have had to make a decision of what to do with your money? Okay, pretty much everyone maybe there's there's always a couple of people who have someone else making their decisions for them Now the the real question though. How many of you in making that decision? Used math and science and data to make your decision a little bit less So this is basically just Beginners introduction to pandas by way of an example that I personally found interesting and I hope you also will find it interesting So what is pandas pandas in short? It's a powerful python data analysis toolkit. It's open sourced BSD license It provides high performance data structures and data analysis tools. It's built on top of NumPy if you're familiar with that so it benefits from some of the structures in NumPy and Under the hood it uses map plot lab for visualization So the story is this I'm new to python. I've only been coding python for a little less than a year I heard about this thing called pandas, which I thought sounded interesting So I took a three-hour live online training with Matt Harrison I said to myself well, this is really cool stuff Maybe I can use this to test some of the ideas I have for trading in the marketplace and at the same time It'll be a great way for me to force myself to learn pandas So a little bit about me. I'm an engineer at Bloomberg. I spent most of my career doing C&C plus-plus on the server side. I Was a member of the Charter Financial Analyst Institute for 10 years and I started my career in molecular biophysics Little bit about Bloomberg, which I hope you've all heard already we provide all of these services Data analytics private communications network some electronic trading to over 325,000 financial market professionals and to the public we provide financial news in the form of television radio And online We have about 20,000 employees in 167 locations including London Over 5500 software engineers and among them at least 2,000 probably more than 2,000 are involved in python in one way or another So I decided to try something new This time I'm going to get to my conclusions right at the beginning So you'll have what to think about throughout the talk So in conclusion pandas makes it easy to explore your data to shape analyze and visualize your data However, as powerful as pandas is it's important for you to have some Domain knowledge of the data that you're looking at you don't necessarily have to be a complete expert, but you need some Domain knowledge and you should plan to spend 50% or more of your time Reshaping your data getting it into the data structures that you need to do the analysis that you want to do and as a result of that Probably 50% or more of this talk will be about shaping the data So let's talk about stock ownership When you buy a stock you actually own part of the company and what's a company a company is nothing more than a group of people Working together to earn money earning money has value and in a sense That's what you're buying when you're buying a stock You're buying the the power to earn money and for that reason what a lot of people look at like to look at with a stock Is they look at the price earnings ratio the price of the stock? Divided by the earnings of the company and there's a concept that if the price earnings ratio is low Then you're getting a bargain right you're paying less for the same amount of earnings So let's talk about pandas if you understand these three primary data structures and pandas you understand the core of pandas They are the index or multi-index series and a data frame So what's in a pandas index? It's just a list of keys. It's essentially a hash table. You can think of it as a Python dictionary a Series is a list of data that has an index so you can access that data by key So here are the parts of a series you have your values You have your index in this case the index are date objects You have a name for your index a name for your series and The series data type those are all the parts of a series object So what can you do with the series? Well, I can take the length of a series and see how much data I have in there I can access the data by key or by position Here's an example accessing the data by key. I put it put in the date and I get the data back for that particular date or Accessing it by position. I put in the position in the series and I get back the data So a data frame Excuse me Data frame is you can think of as the two-dimensional spreadsheet has rows and columns each column is a panda series each row is a panda series So here the parts of the data frame again, you have your data values Your index index name and now you have columns because it's two-dimensional Notice if I take a look at the index attribute on a data frame I get back an index object which contains a list of the keys in that index Notice also if I look at the columns attribute I also get back an index object with a list of the column headings So both the column headings and the index are indexes index objects, but we just call one of them index and one of them columns So what can we do with the data frame? We can take the shape see how many rows and columns there are We can access it by key as we did before but now since it's two-dimensional We have two keys so we can specify the row and the column to get the data back And we'll see later you could specify more than two keys if you have multi-dimensional or greater than two-dimension in your data I Can also access just a column by providing just the car I'm sorry just a row by providing just the row key and notice that what I get back as I said before each row is a series Each column is a series so when I access the key just for a row I get back a series notice that the index on the series is the same as the column headings Whereas the name on the series is the key that I provided for the row So now we can talk about a multi-index a multi-index is just where you have Multiple levels of keys in your index So we have a level zero key that points to a level one key in this case We have two levels on the columns one level on the rows So we have three-dimensional data I could have three on the columns and four on the rows I'd have 12-dimensional data represented in two dimensions There used to be in pandas a three-dimensional data structure called a panel Which is actually where pandas gets its name from it stands for panel data However, the through that the panel data has been deprecated or it's in the process of being deprecated in favor of multi-level keys Because with multi-level keys you can specify any number of dimensions and represent it in two so you can see clearly what's going on So now we understand the three primary data structures of pandas So back to our hypothesis the idea is that low price earnings ratio represents a bargain if so Then buying stocks with low PE should result in higher returns So we want to compare the PE ratio to the returns and we hope to find a negative correlation That is when the PE is low we expect the returns to go up So this is here's the fine print. This is information not advice yada yada yada So how are we going to compare the PE to the returns we need a data set So I chose to use the 30 stocks from the Dow Jones industrial average I would have much preferred to use a much larger data set the S&P 500 or the Russell 5000 However, this was my first time using pandas So I wanted a smaller data set so I could kind of scroll around and see all the data explore it Make sure I knew what I was doing We're going to need price earnings ratios for all 30 stocks and we're going to need to calculate a return for all 30 stocks Which means we need Which means we need the price and the dividends So I got myself a CSV file containing price earnings ratios This is kind of ugly fortunately pandas has a read CSV method that I can call Pass in the file name and it gives me back a data frame It's also a little messy, but it's a lot easier to read than the CSV file notice when I Read in a CSV file plane It simply numbers the column headings and numbers the index over here But now I can kind of look at it and I can see why I have a start date and end date for my entire data set In the upper left-hand corner I can see that row three are the ticker ticker symbols for the stocks Rows four and five are just telling me this is the field I asked for the price earnings ratio and then the data begins below that I have Dates in column zero and some dates some data and some missing data So now I can call read CSV again and I can tell it That the header is in row three and the index column is in column zero I tell it parse dates. He was true which means if you find anything in the header or in the in the Index that looks and smells like a date convert it to a date object And I tell it to skip rows four and five because I don't need those and now I have a better-looking data frame I have my ticker symbols over here. My dates are part of the index. I can see because they're involved However, if I want to access a column by key I don't want to have to type in this whole long thing here and what this is This is the ticker symbol followed by the primary exchange And then the word equity just tells me that it's a stock. So I don't need all that So if I look at the columns attribute, I see that these are just strings in the index and an index is The column an index object is iterable so I can use a list comprehension To simply go through that and split out the first token of each string and now I have a better-looking data frame I just have the ticker symbol that I need to access a column. I have the dates I have some missing points. We'll deal with that in a minute. I also notice I have 31 columns And I thought there were 30 stocks in the Dow Jones industrials I went back and I looked if you can see here on the index the very last object is the index itself It doesn't say equity so fortunately Data frame has a method called drop where I can tell it to drop any columns or any rows that I don't want So I drop that column and now I have third the 30 stocks So just exploring my data some more you can take a look at the tail I see that I have much more complete data towards the end of the data set So I decided you know what I don't need 30 years of data. This is monthly data. So 360 rows 30 years of data I decided let me just take 12 years of data. So I took the last hundred and 44 data points and Reassigned it to the data frame and now my data is starting in 2007 and I it appears to me that there aren't missing points anymore So I looked at the say a shape again exploring your data Take a look at the shape make sure it looks the right way Check the type make sure that read CSV did the type conversion properly If there are any straight characters, it might leave your fields as strings in which case there are methods to go back and clean it up afterwards So let's describe this data frame There's a method called describe which returns another data frame with the same columns as your data frame plus various attributes to count number of rows for each column the mean standard deviation min max and various quartiles And each of these attributes can be called as a method on its own to return return a series With the index of the column headings and the number For example in this case count shows me the number of rows in each column And I noticed that one of the columns has 20 only 25 data points. So there's some missing data there So you can take a series and you can compare it compare it to a scalar and get back a Boolean series and Then use that Boolean series to filter out the specific items that you want from another series So here I'm filtering out that Boolean series of the number of columns that have less than a hundred and forty four points And I see there were actually three columns that have that So I decide you know what I'm going to drop those I don't have so I don't have to worry about it It's fewer stocks, but at least I don't have to worry about missing data I can always go back and see how it might affect the analysis So now I'm down to 27 stocks. I have my data Fully complete Take a look at the mean now Panda series in data frame has they all have plot methods So I can call plot on the mean and take a look at what it see what it is. I have Excuse me. I Have the stocks along the x-axis here and the mean everything looks kind of good There's there's you know one or two that are a little high but nothing's standing out really all that much But if I look at the standard deviation, there's one stock that stands out way above the others so I decided to plot the PE ratio for that one stock and I see back in 2017 it shot through the roof and what can happen a When a company reports their or an earnings typically quarterly Occasionally they have a loss and they get negative earnings. That's not what's happening here If you get negative earnings, we don't calculate a PE ratio But sometimes they lose some money, but they end up having very small positive earnings. Maybe they only earned a few pennies So that's actually what's happening here and it makes the PE ratio go through the roof Somewhat artificially and again, I decided well, I don't want to include that in my analysis So I called drop got rid of that column now. I'm down to 26 stocks, but the standard deviation looks good I don't see anything standing out Among the stocks. I have what seems to me a reasonable data set So how am I going to define low PE? Well, it makes sense that what really matters is in relation to the other stocks So I define low PE as a PE that's significantly lower than the average for the entire market And one way to measure that is the number of standard deviations above or below the mean and in statistics We call this a z score value minus mean over standard deviation So I pulled out the mean in the standard deviation put them in the data frame on their own And I plotted it just to make sure again exploring my data make sure it looks good Standard deviation is less than the mean I like that and it's moving around and nothing's standing out particularly strange And now in a single line of code I can take the data frame and I could subtract subtract the mean and divide by the standard deviation for every point And now I have a new data frame which has the z score of the PE ratio We can see over here in 2007 Apple stock Had a PE ratio more than two and a half standard deviations above the mean for the market Caterpillar here was a little bit more than one standard deviation below the market So just for fun. I took a look at the max and min of the standard of the z scores P ratio z scores and I plotted it again. It looks kind of normal to me. I like exploring my data see what it looks like Now we need to calculate returns So we can relate the returns to the z score where return is final value divided by starting value minus one and If I want to do returns for different periods of times I might want to annualize it so I can use this formula the important thing to note here Is that the final value is the final price plus any dividends that were earned during the period over which I'm calculating return? So I read in my CSV file for the prices look similar to the PE ratio one I had the same issues I cleaned it up and Now I'd really like it to be the same shape as the other data frame So I have the same number of points the same 12 years the same 26 stocks And I could try to remember which stocks did I drop and so forth But what I can actually do to make it simple is I can pull the dates out of the PE data frame and use them to Slice the pricing data frame and also take the columns from the PE data frame use that to select the columns from the Pricing data frame and now the two data frames are the same size and shape What about the dividends? What shape should they be? It's very good to know what shape you want your data to be in before you read it in So the issue with dividends is that they're typically paid quarterly, but it sparse data So it's not necessarily going to line up with the same 144 by 26 data frame So I decided Based on the fact that the way I need to use my dividends is while I'm calculating return I need to look up any dividends that may have been paid during that return period And I'm going to calculate returns for one month three months six months and so forth So I decided to store my dividends as a dictionary where the key is the stock ticker symbol and the value is a series of the dividends So I got a csv file for the dividends and I noticed that each stock had nine columns in the file And I looked through the columns and figured out that the ticker symbol was in the first column Nothing else in that column and I have a date column Which is when the dividend was paid and an amount column So I wrote myself a little loop going through nine columns at a time going through the data frame Pulling out the ticker symbol pulling out the index and then creating a series and placing it into my dictionary based on keyed by ticker symbol So now I have dictionary keys ticker symbol and I have a series of dividends being paid This is just me testing my idea that I can then key into it by stock ticker symbol and date range And then sum up the result and get the dividends that were paid during that period of time So now I'm ready to do the return calculation for the various periods that I mentioned before What I'm going to do is I'm going to go to each date Find the date that corresponds to that period of time three months later six months later Get the price from that period and then grab the dividends and add it in To the return calculation and do that for every date that I have So when I do that I can't there is no method to call to calculate a return on a data frame like there's a mean I can calculate the mean and standard deviation However, there is a method for a data frame called apply where I can write any function that I want And it will apply that function to either every column or every row depending what I ask it to do So in this case, I'm going column by column calculating the returns for each stock And when you do that the important thing to note here Is that the apply method will pass into my function as the first argument either the column or the row That it's supplying it to so I get the prices for a given stock and I can go through and grab the dividends and calculate the return So I call apply I get a data frame of returns That corresponds to my data frame of pe ratios But how do I know if my return is higher or low? Well, if your friend comes to you and says I bought this stock and six months ago and it earned 15 How to If the entire market or 25 percent your friend didn't do so good But if your friend comes to you and says I bought this stock Six months ago and it earned only 1 percent but the whole market tanked and went negative 10 percent then your friend did pretty good So again, we want the z score. We want to compare it to the mean for the market on every date And then calculate the correlations So I did that and here's one example of one stock where I have Down here you can see the Pez square and you can see when it's going below zero the returns are indeed going above Zero so starting to look pretty good for me They say a picture is worth a thousand words, but there's strength in numbers So let's look at the actual numbers There's a method called core on the series where I can pass in another series and calculate the correlation I went through I wrote a method where I could define the length of time I was correlating in this example i'm showing i'm doing the entire 144 months And I took all five return periods that I want to do calculate times the 26 stocks. That's 130 correlations And on average the mean correlation was negative which is good, but it's only slightly negative So I started worrying maybe my low p investment strategy won't work So I started poking around at the data. I looked at some of the positive here I am filtering for positive correlations Which are the ones that are going against my hypothesis here I'm filtering for negative correlations and I noticed that with the negative correlations They tended to correlate with the longer Return periods so so much for getting rich quick, but it still might work if I decide that i'm going to take low p investments in stocks that have a history of negative correlation Between the pe and the return So here's the test plan by all the stocks by the stocks with low pe That was my original hypothesis or by stocks with low pe and a negative correlation between pe and return That's my new hypothesis So I need a period of time that I calculate the return it could be five years I'm sorry that I calculate the correlation five years four years and so forth I need a gap to calculate the return before I actually make a trade So for example two year returns the latest correlation that I can calculate has to be two years before the trade did because Otherwise i'm using a price in the future and if I could predict the future there'd be no point for this whole analysis But once I set that up I can then go through a whole bunch of dates and do that So that's what I did. I took the three trading strategies held each investment for two years from january 2014 to 2014 to january 2017 In the day they're not in real life. I didn't actually spend my money on this yet So this is just six of those 36 months that I tested and I noticed that in some cases the negative correlation low pe Method is much worse than the market in some cases much better But what about all 36 overall if I could if I did have the money that I can invest every single month for two years For 36 months what would happen? Well, if I look at the overall average I noticed that if I use two or three year correlations I can beat the market between two and almost four percent However, if I don't have that much money and I just want to pick one randomly I could look at those 36 months as 36 different scenarios In which case the low pe method is about half and half half the time it beats the market half the time It doesn't the negative correlation low pe method was beating the market almost two-thirds of the time So in conclusion pandas makes it easy to explore your data to shape your data analyze it visualize your data You need to have some understanding of your domain that you're studying your data for and plan to spend 50 percent or more of your time shaping your data Thank you Okay, any questions Thank you. That was really fascinating. I learned much more about pandas. I think in the last 25 minutes than I've Have in the last three and a half years Are you planning to Put this put your money where your pandas are? Uh, it's a good question Uh, I haven't decided yet. There's a lot more things that I would do here Personally to feel more confident for example the the period of time that I looked at The overall market was up. Anyway, I would like to see if the method also works During a period of time when the overall market is down If it still beats the market because it could this could be just exaggerating the market Right in which case when the market is down. I could do a lot worse than the market I don't know until I try that test. So I haven't done it yet Have a follow-up question. Yeah You started with a hypothesis And then you refined it On the basis of how it matched the data and so you might continue refining it But then Would you be in danger of simply making your Hypothesis Fit the data that you happen to have and not actually fit The mechanics or or what's really underlying what's making those numbers come out as they are You're you're just making your data Fit the pattern so you think that you've got a perfect match, but that's well Yes, and no that's why I say you have to understand your Your domain a little bit Yes in the sense that you're fitting it, but if it works it works, right? On the other hand if you understand your domain like I said before You would want to see if it works also when the market's going down You would want to maybe try it with a much larger data set like the russell 5000 and see if it still works So you'd probably want to do more of that type of stuff to become more confident But if it works it works On the other hand if it doesn't work, it's a very important result as well You know using math and science to make a financial decision if you see it didn't work in the past There's probably a good a pretty good chance. It's not going to work in the future Yes, um, I'm just thinking about what you were saying about rising and falling markets Um, would you look at using um, uh, so you've talked about having buy signals? So are you buying hope that the value goes up on the basis of your model? Would you also look at having sell signal so you could have Sell selling short so you could do that. That's true I did not look at when to sell other than I kind of more or less randomly picked a few different periods and just say I'm either going to hold it for a year or I'm going to hold it for two years But that's correct. I that could be another thing that you could throw into your model. When do you when do you sell? Right and ultimately once you have a bunch of these Parameters you could possibly do some machine learning on it and try to refine the parameters. You know, there's just a basic Analysis to kind of demonstrate what pandas can do Okay, that's all the question we have for now, but daniel is going to be around so make sure to Follow up on him directly. So yeah, thank you very much. Thank you