 Good morning, good afternoon everybody, depending on where you are in the world. Welcome to our webinar, Tidy Finance and Assessing Financial Data, brought to you by the R Consortium. The R Consortium works with and provides supports to key organizations developing our self-worth through grants and sponsorships worldwide. To find out all the details and how your organization can become a member, please visit our website. I'm Melania Quintero, today's announcer, and I just have a few housekeeping items before we begin. This webinar will have an interactive Q&A section between you and our presenter. Just stepping in a question at the bottom window at any point during the presentation and click the submit button. Near the end of the webinar, we will have time to answer as many questions as we can. So, now we can get started. This webinar focuses on Tidy Finance and Assessing Financial Data. Christof Skoush is the head of Artificial Intelligence at the social trading platform with Kefolio Financial Technologies AG. He is responsible for researching, designing and prototyping of cutting-edge AI-driven products using R and Python. Christof, thank you so much, and you can go ahead and bring in your presentation. All right. Thanks for joining us today for this Tidy Finance webinar and also thanks to our consortium for having us here. My name, as you said, is Christof Scheuch. It's a very, for American-speaking people, it's a horrible name to pronounce. I'm currently the head of AI at the Kefolio, but I'm also the co-author and maintainer of the Tidy Finance project. And in today's webinar, I will first introduce the Tidy Finance project and also talk about why R is a great language to access and manage various financial data sources. At the end of the webinar, I'm also pleased to announce that we have just yesterday released a new R package that is available on Kran and that deals with a lot of issues that I'm talking about today. So let's start with the question, what is Tidy Finance? Tidy Finance is actually the result of our struggles as island PhD students at the beginning of our research career because we faced considerable challenges working on our first research projects. In particular, there was a lot of data cleaning involved and nobody prepared us for that. And the second big challenge is that there was no code available at the time to reproduce the seminal results. So it felt like, okay, why is this so opaque? Why is nobody sharing the code? And at the time, in the last couple of years, prominent voices in the financial economics claim that actually there's some sort of reproducibility crisis because people are not sharing the data, people are not sharing the code. In recent years, this has shifted dramatically the top finance channels. They have some cold sharing policy and things are moving in to our perspective in the right direction because what we want to achieve with our Tidy Finance project is providing a transparent open source approach to research in financial economics and recently we also introduced the feature that we support multiple programming languages. So our website tidyfinance.org at the moment provides the following set of tools. So first, if you want to learn about empirical applications using Tidy Principles as a student or as a self-learner we have the content on our website for you to look into. Second, if you want to learn how to work with financial data in a tidy manner then we provide resources and also the package that I just mentioned. This is actually the focus of today's webinar. But if you are an instructor or in any other position where you want to teach students the importance of reproducible research in economics and finance we also provide resources for that. And last but not least our platform also features Epilogue where it is open for external contribution in the area of reproducible research in finance. So let's dive a bit more into why do we use Tidy all the time? Why do we emphasize this? Why not just call it transparent finance? We decided that we want to follow a couple of principles when working with data and when writing our code. In particular we think that the code should not just be correct but it should first be designed in such a way that it is easy to read for humans. So imagine whenever you write a script you're not just quoting in the moment but you're also writing a script for your future self to understand but also maybe for your co-authors. One of the nicest compliments I've received from Stefan who was also part of the Tidy Finance Project at some point was I can read your code like I can read a novel which is very nice because I feel like you understand what's happening with the players and I was really happy about that. The second important point is when you have a complex problem try to break it down into simpler smaller problems and then write functions that solve these particular problems in order to solve the bigger problem. This is very much tied to the third point. You should embrace functional programming. This is something where R excels in because it's very easy to do that. It's very easy to write functions to solve particular problems and if you chain these functions together this really increases the chances that your results are reproducible in the end. The fourth point for our Tidy approach is that we want to reuse data structures across applications as much as possible. So I think that everybody who has experience with data has learned that whenever you get your hands on your data source it looks very likely it looks completely different to what you've worked with before but oftentimes you need to combine different data sources so it really makes sense to think about how do you harmonize the data across different sources. This is also the focus of today's talk. We will talk about tidying financial data. But before we dive into the financial data I think it makes sense to do a quick recap what is tidied data. I took this very nice illustration from Julia Lowntz and Allison Horst and I really like how they visualized this quote that was put forth by Hadley Wilkin a couple of years ago because he has written a paper on tidied data what he means about it and the importance of it and just this nice quote that tidied data sets are all alike but every messy data set is messy on its own way and I think there's a lot of truth to that. So what do we mean by tidied data? You see on the top of the figure you see these happy tables because what the characteristics that they share are that the columns are all variables and the rows are observations so this is something that we will see in the applications further down in the webinar and if you look at the unhappy tables, the messy tables below it seems to have some problems that most of us have already encountered in our work for instance the columns are values and the rows are variables actually switched and you might have variables in columns and you know there can be anything goes with data so I'm particularly like the last thing I don't even know what my deal is actually but I can be really complicated. So I hope with this in mind so tidied data is a concept the columns are variables, the rows are observations I think for economists that notion is also very intuitive but before we dive into the data part let me quickly also show what we mean with tidied code so this is an example code chunk it shows the R code how we would solve an introductory problem so by now I guess you're not surprised that we use the tidy bursts for most part of our content but just let me assure you we are not dogmatic on our platform there are also contributions that use data table or other approaches it's just our personal preference that we like the tidy bursts so with this in mind we're not claiming that the only way to write tidy code is with the tidy bursts I mean we can also write tidy code with other approaches so what's happening here is in the first part we want to download the symbols of a DAO index so we can leverage the tidied quant package to do that which is very nice and actually what is returned also contains the dollar so we kick out that row we don't need it so the next step if we want to download the prices of the DAO index we can also use tidied quant and this is all that you need actually you just plug in the symbols and then you can download the prices for instance if there's an x step we want to calculate the returns this is a step-by-step instruction so for each symbol you calculate a return and then we throw away all the rest of the data that you do not need and also you kick out rows that have missing values so just for completeness and this will be just very very brief and we will not go into details but we recently released also a Python version of our content where we use pandas and numpy and you can of course do exactly the same thing that I just showed you in the R code using Python so if you feel overwhelmed by pandas that's okay it has a very different syntax but still we try to use the same principles of composition and chaining operations alright a few slides of advertisement I mentioned that we have the blog feel free to check it out we're still in the initial phase but we have a couple of external contributions for instance Björn Haakströmer and Niklas Landsberg have recently published something on conducting tiny market microstructure research so we provide a lot of details how to compute measures for in this area Ian Gao has wrote a post about using Bacti B to work with WRDS data and I have also written a short post about Farmer French 3 inside sectors and some differences that are in this company so if you also want to have an idea if you want to contribute something feel free to reach out check out our website and get in touch second piece of advertisement so it's not just me who maintains this website as I said it's important for us to have external contributions but we review the input and we make sure that the content is up-to-date and it's me, Stefan Fork and Patrick Weiss who I got to know during my PhD studies at the Vienna Graduate School of Finance and recently Christoph Frey who was in Hamburg also joined us as our Python expert and last piece of advertisement we have written the tidy finance with our book that was published to CLC Press last year and the tidy finance with Python book is forthcoming in July of this year unfortunately we don't have the nice cover yet we have this pixelated ugly cover but of course we have created a new cover image the production team has not implemented it yet we are very happy to have written these books but we also transparent that the website is more up-to-date to the books so whenever we come across some innovation or some bugs or errors we immediately update the website so that the tidy finance with our book might be on some occasion be a bit outdated but more or less the content is still the same alright, with this idea of tidy finance in mind let's go to the main body of today's webinar which is the topic of accessing and managing financial data there are many data sources out there for the financial domain and I'm definitely not going to cover all of them in this webinar neither are we covering all of them in our book we have selected those that we think are most important from an academic point of view so what you typically use during your studies or as a teacher or researcher in financial economics but let's start with why are we doing this or what's the challenge here we want to highlight that it's really important to organize your data efficiently in particular when you work on multiple projects over time or at the same time or with co-authors you also code and if you pull multiple data sources together it might come across the challenge that it's hard to ensure the consistency across those data sources I mean one of those consistency issues might be just related to the column names they have wisely different conventions out there and the solution that we put forward is that R is a great tool to import, prepare and store data from basically any source that you can imagine and the second point is that we think that SQ Lite is a great data-based format tool to organize that data because it's very easy to install but we'll get back to that later so in today's presentation I will use mostly these R packages on the list here so for the manipulation it's the tidyverse but mostly it's the plier and tidier and for the data import I will show examples from tidyquant there's also a French data package that we really like to read Excel to read Excel files and it's a storage technology as I said before we use SQ Lite so let's start with the first data source and bear with me, I know it will be a lot of very cold and data heavy but this is I think as we announced it accessing and managing financial data towards the end of the presentation I will include a few more graphs to slowly ease out so the first data source and I think for everybody who has experienced with the financials domain is familiar with that the farm of French factors and portfolio so this is one of the most it is the most popular data for asset pricing tests since farm and French published a paper in 1993 the idea here is that they constructed portfolios and the ideas can these portfolios explain the cross section of stock returns and whenever you have a new idea I have a new factor that might explain the cross section of stock returns you typically take the farm of French data and evaluate whether you provide any new explanatory power so as you see in this code chunk here we use the French data package we think it's a great resource it's a bit faster than copy pasting all these URLs yourself but the cool thing is that it really downloads the raw data because it doesn't do any manipulation so you can do everything yourself but actually if you use this data all the time it makes sense to clean it properly so that you can have a nice data structure and also recall that the data should be tidy so what we do in this example if you focus on this mutate block here is first and that's a very common thing dates come in horrible forms anything you can imagine so what we do here is we create a month column as you see in the data below that just takes the first day of the month basically for each row second as you see in this that the second mutate part here the farm of French data is in percentage points which is might be confusing if you calculate returns yourself with other data so what we do here is actually that we want to have numerical values so we divide them by 100 and then for personal preference we like having column names in lower case and words connected with underscore I mean you might have your own preference that's fine but we decided to stick to that to use this convention consistently so what we do is we rename all the columns to lower case and then also the market access return we rename manually because its market access is more expressive than having MKT minus RF as a column name so what you see at the bottom is our understanding of what's the tidy farm of French factor data should look like so the columns are variables so we have the month we have market access returns this small minus big, high minus low and the risk free rate and month so in each row is a month so if you want to combine this data with another source that is growing in importance the Q factors that's a recent alternative to the farm of French data you see that this comes also in a very different format so what we do here is we directly read a CSV file from an URL and I think this is a great feature that you can just do this in R you don't have to manually download things you can just use the URL and download it directly to your R session and again the first step is you need to convert this year and month to proper dates because we think that having the proper dates is important and then we also do a couple of naming transformations to get this lower case column names and also percentage values so just as an inspiration for you if you're doing this asset pricing test we encourage you to also look at the Q factors and again we have a nice tidy data format where the column names are also a bit different from what you have in the farm of French data but we think this also makes sense because on a theoretical level those columns are also different the next data source that I want to show you because it also has a different approach how you get this data into your R session are a set of macroeconomic predictors by Goyal and Welch they have released this data for the first time I think in 2008 it's just a collection of different macroeconomic variables that you can use for equity premium prediction we also use it in our book to do a machine learning application to select factors that price assets why I wanted to show this example is because in this case you cannot directly read from a URL because it's actually an Excel spreadsheet stored in a Google Drive so it's a bit confusing for R so what you have to do here is actually first download the file with the download file function store it, I mean here I store it just as a macroeconomic predictor the xlsx file and then you can start reading it in and doing several cleaning steps and barrel transformations that I'm not showing here today but just as an inspiration for you if you have data sources that you cannot directly use read.csv or some other read function you can always use download the file in your script store it and then read it and we think this is also a very nice feature yes and again here it's very simple then you have similar to the other data we have clean picolubames one row is one month but here are way more variables I'm not going to go into details here okay I want to highlight that there are so many other macroeconomic variables out there that you can use for your research most notable I think that the Federal Reserve Economic Database so the FRED database is a great resource there are more than 10,000 data sets available and what's nice is that the tidyquant package that I've already shown in one of the initial slides has a built-in support for the downloading data from the FRED database so again you can use the tqget so tidyquant get function input your desired data source here with the consumer price index that we also use in our book and by specifying the economic data source you tidyquant knows that it has to get the data from FRED and here actually we don't have to do much computation or column cleaning because we also have the same convention as we do we just instead of having an end of date we convert it to the first of the month because this is how we consistently use in our applications and we calculate the consumer price index relative to the last month of the time series but that's again that's just our convention so we think that the tidyquant is also a great resource to download data now let's move on to the point of data storage so as you know from your own experience probably there are many ways to store data I guess everybody has written CSV files they are somewhat good better than Excel notebooks at least but we think that if you that playing around with data bases also pays off in the long term in particular because it's so easy to set up so what you see in this code chunk here is that we use the RSQ Lite package which is very lightweight to create the database and we use the DB Playa package to interact with the database creating the database you can do this within your R session and I think this is a big advantage of SQLite at least relative to other non-serverless database engines so you can just if you connect here what we do is we connect to the tidyfinance RSQ Lite database and if it doesn't exist then this code creates the database we also use the extended types option here because if you use that then the RSQ Lite database is also able to preserve dates for writing and getting them back out of the database again but I guess for dates I think we can do a separate webinar only talking about dates in different programming languages and other storage technologies but at least for our use case it helps us to be consistent across the applications and if you want to write to the database there is the DB WriteTable function and for instance if we want to write the Factor, the Farmer French 3Factors, the monthly ones that we've downloaded before to the database we can also just give it the same name in the database this table and I think it's a very lean operation and if you want to load this data again from the database you can there are also other approaches you can use the DB ReadTable function but here we use the DB Plier package and just use the table function to access the database and pull information about the table that we've just written in there and if you use the collect function then you have it again in your memory so this is the typical workflow that we use in the book we download the data we make sure that the column names are aligned we make sure that the format is tidy again each column is available each rows and observation and then we store it to the database because in later applications we do not want to take care of all this data cleaning again and again so if you're working on multiple projects with the same core data it makes sense to do it once properly clean everything and then reuse that storage for other projects you might wonder why why SQLite we think the big pros are two-fold so first it's a lightweight self-contained serverless database engine so this means you can just install it with this I mean you can just install the package so that you don't have to install any other dependencies and as you've seen in the code trunk before it's very easy to create this database because this R SQLite package just initializes the database you don't need your own server so database management can get very complicated very quickly and the second big plus is that it is great for education purposes and prototyping so also in my company we frequently use SQLite database just to quickly set something up that doesn't have a heavy load but just to get things running and to stick to this SQL thinking it makes sense to force yourself it makes sense to force yourself to SQL thinking if you also have production technologies that use SQL but I just want to mention that there are also big cons of SQLite in particular when you have very large data it might become slow and this is definitely not the most efficient solution to do very large data and concurrency is also an issue so if you want to SQLite in jobs or in applications where you have a high degree of parallelization then you will have to tinker a little bit to make it work but ideally you can switch to other technologies and the second con that I want to highlight is that with this approach that I've just shown you before it might be a bit cumbersome to transfer the data over to other languages using the SQLite database engine for instance using the same database in R and Python might be a bit of a pain in particular when it comes to dates I've written a blog post about that if you are interested about how different data storage technologies handle dates in R and Python I'm happy to share that with you with this in mind SQLite it's very simple to install, very easy to manage locally on your computer there are many open source data sets out there I've just shown you a few examples unfortunately one of the most important data sources in the academic research community and in teaching is a closed source it's an important part of our book so I want to highlight this also in today's webinar so we will talk now about WRDS so you might wonder what WRDS is it stands for the Wharton Research Data Services and it is I think the most wider financial economic data in the academic domain so it has a focus on academic audiences and research applications and they have a lot of data a lot of tables so if you go into the details of WRDS it's just crazy how many things they have there so they obviously don't use an SQLite database they use another database engine namely PostgreSQL and it's fortunately very easy to access that using the R PostgreSQL package I'm not going to show you any more code here, it's very similar to what we've seen with the SQLite we've just changed the driver basically and the target so you can find more details on our tiny finance website and it's very similar to pull data from there just to list our data sources it would take way too much time to go into details for each of them the ones that we use in our book and we think are also the most important ones in the financial research first the CRISP dataset so that's basically historical monthly and daily returns for US stocks I'm going to show you some insights into that in a bit the second very important data source is CompuStat these are historical accounting data for US companies so naturally this is very useful information if you look into asset pricing problems the third very important data source are characteristics of US corporate bonds that we use in our book because we also combine it with the fourth data source we actually use enhanced trace these are detailed US corporate bond transactions so you can actually map those two things which is really exciting just a brief glimpse into this CRISP data what you can see here are these historical number of securities by listing exchange I think this is already some interesting insight so what we have here is since the 1960s we have the number of stocks for the New York Stock Exchange NICI for the NASDAQ and for the MX for the American Stock Exchange and yeah for instance what you see here that there are in the 90s we had a peak in the late 90s we had a peak of stocks listed on exchanges in particular on the NASDAQ we had more than 5,000 stocks on the NASDAQ and at the moment we are around below 3,000 and you can also see in this crop that MX has steadily declined over time there's also the other category but to be honest I'm not exactly sure what's in there it's just a handful of stocks actually but yeah so this is a great data resource so if you can use that for your research or for your studies you can do a lot of historical analysis so that's really great and our tidy finance approach provides you the guidance how to prepare that data and how to work with it I also want to briefly mention the historical bond data it's just a quick glimpse this is the snapshot that we used in our book so these are bonds between 2014 and end of 2016 and in this data there are more than 13 or 14,000 bonds outstanding towards the end of the sample and if you look at the trading data it's also more than around 8,000 bonds traded in that period actually we have to extend this figure at some point because I think it would be informative to have more recent numbers but as you can imagine at least the transactional that the trading data becomes huge if you look at a big sample period but of course with R you can also do that if you want to I've briefly mentioned that there are of course many other data sources out there in fact we think that there is a large ecosystem of data providers when we wrote the book we tried to compile a list of our packages and since then we are continuously expanding it so just to list a few examples that I did not talk about today but I mean you can directly access the thread database that I've mentioned using the thread R package if you want to have European macroeconomic data there is the ECB package Kvandl is also a very important data source and I think the TIDIC one package uses some of Kvandl's endpoints and many more I mean also for Bloomberg there are packages for most data sources and also we list of course for crypto data there are also data sources we actually don't really talk about this data in our book yet if you take a look at our list and if you think that we are missing an important package maybe your package that you wrote should be on that list then please reach out to us using our contact at tidyfinance.org or you also find it on the website how to reach us we are happy to extend this list and with this actually we arrive at the end of today's webinar I hope I've given you an overview of what the tidyfinance project is about and some inspiration for how financial data can be cleaned and why it is important to clean that and ensure consistency I want to close today by mentioning that we have also just yesterday released a package to Kran and it's our package it's the idea is that it provides a couple of helper functions for people who are familiar with tidyfinance with R in particular all these download and cleaning steps are packaged into functions that do all these operations we don't have to copy the code from our website all the time if you use this data at different projects so for instance the example that I'm giving you here you can just install the package now that is hosted on Kran using the install packages and for instance if you want to download the Chris Monthly data from WRES you can just call the tidyfinance download data function and it internally also establishes a connection of course you need your own credentials for that and it performs all these cleaning steps that we have in the book and just spits out the final data for you and we are yes we are very happy to receive feedback about this package so it's just the first release so it has very limited functionality but we decided to have limited features initially because we're not sure in which direction we should extend the package so we really rely on your feedback on what we should do with this package yeah and just three main takeaways for the end so we provide detailed open source material on tidyfinance.org if you want to learn about financial implications yourself or if you are a teacher there are materials there if you need teaching materials such as slides or solution manuals please just get in touch with us we are also very happy to learn who is using which material where and also collect feedback on that level if you have an idea that would be a good fit for our blog feel free to reach out at an early stage not when you have written the blog post but before you will write the blog post and for instance if you are a master student and you want to showcase your work in a different format than your master thesis we can also use our blog last but not least if you want to get updates on the tidyfinance front feel free to follow me when you are on LinkedIn with my Christoph Scheuchnamen so thank you very much and now I am looking forward to the Q&A ok so I see that there are always some questions answered why is it open? why does the other correspond to Null on the first graph? ok Null yes it's the question is actually about this one it's not zero it's just very very small this is why it looks like a flat line around zero and it's a good point that you raise it I think for the future we could even think about excluding that one, particularly in presentations I think it raises more questions than it answers ok any more questions, no open questions at the moment maybe I can feel the time in still by some anecdotes actually I mentioned initially that the tidyfinance project emerged out of our own issues that we had with the data but initially I rolled a couple of blog posts in the final year of my graduate studies because for me it was already clear that I am not going to stay in academia and I wanted to go to the industry and I figured that I have no idea how asset pricing works and I wrote blog posts how to replicate the famous Bali and Engel book on the cross-section of returns in asset pricing I'm not exactly sure about that and I just put them online because I figured I would have liked this information out there yeah over time I received emails from people thanking me for putting these things online and from there on I think Stefan, Patrick and I figured let's make more out of these blog posts because there are people who would like that ok we have a new open question what kind of background in R and finance is recommended for getting the most out of your book and other materials so to be honest our book is neither an R beginner nor a finance beginner audience book so ideally you have some previous exposure to programming with R and ideally you also have some basic introduction to finance course so we are not, for instance the capital asset pricing model, the CAPM is part of our book we do not go through the details of the CAPM, this is not what we want to do in our book, we really want to bring people who have already some knowledge and bring them to the next level so in particular master students in business or finance and graduate students but then on the other hand my colleague Stefan he teaches the book to I think master students in Copenhagen who do not have previous R experience but at least they have the finance background and we think this is also a power of the tidyverse and this human centered coding approach that you can write accessible code to people and understand even if they don't have a programming background so maybe long story short to answer your question I think the finance having some finance background is valuable to really get the most out of our book and the R the R background so the R knowledge can easily be learned on the job so to say maybe another another anecdote so the how did we come up with the Python approach so initially we were really convinced that R is the best way to do it in finance but it turns out that in the industry that many people prefer to use Python actually but people were still interested in our approach to financial economics so people started translating our stuff the code, the R code to Python and we figured okay before other people do it and before people use chat UPD to translate stuff without any quality checks we decided okay we should do it ourselves and really properly make sure that the results align have you looked, okay we have a new question we looked at adding a chapter on building shiny apps with financial data so we will most certainly not do that because we think first there is a great resource there are many great resources on shiny apps if you want to get into that so Hadley Wickham has a book on mastering shiny for instance and it would just create there is a lot of overhead involved with shiny apps nothing that would just go beyond the scope of our book having said that if you are doing research with financial data I think it really pays off to also dig into shiny apps to make your results also provide an interface to your results and give people just a new way to interact with results so coming up we are not planning on doing that we decided to not include chapters on shiny apps and other publishing formats it is also a valid way to share your results just because we don't have the expertise and the book would explode but nonetheless we think these are great great things to learn so actually if you are into Hadley stuff in general not just the Hadley verse but I think you can also approach Tidy Coding in Python and Julia I once again want to invite you to follow me on the LinkedIn because I am sharing also content in addition to Tidy Finance I have my own blog called Tidy Intelligence where I write more about cross language and comparisons but I think since we are approaching the end of today's webinar I think we can close it for today I just want to say thank you everybody for participating for attending it is a great experience for me to present this material for the first time and thanks again to the Aura Consortium and hopefully people reach out to me or us and to me thank you Christoph bye everybody