 Hello, everyone, and thanks for coming. I'm Tomasa Rodrigo, and today I'm going to present one of our latest projects developed this year, where we are going to monitor the financial and economic performance of the US economy using corporate reports, the narrative of the corporate reports, using natural language processing and machine learning. OK, I have structured the presentation in three parts. First of all, I'm going to briefly describe why we are doing that, what we do, and why we are doing that. After it, we are going to see the data and the methodology we follow for doing that. And finally, I'm going to focus on the results and the applicabilities of what we are doing. Well, let me start asking you a question. Do you know that more than 80% of the total amount of web pages on internet are composed just by text? Think a little bit on it. Imagine all the data that we have, the comments, opinions, statements that we found finding media in the media or in social media, social networks, in blogs, or even in economic or financial reports. This huge amount of information could give us a lot of available data for the analysis that has not been exploited before. And this could help us to understand better the society, the economy, and the world. So why we are not taking advantage of it? Nowadays, this is possible thanks to the use of natural language processing techniques, also known as text mining or computational linguistics. What we are going to do is to take all this instructor data, that is text, image, videos, and we are going to convert it into numbers. We are going to quantify letters, and we are going to extract meaning from these letters. As you can imagine, this analysis has a huge potential in different fields. What are we going to see today? We are going to focus on business. So another question for you. Did you know that two-thirds of the information that companies has to report is just text when we think in a company, not in the corporate, we think in a lot of data, but also there are a lot of information that is not numbers, but also has a lot of value. Well, what do we mean by corporate report? Normally, in most of the countries, companies are required to deliver and to submit information about their financial health. In the case of US, that is the case that we are going to see in this presentation, US companies has to present every quarter and every year information about their financial health, as I said, to the SEC, that is the Security and Exchange Commission. The information that they present gives a comprehensive overview of the financial activity of the firm, the business they develop, with all the financial data, operation results, new product plans, even research and development activities for future plans. So imagine that all this data is a viable for knowing better the health of the economy, in that case, the health of the US economy. Well, before starting with the presentation, let me show you here all the working process we use to develop that project. We are going to show you, actually, I'm going to show you, I'm thinking all the time in my team. We are going to show you all the process from the data we extract, how we preprocess, clear this data, and finally, how we analyze this data. So through the presentation, we are going to see these parts in order to make a better understanding of what we are doing. Well, starting with the data, all the information that corporates has to present to public institutions, in that case, to the security and exchange commission, is a viable in an electronic format in EDCAR. This is the electronic database that collects all the information from US companies. So what we are doing is to go to this database, and we are going to filter by two main reports. One is the 10Q report, that is the quarterly reports that firms submit to this commission, where they find all the information about how the firm did during the quarter. And then we are going to go to the annual report, that is 10K. And this is a more detailed document where the firm explained all its performance during the year. Particularly from these two reports, we are going to focus on a particular section that is called management discussion and analysis. Why we are doing that? Because in that section, the firms is going to tell about their activity, their performance, analyzing the economy, analyzing risk and uncertainties, and also giving some forward-looking statements. That is, they are going to show their expectations about the future. They are analyzing the moment, the current situation, but also giving information about the future. So these particular sections for us is really, really interesting in order to understand how the firms behave during the quarter, during the year, and what they are looking ahead. What are the main risks, the main uncertainties that we should take into account. So once that we know the information we extract, let's see which companies we are going to analyze. Well, this database has more than 21 million corporate reports from 1984 to nowadays. So imagine it is a huge database for a lot of firms in order to get the text and to do the analysis. At the beginning of the project, we were really ambitious and we said that we want to analyze everything. So we got everything, we put it in the cloud, and we started to analyze all this data. Imagine it was so, so time consuming. And we were finding a lot of noise in the data that we have to filter, we have to clean before going to the analysis. Because as you should know, in all time of projects like this, what is important is to make your data clear. That you have enough data quality so you can get a good results. If not, forget about doing that. So we spend a lot of time with this, with the data. And what we realize is that the most important companies were giving the hints and the main points, the main takeaways of what is happening and what is going to happen. So what we decide is just to focus on a standard and poor 500 companies that are the 500 largest companies in US covering 80% of American equity market by capitalization. So what you can see in this graph is the evolution of all the firms that we monitor over time. You see that this graph is growing over time because the electronic format and the digitalization of all these documents increase over time. So in order to make our analysis stable over time, we started to analyze from 2000, even that we have all the data preprocess for doing the analysis for all the time period. So we are going to focus in this sample of firms. And now in the other graph, what you can see is the distribution of all these firms by sectors of activities. So you see that the biggest firms belongs to manufacturing and services that comprise more than 50% of the total sample, followed by retail trade and transportations. Well, now that we have cleared the data, we are going to see the process that we follow for cleaning the data and transforming the data. As I told you before, we get it from Edgar. We get it 10K and 10Q reports from SP500. Then we are going to clean and organize this data. We are going to remove upper letters. We are going to remove stop words to clean the data. You know when you are working with text, you need to prepare your data in order to convert it into numbers. That is the following steps. After it, for preparing it for the analysis part, we are going to construct the documenter matrix that is a huge matrix where you can see every term. It's like a vector. And you have for every document and every term an ID that is going to help you. Once you are working with the data, it is going to help you to get the relationship between data and how they evolve over time. For the methodology, what we are going to do in this project? We decide to do two approaches, two different approaches. First of all, we are going to apply a Latin Dirichlet allocation for topic model. And what is that? This is supervised learning algorithms. And this is a probabilistic topic model which is going to find groups of words that has a higher probability of appear together. And this is what we call a topic. If we give a test in this text, we are going to see that there are particular words that used to appear together. Imagine that we have almost 21 million of data of documents. And in all those documents, if we started to analyze, there are particular relationships between words that the algorithms are going to take into account in order to say, OK, these are the words that appear together and that belongs to a topic. The other methodology that we are going to use is word-to-veg neural network. This methodology is going to analyze the non-linear relationships between words in order to groups them by a context. What does it mean by that? I mean, imagine that you have a particular term that is of your interest. So this methodology is going to tell you all the words that appear in the same context that your term. So if you want to monitor a strategic set of indicators, you can go into this methodology to say, I'm interested in uncertainty, for example. And it is going to give you all the words that appear in the same context that uncertainty. So just to be clear that you know the methodology and you understand that, for the LDA, this topic modeling, we are going to let the data to talk and we are going to say, we have all these documents and the methodology is going to group the data, where the data, in that case, are words, for the probability they appear together in a set of topics. And then the documents is a set of topics that is. You have a document that is a set of topics with different proportions and a topic that is a mixture or a set of words with different probabilities that they appear in the topic or not. Then for the word-to-beg methodology, as I told you, it is going to take into account the context. So imagine a simple example like this, where you have like a sentence. It is going to map every word and in every word it is going to take into account the words that appear together. So it is going to construct a huge network for each work taking into account the other words that appear together with that work so that if you are interested, once you analyze this huge amount of data, if you are interested in a particular topic that is of your interest, you can go to the algorithm and to get the data. And once we know what they are talking about, we have all these documents, we apply these algorithms, these models, and the models are going to tell you what the data and what the US companies in that case are talking about in the documents. Which is the other part that we want to analyze? How they are talking about that? They are talking about activity. I want to know if they are talking positively about activity or not. So we are going to analyze sentiment for all the identified topics, okay? How we do that? We are going to do that using the lexicon approach. This approach is going to give us a dictionary. In that case, the Lorena McDonald dictionary was created, was trained using this set of information, so it's pretty, pretty accurate for analyze financial text. And this dictionary has a set of words with positive connotation, another set of words with negative connotation. And what we are going to do is to give all our text and to match with the dictionary. So at the end of the day, for every period of time, for every firm, or for every sector of activity, it depends on what we are interested. It is going to give us the average tone of the document that is calculated as the sum of positive words minus negative words over total words in the document. Okay, so here you can see our results. What is that? Think on it. This is the evolution of US GDP compared with the sentiment, an indicator, the sentiment of US corporate reports narrative. Okay, so remember that we have all this instructor data. We apply natural language processing techniques and we apply sentiment and at the end of the day, we have an indicator that also over time, it is quantified of course, and you can see here the relationship it has with GDP. Correlation is higher to 60%. So from an instructor data where we don't have anything, we construct an indicator that replicates activity. But which is the advantage of this data? That you have really granular data. This data you don't have just the national evolution of sentiment. You have all the companies reports analyzed from the micro level, a particular firm over time to the micro level, analyzing sector of activities, analyzing investment, profit, uncertainties, risk, whatever you want. So you can see how using this type of techniques, you can replicate official data but with the huge advantage that you are going to have a really granular data for taking value of it. Well, from this aggregate purpose, we are going to go to the micro level. From all the reports that we have for every time of period, in that case, we group them by year. We are going to have the main emerging topics that appears in every document. So here you can see, yes, it is automatized, so you can see whenever you want. The information that appears in every of these documents in a simple way and to have the comprehensive view of what is happening. Looking to the data that we have here, we have shown the variation of topics and terms over time, year and over year, in order to see which topics emerge. So you see that, for example, in 2000, the data was focusing on a higher revenue, increase, and then if you go, for example, 2008, where we have the financial crisis, you see that other terms emerge in the documents like liabilities, restructuring, credit claims. So you see that the narrative of firms were different. We have some point here where we should focus attention. But it is not just independent words. We can take into account the relationships between words. In that case, what we can see here, analyzing all these texts, we construct this network where you see that is a core network that is quite interrelated. And then we have some terms that are related between them, but not so much. We can get rid of it. Well, focusing on the most interrelated networks, we see some combination of words that deserves to appear together. For example, foreign currency, stock markets, risk factors. So why we are considering, when we are going to do the analysis, we want to consider them independently or we want to take into account this relationship in order to make an algorithm that is more accurate to the data that we are going to consider is better to take this combination of words. Well, once we prepare all of the data, we run the LDA model and we got topics. And we got from the data that is unsupervised learning. It emits words that the model said they should appear together. They are related to the same topic. And what do you think once you see the workloads? It makes sense, right? So the analyst should go to these results from LDA and to see if topics that emerge has a particular meaning. For example, in the ones of loan portfolio deposit mortgages, we rapidly think on financial sector, right? Then you see that we have also topics related with retail sales like store retail, to the automobile sector, to the real estate. So all those topics that the data tell us, it is going to give us information about what they are talking about. Now we know we have more than 30 topics classified and we know what they are talking about in this data. What we want to see? How it evolves over time? Because imagine that we are in a bank and we have a client that is a big corporation and belongs to a sector of activity. With this tool, you are able to tell your clients or at least to know you, how it is going to evolve. What are the main issues that emerge in every quarter for your corporate and to have a better knowledge that could complement your traditional tools for analyzing risk or uncertainties? Well, all these topics that we got, we also get the relationships between them taking into account the probability that these topics appears together in order to see if we can get a matrix for identify the transmission of a shock. Imagine that in that case, most of the topics refers to sector of activity. Imagine that there's a shock in a particular sector of activity that emerge is an idiosyncratic shock but given their interrelationships with other sectors, it could rapidly propagate to the whole economy or at least to an important part of the economy. We want to know that and we want to know how is these mechanisms in order to prevent what is going to happen or at least to minimize the risk. So you can see here in this network that three different clusters were identified and for example that manufacturing is more related with markets and oil and retail that for example with communications and you see for example that finance is really related with global factors, global issues that is this topic. Well, this is with LDA that remember that we will let the data to talk, we extract some particular topics from these topics, we can know the evolution over time and we can analyze sentiment. Now, I want to also to analyze with this huge and massive amount of data another approach and is to say, okay, really good to know how the finance sector or the automobile sectors evolves over time but I'm quite interesting in knowing the uncertainties and the risk of a recession in the economy. And I want to see how when companies are getting information about their performance, their financial performance during the quarter or the year what they are talking about this risk of a possible recession. So what we do is to do this with word to back. We say, okay, give me all the words that appears in the same context of recession because if I say I'm going to the data, I look for recession and I got the count of the times that appear recession in the documents. This is not complete, right? Because you have a lot of information that you are not taking into account because probably they are talking about slow down, concerned and you are not taking that into account. So this model is going to say, hey, whenever you want to look for uncertainty, take into account that in the same context, appear sluggish, slow down, downturn, turmoil, crisis. So with these dictionaries that we create with this methodology, we can make a more accurate look for in order to get the data that we want to analyze. In the graph, in the two-dimension graphs where you can see each point in that case is a word, okay? So you can see the words that appears in the same context of uncertainty and the words that appears in the same context of recession. And where you can see in different colors is that more or less that dictionaries are well separated, they are not mixed. So it means whenever I'm going to look for uncertainty, there are situations where they are talking about a recession but not all of them. So you can make two different queries and get the data for them. Good. What is happening if we look in our data in that case for these two dictionaries? So you see here the evolution over time of recession dictionary and uncertainty dictionary where it goes from minus, it means that there are more words related with uncertainty and with recession. But in order to compare it with GDP, we just inverted this series. What you can see here that again, it replicates activity but it gives us more information. For example, in these particular graphs, there's something that for me is quite interesting. If you see the graph of uncertainty, you see that it is below activity. Precession, it seems, that is in line with activity but in the case of uncertainty from 2015, it goes below it. Why is that? Because we are living in a world where uncertainties are increasing exponentially. Think on tradewords, think on the rise of protection aims, all the geopolitical tensions that we have. It is reflected, not in activity, but yes, in the concerns of firms, of companies. So you can see how it is taking an impact on the performance of the companies. But this is not just at aggregate level. This also replicates the activity at sector of activity, as you can see here. You have here some examples in the graphs where you can see how those indicator mimics the activity and also a color map where it indicates where you should take attention to that. You pay attention to the data. So you see the financial crisis, the global crisis, how all the sectors were affected when they were speaking about recession. And you see, for example, that the financial sector, it was not just during the crisis, it last two more years where they were speaking about these effects of recession. You see also in 2001, the dot-com crisis in the US. So imagine that you have all this data and you can have your dashboard with all the metrics that you want to monitor, automatize in order to give you all the insights that are relevant for your business. Finally, another point that we study is risk. This is another important issue that we want to monitor. Here in the timeline, we saw the evolution of this dictionary for risk in the documents compared with the VIX, okay, this volatility index. And in the world cloud where you can see are the idiosyncratic dictionaries that we have created for each sector of activity. In order to understand when companies in US speak about risk in the financial sector, it's the same that in the retail sales sector. Probably there are some part of risk that are common that are global and we can identify in each of them, in all of them. But probably there are other that are more idiosyncratic and we would like to take this information. So here in the world cloud, you can see in the center words that are more common to all sectors. And in the rest and smaller, you can see the most idiosyncratic words that appears together with risk in every sector of activity. So to make it clear, here in this graph, you see that words, that idiosyncratic words in each case, in each sector of activity. And what are the main highlights of it? What we found is that in the case of the financial sector, finance and insurance, banking and insurance, we found that risk are really related with global issues. Okay, with all the global conditions, all their terms are really related with that. But in the case of the rest of the sectors like manufacturing, retail trade, what we found is that they are really, really related and in line with the cycle, the cycle, the economic cycle, the economic conditions. So you can understand that the risks that appears are different depending on the sector of activity that you are monitoring. And finally, we analyze all of this and we have really granular data for taking different type of insights according to their business needs. And we say, the only thing that I don't like from this data is the frequency because it is quarterly. When we are speaking about data or big data, we want to have it in a higher frequency, right? So we say, okay, we have all this information that is the real information and should be reliable information because this is compulsory that companies should present to the security and exchange commission. But also, let's see if there are other data, other big data sources that could complement it with the frequency because if I want to see or to have an early warning indicator or to see if there is a higher probability of a recession is coming, I prefer to have more high frequency data. So what we are doing is to say, there's an indicator that is done by Bakker et al. This is commonly known in economics where they measure uncertainty in the media. They are tracking, I think, 70 different newspapers and they are looking where the word uncertainty appears and they construct an indicator over time. And we say, okay, we have from the US companies, we also have an indicator for uncertainty. Let's see how they relate and you can see here that again the correlation is really high. And it makes me to think, we have a database, we are exploiting a database since a lot of time where we monitor media information, news articles in more than 65 different languages. And we say, okay, let's go in to replicate this same exercise, getting the data for uncertainty but in high frequency, in daily basis, on daily basis or even in real time. The only caveat of this database is that it is only available since 2016. So we compare it with the indicator of Bakker et al. That is an historic indicator. And then we say, given that they match really well, we are going to construct another indicator using this high frequency data. So imagine that you can get the whole picture about how things are evolving regarding the evolution of US economy if you want to monitor real information from companies. You are going to get the information that I explained before. And if you want to complement it with high frequency, we are going to go into the media and to analyze it. And with it, given that I have three minutes and I want to hear your questions, I finish and please feel free to ask me anything that you need. Thank you so much. Any question or too much information? Yes? Good question. You can try to replicate in Spanish, but the problem is the algorithms. For working with natural language processing and text mining, you don't have so many packets for analyzing it, even dictionaries for sentiment. So for example, imagine when you are getting the text and you have to do the streaming that is to get the root of each word in order to analyze it and to get rid of noise. There are not so many packages. So my advice here is to try to translate everything because translation algorithms are better than all the other processing text mining that you can apply to other languages. Yes, of course. Yes, yes, there's data from European companies that submit to a public commission and you can analyze it. Also you can analyze, take it into account. I think in most of the countries, at least in developed countries, in all the countries, and also in emerging countries, you have more than 50% of the countries that has this data. So if it is in English, it is easily replicated. Just change your database and it should work. No more questions? Okay, so thank you so much and I hope you enjoy the presentation.