 Good morning, everyone. Welcome you to lecture number four of our course, Collective Dynamics of Firms. I see four students here, so I'm not so sure what kind of message I should deduce from this. Maybe it's a message that no one is really interested in this course, then I stop teaching it, right? That's the only consequence. The loss is then mostly on the side of those who want to get the credit points. You should think about this. Maybe you talk a bit to your fellow students if this is a message you want to convey or not. If it is not the message, then you may think of maybe a better strategy. If no one is willing to come to the score, then I certainly do not continue to teach here. What did we do last week? We were talking about distributions and we addressed the issue of how to specify the parameter dependencies. Our assumption was that distribution should be described by a mean value and by a variance, and the question was how should we calculate this? There was a maximum likelihood estimation. You should understand the meaning of this. Then we talked about skew distributions and the two best candidates for this, the log normal and the power law. There are other forms of skew distribution. You should not worry about it. It is simply that the two play a major role in this course, but there are other candidates. We also talked about a specific way of finding out if some values belong to a distribution or not. That was the sift plot where we had to stop at the very end. You understood that it was a way of projecting out the deviations of the large values from the distribution. We are going to improve our slides on this. I thought of a little figure to enlighten why the rank is linked to the cumulative distribution. That was basically the meaning of it. Then most importantly, we also focused on the stylized facts about firm size and firm growth. We started with firm size. We found that firm size is following a skew distribution. This was not very impressive, but then we discussed why we cannot be more specific about it. I hope you recall this. It depends on the data set that we have, or it depends a bit on the way it is analyzed. I did not finish with the second stylized fact, so I will continue with this in a moment, and then we will also continue to talk about these growth rates. Let me switch back to the stylized facts about firm growth. That is why we stopped here last week. We first have to define what a growth rate is in order to determine the stylized fact. The growth basically is the change of the size over years. That means, from my perspective, I would define the growth rate like this. This is the change of the size between consecutive years, t minus delta t, with delta t as one year. That is the size of the previous year. That is the size of this year. We divided by the unit of time, or by the time interval, which is in most cases for firm data, it is one year. This growth rate has a unit, something over time. If we measure growth in number of employees, then this is change of employees over time. There is a time unit involved. This is inconvenient in most cases, but the rate, if you come from physics or chemistry, so the rate is the number of events per time unit. This is not very convenient. Therefore, people instead of defining this as the absolute growth rate, defined it as a relative growth rate, which is the ratio between the size of consecutive years. You see there is a formal relation, if I define it like this, to the absolute growth rate. That means, if you are given the relative growth rate, this capital R, then you are able to think about the absolute growth rate. The most common way of using it is the log growth rate. That means, I take the logarithm of this, and then I end up with a small r, which is, again, a ratio, but even the logarithm of a ratio, like this. You see how it is related to the size of the firm in different years, and to the absolute growth rate. I want you to think about this in a moment, although when you recapulate the lecture at home, because in most cases, the small r is used. In some cases, the capital R is used. We have to make sure that we always talk about the same way we define the growth rate if we want to compare with the data. The natural thing is this one. Now, what do we find if we analyze growth rates like this? We find that growth rates follow a Laplacian distribution. That is the stylized fact number one. It's a well-confirmed fact. What is a Laplacian distribution? It is a symmetric distribution, like this double exponential distribution that I have plotted here. You see there's an exponent. In the exponent, there is an absolute difference. Mu is the mean value, x is the size of the firm, b is then identified as a variance, of course. You see because of this absolute value, it should be symmetric, at least from the theory. We will find out later whether this is true or not. This stylized fact is different from stylized fact number one, because we talk about a different distribution, but also because we are very specific about the distribution. In stylized fact number one, we could only agree on there is a skew distribution, but we didn't agree on what kind of skew distribution. Here it's different. We are precise about the kind of distribution. That is a noticeable difference, and no one argues that this is the right distribution. Here are a few examples where it holds on aggregated, but also on disaggregated sectors. If we plot the Laplacian distribution in a double logarithmic plot, that's the same distribution as before, then it looks like a tent. You see this? This is not in a double logarithmic plot, if I'm sure this is a log-normal plot. Anyway, it looks like a tent. This is very characteristic here. You see that this shape of the growth distribution is the same for different sectors of the industry. What is the printerry and furniture are the examples here? This can be a bit of a surprise to you, because why should the growth rate of such very different industry be the same? The first issue in order to project this out from the data, one has to do a little bit of rescaling, and we will talk about this in more detail today. You see, instead of these absolute growth rates, they have plotted here the difference of the small s, and the small s is the logarithm of the capital s. S is a measure of the size. In this particular case, size was measured by sales. We start from sales numbers, then we take the logarithm of these sales numbers, we get the small s, and then the difference in the logarithm is calculated. You go back to slides where I define the relative growth rates, and you see it's the same, except that instead of the x, we use the small s here. But this, at the same time, is also normalized against the mean value. This is very important, because otherwise you wouldn't have this symmetric distribution that you can compare. That's the first surprising fact. It holds, obviously, for different ways. But what does it mean? What's the difference between a Laplacian distribution as we have it here, a double exponential distribution, and a normal distribution? Who wants to comment on this? How would a normal distribution look like in this graph? Parabolic, yes, it's also symmetric, that's correct, yes. But what would be the most noticeable difference? Who has a clue? What would be the most noticeable difference here? Normal distribution. The normal distribution would be here, very much centered around zero, very small variance here. It would be in this area here only. If you just go for the small growth rates, you would probably not distinguish, be able to distinguish between a normal distribution and a Laplacian distribution. But if you look into these tails of the distribution, this tent-like shape, then you notice there's a huge difference between the normal distribution and these tent-like tails. And that, what's the message of this, yes? A normal distribution with a large standard deviation would also look symmetric, but it would not have these straight lines, so that's number one. And number two is, I mean, of course, we can define a normal distribution where the variance is 100 times larger than whatever the average value, right? What would be the meaning of this distribution? That's another question. So what we see here from these tent-like shape is that negative growth rates and positive growth rates are much more likely than compared to the normal or random case, right? That's the message. To have large negative or to have large positive growth rates is much more likely than drawing a random number. That's an interesting finding in case of an economic argument, right? Because then you have to find reasons for explaining these large growth rates. We will come to this in a moment, but you understand this, so you will recapitulate its symmetric, you see these huge deviations from the normal distribution, and you also please understand that it takes a bit of data rescaling to get this nice plot. It's not simply like this, yeah? Okay. With this, we go back to the lecture of today. Before we start further analyzing the growth rates, that's what we do today. We have to look into tests for the distribution. It's extremely nice that we now know there is a Laplacian distribution underlying the growth rate. How can we verify this? You have the data in your self-study exercise, and you have to do this yourself. Can you find out whether this is the Laplacian distribution or not? Of course, you can plot the data, but there should be some test to tell you this is the Laplacian distribution or it is something else, right? This test is the test we use here is a Kolmogorov-Smirnov test. Usually, as I wrote on the first bullet points, you test distributions by creating a null hypothesis. That is your hypothesis that you want to reject. That means if you want to go for the growth rates, your null hypothesis cannot be or shouldn't be. The null hypothesis could be the normal distribution, which should be rejected. If you have the null hypothesis it's Laplacian distribution, then you probably cannot reject it, right? Because, as we now know, there is a Laplacian distribution behind it, but you can also not tell that this is the distribution. What you do instead is you argue the other way around. You take all other candidates in mind, you keep all other candidates in mind, you take them to the Kolmogorov-Smirnov test and round by round reject other candidates. This is what you do. Please understand this very carefully. You do not prove that some data follows a distribution. This is not what you do. The Kolmogorov-Smirnov test tells you that some data does not follow other distribution. This is something you really have to understand. This is important. The null hypothesis is the hypothesis that you have to reject. As you already mentioned, the standard null hypothesis is of course the random growth, which would be according to the normal distribution. I mentioned on this slide that there are other useful tests, which we do not discuss in this lecture here. We are not going to apply these. We are even not going to tell you all the mathematical details about the Kolmogorov-Smirnov test. Instead, we want you to use it and not to fully have the statistical derivation of this test. I had an argument with Pavlin last night. Some of you may have seen that we had other slides prepared for the Kolmogorov-Smirnov test, but I thought this is not really needed. It is not needed that you go through the statistical details of the test. It is much more important that you understand this test is available. I can apply it to my data with the following R command. Then I get an output that I am able to interpret. That is what we are concentrating on. The basic idea of the Kolmogorov-Smirnov test is to compare an empirical distribution with a theoretical distribution, which would then be your null hypothesis. You compare these two distributions not on the level of the density, but on the level of the cumulative distribution. Then there is a difference between the one sample and the two sample Kolmogorov-Smirnov test. You should also understand this. The one sample Kolmogorov-Smirnov, or one side Kolmogorov-Smirnov test, compares empirical data to some theory. The two side or two sample Kolmogorov-Smirnov test compares samples among each other. That is the difference. If you have two samples and you would like to know do they belong to the same underlying distributions, then you use the two side Kolmogorov-Smirnov test. For today, for what we discussed today, the one sample, one side Kolmogorov-Smirnov test is important. I mentioned that if we just talk about normal distributions, there are more specific tests you should use. I mentioned two on this slide. This is just for your reference. For specific classes of distributions, there are more specific tests, which give you maybe all the better statistics, but the Kolmogorov-Smirnov test is quite general on this. For the Kolmogorov-Smirnov test, we have to calculate a statistics, which is based on this F. F is the cumulative distribution function. That is something you have to understand. It's not the density, it's the cumulative distribution. There are two. Fn is the empirical, where you have n observation, and F, without the subscript, is the theoretical distribution function. You already know how to calculate the empirical distribution function, cumulative distribution function, simply by looking into the number of elements that are up to a given value x. You sum up over, you calculate the total number of elements up to the value x, and you divide it by the sample size. That's the way you calculate this. This is the distribution you want to test, for example, the normal distribution. Then you see, dependent on the value of x, you get then a number of differences here, absolute differences. You take only the, it's not the maximum of the difference, but this gives you a set, a bunch of values, and you look into the supremum of this set. That means you look into the lowest upper bound of the differences, something that comes from above and bounds these differences. If the two distributions are the same, then this D converges to zero. But in most cases, it looks like this. This is the example that we want to use. We have a blue line here that is the true theoretical cumulative distribution function, the one that we use as a hypothesis. This black line, with the steps, is the empirical distribution function. How is the empirical distribution function calculated? As I said, you sum up two specific values of x. In this particular case, all the time, if you find a value that is below this, you add one here. If you go and look here, I hope you can see this. This is your first x that you got from the sample. Then you get a second x. That means the step height of this is always one. If you found a new x that fits into this boundary of x, if you find a new xj that is lower than x, then you make a jump here and you add a one here. There are always steps of one. You see this? If there are two values, then, of course, you have larger steps. What you see here, of course, it's not the same. There's a minor difference, but it looks like a larger jump. Then the question is, what is the maximum difference between these two curves? This is the sequence of the D. The Kormogorov-Svernov test calculates for you this sequence of the D and returns the lower upper bound of it. Just basically, if you want to get this in an i-figurative idea, it's a bit like the maximum of the difference. It also has to tell you, and that is what the Kormogorov-Svernov test actually does for you, whether this is a significant number or not. Let's assume the maximum difference is 0.01. What does it tell you? You don't know. This is what the test is about. It tells you if this is compared to the distribution we are talking about, is significant difference or not. Here is how it is implemented as an example. This is something you should basically do at home. You can do it by typing exactly what I show you on the slide. We need some empirical data to make sure that the data is not according to our null hypothesis. We have chosen a complete different distribution, which is in our case a t-distribution. Now we generate empirical in quotation marks data, artificial data that follows obviously a different distribution. Students t-distribution. We do it in a very simple way. This is the only command that we use to generate the data. Very convenient. Then we create a null hypothesis. Remember the meaning of the null hypothesis is something that needs to be rejected. Our null hypothesis is the most common one. It follows a normal distribution. You already know that this is not true because I told you where the data comes from. In other cases you don't know. You test this distribution. Here you see how the Kolmogorov-Smirnov test works. You have here the sample that comes from this distribution. Then you have here your null hypothesis, which is a normal distribution. The output looks like this. You get the maximum d-value, which is not the matter. It's the lowest upper bound of the set, formally. Then you get this value here. What's the meaning of this p-value? We explain it on the next side. The p-value, as you have seen, is a very small number. It's e to the minus 16. We assume for each of these texts that there is a significance level, which is in some cases 0.05. We assume for each of these texts that in statistics you learn what is the meaning of the different significance values. You choose of course different significance values for refining your results. Let us assume for our test that the alpha is a small number. If the p is less than the alpha, it means that there is no significance for the fact that the sample comes from the normal distribution. That is the meaning. That means you have to reject the null hypothesis. That's the important message. Because there is no significance that the data is equal to the value of the null hypothesis. In our little example, this is not a surprise because we have chosen a different distribution to generate the data. That's the idea. You see from this that you cannot tell what is the real distribution the data follows. Then the next step that you have to do is check other distributions. That's the next step, and so on and so on. You see if you go here for the Kolmogorov-Smirnov test again, that this p-norm defines our null hypothesis, but we need to have two other values here to characterize this. Namely the mean and the variance. The mean and the variance have been calculated in this particular case from the data. You hopefully remember that if I go for other distributions, then of course these are the two parameters that also determine how the other distribution should look like. What you have to do now is to go through different distributions and find out is there a distribution where I cannot reject the null hypothesis. Hopefully you come up with the idea of testing the t-distribution. In order to get this, you first have to plot the data and then look into your mass book what looks like this. Maybe is it symmetric? Is it asymmetric? Does it have only positive values or negative values? If it has negative values then it's obvious it cannot be the log-normal distribution and so on. Then step-by-step you reject certain ideas and then you test the significance of this. Now with this knowledge we go back to the fact, the last fact number two, namely the distribution of the growth rates. There was this finding that we already discussed, namely there is a tent-like shape in a log-log or in a log-normal plot. At least on one axis there should be a log, otherwise you don't see the straight line. The interesting finding is that we are very specific about the distribution that means it should occur independent of the data sets that we are using. We go for other sectors of the industry then we should see the same distribution. We look into small firms and big firms, we should see the same distribution and so on and so on. Remember what the consequence of this is. A stylized fact means it is a statistical truth because it is observed in very many different data sets. That's the important message of a stylized fact. Let us now talk about two other dependencies here which I will refer to the next week in more detail. The first one is, no, I think on stylized fact number three I refer today. The first one is that the growth rate itself is a function of the size. We are not saying what function it is, we will investigate this by looking into the data, but the growth rate is nothing that is independent of the size. You could already have seen this when you look into data that comes from different size classes, I didn't show it to you. There is a plot next one. The second stylized fact is growth rate also depends on the age of the firm. If you are a firm that is already for 200 years in the market, then you have a different growth rate than a firm that has been established two years ago. This will be specified next week in more detail. In order to understand, we have to find out the dependency on the size. How do we do this? I go with you step by step through these steps that you have to do in the self-study exercise. I explain it here and you do it afterwards yourself in the self-study exercise. Here we have put up the stylized fact. The volatility of the growth rate decreases with size. That is the stylized fact number three. What does it mean? It does not mean that the growth rate decreases with size. You can say this. That means the variance, the deviation in the growth rates decreases with the firm size. If you are a startup company of 20 people, then your growth rate can heavily fluctuate. Understand, if you get rid of 10 people, then your growth rate is already minus 50%. If you hire another 10 people, then your growth rate is plus 50%. Why is it that huge number? Because you only have 20 people. That means if you then on this low level change a bit in the number or in the size of the firm, the number of employees, for example, then you have heavy changes in the growth rate. If you are a company like Nestle and you hire 10 new people, then there is probably no effect in the growth rate because 10 is way too small compared to the whole company. That means even if you have larger changes in the company, because the company is that big, you will not see big fluctuations negatively or positively in the growth rate. That's a message. Or as you said correctly, you can better predict the growth of larger companies and you have problems to predict it with smaller companies. We have to find out how this dependence looks like. That is the important thing. The start-up effect does not tell us a precise like equation, how it depends. We find out with the data. I hope that you understood that we do not talk about the absolute value of the growth rate. We just talk about the deviations, the annual deviations dependent on size. How do we get this? We do the following. We have the data and this is actually what you have to do at home when you do the self-study exercise. You are given data that looks like this matrix here. In the column here, you have the N firms. I think in our case, N is equal to 10,000. We have the size over different years. That means you are given the size of a firm in year one, whatever year one is, 10,000 different sizes. Then you are given the size of the firm in year two. We are talking about the same firm. From this, you can calculate the growth rate that you follow. Of course, there is a change in the number of employees that you can calculate. As we already discussed on the previous slide, the growth rate in this case is not just defined as a change in the number of employees, but there is a ratio between these sizes. There is a logarithm of this ratio. That is the small r. r is basically the logarithm of the relative growth. Did you get it? That means you have then two values assigned to each of the firm. Actually, yes. It is the size of the firm. The growth rate, which appears twice at year one and year two. The growth rate, which is of course only one value because you have taken the difference. What you have to do then is you pool the data. What do we mean by pool? We neglect the year the data comes from. We treat everything as coming from the same years. We pool it together here, as you see. By this, we ignore the time dependency. We have a larger statistic. Then in the next step, you do the following. You take the pool data, where the year is of course gone as an information. You sort this according to the size. There is a very easy command. It is named sort. It does the same for you in r. You get a sorted pooled vector. You understood why we pooled the data, because we are not interested in the actual year of the size, but more or less in the distribution of all the sizes. That is the first thing. Then we sort it. Because each of the firm has a particular growth rate, then we can of course assign or keep the information of the growth rate with the firm. We sort along the size of the firm, and not along the growth rate. That is the important information. Did you get it? Yes. What we do then in the third step, we divide this sample of sorted values into size classes. A good size class number could be, for example, what would be a good size class number? 3, 10, 50, 10,000? You can think of it. It is the same problem as we have already discussed when we talked about binning of data. If you choose a bin to large and everything falls into one size class or into one bin, then you get a complete flat distribution, because the bin was too large. If you choose a bin too small, then no observation falls into the bin, then you get a bunch of delta functions or these spikes, but no real information. The same is true for the size class. If you have two size classes, then you will see afterwards that you can fit any sort of curve to these two values. If you have 10,000 size classes, then you do not have a good statistics. Let's say 20 is a good value. Then you have now 10 or 20 of these values and you calculate the average size in this size class. For the same firms, you also calculate the standard deviation of the growth rates. You have all these numbers and for each of the size classes, you calculate the standard deviation of the growth rate. Why do we talk about the standard deviation of the growth rate? Are you still with me? Because that was standard effect number three, right? Standard effect number three was talking about the volatility or the standard deviation, so how it depends on size. Therefore, we look into this. This is what we get. That's the result. We copied it from a paper here. You see people took a log log plot. We will have a theoretical argument while we take a log log plot later, but that's the result. The standard deviation, where does the growth? You see it's a straight line. There's a linear fit that nicely follows this data. What does it mean? It means that there is an exponential relationship between the size and the standard deviation. The slope here, this beta can be determined empirically. That's what we do in the next slide. This is the important finding. It's, by the way, a robust finding. We will see this afterwards. Let us interpret this now. This is a log log plot. We talk about the logarithm of the volatility and we talk about the logarithm of the size cloud. This is the average size of the size class. Let me remember this as we have defined on the previous slide. Because we see a straight line, then we write this up as our hypothesis as a straight line, but on the log log plot. X is equal to a constant plus linear dependency. This is z and this is x. It's a linear relationship. The x is, in this particular case, log of the average value of the size class. You see that underlying this equation, there is a linear relationship. Just we wrote it for the log because that's what we have seen in the data. We do not take the logarithm, then it looks like this. We rewrite it and then we have this dependency. The sigma is the volatility and the j refers to the size class. It depends on the average size of the size class in such a way as we write here. From here we get into this. That's this exponential dependence. That means the straight line on this plot is exactly equivalent to having a volatility proportional to the xj to the minus beta. The beta is a slope of that graph. Did you get this? Yes. It was a log log plot and you have seen a straight line in the log log plot. That means there's a linear relationship on the log log axis. That's the linear relationship. I just rewrote this linear relationship in this way and then I end up with this relation between the sigma and the x. This tells me exactly what I was interested in. The sigma is the measure of the volatility and I told you the volatility depends on the size class and the larger the firm, the smaller the volatility. That's exactly what we found out. That is a particular instance of Stylite's fact number three. Let us have a break of 10 minutes here. 10 minutes and 10 after 11 we continue. I think we can continue. We already explained how we got this relationship. I hope that you are able to follow this discussion. We get this relationship artificially increasing our data set by ignoring what year the data comes from. This is an important issue. We have a small sample. You can still get this relationship if you pool the data the way we have described and keep in mind the growth rate for each of these size values that you have obtained. This is a binary vector. There is a size in a given year and the growth rate between this year and the next one as I have described on the slide with the pooled data. This means that we cannot infer on time dependence here, but this is all the normal we are talking about. We are talking about the volatility or the variance of the growth rate dependent on the size which is measured as a size class. You can do the same as I think I wrote in the note here by instead of looking into the volatility of the growth that you look into the growth rates yourself. Then you do the same thing and try to find out whether there is a dependency on the size of the firm on the size class. Then you find there is no dependency. Maybe it is a good exercise if you do this at home. You can also test this to find out what you see. There is no relation between the growth rate and the size. That's important. Starlight's fact number three is on the volatility of the growth and the size class. Please keep it in mind that's an important difference. Then we found this beta which is approximately 0.2, but not exactly. If it is a starlight's fact then we should find something similar in different data sets. Keep it in mind and we will test this before. Now I do the same thing again by discussing some of the findings. Let's assume that we did this now with different data sets. We first go for the data set of the US companies. I showed you last week when we talked about this very nice log normal distribution. You remember that I showed your data set from US manufacturing firms over 20 years. Remember this? We are investigating this now again by looking into the one year growth rates and how volatile they are dependent on the size class. Remember this is the findings that we want to discuss. The variance of the growth rate depends on the size class. Here again this is just to remind you on the definition of the size of the growth rate. You see there is a log growth rate that's a small r and the s0 is just ln of x. x is the size of the firm because in some of the coming slides in the figures there is a small s and that's a small s. What we are interested in is this distribution of the growth rate conditional on the size class or the size in a given year. We do exactly the procedure as I have described before. This is the finding. The same finding if you want so as you have seen in some of the previous slide but these were from different people. The previous slide was from Bottacci and Seki and this is from Amaral and Co-Workers. Different data set, different people analyzing it but the same stylized fact. You see this tent-like shape here of the distribution which reminds you on the Laplacian distribution that we have discussed. That's the first finding. Secondly you see that dependent on the size class I'm talking about I have different slopes here in this distribution. You see this. They have chosen these three size classes here but in the investigation that I'm going to show they have investigated about 20 size classes. There are only three shown. They are not just three taken. That's a different thing. If you choose all of them then the figure looks a mess. This is the expression also for this particular distribution function. You notice this double exponential distribution here. You see this. It's a difference to the normal distribution. You want to see it here in this absolute values of course. There was no x square. Now they did the same exercise to find out how the volatility of the growth rates depends on the size class. In this picture you already see that for some size classes the volatility is much larger because these flanks of the distribution, these tails here go much more to the left and to the right. You see these where we have the little triangles here. This is the smallest size class. For the largest size class where we have these squares the things look a bit different. You see already by looking into the distribution of the growth rates you see that for different size classes this has to be different. Now they investigated the same way as we did before and get this kind of plot. You see this distribution, this dependency here. We remember S0 was defined as ln x. Basically it's the same functionality of the volatility dependent on the size as we have discussed before. Does someone remember what Mr. Botatzi found? What was the value? Two slides back. 0.19. What do these people find? 0.19. That's a very interesting fact. It's not a trivial thing. Even more interesting you see they have proxied size now by very different measures. Before we were talking about employees mostly. They have different measures. Sales, the data set that Louis Amaral used was based on sales. We saw the log normal distribution. These are the sales data but they have all the proxies with assets. This was what was PPA. I've planned property and equipment. Don't ask me what this was. That was another measure of size. This is cost of good sales and this is number of employees. What you find is that the relationship between the volatility and the size is always the same. No measure how you proxies the size. This is the important finding. It doesn't change if you go from number of employees to sales value. You find the same pattern. This is why we call it a stylized fact. If we find a complete different pattern then this is not a robust observation. It's an observation we only have if we look into sales data. Here you see we find this always no matter if we proxied by sales or number of employees or asset value or something like that. That is the meaning of a robust observation. You see that the beta is although more or less the same. That means the dependency is constant over all of these proxies. That's the first important message here. You see it does not really depend on the proxy of the size. There is another interesting finding. It could be different of course for different markets or for different business areas of the firms. Why if you are a pharmaceutical firm why should the growth rate for pharmaceutical firms be dependent on size in a similar way as the firms for whatever computer hardware or something like that? This is a non-trivial thing. What they find here is that if you go for different sectors of manufacturing which means basically if you talk about firms at very different markets you always find this here. I think in the notes I wrote that there is a difference in the size of the market. As I see between 2000 and 3000 is manufacturing food, textile, paper, chemical whereas 3000 to 4000 is electronic transportation equipment, primary materials, machinery and so on. This is a non-trivial thing. The economist tells you that of course your growth rate depends on various macroeconomic conditions. It also depends on your competitors how much you can grow. That means it depends on the market you are in. If you are in a market that is not saturated like let's assume some eastern European markets then you may expect much larger growth rates than doing the same business in south of France. Of course the market has been established for a very long time already. The interesting finding is if you look into the change and this means basically also the predictability of how firms grow you find a similar relationship for all of these. Across different markets, across different proxies of the size. Now we go one step further. We want to understand in detail how the dependency on the size is. We already discussed the sigma proportional to x to the minus beta. That's the equation you should keep in mind. If you want to remember an equation then that's the equation to remember. We want to use this finding now to rescale this. This is a technique that tells us if we understood something of the underlying distribution or not. You see we can now assume this is our reference figure that I want you to keep in mind. You see here whether I get this slope or that slope depends on the size class. But I know already from this finding here how the volatility depends on the size class. This was my finding. Then I could use basically this relation to rescale my volatility. The different size class instead of having a sigma that just depends on this I put in this result that I just got. That's what we see here. We have rescaled the distribution function by making use of the sigma dependent on x relationship. Basically what we do is we normalize small and large firms. We put them into the same distribution by taking into account that small firms have a much larger variance of the growth. It's in large firms. Then you see something very nice. It's exactly the same figure as I have shown you before. You see that all curves collapse on the same curve. This is called a master curve. It means we understood how the growth rate depends on the size. Otherwise we wouldn't get this collapse. No matter whether I talk about large firms or small firms all of them follow the growth rates for all of them follow the same distribution. In order to get this we had to rescale our value which is the log of the growth rate. We scale it here as you see by the sigma one dependent on s which where s was the proxy for the size, right? ln x. If we scale it by this then we can get all these together. Of course you have seen that the maximum here was also different. The maximum dependent on where you are in and what size class you are in. This has been also rescaled here by the volatility and then everything gets to the same height. You understand what the meaning of scaling is. We found a relationship namely how the first of all the distribution depends on two parameters mean value and variance of volatility, right? Now we understood how the volatility changes if I consider different size classes and then we put this into the dependence of the variance, how the distribution depends on the variance, right? That means we rescale the two important values here in order to match everything on a master curve. That is the idea and only if we see that all the data points collapse on one curve we claim that we have understood something. I take here a complete different example to show you that we are talking about if you want some fundamental way of looking into data. I take one example here from physics. Let's assume you are doing experiments. You are in a lab and you have to experiment with some chemicals and something like that. Then you have various ways of controlling the experiment. There is a difference. For example, change the temperature in the simplest way though higher, lower temperature then you can change the initial concentration of the chemical you put in then you see sometimes the process is faster, sometimes the process is slower and so on. That means you have like different variables that you can change, right? Or parameters that you can change, boundary conditions of your experiment. Then you want to understand how does the outcome really depend on what I have changed, right? There is some, of course there is some dependency for example on the number of particles that you have observed or the size of the system on this kind of thing. There is also some dependency on initial condition like what's the initial concentration or something like this. But you have to find out how the whole experiment really depends on this. The idea is to use this kind of scaling relations in order to find this. Usually you test only for a particular variable. That's what I wrote here. You have an idea of how the whole thing depends on one particular variable and then you try to get this variable out of the system and then you scale for other variables and so on. You use different dependencies like log or exponents and these kinds of things. This is one example that I have taken from a physics book that I own. It's not important what the experiment behind this is. We don't care. But in this particular case, you measured some quantity c experimentally dependent on another parameter n. You do this experiment for n equals 100, 300 and so on. You get these different curves here. Here you plot lnc versus lnr. I don't talk about r and c and what's the meaning of this. It doesn't matter. It's the same problem as we have discussed before. You see if you vary n, you get these different curves. Now you need to understand what is the influence of the n on the output here. That means you try to find out how these different curves depend on n. No one tells you, but you can basically have a number of guesses that you can then try. If you find the right dependency, then all these curves simply collapse into one master curve. This is a visual confirmation that you got it the right way. Because these ten curves or whatever it was collapse into one master curve. Why did it collapse? Because you rescaled your axis. These two axes in a way that depends on n, of course. But in the right way that depends on n. Here you had to put in n to the minus 1 over d, whatever d is. That was another parameter that you could adjust. I think in the notes we wrote, if you cannot read this here, we wrote what you have to use for the scaling. It's not trivial to guess this dependence. I agree. It's not as simple as we have seen it before. It's a more complicated dependence. But once you find this dependence on n, how r and c depend on the influence of n, then you can scale the axis by this influence and then everything collapses into a master curve. This is basically the visual proof that you understood how the whole process depends on n. Because you were able to get rid of n. It was the same example that we used in the very beginning when we talked about men's shoe size distribution and women's shoe size distribution. After we understood how the shoe size distribution depends on gender, we were able to remove the influence of gender from the distribution. Here it's the same after we understood how these processes depend on n. We were able to remove n from this figure by rescaling the axis the right way and then at the end simply get this master curve. We understand what we are talking about. In our case, the relation is a bit simpler than this one, but the idea is the same. The idea is shown here on this slide, that all our data collapses into one master curve. In order to get this, we have to rescale the x and the y the right way. This is what I wrote here. In our case, I also described precisely how we got to know how to scale it, because we found this dependency how the sigma depends on the s before. This was x to the minus beta. This knowledge was used to rescale the data axis. With this, let me come to the last example. Nicely enough, we finished a bit earlier today. If this is a robust observation in firms that we have a size dependence of the growth rate, then we can also ask how robust it is. If we go for other economic entities, do we see similar things or not? That's an interesting question. Once you've got this idea, then you can test other sets as well. You have already seen that Bottazzi and Amaral found the same relationship in different data. In Bottazzi's case, if I recall correctly, it was also on US data, so then you can say, well, that's maybe country-specific or something like this. It's not, but that argument you could probably bring up. But you can also look into other data sets. You can look into specific markets only where you then distinguish the sig numbers you already have seen that this doesn't play a huge role. You can use it for different proxies. You have already seen it doesn't play a role. And you can go also for other economic entities. If you think instead of a firm of a country, then people have tested whether the growth rate of a country or how the growth rate depends on the size of the country. Basically, the volatility of the growth rate depends on the size. I come to this maybe here already. These are the results. The second example is the growth rates of R&D expenditure of US universities. US universities are another economic entity that decides every year how much to spend on R&D, research and development. And this of course depends on the income of the university, the number of endowments that they got in the previous year, the interest rates that they got on their existing endowments and all these kind of things. So it's hard to predict things, right? The question is how does the growth rate of these university changes over time? Or the same question is how does it change for countries? You look into a country like Greece, and you see this has a negative growth rate for a couple of years. But this was not always the case. In other years it had a positive growth rate. That means you have a time series of growth rates, and you can look into how does it depend on the GDP as a measure of the size of the country. I'm not sure if people have used a number of inhabitants, but that's a bad proxy, I have to say. The number of GDP is certainly a better proxy for the size of the country. GDP means gross domestic product. Brutal social product. And here you see how these growth rates for countries depend on size. Here we only discussed two different size classes, small GDP and large GDP, and we do not look into the growth rate directly. We look into the volatility of the growth rate. That means the predictability. And this is what we get. It's very interesting to notice that the GDP is first of all the same relationship, the volatility dependent on size as for the other economic entities. That's the first finding. And even the scaling exponent, the beta, is about the same before we talked about 19. 0.19 and here we see it's 0.15. That's very close. And here in the other example, I think this one here, they looked into the growth rates of universities. And you see the 10 shape is the same. The exponent is a bit larger, which means it's even harder to predict because you can have a larger range of growth rates over here, over the year. But apart from this, everything is the same, right? That's a nice finding. You would not expect this. You can be a bit surprised. What's the underlying dynamic that makes countries and firms and universities so similar with respect to this one feature? And of course, in this course, we will address this question after we understood everything about the empirics and we will say, okay, what kind of dynamics is it really? What's underlying these data? There is some kind of universal dynamics that captures the generic features of a growth process, independent of whether I talk about a firm or a country, right? And we are able to highlight this independent dynamics and of course we understand why it creates this kind of universality. That's the idea, right? I hope that everyone got this message and I think this is also the last message I have of today or not. So, yes. So, your self-study talks is exactly what I have described today. You have to bin the data. You have to plot histograms. You have to test the distribution. As I have described before, you do not need to understand how the Kolmogorov Smirnov test works precisely. You should recall what the command is, that you have to type in. That's the important thing and then you should be able to interpret the outcome. If R tells you the outcome, then you should be able to interpret this. This is the idea and that's also something we certainly ask in the exam, right? Just for your guidance here. Then we come to these questions here where I'd like to recommend that you pay some attention to this because only if you are able to answer these questions then we make sure that you also follow the course to some extent. We will have our first online test very soon. We are now after lecture four. That means in the next week you have to do the online test. The online test is a simple test. It's an open book test. You can do it at home. You can spend as much time as you like on doing the online test. It's simply for us and for you to get an idea of whether you are following the course or not. That's the idea. You have to pass the online test in order to take the exam, which is obvious, right? I cannot imagine that someone who fails in the online test completely changes everything in the exam. I would be hardly surprised. Regarding the exercise, Pavlin told me that there are not more people than here, right? Obviously, the solution that you found is that you send a representative of your group who has then to volunteer to do the exercise. If you think that this is the best way, then I don't know. But what we would like to achieve is that you take this as a chance to discuss more about the lecture, more about the findings, more about the art, that you use this really as a way to improve your knowledge, to double check, to ask questions, and so on, also to your fellows. Yes, Pavlin told me this. I am fine with changing the date of the time slot of the exercise. There was no problem with this. Every time slot is a compromise. The question is not whether we can change the time slot. Of course we can. The question is whether we find a better compromise. Maybe we find another time slot, but then you cannot attend. Okay, so that's maybe the problem. Pavlin has opened this discussion as far as I understood already, so you can make suggestions if you like. And we find a time slot where more can attend. That's fine with me. This provides a seminar for you, not for us, right? You have to understand this. That means we want to offer you the possibility to get here. We can certainly adjust and find another room. But I want to repeat that this is not... You shouldn't see these exercises as another way to push you, right? We do not want that you spend more time and effort in this. We want you to really learn something from the course. That is why we have structured the course the way we did it. It's not to really squeeze you and be nasty enough that you will always remember how much you had to work for this course. This is not what we have in mind. We want to make sure that you draw some useful knowledge out of this. That is our important thing. And therefore, we want you to do these practical circumstances. Later on, if you are in industry or academics and you have to analyze data, then you immediately know, okay, I got used to this program and even if it is a different data set, I can try. I can try. And then you educate yourself, right? This is the first step that you get confident in yourself, that you are able to do this, right? And that you get something out of it. And if you want to do your PhDs, then, of course, then you learn what to do with this, how to test a model against data and all this kind of thing. It's useful for everyone. Okay, with this, I stop here. I see you next week again and we continue then on the last Dalat's fact. And then afterwards, we move into modeling. Okay, thanks.