 This paper will be a bit different to what you usually see at these kinds of conferences, so we're not trying to estimate any causal effect or show some interesting correlations that might be indicative of one, but rather we try to provide a methodological contribution showing you what is possible with the new data which are available electronically nowadays. So the main motivation for the paper is that, as you all know, migration data is highly problematic. If you want to look at international migration between several countries at the same time, there are a few sources with which you can reliably do that. You probably know most of them, what bank has nice data, there's the OECD migration panel data for example that we use here. Even in those data sets where people have made a great effort to standardize definitions and so on, you have problems with inconsistencies, these data come out two or three years at best after migration has happened and in some countries you even don't get any information because they don't provide any data which can be put into these kinds of formats. At the same time you have more and more geolocated data which is generated by smart phones, by activity on the internet, think of for example sending a tweet, which if you don't turn off the setting will be georeferenced so you know exactly the geolocation of the person who just tweeted and the same is possible with something similar as possible with Google and that's something we'll be using. And as you probably know migrants try to get information, so they don't only ask the friends, acquaintances, family, but of course young people especially in cities or in places which are electrified, they look up things on the internet and there's not that much research on that yet but I think it's very, very interesting how people get these kinds of information. So a research question could be summarized as is online search behavior in some way predictive of migration moves from the origin countries where people search for information to some destination countries and might we be able to develop some kind of proxy of demand for emigration or demand for information about migration possibilities on there? In our concrete case we use Google Trends. If you don't know that yet, it's basically a tool to summarize Google search volumes across the world so if you could just put it on the standard setting and enter a word, so here this is the word visa which is somewhat problematic because it can be confounded for the financial service but anyway then you get a map for example of search volumes in a particular year, darker colors mean there have been more searches for that term and then you can basically download also the time series behind that data down to a weekly level for example if you need that, this would be an annual snapshot here which gives you some variation of a time and some information about search behavior in a particular country. You can also zoom in but we're not going to do that today. So potentially people could Google hundreds of thousands of things. Actually you could take the Merriam Webster or another big book with potential words and look through all the potential words which are somehow related to migration. That's of course a problem since we'll have annual migration data that means we would have many more potential regressors, many more time series for individual keywords then we would have observations and that of course doesn't work. So we took to some website called Semantic Link to reduce the number of potential keywords we use also to be sure that we don't cherry pick too much what kinds of words we include in the search volumes we include. So what we did was we took this website which is based on a big corpus of written works and looks at correlations of particular words in these and we entered the word migration here and then took the words which are correlated with this heavily. So I guess most of you can't read that but these are terms like emigration, visas, undocumented quotas, multiculturalism. This is another example, this is immigration you'll see in a second what you get for emigration. This helps us to go down to your keywords and then we translated these into French and Spanish as well, always included all kinds of different spellings like British American, plural, singular and so on and ran these through software to get the time series data from the website. So this is our keyword list, we not only used the word migration we also used the term economics because you know economic issues can also indicate either reason to be attracted to foreign country or to leave your own country. So there's a whole lot of words and the approach doesn't depend on any particular of these. We then took the OECD international migration panel data as a yearly panel since we started that project two years ago we're still using the data up to 2013. Part of the reason is that we would like to have proper out of sample experiment next year maybe or next month where we use the new data which came out. After that you see if our approach still works. So in this data if you haven't worked with them you get migration numbers per year from almost all countries in the world to 33 OECD countries. There are a few OECD countries which are not well covered because they for example don't have good migration data but in general the data are as good as it gets I think for these kinds of international cross country regression. We then combine that with world development indicators to get some idea of how widespread internet uses literacy in the countries of origin then with a nice data set by Jacques Militz and Fari Toubal on spoken languages in different countries to get an idea of whether for example English is actually spoken in the country we think our English keywords should be working in and then of course the usual things like distance and so on which make a difference for migration flows. We then have two specifications we test. One is panel fixed effects regression basically where we have the lock inflow to the whole of the OECD by each foreign nationality. You can do the same with each country of origin doesn't really change much. Then you have a vector of these time series for each search term we include. Then origin specific control variables as I said for example population size GDP and so on. You can add destination specific control variables like the GDP or the growth rate of the OECD indicating how attractive it is at any particular moment and then we have fixed effects for the country of origin time fixed effects per year and narrow term. Our second specification is closer to what forecast is used. This puts us basically against the benchmark of just expecting last year's inflow of migrant for this year which is quite a good forecast if you look at the R squared of that but we want to do better than that so we use, we try to cancel these kinds of things out so we control for last year's inflow as well plus the percentage increase in the inflow two years ago to this year so for example if migration has been growing by five percent per year we would expect it to grow another five years in the baseline to make sure that it's not just picked up by our Google trends variables and then we have our trends variables and all the other controls as well. So main result I didn't put it in a table because it's not very informative to look at the individual keywords but what we see is that depending on which specification we use the R squared of these regressions within R squared increases by at least a hundred percent and almost or can triple and sometimes almost quadruple but of course if you have run these kinds of regressions you start from a very, very low variation that you can explain so typically we use for example we're modeled by Anna Maria Meida that she has published in a really good paper as a benchmark case and with her specification we get in R squared of 0.006 and that increases to say twice that or even up to 0.2 if we look at specific groups. What kind of specific subgroups of countries for example when we exclude countries which really just use English or Spanish as the local language basically by definition. We include countries for example where only five percent of the population can speak English then our performance gets better which I think makes sense if people don't actually look for information in that language that shouldn't be very predictive of migration but of course if you add lots of different regressors to your regression model and your risk overfit so just by chance all this variation of the keywords can very nicely explain the changes in migration over time and that's a massive problem and that's a mechanical problem simply by using too many variables which jump around a lot so there are a couple of approaches from machine learning literature mostly how you can address that and I'll go through these briefly with one slide. One is variable selection models these have been developed for example in biostatistics when people had millions of different genes which they wanted to link to a particular condition then it's quite clear you have one dummy variable as the outcome back in the days you had very few people who were whose genome was fully translated into a matrix and then you had to come up with some models and these variable selection models usually work by selectively kicking out the least informative regressors and they can tell you whether what remains of the model is correlated with your variable of interest and these models would suggest that keeping half of our keywords in the model make sense and interestingly they drop out some of the standard gravity variables which don't add anything to the model. Then out of sample estimation here the idea is that if there's a mechanical overfit coming from adding too many regressors which are artificially correlated with the outcome then that should not work if we take only part of our sample to estimate the model and then use the coefficients to predict the numbers on the rest of the data and one nice method for this which does it in a standard way is K-fold cross validation that means drawing sub-samples out of the data for example dividing it up by ten different folds as they are called and using nine of these to train your model or to estimate your regression and then testing it on the remaining 10% and then going systematically through all ten possibilities how you can do that and then you get kind of yeah pseudo out of sample estimates. What do these look like? So two typical things you would look at are the R-squared and the prediction error. This is the basic model so this is the Maida model. Sorry for the small script. In R-square in specification two ranging between just under 0.4 and up to 0.6 note that here the fixed effects are counted as well. This is an empty model so that includes hardly anything just last year's information and what we would expect from that. So no other economic or social control variables that does just as well as the other model and this is our model. So this is the Maida model basically with our Google keywords and since this is a histogram depending on the standard you set you can call that significantly better. I think it's significantly better in terms of the R-squared. This is when deleting all the other controls just keeping the keywords. So once you have added the keywords and the economic and so on controls don't add that much and this is something I'll come to on the next slide or two slides. And what about the prediction error? So if we were good at predicting outcomes every three or four years and then having some prediction which is very far off reality that wouldn't be good. So what you usually do is that you use the root mean squared error. So squaring the error to make large deviations from reality very, very painful. And what you see here is that these are again the two models in the beginning and here these are our models with Google search terms which have a lower forecasting error. So that looks quite good. Another way to reduce the dimensions to work against this problem of overfitting is to use standard technique, principle component analysis to get some vectors from your data that are or to have a lower number of potential control variables and add these to your model if you just add for example five vectors which can capture 80% of your variation in 50 keywords. So then automatically you have your risk less that there will be an artificial increase in your r squared for example. And if we do that we can look at these principle components and our fifth principle component is what seems to be driving that. The problem with principle component analysis is that it's very, very difficult to interpret these. It's very abstract. What we can see is that the first PC and probably also the second are somehow related to increased computer use. So this is quite reassuring that it's not just increased computer use which is explanatory of migration next year but it is explanatory of increases in the Google search volumes. And with this method we think we are able to kind of disentangle that to some extent. But there will be some work needed on that. Finally we try to put some meat to our rather abstract paper and you perhaps know the Gellop world poll, this hugely expensive data set which is based on surveys in all sorts of countries and they have these nice questions on migration intentions. The advantage of working with someone from the OECD is that they had access to these data. So we could actually test our data against the Gellop world poll questions which are used a lot in the media and have been used by some researchers for nice papers. And we did kind of a horse race using our variables against this one on question which is ideally if you had the opportunity would you like to move permanently to another country or would you prefer to continue living in this country? And if yes, to which country would you like to move? From that you can build a dummy variable and test that. So the Gellop world poll is unfortunately only available on a subsample of our data not available for all years, not available for all countries. But in these countries where we add our data the Gellop question becomes insignificant and our keywords perform as before. So they seem to capture part of what the Google world poll tries to measure with that question. So to wrap up since I'm running out of time our paper tries to show that using such terms can improve prediction of international migration flows and potentially from that we could develop some indicator of something like interest in migration or interest for information about migration so I think that's part of the research process to understand better what exactly we are approximating here yet we think that can help policy makers in several respects. And one of the examples that we are thinking about including is for example evidence on post disaster situations so for example I have a few graphs and if you're interested on the earthquake in Nepal where you saw search volumes in economic terms and migration related terms drop down right after the earthquake two years ago due to lack of power and so on and then suddenly search for a week or two in the aftermath. And in these kinds of situations potentially our approach could be interesting. Thank you.