 Well, I'd like to first of all thank the organisers for inviting me to speak and indeed for organising the conference. What I want to do, I'm not going to present a conventional research paper because I want to discuss a large ongoing programme of research and issues relating to big data that arise from that. I want to start by making some very brief remarks on what I'm going to call the Hypothesis First Dogma. And in most of the presentation we give taken up with an overview of the occupational structure of Britain research programme, which Mohammed has alluded to. I'll start with the intellectual context. I'll run quickly through some of the data. I'll present you with some of the key findings from the work on occupational structure. I'll introduce our transport data and then I'll say something about the kind of longer term goal of producing a high spatial resolution data infrastructure. And then I'll talk about some of the international comparative work and look at occupational structure and population density. I mean that international comparative work which Mohammed has alluded to is more than just a plan. I mean a lot of it is happening and Mohammed is in fact one of the contributors with his own work on occupational structure on Egypt. And then at the end I'll come back to my comments on the Hypothesis First Dogma and then I want to discuss briefly sort of key obstacles really to creating an open access pan-European high spatial resolution data infrastructure. And I explain what I mean by that as I go along. So I want to start then with the what I'm going to call the Hypothesis First Dogma. And all I really mean by this is this rather wooden or dogmatic view that good research happens in a single sequence. You start with a testable hypothesis, you then go out and collect data, you test the hypothesis, you might reject it, you might accept it or you might modify it, which case you've got another testable hypothesis and you go through the loop again. Now I have nothing I don't for a moment wish to take issue with the kind of fundamental importance of that kind of perperian view about fundamental importance of having falsifiable hypotheses. It's the sequence that concerns me and the dogma which is widespread in my view amongst funding agencies and referees. This is the only way in which you should undertake scientific work. Begs the question of where do questions fit in and where do hypotheses come in the first place? Because in this sort of loop, there's no room for new hypotheses. Okay, so the Occupational Structure of Britain, this is a project that Tony Wrigley and I really started working on about 20 years ago. So it's been long running now and a very large number of people have worked on the project in that time and the material I'm going to show you some of has very largely been the work of other people and they're not going to try and tell you who they all are but just to make clear this isn't just me beavering away on my own. The intellectual context for our project when we started 20 years ago, it's kind of captured on this graph where we've got the Dean and Cole's original estimates of GDP per capita growth from 1700 to 1850 in green and then Kraft's estimates in in red and of course I'm sure everybody knows that the big difference of the Kraft's estimates were that he showed much slower economic growth during the Industrial Revolution and Dean and Cole had done that's very well known. What's much less often commented upon is that the implication of Kraft's work is that the economy was much more developed and much larger in 1700 than Dean and Cole had thought and that has the further implication of much more growth over some earlier period and I've just speculatively drawn a line back to actually Madison's data point for what it's worth in 1500 to indicate the difference. Dean and Cole's work in terms of any kind of implication for the early modern period would have actually been contradictory to several generations of work in early modern economic history and Kraft's works it's much more comfortable with that because anyone who works on the early modern period knows it's full of economic development of many kinds. Tony Wrigley and I thought it would be possible to produce high quality, long run estimates of occupational structure which would fundamentally improve our understanding of the world's first industrial revolution and part of our initial thinking in that was you know if there's economic growth, economic development in the early modern period and economic growth, economic development in the classic Industrial Revolution period say 1760 to 1830 how do they differ? When we thought occupational structure and we both knew quite a lot about the sources as a result of earlier work that we'd done would allow us to say a great deal more about that. Now I'm going to talk a lot about occupational structure. I do not want to suggest that it's a substitute for national accounting work because it does not tell you about output though I'd be critical about the accuracy of estimates within the national accounting framework. I still think the work's essential. In terms of the wider themes I want to stress that we did not start off with the hypothesis. We started off with questions and the questions were pretty simple really how did the occupational structure of the economy evolve over time and what can that tell us about the industrial revolution and the economic changes that preceded it. So pretty open-ended questions and we also had the belief I suppose that which we had reasons for that data on occupational structure will provide rich new perspectives on the industrial revolution. I'll talk very briefly about the data. I haven't got time to really describe any of the data sets in any detail or how they were constructed all the limitations. First one I want to mention is that we created we modified a pre-existing and improved a pre-existing geographical information data set of boundary data set which has subsequently allowed us to analyze and map data from 15,000 different parishes reported in the census and 11 is a civil parishes and 11,400 geostrical parishes and data from a variety of other sources. We didn't initially plan to do this but as soon as we started collecting occupational data most of which came with the name of a parish or town attached to it we realized that we really needed to be able to map it to make sense of it. But crucially linking the data all the different data sets we have to our underlying GIS data set allows all of the data sets to be interrelated because we can construct units spatial units across which any pair or more of data sets that you want to attach to polygons can be made consistent and that's crucial to the whole data architecture of the project as it's subsequently developed. Tony with the late Ros Davies produced the primary secondary tertiary occupational coding scheme which at its finest level has about 1600 different categories at it and that's also a linchpin of the project and the international comparative work. We've also got population data that was published in the census decennially from 1801 for about 15,000 different units though of course never quite the same set of units from one parish to the next something Tony spent several years sorting out. The main not all of the occupational data sets but the main ones that we have I'll describe in a moment all of these data sets have been coded to this occupational coding scheme and all of them have been linked to the boundary data set which we call EWCP which stands for England and Wales census places I think. From 1851 to 1911 thanks to the ISEM project run by Kevin Shurer and Eddie Higgs we now have individual level census data for the second half of the 19th century it's 180 million different records which creates some severe problems encoding the occupations which we haven't fully solved there are at least a million text strings that only occur once in just the 1881 census. Then I'm going to working backwards here the occupational data first available decent occupational data first available in the 1851 census but at the beginning of the 19th century collected data since teams of undergraduates out to archives around the country and collected three and a half million occupations of fathers from more or less all of the 11,400 Anglican baptism registers which give us very high quality data with complete spatial coverage at the beginning of the 19th century for males not for females between 1600 and 1800 we collected about a million male occupational observations deriving from testamentary documents these are wills and probate inventories these are systematically biased towards the better off but Sebastian Kaibach then my PhD student and a former physicist devised a robust method for reliably correcting that bias which we're very confident is correct it's been road checked against other data sets that we know are not biased for the period 1418 to 1547 we've now got a very large number of occupations of male defendants from the court of common pleas more currently being collected we're still at a relatively early stage in this and we haven't yet corrected biases in the source we don't even know if we'll ever be able to do that we've got a rough idea of what they are tertiary sectors over represented and the primary sector is underrepresented the earliest data we have comes from the 1381 poll tax there are seven we've got 71,000 relatively modest data set male occupations there and these have all had to be geographically reweighted to correct spatial biases particularly between urban and rural areas that not being done properly is a major reason for our disagreements on this subject with Greg Clark and perhaps with with Robert Allen well Robert Allen much less explicit or clear about what they really did with the data to get such a high non-agricultural share as soon as we started mapping the data we realized it if we perhaps we should have realized before but that if we wanted to there was such striking geography to occupational structure that became very clear that we wanted to understand occupational structure and its evolution we would need to have we would need to know where the roads rivers canals railways and so on were nobody had done any of this so we ended up creating over a number of projects we created a series of what I call time dynamic transport GIS data sets and by time dynamic I mean we can map the data we can map each network for every year of the date range so we got the rivers and the canals we've got railways and railway stations we've got turnpike roads these were roads which were key to improving the road network or critically on which users had to pay a toll we've got complete data sets of main roads which includes roads which weren't termpiked at three key dates we've got the ports and we've got coastal sailing routes and at three dates we've glued all of these transport data sets together to create a multimodal model and that allows us to calculate least cost or fastest path or various other types of routes between all pairs of points on the maps at each of those three dates using the cheapest or fastest combination of different transport modes available we've also created a number of GIS data sets of natural endowments coal fields iron copper etc um soil qualities soil capabilities altitude rainfall uh water flow in rivers and so on and so forth and various other kind of odds and ends some of which are illustrated here likely have text that we have property values for every parish at the beginning of the 19th century we've got all the 18th century steam engines that stationary steam engines which derives from from from kenevsky's kenevsky's work um but all the occupational data have to be coded and since i'm going to show you a lot of graphs that rely on the coding let me just make clear in the in our uk data where we've used rickley's psti the primary sectors basically agriculture and mining um also forestry state work fishing so covers all the production of raw materials secondary sectors basically anyone transforming raw materials into something else whether it's a you know village tailor or blacksmith or or a worker in or a worker in a factory the tertiary sector is basically everything else it's all the services anything that doesn't result in a physical product as the military can result in dead bodies um in the variant of the scheme that we use have used for international comparative work key psti which i'll say something about later the key difference is that mining is in the secondary sector as is normal and most common coding schemes and there are arguments for or against that or indeed for making a separate sector so some key findings um this is our kind headline graph where we have the occupational structure of england and wales because we can't so far don't have any data for scotland before 1851 um this is males um and i've put four effectively four sectors here agriculture secondary sector tertiary and mining um and one of the most striking and unexpected things about about this was that the huge shift in the labour force towards the secondary sector which over 100 years of scholarship has almost almost all of 100 years of scholarship has considered to be defining feature of the industrial revolution did not take place now it was always my feeling by the time was my feeling by the time we started this project that it's one of my reasons wanting to undertake the project was economic historians over a long period of time have spent an enormous amount of effort trying to explain the industrial revolution without documenting properly what happened first and that there was a serious risk that a lot of this literature was trying to explain things which either didn't happen or happened at a different time um and you know i think there was some just if the graph here shows some justification to that to that view but of course this isn't isn't telling us about output directly um the big shift um in the secondary sector took place between 1550 and 1700 there's some growth between 1381 and 1550 and there's almost certainly growth between 1350 and 1381 but we don't have any data on that um so there's a long period of what I would call low tech labour intensive industrialization from 1550 to 1700 and it's pretty much over by 1700 now one knock on implication of our data is that labour productivity growth in the secondary sector in manufacturing industry um during the industrial revolution was twice what either crafts or property have suggested with the immediate implication that technological change was much more pervasive than the national accounts literature suggests so we have a story here which is sort of if you take crafts as revisionist this is kind of super revisionist in the sense we're now pushing back a lot of change to before 1700 but it's also going partially back to an older story that stresses the importance of technological change in later period which has been systematically underestimated in all of the national accounts work um since Dean and Carl well not sorry not since Dean and Carl since since crafts um another very unsuspected finding was that the though perhaps it shouldn't have been in in retrospect is the dominant structural shift in employment during the industrial revolution was basically from agriculture to services all of the national accounts literature from Dean and Carl through to Broadbury and Carl has a tendency to assume that tertiary sector employment stays more or large sections of it remain more or less stable in relation to as a share of the labour force during the industrial revolution that's completely wrong and that requires major rethinking of the role of the tertiary sector in industrialization um now we've got problems with with with women because it's much more difficult to get data on women before 1851 then my colleague Amy Erickson is is working on that but um we are able i've got time to describe how we did this but we are able to um model what we think female employment looked like um in relation to the male data and if you do that you get somewhat rather different picture doesn't really change the picture down to 1750 very much but then you get this big fall in the secondary sector actually during the classic period of the industrial revolution secondary sector is actually shrinking and that's because the mechanization of textiles starting with spinning just takes an enormous number of women out of the labour force um but without data on women we seriously risk misunderstanding macroeconomic change and miscalculating productivity growth and that's a major factor in what's wrong with the productivity calculations in both crafts and property um and of course we're also blind to the social history of this and the gendered impact of early mechanization this breaks down the secondary sector a bit more that enormous fall in that red line that is are these are our estimates of total employment in textiles so early mechanization produces this massive contraction in the relative importance of the workforce older literatures tended to assume that just because textiles were growing um employment shares must have grown too perhaps most notably stated in the work of Maxine Berg um that looks like it's a roll um but that's precisely because all this labour saving technology is being introduced in this period our estimates are that female labour force participation on the eve of mechanization was probably about 80 percent um it's down to about 40 percent in 1851 and it's highly spatially concentrated in it by 1851 as well if our estimates are right um female labour force participation rates didn't recover till well after world war two possibly not till the 1980s um going back further this is data which is more preliminary stage and we know that the tertiary sector is underrepresented and that the sorry the tertiary sector is substantially overrepresented secondary secondary sector looks like it might be about right or slightly overrepresented it's quite close to the 1381 figure of 20 percent um agricultural sector is underrepresented you probably need to take about 10 points off the tertiary and add them to the secondary sector but the trends are probably right the trend is there is no trend um and there are lots of good reasons and lots of historiographical reasons for thinking that there wouldn't be much of a trend in the gross occupational structure across the 15th century and um in the first half of the 16th century um nonetheless um if you if you dig down deeper um well if you look at large occupation common occupational groupings like blacksmiths or carpenters that's basically trendless but if you dig down a bit deeper to a much more fine grained analysis which we it's just one of the beauties of this kind of occupational data now these numbers are very small but there's a 25 fold increase in um mill rights these are people who are specialized in making mills so this is probably an indication that the quality of mills which are the largest most complex pieces of machinery in this period apart perhaps from sailing ships must have improved side makers right people making sharp edge tools they just about triple as a share of the population over this period brick makers that's the introduction of a new material it stays at very small levels but you can see it developing in the period that fits exactly with what we know about buildings in this period it's starting to be used but only for very high status buildings later on you don't want to use brick for high status buildings use stones distinguish yourselves from the people who are building their houses in brick in the in the 16th century building brick to establish your status which is why in Cambridge Trinity College and St John's the oldest and richest colleges have big fat brick gatehouses where all the other colleges have stone wheelwrights and cartwrights nearly triple sorry go up by 50 percent over this period as a share that's suggestive of significant increases in the use of wheeled vehicles during the late medieval period um now having large and complex data sets like this allows a fine-grained analysis which can identify a significant phenomena which are more aggregated and smaller data set would miss altogether so if we had a you know a data set a tenth of this size um couldn't begin to make these comparisons because i i mean you might think i'm in the small numbers problem anyway but if i showed you the data i think you'd be convinced that i'm not um but you need very large data sets if you want to look at these fine-grained changes and some of these are what i would call key or marker occupations you know millwrights are critically significant group all the way through the period that we're interested in into the 19th century um one of the great attractions to me of occupational data is that you can use exactly the same measure to look at basically proxy for economic activity at molt i've said social scales i mean geographical scales we can look at the whole country so i've shown you graphs of the occupational structure England and Wales we can look at regions or in this case counties and we can come down to individual parishes so this is the parish gis that we have that i talked about before we can look at the occupational structure of individual villages and you can see the kind of level of occupational detail that we that we have there once all the data have been standardized variant spellings and so the 17th starting to look a bit at the geography the 17th century saw a very general industrialization so in every more or less almost every single county we see significant increases in the share of the population employed in in manufacturing in one form or another by the time we get into the 18th century we've got a very different pattern one of the major findings of the project has been that deindustrialization was extremely widespread in the 18th century in fact more of the country was was losing employment shares in the secondary sector than gaining them in the 18th century and that was even more marked in the second half of the 18th century where you can see actually that the vast the vast bulk of the country outside key some key industrializing counties most of the country was actually deindustrialized we can go on but i'm not going to comment on these just but just to show you we've got the data here we've got the ratios between the absolute number of workers in different sectors in 1851 and 1601 and you can see that that the two that the most dynamic sections sections in terms of absolute numbers in employment not output in either case are in fact services but particularly transport and mining and a very high proportion of the tertiary sector are people merchants shopkeepers people selling things and transport so a very high proportion of the sector is people who are basically in one form or another moving the products of the primary and secondary sector around the country and they're moving more and more stuff greater and greater distances over time rates of productivity are not keeping up with that so you need more and more of them so i'm just going to put up here at the summary of some of the key findings that i've discussed already and i'm not going to read through them again i just want to make a simple point relating to the wider issues which is that none of these findings originated with a hypothesis none of them were answers to questions that we specifically or originally posed questions and hypotheses can be posed retrospectively when we can you know we can pose the hypothesis and then we can test it and then we can show it's true we can we can of course test other hypotheses and show they're not true these findings arose from the systematic collection of a very large body of data on economic activity on systematically categorizing it on linking it to locational data and being able to map it and then exploring the data in a variety of ways we wouldn't have thought about posing most of these questions in this form um but these findings immediately generate new questions and new hypotheses and i'm sure when you've looked at you know i would guess that when you saw the graphs you'll be you many of you will have have questions um take this a bit further when i was working on my phd i started off with this absurdly ambitious project as is normal i wanted to work on the development and the timing of development of agrarian capitalism from the 15th century to the 19th century but of course that got chopped down to something um much more manageable and i looked at the proletarianizing impact of parliamentary enclosure on one group of workers agricultural laborers in a relatively short period of time so the brick in the wall but not the wall um after that actually as a result of engaging with sources that used occupations my phd i changed topics and tainy rig and i put this project together and i've worked on occupational structure for the last 20 years but one day i was looking at the occupational data and i had this eureka moment when i realized that the ratio of laborers to farmers and you can do this in the english language it wouldn't necessarily wouldn't work in all other linguistic contexts allows you to track change in the share of the labor force in agriculture that's actually wage labor over space and time and i already had much of the data needed now i hadn't sought to answer that question with this data i mean i already had the question because i thought i've added a lot before but it was the having the data having this enormous data infrastructure and as soon as i met these data and others it was immediately apparent to me that bob allen's argument all the arguments in bob allen's book enclosure in the moment fundamentally fragile because he takes the southeast midlands so virtually all his data come from that's the area outlined in purple and regard treats the whole thing as representative of the whole of england and wales where it's certainly not even representative of england what you can see there are much much higher labor-farmer ratios in the south and east than in north and west of course it's very different what you find in most parts of europe in the south and east you've got nearly nine labours to every farmer so a very highly capitalistic agriculture if you like you know in small scale without capitalist industry becomes related this is large scale in terms of of agriculture and i was then able to show it with rather spotier data that by 1700 in southeast england they were already three times as many farmers as labourers so growing capitalism was already dominant at the beginning of the 18th century bob allen's book argues very problematically in my view this all takes place in the 18th century so that became an article in the agricultural economic history review but the critical point i want to make here it was the pre-existence of this very large and rich data infrastructure that allowed me to make a chance of observation in the data and then suddenly use it to answer historiographical question i would never have posed this as a question or a hypothesis we could have answered with occupational data before i started although theoretically that would have been possible there's a major spatial concentration that takes place over time particularly in the textile industry and you can see the counties that were so familiar from the industrial revolution starting to gain share in the late 17th century but really become dominant after 1750 well there's a big question then what explains that timing is because the locational advantages of cheap coal and abundant running water were always present so one possibility is it gets those advantages if you like gets switched on by transport developments i'm just going to whiz very quickly through this i'm conscious i'm about to run out of time we've created gis data sets of navigable waterways that's 1680 1740 1830 uh term pipe roads over time we have railways and stations that's the rather unhappy bit at the end that's allowed us to build a multimodal model where we stick all this together and we've got coastal routes in there this is this is ongoing work with dan bogart and others and that's our multimodal model for 1911 so we use network analysis software to measure the costs of moving freight goods or passengers or speeds between any pair of points and then we can build up you know iso cost lines or iso speed lines like these and we're using that to try and explain developments in occupational structure and population geography and we've built up a lot of other data which i'm just going to whiz through without talking about and i think i'll skip what i wanted i'll skip most of what i want to say about international comparative work but i do want to make the point that occupational data are much more straightforwardly commensurable over time and space than gb to p per capita or wages because we don't have to face the problem the index number problem that the structure of prices is very very different in different places and when you compare long periods of time but that only works if international data sets are consistently coded into the same categories in with the categories being defined the same way so the data is commensurable but works now underway using a modified version of pst in all of the countries shown on the map including mohammed's data for Egypt there and this is all directly comparable in a way that's never been possible before because we've managed to enforce a standardization of the coding of the data i'm just going to whiz through these because otherwise i'll run out of time for things at the end um i just put this map up very briefly um this is population density around 1870 um which we've managed to put together um with a lot of people having input but as soon as you see this map um questions and hypotheses immediately start to form and of course this is a map of concentration of economic activity and you can see um lots of first order geography effects you can see coal you can see water power running off the outs for instance um extremes of climate associated with low population densities you can see capital city effects and so on i'd say population densities are very neglected variable in economic history but nothing is more conventional across time and space than that because a person is a person is a person um okay i'll try and finish in the next couple of minutes so that's the that's the conventional model and this is more like what we did i suppose uh was going to say a little bit more about that because but since i'm essentially out of time i'll i'll leave it there you can start with the question i think with these kind of data you can start without a question you can start if you're collecting occupational data where it hasn't been available before this kind of quality spatial and sectoral resolution it's going to tell you a lot you don't already know but you can have a question if you want um just a comment on the side chance observation is also important net crafts work on national accounts didn't start from hypothesis it started from the observation there were serious problems in dean and what dean and coal had done particularly in the weighting of cotton the whole work of the cambridge group on family and household size which ultimately showed that contra all previous thinking um early modern english people did not live in large extended households arose from peter laslet chance seemed to look at a couple of 17th century population listings which immediately made it clear that in those two villages this was false um i won't say more about that um what i'll come back to just to finish there was a question from the organizers circulated about whether the use of big data will induce economic historians to pay much less attention to our data being created in the first place traditional strength of economic historians i think that's undoubtedly the case um and it and it pre i mean the problem predates big data and you um economists in particular um not not so much economists or economic historians but economists in particular are in my view generally pretty careless about about data sometimes very explicitly so certainly in cambridge i regularly hear stories from students whose economic supervisors have told them that data doesn't matter but this will happen it's happened a lot already with the demographic data produced by the cambridge group it's been used by people who have used it in ways that are complete garbage they've got published in top journals um because the referees aren't familiar with the data sets and their limitations so the solution lies with the clear and public documentation of data sets clear evidence that authors have used that data set and recognized its limitations and in the refereeing process referees who are reviewing work using large complex data sets of secondary provenance need some familiarity with those data sets if they are to be able to pass definitive judgments mostly that doesn't happen and we have a lot of experience of this at the cambridge group it's just more generally that there's a need for reality checks that we shouldn't just be using quantitative data we should check if it fits or doesn't with qualitative evidence we should think very hard when it doesn't um i'll just put those up but stop but i'd say that what we'd like to do and have been trying to do is create a large scale open access high spatial resolution data infrastructure of the kind of data sets that i have just been showing you to cover the whole of europe and so far we've felt spectacularly in all of the big ground applications we've told um partly because we were told that we didn't have a hypothesis so it was all a waste of time and i don't really know where we go with that because the problem with these data sets is you can do so many different things with them they're very expensive to create so if you only put two or three hypotheses that you're going to test it looks like extremely expensive if needs like 10 million pound projects you haven't got space to list 40 50 hypotheses or questions and then it looks like and then you know if you try to do that you get told it's um it's too diffuse um so at the moment um we kind of stuck with that and that seems to be a big if if you think that having this kind of thing publicly available and covering the whole of europe would be really valuable and that's a big problem and i don't know how to solve it thank you i'm sorry i've overrun