 Efallai, mae'r wir i gael, i hawd ddiogel ddartig i gyd yn ddoedd, ar wah番ach hynny'n mynd i'w'r cyflwytaeth cyblwydau LHC Cymru ac, yn ddod, gysylltu'n cyhoedd. Mae'r iddo yn ytafell o'r buswyr ar draws, yn ddod, y buswyr ar drafod y buswyr mewn18 ym 18, ac nid yw'r cyflwytoedd chi'n gŷnoledig. Yogi chyfwys i'w cysylltu gyd ym môl LHC Cyngor. yw ei fod yn ei wneud i fynd y my 빠id sydd wedi deall chi fod yn cyfnod ddiwedd ar yr LHC, sydd wedi'i'n gweithio ychydigol pan offerodd er mwyn i ni i fynd i'r cyfnodd yn ardal. Felly, mae'r fan. Mae'w ddim yn cyfeirio LHC ac wedi'u cyffodus i gynnwys yma. Felly am y cysyllttech Siadolol Iwwgwyro, rydyn ni'n ffawr ac rydyn ni'n rhaid gynnwys i gyd. Ie've been dealing with multi-patterbyty data sets for the last 10 years. Basically, we have a very large digital camera. This little guy here, that's a person to give you an idea of scale. This is the small one, as well, that's much larger. At design luminosity, this guy will kick out 10-patterbyty raw data a year that's sort of required to be stored long term. long term, which goes to tape. The experience is going to run for maybe 20 years. All the data that it takes probably needs to be accessible for 50 years, maybe longer. The curation of the data is something that we are not particularly scared of. Tape systems are pretty good and we can upgrade them as we go along. The thing that is actually harder going back to the DCC talk is curation of hardware and software to actually process that data. So If I accidentally delete the version control system that contains the only copy of the software that can be used to process some data we've taken, then we're screwed. Likewise, if the x86 architecture suddenly disappears, data that's very tied to that might not be processable or might get a different result if you process it again. We spend a hell of a long time comparing GCC compiler versions before we move forward. So that kind of thing is a bit where people start getting antsy. If you want to annoy a particle physicist or something like that. This thing, basically the photos or the event that it records, they're about a megabyte each. The design luminosity was that we would record about 100 of these a second. In practice to date we've been running at between 300 hertz, so 300 event per second up to a gigahertz a thousand. If you have a design, like when you're building a system and someone says, oh yeah, we'll never put more than 10 terabytes of data in that a day, they're lying. They will do much more. For every event that we record, so every actual physics collision that we record, we'll simulate at least as many events, if not more, using various Monte Carlo tools to basically be able to compare prediction, which is the simulation to the experiment. We basically use pretty much any and all technologies that we can get our hands on. So we use Oracle, MySQL, Luster, GPFS, but also we've started using some of the NoSQL stuff, so Hadoop mainly the storage system. MongoDB, which is a sort of quick key value store, CouchDB, which is another one of those sort of things. So this is kind of a hard to read diagram of the data flows between various sites around the world. So we have about 180 institutes involved in the experiment in 38 different countries. And we have this kind of tiered system, so the tier zero, the compute centre at CERN says data out to the tier ones, which is large national labs. And they then do some processing and then serve that data out to universities all over the world. We need to have this data accessible by about 4,000 people, so that's kind of not a huge number of people in terms of Facebook, but the uptime needs to be about 24, 7, 365 days. It's scary how many people submit jobs on Christmas Eve and Christmas Day, such as I. So we have a very large distributed database. The query language that we use is basically Crufty C++, which isn't great. Everyone says, oh, why don't you put it all at one site? Why don't you put it all at CERN? And the main reason is politics, power budget and cost. We have to sort of do some money laundering to get it out and put it into universities. It's true. So this is some data. FedEx is our tool that we use to ship our data around. It also catalogs what files are in what data sets. And it's developed in Bristol. So we currently have a 50 petabyte. So that's a 50,000 terabyte data set where we have deletions here. That's where we've sort of removed some Monte Carlo data that we don't need anymore. We've got a better version of it. The systems transferred about 100 petabytes of data to date from not very much in 2004 up to sort of fairly continuous running now. This is all done on commodity hardware, Janet, normal networks. And we have this sort of, this very distributed system where we push work out to the data. So you don't sort of open up your laptop and suck in a load of data. You send your data out to a job. And the reason for that was that about the time where people were sort of discussing and thinking about the LHC and how we were going to do the computing. This is a plot of things that people at Stanford bought over time. So you can see that sort of disk and CPU are kind of following a nice Moore's law because they were fairly confident that that was going to be fine and we'd be all right. The issue is that network looked like it was going to be the big issue, right? It wasn't going to have a good enough transatlantic link to be able to take the LHC data and process it around the world. What happened is that about the time that they were sort of deciding that they were going to have this very pushed system, these things called Google and Facebooks and blah blah blah kicked off and suddenly the transatlantic network gets a lot better. So now we're in a regime where actually the network is probably our best resource. It's the most reliable. And we're now trying to work out ways of breaking that as well as storage and CPU, which is good. When we started doing this sort of planning, there was no Google, there was no Hadoop and tools like that. So everything kind of is a bespoke computing system written in all kinds of languages and of various quality and various sustainability. If the LHC had turned on when we had originally planned for it to turn on, we would have been in trouble because there was no way that there was enough compute power to process the LHC data. Similarly, if the LHC hadn't existed and then popped into existence now, the computing system that we have would be very different. We would be using tools like Hadoop or whatever. Okay, so we have the kind of... I like to think that it's a kind of ladder and it's interesting that the Sangster talk had a horizontal one, I don't know. Where basically the number of people interested in the data, the more data there is, the more people will be interested in it. Research has worked basically at any level of this data ladder. And what the job of the people like me is to do is to try and minimise the number of people who actually have to process petabyte of data in a single thing and give them some nice summarised data set that they can deal with. And so we do this by a period of process of refining the data, skimming stuff out. We still keep all the raw data, but the stuff that people actually use on a day-to-day basis is factors of 10, maybe 100 times smaller than the raw data. If people have private resources, so a hidden away cluster or a very powerful laptop or something, they need to be able to plug that into the system and use it appropriately. And it should be as easy to run this job on this massive distributed system over thousands and thousands of files and thousands and thousands of terabytes of data as running a job locally on your laptop. It's not yet, but it's getting there, it's improving. And that sort of thing of moving up and down the ladder requires a great deal of synergy of the tools so that behaving locally is reflected in the last distributed computer system and a lot of integration work. So if you haven't tested it, it doesn't work, and most things don't work because they've not been tested. The other thing that keeps coming up is this idea of letting go of data, getting people to let go of the data we've found to be... In the first few years of actually running, it was really hard. Everyone wanted a copy of their data down the corridor, so that if the machine that was hosting it died, they could go down and kick the post-grad that was responsible for it and make them turn it back on again. What we've found is that by providing a sort of equivalent level of service, not as good, it's a lot harder to run a large distributed system than to run an NFS server at the end of a corridor. But if you can find a level of service that's close to that, people are actually happy to move on. You have to sort of let them relax and sort of give in to the system and then they're all right. Okay. So as a physicist, I have to have a formula in my talks. It's part of the contract, I think. So the formula is this. Oh, it's unfortunately a bit blurred. So s equals g over c sub s of time plus c sub c over time plus the integral of c over n c plus p over time. Do you understand that? Okay. So g is fixed and those things are usually fixed. Or if we put it into words, the grant is fixed. The cost of storage and compute over time is fairly fixed. We can sort of plan that. We can buy a tranche of hardware and off we go. The integral cost of network we can actually kind of amortise by careful planning. But the cost of people is something that is, that's the place where we can vary that value by giving them more access and more power to do stuff, right? Another way to look at this is the way that other people have said it is that technical problems are actually quite easy to solve. People and political and social problems, they're the things where you get stuck. So yeah, people are important. You're nice to people working on weekends because they won't work the next weekend if you're rude to them. The cost of the person is the place where you can make savings. So if you can let them do more, let them do things quicker or without having to go through a huge potential barrier to sort of work on your data set, they will do more. And you can sort of make a savings there. The big issue though is building a sustainable team. So you want a bunch of people who are able to do this complicated science or analysis, you might also have to be involved in some of the sort of operations of the system. They might need to sort of develop the system a bit because it doesn't quite do what they need to do. University funding, so this thing has a buzzword in technology companies called DevOps, right? So it's development operations combined. And the idea is that this is a team of people who develop and manage the system together. The problem is that university funding doesn't really align very well with that kind of model. This is the sort of model that people like Yahoo and Facebook and Google do pursue because it's a good way of making sure that the system is reliable, that if it's something breaks, the person who breaks it probably knows how to fix it. The university system, you've got kind of having a long period of continuity, it's quite difficult with short grants. And also, industry pay is significantly higher than the universities. Okay, so what's interesting to me is that big data isn't actually interesting anymore. Unless you're basically trying to make a new Google, you don't need to write your own system. It doesn't mean it's easy, but you're not at that sort of forefront of how to crap how do we do this. So you can spend, instead of spending time writing complicated distributed computing systems, you can spend time, money and effort on understanding your problem better so that you can use those systems more effectively. As well as this, the price of hardware has come down a lot. So when I was in my first year of my PhD, one of the first things I was given by my supervisor was to build a terabyte disk array. It was about that big. It cut my finger as I was putting it together, and it was about £3,500, £4,000. Same price now. I can get a 2U, 24 terabyte box, plug it straight in, easy. So the commodity hardware is trended to become much cheaper as you'd expect, and you can now, yeah, it's about being able to use that hardware effectively. Okay, so NoSQL is this enabling technology, the big date movement, along with the cheaper storage stuff. Most of the systems are designed for big work scale applications, targeting large cluster deployments where you can sort of scale out for sort of a data point. The Yahoo clusters are kind of multi petabyte systems, and they're run by one person. They're designed from the ground up to be cheap to run, cheap to buy. You put a container, fill it with disk, plug it in, power network, and then half of it goes down and you worry about it. If a disk dies, who cares? You've got thousands of them. So they can experience massive economy of scale, and that means that they can run these things relatively cheaply in their business, they've got to make money. NoSQL is an intentionally provocative name that's been kind of sanitised a bit by marketing departments, I guess, by inter-big data. It was also kind of a poor name. It's quite catchy and it's provocative, but a lot of the systems actually have kind of a sequel-like interface, so it's a bit silly to call it NoSQL. Nowadays you don't really need to write your own systems. You've got software-as-a-service vendors who you can go to and say, hi, can you do this for us? You can download the source from, say, Hadoop is an Apache open source product. You can just go and download it, run it, install it. It's also been largely commoditised by vendors, so you've got EMC or Oracle. You can go and buy an Oracle big data appliance, cost you a quarter million quid or whatever, plug it in and you've got a Hadoop system. What's interesting is that a lot of these NoSQL tools are open source software, and I think that collaboration is really vital. You need to actually have, because the problem is so large, you need to have lots of people involved in it who are testing different pieces of the problem so that you can... The problem that you maybe experience one in a thousand times, someone else maybe experiences once a week, and so they can fix that problem, you don't have to worry about it, and you fix the problem that you see and both of you benefit. The other thing with NoSQL stuff is that they're massively tuned towards specific use cases, so if your thing is a very, very relational database, and you're using Hadoop to process it, you're probably not doing it right. You also might find that the system does almost what you want to do, but not quite everything, and so you'll need to get involved, use the source, adapt the system and extend it a bit. Some people try to sell NoSQL as a silver bullet that sold everything. Hands up people who believe in silver bullets? Good. It's no panacea. Anyone who tells you that it is is lying to see something usually. Another thing is that they're kind of, because they're sort of interesting and trendy and buzzwordy, people sort of go, oh yeah, we can do this, and we can process this many files and stuff, and it's amazing. But actually, your problem is to deal with 10 files. Deal with 10 files, don't deal with 10 million. So avoid the technology left. Similarly to the silver bullet side, there was also a converse pushback that was NoSQL was totally useless because a few people just need to go and learn how to use SQL properly. And that was unjustified as the silver bullet thing. This initial perception I think is now waning. You just got the end, right? Good. An oracle, vendors like Oracle, EMC, getting involved, the big data meme kind of kicking in means that people are moving on and starting. I think these things are actually quite serious and useful products. There's still a lot of reality distortion fields out there that we want. The main thing that NoSQL kind of gives you is a different toolset, right? So when all you've got is a hammer, everything looks like a nail. That's EMC hammer if you're older than about 40. Maybe 45. So it gives you alternative solutions, right? It doesn't mean that you're going to stop using Oracle for your HR database, say. You will always find relational data and the relational databases are very good, very established technology to use. NoSQL stuff tends to be somewhat less forgiving than SQL databases in terms of queries and more forgiving in terms of your schemer. So one of the things that NoSQL stuff nearly all benefits are is that you don't have to worry about writing a schema anymore. If you then want to do a query on something that's arbitrary, you've probably got problems because you haven't got a nice schema in the way to bind your query down to it. Individual... NoSQL is a broad term, obviously. Individual tools in that space provide different benefits and costs, and often provide features that you wouldn't expect from a traditional database. So one of the databases we use is the thing called CouchDV that's a web server as well as a database, and that's really useful for the particular use case that we have. I suspect that it's impossible to find a single tool from the NoSQL sphere that solves all problems because such a thing is a silver bullet and they don't exist, remember? So spending a lot of time on actually studying what your use case is and studying the various technologies is actually what you need to do. You don't want to sort of bind to a large Hadoop stack when you've got a fairly small amount of data and it would fit into a MySQL database and it would work better out of MySQL. So you've got to understand the problem you're trying to solve and then choose the appropriate tool, like Leather Pants, like MC Hammer. OK. So NoSQL general stuff, they were good for sort of NoSQL combined with cloud, similar technology maturity and they're very good for internet start-ups who don't really care if they lose the data. They want to show off their cloud score, but great. They're also good for the large companies like the Yahoo and the Facebooks and these gigantic bit centres that are customised and really, really purpose-built for their problem and build a large DevOps team to run it and maintain it and tick along. How that fits in with university researchers I'll come back to you in a bit, but you can obviously see that universities aren't really either of those to sort of, this is where it's good. We're smaller than the large companies. We have a larger exposure to risk than a cloud or something. And a lot of the benefits that NoSQL have are these kind of soft benefits. So benchmarking and comparing product A and product B while you might get a number is actually wrong. It doesn't make sense, you're comparing apples and oranges. So choosing your right tool is actually quite difficult. Basically. OK. So LHC, computing evolution, as I said earlier, if we were building the thing today, we were defining the computing system, it would be different. The current system works really well. We publish papers, we can ship data around as I showed you in that illegible circle diagram. But it's run at a very high staff cost. So we have people who are on call 24-7 to make sure that various bits of the thing that we know are kind of hokey or whatever don't fall down. We know that that's unsustainable. We're not happy with it. So over the next few years we'll start simplifying hopefully retiring bespoke components in favour of something that's generic. Something like Hadoop maybe isn't appropriate but maybe it is and we can maybe sort of subvert it to our needs or whatever and so you can move along that way. The other side of it is that the actual because we've got so many bespoke components the software maintenance becomes quite difficult because a researcher does some stuff, writes some code, is influential, whatever and then gets picked up by Google or Yahoo or Facebook or whatever. So they move on and then someone else has got to maintain that code and it's probably a very complicated system that someone's spent three or four years thinking about and designing and developing and you can't just learn it overnight. So it's high staff cost both from the operational side and also maintaining the software that runs the systems. Okay so that's kind of coverage or broad coverage of the no-SQL big data stuff on the LHC computing. The thing I want to talk about now is re-use. So we're taking Bristol got a nice transfer grant to take experiences and tools and techniques from the part of physics group and reapply it to a completely new domain. And it turns out that Bristol has this sort of world leading group of landslide modelers and they work with the World Bank to sort of help plan disaster recovery from various landslide issues. So the idea was to sort of take these ideas and these tools and apply it to their problem. Landslides are a fairly major problem especially in developing countries. This is the El Nino 9798 event. If this is not a log clock, they're about the same value. So the amount of money that Ecuador spent on cleaning up El Nino is about the same as US. GDP is much larger. So this is a huge problem for Ecuador for US. It's pain in the ass but it's not so bad. So the landslide modelers kind of have this they've got about 30 years worth of work developing their software that does all these simulations. They don't want to rewrite that to use some new system. They don't want to learn complex data management tools because they're landslide modelers, they're not data managers. And they don't want to worry about managing hundreds or thousands of jobs in a system that's complicated and scary. They also have some quite interesting use cases. They need to have a system that's usable by people distributed all over the world which is familiar to us but they're also very non-expert field engineers. They're not computer literate necessarily. They need to have an expert approval step. So if someone puts in a simulation where you've got a slope that has some kind of esher geometry, you need to be able to kick that out before you start it off. It's got to be usable on low, cheap hardware low-end cheap hardware and they need to run sort of thousands of simulations per slope and storm and analyse those results. So they basically need a nice GUI interface using mobile phones or tablets or cheap laptops for both data entry and the analysis. So I'm going to quickly skip through a few slides on the complexity of their problems. They have a single slope with a single angle of cutting back produces one file and takes a quarter on an hour to run their system. If you increase that to sort of 25 different angles so you're actually doing some planning and saying is it better to cut it really steep or shallow then it's obviously 25 times the number of files takes 25 times as long. If you're then putting stochastic parameters so measurements that you're not unsure of and the amount of rainfall and that it increases as you'd expect until they end up with sort of 31,000 hours of CPU time or three and a half years which is obviously not very useful if you've got a storm coming in a week's time. And that's for one storm, right? They need to simulate many, many different scenarios you know a regular storm versus one in a hundred years, one in a thousand years type expected event and they can easily pull out more stochastic quantities to vary and vary them in a more fine-grained manner. And they also probably want to start looking at things like if I run it with this version of the software I get this result, if I run it with this version of the software I get this result, which one's the right which one's better. So we're able to solve this side of things by parallelising their stuff using the LHC computing grid that GridPP and others developed that was mentioned earlier. So we've got a nice parallel problem and it's solved, right? Yeah. But it makes a whole new one, right? They've able to do all the processing but they've now got masses of mass data that they don't understand how to use. Did they want to use an Excel spreadsheet like probably everyone else? But now they've got data volumes that are in terabytes of data they can't just email around a terabyte size Excel file I think they may have tried, it doesn't work so well. So yeah to solve these quite reasonable use cases they can't deal with their current manpower or tools. So they've come to us, we sort of talked about it we've ended up with a system that looks a bit like this. So you have this thing called BigCouch which is a fork of CouchDB that runs in a clustered environment. The nice thing about the reason why we chose CouchDB is that it's got a very simple API and you can just talk to it through a web browser. So it's nice, simple you do some performance because you're talking about HTTP but it's ubiquitous and it's nice and simple. We store a summary of results from the simulations and then the full result as an attachment to that metadata so that people can do further investigation if they want to on the full raw data set. And then the other thing that Couch has is this incremental map reduce system so you can build secondary indexes to find other stuff. So we can build an index that says show me all the slopes that fail and we can pull that out very quickly. So they were sort of happy with this and we were thinking it was a week's work. This stuff this is where the grid kicks in. And they go oh yeah we've got one other use case we need to access the result data in St Lucia and we said yeah we knew that it's fine but yeah we need to do it when there's been a tropical storm and the telecoms infrastructure's all gone. And you say oh right so you need to look at something in Bristol when there's no link between that's kind of fun. Fortunately Couch again comes in here it's got possibly the easiest database replication I've ever seen. You type replicate this to there and it does it all over HTTP so you don't have to worry about kind of caching or rather you can put caching in you don't have to worry about that sort of scaling and stuff. So people can in the field upload results to their local thing to the main central server and then we replicate the stuff back and forwards and you see the same interface from both systems and then you get result data back down there and visualise it through a web browser either talking to the server in Bristol or one on your phone. And that's incredibly powerful and incredibly great. This kind of thing is this sort of web 3.0 is a vague enough word that I'm going to steal it and decide it's mine. But combined with sort of HTML5 we can have very data rich applications accessed via the browser powered by these big data tools this huge data process and things ends up resulting in some nice simplified data read by the browser. And so now the focus becomes on the user interface end and hiding that complexity out of the system. So it's query builders and planners you're seeing things in Hadoop where you've got interfaces to things like R and MATLAB already are appearing. And then from the drivers from the industry end you've got sort of better ad-turbiting production suggestions things like Square where you can sort of have an online bank and you can see all your transactions over time. And this is arguably a return to sort of the web's origin of a science place. So for the landslide stuff this is our visualisation. Basically any of these points is a measurement that some guy in the field has taken. You can drag them around if you don't believe him. The blue line tells you where the water table is and these sort of circles here tell you whether or not that slope is stable where it's going to slide away. And it's all rendered client side using JavaScript and a fairly modest browser. Which is awesome. So the last thing is about how this is going to impact universities. Data intensive research is already norm for a lot of fields and will become the norm for the rest basically. Universities will need access to big data resources. And one of the interesting things is that you'll start seeing significant use from traditional fields. So people will want to store their entire DVD collection of students' performance videos or whatever. And then want to index it and do interesting stuff with it. Universities need to be able to provide that resource to be able to do interesting research. And the other thing is that big data and no SQL are these empowering technologies. So they basically will generate new fields. You will find, I don't know, there will be something that you had never expected or always wanted to do but not been able to because you couldn't access enough compute resource to do it. Going back to this workflow ladder you've got this idea of stuff at the top which is done centrally for the benefit of everybody and stuff at the bottom which is done privately or shared between you and your friend. And I think there's a fairly nice alignment with some of the no SQL and big data tools that you have something like the big yellow elephant of Hadoop sitting at the top and this sort of multi petabyte data processing something like couch, big couch in the middle that's able to sort of distribute and share trivially your data sets. And then something like couch itself at the end user level where you can have your data on your laptop for the thing that you're interested in. Okay, so other implications for the future quality and scale of big data resources are going to have a direct impact on the ability of the universities US universities for instance certainly ones involved in CMS have much bigger compute centres than we have in the UK and they're much better funded and that's going to cause problems, right? We also need to provide a daily sensitive compute resources to complement HPC and the reason for that is that big data clusters are very different architecturally from HPC clusters this sort of whole thing of very expensive storage system thin little network pipe loads of CPUs doesn't really work if you're trying to process multiple petabytes of data you need to have the storage as close to the compute as possible a loop did this very well where basically you use the disk inside the machine that's doing the processing it's also harder to manage data intensive systems than HPC systems because data is stateful if a CPU doesn't work you turn off the machine and it usually comes back to life if you do that for a disk the disk corrupts itself grinds to a halt and you've lost that disk you're so interesting legal issues that people are much more expert than me will tell you all about so how do we there's also this sort of thing of software as a service and vendors such as CloudNet who are excellent you should totally buy them but these software as a service vendors are quite hard to realise at an institutional level because they really because they expose the cost so well because you say this costs you £500 per terabyte or whatever to store this data researchers turning have to sort of scrabble around and find money down the back of the sofa to do it and it makes it much more difficult to actually sort of fill up and use appropriately as I said earlier building DevOps teams is something hard and university funding doesn't really support it very well that's I believe been recognised in recent EPSRC fellowship calls on software development for novel research so there are big issues coming out of the big data stuff universities basically last slide big data is mainstream it's hard to get computer scientists excited about it because they think it's a sole problem it's not it's an engineering problem as people were talking earlier but it should be seen as an enabling technology for all academics you don't need to be a hardcore coder to do something interesting with a large data set anymore it's not trivial to adopt if you've got existing code or systems in place based on Excel or whatever you're going to have to do some work but it's also not horrendous anymore and it means that universities are going to need to start building up teams to support these activities they're a new activity or find a way to outsource it to either another department so it might be done with the landslides or maybe to another university or a national lab or to a company like Cloudant or well I'm sure there are other ones as well that's my thanks very much Simon I've let Simon overrun slightly so I'm not going to go for questions at this point we've got a tea break so I'm going to let you get away upstairs for tea back at quarter to four please for our last talk and then our last closing keynote thank you