 and welcome back to stage one, it's going to be the second talk about physics on this day already and it's about big data and science and big data became something like Uber in science, it's everywhere, every discipline has it, Axel Nauman is working for CERN, the accelerator in Switzerland and he talks about how physics and computing bridge in this area and he works a lot with ROOT, a program that helps transform data into knowledge, warm welcome. Thank you, thanks a lot so well you know when when I was discussing this abstract with the science track people they told me well you know about 300 people might be in the audience but whoa hey you're huge that's much more than 300 people so thank you so much for inviting me over it's a real honor and of course originally when talking to 300 people who are all science interested I thought you know I picked something fairly narrow focus wise but then I learned I'm going to be in XI1 and that's different so I decided to make the scope a little bit wider and that's what I ended up with I'll talk a little bit about CERN in society as well if you so choose you'll see what that means in a minute so the things I'll cover here is obviously CERN just a little bit of an introduction how we do physics how we do computing what data means to us and I can tell you it means everything you heard about that already right how we do data analysis in high-energy physics and just because we've been doing it for a while and because I've been doing it for more than 10 years I'm one of the guys who is providing the software to do data analysis in high-energy physics so you know because we know what we are doing and we have some experience I thought maybe you might be interested in hearing what my forecast is for data analysis in general in the future so let's start with CERN and so if you wonder what CERN is you've all heard about CERN about the fantastic fonts we love to use then you probably also heard that we are doing science we were founded right after the Second World War or soon after the Second World War basically as a way to entertain those freaky scientists you know that was the idea peace you're a white and damn that's working out really well and so well that's not just Europe anymore these days we are located near Geneva we're doing only fundamental research so we don't do any weapons nuclear stuff you know these kind of things the the WWW was invented at CERN but that was just a you know side effect happens sometimes that we invent things but usually we just do science so what we do is we take money lots of and brains who like to discuss and think and come up with ideas and from that we generate knowledge it's really all about curiosity the things we try to answer is what is mass which is a funny question right like we all know what mass is but actually we don't we know what mass is in the universe we understand that masses attract one another gravity which is beautifully correct and in the small scale our particles we know that masses energy and we can't convert them but we don't understand how these two things go together like there is no bridge they contradict one another so we are trying to understand what that bridge might be part of that mass thing is of course also what's out there in the universe that's a big question we only understand if you percent of that 90 and some percent are completely unknown to us and that's scary right I mean we know gravity really well we can deal with freaky things like black holes and yet we don't understand what's out there now to do all these things we are probing nature at the smallest scale as we call it so that's particles and we're dealing with things like the Higgs particle and supersymmetry here's a little bit of a fact sheet we have about 12,000 physicists who are working with CERN we are basically the workbench that you saw in Andrea's talk before we are the table that physicists use okay and and so they come to CERN once a while about 10,000 physicists a year or they work remotely most of the time from about 120 nations so you're seeing it's not European anymore this is a global thing CERN itself has about 2,500 employees you know those scrubbing the table setting things up and so on and our table is right here in the far end we have the Alps it's in Switzerland as I said so the Alps are always close with Mont Blanc we have the Lake Geneva we have the Jura the French mountains on the lower end here it's just beautiful it's really nice but we needed to stick a 30 kilometer ring in there somewhere and people would have hated us had we put it like this but luckily people were smart back then in the 70s and build a tunnel much better so now we have this huge tunnel and we send particles through in both directions near the speed of light and the tunnel is filled with magnets simply because if you don't use a magnet the particles will fly straight but we need them to turn around here you see what it's looking like you also see these big holes there that have access shafts from the from the top and that's where the experiments are that's sort of a sketch of one of the experiments so the the LHC is one of the no it's the biggest particle accelerator at the moment it's a ring with 27 kilometers circumference 100 meters below Switzerland and France it has four big experiments and several small ones and we are expected to run until 2030 so you see that all of that is large scale simply because we are trying to make good use of the money we have here you see one of these caverns that are used by the experiments while it was empty the experiment was then lowered through this hole in the roof piece by piece and these things are humongous to give you an impression of how big it is I put Waldo in there so your job for the next three slides is to find Waldo you know that gives you the scale he's friendly waving at you so it should be easy to find him so then we put a detector in there here it's pulled apart a little bit so it looks nicer you can actually see something you can for example see the beam pipe so that's where the particles are flying through and then they're coming from both directions and colliding in the center of the detector and then things happen and we try to understand what is happening that's yet another view frontal view on one of the detectors and now you have to imagine that you know you can't just open up Amazon and order in LHC experiment right that's not how it works we do this stuff ourselves like PhD students postdocs engineers you know that's all done by hand just like the microscope you saw before of course you order the parts but you know the design the whole conception and actually screwing these things together making sure that it all fits is all done by hand and I find that just beautiful I mean that's close to a miracle right that nations like people no matter what nation people across the globe work together to build such a huge thing and then you turn it on and it works more or less but you get it to work that's not my applause that's your applause because you make this possible really it's it's it's huge this is for me one of the things I love most about certain that is this international thing that just works smoothly now the detectors are like a massive camera we have lots of pixels and we take many many pictures a second we do this to identify particles and then sort of estimate what has happened during the collision now life at CERN is of course an important ingredient for scientists as well and if you live at CERN then actually it's just work at CERN that's what it's about but it's not that bad so we hang out together in our control rooms make sure that the experiments work correctly we also you know study the forces we have scientific discourse in the sun view on the Mont Blanc with a good coffee we have lectures and we are lectured and of course as you we have more laptops than people and then we do stuff and so this presentation is going to introduce you to some of the things we are doing and more on the computing and the society side as I said but because I have so much to talk to about I decided that you just build your own talk you tell me what you want to hear so let's do this you can choose between a physics and B model simulation and data you remember these books like from the old days when we were all young it's that kind of thing okay you decide you design your own talk here so by applause do you want to hear about physics okay or the model simulation data part okay there we go so this is what we skip model simulation data it is you're a strange crowd first time I meet people who don't want to hear about physics no I'm kidding so model simulation data it is so our theory is actually incredibly precise it's so precise that our basic job is really really boring because we already understand everything whenever there is a collision we know what's gonna happen except for these very rare things so we are trying to find these very rare things out of this haystack of fairly boring things that we really understand well and the weird things are for example monopoles, supersimtary or black holes now the theorist's job is to tell us what we should be seeing in the detector given some fancy physics let me use simulation to see how our detector would respond to that now of course the question is we are just counting basically when we do experiments and the question is how often do we need to see something to say well that's not just the ordinary that is something new that's something that could be explained by a weird theory we use the detector simulation as I said to basically predict how much we expect to see things we use reconstruction software which tells us what has happened or might have happened in the detector to to count how often we saw something and then we use statistics to compare these two and to say whether something is expected or not now that's fairly abstract but it's it's fairly fairly common it's a fairly common approach for example if you look at at climate versus weather right I mean we always have temperature fluctuations because of weather and the question is is that rise in temperature because of a weather effect or because of a climate effect is that large-scale or just a short term fluctuation so there we have a very similar problem and here what you do is you measure temperatures and you want to detect a normal variations and you can improve that by measuring longer like for 300 years instead of 20 years that gives you a better prediction what you would expect in the future also larger deviations up right if you look for something that is just point one degree then you might not be able to find it if there is a deviation of five degrees you will definitely find it and for us it's very similar so here we have a plot one of the first takes discovery plots and you can see that we have we have many ingredients there so the the black dots are what we measure and they have certain uncertainty because when we when we measure we count and we might have you know not seen something or we might have seen more than we we should have seen so there's always an uncertainty and then we also have theory which tells us you should have seen so many and so for the red part that's something that we know exists it's nothing spectacular it's simply what what theory is telling us what we should be seeing and you can see that data follows the red part fairly fairly well but then there is this other bump in our dots on the right hand side on the center and that does not make sense unless you take the Higgs into account right which is the light blue part and so here you can see how this interplay between different sources of physics and statistics works for us now just as for the climate more data helps and there are two versions of more data more data either by having more collisions which is why we are running 24 7 or more data by combining different analyses which is what's happening here so here you see all these different analyses if you combine them of course you get a much stronger prediction of in this case the the Higgs mass then if you just take any single one of them you see how similar what we are doing is to you know any of the big data analysis out there okay so that was that part now comes the obligatory part again computing when we were designing the LHC not me when people were designing the LHC they needed to project computing power from 1990 to 2000 2010 and so on and then they said well we need massive amounts of computers and for you there's no yeah everybody has it we have it as well we have our racks of computers this is something that the big companies usually don't show you you know there is actually a ramp where the trucks arrive and they offload the things and then somebody screw them together and then it looks shiny this is how we are spending our CPU time we have about 60,000 cores that are spinning all the time for us and they're distributed around the world you can see that CERN for example is the or or the red part there near the bottom yeah so we make good use of that we also monitor the efficiency and because 100% efficient efficient is for beginners we are actually about 700% efficient don't ask why they they decided if you're multi-threading then we you know we multiply your efficiency by the number of threads you have makes no sense to me we also have storage currently we use about 0.7 exabytes we also have available 1.7 exabytes so that's good we make use of the storage we have where it's you know tera peta exa so it's a lot and here you can see on the right hand side you see for example the tape usage on the bottom and you see this dip that was before we were starting the accelerator again we needed to make some space so we monitor our our disk usage all the time hey here comes the next decision point so do you want to hear about one distributed computing or two major effects of bugs who so one distributed computing and to measure the effects of bugs okay so that's my call and I would say we do we do measure the effects of bugs because it's shorter so this is one of the views you can you know electronic views you can get from a detector and you see how we trace the particles that fly through the detector now that's software right that's the result of software and you might not believe it but we have bugs in there and that's software and you know these bugs are sometimes wrong coordinate transformations so things don't go this way but that way it's kind of weird if you look at it and and the result is that our particles don't go through the path that they should have been going but we are attributing them a different path now the the nice thing is that we are doing this a million times right so all of that is is mirrored we are not systematically doing this wrong it's just we are always doing it a little bit wrong and so the net result is that if we measure our particles we will not measure the right thing but always a little bit wobbly left wobbly right you know things are not as precise that's simply an uncertainty so for us just like counting has an uncertainty and predictions have an uncertainty software bugs introduce another source of two uncertainties and here you can see how we are tracking uncertainties for for all of our analyses we're trying to understand the different sources of uncertainties and again bugs are one of the sources here so if we find a bug then we reduce our uncertainty and we can find new physics earlier instead of having to wait and collect more data so for us finding masses really key we really love to find bugs because it brings physics closer I thought that was interesting it's kind of rare that you're in an environment where you're able to measure the effect of bugs okay so now we are talking we'll be talking about data I talked told you that we are trying to find particle traces in our data and the way we do this is by using reconstruction programs and there are multiple gigabytes of binaries and share libraries and stuff they are huge there are experiments specific and they are curated by the experiments open source for some of them and we want them to be correct and efficient the data format we use is not comma separated values it's binary and for some strange reason it's our own custom binary format the reason is that it's really targeted at the kind of data we are having we have collisions that are independent so we only need one in memory at any time and we have nested collections which makes the regular table layout a non-starter we actually generate them from C++ objects so from from classes class definitions C++ class definitions and we can read them back into C++ but also into JavaScript or Scala database just didn't do it for us they have the wrong model of data access they don't scale it's just not the kind of system that works for us also using a file system as a storage backend might sound really very traditional and boring but it works amazingly well and seems to be future proof as well so that's just the way to go for us there are many others structured data formats out there many of those did not exist when we started route our own data format but they also miss many things for example we wanted to make sure that we have schema evolution support we can change the class layout and still read back all data we don't want to throw away all data just because we are changing the class also we do not trust people that's as a you know as a computer scientist or whatever you probably know what I'm talking about right if people have to write their own streaming algorithm there will be bugs and we will lose data and we really don't want to do this so we were trying to automate this based on the class definition so last decision point for the story do you want to hear about cling or c++ interpreter or about open data and applied science um let's start with option one the c++ interpreter okay and and open data and applied science so much yeah I'm heading there you miss a fish you can look at the slides later okay so there we go really no the slide number is wrong oh a bug so open data and applied science okay you really want you to know about our budget I understand that so we get from you about one billion a year and the currency doesn't really matter anymore at this at this point in time um and that is a lot of money and you know we try to do really wonderful things I mean we really enjoy our job we love it it's fantastic to work in such an environment and thank you very much for making that possible really I mean it but it also means that you decided as a society to enable something like CERN which I think really deserves my applause and yours probably as well I think it's a great decision to do something like this so we realize this right we realize that we are basically we can do what we do because of you and we are trying to react to that by giving back what we do software research results hardware um and data so the way we share research results is through open access we have it finally it took us a long time to fight with publishers and you know the establishment but now we have it we also yes thank you we also put a lot of effort in communicating our results and what we are doing and if you're in the region it's definitely worth a visit I mean the URL is really easy to remember it's visit.CERN you know works and you should go there by April actually if you can because then you can ask people how to get underground because the accelerator is off at the moment we also do applied research for example we have this super cool experiment where we try to study how clouds form based on cosmic rays so the the influence of cosmic rays in cloud formation which is a key element in the uncertainty of climate models we are trying to to think about you know how to make energy from nuclear waste so getting rid of nuclear waste while making energy from it and we are trying to repurpose detectors that we have and and you know develop we have something called open hardware for example wide rabbit deterministic ethernet we have open data and we have the LHC at home and some other programs where either you can donate compute power or your brain and help us get better results we explicitly try to use open source as much as possible and also feedback whenever we see issues but we also create open source for example we create giant which is a program that allows you to simulate how particles fly fly through a matter for example used by the NASA we have indica which allows us to schedule meetings upload slides you know these kind of things across the globe lots of people with access protection all these kind of things and it's open source we have davix the dimension we love http that's the next machine of tim Berners-Lee and that's his futile effort in trying to prevent the cleaning personnel from switching it off they don't speak english they did not back then at least so we use we use davix to transfer files over http with a high bandwidth or we have cmcv mfs which allows us to distribute our binaries across the globe and not rely on admins downloading stuff and making sure it actually runs and these kind of things that is a lifesaver it's really fantastic it's a great tool but nobody knows it and we have root but that's coming up so now the last official part of this of this presentation how do we do data analysis not like that we use we use c++ and actually physicists need to write their own analysis in c++ we have very few people who have an actual education in programming so that's sort of a clash as i said we need to keep one collision in memory and for what you know what matters for us is throughput we want to have we want to analyze as many collisions as possible per second what we can do is specialize our data format to match the analysis because we don't want to waste IO cycles if we can you know if you can make use of the CPU better root allows us to do this since 20 years it's really the workhorse for the analysis in high-end physics and it's also an interface to complex software we have serialization facilities we have the statistical tools that people need and we have graphics because once you have done your analysis you need to communicate that to your peers and convince people and publish and so on so that's part of the game all of that is open source and of course all of that is not just used by high-energy physics so to conclude um we are here because you make it possible thank you very much it's fantastic to have you we we want to share and we have great people for science outreach but we have nobody for software outreach basically so maybe it's worth a look to see what what CERN is producing software-wise um scientific computing is nothing new it existed since a long time but we had to start fairly early on the large scale so when we were building it up we had to take um we were trying to take pieces that existed and did not find much so now we ended up with C++ data serialization, efficient computing for non-computer scientists even in the part that has skipped in you know one of the alternate tracks you saw you would have seen that we have a python binding as well for the whole software stack in C++ and for us what matters most is scale now we are seeing that we are not the only ones there are many more natural sciences arriving at a similar challenge of having to analyze large amounts of data now i promise to you that i'll be bold and i'll try to make a few statements of what will happen with data analysis not just in science because what we see is that we actually educate the people who will do data analysis not just in science what we see is that in the past data data volume mattered most so more data meant more power now that's not the complete truth anymore it's a lot about finding correlations so even with the amount of data not growing anymore because it's already humongous we try to squeeze more knowledge out of it and for that IO becomes important and CPU limitations is the crucial factor we see that multivariate techniques are still rising and they will just be part of the tool chain of the statistical tools except for generative parts which i believe will change the way we model now based on what i just described this is not a big surprise anymore as we need throughput we need to have a language for the core analysis part that is close to metal so something like C++ on the other hand writing analyses is still complex so you need a higher level language and for that people could for example use python so now language by name becomes relevant all of a sudden it's much more important in the future and we need to tailor IO to the actual analysis to not waste CPU cycles so throughput is the king and my point of view also in the future we will see much more effort in increasing the throughput okay so that was it in case you want to discuss anything with me like that's just wrong that's fine i'm probably have several bugs in there i'm still here until tomorrow i don't know where yet so i'll wander around and you can contact me by email or twitter thank you very much for your attention thank you