 Line from Orlando Florida, extracting a signal from the noise. It's theCUBE covering Pentaho World 2015. Now your host, Dave Vellante and George Gilbert. Welcome back to Orlando everybody, this is theCUBE. theCUBE goes out to the events. We extract the signal from the noise. We're here at Pentaho World 2015. Derek Matheson is here from CERN. We're getting all the excellence award winners here today. Right in a row. Derek, thanks for coming on theCUBE. No problem. Very nice to be here. So CERN, obviously European organization, nuclear research, controversial topic, very interesting topic. Heard some of it in the Democratic debate last night. Oh, I missed that. So very interesting. So Derek, tell us more about CERN and your role there. Sure, okay. So CERN is the European Organization for Nuclear Research. That's his official title. But in fact, what we actually do is particle physics. So it's the N word, it's sometimes a bit controversial and it's not the one we typically use because, in fact, it's basically because CERN is an old organization that's created about 60 years ago when at the time, the only thing we knew about particle physics was the nucleus of atoms. Therefore, it's nuclear. These days we know a lot more about stuff. So hence everyone's confused, thinking you blow up bombs underground where all you do is really accelerate those tiny particles. Exactly. Yeah. We don't make electricity. We actually consume quite a lot. We don't do anything of any kind of military application. Everything we do is open science. That's the fundamental purpose of CERN, is basically to understand how the universe works, understand the laws of nature. And basically we bash stuff together underground, subatomic particles, and then study to see what happens afterwards. I gather that a friend from Oracle told me that you generate about a petabyte, a second of data, most of which, the vast majority of which you couldn't possibly store. In fact, it's actually slightly more than that. So each of the different, we've got four main detectors in the LHC. That's kind of the headline act of CERN, which is the Large Hadron Collider. And each of these different detectors basically produce, they're like 3D cameras. But they're 3D cameras which run 400 million photographs a second with a raw data rate coming out of the detectors of around two petabytes a second. Wow. It's a crazy amount of data. In fact, that's not the main mission. I mean, my job isn't physics. I'm not a physicist. I'm actually a software engineer, computer geek. And in fact, what we actually work on mainly is running CERN as a business. Because as you can imagine, it's a big organization like CERN. We're an international, intergovernmental organization. And our role basically is to provide infrastructure for the visiting scientists. There's about 12,000 scientists who come to CERN who work on the different activities from all over the world. 1,500 of them, in fact, from the U.S. But really, from across the planet, all the particle physicists come together to work on CERN's activities. And the job of my group, in fact, is to provide all the infrastructure, all the software infrastructure, to run CERN as a business. We've got a billion-dollar budget entrusted to us by funding agencies across the world. And it's up to us to make sure that it's looked after, it's spent correctly. We're auditable. We adhere to all the various compliance regulations and transparency and everything. So that's really our job. So CERN is easy to talk about all the physics. But in order for that to happen, you have to have the infrastructure to make it happen. And that's basically my job, is to make sure that we've got the software infrastructure to make sure that we can actually run the lab as a business and make it operate. Nice examples would be... We're a trans-border organization. Physically, we're actually sitting on the border between France and Switzerland. So our logistics application comes from one of the most important customer services services that the company has to try to run. And we have a lot of customers and customers forums that it has to be able to manage just by moving stuff around the site. So you can imagine just the complexity of the infrastructure that we have just running this as a business. One of the nice things that we've seen in the past is... Basically, just because we suffer from very exacting requirements, I mean, physicists are renowned They're used to studying particle physics. They're used to having knowing things to the end decimal place. So of course all the business intelligence has to work the same way. And that's basically our job. Okay, so do they get kind of testy sometimes, like TV anchors? Exactly. It's precisely that, yeah. It's the environment that we're used to working. We've got these exacting requirements. And fortunately we also have the infrastructure. We've got the computing infrastructure and we've got the smart people who actually build stuff which works. Yeah. So you get a chunk of this billion dollar budget for infrastructure. Yep. Roughly how's it break down? I mean, is it mostly there? You guys don't have, like a lot of companies, you don't have to do a ton of promotion. No. Which is where all the startups spend their money. So you can actually spend it on an engineering. Yeah, and a lot of it goes, a lot of it basically goes on infrastructure. That's, well, more than half of it basically goes directly on infrastructure. Let's say you want to suck up a lot of the budget. Yeah. Okay, so what are you doing with that infrastructure? So the scientists come in, they expect consistent infrastructure. They expect a good experience. Right. Your job is to provide that. Yep. I'm interested in how data has sort of changed CERN. And I mean, that's kind of a stupid question. You've always had data, but how this ability to deal with data over the last 10 years has changed CERN. Yeah, it's really the accessibility of the data. I mean, way back when I first joined CERN, now in 1989, it was a long time ago now, I started as an intern. And one of the first things that I worked on. What are you working with, Tim Bairden or Lee? I worked on the world's second web browser. It's not really the coolest to work on the first to be better, but I worked on the second web browser. Was that on the next, was that also on the next one? Well, this was actually a terminal based one. VT220 terminals, these kind of things. Yeah, yeah. How does that work? This beats me a little bit, I'm afraid, but yeah. All technology, early web, yeah. No underlying hyperlinks or anything like this, but this was the beginning of all of that. And this technology was developed basically because there was a need for the physicists to talk to each other in an efficient way. So they built the web. Now the web has transformed the world. I mean, you can't imagine the world now without some kind of a worldwide web infrastructure. Are you doing similar things that could have transformative effects based on the sort of support for infrastructure? Because that wasn't really helping people peel apart these boson particles. It was supporting the work of the facility. Is something, are similar innovations going on? That's actually our principal target now, is to make the data available to the people who need to get it. Our message is basically, like in the same way, search engines today are kind of working really hard to make it easy for you to get on with your everyday life. So you know how long it's going to get you, how long it'll take you to get to work in the morning. And it tells you that because it knows where you live because it's worked it out. And it knows where you work because it's worked that out already. It knows the traffic information. So it can tell you all this information. So what we want is to be able to give that kind of live information to the fingertips of the people who are actually running CERN. So the ones who are the project leaders, the ones who care about the logistics, the guys who are actually building these different parts of the detectors, they need the information at their fingertips. They're not necessarily even located at CERN. They're probably working in a research institute in Helsinki or something like this. And they need still to have direct access to all the data coming from CERN. So one of our main missions is to try to get it all into one place. This has been the major job that we've been doing over the last three years. So like everyone, we've had silo developments where you have all the HR data in one place, all the financial data somewhere else. And we need to get that together somehow. And what we did was basically build a team to try to put everything together into a single data warehouse. And okay, we can take advantage of the fact we've got access to fantastic hardware. So we can put the whole thing in memory and have an enormous in-memory database to have sub seconds, speed of thoughts, business intelligence. I like how many nodes and total memory. Total memory of a single node is 256 gigabytes. Single node 256, okay. And we've got access to several of them. I mean, the moment we're just rolling this out, so actually we can run the whole thing on a couple of them. But it's basically, because we have this kind of infrastructure available to us, which you can imagine with Moore's Law, it's going to become commonplace in five, 10 years time. That will not be an impressive number. I do vendors say 256 is now pretty much getting, but how big is that cluster or clusters? Okay, so we're expecting to have an order of one terabyte of data. That will basically be all the business data. Okay. I mean, remember we're talking business data here. Like OLTP stuff. It's the usual OLTP stuff, exactly, yes. So we've got stuff from logistics applications. We've got stuff from our standard financial system. We've got lots of information from maybe less common data sources because since we also have a pension fund running at CERN, we also have a school. We have a hotel. I mean, basically it's like a small town. Fire department, postal service. We've got all this kind of information going on. And because we have access to all this information, it's all tracked. We've got all the information about it. We can know things like, how long will it take for the parcel to arrive? How long does your order on average take to get the approval process completed? We know that information about all the operations of the organization, so we can expose that to the end users and give them an idea so they have upstream information about planning. So the legacy systems of record, which were good for automating processes and standardizing them and the business transactions they supported, now you're putting another layer on, which gives you information about how to optimize them. Exactly, and that's basically the main goal is to be able to get this information out there so people can actually make decisions based on the information that's trapped in all these legacy OTP systems. Get it out and so we can actually use it. You talked about this operational data, but you also have a lot of scientific data as well. Absolutely, yes, yeah, yeah. So, I mean, you talk to people in the sort of technical computing world and the sort of HPC days, they sort of say, big data, big deal. I've always done big data, but you're blending a lot of different data types. So how has, back to sort of my original question, how has the way in which you've been able to handle data changed in the last decade with all this dupe stuff and spark and Kafka's and all this other cool sort of technologies coming out, has it had an effect and what effect has it had? No, I mean, I think the effect is now that, I mean, cloud computing is a kind of everyday buzzword now. CERN has been doing cloud computing since about 2001 we were talking about it then. So, I mean, the LHC computing grid is the largest computing grid on the planet with hundreds of data centers linked together. I can't remember how many, is it two, 250,000 CPUs and one computing grid. It's a, we need enormous infrastructure. This is the bare requirements that we need in order to process the LHC data. This is a need, a must have for CERN to have that kind of infrastructure. It's almost like showing up as a blip on the NSA type scale. Yeah, it's really a huge amount of infrastructure that we have to put together. So how do you use Pentaho? How do we use Pentaho? To deal with all this mess. So Pentaho is basically our strategic choice for all of the business intelligence or the organization. This is the main use. So we basically go rid of all of the legacy systems that we had in Consolidated. So no more do we have all the different homemade reporting toolkits from a variety of different people. Different vendor provided the reporting toolkits as well. We brought them all together and said, okay, one big data warehouse for all the operational information. One reporting toolkit on top of that and that's Pentaho. And then behind it's using PDI basically to do all the orchestration for all the ETL processes. And there we're basically trying to do things real-time. So we want a data warehouse which is less than a minute behind what actually is reality. And that's- You said something, I'm sorry to interrupt, but something behind the data warehouse and the reporting toolkit that, but before the data warehouse providing, you know, near real-time results. What was that? So it's PDI. So basically using PDI as this closing the loop, basically having it running continuously. There's no longer a nightly extraction job or the typical things that we used to do. Now we have information real-time and that allows us to basically have processes which can take advantage of the data warehouse in real-time. I mean, we have information, for example, when a contractor arrives on site, that information is recorded in our system, has to propagate to the access system so they can go and actually work on the accelerator complex within minutes because they do their access, then they want to go down and actually work on something. So this is really significant. This is, well, what Mike Galtieri from Forrester talked about earlier and what we've been talking about for a long time, which is the analytic data pipeline and how, what's the latency? And it sounds like you've gotten the latency down to even just for operational reporting, very, very, very short time. We have exacting needs where we want to have things to actually work in as close to real-time as we can manage. And for us, close to real-time is in the minute range. Yeah, or even one minute. So where do you see this whole thing headed as we wrap here? You've talked about sort of the journey that you guys have taken. What's next? What's next? Actually, we're looking at using Pentaho in other areas. I mean, there's a lot of opportunity there, particularly in the area of things like predictive analytics. It's something that we've been using WAKA a bit. We think there's a lot more mileage there to actually try to do something, not just predicting, but also actually acting upon it. We've obviously got a lot of people interested in making sure for our governance to make sure that what we do is done well and transparently, cleanly. Even things like fraud detection. We have to care about this kind of stuff. And tools like WAKA are going to help us with that. Excellent. Well, Derek, thanks very much for coming to theCUBE, sharing your story. Congratulations on the excellent award. We didn't actually talk about that. Maybe let's put a little plug for that. Why do you think you want it? What does it mean for you? For us, it's the wholesale change from ad hoc solutions, piecemeal, put together business intelligence solutions built on silo data, put it all together into one place and allow this kind of synergies that you can get while having a single solution which talks both HR data and financial data and logistics data all from the same point. Simplifying the chaos, Derek Matheson from CERN. Thanks very much. No problem, thanks. All right, keep right there. We'll be back with our next guest right after this. This is theCUBE, we're live from Pentaho World 2015. We'll be right back.