 Live from Orlando, Florida Extracting the signal from the noise. It's the cue covering Pentaho World 2015 Now your host Dave Vellante and George Gilbert Welcome back to Pentaho World everybody. I'm hanging in there George. I know my voice is gone. I apologize for that But I'm gonna be with you guys for the bulk of the day if I can James Dixon is here as the CTO of Pentaho Or as his card says the Lord of ones and zeros This is binary James. Welcome to the queue. Thank you. Thank you very much Tell us about your perspective on Pentaho Pentaho World and we want to get into the architecture and your role. Sure the event is is Super it's it's always fun to see Our customers partners what what they're doing with the the platform the ways that they're extending it and and expanding it So that's that's always a lot of fun to see this many people in in in one place that you can talk to and and you know Exchange ideas with The focus for for me at the moment within Pentaho is on our IOT initiative So not just in Pentaho, but all across Hitachi You know Hitachi makes a lot of things it's over nine hundred and fifty companies make up the the whole Hitachi group They make a lot of sensors. They make a lot of devices trains power stations So they're it's actually very good at at the things and you know, obviously with the the connectivity the the internet of things becomes Something that's interesting Particularly from a form and analysis and visualization perspective. How do you handle all of that data? What do you? What do you need in place to? to orchestrate and manage the the control of all those of all those things and how do you how do you Get the timely events Processed in the right order quickly enough. So for instance if I'm driving my car and All of us I got something in my tire and so I've got a flat that's about to become a blowout If I'm notified in time to pull over two lanes and stop That's a five hundred dollar problem I have if I get a blowout hit the rating in the median hit three cars on the way over and then upside upside down in a ditch That's no longer a five hundred dollar problem And if you look a lot of the catastrophes in the oil industry a lot of those could have been avoided by incredibly rapid automated detection and Two to stop a chain of events And it's complicated because let's say a pump explodes The pump can't tell me that it's exploded because it's exploded But the pressure sensors either side maybe one of them, you know the pressure drops to zero the one on the other side The pressure goes through the roof Those are the the devices that can tell me about the problem of the pump in the middle So it's not just a case of looking at the data coming from one device. You have to infer things Sometimes from other devices that that give you information about what the actual what the actual problem is so what kind of infrastructure change has to occur For that vision to become a reality. It sounds like Pentaho is ready for it is the world It's the the stack is that a lot of layers in that IOT stack There's 10 or 12 different layers of things that you need in there One of the things that that I'm working on the moment is is a thing that I'm calling state analytics Which is the analysis of state, you know exactly how our things right now in the if you look at the classic applications HR ERP CRM for my technical perspective that they're called state machines. They basically they store the current state The system knows my address So the HR system knows my address it knows my name my phone number It knows all of those things and it's actually redundant because I know those things So it's storing that information about me In a central system so that people can access it easily so they can report on it But it's essentially it's redundant because I know that information But it's good that the system knows it because people don't call me all the time asking me for those pieces of information The the problem with those those systems is that they only remember The current state of everything they don't remember the history So if you can ask a CRM system, how many customers do you have it can tell you immediately? If you want to know how many customers you had yesterday, it has no idea you have to go back through these things called change logs Where okay, we added a customer. We lost a customer. We lost another one. We added three customers You have to go back through this log adding and we're moving you know to the the count to find out the number of of customers and so The the analytics industry has been focused on time series analysis because time series analysis was really hard Because the systems didn't understand their own history So just to bump that up a little bit because Dave's suggesting I I like to talk it I like to draw the most technical conversations out of our guests and in it Attempt to make it perhaps more digestible for for a larger segment. It sounded to me like you're saying the Analytics industry sort of had a backfill for what these old legacy systems of record could do in that They didn't the legacy systems didn't really keep track of history. So we had to build up new systems that did yes Yeah, and so we now have those we now have those systems for doing history What we what we haven't realized in the IOT space yet is that the Your device like an iPhone has there's a lot of information on here. That's only stored on my phone So how much what my battery level is how much? Space I have left That's only stored on my phone. It's not stored Centrally redundantly like an old HR system would be so in other words now we have a new set of systems devices devices that have some amount of history on them, but the Systems that are keeping track of them that are managing them. They don't have that same notion of Stuff that's happening over time is yeah So the the the stuff that's happening over time is for quite easy because that's what you put into the data lake So part of the IOT architecture is a data lake where you take all the all the Information you get from the devices as they're running you pour those into the data lake so you can do history and trending The the thing that's missing is let's say You've got smart cars. They continue to send you their location Until someone turns off the engine and gets out Let's say you have a question of how many how many vehicles do I have with enough gasoline to get to Baltimore in the next two Hours well if the car has been sitting in the parking lot for three days without being on We don't know where it is anymore the car knows where it is and it will tell you that as soon as someone You know turns the ignition back on but the the central system has no idea where that car is It out we would have to go back into the data lake and look at the history of that vehicle to work out where it is right now So a practical application would be let's say I've got a Predictive model of when a pump will fail and let's say the the two main indicators of pump failure are pressure and temperature We can deal with high pressures. We can deal with high temperatures. We can't deal with high pressure and high temperature And when these devices are running they only send the information that's changing So let's say the the temperature is fairly constant, but the pressure changes frequently So we'll get pressure changes every couple of seconds. We'll get the current pressure But the temperature might be constant for days And let's say so the let's say the pressure ticks up a little bit We now need to the mathematical model needs to know the current temperature We might have to go back through a billion records to find out when that pump last told us what the temperature is And if we're getting a thousand events a second, we can't a thousand times a second We can't go back through a billion records trying to piece the parts together so tie this back to how we use Pentaho and sort of traditional systems of record today As Dave was asking it sounds like something that's very different because to understand, you know Pressure and temperature, but the temperature we didn't we haven't seen any results from for ten billion Intervals, yeah, so how does how does Pentaho have to change to accommodate this type of application? So the I see the the the the system that's necessary for IOT Being there's a architecture called the Lambda architecture, which is basically a data lake and a real-time system Okay, and then so we're now adding real time bump that one up a little bit Yeah, so you've got these three main systems You've got the data lake for the history You've got a real-time system for handling the streaming events and then you've got this the state repository for knowing exactly How it how is each device right now? So the challenge for Pentaho is that we have to work with all three of those data sources Be able to query them be able to join the the results together As questions are being asked. We've also got to handle Blending so if I've got a subscription service, I don't really care how many devices I'm selling because I sell a subscription I'm giving the devices away I'm actually losing money when I when I gain a customer because I'm giving the device away because it's a subscription So what I care is are people using my service? So now I want to look at how many devices did I sell that's in my cell system and How much is my service being used that's in my data lake and my real-time system? so I need to blend those two data sets together in a In a reliable accurate accurate and and in a government way So blending becomes very important Of the blending these different data sources together making sure that that's accurate security metadata all these things play heavily into the Into the complexity of the system. So Pentaho basically it's the orchestration and the joining of all these different data source So James you think about this stuff all the time. I could tell you live it Think about it at night think about day while you're eating breakfast. I'm sure What is your role specifically and what's with the titles? the So it was so with the title first. I was originally the the title. I gave myself a CTO was Chief Geek And I'd had that title for six or seven years without a promotion And so I just decided to give myself a promotion from Chief Alpha Geek from Chief Geek to all of the ones and zeros I'm not sure what I'm going with next actually have my email That it's the stupidest feature in outlook you can have you can create multiple email signatures in outlook and Tell it to randomly put one of those in as your email signature So I have on my email signature. I've got Chief Geek. I got a lot of the ones and zeros I've got Duke of Earl you are URL Baron Von Tech. Oh Duke of Earl URL. That's great Baron Von Tech I've got a big data ninja So I've got a collection of so no one really knows if you get an email from you Don't really know what my title is changes the next time I reply to you So that's the you know the the story behind the behind the name So really what you know what I do is? You know when we started looking at big data That's when I came up with the whole idea of the data lake, you know the data lake concept was something I came up with six or seven years ago when I started looking at the It's a big data technologies and I was with our CEO now chief strategy officer Richard Daly we drove up and down Silicon Valley up and down 101 Talking to the Hadoop early adopters What they were doing with it? What were they trying to achieve? You know, what was the business purpose? What were the technical hurdles and then I tried to take all of those use cases and distill a common What are the common elements? Most of them were structured data But then I came I tried to come up with an analogy to help explain to people, you know There's a set of use cases that can be looked at one way And I tried a number of different analogies before settling on the the data lake It's not an ocean because the lake gives you a an idea that it's confined in some way. It's constrained. It's bounded But you know lake water is not clean water, so this is this is raw It's not Process it's not filtered. It's not distilled So I compare the data lake to say a data warehouse which is highly structured. You've got aisles and and pallets So it's like getting bottled water out of a out of a warehouse. It's highly organized The water is cleansed and purified in bottles. Yeah, so it's it's made it's Managed in the way that it's easily accessible and easily to navigate. Yeah, and so but the day like is complete opposite It's raw. It's unstructured. It's just an enormous amount of raw data. So how does your vision of IOT? affect Architecture product, how does it get translated into something that I can buy so we have so first we go look at the There's the technical aspects of what the platform needs to do There's the business purpose or some business purpose of this platform Is it something that we sell is it an enabling technology for all of Hitachi's many businesses? Is this an open source thing that we're making available to everyone? So we need to decide what the What is the business purpose of this platform? And then we get to the the technical nature of what are all the different pieces in there? And it gets coming if you consider something like do should this be cloud-based and or should this be on-premise? There are some people that will want this on-premise. There are some people that want it on the cloud So now we got something that needs to be to run in two environments and you know when it comes to the database some people will want to use Hadoops some people want to use you know vertical or green one. So the the platform needs to be very flexible So really what my job is with Several other groups in Hitachi is to define what this architecture looks like and then we'll go off and and and build it So, you know when when you talked about driving up and down 101 and looking at the early use cases of Hadoop I think many of them were I mean the earliest of course was the web crawl But like yahoo talked very specifically about making it easy to sort of change a data warehouse, you know pipeline If we were to say because there's you know now we've got buzz around internet of things But we always have clear use cases in mind Have you seen a couple early use cases that have things in common the same way the ETL off-load did? Or yes data warehouse off-load Yeah, absolutely. So you've got the So some of it you know you look at smart buildings and smart cars So there the the exercise is around Optimization of resources so reducing electricity Reducing fuel costs working with one of our customers and their goal is basically to save five million dollars a year in In gasoline costs for pumps. They're not they don't know Sometimes they leave pumps running longer than they need to and they estimate they can save five million dollars a year Just by shutting pumps down sooner, but they need mathematical models to work out exactly when to shut those pumps down So you've got the you've got the optimization. That's one use case Capacity planning is another one. So you're looking at I've got all these devices as people buy more and more devices. I need more and more infrastructure to handle Serving up the the data that those applications need So there's the capacity planning aspect and then there's the failure prediction the failure what failure prediction? Oh, yeah So and failure prediction or something like a fraud detection where you're basically running a As as each event comes in you run a mathematical model to determine whether there's a problem or not so for those with those three examples have sort of common elements among them or would each have its own sort of Optimal way of You know configuring Pentaho and some additional technologies if any. Yeah, so there There's a large amount of overlap in technology wise in all three use cases So for instance in the failure prediction, you need a mathematical model You cannot build the mathematical model without looking at all of the history in the data lake So you need the data lake there for the modeling purpose But at runtime the the mathematical model is running as the events come in and you don't need the data lake anymore, okay? So there's some some phases. So this is the classic. I build my prediction batch mode Put it into production on near real-time or real-time data. Yes. That's that simple. Yes. Yep with that same model work for Capacity planning so capacity planning is looking purely just at the at the data lake basically looking at history and time series With a capacity planning you very seldom needs Access to the real-time data because you're looking at looking at trends over Over weeks or months and you don't need the real-time aspect as much as that's really clear So then what about the I would I would guess then smart buildings and cars where you're optimizing Resources that sounds like another one where you need both Yes, that is so you need to look at the history. So you might have example where Okay for the next few hours I only have one meeting booked on this floor if we move that meeting to another floor now I can shut down that entire floor every light bulb the AC I can shut off that floor just by moving a meeting So it's that kind of Optimization looking at all the different resources saving lectures the saving power Yes, there's definitely an element to to analyzing the history But then also it's it again comes down to the the real-time of you know What what at which devices are on and which ones can I turn off? Oh James? We have to leave it there. Thanks very much for coming on the Cube and sharing your vision of IOT. It was great to have you on Yeah, thank you very much. All right. Keep right there, buddy We'll be back with our next guest right after this. This is the Cube we're live from Pentaho World 2015 right back