 at Big Data SV 2014, is brought to you by headline sponsors, WAN Disco. We make Hadoop invincible and Actian, accelerating Big Data 2.0. Okay, welcome back everyone. We're here live inside theCUBE, our flagship program. We go out to the events extract, the signal from the noise. Day 300,000 views later, live audience. Here at Big Data SV, our event covering all the Big Data action and the Stratoconference. This is SiliconANGLE, Wikibon's production of theCUBE. We're here with a great guest here, Jeff Kelly, the analyst, and Joe Hellerstein, the CEO, Trifecta CEO, entrepreneur, professor at Berkeley, computer science guru. Joe, welcome to theCUBE. Thanks for having me. We love computer science, our motto, it's looking at angles where computer science meets social science, is something that we're really passionate about, and some of the action happening that you've been involved with over the past few years at Berkeley and your career is now flowering out mainstream, seeing computer science paradigm, just really infecting the world in a great way. Obviously with the Hadoop, Big Data, the systems, the large scale, everything's happening around the technologies and then people. So I wanna get your take on what's going on. Big Data, from your perspective, you've seen the movie, you've done a lot of pioneering work at Graph Lab, and obviously all the work you've done. People weren't really watching five, six years ago. You were, you know, all the geeks knew, all the alpha geeks were there, but now it's like a dream, you wake up from a dream. What's your take, how do you personally feel about this? Well, it's terrific for me. I've been working in data since the 90s. I went to IBM Research after college, worked on some of the relational database technologies there, some of my code apparently still ships in DB2. So I've been doing this for a long time, and never has my job been more glamorous than it has been over the last few years because data has really entered the common culture, not only across computing and technology, but in society as this incredible asset to be managed, right? And it's just this incredible opportunity to build new things and do new things, and it comes with all the pros and cons and social complications and excitement that's something that really is significant to society brings on. What's the biggest game changer about the data, science, revolution that most people might not be aware of that you're watching, saying, hey, that's the most game changing aspect of this trend? Well, you know, there's the inside baseball, and then there's the exterior. So on the inside, you know, industry stuff, I'd say that the big change we've seen is we saw a revolution in the software frameworks and the tooling and just the whole stack for data that happened when open source suddenly became viable as an alternative to commercial data vendors. And so that has created all kinds of interesting opportunities, and it's also changed the way that people think about working with data. That was facilitated and coupled with, you know, just the progression in Moore's Law. Things get twice as good every 18 months, storage, compute and all that. At some point that changes things radically and we're there. So compute is cheap, storage is cheap, generating data is cheap. Suddenly data is this thing that everybody should be playing with. It's not a carefully governed thing deep in the company that a few experts are working with. So that changes the rules for what kind of software needs to be built, what kind of people are gonna use it, how they're gonna use it. So all those kind of things are on the table. None of it invalidates the history of computing. All that stuff was important too. It's just that there's this gigantic wave coming in with technology that's gonna be used by different people for different things. So it's just incredible to watch that. On the mainstream side, what do you think their view of it is? So outside of computing, right, what we're seeing is that people have expectations about data now and they have expectations about software now. They use Google every day, you know, and they expect software to work the way the software they use works. And so people talk about consumerization. You know, they want their software to be as good as their iPhone. But I think much more than that, people expect that their software will be as smart as Google is. It will predict what they want. It will figure out how they wanna interact. And this kind of expectation of technology is incredibly unrealistic. And we see that in the enterprise space. Like to build that technology is gonna take a long period of time. And if you don't have all the users in the world using your software, you're not gonna be that smart that fast. Before we get into some of the news about your 1.0 product which you guys have announced and shipped, availability, I wanna talk to you about the talent. You know, you've been a professor. You've seen the young guns come in over the years. You know, it's again, our generation. You know, we're in our, I'm in my late 40s and we were systems guys. We were doing low level, register a hexadecimal. Then it got easier as the structured languages got more scalable. But now, what are you seeing on the computer science curriculums and the kind of kids? What are they hungry for? I mean, what's the disruptive changing appetite or discipline around computer science? You know, when I started teaching, I've been teaching database classes since 1995 at Berkeley. When I started teaching, I'd have to explain, you know, all that fun stuff about computers. Well, we're gonna talk about data. And that's interesting too. And, you know, let me explain to you why. You know, a big yawn, right? Right, yeah, yeah. So, then there was a period where it was like, oh, the web is cool. Is this about the web? And I said, well, the web is a lot of text with links and this is sort of about that. Now they come in and they're like, we get it. Data is what computing is for. Data is at the center of most technology. And this is what we wanna learn about. So my class sizes have gone from like 50 to 300, you know, in the course of just a couple of years and it's not cause I'm teaching better. It's sexy to be a data scientist. And I think, you know, one of the things that we always talk about in the queue is the simplification, going from data science, making it hardcore versus I wanna be a data scientist or data scientist for dummies, kind of books that are being published out there. It's a lot harder than most people think, but it's getting easier. So the trend is, how does it get easier? What do you see? What's your vision on that progression? And distraction away, some tooling, is it new stuff? I mean, how do you make that data science action? Is it more ontologies? Is it more machine learning? Is that all the love? What's your take on that? Yeah, well I think what happened with data science over the last few years is it's kind of analogous to software back in like the 70s where there's been a lot of custom code and people have been holding it a little bit close. So there's this mythology in data science that it's supposed to be this task that requires you to be good at business and it requires you to be good at computer science and it requires you to be good at statistics. If you can't do all those three things game over. So we need a new way to teach students and all this kind of stuff. The truth is when you talk to data scientists they will tell you very directly, 80% of their time is spent doing things like cleaning data. And that is a task that is extremely hard to automate. You can't come up with an algorithm to do it but it could be so much more intelligently tooled and you could take so much of the burden away from the data science to do that kind of work. Two things come out when you do that and this is the business that trifecta is in. The first is that the people who are good at data science get to do all the stuff they really wanna do. They are doing statistical modeling and they're predicting what's gonna happen and really playing with data in a rich mathematical way. But also people who like it would be useful that for them to know what the numbers say they can get the numbers now, right? They can get in there and they say, you know, I'm interested in sales data but I think weather's predicting what's going on. Let me buy some weather data. Oh, how am I gonna work with that? Now they can actually work with it. They can manipulate it. They don't have to get the IT guys to come in and do some kind of model for them. Yeah, weeks of provision, some sort of new connector, you know, proprietary, weird kind of thing. Now it's just, okay, import, ABIs, right? That's right. So talk about the success of their products. So you guys launched at the funding and then now all that build out where are you guys at now? Talk about some of the things you're working on now and some of the success you're having. Yeah, so we were happy to be able to go from a December announcement of our series B with Greylock and Excel to a now February announcement of GA of the product and announcement of some of the initial customers. Lockheed Martin came to talk with us in our session at Strata and they were very verbal about the way they've been using the product. Lockheed approached us very, very early in our life cycle when we were really slideware and research software and they told us, look, you know, the federal government is building petabytes of Hadoop, they're gonna be moving things both from legacy into that environment and also there's DeNovo data coming for things like healthcare and defense and so on. There's just tons of data coming online. There's an enormous burden for these guys to go fulfill these contracts and get it done and they're really looking for new software solutions both at the federal level and in the community that works in the federal government to do that kind of work. So that was one thing that was fun to be able to hear about in public, get a customer out there talking with us and then it's just great to have people seeing the software, it's out from under wraps. Yeah, talk a little bit about it. Let's dig into the software a little bit and how do you do this? Because it sounds like I'm sure you're very popular over at Strata these days with the data scientists must love the idea of removing that burden of doing a lot of that transformational work of the data and getting right to the analytics, the sexy part, if you will. But talk a little bit about the secret sauce to the extent that you can and how you actually go about doing that. Yeah, so we really face this problem. We said these are people who work with data and the problem is not a problem of technology alone, it's a problem of making people more productive. So people are very much at the center of our story. Although my background is in systems like yours, Jeff here, my co-founder's background is in human computer interaction and data visualization. And the two of us came together with our third co-founder grad student named Sean Kandel who finished his degree at Stanford now as co-founder at trifacta. And we said, look, we gotta attack this problem as an interface problem. So you know how I alluded to the way people expect their software to work like Google? We kind of approached that problem in some ways in the same way. I said, what could the software do for us in a kind of predictive fashion, the way that Google able to kind of guess what we're gonna ask and predict what we're gonna do. So the way this works is traditionally an analyst would go in and they'd write some code to kind of clean up their data and then they would visualize the data to see if it looks good and then kind of go back and forth. We flipped the paradigm, we did something we call predictive interaction. We start the user with a visual representation of the data first. And then what the analyst will do is the highlight features of the data that are interesting to them. It might just be a string in a printout of the data on the screen. It might be a bar and a bar chart. And based on those highlights, we'll predict what transformations they wanna take to the data. And then very much like Google search, you have a ranked list of predictions of what you might wanna do. You can mouse over them and see visually how it'll change the data. So it's very much a visual environment. There's code there that you can look at and make sure it's right, sort of very high level simple code, but you don't author it. You just kind of look at it, you go, yeah, that was it. Let's keep moving. So how do you do that prediction part? I mean, that sounds like, let's drill in on that secret sauce again to the extent that we can because that sounds like one of the key innovations here. Yeah, I'll give you a couple different examples. One example, very often you'll get raw text fields, whether they're in web logs or frankly whether they're in relational databases and they're just in a comment field. And there might be sort of standard patterns that people are doing in that text. I had a data set from City of San Francisco that was restaurant inspections and violations. And in every single comment, they had a date. And it was almost always in the same format. And you can look at this thing and go, the date's over there, it's over there. But for a programmer to read programs, actually kind of annoying. So in our software, you highlight an example of a date. You highlight an example of a second date and go, oh, I get it, it's number slash number slash number. You're interested in that? I can pull that on into a new column for you. Would you like it? You know what, that column is of type date. And here's a distribution of the dates over time. Does that look great? So all that happens, you know, just two brushes of two different dates. And you get all this feedback and you go, yeah, that's it, bang. And you move on. Joe, I gotta say, when I met Omar Awadala around 2007, when he was kicking around the Cloudera concept with Excel, we're like, this is the future. And then obviously my goals and things got got together. You're on your masthead of your company profile. You got yourself, Sean and Jeffrey all there with your LinkedIn profiles, the Twitter profiles and your GitHub profiles. This is the senior executives. So that's impressive. Now, will your board of directors have GitHub entries? So this is the future. I mean, you guys are pumping out code. That's the culture. Is that the culture of the company? That is the culture. I'll say two things about that. First of all, so Jeff and I are professors. Sean was a graduate student now as a PhD. All of us code, I still code. Jeff wrote with two of his grad students D3.js, which is one of the most popular repos on GitHub today, more popular than Linux. It is taking over data visualization in the world. Jeff is not just a coder, but he's one of the leading coders of our time, I would say. Sean and I both code. Sean's actually a monster. I code for research and it was a deliberate decision when I ran this company that I would not be writing code at the company. And I think that you do have to have a healthy respect for when you're building an engineering team. You can't be a part timer. You can't be like kicking, you know, putting graffiti on the walls and you know, gotta focus on the business. Let those guys take care of it. So you got your own little playground. Where do you push your code? Do you have private repositories? I wrote a bunch of code when I was in sabbatical a couple of years ago and that stuff with my graduate students I still work with. And actually one of the things I was proud of at that time is I started an open source library called Madlib while I was in sabbatical, which is a SQL machine learning library that actually Pivotal still supports. And it was on stage today at Strata. So that's still alive, although I haven't contributed to that since I started Trifecta. That's the new gold standard for investing. If you didn't see the mast head, you want to look for the GitHub there. I'm going to have to get my code skills sharp. We have, that's interesting. I just love it. I just had to make a point that out. It's really, and it speaks to the culture. And I think one of the things we get excited about by some of the startups that are doing work is really the open source leverage. You see in a new generation. So I have to ask you, you know, folks are raised when open source revolutionized it was standing on each other's shoulders and now we're, how high are we now? So what generation are we with open source? What is the next innovation in open source? Obviously trust continues to be part of it. Code gets better. Linux, obviously those days are kind of scaling beyond that. New generation of young kids are coming in. What's your vision for the open source world of code? And is it more of the same? You see anything different that's innovating around the communities and efforts? You know, I got to say, my head is spinning on this whole open source model and what's possible because every time I think that things aren't possible, somebody has a project that comes out that works. So I've been so excited to watch the folks at Berkeley like Montezahari, I build out Spark and the, you know, the beat ass stack that they've got. I would not have predicted that they could have built the community that they've built around a university research project. It's like nothing I've ever seen. At the same time, I did a very different route with Madlib where I said, look the model that Hadoop and Linux and others have had with open source is you find a corporate sponsor and it doesn't get real till that happens. So with Madlib, you know, because I had a relationship with folks at Pivotal, I said, I want to do this thing. You can see the value of it. Why don't you guys help me out and provide some QA and engineering support to this thing and I'll get my research friends, we'll all pour it in together. So that was a very different model for getting an open source project for Madlib going. So, you know, Spark and Madlib are two completely different models. There are others I think that are going to come forward and I think it's very interesting. There's obviously all this energy around the Apache Foundation which has been super, super productive. There's a bunch of friction there too. Any big organization introduces friction and Apache does introduce certain friction. So I think there will be experimentation which is part of the open source movement and that's pretty fun. I think one of the things I'm seeing and we're watching, it's just more of kind of like a vision but the thing, if you take the open source collaboration concept and take it into the corporate world with all the big corporate participants now recognizing that contribution is not just marketing but also participatory where contribution is your marketing. It's interesting, you know, you're seeing the trend where you see people crossing corporate boundaries with projects like you mentioned. So I think that's an innovation that we're watching and there is an interesting dynamic there, right? The walls are coming down between companies. We say, hey, you know, I'm a CEO of a company but yet I got this little side project but let's, you guys have that. It brings collaboration front and center and that's going to be fun to watch. Okay, let's talk about some of the things that you like and don't like about the current market. I'll see one of the things we've heard from other folks here in theCUBE this week was all these new companies are emerging. It might be too late, some people are just kind of cloud washing, big data washing. What do you like and what don't you like about this current market? Well, my take on it and this is where I've placed my bets and entered the market is that things are going to move up the stack in this next round of innovation for data. So there was an opportunity to really overturn things in the aughts and in the early part of the 2010s and we saw the MPP database wave go through with companies like Green Plum and Vertica and so on and then we've now seen the Hadoop companies come through but man it is really crowded down there at the bottom now and that honestly is where my technical roots are so that's the company kind of you'd expect me to have started and I look at the opportunity and this is partly with Moore's Law too. It's like Hadoop proved that a pretty good chunk of software doesn't have to be perfect is enough to disrupt the likes of Oracle and IBM after 30 years of development and those are beautiful complicated pieces of software. So in that environment, the values with people it's not with making the machine go faster is my mind and that's what I'm excited about is things moving up the stack. You mentioned your partner was trying to solve the human interface problem around data. The humanization of data has been something that's been talked about since big data NYC a couple months ago at Strata in New York. What's your vision around that? Obviously it's important and is there anything cutting edge around that's developing out of the humanization of data? Is it do we need more automation? Is there more AI? Is there more reasoning coming out of the tooling? Is there anything that you can share that from your perspective around the humanization? Because the human role is important. Yeah, two things. First of all, I think data is a medium like an artistic medium like clay or some people talk about data is oil that needs to be refined. I think those are actually apt analogies because without people to work the data without people to figure out what's the story that this data supports and what's the story it's trying to tell? It's just data. It really is just numbers. It's ones and zeros. So that piece of it is, it won't humanize itself, right? And then there's the flip side of that which is how do you get the people who could do this work who maybe have a question they wanna answer or they have an idea that they want validated or they're just gonna explore data. How do you facilitate them to be agile and get the creative juices of the human mind running at the same time scale as the crunching of the big data sets? So that part is a technical design challenge between technology and design that really does need to be brought further to bear. What's your take on the graph database movement? Obviously there's some folks out there doing some stuff. Obviously we're in a network graph and graphs are graphs and graphs are great for math. Compare and contrast relational databases, other databases that we've seen over the years and it seems to be the cycle of databases. But what's your take on graph in particular? Obviously we hear about social graph, interest graph and relationship graph, things of that nature. What's your database view of these emerging? Yeah, graph is a data structure, right? So I'm a computer scientist at the end of the day. I'm like, well there's nodes and there's edges. What more would you like to know? There's labels on them, right? But realistically, so first of all, relational databases are not great for working with graphs. It can be done, but first of all, it's not very pleasant to write a sequel and second of all, they're usually not tuned for it. But what I think is more interesting is that graphs are not just about the shape of the data, they're about the computational model as well. So if you look at a company like GraphLab, which is both an open source project and a company, they're actually a machine learning platform. They happen to use graphs as their universal model for a whole family of machine learning algorithms. But at the end of the day, graphs are a means to an end and the end is machine learning. So a lot of the algorithms are about graphs and yes, a lot of the data is about graphs as well. It's a phenomenal area to do work in. I love it. It's really kind of intoxicating at many levels. So I got to ask about the machine learning. For the folks out there trying to crack the code, a lot of people are working on machine learning. It's really hard to do unsupervised machine learning. What would you share with those folks that are trying to get that going? How to get unsupervised machine learning going versus supervised machine learning? Wow, this is a pretty technical question actually. That's great. So unsupervised machine learning, so this means that you take your data, nobody's told you anything about it and you're going to run algorithms and extract patterns. It's pretty much the same thing as what people used to call data mining. And there's a handful of things that people have done over the years that have worked pretty well. And then there's some exciting new results that we're starting to hear in the press from Google and others about kind of a next generation of that. And I would have to say that that stuff is very nascent of the new things and what's possible there is very unclear to me even as an academic. So there's a lot of promise there, but there's been a lot of promise in AI over many years and the responsible folks in AI, including the people who are pushing the envelope here are saying, let's wait and see, okay? So unsupervised machine learning though, traditionally it's things like market basket analysis and that kind of stuff. And the family of techniques that people know there and have used there are well understood, I would say. And then in supervised machine learning we've just seen so much done over recent years in terms of if you can find some way for to get signal out of people clicking on ads or tagging things or so on, you can build models and predict all kinds of stuff. And that also assumes that you're playing with data, right? So again, if you're playing with the data that kicks in nicely. So that's a flywheel, right? So you see that happening? Yeah, absolutely. And we've been seeing this on the internet now for long enough. Advertising is the place where it came through, right? Because there's a monetary need to do it and people interact with the content. So it's been easy in advertising. We're going to see it in lots of things retail and so on. Joe, it's been great to have you on the queue. We really appreciate you taking the time. We know you're super busy. It's a long week event and we're all getting down a little punchy but this has been fantastic. I want you to just share with the folks final comment. This moment in history right now, Big Data at Silicon Valley, you've been involved in a lot of startups and advisor, professor, seen it all. Now you're running your own company, you're getting your milestones. What about this moment right now in the industry with all the stuff going on, Hadoop? Who would have thought the work that Amar and Mike and the folks that Yahoo did in Berkeley and all the academic areas is now fully standard, if you will, in business. But right now, what is the core story in this moment? So to me, and this is completely self-serving but also completely true, computer science has been flipped on its head and data is at the center and computation is the thing that happens to data and that's how we build stuff. And when you realize that, then a whole bunch of things change, including things like, how do you program computers, particularly when you have a cloud full of them? Well, writing Java is really very painful if you're going to try to program 1,000 computers. But if you think about all the data that's going to flow amongst those computers and then the computation follows from the data, things become much easier. We're going to see a paradigm shift, I think, in the way that people think about computation. It's going to be very much data focused. Disruption in computer science, it's really great. I think that's going to enable a lot of great opportunities and I was going to machine learning side. It's wide open, great stuff's happening when things are on their head. That's opportunity. Joe Halestine, a professor in computer science at Berkeley, CEO of Trefecta, a funded company with the VCs involved. All the senior people in the Masthead have GitHub and well colored GitHub contributions. So congratulations, that's great to see. I hope that's more of the future and I think you will be so. This is theCUBE, we'll be right back after the short break. Stay with us.