 Okay, we're back live here at Hadoop World 2012. This is SiliconANGLE.com's continuous coverage. SiliconANGLE.tv productions with theCUBE, our flagship telecast. We go out to the events, extract the signal from the noise and share that with you. I'm joined by my co-host, Jeff Kelly. Jeff, we got to get something on the record here for everyone about our sponsors. We would not be possible without this great independent coverage without our sponsors. So go ahead and give them a shout out. No shortage of sponsors this week. Hortonworks, Cladera, MapR, SyncSort, DataStacks, Lucid Imagination. Thanks to all of them for helping us get here and give you this great wall-to-wall, live coverage independent content. Yeah, and also I just put a special recognition out to Cladera because Cladera has been a big supporter of SiliconANGLE from the beginning when we were formed two years ago and we were doing some of the most cutting edge predictive analytics around media construction, media developments, what we're doing. Also helped us support us when we went to HBase. So those guys are great citizens and now Hortonworks as well for allowing us to do the exclusive coverage. So shout out to the sponsors, give them some extra love. We see them and now back to our programming. So who's our guest here? We have David. Mariani. Mariani, Vice President of Engineering of Clout. Welcome to theCUBE. Thank you, nice to meet you. So this data business is what you're in, right? And so what Hadoop does is opens up you as an engineer in your team to do so much more with your product other than putting into a scheme of databases. MySQL, trying to figure out this, that and going just as a disaster. It's a nightmare. So talk about the challenges you have with Clout, the product, and how Hadoop helps you. Well you know, what's really interesting is that with Hadoop and sort of what brings to the table a scalable way to store mass quantities of data, right? A scalable way of not just processing it but also storing it and storing it cheaply and expensively. So Clout is a startup, you know? We're a venture backed startup but we don't have all the money in the world. So the fact that we can do open source and we can build a cluster of commodity hardware to store that data, which is our business, is really important. But you know, you alluded to something there where you were talking about schemas and the world of schemalists. Part of my talk today is about how you still do need schemas on the unstructured data to get the most out of it. To roll it up. To roll it up. So business intelligence hasn't gone away. It's great to have Hadoop and Hive and HBase and we use those technologies to store and serve our data at scale. But at the end of the day, you do need to do analysis and speed of thought analysis and still Hadoop today is not the right environment to do speed of thought type of analysis. Let's drill down on that because I think fundamentally everyone has gone to a nice place where they see Hadoop great batch. HDFS has got strong MapReduce, even with HBase, I call it the holy trinity of big data, HBase, MapReduce, HDFS, but still getting the data out of say HBase or any other unstructured database is a chore. So, and we've found that it's hard to configure in a dynamic environment, but in the use cases, it works great. So we're seeing the analytical engines evolve on top of that, getting the data out. Where it is structured. Can you elaborate how you see that in folding where the decoupling of analytic engines from the core batch back in? Well, I can tell you actually what I want. Like what I want is I want the analytical engines to be running on HDFS on the Hadoop cluster because what I love about Hadoop what makes it so powerful is that it's horizontally scalable. If I want to increase my capacity to process data or my speed to process that data by a factor of two, I double the number of machines that I have in my cluster and I don't change anything with my software to make that happen. So I would love to do that with business intelligence but business intelligence is still sort of running in the, it's 30 years old, right? It started in the 80s. So we're still dealing with SMP type of architectures where everything has to fit on one single host with one dedicated disk. And that is inherently constraining and makes it so that you can capture and store all this data in the Hadoop but trying to make sense of it means you got to write your, roll your own. So we sort of, it's like a throwback to the early 80s or before there was BI we're kind of back to the raw text files back on mainframes and trying to write some cobalt scripts to actually make sense of it. I think it's a real throwback actually. So at cloud. Okay, that's a first. Cobalt scripts mentioned in the tube, I love it. Okay, for a little fortress in there and we'll be rocking and rolling. I never knew cobalt by the way. Not that old. Yeah, it was, it was definitely put you to sleep at night with all that documentation. For showing my age there on my cobalt programming days. It's only one credit lab and I only did it in college so I swear. And then I didn't really do cobalt. Thank you. I just did one credit. So let's talk about the science side of it. So obviously cloud, you guys are kind of interesting. You're as a company, you're a startup but you're on both sides of the equation. You're like in the mainstream social media sphere which is brutal. People are always complaining and they want a better score. Excuse me, we're getting a little, we'll get a little freight train here. You know, just a tight venue so you're going to have some things moving around. Yeah, you know, you want to be in the action. It's the fun of life. They're moving around, bringing the action. So back to my question, the social media world is on a buzz. So you can't, you're always under scrutiny there. But also you've got a platform you're building at some time. So like we've talked about SAP, it's like changing an airplane at 3,000 feet, the engine of the airplane, 30 other feet. What is your critical path right now for you on engineering? What's on your to-do list right now for cloud? Well, you know what, so for cloud to be even more valuable service is that we have to really reflect changes in people's activities really as near real time as possible. And so Hadoop is still really a batch processing system. And so when we process our scores, when we process and detect different topics of your interest, your social graph, you know, we still do that in batch. So there was a great session on storm and the storm technology, which really is a combination of real time processing combined with, you know, your own persistent storage. So what we're really moving to on the cloud side technically is changing our batch processing of our data in our science, applying the scores and moving that to be real time. So that insure day you tweet and that would affect your score and we could display that. So you guys have had brills with your APIs, right? So you just have an enterprise API coming out or? We just revamped our API last quarter and it's comprehensive. So our initial API was a couple calls. And to be honest with you, the cloud didn't even use those calls themselves. We had our own way to access the data. So last quarter, we completely switched to an open API where cloud.com and our mobile app now are all using that API. So my goal is to have not just cloud.com, I want a thousand cloud.coms and I want that innovation to come from the community and the development community, not just from cloud. Because I think there's only so much we can do. There's so much more that can be done with our data with the right access. Let's talk about data science. So we hear a lot about the skills gap and there's just a dearth of data scientists out there. So how do you guys approach that? Do you have, how do you go about finding talent, kind of scaling up that team so that it can do the work that you need to do to turn that raw data into, ultimately the end product, your cloud score? Yeah, our value. You know, it's tough. So the way we have it have the team organized is we have a data group and that data group is made up of science, analytics and the data pipeline. And on the science front, they're PhDs. So they have a machine learning background or a natural language processing background and they can code. So what's really tough is to find engineers, scientists, who can think and create the models to create the algorithms and who can also code. So we're actually still working through that process to tell you the truth, but we definitely need the science and that machine learning because that's really where the value comes into play. So my big focus is to make sure we have the right people coming up with the right science and then I'll find somebody to code that up. Interesting. What do you think about the ecosystem here? Obviously, Hortonworks has got their messaging, but you know, just not to kind of buy into that right now, this is their show. But within the industry, this is much more of a tech show. A lot more go-to-market business use cases, but not business suit kind of show. Share with the folks out there your vibe on this show here because that's what they're watching, they're not here. What's it like here? What are you seeing? What's the core themes? What's the audience like? Share with the folks out there what's happening at the show. Yeah, so I've been at Cloud for a little over a year now before I was at Cloud, I was at Yahoo. So Eric- So you know the guys? So I know all those guys. Oh, okay. And so all the Hortonworks guys that I used to work with when I was running the data engineering team at Yahoo. So we were their customers, internal customers. So I've seen the development of Hadoop from sort of our own custom version of clustered processing of data because we had so much data to something that's been open sourced and made more standard. And what I'm seeing, what I really love, the vibe I get, I'm very proud to be sort of part of seeing that happen, being an early user of it and to see so many different people, companies, startups, really starting to do more than what we ever imagined. So I think it's really exciting. So let's talk about the Cloud score and some of the science by it. I didn't say you don't have to reveal the secret sauce because obviously there's some intellectual property but obviously a lot of people are saying, oh, it's just the number of retweets you do and how many mentions you get. Is there more? I mean, obviously just science behind it. You can look at graph data. Talk about the data involved. And if you can share as much as you can about the Cloud score and how that's computed, everyone wants to gain the score, obviously. That's what people in the audience will want. But for the tech geeks, talk about the data aspect of it. Yeah, so we have an architecture such that every single source of influence, think of the social media networks, right? Think of Facebook and Twitter and Foursquare and there's 15 of them that we process. They all have their own APIs. They all have their own way of, once you have a registered user, a Cloud user, they give us permission with their token to go and access their data that's being generated on those networks. So that becomes a tough, non-scalable model because for each network, we have to write a translator to get that data to be used so we can use it to compute the Cloud score. So part of the- You use LinkedIn? Yes, we do. And there's all kinds of signals actually from every different network has its own signals that are unique to that network. And LinkedIn, it's about who you're connected to and what their titles are. What companies do they work at? All the context of the subjects that you would be influential in are actually embedded in the people that you're connected to as well as the things that you've done in your past. So- Does LinkedIn have a terms of service issue? Can you crawl their data? Do you have a deal with them or how does that work? So we basically apply, we comply with all the use policies of the individual network. So all the permissions are specified and controlled by the end users themselves, the consumers. Can you use OAuth or? When they opt in, we OAuth. And if they opt out or they disconnect us from the LinkedIn side, then we have no access, no more access to their data. The great thing about Cloud though is that because the more signals we collect, the better view that we can compile about you, you're definitely encouraged to connect as many networks as possible. So the Cloud score going forward is we're always revising it and making it better and making it more accurate. And you'll see another revision coming out fairly soon that we think does an even better job at figuring out what you're influential in and how you can leverage that influence. Talk a little bit more about Hadoop as a platform for startups and the opportunity to really give startups. We were talking before we went on air a little bit about your working at Yahoo, obviously large data volumes at Yahoo but also a very large company. But small companies can also now with Hadoop start working with large volumes of data to produce the product essentially. So talk about Hadoop as a platform for startups and the opportunity that really offers startup community out there. Yeah, so our business is we collect a billion signals a day for our registered users. That's a lot of data. So even though we are a small company with limited resources, we still end up having a very large data set. So what's great about Hadoop is that the fact is is that I can purchase commodity hardware. I don't even have to purchase commodity hardware. I can just actually do it on Amazon EC2. So I can actually literally with my credit card start getting up and running and have a real infrastructure to process big data. That's awesome. We chose to have our own data center and have our own hosts because we need to control those terms of service. But for a startup, it's very inexpensive and very easy out of the gate to get scale and scale a data for a very cheap price. So to me, that's what's great about Hadoop is that it's inexpensive in all forms, not just from a software being free standpoint, but also from a hosting and hardware standpoint. So where do you see kind of your use of Hadoop going in the next year, two years, five years? I mean, do you see this as Hadoop? I mean, you mentioned kind of, Hadoop has been a fundamental part of your business and how you deliver value to the community. So do you see Hadoop as a long-term investment? Is this something that's going to be the underpinning of cloud for many years to come? I mean, what's your view of the long view of the future five years from now? Yeah, you know what? We can't do what we're doing without Hadoop. So we're out of business without that infrastructure. I think where I see where I want Hadoop to go and where I think there's opportunities are that on the real-time processing, right? But Hadoop is still inherently a batch processing system. And that's just not good enough for the real-time world that we live in. And the other part is bringing intelligence back to the data. So now we can store all this data in an unstructured fashion and the tools to actually analyze it, make use out of it and make money from it or provide value to consumers, very, very limited. And so those are the two areas that I want to see growth. Real-time and intelligence on Hadoop. Intelligence is scalable in HDFS. From a technical standpoint, as you mentioned, so Hadoop is fundamentally a batch and low type of system. So can Hadoop become that real-time system, do you think, or is it going to require a kind of a complimentary approach maybe that stands next to Hadoop? You know what I mean? There's already lots of things happening like with Storm, for example, from Twitter. We're actually using that technology and that's what we're working on to develop our real-time data feeds. So, and it looks really great so far. And so that's something where it's compatible and complimentary to Hadoop, runs on Hadoop as opposed to being a different system than Hadoop. So I think you're going to see solutions like that built open source on top of the Hadoop environment to be able to sort of fill the gaps that we see today in Hadoop as a batch system. So are there any best practices you can share with folks out there who are building on Hadoop and playing with H-Base? Obviously maybe other companies in terms of what you've learned, other off-the-shelf hardware that you could use, other technologies that fits well with Hadoop just either from a personal observation or your experience at Yahoo. Because it's all about the speed, right? It's the end of the day. You want to get near real-time. That's just great. Commodity hardware is great for storage. So, hey, home run there. The challenge is, what do I use for extraction? Do I write my own? Is there off-the-shelf, Vertica, ABM? Well, you know, so we use analysis services, the SQL server analysis services to do our BI. And SQL server analysis services. So it is a query engine. So I think I alluded to in the very beginning that Hadoop is not great for interactive queries. It's not an interactive query environment. Hive has a SQL interface, but if you write a Hive query, it kicks off a MapReduce job and you go and get some coffee and come back and get your answer. So it's sort of, again, a throwback to the early days before there was business intelligence and before there was OLAP. So at Cloud, what we've done is we've married OLAP to Hadoop. So using Hive as a data store, using Hive as a translation layer, I can connect that to a business intelligence OLAP engine like analysis services and get the best of both worlds. So I can get my unstructured data on Hadoop, not throw it away, but I then can put some structure on top of it using OLAP and give me the interactive queries that I desire. Another cohesive element that's designed for that purpose. But you made also another point, I'd like you to elaborate on it. You want to put the analytical engine in HTFS. Can you expand on that one more time? Because that was a nuance that I want to just capture. So if you're going to decouple some of these, the insight package, okay, treating the Hive as that there. What did you mean by putting the analytic engine in HTFS? Yeah, I think that if you think about what makes Hadoop so great, right, is that when you store a piece of data, let's just say it's a file, it appears virtually to you as a file, to the operator as a single file, but that file is actually distributed across as many nodes as you have in the cluster, in little bits, stored three times for redundancy purposes. So when I do a query, when I go and access that file, it's a massive parallel table scan across all these individual hard disks out there that I get to take advantage of, dumb hard disks, not one single huge filer that needs to process everything. So that's what I want to do with BI, right? I want, that's where all my raw data is. I want to be able to do the BI and do the OLAP and the analysis right there where the data is already sitting versus trying to pipe it and load it into something else. Got it, got it. It's almost like a caching layer, if you will. Yes. Okay, we're getting the signal. David Mariani, VP of Engineering at Cloud, ex-Yahoo, he knows the team at Hortonworks. He's here at Duke Summit. You're like a kid in the candy store at Cloud. A lot of science, a lot of market success, congratulations. Thank you. They're always trying to tweak the Cloud score, so I hope you enjoyed this interview. Hey, I'm a 28, so that goes to show you that we don't gain that score. My influence is going down, but that's okay, we have the key. We'll change that. We'll change that. Well, now that you're one of the Cube, your influence is going to go up big time. So, okay, we'll be right back with our next guest. We're going to have Todd Lipcon come on, who is the HDFS guru, also done a lot of HDFS, H-based stuff. So, looking forward to talking to Todd, another Cube alumni. This is SiliconANGLE.com and Wikibon, the Cube. We'll be right back after this guest. Thanks, you guys.