 The Cube at Hadoop Summit 2014 is brought to you by Anchor Sponsor, Hortonworks. We do Hadoop. And headline sponsor, WAN Disco. We make Hadoop invincible. Okay, welcome back everyone. We are live in Silicon Valley in San Jose for Hadoop Summit 2014. This is the Cube, our flagship program. We go out to the events, extract the scenes from the noise. I'm John Furrier, the founder of SiliconANGLE.com. We're here with esteemed big data analyst, Jeff Kelly from wikibond.org, putting out the new survey. And of course our new guest is John Williams, Senior Vice President of Platform Operations with TruKar, welcome to the Cube. Thank you. Great to be here. So we were just talking before we went on about gamification, data. We were kind of geeking out because you're in Santa Monica, which is the hub of pretty much a lot of data companies. Certainly you're there, League of Legends, Snapchat, many more, Silicon Beach, whatever they're calling it, is they call it Silicon Beach, is that or? Yeah, yeah. There's some coolness going on in LA right now, in some pockets. But gamification in data is about the new asset. So before we get into that, share with the folks what you do and why you're here. Sure, so TruKar is an amazing company. Our mission is to bring transparency to complex marketplaces. Buying a car is certainly a very complicated transaction. It's something involving a lot of data, data that we don't traditionally have a lot of visibility into. So in order to make car buying fun and fair for everybody, including the guy selling the car, as well as the guy buying the car, that's an issue where we bring a lot of data to bear on that problem. And big data has really transformed our approach to that. And it's not about sort of doing traditional things bigger and faster. Big data is really transforming our business on a fundamental level. So let's say it's through use cases. Let me just, I can only imagine, the old days, you had a website, click a few buttons, it'll e-commerce, get a follow up from the local dealer, it goes into some database, some schema involved. You know, yawn, Frank, we're there, that's open, they've done that. So take us to the modern version of that. What goes on under the hood? Okay, pun intended, in the car, big data business. Sure, so there's a lot of data, things like vehicle taxonomies, configurations, options, everything about the pricing of vehicles. If you think about used cars, every used car is unique. If you think about vehicle imagery, obviously very important, you want to buy something that you can see. So there's fantastic work that we're doing. We handle about 250 million vehicle images. We've processed over a billion images, and we handle about a million new images a day. So when we look at data and how that affects car buying, having accurate data is really key, processing it more quickly is key. And a typical scenario would be something like, you know, you take in the inventory from a car dealer. So what they've got on a lot, what the features are, what the options are, there's a set of images that goes along with that. So we're doing things like, some boring stuff like cropping, rotating, scaling, but also trying to use machine learning to figure out what kind of car is that? Does that match the description? There's a tremendous data quality problem. So being able to use automation and analytics to figure out, no, that car is not really red. But of course, the names of colors for cars are in and of itself a complicated thing. So all right, so breakdown for me. So how does ChuCar make money? Is it on any kind of referral through your site? You get a percentage of that. How does that work? And relate that to why it's so important to be able to do those kind of things you just mentioned quickly. That's a great question. So ChuCar only makes money when the vehicle is sold. And that means we have to be very, very good at the data. We don't just sell a referral or an introduction. We actually have to master the data problem and facilitate a transaction that then comes to a successful conclusion. So our business model really makes this critical that we get this correct. And then the scale of the data makes it a tough problem. And of course, especially with used cars, timeliness is very important. Absolutely, so yeah, I'm on your side now just kind of checking it out. I was maybe shopping for an Audi here. Cool. Maybe if we get a race here at Wikibon, that'll be my next purchase. Q5 or Q7? But so talk a little bit about the technologies you're using. We talked very briefly before we came on and you mentioned how maybe some of the best practices around a hubar start with a small problem, master that and then kind of move to the next one. Whereas you took a different approach. We did. So we jumped in with both feet. We sort of looked at it and said most proof of concepts are too small, not interesting enough and the conclusions you reach aren't powerful enough to drive the kind of adoption that we really wanted. So we had a different approach. We started with a philosophy of big data. We looked at the mechanics of the new solutions and we realized things like the economics of data storage have changed. Storage is now so inexpensive. You can now effectively store all your data forever. And when you really think that through, you've now got an infinite timeline where you can go back and re-monetize the same piece of data many, many times in the future. And in fact, at the moment when you decide to store the data, you may have no idea how that data will make you money. You may add different data later that then unlocks the potential. And certainly you think about something like machine learning. You need to train machine learning. So having saved a tremendous amount of historical data means you get better results. You have more models to train your machine learning. So the opportunities to monetize data over an infinite timeline are themselves infinite. So the language of love, it sounds like our crowd chat product we're always talking about the same thing around machine learning. Give an example, because this is kind of cutting edge. You're on the front edge of what I believe to be really modern tech using all of the infrastructure of cloud and legacy. But like on the user experience side, it sounds like you're using the new stuff to create amazing user experience. So take us through the gamification and the machine learning and how that translates into user experience. Sure, so when we looked at something like mobile, right? And mobile is a huge area for us. You know, we were trying to figure out how to approach this. And when we look at a new market, we look at what are the video game guys doing? And when we looked at mobile gaming, one of the really fascinating aspects of it is the interaction model. So a game like words with friends, turn-based, very interactive, multiple folks interacting directly. So that really informed our approach to mobile. You know, a car buying transaction involves a back and forth between a customer and a dealer and we want to be there to supply information at every step of that. So really good interaction loops are key to transactions like this. Gaming was a fascinating sort of directional influence on us as we approach that space. Awesome, so let's talk about databases. Yeah. Structured, unstructured, also you have a lot of loose data, semi-structured, it's somewhat loose structured. But you got to compile all this data and then roll it up using SQL. How do you guys, what's your database architecture look like? Mix and match? It is a little bit where we're trying to move away from SQL. You know, when we looked at what big data and Hadoop in particular can do for us, it's really a game changer. Everybody says SQL is slow. And it's not just slow in terms of query execution time. It's also slow from a development process standpoint. So we have a very ugly unstructured data and automotive. You know, car dealers supply information in crazy formats, pipe delimited without quotes. It's a mess to parse it. So when we look at that, we actually realize that so much effort went into pure SQL. So figuring out what's the schema to put this data in. What's the indexing strategy and writing queries against it. But then if that data changes in any way, I got to go rebuild that whole process again. So we think about speed relative to data in terms of from idea to commercialize product. That's an area where we want to see speed. So Hadoop is really a different thing, right? So we can put the raw data set directly in HDFS. Rather than writing SQL, we have Java programmers. They write MapReduce jobs. They can parse that raw data format in a much richer way, much more easily. And again, we can iterate much more rapidly. And the goal really is take the brain power that we have and put it on the most interesting part of the problem space. So not into boring computer science-y stuff around SQL wizardry, put it to where it really makes a difference for the business, which is automotive intelligence. Well, so you mentioned transforming that data into insight. And that's, you know, we see in our surveys with big data practitioners, one of the key, I guess, barriers or challenges for them is cleaning up the data. We had Joe Hellerstein from Trifacta on earlier. And his whole reason for being his company anyway, I should say, not Joe himself, is to make that process a lot easier so you can let your data scientists do the stuff that they're really good at, which is the analysis. So how do you approach the transformation challenge with all this really messy data? I can imagine, you know, coming from, I suppose the car dealerships have some incentive to supply you with good data, but it's not probably their top priority. So I imagine it comes in all different kind of formats. So how do you go about actually transforming that data in as quickly a timeframe as possible? So I think there's two parts to that. One is writing rich parsers. And this is sort of the technology, the technological side of it. You know, we have to have very flexible strategies. Java gives you a very rich sort of environment to do text parsing, to clean up values. We also do a lot of correlation. Data quality is a huge issue. So, you know, figuring out that, you know, midnight red, which is a real vehicle color, is really neither midnight nor red. But being able to programmatically identify whether or not that's accurate leads to inferences about the rest of the data set. And I think the second part of this problem in mastering it is actually more about organizational design. So training up folks and getting them bought into a ambitious version of what big data means. That was really important to us. And, you know, if left to their own devices, sometimes it's, you know, you get somebody started and you turn your back, and they turn everything back into SQL again, because it's what they know. So we really started with a philosophy, things like data is money, because of the ways you can monetize it, and storage is free, and compute is almost free. So that really was, you know, get buy-in at a high level, get everybody to buy into a very ambitious set of goals, and that really helps drive the adoption and the organizational change. So Ken, I'm just thinking about some of the audience members who maybe work at a more traditional firm, so maybe you're not as data intensive. Can your approach be applied to them, or what advice would you have for maybe companies that are not as, you know, they're whole, it's not necessarily core to their business, or they don't think it is, perhaps. How can you apply some of the lessons you've learned to their business? So again, I think that sort of philosophical buy-in. You know, think about it ambitiously, and what it really means to your business. You know, data is the product we sell, right? We don't actually sell cars. We supply data through mobile apps and websites. So for us, it's really core. But what we're discovering is, by doing things like having all the data in one place, we're writing these sort of open-ended correlation engines that just go out and hunt for facts. And I think that's something that could transform almost any business. You know, BI today is, we have very smart executives. They think of very clever questions. A bunch of analysts go into the data and come up with answers. But we're transforming that, and what we're seeing now is, an open-ended correlation job will go out and run against the data. And it will come back and tell you surprising things about your business. Things that you would never know to ask as an explicit question. And I'll give a great example. We put instrumentation from the front end, purely technological data, and mixed it in with all the BI data. And we suddenly found correlations like, you sell less cars when cash hit rate is low. And that's something you sell less cars when cash hit rate is low, right? So a technological detail of the front end implementation actually had a shocking effect on the business. We never would have had all that data in one place. Were it not for a solution like Hadoop? And we never would have known to ask a question like that. So what you're doing is something that we always get excited about. Why I'm kind of smiling right now is that you're instrumenting your business. Absolutely. End to end. That's right. And having real-time data, because you never know what could trigger That's right. The tsunami either way, positive or negative. That's right. Defects, what's going to drive user experience. So you're essentially instrumenting the entire end to end process. That's right. And the value of that is tremendous because instead of executives using their brain power to ask questions, we bring them startling facts about the business and then they spend their time figuring out how to operationalize that for the benefit of the company. Well, we'll certainly have you on theCUBE many times, but I got to ask this question. How hard is it to do that and take us through for someone who's not done it fully yet or wants to do it or like sees the benefit, sees the vision, drinks the Kool-Aid? How hard is it to do and what do you do? Well, so I certainly don't think it's necessarily easy. It certainly wasn't easy for us. However, you know, we adopted an approach where by getting folks to really buy into that value, right, by talking about the benefits of open correlation hunting, that really got people excited about the potential. So it was more than just like make SQL go faster and getting them to think about it in terms of really ambitiously what it could do to change the business in a fundamental way. They really appreciated those conclusions and then that helped kind of drive the adoption, made the organizational change a little bit easier and I think that was really key. Open correlation hunting is so much more safe than having actual guns and the real world, but you're talking about some progressive data science. And the next question is, where do you find these people? So you pluck them out of certain schools, you have certain targets, how do you write the craigslist job description? Hey, ninjas come to us, but is there? That's a great topic. And first of all, true cars hiring. So anybody that's excited about this, we'd love to talk to them. But we actually looked at it in a couple of different ways. One component is retraining the folks that we have. And in many cases, they know the business really well. That's the value that we're trying to capture. So if you're a sequel guy that knows a lot about true car, knows a lot about automotive, we want to really capture the value around automotive. So we're having great success retraining folks. We also came up with a good training package for new folks that we hire. And again, it's a three-part thing. We start with the philosophy, what it means, then we get into sort of traditional training. We have partners that helps supply that, what's MapReduce, how do you do it, things like that. And then the final component is, how do you apply that to your job, right? So taking it from philosophy to then making it really actionable. And telling folks how we would like to see our business evolve. And then they do all the hard work for us. We are here with John Williams, Senior Vice President of Platform Operations, that true car, great to have you on. We love the conversation. You're at the front edge of all the bleeding edge. It's really impressive. Certainly want to keep in touch with you and get that knowledge. I'll give you the final word. Share with the folks out there, in your own words, how exciting it is to do all this cool stuff. And what's it like? So it is super cool. And it's really fun to be a part of it. It's great to work for a company that's willing to buy in and jump in with both feet. For true car, it was transformative. Mastering data was a thing that when we succeeded at it, we just recently had our initial public offering. So it's really taken the business to new levels. It's just super exciting to be a part of that. This is theCUBE, extracting the seed from the noise here at Hadoop Summit. This is a great example where data-driven businesses will be the future, gamification, internet things, instrumentation is all about the data. We'll be right back to broadcast more data here on theCUBE right back.