 Okay, we're back, this is Dave Vellante. I'm with Wikibon.org, I'm here with my co-host Jeff Kelly. Kay Young is here. Kay is the CEO of Mortar Data, CUBE alum. Kay, it's been a while, we're talking off camera, it's been almost a year now since we've seen you. Welcome back, good to see you. Thank you, good to see you too. So yeah, we, I think, saw each other at the Atlas headquarters in Cambridge. We did a little discussion. And you and I have met before, you grew up in Cambridge quite a bit actually, but you're based in New York. That's right, yep. We heard today, and we've been hearing that New York has surpassed Boston as the number two now. So I hear. VC-backed region in the world. Still a distant second from the vortex that is Silicon Valley. Yeah, and I think it also matters what dimension you're looking on. New York hardly has any enterprise tech. You know, Mortar is still almost alone, except for 10 Gen in enterprise tech here in New York. But a lot of it's up in Boston. So tell the audience a little bit about Mortar data for those that don't know Mortar, and then we'll get into what's new. Sure, so Mortar is a platform and also an open source framework that is used by engineers and data scientists to process very large volumes of data. And so what I mean by process is everything from ETL to cleaning data, to doing natural language processing, to doing machine learning on a building full on recommender systems. And the reason that I say for large volumes of data is because it's built on top of Hadoop. And there just comes a level of complexity when you're doing parallelization on top of Hadoop. As much as it's our job, and I think we do a great job abstracting that complexity, making the IT headache go away, making the engineering interfaces really easy and making it easy to run over time, there's still extra complexity. So if you've got something that's a small dataset you're probably never going to use Mortar. And recently we've focused in on one particular use case that people found very useful. I think I just mentioned it, recommender systems. And so we're in the process right now of building out 10 recommender systems for 10 different companies, a variety of fields, a variety of stages of company from the very largest publicly traded companies on down to little startups that are about our size. And if you're not familiar with what a recommender system is, it's anything that makes a recommendation about inventory that you have to your users where inventory is loose. That could be literally something you're selling or it could be somebody to go on a date with or some music to listen to or what have you. So a lot of companies have this problem, more stuff than every user can consume and they want to know how to better match it. It's just a huge problem that we're focused in on a little bit. That said, Mortar itself is still a platform it's very general purpose. It can be used to recommender systems and anything else that involves. So that's sort of a first example use case, get a foothold type of application. I mean Amazon is the sort of poster child for that sort of recommendation engine, right? And others need to- They did some of the early pioneering in that work. Yeah, others need to build out similar capabilities and Amazon to my knowledge anyway has an open source there recommendation system. No, they haven't, it turns out it's a tricky problem. Everybody's a little different, their domain's a little different. And so what we're doing is we're building a bunch of open source components as part of this project with 10 companies that will be available to all of our companies to mix and match and customize as they need. So when we first met, you sort of described the strategy and the plan for the early formation of the company, this notion of making Hadoop a service, making it easier for people to consume. Mentioned, we just talking about Amazon. So Amazon AWS, they have Hadoop too, right? It's one of the fastest growing services there. So help us understand why somebody would go with your sort of specialized service, that's part of the answer, versus say just throwing it on Amazon. Yeah, so in fact, we run exclusively on top of Amazon and we use Amazon's infrastructure. We run on top of their Hadoop infrastructure, Elastic MapReduce, and they really like what we're doing. In fact, in January of this year, we won grand prize in their global startup competition in the big data category, which is worth $100,000. That's like a very serious prize. And the reason that we won that is because they see the high value add that we're providing on top of their infrastructure. So to provide sort of a metaphor of what the difference is between mortar running on top of Elastic MapReduce and running directly on Elastic MapReduce yourself. Elastic MapReduce, you can think of it as a raw infrastructure. It will run Hadoop jobs for you if you have them right, and that's all it does. Mortar provides you all the things you would expect to go along with the software lifecycle. So knowing exactly what got deployed and run in the past and so your history, and when there are errors showing you how to correct them. It's a solution. It's a full solution. Another metaphor that's sometimes helpful is in the same way that you could run a web app directly on EC2, that's raw infrastructure, you could run your data processing directly on Elastic MapReduce. As it turns out, most people choose to use like Heroku or some sort of service to manage that raw infrastructure for them, and that's kind of what mortar does at a simple level. Okay, you mentioned off camera, you've been hiring folks in the data science space. Actually, you mentioned some advisors in the data science space, but talk about that a little bit. What's the affinity with data science and how is that helping progress your service? Sure, so data science is very important to mortar. We actually own the New York Data Science Meetup. We organize it, I guess, is a better way to put it. And so there's a thousand data scientists that are part of that group. We're always looking for new space. Somebody's watching this video and says, oh, I would like a thousand data scientists to come into my space. I need to know about that. But the reason that data science is so important to us is because one of the fundamental problems we're solving is repeatable data science. Right now, everybody does everything in an ad hoc way. You try this, you take this little piece on there, then you try this other algorithm, and then you publish your results, but nobody can verify or do it the way that you did. So we've created a framework, which going back to the Rails Heroku example, it's a lot like Rails, but for data processing instead of for web applications. So it's a self-contained unit with all the computation that you need to do a full run of taking data from its raw form into all the way to results. So this solves a fundamental problem of data science. How do you collaborate? How do you repeat? How do you build on prior art? And so folks in the data science community are excited about what we're doing. So to name some of those advisors we've brought on, we've got Hillary Mason out of Bitly. Yep. I saw recently there was a graph of the most influential data scientists from Twitter and it was like Hillary tried to make up. That's because she was on the queue, by the way. You're going to, you're waiting to see what happens after you look at it, because I think it's out there. So we've got Hillary, we've got Max Schrohn, who was, if you are familiar with the famous OKCupid blog, which really kind of made data science popular. He was the author of that. We've got Drew Conway, who's the author of Machine Learning for Hackers on O'Reilly. And we just most recently brought on Eric Coulson, who is the, he used to be the VP of data science and engineering at Netflix. So we've got it, we've got, you know, the right. That's a nice team that you've got on. Yeah, it's a really great team. Those guys don't just, I mean that's, it's hard to get those guys to come on and pay attention, you know? No, yeah, it's a real serious team. And I think it speaks well to the fact that what we're doing, you know, it hits a nerve with them and they think it's important work. So what kind of traction are you getting with customers? What are you finding in terms of, who's typically coming to you for looking for you to use your services? Do you find that they're in specific verticals? Do you find it's more traditional versus the web companies? What's kind of, what are you seeing there? Since we are available exclusively on AWS, all of our customers have data on AWS. That tends to be younger companies, but it's becoming less and less so. The other day I was speaking with like JP Chase Morgan and they were saying, you know, we're actually starting to put data onto the cloud. And so that's becoming less of a hard and fast rule. But up till now, mostly it's been companies that are, you know, from 2007, 2006 or later. There has been no particular industry focus. We haven't tried to focus on anything and we sort of thought, well, maybe there'll be some bucketing that naturally occurs in folks that come to us. Hasn't been the case. It's been sort of all over the map. And so, and there hasn't really, to tell you the truth, been a particular use case that stood out until it's a recommender system thing, where we just finally heard that enough that we decided we'd focus in on it. So yeah, I'm interested to talk a little bit about your take on, obviously you're delivering Hadoop as a service. So Hadoop in the cloud really, and delivering that as a service. And we recently did our market sizing of the big data landscape. And you know, cloud-based kind of Hadoop in big data. It's still pretty small, small sliver of the market, but we see that growing. Talk about some of the advantages of doing big data, specifically Hadoop, in a cloud environment like you offer, versus having to do it in-house, maybe bringing in all your own infrastructure, et cetera. Well, I mean, right off the bat, you get all the classic things you get with cloud, which are, you don't have to set anything up, you don't have to worry about it, it's just not your problem, the infrastructure and the software, not an issue. Hadoop in particular benefits from the cloud, because you can use it elastically. If you don't need any processing power, you can shut it all down, you pay nothing. If you need these massive spikes where suddenly everything goes wrong and you need to recompute everything, no problem. You actually have capacity to do that. I'm sure it's going to cost you a little bit more on that day, but if you're doing an in-house Hadoop cluster, you may just hit your ceiling, you only have 10 physical machines, that's something you're never going to get better throughput than that. So there's those reasons. Then in addition to that, there's a special one that sometimes people don't think about in connection with big data, which is you're really going to need to operate Hadoop wherever your data resides. You aren't going to move terabytes of data from here to there to there for processing. I mean, that's a fundamental concept, really, of big data is do as much of the processing where the data lives and move the data around as little as possible. Exactly, that's right. And so for all these companies that are coming of age in cloud environment, their data's already in AWS, they want to keep it there. Right, so one of the potential challenges, of course, for the enterprises moving data, we've heard about actually moving data into the cloud can be a challenge, but I take your point that if you've built your infrastructure in the cloud anyway, that's going to be a logical place to do that. And then the other advantage that would strike me is when you're trying to bring in third-party data, that's, by definition, it's not inside your own firewall anyway, it's living in the cloud. So it's a place to do that. Do you see yourselves, or do you offer any kind of services to help companies bring in third-party data services and kind of mash that up with some of their own data to do some more advanced analytics? We haven't focused on that yet. There are other folks, obviously, that do that, and I hope at some point in the future to be able to work with them to make their services complimentary to ours, but it's not something we've focused on so far. So we're here at MongoDB in New York City. So talk a little bit about the relationship between kind of what Mongo's sweet spot and Hadoop. We've had a few people on today, as a matter of fact. You've talked about using Mongo to support their online, their web applications, their mobile applications, and then moving a lot of the data that's generated from user data, basically, how people are using their applications, maybe moving that into Hadoop for some analytics. Is that really the, is that the core relationship, or how do you see the two sitting side by side? Yeah, so Mongo primarily is a data store, right? You put data in, you get data out. There are some exceptions to that. They've got their aggregation framework, and they've got a way to run, not produce jobs in JavaScript. But those are solutions that don't apply to often to complicated use cases, and they actually add load to your data store. So a lot of times you basically want to treat Mongo as a classic data store. You want to put your processing somewhere else. Where is it going to go? Really, in most cases, people are considering Hadoop. So that also makes sense because sometimes you want 10 nodes in your Mongo cluster, and 100 nodes in your Hadoop cluster. You really just want to separate concerns. Plus Hadoop is purpose-built for processing data, bringing in existing libraries, you know, and Mongo isn't, and so it's better to let them focus on the things that they're good at. I proposed a talk for next week at the Hadoop Summit out in California, and it was actually the same as this talk, which I just delivered, called Mongo and Hadoop Sitting in a Tree. So it's like, you know, I can get these things together. And they put, at the Hadoop Summit, they put up for public vote, you know, what the most popular proposals were. I don't remember, there were more than 200, you know, proposals for talks came in, and this was the most popular talk of all of them, according to the community. And so I think what that shows is that there's a really large hunger among the Hadoop and Mongo community to figure out how to get these two powerful technologies to really work well together. And so during my talk today, I just spoke about, here's, I give very high level, here's what Hadoop is, here's the ways you might consider working with it, why you would consider working with Mongo, and then it was a very technical talk, I actually coded on stage, did a live integration with a Mongo database into Hadoop to do some work with Twitter data. So you, so talk a little more about what you did, what were you doing with the Twitter data? So the first thing I did was I used Hadoop to crunch through the entire Mongo database, look at all the different elements, because as you may know, MongoDB is a polystructure database, right, you can, not all the documents have the same structure, so you can lose what track what's in there. So I used- Polymorphic was the word we used earlier, I love that word, I just love to say it. Yes. And so we used Hadoop to look at all those different structures, take sample values out, basically give you a sense of exactly what's in your database to counts of all these different values. First thing I did, then the next thing I did was just a very simple example of let's look for when, what time of day, time adjusted for, you know, to UTC, do people tweet about coffee? Just took a look at that, and it's a simple enough thing, but it involved a little bit of Python, a little bit of Pig, and something that's much more easily done in Hadoop, and I was able to do that on stage very quickly. So you wrote that code on stage, and it took a go. To be fair, I had pre-written it, but I then showed, you know, I wrote it again on stage and showed all the parts and how it worked together. So you pre-wrote it, tested it, made sure it worked. Yeah, I didn't want people to, like, have to go through my debugging cycle, I think, but yeah, that's right. So how long did that cycle take on stage? Well, on stage about 20 minutes. And okay, how about your prep? A little longer than that, but not much longer. Really what took a long time- Under an hour. Making sure that, yeah, under an hour. What took a long time was making sure that I had, like, all my steps correct so that it would make sense to people as they were watching. So your talk was accepted at Hadoop Summit next week, or? It actually was not, which was surprising to me. Given that it was the most popular, yeah. I think it has to do with, you know, who was funding the event, and we weren't. So we didn't get selected, unfortunately. Well, we'll have theCUBE next week at Hadoop Summit, so maybe we can unpack your talk there, you know. So I wanted to dive into the data science question a little bit, because it's something that struck me. Actually, Dave and I were just talking to a practitioner yesterday. And one of the things they mentioned was trying to tell the business case about Hadoop to a non-technical person, to a CEO, for instance. And it's a little bit hard to make that case that, well, you know, we're going to invest in this technology, we're going to put a lot of data in it, and we're going to let some really smart data scientists go in and fool around with it. We don't know what they're going to find, but we're pretty sure it'll be valuable. Now, one of the benefits, of course, of Hadoop and kind of data science is you don't know what you're going to find. You can, because you're not pre-defining the questions like in a traditional environment, but that can be kind of a hard thing to explain to somebody who's not deep into the tech, doesn't really kind of, you know, it's not a technical person, it's not a data person. How do you, what advice would you give to people who, you know, at a company, they think they get the value of Hadoop, but they're having trouble articulating that business case. How do you articulate it to more of a CEO type, a business type versus a hardware tech type? So are you particularly talking about the case where we're just sort of saving a bunch of data and we don't really know what we're going to do with it? The whole concept of you can ask questions you never asked before and you're going to find insights that potentially you didn't even think to ask about those questions, but that's a hard business case to make. You know, it is hard, especially because unless you're speaking in the language with the CEO or the company that you're talking to, it's not good, they're not going to care, you know? So the way that I do that generally is I dig into their business a little bit. I'll say, okay, tell me about what you do. Okay, so this is kind of what you do. How do you, what do you think the limiting factors are on your revenue right now? Like what's stopping you from selling more to individuals or finding more individuals? And then just dig a little deeper into that and say, oh, well, what if we could, you know, bring in Twitter, bring in Facebook and actually tell about the strength of the relationship in Facebook, for example. Like this is your father and so you really care about his opinion, whereas this is a random connection, you don't really care about, bring these things to bear and then put things that are relevant in front of you. And then people say, oh, well, that is really interesting and I say, well, yeah, but you wouldn't be able to do it if you weren't keeping track of Facebook likes or maybe if you weren't keeping track of all your logs. And so that's how you start to justify imagining the world of possibilities if you start storing more data. Well, that's, I think a very good point. You have, you've got to talk about the business use case specifically and dig into the details of what that business, I like what you just said about, I mean, well, you know, would you like to reach more customers here or there? What part of your business is maybe a pain point and identifying that first and then starting to dig a little bit. And that sounds like a way to help tell that story. It is, I mean, it always, for me it always has to be personalized or else it doesn't resonate. But the nice thing is it's easy to personalize. It's always easy to dig into the point where you figure out where they'd like to be doing better and then think about how data could apply to that. So tell us about going forward. You know, what's on your, what's on your plate? What's, what's, what are some of the things? Obviously, talking about Mongo and those integrations is one thing. What else is kind of in the future for more? Yeah, so we've got a couple of things there is on the product front. We're just continuing to improve and improve. We've got release coming out in the next two weeks where it becomes very, very fast to get feedback as you're developing with Hadoop sub-second feedback which is going to be a breakthrough. On the use case front, I was talking about building recommender systems for 10 different companies, pushing forward on all those, getting great open source components put together. And then looking a little further into the future in Q4 this year, we'll be going up for a raise and hitting the road with our story. And how many people are you now? We're 10 people now. Awesome. And you'll be at Hadoop Summit next week. No, you're not going. I'm going, I'm going. You are going, okay, go. I'm not going to be giving that to that talk about there. No sour drips, Kay, right? No, I'm not. Okay, and then so any other events that we should look for more to data at this year? Not that I have schedule. Are you doing, do you think you'll do AWS re-invent? Yeah, we'll be at re-invent as well. We did the AWS summit in San Francisco, which was I guess a couple of months ago. I can't remember now. I think it was April and May. It's all a blur here on the cube. All right, and maybe we'll see you in Cambridge as well. That sounds great. Kay, great to see you. More to data, making some awesome progress. So thanks for coming back in the cube. Keep it right there, everybody. Jeff Kelly and I will be back with our next guest. We're live here in New York City at the MongoDB event. This is the cube.