 The Cube at Hadoop Summit 2014 is brought to you by anchor sponsor Hortonworks. We do Hadoop. And headline sponsor WANDISCO. We make Hadoop invincible. Okay welcome back everyone here live in Silicon Valley in San Jose for Hadoop Summit 2014. This is The Cube, our flagship program. We go out for the events, extract the signal from the noise. I'm John Furrier, the founder of Silicon Valley, my co-host Jeff Kelly. Big Data Analyst at wikibond.org and our next guest is Chris Wenzel, CTO founder of Concurrent. Welcome back to The Cube. Thanks. Hey, so we've been following you guys. You have big news, big technology. The world is spinning in your direction. We last talked to you guys about your vision. I want you to explain just a quick update. One, you got some fresh finance and just get that out of the way, knock that out and then update on the company and let's dive into the tech. Sure. Yeah, so we just announced, I think this week, we closed 10 million in funding with lead by Bain Capital and True Ventures and Rembrandt also participated. So we got some fresh money to keep delivering on the product vision, right? And also to continue growing the cascading ecosystem. As you know, we founded the company behind cascading, a technology for building data-centric applications on, if you will, big data infrastructure. And we're seeing a lot of growth and adoption. Rob Beard was on here from Hortonworks and talking about the massive tsunami of data, which everyone talks about. But in the enterprise, in particular, it's even more massive. Certainly other verticals like financial service have a big problem with data in terms of dealing with it. The web-scale companies like Twitter, a client of you guys or a user of your product. So explain to the folks about the technology, because a lot of people might not know how unique it is. Talk about the architecture and how you guys can see that and how it's rendering itself out through the open source and also through deployments. Okay, well, the philosophy is pretty simple. We want people to be able to build data-centric applications. It's not about queries. You start with queries, but there's a point in time where those queries actually need an SLA. They're actually important, especially after you go IPO, right? And so building a standalone application is maintainable, testable using standard Java best practices is extremely important with as little infrastructure as possible. So cascading itself is just a Java API that abstracts away the complexities of MapReduce, Ortaze, or Spark, and allows you to build these applications independent of the frameworks, so you don't have to re-implement the logic if you want to go into different infrastructure, but build an application that anybody can be trained on with existing Java, Scholar, or Clojure skill sets and be successful. And so it's about maintainability. It's about operationalizing the applications and getting them into production and moving on to the next thing. So this is the holy grail for dev options. It sounds like what you just laid out, essentially, if I'm a developer, I don't want to have to go in and recode across multiple frameworks. That's pretty much what you're providing, right? That's actually one of the stories, right? So in cascading 3.0, we will be supporting a multitude of frameworks underneath Apache Taze will be the first. We have Spark and other streaming technologies. We're also working with other companies to address as well. And this is demand from our users, but if you will, we're planned to be, at API, you use to build your business logic and you don't have to worry about that the de facto, de jure infrastructure of the day underneath, it's actually different problems. I like to break things down into mama bear, papa bear and baby bear. So papa bear is MapReduce, Hadoop, right? Baby bear is running R, some Esper, something on a single process, but what's mama bear? What's that 10 node plus or does something interactive? And where Spark can be that, there's other technologies emerging. So you should be able to choose that based on your needs without having all the legacy stuff that comes with the underneath. So that's the importance. Well, it's an interesting value proposition because there's so much going on with the different computational engines. And I still think there's a lot of confusion around what's best for what use case. So this sounds like it would allow a developer to, you could launch on one of those engines and if that isn't the right fit, it turns out you can move that potentially to another app. That's exactly true. Sometimes you already know ahead of time what you want to use. Sometimes you just want to stick it on the things that have the semantics that meet the needs of the application. But it also helps you work through the hype if you will. Everything's good and everything's bad and they all overlap and what's right for the right job and you want the minimum number of amount of infrastructure. And so pick two or three things that fit the appropriate model. I mean, we don't know what's going to win but our intent is allow you to develop your applications and cascading. Scalding and cascalog are two higher level languages that are extremely popular. And then once you've built your business logic, that's what your IP is in there. You're not locked into what it was built on top of. And that's essentially it. So let's dig into the tech a little bit. I mean, you're abstracting away the complexity for the developers, so you're handling the complexity. How do you actually make that possible to support applications like that? That's a great question. So in cascading three, in cascading two-oh, we broke out the dependencies completely. And we had two planners, an in-memory mode and a Hadoop mode, if you will. In cascading three, we've actually formalized a rule engine. I can get into all the academics but we can actually now, with rules, declare how your logic maps into different fabrics. And actually that'll be exposed to users, so they can actually create new rules to experiment with new capabilities and new fabrics. But the point here is that community, not us, is going to do the work. I mean, I'm going to do the taze work, I'm going to work with the community and the others, but in time, over the years, all these things are going to emerge and if they want 10,000 existing users on a project that's based on cascading, all they have to do is go in and build the rules and plug in the APIs and fix all that up. And then they've got all the technologies on top. So it's actually not just a win for the businesses building business logic, it's a win for the companies building much more innovative infrastructure that actually want them to be able to port onto the system without them to retrain with new APIs and new semantics, right? I mean, it's not perfect, it's ideal, right? It's an idealism, but look at LLVM, right? You can actually build lots of Swift from Apple came out immediately and all it is is a new syntax on LLVM compiled into something else. The one for LLVM, Swift would've been probably not be here, right? So a lot of things to use LLVM now to get onto your architecture, but LLVM maps into is it a risk architecture is an I3A, six architecture, et cetera, et cetera, right? That's the same thing. So how's this going to change the game? Talk about some of the people you're working with. You mentioned you guys are working with Twitter, they're a big supporter of what you're doing. Talk about some of how it's rendering stuff. A lot of really big early adopters are working with you guys. Yeah. Explain who they are and why they're using it and what some of the use cases. Well, some of the users, like Twitter obviously, everybody knows about, all of the revenue applications are based on Cascading. They have many teams internally. They actually created the language Scalding, which is a Scala-based DSL for doing machine learning. They also use the raw Java APIs. We're seeing lots of companies adopting Scalding as a foundation because they love Scala and they want to do machine learning and on to do, so it's a perfect fit. The kind of applications that we see, we see things from genomics to simple ETL, predictive analytics, one of my first favorite earliest examples was a buddy of mine found the company called FlightCaster and it took 10 years of weather data and 10 years of flight information and built a predictive model to tell you when your flight was going to be delayed and made an iPhone app out of it. They flipped the company in four months. The soup to nuts, right? Amazon, Cascading, some closure, Hadoop and lots of data and it wouldn't have taken, it would have taken a shorter time if they didn't have to spend all the time making the data not suck. Yeah, Schema and it developed and it all kinds of craziness. Now people have a lower barrier to innovation and they can get into production. Again, he was actually doing all the ad hoc stuff but the exact same code went to production. That's a key takeaway, right? Most data scientists will take any tool they can get a hot hands on and build a pipeline but it's extremely brittle. You want some conceptual integrity across the applications and anything built on top of the Cascading ecosystem retains that conceptual integrity and allows you to get into production which leads to a productization. So we have launched a product called Driven that allows you to observe running Cascading applications in real time and reason about how they perform, reason about how if they're slowing down is there any anomalies in the behavior and anomalies in the data and actually understand what's happened and it helps to in operationalize applications and it helps to actually maintain those applications. That's the trend in the enterprise. I mean, obviously some of the use cases on Twitter these guys are huge scale but we thought the true car one of the customers support works and stuff they're doing with exploring correlations. I mean, they're finding new correlations in their data and that is now part of their business model. So essentially what you're getting at is there's all this insight out there, all this new stuff that if you don't mash the data up you'll never know. Everybody's in the business of data whether they know it or not. Twitter didn't know it until later, right? They thought they were, I mean, I'm sure they knew what their business was but their revenue is optimizing ads, right? A lot of companies are realizing that it's the event streams is everything else that they're trying to monetize and try to actually build a better organization based on that data. So it's not the big data proposition but everybody is in the business of data. There's something to be had there, right? Yeah, we're streaming the data right now on theCUBE, right? It's providing data from you. So give us the update on some of the roadmap stuff. So product update, you have the open source piece just for the folks out there that aren't familiar with your company layout or how they engage with you and the company and the value and what's shipping and whatnot. Sounds great. So Cascading's free open source Apache licensed. It's actually supported in the Horton distribution so you can go directly to Horton to get support or you can come to us for support but we're not really a support company. Go to cascading.org to actually get access to the software and get a visibility of the full ecosystem of tools and applications built on Cascading. We have a mailing list with a huge community around it. Cascading.io is where you can go in actually when you're developing a Cascading application actually get insight in how it performs that massive scale on your cluster or small scale in your little cluster which is probably more likely but if you want more deeper insights in your scalding app, go to cascading.io but no intervention, a little plug in you can have full visibility of your existing applications. You don't have to do any developer work on that. And if you want to hear about the company concurrentink.com and so again, we're partnering with other so Cascading's available through all the distributions. We hope to be actually be available in the Red Hat distribution pretty soon but Horton, I think Cloudera most everybody actually has Cascading something in the ecosystem. Who's your target developer that you really want to reach? Who are you trying to reach for this product and offering? For the product itself? Today where it's really focused for developers to be successful, right? Help them really understand how Cascading works how it works at scale. So there's one thing where you build a data pipeline or you're doing data, right? So it goes like this but when you turn it sideways and you're running on a Hadoop cluster of a thousand nodes or 10 nodes or whatever it goes this way. The challenge there is what happens when it goes that way as it moves through the pipeline and what are the pathologies that happen, right? So a lot of companies are trying to do that but they do, they tell you what Hadoop is doing. We can tell you what the application does and what your data looks like. There's a huge difference between what a Mapper does and what your actually code is doing at a parallelized fashion. So what's persona developer you're referring to? Is there a specific like guys who are doing Java guys who are writing Python, is there art, do you have scientists? It's going to be purely today. It's Cascading, Scalving and Cascalog. And so in the future we'll open up the APIs for other like Pig and Hive so you can see deeper visibility into that. But today, for the next couple of months, Java developers, Scala developers, Closure developers complete deep visibility in their executing applications. How is the role of the developer changing with the kind of the advent of Hadoop and some big data technologies? I mean I wrote a piece a couple of months ago about kind of the data scientists and the application developer are starting to overlap where you've got to, they've got to have good relationship and maybe you can find one that can do both of those things, that's fantastic. Do you see that happening or what's the development like of that role? So I used to tease that what we're seeing is the emergence of a dev analyst, a guy who can write some code and actually knows numbers, right? That is breaking apart again because the statisticians are coming in and calling themselves data scientists and the Java developers are just taking their lead. So it is actually a team effort. There's also an operations guy that comes in. And so what we're seeing now is what's actually more interesting is we're seeing this former DBA become a Hadoop DBA if you will. I don't know if I haven't made up a fancy word for it yet. There's a guy at stairs and looks at Hadoop clusters all day long. And this is actually our product audience ultimately is this guy to help him understand what's happening in the cluster and in the business. But yes, dev analysts and dev and analysts are working together to solve these problems. Sometimes it's the dev, the analysts dealing with scalding to understand the problem, but it's the dev guy who will actually finish it up, put it into production and make it an SLA managed application. I'm reading a post that was just tweeted by Sean Connolly from pepperdata.com. It says, top hottest words in Hadoop Summit 2014. He's comparing the word cloud, I'm sure they got any from Twitter and some social media, from 2011 to 2014. HBase pretty much goes down to a tiny number. Hadoop is still pretty solid. Data got a little smaller. The user is now smaller. Now analytics is bigger. What do you think about the trends? Obviously HBase, I can see why that's getting less relevant. We've seen certain use cases, it's great, but it's hard to deal with some say, but you're seeing much more mainstream Aussie adopts than where enterprise is in there. In your mind, what has changed in the landscape? Because this is a tooling conference. This is about the people building technology. They're customers, all the guys with data. Again, the day and those developers that you're enabling are building it. So what do you think has changed since 2011 that's most impactful? You know, I don't, it's hard to say. I think the amount of money it's being spent is definitely what's changed. I think the word data has now moved up the stack and I remember when I was at Thompson 10 years ago and I was trying to solve these things and get a data warehouse in our unit, right? And the Hadoop Pico system, what's changed, I think people are, the things that have been around, people are understanding what they're good for and now we're seeing new things and we're not know what they're good for. And so kind of the earlier point. So I don't actually know the answer to that question. I mean, I've been in, I guess getting's been around since 2007. You know, I've been at the very first Hadoop Summit, very first user group, if that will, and I still have a difficult time understanding what's happening all the time. You know, there's so much going on. There's so much, there's so many technologies and some pop up for a while and they go away. But that's, we're in the Cambrian period, right? Things are supposed to come to life and then go away and the winners keep going. So it's going to keep going. I think that's so exciting and I think one of the things you're talking about with your value proposition, you're enabling people to actually have a one hit wonder in building a company and then having more features or having someone mash up data sets to create value, right? I mean, that's the kind of thing that people are looking at right now. Well, they are. But there's also very established enterprises that have these processes, but they want to lower the cost. They want to, they don't want a rarefied skillset to do the work. When I first got attracted to it was, you know, it's powered by two principles. More is better and worse is better. The more is better is obvious. Big data, more computers, more data, more algorithms, et cetera. But the worst is better's approach is the fact that Hadoop, if you will, gives you root to your infrastructure, right? So I installed VMware many years ago because I wanted to get root on a VM, not root on the master because they wouldn't give me root on the host, right? So I snuck VMware to have root until I figured out many years later, right? Hadoop does the same thing. Now we're figuring it out, but you put a Hadoop cluster in it and it covers over a hundred nodes and you got storage and you got a hundred nodes that compute. I can do anything I want in a machine. No one's in my way. There's no bureaucracy. I'm not asking for storage. I'm not asking for an account. I'm not asking for anything. I just use the tools available to me and get my job done. And so that's the attractive part. And so, yeah. Talk about visualization, data visualization. You guys are talking to some of that about that. What's the market like that? How do you guys work with visualizing data? We don't specifically visualize data. We visualize applications. Cascading is used to prepare cleanse data for visualization. It can be used for integration. So you can have a cascading app that reads from Oracle and writes to Elasticsearch. And then you use Elasticsearch and Kambana to do visualization, right? But it'll run at scale. Or you can run it locally and it's irrespective or in a different system. But you can build an application. So the other challenge that people have is you're moving through a development cycle. One, you start with inception. You get to production. You got all these stages. But as you do that, you go into different infrastructure. So you have to build an app that can live through the infrastructure. Read JSON, read binary formats, et cetera, et cetera, in different scenarios for debug. They're also moving through scale. Small data, the huge data. And every time a new outlier arrives, you deal with it. But then you turn out that outlier is not something to be dealt with. It's actually a feature. And so you just keep working through that. So cascading opens that up, allows them to actually build these applications reliably. But in actually you have much richer visualizations with the integration. But we don't do visualization. So it was a tweet that came across the wire here. Best kept secret and big data at cascading, at concurrent, Chris doing his thing. I had to do some of the little picture in your slide. How was the talk? What did you learn? What did you present? What was some of the reactions? That's actually, he did some in Netherlands. I was in Amsterdam. Oh, recycling. Yes. World back Thursday. Yeah. So I didn't speak this year. I teased. I never actually speak in the US, just in Europe where the English is the second language. Because it was better. Because it was better. And so, but that actually might talk really well. We actually did a user group in Amsterdam and then I gave a talk on. So on Copic Cascading there's lots of projects. One of them is called Lingual. It's an ANZ SQL parser on top of cascading. So cut and paste SQL onto Hadoop, build a standalone application to run existing legacy SQL. I mean, it's simple. You don't need anything to install anything. It's just a library, right? There is a JDBC driver if you want to use it. But the point is I was speaking on that and speaking to just build simple things using existing technologies. And get them into production and not have to worry about. You guys are one of those companies that Jeff and I were talking about earlier on in the opening segment of day one, which is three classes of company, early stage through B round and C plus, series seed and rapidly growing. And then after the big whales are here like IBM and Cisco. Your challenge is to break through that B round and hit that growth curve. That's your business model right now. And so I got to ask you the question. The old saying is you become what you're known for in life, right? So what do you want people to know concurrent for? As a company and as a vision? Dang, we should prep me on that one earlier. It's theCUBE baby, bring him on. I think simplicity. I mean, we're not proposing complex things where we, worse is better. Richard Gabriel got it right. And so we're out to actually make people successful. We're not trying to throw anything down, push anything down your throat. And so concurrent, we're building simple technology to solve really hard problems, I think, is one way to look at it. Whenever we hire CMO, I guess we'll come up with a better way to do that. Well, when you're growing, you can help people do their job, reduce the time it takes to do stuff and save time and money. That's always a good thing. Bless it. I mean, ultimately our company concurrent is about business process management, about understanding what your business is doing because you're in the business of data because everybody is. And getting visibility of the organization. Because worse is better, you don't have to govern everything. Things just happen. But if you can come back and actually look at what happened to your organization in the last 10 minutes or 20 hours or 20 days, you can make business decisions around that or you can improve your business. Well, it's a complex story because it's a simple technology idea but like the implications are diverse, right? So you have, you know, breaking down the silos because I can read and write the different, cross different apps. I mean, that's compelling. That's contrarian, if you're a big whale. Right. You want to use your technology. You want to own a bow guard and everything. I own it. So that's interesting. So if you want to have kind of interoperable open. Well, Hadoop's never used alone. Data doesn't magically show up on Hadoop unless you're using Hadoop and you're reading his logs, right? So you have to integrate with everything. So making that first class is primary. Yeah, yeah, awesome. But what happens with us is, I mean, for the product itself, you build a cascading application. Everything you did is captured and you can actually now see it any time, point in time and understand it. And you can look at all the other things. And look at the relationships. That's the power. So using our APIs, you magically keep a documented history of everything your organization has done, if you standardize on us. And you can go back and reason about it. I mean, it's a very, very simple concept, right? But it's at the business level. It's not what job did what, it's what applications succeeded or failed. What developer wrote the best code, applications that failed the most or succeed the most? I mean, if you want to take it that far. Yeah, the timing too of the value purposes is interesting because you mentioned business logic and the API, right? That's interesting, right? If you, this conference, you're hearing a lot of discussions that's not on the word cloud, you're hearing words like business outcomes. That's an advanced industry term. When people are talking business outcomes, you're not in who contributed the lines of code anymore. Right. That's more fair. Okay, I'm applying it. And now we're talking about result objectives. Right. I see it as, I mean, maybe this is a horrible metaphor, but I mean, once you're in the business of data and data's raw, materials, et cetera, et cetera, it's a supply chain problem. It's a lot of business, upper high level business problems. But, you know, as companies mature, I mean, your customers, most importantly is probably your CFO if you're a public company. So those are having applications that succeed 100% of the time versus 1% of the time or 5% of the time is extremely robust. Yeah, I had a conference with a CFO earlier two days ago and he said, we'd like to get the cube of content to the CFOs because the CFOs aren't having the big data conversation. They don't really kind of like get the tech. They're getting there. They're getting there. I mean, there's been companies, they're trying to, they have data and their value, their asset is their data and how do they measure and measure the value? And they're using things like Kadoop to actually come up with a metric to set a value to it so when they do their reporting. You know, I'm not a CFOs. I don't really know all the magic words, but you know, these are things I hear, but these are important, you know, but that's who the customer is for that particular set of applications. So take a step back, take your founder hat off and put your industry participant hat on and tech athlete hat on and look at the landscape and what gets you excited right now in terms of, you know, the trends, the tech, the projects out there. What's out there that gets your attention? You say, hey, I'm really jazzed about these three things. You know, not that they need any more press, but Elasticsearch is one of my favorite technologies. They got 70 million from NEA. NEA, thank you. And so we love that technology, but it's been around forever. So other technologies, I don't know. I mean, you know, I love anything on top of cascading, obviously, but it's, I've been heads down. I've never really been paying attention to this. You mentioned Spark, right? Spark is actually extremely interesting, you know. Taze is extremely interesting out of a Wharton, you know, a Soms on Storm or streaming technologies. I know Continuity's working on some interesting streaming technologies. I haven't seen them all in play. So, you know, I can't, I don't have first, I'm not, I've had many hats, right? So I've used a lot of these technologies. The new ones I haven't used and being from a pragmatist, I only speak to the things I really know anything about, but Taze is going to be extremely interesting when we get, when we ship that on cascading, or cascading on that. And the same is true with Spark and any other technologies. There's even new ones that are very similar coming out of research groups coming in the Apache Foundation. I think Stratosphere's one of them, and so. We have some questions from the crowd chat, crowdchat.net slash dupe summit, our new innovative social engagement container in beta being announced on Monday. What kind of reporting does cascading support so that developers can receive input back from how their program is behaving at scale? I, it's an interesting question. So today what we got is to just give you the visualization of the actual application broken up over all the units that work on it. I do cluster and you actually observe it actually executing. And if you see something that has, it looks like it's not performing the way you'd like. You can drill down all the way down to, in case of MapReduce, to the MapReduce job and actually visualize the minor data pipeline that's executing and see how that's been paralyzed. So, and with that, if you take, so dupe keeps lots of counters and information around how it performs. It's not the raw counter value that matters, distribution of values. So if one has a long delay here, do the other counters have a similar delay or similar, similar links? So looking at distributions in comparison with the data pipeline simultaneously while it's running, is extremely important to a diagnose different issues and slow down some problems. And so some of those can be remediated by changing a property file. Some of them have to be remediated by changing your algorithm. Some of them are just temporary because your Twitter and someone, and you joined, you did a join on Lady Gaga and it exploded your Hadoop cluster, right? And so, 40 million followers. Boom. So I did a project a long time ago and it was, I had a multi-thousand node cluster in Amazon before there was, right when EMR was still in beta. And we had a 9 p.m. SLA and we're delivering CAD targeting information. You can read about the case study on Amazon. But the issue is one day, it just never stopped. The cluster never met the SLA. And I couldn't tell if it was hardware or if it was software or if it was data. In Amazon, you want to blame Amazon, right? And a killing node could have been the solution. But if it was software, then I could just kill it and fix the bug, but I didn't know where the code was running. But if it was data, it's like, okay, if a killing node isn't going to solve anything, it's going to start over. So do I filter the data, kill it, filter the data, then rerun it, or do I need a new algorithm or just do I keep waiting? And if I'm keeping waiting, if I ever use that key again that's causing explosion, where does the key reoccur in the code? So I'm going to have the explosion over and over and over again. Having that visibility and insight is what we're solving today, right? Folks, that's what a DevOps ninja sounds like. That is what DevOps is all about. That's so a hard problem. I mean, that is really, I mean. Dev analysts ops. That is like, that's under the hood. I mean, the thing about the complexity of that, right? That scale. That is, I think, one of the things that we hear in the Key World of Time is, I want to simplify that. I want them, I have developers that just want all that scale across multiple apps. And I think that's, you know, you talk under the hood, that's a problem. Do you see that getting better? I mean, obviously you guys have cascading, but what are the problems out there that you see that you can solve? Underneath cascading? Under the hood in general. You got cloud? So cloud, you said software, hardware, data. Three major problems. Well, first off is being able to identify which and what it is, right? For us. And we try to do that. You know, connecting the dots in your code and what happened in the data is extremely important. But what other problems need to be solved? I don't know. Much faster, I would love to be able to run. I could do cluster over a stack of Mac minis with Begrant really quick. Just pushing a button and be able to change. You can solve that for me. I'll hire you. There it is. This is the challenge. So what's next? Give us the roadmap real quick before we end the segment. We'll give you the last word. What's next with the company? What's your plans to break through the B round? And then we'll call it a day. Well, obviously hiring. But if we're cascading 3.0, we'll be out this summer with Apache Tay support, working on Apache Spark and a bunch of other technologies. Driven 1.0, the data engineer product will be out in days. You can actually download a beta trial right now on cascading.io. We are going to continue hiring and iterating on the product. We have extremely large roadmap and a lot of things ahead of us to focus on. Congratulations on the funding. Chris, CTO of Concurrent. Great talent and you could do some great work. Best kept secret. To do some of what they're saying. But you're really a good profile of the kind of innovations. Congratulations. Appreciate it. This is theCUBE right back after this short break.