 signal from the noise and we have a great guest here, Jonathan Ells, co-founder and CTO of DataStacks, also heads up the Cassandra project within Apache. Foundation, welcome to theCUBE. Thank you. I'm joined with Jeff Kelly, our top analyst at Big Data for wikibon.org, our research team. Jonathan, so you're up there in the keynote, so welcome to theCUBE, first time on. We get a little technical, we heard you want to get down in the weeds and talk some tech. For first, before we get into some of the specifics around what's going on in the marketplace and the tech, take us through your keynote this morning. It wasn't a live stream, but you guys had a packed house. Billy was up there doing the kind of welcome, kind of chilling out, had a nice ding on HBasey, which I have on social cam, we'll play later. But talk about what you talked about on your keynote. What was your key message? My key message, well I wanted to give people a flavor of what problems people are solving with Cassandra. So I talked about how eBay and Disney and Simple Reach and Source Ninja and Netflix are using Cassandra. Those are all companies that are, I picked those specifically because those are companies giving talks here at the conference about how they're using Cassandra, so people can go and get more detail on that. And then I talked about where we're going with Cassandra in terms of what features we're working on and when you can expect those. So just kind of here's where Cassandra is and where we're going. What I find really exciting about this market is, and obviously SiliconANGLE for the folks don't know, our motto is where computer science meets social science and I'm kind of, I guess I'm in my late 40's, so old school database degree and I've lived the days of the mainframe and you saw that the same kind of movement where you had different approaches from relational databases but things kind of circled back around. So you have a different kind of range of demographics of ages that are in the marketplace. They try to solve a lot of these big problems around new data and mobile and cloud. So you have kind of old school, you have kind of new school, you have open source, scale out, scale out, all this stuff going on. So the question I have for you is, what do you see as like the key problem being solved right now in the marketplace? Obviously no SQL, it's pretty obvious from the press coverage and a lot of the hype, but from a functional solution, people are adopting no SQL and bring that into their production environments, yet balancing that with the relational database infrastructure that they have and or will need on top of no SQL. What is the key use cases you see being solved, some problem solved in that area? Well, that's a good question and I think it's worth elaborating a little bit on one of those terms you brought up, the no SQL movement, if you will. And that's been a good and a blessing and a curse for us. I was actually at the inaugural no SQL conference and that Yuan organized in San Francisco about three years ago and that was where it really started taking off and it's been good because it really called people's attention to the fact that relational databases are a hammer but not every problem is a nail. So there are other tools that are better at solving specific problems. And then the problem with the no SQL term of course is that defining, lumping in everyone that's not a relational database is a little bit of a disservice because my personal take is there's nothing wrong with SQL as domain specific language for accessing data. In fact, it's actually pretty good at that. And so we've been actually taking steps in that direction with Cassandra. We added the Cassandra query language about a year ago. We've been improving it steadily since then. So I think the main thing that we're focused on with Cassandra is saying that we're interested in the scaling part, not in that language you access it with. We're interested in scaling applications that are millions of requests per second that are terabytes of data. This is something that the engineering trade offs that relational database chose are not a good fit for. And so we've said engineering is trade offs, right? There's no magic bullet. But we can make choices that are more appropriate for this scaling real time applications that make Cassandra a better fit than MySQL or like Oracle. On the general purpose versus specialist argument obviously we play with Hadoop, we play with HBase, and we don't use Cassandra yet because we're a small media group. But that's a hard question. Everyone's talking about that, whether it's a data scientist, analyst type, and or a programmer, developer. Specialism is really important in these new emerging areas. The role to general purposes where the cross and the chasm kind of starts to hit where you can start attacking those problems. What are you guys seeing with Cassandra? As you see more production use cases develop, what are those specific general purpose problems? Is it scale out, scale up? Is it online transactional? Is it more rights, less reads? What are you seeing that Cassandra is really hitting the groove on? I mean, I know our data stacks, you guys are doing a lot more to make it more general purpose, but specifically what beachhead are you navigating off of and pivoting off of to solution wise that you guys are kicking ass in? If I had to pick a single thing that Cassandra does better than anyone else, it's support for multiple data centers. So we're definitely, we're firmly in the scale out camp. We're tackling the more OLTP side of things. The problem is that there's not really a great term for what we do because when you say OLTP, people think, well, I have to begin and commit and roll back, right? But if you say, what else can you call it? I could say it's short request that we're about answering this small request millions of times per second. We're not doing joins, we're not doing aggregation. That's what we're doing. So I think it's fair to say that a couple years ago, when we first got started, we were firmly in kind of like, you use Cassandra for social media. So we've moved beyond that now. We know of literally hundreds of production Cassandra users across all kinds of industries. And yeah, people are recognizing- It's interesting, man. You brought up multiple data centers. Obviously we covered the data center at the Blanket, our blog on Wikibon and we covered convergent infrastructure. It's interesting, in these new use cases where no SQL and big data is really relevant is these new environments where ingesting a lot of data, it might be right-intensive sometimes, more read-intensive, sober variable. And those are emerging solutions. But when you talk about data center, one of the things that we've noticed I want to get your take on this is that with the advancement of solid state, we've seen a massive shift over the past 18 months from off-premise deployments to maintaining on-premise solutions because with SSDs and solid state, we can actually put in caching layers to either scale up my SQL or other legacy environments has changed some of the paradigm of what's going on in the data center. What have you seen with Cassandra if you're playing in the data center? Has that made an impact to you? Because obviously SSDs impact the spinning disk market which is critical in talking about read-write. So talk about how that has impacted this over the past year or two. So SSDs has been a revolution, a slow revolution because we've had early adopters starting to use that but it's finally starting to go mainstream. And you can see that because Amazon announced that they're doing cloud instances based on SSDs and there's a rumor that the next version of Microsoft Azure will be entirely SSD based as well. So this is something that we've known was coming and- You don't have to be a rock of science to figure out that they got to move off the spinning disk at some level. And two years ago at a Python conference I gave a talk on database scalability and one of the things I said was that SSDs are as close as a silver bullet as you're going to get because you just get that random IO performance that just you can serve a much larger hot data set than you could on spinning disks and a ram crash. Well we've seen SSD from not just the database aspect but just from a complete re-architecture of data center topologies where clients and large enterprise were looking at off-premise solutions in cloud and completely throwing those away by staying on premise because they get the economics and their performance with SSD in the store where storage was critical. And that's the database. So the advantage I see to cloud is that you can say I'm not going to focus on hiring ops teams and training them. I'm going to focus on my core business and what I'm good at and kind of outsource that infrastructure. You do pay a price premium for that. So I don't think that there's one right answer for everyone. It's hybrid. I mean right now we see hybrid really dominating where it's a combination of on-prem off-prem but it's the whole everything's going to the cloud it's just not happening in drones. I mean we're seeing that pretty clearly but it's not to say that cloud is not going to die. I mean cloud will be around. I just don't think cloud is as revolutionary as big data is in the sense of big data being using data to redevelop applications and whatnot. So we're watching that and what I'm curious is all that Jeff asked a question I know is he's been waiting to chime in. Go ahead Jeff. Thanks Jeff. Can we not play full conversation? No problem. So I would love to get your take, help our audience put Cassandra in context of the NoSQL movement in the sense that are all these NoSQL databases moving towards the same goal or you mentioned the right tool for the right job help our audience understand okay they understand on a conceptual level what Cassandra does, what Mongo might do, what HBase does but really what's the, are these going for a different use case ultimately or are they all going to eventually converge and the winner will be based on whoever can provide the best support, the best use cases, et cetera. There's some of both, right? So there's definitely things that for instance Voldemort, another project, kind of the same general area as Cassandra dealing with large clusters, many, many requests per second. And we're seeing that there's some consolidation there that not to pick on them but there isn't a whole lot of community around Voldemort. And so there's some consolidation in that respect. Now especially when you're using an open source project it's like does an old soldier ever die, right? It can live on as long as someone's interested in maintaining that. But I think there's going to be a few leaders that are going to accrue more momentum around them. So but then there is room for I think diversification in that MongoDB for instance is tackling something different. They are looking at being a very easy to get into, very developer friendly thing for hobbyists and small companies. MongoDB isn't really tackling the, I want to scale to dozens and hundreds of machines. So we don't really see them as competitors in that respect. So you mentioned the community and I think that's going to play a really important role in terms of which of the databases kind of comes to the fore over the next couple of years because really as you move into the enterprise of more traditional maybe risk averse IT departments really what they're looking for is the community to support regular upgrade cycles, company like data stacks providing security and management capabilities, et cetera. So John asked a little bit about this before but could you elaborate a little more on the, you know what's the community like? What's the vibe in the Cassandra community? Maybe compare it to what we're seeing in the A space community, you know is it. What's the personality? Yeah, and are you confident that it's the right community to kind of move this into an enterprise to the enterprise level? Well, we are seeing a little bit of a shift in the community as we've gone, we've started to move beyond the really early adopters. So two years ago we had the first Cassandra Summit, we had less than 200 people there, right? So now we have over 800 and so the audience has changed a little bit as that's happened. But if I were to pick one adjective for Cassandra users, I would say it would be problem solvers that were practical and there's some communities that I would say might be characterized as being elitists. So we're concerned about having the best technology and we really want to push the technology forward but we don't want to, we want to make it accessible at the same time if that makes sense. Rather than kind of on a theoretical level we have this great technology but practically what can you do with it versus what you guys, what you're saying I think is that we're really focused on practically getting down to use cases and solving real problems for the business. I had a conversation with a guy in academia where he's saying this product is way better from these theoretical standpoints but oh by the way it crashes and loses data, right? So that's the kind of thing that in the real world that really matters. So let's talk about developers. Obviously this is the wheelhouse for developers. Open source is fantastic. We're like what on our third generation of open source we see open source is no longer the other approach, it's the approach. Everyone's using open source from startups to big companies. Developers programming languages and frameworks has been all the rage. Spring was bought by VMware, you got Rails, you got Python, all these different languages and frameworks. Hadoop versus Java versus C. Is there a difference? I mean obviously one's has different type of attracts different kind of programmers. C is more kind of hardcore, I would call hardcore coders. I think that's what you're saying, kind of problem solvers, maybe that might be more of a description of you guys more hardcore coders than say I don't want to say Java's not hardcore coders but Java's easier. But which one lends itself better or is there a tool? Is there a hammer versus a chainsaw or what's the, break that down Java versus C and C++. So I guess I can make an analogy there to the no SQL space. Java was designed as kind of a response to C++ and saying that the people who designed Java were all C++ users and they're saying these are the problems we have with C++ on a day to day basis. And the big one of those was memory management and saying I have to malloc and free, I have to new and delete and I have to take care of that and tracking down memory leaks is a big part of my time that the language or the environment should be taking care of for me. So Java's easier because you have that memory management built in. But it's not like it's easy because I'm giving up power, it's easy because I'm solving a real problem. So Cassandra's kind of a second generation no SQL system in the sense that HBase as you know is inspired directly by Google's big table. And Cassandra came out- But big table I believe wasn't Java based, that was C++. Yes, it was built on C++, yes. But the analogy I want to go with is that the people behind Cassandra were able to look at the big table design and say, well these are the sharp edges there. It has these single points of failure. So we're going to architect Cassandra around a fully distributed system. We're going to give you a similar data model where you have rows and columns unlike some other first generation no SQL products that were strictly key value. So Cassandra's going to give you the best of both worlds, a fully distributed system with a powerful data model. Awesome, well that brings up a lot of different arguments. But again, back to the point of explain to the developers your philosophy from your personal perspective around this no one size fits all, no one tool for the job. And as a quote on Horai tweeted earlier last night is that any developer worth their salt is not going to be hung up on one stack. Explain your philosophy on this, do you believe in that? Or should it be like, oh I like American League Baseball versus National League Baseball? I mean, in a way, isn't that what we're talking about here? It's like, what's your flavor? Yeah, I mean, I can relate that to my own career in a couple ways. I'm a big fan of the Python language. I've spoken at PyCon for probably five times and really big fan of that. I really like the elegance there. But when I started investigating scalable databases for Rackspace back in the day, building a database in Python just, it doesn't make sense. You've got some problems around concurrency with the global interpreter lock. And more fundamentally, Java is five to 10 times faster. And so that's something that I'm willing to give up some of the expressivity in Python to get that performance. And so I make no apologies about being a Python guy at heart, writing Java code all day. Yeah, it's kind of like, what kind of artist are you? You want to do a quick hack and then find a solution and then kind of really build it out. Which brings up my next question is that being kind of a student of entrepreneurship in tech over the past 30 years, we've had an interesting past decade. We moved from Web 1.0 to seeing things like iTunes, YouTube stuff we weren't around a decade ago. But from an entrepreneurial perspective with open source, you can start a company up and pop some MySQL. Amazon's been an amazing resource of getting started, look at Twitter, look at Zynga, these are great examples of companies in Facebook. Just explode on the scene and developing really fast. But then they realize, damn, we have to really read architects. So it's well documented on Quora, you go anywhere, it's all well documented. It's like, you got the problem of, we got critical mass, we got our funding, and we got technical problems, which is shit. We got to change the airplane engine out at 30,000 feet. Zynga went through it, Twitter tried to do it. I know Cassandra's been talked about is plugging in, plugging out. Talk about that dynamic from a quote, historical perspective, this is a great thing. Build a company, get it funded, prove the model, everyone's happy. But then you're at a critical juncture of, I got to really rebuild to scale. There's some serious computer science involved, really serious tech. What is that? Where is that line? What is that benchmark? And what do developers do today to avoid that, knowing what we've learned through? You know, I'm not sure, I'm not sure avoiding it is even necessarily the right thing to do. Even if I had a crystal ball that says, I'm going to be at 100 million users in five years, maybe I still start off with building it on PostgreSQL because if my team's more familiar with that, it's okay sometimes to accrue technical debt as long as you know what you're getting into and you're making that decision deliberately. So an example I gave in my keynote this morning was Source Ninja is using Cassandra, but that's not the system they started with. They started off with something they were more familiar with, ran into the limitations of that and then they said, okay, well now we have the revenue, we have the users, and it's time to start paying down some of that technical debt. I think that's a totally reasonable thing to do. Take us through your advice if you have brought in as a consultant, knowing what you know now and looking forward, and obviously Quora's littered with, I tried Cassandra, it was too hard, the UI tools aren't there, I need more tooling, obviously DataStacks is doing some things there, but take us through what Cassandra can do for that startup, that company that says, hey, you know what, we're doing some black souls out in the cloud, we started tweaking around, we built an app around it, we've done this app, this is a random example, and then now it's shit, we've got to scale this, or where would Cassandra fit and how would that, how does someone get from, hey, we have our, we have technical chops, but we have some technical debt to get out of, what does Cassandra fit? Can you just take us through a quick, quick example of where that would be great? So what people typically do is they start by saying, you know, I have, you know, these five tables are 80% of my reads and writes. So those are the logical candidates. I'll start by moving those to Cassandra. And then when you're using Cassandra, Cassandra forces you to think efficiently and it makes you discard some kind of bad habits. I've heard a rumor that, you know, the first thing that eBay does when they bring a new hire onto their core team, you know, they're built on Oracle, well, except where they're using Cassandra now, but, you know. How to quickly get the guy to quit, work on the Oracle system. But, you know, they train people on Oracle, at least this is what I've heard, to not do joins. You know, we use Oracle, but we don't do joins because that just doesn't scale. So Cassandra says, you know, we're not going to give you that security blanket. You, we don't support joins at all. It's kind of like the real estate. They show you the crappy houses first and then show you, I want that house. Play with Oracle and then move to Cassandra so it's a dream. Okay, that's cool. Jeff, anything else? Well, I know, you know, your talk today was all about the future. So I just wanted to get your take on, you know, where are we going if we're in this, we're at this table in a year from now. What are some of the key, either key limitations or the key projects you're working on that we're going to see addressed over the next year, both among the community specifically and to some degree data stacks, specifically what they're working on as well. I guess if I had to pick kind of a theme for what we're doing, it would probably be some better support for large cluster members. So right now the sweet spot for a Cassandra cluster is lots of fairly small machines. So maybe eight cores and a terabyte of disk. So relatively thinner machines. And for a bunch of reasons, there's demand for supporting 32 cores, eight terabytes and being able to scale up better as well as scaling out. So where's that coming from? So, well, I don't know if I can name names, but for instance, one Cassandra user is using Cassandra as kind of part of an appliance. And so they want to ship this messaging platform that they have on a relatively small number of machines. And so they'll make those beefier. They would rather make them beefier than add more machines. Within reason, right? Obviously, past a certain point, it becomes not cost effective. But up to that point, they would rather have fewer beefier machines. So this is something that we're working on is supporting those better. So I mentioned in the keynote, support for letting Cassandra manage the disks in a JBod configuration so it can get the maximum use out of those. The virtual nodes that I mentioned so that you can parallelize, rebuild if you lose a machine. I want to be able to parallelize that rebuild across my entire cluster because if I'm streaming eight terabytes from just a single source, that's going to take me a while. But if I can stream it in from a dozen sources, I cut that by a factor of 10. So that's kind of one theme, I think, of the improvements we're making. Quick breakdown as we wind up the segment for our final questions. Just break down the courses on the track for us. Cassandra, H-Base, Mongo, Couch. Okay. Okay, Jeff's favorite line, or Dave Vellante, my co-host, says, horses for courses. Some run better in the mud, some run better on the dry track, grass, whatever. I mean, break those down because that's really on the table, those guys. So I mean- People are talking about it, they just want to just give us your quick personal take on that. Yeah, so I would say that Couch and Mongo are kind of going after the same market and then Cassandra and H-Base are going after the same market. Couch and MongoDB, both talking about, let's make things easier for developers. Couch especially talking about being able to take data offline and resynchronize later on. But they're both focused on relatively small data sets. MongoDB has this global right lock and that is problematic as you try to scale up. And so kind of a different market than what H-Base and Cassandra are going after. H-Base is one of the systems that I looked at when I was at Rackspace. So I wasn't one of the founders of Cassandra, right? So I came onto it after it was moved into the Apache project. So I was looking at these different systems because Rackspace knew that they needed scalable infrastructure. They're playing with Amazon and Google, they know that if they're playing with the big boys, they need big boy toys. And so- Yeah, and I've had many countries, Lou Mormon over there about that. Yeah, yeah. So I looked at- It took a few tries. H-Base was one of the systems that I looked at and it was more mature than Cassandra in a lot of ways back then. But architecturally, it's built around, the HDFS name note is a single point of failure. Then you have the H-Master and the region servers. And it's not a system that can run continuously when you have machines failing. And there's still a lot of complexity involved there too. You have those three services that I mentioned. You also have ZooKeeper thrown into the mix. You have HDFS data nodes. So there's a lot of complexity around that. And so that's why I think you tend to see H-Base deployed into shops that have heavily bought into HDFS already and have a team of deep experts that can dig into that. We have a- We were at the H-Base conference in San Francisco, the first one and we called H-Base the tailored suit. Once you have a tailored, it's rocking and rolling but don't try to give that to somebody else. So use case wise, really talking about if you have the team in place, you have the expertise and the use case, H-Base can be great. Yeah. So I think Cassandra improves on that. Do you agree with that? Yeah, I guess I would agree with that. I mean, you can make it work if you have the expertise but it's still, even if you have the expertise that's not going to make some fundamental limitations go away. For instance, Facebook has one of the deepest Hadoop and H-Base teams around but their H-Base cluster, they've actually sharded that into multiple H-Base shards because of that name node single point of failure that they have and so- And they've also limited to only a specific aspect of the platform. Yeah, I mean there's some politics there. They're MySQL team thinks that MySQL is the way to go and so forth, right? So it's not entirely a technical decision but what's the point of using a distributed database if you have to go ahead and start it again afterwards? So yeah, so these are some of the things I was evaluating when I was looking at these things. So I do think that Cassandra has the edge architecturally. So I said earlier, no, we're not trying to be elitist about it but we do want to be aware of, that this kind of architecture difference does matter and that will affect your site uptime. So yeah. I love talking to computer science dudes coming right out of college to see what kind of they think and to kind of the more senior guys have been on the block with the industry, the computing and systems can have a different kind of blend. The old school systems guys see certain things a certain way and the new school guys see things other ways but that being said, really I look at down in terms of applications. Mobile, GEO for example, I mean Mongo has been credited I think four square uses Mongo for some of their GEO work. Is there a generalization you could say this is better for that? You mentioned multiple data centers for Cassandra. Would you say that mobile type developers would be better for Mongo or is there a different, is there a general use case? I don't want to be specific. So I mean really MongoDB stops being appropriate once you hit data sets that don't fit in memory just because of the way their storage engine works. The performance really falls off a cliff. So mobile focused developers, urban airship for instance do a lot in the mobile space but they tried out Mongo and then ran away from it and because of these limitations. So I don't know if I'd have a hard time breaking it down into domains or application domains. I think it's more- It's not that mature yet either way. It's more about the scale of the problem that if you have a Cassandra size problem then MongoDB isn't really an option. Got it, cool. Well I mean we've seen a lot of politics and you mentioned Facebook, I read a comment from Adam D'Angelo Quora saying oh yeah yeah we're just going to use MySQL and we'll throw more caching layers on top so their answer is just throw more scale up and throw more cache out of it. Yeah so Adam said that in 2010. So at the time obviously Cassandra was a lot less mature. I'm not going to try to convince anyone who's not comfortable with Cassandra to say that you absolutely should use this. If you're more comfortable sharding MySQL you'll go for it. But I think- There's trade-offs either way. There is trade-offs and I think you can make an excellent case. If I talk to Adam today I would definitely make a good case that in 2012 you should be going with Cassandra instead. All right final question for me and Chad will give you a final question. Looking out over the next year for Cassandra Summit 13 just in your minds I shoot the arrow forward. What do you think is going to happen? How do you think the market will evolve? How do you think the community will evolve? And potentially what will your slides look like next year? I'm not asking you to tell you what slides are but just conceptually where do you hope to be given kind of the macro tech environment and just some of the conditions of the market? Well I want to see us continue broadening the scope of the developer that we can appeal to and make it so you don't have to be quite so hardcore to use Cassandra. We have a screencast up, I think it's called Cassandra in Two Minutes where one of our engineers named Jake Luciani walks you through standing up a four node Cassandra cluster literally in two minutes. And so from the operational standpoint we have a really good story about by having a fully distributed cluster where every node is the same, you don't have to special case these things or build them on special hardware. It really has a good story there. We want to continue improving our developer story to match the ease of use of operations. So I talked about some of the things we're doing in CQL for 1.2 and we're going to continue to move that ball forwards. And continue to make lives easier when people build applications on Cassandra. Okay, all right Jonathan Ellis is co-founder of Datastacks, CTO of Datastacks. Also running the Apache project over at Cassandra, project over at Apache. Thank you for coming on theCUBE, great conversation. We'll be right back with our next guest after this break. Thanks for having me. Thanks, good job.