 But only anyone wanted to care to talk about was the data log issue and that became the product and then expanded from there. Talk about the analytics space and compare that to some of the database confusion going on in the marketplace because we just had Jonathan on talking about HBase, Cassandra, Mongo, Couch. You know, a lot of young developers out there coming out really with high desire to start coding and playing with the open source community. It's just kind of the instant direction. So kind of give your take on where the Cassandra marketplace is relative to things like analytics, things like frameworks. Yeah, so it's kind of one of the interesting reasons that I joined Splunk. I've been spending the last two years evaluating a lot of different technologies and looking at what's in the space from an analytics perspective as well as what's available from a data store to a SQL store side. The company I came from is very big. Cassandra users heavily committed to that space. Tested HBase, tested Mongo, tested Couch. Very small footprints of Mongo in the space but mostly the use cases were all around Cassandra. Couple hundred nodes on DSE. But the reason for that is just looking at the space and figuring out what's out there. It's very complicated. Requires a lot of heavy developer input and developer savvy, let's say, to be able to deploy a lot of the other solutions that are out there. HBase took us about a month to even get a stable cluster. We started having some of our developers testing it. It was very challenging. If you're a hardcore Java Dev and you want to just spend your whole life focusing on a technology, that's great. But for the fact that we had a lot of things to do and a lot of things to accomplish in the architecture team and prototyping that we did, it was really valuable to work with stuff like DSE where it was easy to deploy. We started with Cassandra at .6 in production, .5 in the lab environment. So we were kind of early to the space. Share with us your experience and give us some editorial in your own mind's eye around what's happening in the marketplace. Because there's two schools of thoughts around analytics right now. There's the old school which is business intelligence, data warehousing, fence the data, parked out way out there, pull it in, run a report. Our joke is from the movie Office Space, the TPS report, whereas those TPS reports. It's like old school BI's like, ugh, it's like you cringe, it's not near real time. Now all the sex appeal is real time, near real time, low latency, fault tolerance, high availability, distributed databases, stuff Jonathan's talking about. And then you have the new analytic world which is like again real time. But sometimes analytics goes faster, like in your case, you guys done some really good work on the front end of analytics and now you got to kind of shoehorn that back to other databases. So what is the state of the market relative to analytics and these environments like Cassandra, HBase and Mongo? Yeah, it's really interesting on the analytics side, the things that are changing every day. And so I talked to a whole bunch of BI guys, I was involved in the data warehousing side of the house back in the 90s with telcos. When it was a huge- Oh, God, sorry. Yeah, when we had 12 HPV class cubes full of drives to do nothing but store call records so they could do analytics which analytics meant SQL joins that took weeks for developers to write. So what's changed to now, it's very interesting that with the real time components of Splunk and various other tools that are out there, you take all of the pain of that out and you just make it easy to do and you make it in a UI where it doesn't matter. I mean, in my previous job, we had the CEO, the CFO of the company log in to look at a dashboard in Splunk which connected to Cassandra on the back end to pull in a lot of the transactional analytics piece. It connected to SQL data stores which are on the legacy side for booking because of asset compliance as well as, I think, so you could still go to the data warehouse if you needed to run a report over the last 18 months but that's not that interesting to 90% of the people. What happened in the last 30 days? I mean, people are blowing away when you show those dashboards with me. People fall out of their chair. Oh yeah. I mean, pretty much, right? Absolutely, and I will say that you made some profound changes in the way that the marketing teams, the SEO, SEM folks did their job just by building these dashboards where they could actually see what the investment was versus what they're actually getting in click-throughs. So Billy Botsworth and I talked at Hadoop Summit. We were at the Hortonworks head at Hadoop Summit which was a great event. We did the QBAT. Present it there. And Jonathan was just on about talking about making things easier, tooling UI and you guys play a big role in that, you know, talking about working with Cassandra. How do you make it easier? Because the new analyst, the new data scientist is not so much the PhD or the master's in computer science. The new category of data scientist is someone, there's a quant shock who knows SPSS or knows some other tool. They're not super programmers but they need to actually manipulate the data. So talk about the dynamic between that and the developer who needs to be the Cassandra guru, the computer science guy. What's the marketplace look like? Because you guys, you play in both sides. Yeah, so it's funny that part of this big data movement has made developers more and more critical to getting anything done, which is good for the developer community but when there aren't as many of them because it's a four to six year old market, there just aren't a lot of experts in six years. So it's very interesting to see how that's kind of evolving but at the same time your analysts who are looking at the data whether they're called data scientists or not depending on the company but those analysts that are looking at the data, some of them have absolutely no technical chops whatsoever, they just need to know how it works and they need to get the data from a business perspective. So kind of what was built before in my previous job and then working now with Splunk, part of what we're building is the front end to be able to go look and visualize the data and explore the data whether it's in a dupe store or Cassandra store or in the Splunk native index. So you can pull all that data together and get a view for the user in the business who really has no technical ability at all. So H-phase, Mongo, all the different approaches. It's easy to ingest data, it's hard to get it out. What's your take on that current situation because getting near real-time, getting large sets of data out and near real-time is very challenging. What best practices and solutions have you seen out there that allow a dashboard to render a lot of unstructured data? Well, obviously I'm a little bit partial to Splunk working there. Yeah, of course, yeah. Which is a very good engine for doing that no matter what the backing store is. Writing a connector to Lucid or Mongo or a couch or something else is not complicated. It could all be done in the SDKs and you could quickly pull that data in. What you have to do is know what you're actually indexing and what the real problem with a lot of the no-sequels, you have to know what you're putting in ahead of time so you know how you want to get it out. So... In a way, the whole no-sequel is kind of a weird term because there's no such thing as schema-less. Right. You always have schema. There's just less schema than the big schema. No hard and fast schema that means every column has to say exactly what it was. And the same thing with sort of column or data stores, right? The whole point in moving that direction is getting away from the hard and fast schema where every time you have an application update, you got to go do 50 SQL scripts to update your schema so everything works. Yeah. If you extend it... So the notion schema-less is really not a real word. It really means you're not reliant on schema to do stuff. Right. You have to have some schema at some level. Yeah, there is a schema of some sort and even on the Splunk side, there's a schema there too. It's just the schema's only applied at search time instead of being applied at ingest time so you don't have an overhead. Explain that. That's a really important concept. Yeah, so from Splunk's perspective, you ingest all the data. When it comes in, no metadata or anything is applied to it. It's just literally written as it streams in which lets you get the data in faster from multiple sources and do a lot of your filtering if you need to at that level. But then when you actually go in the UI and do a search, that's when the intelligence is applied to the data and it's extracted at that point and you actually apply a schema at search time. So if it's a Tomcat log and you have the app loaded for Tomcat, it's immediately going to know what every field in the Tomcat log means and how to interpret that. So when you get the visualization back in the UI, it's going to put them in names that a user can actually understand rather than the generic Tomcat name of one, two, three, four equals six and I have no idea what that means. I actually will know what that value is. And it's true of a lot of the apps that live on Splunkbase. Interesting, can we dive into a little bit more you talked about visualization kind of making the tools easier for somebody who doesn't have really the technical chops. So we're seeing a lot of over the last year or two companies like Tableau getting very popular and Cliktech to some degree. So what are you guys doing in that respect and are you trying to kind of compete in some way with those kind of tools? Or what are you doing in terms of visualizations and what's your approach to making all this data that comes from Cassandra and other sources through Splunk actually usable or someone that can make sense of it who's not a trained expert? Yeah, so I mean the whole point in Splunk from the beginning was to make it very easy for anyone to use it. Originally it was only available to work on the Splunk data store, the rapid indexing because that was the easiest thing to solve when it's your own technology. Since then there's been a number of, there's a MySQL app that's out there. There was a first generation to do BAP that was released last year. The new generation that's coming out at our user conference this year in a couple weeks as well as some early work on the Cassandra integration and generic NoSQL integration longer term but Cassandra is the first focus point obviously because I care about the community a lot and have been involved in it for a while. But so those pieces where you're able to use those storage engines on the back end and visualize directly in Splunk with all the same query language, all the computational, all the nice frames and different charts and tables that you have available out there. That whole approach is making it much, much easier for people whether they store their stuff in HCFS for long term storage and they're okay with waiting minutes, days, hours, however long it takes to run that query and return it in the UI in the back end so we can populate a dashboard so someone logs in every day. They've got their Hadoop job that runs in the back and populates their month over month or year over year or whatever it happens to be from a summary data perspective. Or whether it's knowing what happened in the last few minutes and you're pulling it directly from Splunk or whether it's knowing what's happened in the last, let's say 45 to 90 days and we're pulling that out of Cassandra or whatever other data stores out there. That whole point is to be able to visualize that together and enrich the data that's coming directly from the servers into Splunk and rich it with data from transactional backing stores which is what I did at the previous company with Cassandra and DSE is restoring actual steps in the transaction process which are very large XML payloads which is not ideal for Splunk. So store it in somewhere else but then be able to marry that back. I have an error on a web server in Splunk. I want to marry that with data on the transaction what the user actually inputted on the front end that caused that error to happen and then I can then marry that to the system data for the back end application stream. Draw correlations, they really drill down to what caused the problem. Right, it gets you to the RCA process very, very quickly and the more data sources you can add obviously the better that gets and Splunk is very good at specific data types and you know, textual based temporal data and not large payloads in single key value store space. You want just normal kind of, you know the any character kind of normal log type rather than the big set. So you need something else for that and that's where the NoSQL play comes in as well as HTFS and all the other stuff. Yeah, I mean, your point is well taken. I think that when we talk about big data and analytics it's all about bringing in multiple data sources to kind of enrich what you, maybe some of the analytics you've already been doing. I think if you're not bringing in external data sources like that, you're not really doing big data analytics in my opinion. So I mean, that's the key. To the extent that you can, you can talk a little bit about what you were doing at your previous job and kind of how that came about. I know Cassandra is a fairly new approach and I'm interested to hear about how you kind of came to work with Cassandra and how you got involved in the community. Sure, yeah, so like I mentioned back in .6, so I ran an architecture team, reported the CTO at the end of my previous position. But what we were doing is evaluating various data stores for replacing our legacy RDBMS stack that was huge and costly and had to live on sand that was very expensive and getting rid of a $60 million a year sinkhole in technology that really wasn't serving at the right purpose anyway. So the point of that is let's get to the right technology for the right solution. So my team was involved in evaluating technologies, both in open source as well as commercial products, figuring out what was the right solution from a performance vector as well as a cost vector for each of those scenarios. Cassandra came into play originally, the .5 in the lab and the .6 in production, came into play for a pricing index solution. So being able to build a quick index of recent pricing so that you could keep track of what the change rate was for specific pricing for specific items. That was kind of step one and it was all about building a package deals engine so you could package multiple components in the shopping cart and price it appropriately. And that's painful because it has a lot of different systems but if you can get that in an index where it's very rapid and can return back at least your best guess early and then when a customer drills out and makes some selections it may price a little differently but at least you're close based on the most recent. So that was the first use and there really was no other solution that did that well. On top of that, they actually did a solar leucine implementation on the front end and then did some app server in the middle that handled the getting the data in. Then when we moved that solution to DSE of course it's all in one now. The apps there, the stores there, the solar's there, we got rid of a very complicated and having to have someone who knew Scala which is not easy to find by the way, to get this working we went to something that just works. So that was kind of the next generation evolution of that when we went to DSE. Another really big use case for it was holding search payload results. So like I said, those are very large data sets. If you search for a particular market with everything in that market that could be gigabytes of data returned in XML, you need a place to put that and you want to break it out. So Cassandra was the natural way to parse that XML out and put the key values in for each GUID for the inventory item and the pricing structure which was not always just one price but the pricing structures and identify those it made it much, much easier to do that in Cassandra. Interesting, so Ed, so tell me a little bit about your impressions of the community itself. We talked to Jonathan a little bit about this and I think the community is very important when it comes to an open source technology like this as it kind of starts to take that across the chasm so to speak and move into the larger enterprise. They want a more risk-averse IT departments want regular updates, they want a solid, reliable community, companies like Datastack supporting it. So what's your take on the community? How has that kind of evolved since you've gotten involved? Well, I will say I felt really on my own early on with a lot of the stuff that was going on. Nobody was really sure what was gonna happen and we were adopting it knowing that it may become something we had to support. You know, a year later, it's amazing what a difference it's made. I mean, the number of people that are getting involved, I mean, I've been on any number of forums talking to people, we've shared code snippets back and forth on exactly what we've done and interesting ways that we've approached the problems and I mean, there was a couple of things that we were going down, prototyping that we're just never gonna work but we of course didn't know that but then I was able to communicate with some other people in the community, Netflix being one of them, who was able to quickly say, oh, no, no, you don't wanna do that. Here's why that doesn't work. Here's what we found, you might wanna look at this and then I was able to share some data back with them on what we were doing with Solar and the DSE side. So it's very nice to have that community of people who can communicate and figure out what's going on and help set the direction for what we're doing and obviously today being named in the MVP, it was really important to me and helping to continue that. I just tweeted that you're on the queue but I also noticed that you tweeted and said, honored to be named Apache Cassandra Project MVP board at the summit this morning. Right. So congratulations. Thank you very much. What does that mean? I mean, tell us what that means for you personally but also from a functional role that you're working on. So personally, I mean, it's been involved in this for a long time. It's important, I mean, a long time in this space. Anyway, it's important to be kind of recognized as this part and be nominated by peers and be able to get this kind of role. Kind of what the role entails. I mean, obviously I think that's going to continue to evolve but at least being involved with this group of people who are named to this this morning. How many MVP's are there? I don't know how many there were in the list. 15? 15? So it's not 100. You're not like a number. Yeah, I'm not one of the 850 or whatever attendees here. It's a real on. Yeah, it absolutely is. And to be someone who's tested some pretty extreme use cases for Cassandra and then be able to do this and kind of help set future direction and some other, you know, both Cassandra and the team. Well, Jonathan was being polite when I asked him what the personality was of the community. And you know, the Cassandra community is hard charging, you know, Alpha Geeks. It's pretty well known in the back channels. You know, you got to be pretty, have some chops to play in the Cassandra community. Not known for their marketing. I mean, obviously, Hadoop is out marketing Cassandra and others are getting a little bit more marketing. But I think that's going to change given what data stacks is doing. But I want to ask you something that Jonathan mentioned. I asked, you know, where does Cassandra fit best? And one of the things he did say, among other things, was multiple data centers. So you live, and you mentioned the world back in the day, huge disk arrays, spinning disks. We've seen a shift on movement.document.silkenangel.com and Wikibon, converged infrastructures actually happening, storage networking and servers. With Moore's Law, servers have always been getting faster. Networking, you know, you had, you know, Monsieur recently got bought by VMware as software-defined networking. Storages have been the last kind of area where spinning disk is still kind of hanging around. But with SSD, Jonathan's quote was, it's the closest silver bullet you're going to find. And we've seen that really change the conversion infrastructure space. So given that you've lived in those old days of, you know, the latency and the huge, you know, server farms and disk farms, with SSD it's pretty game changing the economics. How is that changing the cloud, mobile, social data center? And how does that change some of the things like databases where the future is obviously dashboard driven? You're going to see, you know, real-time management of data as a table stakes. It's pretty obvious, you know, people will fall down their chairs and we talk to SAP and they're running stuff with SSDs now from five minutes to five seconds. So that's near real-time, five seconds is acceptable. We've got a, you know, major batch job. So it's a game changer, no doubt. How is that changing implementations and deployments of Cassandra and other environments where data-centric information is part of business? Well yeah, as a customer using TSE, it's kind of, it was an interesting use case when you go ask for money, right? And you got to say, I've now prototyped it. We want to implement this in production and you got to build a business case for it. And you say, well, here's our business case. We were going to buy two new SAN arrays next year. Instead we're going to buy one. And we're going to move, a lot of the data warehouse load that's on the SAN array, we're going to move to HDFS and a lot of the active database loads that have to be on there because of clustering and you have no other way for clustering to work in SQL but a SAN, really. We don't need those. We're going to put that stuff all on the 200 nodes of TSE that are going to be deployed in production. So let's just not buy a SAN. And it got a very interesting reaction. And they're like, wait a minute, can we use that money? No, no, no, no, you can't use that money somewhere else. Well it's very interesting because you basically take a two and a half million dollar investment in another EMC VMAX, right? And you say, I don't need that. What I need is a million bucks to go build out all these DAZ boxes with distributed file systems that don't need any of this. And yes, we're going to need SSDs in some portion of that because we have some data that needs to be live in memory all the time. But that's a small portion of what we're buying. So it's not like we're buying a bunch of expensive disks. In the HDFS perspective, everything is sitting on terabyte or two terabyte, cheap drives. It costs almost nothing. So obviously this is disruptive. Absolutely. So basically I'll translate. Folks, that means it's disruptive. So again, that validates some of the things we've been seeing around the economics and performance as our address with SSD. Now you have a whole new pallet, if you will, to draw from from a developer to solve the problems. You can now do more things. So as the EMC's threatened, they bought a company recently around SSDs. And you see Fusion IO and violin memory systems right down the street. You've got violin here, killing it. I mean, they're basically crushing it. But what that means is now new raw materials for developers. What does that change for the developer? Like you just essentially, did you like make that up? I mean, how did you get to that conclusion? I mean, you said, hey, common sense. I mean, it's something that wasn't available before. Well, it's very interesting that, I mean, even developers have the ability to build and prototype so quickly now. Because you can do, literally when you're doing real-time builds and from our DevOps perspective of the teams that I worked with a lot on these solutions, they're able to do real-time builds four or five times a day because they have Fusion IO cards and the Ant servers and they have SSDs sitting in the Perforce servers to be able to serve up the content very quickly and then be able to do the builds live. You can do four builds a day when it used to take an entire day just to do a single build. So for velocity of release, for velocity of development, it's huge the difference that it's making. Yeah, I mean, what you just said about that use case with the sand is totally seeing. We've seen sands, sands are, spinning disk is going to be gone. We had the CEO of Scott Denson from Pure Storage. He believes that spinning disk will be a thing of the past. I think it's going to be more backup. I don't think it's going to go away. I think it's just going to be more like a but that's why I think Pat Gelsinger going to VM wears a big deal. Yeah, I agree. EMC is really kicking ass right now and I think they know. You know where the future's going. They know. Pat Gelsinger was on the QB said if you don't get out in front of the next wave, you're driftwood. And I think what you just talked about is the next wave. So with that, my final question is to you is DevOps. We have a new site called devopsangle.com so I'm going to pimp that out right there. It doesn't get a lot of traffic because it's a small community, but DevOps is really about the future development. Absolutely. Where dev and operations are intertwined. You know, whether it's ops dev or devops, it doesn't matter, they're merging together. What's your view about DevOps and what's your forecast of the future, the relevance of that kind of global description? Yes, really funny because I think devops came to pass because of socks. So developers wanted access to production. They had to be part of ops in order to get that, which was a funny way that it started, but now it's become key because you really need the developers to understand how their stuff's running in production, especially with these new solutions that rely so heavily on code to be deployed to even work. They have to be intimately involved in that. And I think it's going to do nothing but get more prevalent and more and more companies are going to have to move in that direction just to be able to keep up with the velocity of release that you have to have the expectations of a customer to get a six millisecond response. They're not okay with a second. They need milliseconds and under 10 milliseconds. You can't get that without having your developers intimately involved in operations and understanding how customers are using their code. We're at Eddie Satterley, who's from Splunk, seasoned veteran, a lot of knowledge, a lot of experience. And again, a lot of that experience comes from a lot of scar tissue from the dates of his experience. Cutting edge work with devops, congratulations in Splunk. And with Cassandra and Nominated to be an MVP, congratulations. I suggest for the younger folks out there, the younger developers should follow Eddie on Quora, heavily active and not even knowing who's going to be on theCUBE today. I favored one of his Quora posts, I thought was pretty epic, very active on Quora. Congratulations, you're a great resource and thanks for coming on theCUBE, appreciate it. Thank you very much. We'll be right back with our next guest after this break, silkenangle.com is theCUBE. We go out for the events to extract a segment from the noise and we'll be right back.