 That's where Node Summit was. We just talked about Node.js for a game on it. This is SiliconANGLE.com's production in Cuba, live in Silicon Valley with Cassandra Summers for all the actions happening. End of the day, we're on our 15th minute interview and we're here with Matt Stuff, the CTO of Source Ninja. Matt, welcome to Cuba, Jeff Kelly, Phil Host, welcome. Give the folks a quick update on Source Ninja. I was going to do a good update about the Meta4 to code. So basically what we do is we find the security exploits and performance enhancements that are available for the applications that you depend on. So if you're building an application on top of Ruby or Node, you integrate us. We tell you what packages are out of date, what contain performance enhancements, and sort of help you focus on what you need to update when and why. So obviously, big data. We've had a lot of great conversations today. We talked with a lot of alpha geeks, a lot of principles of data stacks, Netflix, a lot of people in the community, and great vibe. Cassandra community's pumping. I'm saying this in my words. It's kind of a chip on the shoulder like, hey, we're not going away. We're the real deal. We've taken some lumps on some documentation and some stuff that we're fixing all that and fixed, but the community's strong. A lot of use cases out there and they're demonstrating the proof points. So kudos to data stacks, kudos to the community. But the action really has come down to here. Our report is solid state has created a massive innovation around infrastructure that affects databases and big data creates new things that never could be seen before. Online dashboards, real time is particular. So I'd like to talk to you about the real time aspect. So we've seen the need for querying, whether it's data stores and or for dashboards. So things like Apache Solar, you have some knowledge in. And we've heard people talk about storm. So let's talk about real time. What's happening with real time and how does that relate to the Cassanders of the world, H bases, the Mongols of the world? Obviously, near real time, real time, how do you want to define it? To me, real time is near real time. Seconds, that's fine. Minutes if it's a huge complex data warehouse thing, minutes is real time. If it's a five week, five day process done in minutes, I'm okay with that. But to me, it doesn't have to be milliseconds. Near real time. So we use storm internally. And the way that we use it is that we're consuming data from all these different repositories. So take, for example, Node.js. We monitor every single Node package. And we're looking at every single commit that comes into the system, every single change log, every single release. And that doesn't really fit into the Hadoop model very well because we're getting these constant streams of data. And so what we do is this data goes into queues and then it's pulled off by the storm workers and then it's fed directly into the database. In our case, Cassandra. And Cassandra is actually fast enough that it can serve these really wide rows, these really big data sets fast enough for web requests. So I can run everything off of just my Cassandra cluster and not have to worry about a memcache layer or all these other NoSQL solutions that you would have to worry about with some of these other providers. And so that was one of the things that led us to choose Cassandra. So we've had a really good experience with Cassandra and with Storm. And we're actually one of the contributors to Storm. So I work on some of the tabs and contribute back patches. So we're not that heavily involved with Storm. And for the folks out there who aren't totally familiar with it, just take us through what Storm is. The storm is a very, very cool project and we're aware of it. We're actually investigating it heavily. We're not in the weeds like you guys are. But for the folks who might not have heard of it, Storm has laid the Twitter, etc. Go share that. So Storm started out as a project from the small circle back type. They were later acquired by Twitter and so it powers the analytics dashboard for Twitter. So if you're taking a look at all your big data solutions, most of them are batch oriented processing. So I got a bunch of data that comes in. It gets dumped to like an S3 bucket or HDFS and I run some big analytics job. That analytics job could take hours, days maybe. Storm is really for streams of data. So for one example is the Twitter feed. So I've got however many messages coming in per day, let's say every single tweet. And I want to collect impressions on an ad or a link or something like that. There was no good way to do that. People were stitching together solutions using background worker processes and stuff like that. And so it was really hard to track data as it flows through all these different stages of my pipeline. With Storm you write a job using these very simple abstractions and then you define a topology in a JSON markup. Then you submit that to the cluster and it allocates resources for you, automatically distributes code across the cluster and then automatically starts pulling in data from your queues, distributes it seamlessly across the cluster to all your different worker processes and at the end synchronizes that to whatever data store you choose in our instance happens to be Cassandra. So when I got that, I want to translate that to the users that might not have gotten that. So in a way to try to compare and point to more use cases. Streaming data, there's other use cases, mobile data. So anything coming in fast, not necessarily old ways of data. Some data pops in, dumps on the table, you got to take it ingested. Streams of data is a contiguous stream of data and the Twitter firehose as you've experienced is just a really good example, right? Exactly. So would you agree and can you point to other use cases for the folks out there, obviously mobile, machine data? It could be sensor data, it could be check-ins, like say four square. A lot of people are using it for distributed RPC. So let's say I have a call that comes into the system but in order to run, let's say, like a pricing model, it would require the resources larger than a single machine can handle. And so what it'll do is it'll farm that request out into all these different machines and then return it to the caller. So it's a way to actually speed up my RPC request as well. Okay, so there we got it. So there the importance of that with Storm in my opinion and what you just agreed on was that streaming of data is relevant. It's going to happen more and more, not less and less. I would argue. Okay, that being said, that kind of changes things a little bit right now. Now, okay, so the traditional search, let's get to solar now. Talk about some search action. So search is different now, right? So search is about metadata, right? In the old days, you know, robots.tech would crawl your web page, bring it back to an index, run a string query against it. That's the search result. Yep. Alla 1997. But now you have the ability to take data in on the ingest. So manipulating data at the right at inbound is new concept. Yep. So can you talk a little about how that's affecting some of the searching? Because in order to get real time low latency, you got to do stuff like this. Yeah. What's your thoughts on this whole paradigm? And how is it affecting some of the projects? So one of the really big issues that you see a lot of projects is that it's, you'll see some of the data's in your solar cluster. So all my search data is there. And then I've got my big batch data and my Hadoop cluster. I've got some of my data in the SQL cluster. And you have to write a lot of code to keep all of these different data stores in sync. And you got streaming in. It's really complicated. Oh, that's not including the data money that you need to get out of the corpus of data. You don't know anything about. Yeah. That hurts your near real time capabilities. Yeah. Exactly. You know, it may have to run a Hadoop job in order to like do it. And that could be, you know, whatever. So the nice thing about so that data stacks integrated solar into Cassandra. You've seen some of this in the HBase community. But what happens is as soon as my data is written to Cassandra, it's also indexed by solar. And then the solar data is stored alongside the Cassandra data on the same data store inside Cassandra. So I get all the linear scalability of Cassandra also with my solar data. And so my Cassandra nodes can also act as solar nodes. And so I don't have to worry about writing synchronization code between the different data stores because they are the same data store. Yeah. And then all of my data is indexed in near real time. HBase doesn't have this. HBase doesn't have it. Yeah. They're working hard on it. I know that. Yeah. Doug Cutting, they had a meeting two months ago in Boston. That's their top agenda item. Yeah. But Cassandra's got it right now. So. There it is. There's an advantage of Cassandra right there. Okay. So let's get back to the importance of code because obviously you're dealing with some cutting edge. It happens to be a cool implementation because it's cutting edge technology in an area that's also around developers that we care about. So what are some trend data that you see in the developer community? We were talking earlier with Mark about, was it Matt? He's co-founder of DataStacks. Developers now are becoming pretty versatile in their programming capabilities. The ones that I would say the elite tech athletes, tech developers, they can code anything but they pick the tool for the job. So we're getting into the mode of pick the tool for the job. Hey, I'm a Rails guy. We're doing it this way. Or hey, I want to use Python for the data. So you know what I'm saying, right? So what trends are you seeing in the developer community that you can share with the folks out there that you can kind of put things. And I know what we're generalizing is sometimes difficult. But put things in the buckets. Node's good at IO. Python's good at this. What are you seeing from the data that you could share? So Node, I'm seeing a lot of uptake on anybody that needs to do streaming sockets. So like game companies are really, you know, if you need to do really fast response times like serving ads, things like that. Python, we're still seeing a lot of interest in the scientific computing community. Just, you know, SciPy, NumPy, those still have a lot of interest. For the really big JavaScript applications, you're starting to see some uptake of ClosureScript. And that's new. I mean, that didn't exist a year ago. And so that's just recently come on the scene. What's driving that? It's really hard to write large applications in JavaScript or CoffeeScript. And ClosureScript takes a very robust language that handles namespacing, libraries, reusability very, very well. And it has a lot of very nice libraries that make doing a lot of tasks very easy and gives you that environment in JavaScript. And you can write, you can take the same code and you can, you know, run it on the front end. You can run it on the back end. You can use the same. Kind of how Node.js kind of came about. Exactly. Very efficient in a use case that's cool. And it's even more so because like for us, we're using Closure on the front end in Node.js on the middleware. And we're also using it inside of our hard to do jobs. So we get to share that same logic across all of them. And it actually meant that we got to write a lot less code. Yeah, very efficient. What about analytics software? What package are you seeing? Because analytics is a big problem right now. You got obviously the big hype on visualization software. But one of the problems with getting data in, assuming you can take care of stuff at the ingest point, is pulling it out fast because it's large files. So let's take big data. I want to do the training analysis on some big data. You get to pull stuff out. I mean, you get data scientists out there that have older packages. What kind of coding are you seeing? Is there a language that's really good at programming for extraction? So I think most of the community is still using Hive. Because it's a paradigm that's familiar. There's a lot of books written for it. So I would say most of the data land analytics community is still using that or transferring to it. There are some people that are using cascading, things like that. But for the most part, I'd say the welcome traffic is still around Hive. Great. Well, we're getting close to time here, Matt. Thanks for coming on. I still want to spend some more time. And so I asked you a few more questions because you're a great guest, great data to share. Share with the folks out there just what you're seeing within your community, your company. And some of the problems that you guys are solving. And just do a quick plug for your company. So I know you're probably hiring. Everyone's hiring. You're doing this kind of work. So I want to talk about some of your momentum and talk about kind of what you're looking for for new folks to join the team. Really curiosity. That's the biggest thing. And unfortunately, that's one of the things that you can't teach. But it's people that are always asking like, how can I go farther with this data? What new things can I learn? What things are my customers interested in? What interesting ways can I present that data? And if you can find somebody that has that internal drive that wants to go farther, then that's golden. And so really we're looking for people like that. Have you found any disciplines that kind of pop out? And we're seeing, I mean, obviously with CS and math, you're seeing music kind of correlate to that. I mean, I can't tell you how many guys I bump into who have a degree in music and literature that are coders. Some damn good ones, you know? So it's kind of interesting, right? And then the other one, you've got math guys who can code. Then you've got pure computer science guys. So is there anything that pops out on the trend lines? You say, wow, we seem to see a lot of these guys love big data. They're not the consummate CS dudes or psychology. Psychology is a good one because you got somebody, because most of the problem, so the big problems aren't getting data into my system or just being able to run an algorithm. It's being able to present the information in a way that it can be understood. It's to be digested. And that's not a standard tool set. It's not something that you can teach in school. It's got elements of design to it. So how are my users thinking? How do they want data presented? How can we pull the data out and make it meaningful to them so that they can make good decisions? Yeah, it's perfect. I love this marketplace right now. It's a combination of art and science. And where I grew up, it was simply just discipline code, code, code, lines of code. We're your measurement now. It's the creation and it's the teamwork. So Matt, thank you for coming on theCUBE. We'll be right back with our next guest. Go check out SourceNinja.com. You guys are doing great work here at Cassandra. This is early on. We're going to look back at these times and say, wow, I remember that 2012, how small the community was then, even though it's a packed house right now. So thanks for coming on. We'll be right back with our next guest right after the short break and we'll be ending the days very shortly.