 Live from San Jose, California, it's theCUBE. Covering Big Data Silicon Valley 2017. Hey, welcome back everybody. Jeff right here with theCUBE. We're at Big Data SV in San Jose at the Historic Pagoda Lounge. Part of Big Data Week, which is associated with Strata Hadoop. And we've been coming here for eight years and we're excited to be back. The innovation and dynamicism of Big Data and Evolutions now with machine learning and artificial intelligence just continues to roll and we're really excited to be here talking about one of the nasty aspects of this world unfortunately, malware. So we're excited to have Darren Chinnan. He's the Senior Director of Data Science and Engineering from Mauerbytes. Darren, welcome. Thank you. So for folks that aren't familiar with the company, give us just a little bit of background on Mauerbytes. So Mauerbytes is basically a kind of a next generation antivirus software. We started off as humble roots with our founder at 14 years old getting infected with a piece of malware and he reached out into the community and at 14 years old wrote his first, with the help of some people wrote his first lines of code to remediate a couple of pieces of malware. It grew from there and by the, I think by the ripe old age of 18 founded the company and he's now I want to say 26 or 27 and we're doing quite well. But it was interesting before we went live you were talking about kind of his philosophy and how important that is to the company and now it's turned into really a strategic asset that no one should have to suffer from malware and he decided to really offer a solution for free to help people rid themselves of this bad software. That's right. Yeah, so Mauerbytes was founded under the principle that Marsden believes that everyone has the right to a malware free existence and so we've always offered a free version of Mauerbytes that will help you to remediate if your machine does get infected with a piece of malware and that's actually still going to this day. And that's now given you the ability to have a significant amount of in point data, transactional data, trend data that now you can bake back into the solution. That's right. It's turned into a strategic advantage for the company. It's not something that I don't think that we could have planned at 18 years old when he was doing this, but we've instrumented it so that we can get some anonymous level telemetry and we can understand how malware proliferates. For many, many years, we've been positioned as a second opinion scanner and so we're able to see a lot of things to some trends happening in there and we can actually now see that in real time. So starting out as a second position scanner, you're basically looking at what, you're finding what others have missed and how can you, what do you have to do to become the sort of first line of defense? Well, you know, with the new product Malwarebytes 3.0, I think that some of that landscape is changing. We have a very complete and layered offering. I'm not the product manager, so I don't think as the data science guy, I don't know if that I'm qualified to give you the ins and outs, but I think some of that is changing as we've combined a lot of products and we have a much more complete suite of layered protection built into the product. And so maybe tell us, without giving away all the secret sauce, what sort of platform technologies did you use that enabled you to scale to these hundreds of millions of endpoints and then to be fast enough at identifying trend, things that were trending that are bad that you had to prioritize? Right, so traditionally, I think AV companies, they kind of have like these honey pots, right? Where they go and they collect a piece of virus or a piece of malware and they'll take the MD5 hash of that and then they'll basically insert that into sort of a definitions database. And that's a very exact way to do it. The problem is, is that there's so much malware or viruses out there in the wild, it's impossible to get all of them. I think one of the things that we did was we set up a telemetry and we have a phenomenal research team where we're able to actually have our team catch entire families of malware. And that's really the secret sauce to sort of malware bites. I mean, there's several other levels, but that's where we're sort of helping out in the immediate term. What we do is we have kind of internally, we sort of jokingly call it a Lambda 2 architecture. We had considered Lambda long ago, long ago, and I say about a year ago when we first started this journey. But Lambda is riddled with, as you know, a number of issues, right? I mean, if you've ever talked to Jay Krebs from Confluent, he has a lot of opinions on that, right? And one of the key problems with that is that if you do a traditional Lambda, you have to implement your code in two places. It's very difficult, things get out of sync, you have to have replay frameworks. And these are some of the challenges with Lambda. So we do processing in a number of areas. The first thing that we did was we implemented Kafka to handle all of the streaming data. We use Kafka Streams to do sort of inline stateless transformations. And then we also use Kafka Connect. And we write all of our data both into HBase. We use that, we may swap that out later for something like Redis. And that would be sort of a thin speed layer. And then we also move the data into S3 and we use some ephemeral clusters to do very large scale batch processing. And that really provides our data lab. And when you call that sort of Lambda too, is that because you're still working essentially on two different infrastructures? So your code isn't quite the same. You still have to sort of check the results on either sort of, on either fork. That's right, yeah, we didn't feel like it was, we did evaluate doing everything in the stream. But there are certain, there are certain operations that are difficult to do with purely stream processing. And so we did need a little bit, we did need to have kind of a thin, what we call real-time indicators speed layer to sort of supplement what we were doing in the stream. And so that's kind of the differentiating factor between a traditional Lambda architecture where you'd want to have everything in the stream and everything in batch, and the batch is really more of a truing mechanism as opposed to our real-time is really directional. So in the traditional sense, if you look at traditional business intelligence, you'd have KPIs that would allow you to gauge the health of your business. We have sort of RTIs, real-time indicators that allow us to gauge really directionally what is important to look at this day, this hour, this minute. Of this thing is burning up the charts, therefore it's priority one. That's right, you got it. Okay, and maybe tell us a little more, because everyone I'm sure is familiar with Kafka, but the Streams product from them is a little newer as is Kafka Connect. So it sounds like you've got, it's not just the transport, but you've got some basic analytics and you've got the ability to do the ETL because you've got Connect that comes from sources and destinations, sources and things. Tell us how you've used that. Well, the Streams product is quite different than something like Spark Streaming, right? I mean, it's not working off micro-batching, it's actually working off the Stream. And the second thing is it's not a separate cluster, it's just a library, effectively a jar file, right? And so because it works natively with Kafka, it handles certain things like quite well, right? Like it handles back pressure and when you expand the cluster, it's pretty good with things like that. We've found it to be a fairly stable technology. It's just a library and we've worked very closely with Confluent to develop that. Whereas Kafka Connect is really something that we use to write out to S3. In fact, Confluent just released a new, like an S3 connector direct. We were using StreamX, which is sort of a wrapper on top of an HDFS connector and they kind of rigged that up to write to S3 for us. So maybe tell us if you, as you look out, what sorts of technologies do you see as enabling you to build a platform that's richer and then sort of how would that show up in the functionality customers like we would see? With respect to the architecture? Yeah. Well, one of the things that we had to do is we had to evaluate where we wanted to spend our time, right? We're a very small team. The entire data science and engineering team is less than, I think, 10 months old. So all of us got hired. We've stood at this platform. We've gone very, very fast. And we had to decide how are we gonna, A, get, you know, we've made this big investment. How are we gonna get value to our end customer quickly so that they're not waiting around and you get the traditional big data story where, you know, we spent all this money and now we're not getting anything out of it. And so we had to make some of those strategic decisions and because of the fact that the data was, you know, really, truly big data in nature, there's just a huge amount of work that has to be done in these open source technologies. They're not baked. It's not like going out to Oracle and giving them a purchase order and you install it in the way you go. There's a tremendous amount of work. And so we've sort of made some strategic decisions on what we're gonna do in open source and what we're gonna do with sort of third party vendor solution. And one of those solutions that we decided was workload automation. So I just did a talk on this about how ControlM from BMC was really the tool that we chose to handle a lot of the coordination, the sophisticated coordination and the workload automation on the batch side. And we're about to implement that in sort of a data quality monitoring framework. And that's turned out to be an incredibly stable solution for us. It's allowed us to not spend time with open source solutions that do the same things like Airflow, which may or may not work well, but there's really no support around that. And focus our efforts on what we believe to be the really, really hard problems to tackle in Kafka, Kafka Streams, Connect, et cetera. Is it fair to say that Kafka plus Kafka Connect solves many of the old ETL problems? Or do you still need some sort of orchestration tool on top of it to completely commoditize, essentially moving and transforming data from, I guess, OLTP or operational system to a decision support system? I guess the answer to that is it depends on your use case. I think there's a lot of things that Kafka and the Streams job can solve for you. But I don't think that we're at the point where everything can be streaming, right? I think that's a ways off. There's legacy systems that really don't natively stream to you anyway. And there's just certain operations that are just more efficient to do in batch. And so that's why we've, I don't think batch for us is going away anytime soon. And that's one of the reasons why workload automation in the Bachelor initially was so important. And we've decided to extend that actually into sort of building out a data quality monitoring framework to sort of put a collar around how accurate our data is on the real-time side. Because it's really horses for courses, it's not one or the other. It's kind of application-specific. What's the best solution for that particular instance? Yeah, I don't think that there's a, if there was a one-size-fits-all, it'd be a company. And there wouldn't be no need for architects. So I think that you have to look at your use case, your company, what kind of data, what style of data, what type of analysis do you need? Do you really actually need the data in real-time? And if you do put in all the work to get it in real-time, are you going to be able to take action on it? And I think Malwarebytes was a great candidate when it came in. I said, well, it does look like we can justify the need for real-time data and the effort that goes into building out a real-time framework. Right, right. And we always say, what is real-time, right? In time to do something about it. And if there's not time to do something about it, depending on how you define real-time, what difference does it make if you can't do anything about it? That's right. That fast, you know. So as you look out in the future with IoT, all these connected devices has hugely increased attack surfaces, we just read RSA a few weeks back. How does that work into your planning? What do you guys think about kind of the future where there's so many more connected devices out on the edge in various degrees of intelligence and opportunities to hijack, if you will? Yeah, I think, I don't think I'm qualified to speak about the Malibu product roadmap as far as IoT goes. No more philosophically, from a professional point of view, because it's every coin is two sides, a lot of good stuff coming from IoT and connected devices, but as we keep hearing over and over, just this massive attack surface expansion. Well, I think for us, the key is we're small and we're not operating, like I came from Apple where we operated on a budget of infinity. So we're not- Not the building infinity or the address of infinity. That's right. The actual budget. We're small and we have to make sure that whatever we do creates value. And so what I'm seeing in the future is, as we get more into the IoT space and logs begin to proliferate and data just, it just exponentiates in size, it's really how do we do the same thing and how are we going to manage that, right, in terms of costs? Generally, big data is very low in information density, right? I mean, it's not like transactional systems where you get the data that's effectively an Excel spreadsheet and you can go run some pivot tables and filters and away you go. I think big data in general requires a tremendous amount of massaging to get to the point where a data scientist or an analyst can actually extract some insight and some value. And the question is, how do you massage that data in a way that's going to be cost effective as IoT expands and proliferates? So that's sort of the question that we're dealing with. We're at this point all in with cloud technologies. We're leveraging quite a few of Amazon services, serverless technologies as well. We just are in the process of moving to the Athena, right, to Athena as just an on-demand query service. And we use a lot of ephemeral clusters as well and that allows us to actually run all of our ETL within about two hours. And so these are some of the things that we're doing to sort of prepare for this explosion of data and making sure that we're in a position where we're not spending a dollar to gain a penny. Right, right. Does that make sense? Yeah, that's his business. Well, he makes fun of that business model. I can do you could do it, right? You want to drive revenue, sell dollars for 90 cents and watch what happens in front. And that's a dot com model. Exactly. I was there and make it up in volume. All right, Daryl Chen, thanks for taking a few minutes out of your day and giving us a story on malware by it sounds pretty exciting and a great opportunity. Thanks, I enjoyed it. Absolutely, he's Daryl and I'm D's George, I'm Jeff, you're watching theCUBE. We're at Big Data SV at The Historic, but they're around. Thanks for watching, we'll be right back into this short break.