 from New York, expecting the signal from the noise. It's theCUBE, covering Spark Summit East, brought to you by Spark Summit. Now your hosts, Dave Vellante and George Gilbert. Welcome back to Midtown Manhattan everybody. This is theCUBE, we're live here at Spark Summit East. And one of the things that we like to do for our community is when we come to these events, we like to do little events within the events and we love to have the panel of what we call doers. Some people call them end users or users or I've often called them practitioners, but they're really doers. People trying to take technology, apply it to really solve a business problem or build out a business capability. So we have a panel of doers along with our own George Gilbert here. I want to introduce them and then we're going to get into sort of how they're utilizing Spark, leveraging big data to solve business problems. So let me start with Tamara Hassan who's the CTO of White Ops, the New York based company focused on info security. Welcome Hassan, thanks for taking the time to come to theCUBE. We just heard from George Gilbert, George, great job on the big data forecast. Thank you. And Beth Logan, Boston based data zoo, very cool analytics company that I've known actually for quite some time, followed you a little bit. Patriots fan? All right, good. We'd like to see it. Sadly, hey, it's not so bad, life could be worse. And then Danny Rogers is the CEO of Turbium Labs. Also an info security company based out of Baltimore, right? So folks, thanks very much for coming to theCUBE. Welcome. So Tamara, let me start with you. We'll just go sort of my left to right. Tell us a little bit more about White Ops, the problems that you're solving and sort of where big data and analytics fits in. Sure, yeah. So we focus on a very specific slice of cyber security, mainly bot detection. So our goal is to verify that there is a human on the other end of the screen every time the page loads. Variety of use cases for us, one of the larger ones is ad security, where there's at least an estimated $7 billion lost annually to US advertisers. So we collect a lot of data to do that. Our core component of our capability, our product is called bot prints. They're essentially algorithms of detection. So we fingerprint automation and remote control or signals of automation or remote control. So we're constantly developing these algorithms and they run on rather large data sets, arbitrarily large data sets in stream and in batch. Okay, Beth, Director of Optimization. So you're doing a live. So obviously all that talk this morning about optimization was, I'm sure interesting to you, but you guys have been in this business for a while now. I think you told me you've been with the company seven years. Seven years. Before the big data meme even started. So talk a little bit about that journey and what you guys do. All right, so DataZoo's business is to provide great marketing cloud for our customers to optimize their ad spend. And this is across display, media, mobile, video, the whole thing. And at the heart of the thing, we're buying media in real time on different ad exchanges. So my team works on the product that is optimizing that media buy. So again, we have lots of data, lots of previously shown ads and we know whether or not the people converted, whether they clicked, what they did. And so we have tons of data to process. And we started with a Hadoop based homegrown system. So lately we've been transitioning to Spark because it's providing much more flexibility for us. And then, Danny, give us a little elevator on Turbium Labs, if you would please. Absolutely, yeah. So Turbium Labs, we run a system called Matchlight, which is the world's first fully private, fully automated data intelligence system focused on searching the internet on behalf of our clients for elements of their data leaked to places such as the dark web, other places that they would not want to see their data leaked. So we are very much a big data company applied to security and really taking advantage of all these technologies and the ability to build large scale search systems. That is relatively new to the market. So Beth, I wonder if I could sort of start this line of questioning around business problems with you. So it seems because you're in the marketing space, it seems like brands are scrambling to try to understand new demand patterns and customer preferences. As consumers, we all have, we have so much information now about pricing, about product, what's good, what's not good. And it seems like brands have to find a way to capture information that they didn't have before to gain both competitive advantage, learn more about the consumers, oftentimes learning more about the consumers than the consumer knows about himself or herself. Is that fair to say that that's a big part of what the problem is, that you guys are helping customers solve? Talk about that a little bit. Yeah, so part of the problem actually is the industry is changing almost every month. There's something new coming along, some new threat, some new idea that someone has. And so we're having to adapt our algorithms all the time. And so that's partly why, again, why we're transitioning to Spark because it's providing the flexibility that we didn't have before. We have a whole library of machine learning libraries that are available to us through Spark that we don't have to write everything from scratch every time. Okay, and then, and Natima, let me start with you. And, and Danny, you guys are probably solving similar problems, maybe with a different angle, but the problem you're solving is a big, chewy problem of security. What's the big data angle on that? But so talk about the problem, your unique approach to solving it. Sure, yeah, so any crime worth doing online is better off being done by a million machines or 10 million machines rather than an individual actor. So there's all these use cases. And what we found is, especially in advertising, it fuels a billion dollar black market of malware. So all that spam that you see, all that phishing to infect your computer, there's a variety of use cases they can do with your computer once it's infected. Anything from denial of service attacks, stealing credentials, keystroke logging, things like that, or wide scale ad fraud, which is a lot of easy money. One of the few recurring revenue models for cyber crime, usually you're one and done, but this is one where you generate a lot. So our goal is to detect all forms of that. Anytime that you have that kind of money involved in a cyber crime, you have real adaptation. So you're talking about some of the world's best hackers that are making millions to tens of millions, or even organized groups that are making tens to hundreds of millions of dollars, adapting to get around detection mechanisms and whatnot. So we've developed a method that's evidence-based where we essentially look at a variety of technical measurements and performance of the browser, and we fingerprinted signals of automation and remote control. So every time a page loads with our line of JavaScript or something else in it, we collect anywhere from 500 data points to 2,000 data points. Some of it is structured, some of it's very unstructured, and that's where the challenges of arbitrarily large data comes into play. You have 2,000 data points of nested data structures, and anyone who's worked in the web knows that it's a wild west of data, what you get from browsers and devices. So it's a difficult problem when it gets at that scale. We see anywhere from 10 to 15 billion events a day on the web, web transactions, about 20 terabytes of data a day in some cases, depending on the volume that day, and obviously petabytes a year. So being able to analyze that in real time and deliver results, we have an analytics platform that's up to date within two to three minutes of seeing an event for just monitoring. We have another prevention product that's five to 10 milliseconds in responding to an event. But yeah, so it brings those challenges with it. And Danny, when we talk to practitioners in our community, they tell us, well, we used to spend all our money and we dig the moat around the castle, try to protect the queen, but these days the queens leave in the castle, data's flying all over the place. I mean, a dupe was all about bringing the code to data. So I'm presuming you're seeing that trend obviously changing in the way in which people secure their data and information. How does that affect what you guys are doing and the way in which you use analytics to solve the problem? This is a great question, and actually it sort of underlies our whole philosophy as a company. We founded the company really from this premise that defense, while still necessary, is no longer sufficient, that you can build all the moat you want, but given the sophistication of threats these days and the myriad ways that data can actually leave an organization, you have to assume that data will get out. And so we decided early on to very much take this kind of big data approach to this problem and actually use this large-scale computing capability to go outside the organization and look proactively beyond the network borders for these signatures. And to say that if you can catch it first before anyone else, if you can see your data out there before others can see it, you can mitigate a lot of the damage that will typically occur when these data breaches are being discovered by third parties, by often the public at the same time the company affected is discovering it. And so by bringing that breach discovery into the organization and really focusing again this idea of streaming in real-time nature, if we can bring the discovery time down from the average used to be on the order of hundreds of days down to minutes even, then you can really provide a lot of value. And so these technologies are really what underpin that capability. George, I wonder if you could sort of kick off the discussion around, you just gave a presentation, you showed your forecast on big data 1.0, 2.0, 3.0, you took us through the journey. You and I have talked a lot about those that have Hadoop expertise built up can maybe bring in things like Spark and evolve their Hadoop infrastructure. Others may start from scratch. What are you seeing there? And maybe get a discussion going around what these individuals specifically are doing at their companies. It was interesting to hear starting points. If I recall, Beth, you started with Hadoop on Q-Ball. And Tamry, you were on Databricks, so you were always Spark. We were on Hadoop, we're actually just getting started with Databricks and migrating into that. Interesting. Okay. And Danny, so if you were on MapR, you're probably traditional Hadoop based. I guess the big question I would have is from the time you get your hands for each of you on data to the time you get an answer, what is that time frame, that latency, and how has that changed as your platform has evolved? Yeah, that's a great question, and that's obviously one of the drivers in adopting new technologies in the database world like this. But I mean, for us, if it was anything that spanned multiple days, and originally, even early days, you would have to write a query or code, so there was time there, and there was a queue there, and then that had to run, and sometimes it- I mean, for each new job, you had to write a query. Like it wasn't like a repeatable. In some cases, yeah, it depends on the complexity of the query. In early days, yeah, it would be Hadoop for us. We moved to some SQL on big data solutions that allowed at least that part to become a little bit more decentralized and less of a queue, but it still, if you're spanning multiple days, it could take several hours or even overnight to run a basic query to get intel. Okay, and how did that, do you have any estimates now that you're going to be on Databricks? How much faster that would be? It's still early days for us, and that's the first use case we're tackling. So we're just starting to build our Parquet data warehouse internally and running test cases. It looks rather fast, but the jury is still up. Order of magnitude, 2x, 20x. Yeah, depending on the type of query, I would say one to five x. When you get into that type of that world, it depends whether the data is structured, it's unstructured, you have nested queries, do you have a strongly typed schema? So all those use cases play here. Okay, yeah. When Beth, what about you, that sort of the elapsed time from question to answer and how that's changed? So if a question to answer, that's probably going to be a little bit faster with Spark, but that's not the really big win. The really big win is the time from prototype to production, because in the old days, we'd have to, the data science team prototype, probably in Python, I don't know, a mixed bag of things, and then eventually that has to be written up in Java and run on Hadoop. That's a pretty big undertaking, but with Spark, you can work, okay, you might work in Python in Spark and maybe in Scala in production, but the actual libraries and everything are the same, the calls are the same, so the time to production is going to be a lot faster. So that's a big win for us, because our customers are definitely demanding more flexibility. Can you quantify that change? Again, early days for us. It's probably maybe months to weeks sort of thing, that sort of change, and that would be huge. That is huge. It's funny because we have a similar experience with the idea of one of the attractive things about Spark is the Python integration. We're a big Python shop and really believers in that approach at the same time. For us, the real-time nature of our system is one of its big value points, so response times in the days or months are scary-sounding things for us. We go for minutes, so we're shooting for sub-15 minutes is when we want between some event happening and our customers knowing about it, that real-time nature is a key value of our product. But how did you get to such a low latency on a Hadoop-based solution? Well, our computations are relatively straightforward. We're simply trying to kind of match our data fingerprints under monitoring for our clients to the things that we're collecting and that is a pretty straightforward computation, I think compared to a lot of the other computational things that people are doing with these technologies. So we focus a lot more on speed and simplicity rather than trying to kind of achieve some really deep computational milestones. So that really helps speed things up. Okay. How did you all evaluate the trade-offs? Between sort of evolving, bringing Spark into your Hadoop ecosystem versus sort of putting in a separate infrastructure like Databricks or maybe using some other stack like Kafka or Cassandra or Mesa. How did you sort of think about those trade-offs? Maybe, Danny, you could start some. Sure, so I mean, for starters, we're big fans of the map, our Hadoop distribution. I mean, partly because it is natively implemented and incredibly fast and that's really, again, comes down to our particular product characteristics. Speed is important and time is one of the key differentiators. The other, the Spark attraction is really, again, the Python integration and the simplicity of integrating with the rest of our analytics and the rest of our stack. And so having that really beautiful kind of computational ability so simply wrapped up in something that we use every day and not having to do that translation is a really attractive thing for us. Beth, you're kind of going through that process of evaluation now or? We are. We've been in HDFS and S3 for a while now. So for us, it's natural to stay there and just put Spark on top of that. Now in the future, though, we may move to some of the other technologies if it makes sense, but we're still in that transition phase so let's change one thing at a time. So for now, we're changing from Hadoop to Spark but we're keeping the file structures the same. Okay, and Tamara, am I correct that you guys sort of came into Spark more aggressively via Databricks? Did we talk about that a little bit? Yeah, sure. We already had the infrastructure in Kafka, Cassandra, that type of thing so we started off with Spark early independently just trying it out in small components of the data pipeline, really just as an exercise to get familiar and see what the capabilities are. Where Databricks came in was more about starting to move into broader use cases. One of our bigger ones is ad hoc data exploration of large and unstructured data sets as a challenging problem. And data accessibility is a real thing. What we're trying to optimize in engineering departments is velocity, right? Velocity of delivering this or migrating to this and we're talking about wide scale shifting over to different frameworks, that's where you start to hit bottlenecks. So the more you can build up tooling and minimize that transition, the pain of that transition, it's important. Databricks actually just fits well for the full end of the spectrum Spark does in general, but talking like I can just give somebody access to Spark, let them explore and play, that eventually turns into a prototype, that eventually can go into production and that can be as Beth said in a variety of languages, whether it's Python or R, which is very powerful. So in thinking about, you're just sort of describing this sort of agile world, DevOps world, you can use that term, but different from, okay I got an IT project, it's got a beginning and an end, they got a project plan, I got to fund it, I got to get a team together. How do you know when you're done? Do you approach it on a project by project basis? Is it a weekly sprint? Do you have a sort of a goalpost that you're trying to get to? How do you manage your organization? Sure, yeah, that's a great question. It varies widely from organization to organization, right? So at WhiteOps we have a defined set of stages of research, which is optional, code and deploy and delivery, and we tend to break everything up, these projects we call EPICS, into milestones. So we will have very defined milestones, this is what it looks like at the end of this milestone, and so sometimes it'll be open research and that's when you don't know what the milestones should look like, so you put a research spike on it for two weeks or first sprint, and then we say, well, at the end of this, we should be able to run these 10 common queries and get this kind of performance back and have this schema of a warehouse kind of use case and we can define that and then you're in a race to do it. Okay, so you got that checklist, and Beth, you got a platform, every project I presume goes into that platform. Yes, we have, as a company as a whole, of course we have a roadmap out several years, and then the roadmap gets more and more detailed the closer in you get, so of course we have a roadmap for this quarter, and my team has a roadmap, so and that has deliverables and milestones and stretch goals and all the rest of it, and then we have planning every few weeks to see how well we can do against those goals. And anything you'd add to that? I mean, the only thing I would say is that for us is a little bit different because the technology is actually core to our products and it's being used, it's part of the everyday operation of the system and so it's a little bit of a different question. I mean, we're done when the product works, and so, but at the same time it's always an improvement process, so it's a little bit that less project-based and more a core foundation stone of the architecture of the whole offering, so. Is it fair to say that each of you are relatively early adopters of sort of big data, Hadoop, analytics, leading edge sort of practices? Is that fair to say or? Yeah, it's an interesting question. I would say yes, but it's been a long early adopter phase, right? It's not, it's a challenging space with challenging problems that's slowly becoming more and more solved. So the reason I ask is that last time we did a little exploration within our community of you sit down privately with people and you can dig, find out what they're really getting out of their investments and it came out that every dollar people were spending, on average, that people were spending on big data, modern big data, they were getting 50 cent return, not too good. Long journey, so maybe early days that was like that, you guys maybe had a different experience, I don't know, but so what was your return, what's your return, you know, generically speaking, high level, on the whole big data initiative, obviously you're building a company around it, so that's kind of a dumb question, but how about that question on Spark? Are you getting return on your Spark investments today? And if not, when do you expect to get them? I'm going to talk about the ROI a little bit. Go ahead, Beth. It's very early days for us. We know that if we can cut the time from prototype to production, that would be a huge win for us. We also expect there'll be major speed ups and other cost savings turning to Spark. Even lines of code save will be a huge return on investment for us. So it's speed and cost? Robustness, all those things. And Danny, do you have visibility on? I mean, I guess I could speak generally that, kind of as you said, we built our whole company around the existence of these technologies and I would, I always say, five years ago we couldn't have made the product that we made. I also find myself saying, what a time to be alive a lot. Just because the scale at which you can deal with these large data sets and the efficiency has just been, you step back, it's kind of mind blowing given when I was in school and building bale wolf clusters and even things like basic cluster management you had to write from scratch. And so, I guess I'm sounding kind of like an old man here but like I said, it's just the pace at which you're able to deal with larger and larger scales of information is just, it's really exciting. And Tim, your spark experience, I mean? Yeah, you know, the value spans so many facets of the system, right? It's not just about performance, as Beth said. You know, when you get into velocity delivery, right? The fact that I can onboard an engineer and they can write Python, my PhD data scientist and write R, some of my more forward thinking engineers can jive right into Scala, right? Or SQL to explore data. They can do all this in exploration. Like I said, that can evolve into prototypes. It's a very powerful thing and at White Ops we have pretty diverse teams, one of them being what we call the detection team, right? They're white-hat hackers and some of them are legitimately hackers, not scale engineers. And the fact that they can write in Python and SQL in my data science team, my data intelligence team can go into R and Java and things like that is pretty powerful. And, you know, the moment my PhDs start becoming scale data engineers, that's, you know, you start building a unicorn, right? And unicorns aren't found, they're made. And so that kind of enablement is of tremendous value. Okay, question on sort of out-of-scope expectations. Just thinking about your journey here. Things that were unexpected, obstacles that you hit, out-of-scope expectations. You know, if you had to do it over again toward a question, what would you do differently? I mean, you know, you hear a lot of, well, we get executive buy-in, business alignment, but what else, Beth, maybe you could start us off. You mean technical obstacles? You choose. Okay, so, even though we build our models in a batch mode, they actually have to run in real time on a decisioning system that makes 1.6 million decisions per second. So there's sort of memory constraints and time constraints. And one thing we found is that we were hoping to just use the Spark models, but you couldn't just bring all the Spark context into that. So we had to write some extra code to deal with that transformation. And talking to some other people at the conference actually, we found that other people had this problem too. So we weren't the only ones that had to solve that. But it's working now, so it's great. Danny? I mean, I think the only thing that we discovered as we started using all of these technologies more and more is just, I think it actually is pretty early days. We talk about being early adopters. I think all of these technologies are still new and they still have that sort of, for lack of a better phrase, that new technology smell to them. Things aren't necessarily documented perfectly. I mean, this is not any one specific vendor or group is implicated. It's just sort of the result of this stuff being relatively new. And so I would just say that we found that it was, the things were not as mature as we had expected, but then again, we kind of expected to find that. And so I think it's only going to get better and only going to become more mature and kind of more seamless as we deal with this stuff. Okay. All right, last question. Each of you describe your Nirvana. Five years down the road, what's Nirvana look like to you, Tomer? Yeah, that's a great question. You know, five to 10 years ago, a unified data platform wasn't even a part of the conversation, right? The classic case where you'll ETL 10 systems into a warehouse overnight and have jobs that could fail and have dependencies, maybe a variety of different data stores. But I think just in the past few years, it's become possible to have almost a fully unified data platform where all the layers are on top of each other. You're not duplicating data. And all of those problems that come with the ETL of several components or data sources supporting databases are only magnified as big data scales, larger and larger. So it'll become more important. So yeah, my dream that some people still laugh at is that I don't need seven databases. I don't even need three. Granted, you still need the right tool for the job. But I believe we're much closer to unified data platform than we were several years ago. All right, George, what's your Nirvana? Well, being that our panelists are all rooted heavily in the real world, and I would look at sort of the usage scenarios that our cute guests were all trying to achieve today, where it's combining the rich historical information we've got at the Data Lake with the really low latency information streaming in that we want to integrate with our systems of record. So we have freshness and context to make better decisions at a high level that I think can apply to a lot of applications. Great, all right Beth. So for our system, many of our customers are big brands and they have their own data science teams who are pretty smart and they know what they're doing. My Nirvana would be that they could actually bring their own algorithms to the table and they write them themselves, bring their own data, bring whatever they need. It comes in and it just works in our system. And my team can still provide their own algorithms, but we can have all algorithms available to run. How about you, Danny? I guess I would say there's a lot of talk about how these technologies sort of scale limitlessly, but when you actually try to scale them to the limits, you find there are still limits. And I think that someday we really will, I mean, if you're talking about Nirvana, other than generally kind of always being amazed at what a neat time it is to be doing this stuff, I think the ultimate goal is when there really is limitless scaling when you don't ever hit those limits and you just keep going. I mean, I don't know if you ever get there. Every time you take the technology from the whiteboard to the real world, there's always gonna be limits. But when you can finally say, okay, this really can hold and handle whatever you throw at it, and we really mean that, I think that'll be kind of a good mark of maturity. Awesome. My Nirvana, for what it's worth is we've been building these communities for five or six years, and I really hope that we can continue to leverage those communities to create information to help doers like yourselves and others and peers get stuff done and apply technology to solve business problems, create business capabilities. So thank you all so much for coming in on theCUBE and sharing your insights. All right, keep it right there, everybody. This is day one at Spark Summit East. We'll be back with a full day of coverage tomorrow. This is theCUBE, we're live from Manhattan. See you tomorrow.