 From New York, extracting the signal from the noise, it's theCUBE covering Spark Summit East, brought to you by Spark Summit. Now your hosts, Jeff Frick and George Gilbert. Hey, welcome back everybody. Jeff Frick here with theCUBE. We are live in Midtown Manhattan at the Spark Summit East. They've got Spark Summit East, Spark Summit West, Spark Summit Coastal, really loud audio in the background, but that's what happens when we're live. But we're excited to have Tamar Hassan back, CTO from WideOps, he was on yesterday on our customer panel, but we wanted to get him back and dig a little bit deeper into the story. I'm also joined in this segment by George Gilbert from Wikibon. So first off, Tamar, welcome back. Thank you, it's good to be back. Absolutely, so for the people that missed the segment yesterday, give us just kind of a brief overview of what WideOps is all about. Sure, we're a cybersecurity firm. We specialize in a very specific slice of cybersecurity of bot detection. So our goal is to verify that there's a human on the other end of the screen for anything that we're protecting. There's a wide array of use cases. Our major focus right now is ad security because of quite a large market for malware there. So the people that aren't familiar with the bots and ad security, what are the issues that people have? Why is this an important issue that you built your company around? Yeah, it's interesting. We never thought as a cybersecurity company we would end up in the advertising space, but it was a pretty clear journey all the way across in a pretty direct one. So the losses are estimated at 7 billion plus and that's conservative to US advertisers, lost to advertising fraud. Now this is essentially automated agents, spoofing interactions with ads, spoofing retargeting engines, a lot of things like that. And you can imagine the amount of dollars in ad and brand advertising. It's a big game to play. Now when you have a seven, so of that 7 billion, about a billion or more actually goes to cyber criminals or organized groups. Now you're talking about a billion dollar black market fueling the spread of malware, right? So anytime you have that level of money you have real talent, real adaptation and things like that. So this is things like I set up a fake account. GM is advertising their new car. I go in and basically fix it. So I'm getting paid for impressions that were really just some computer software banging away and that's how people are really gaming and ripping off the systems. Is that a good example? That's an example of one of the primitive models. Now if anyone's ever been in the ad tech ecosystem it is a very complicated system and very layered. So there's many layers and there's little holes and cracks in a variety of places that get interesting. So there's a lot of ways to make money. Probably the most important principle fueling this here though is that it's one of the few recurring revenue models for cyber crime. Usually you steal a credit card number and you're gone. Somebody's looking for it, the money's missing but this is a scale revenue model that makes money month over month year over year and sometimes for years which has drawn a lot of action away from identity theft and credit card theft. That stuff still exists but you're going to make more money and have less people chasing you in this fraud model. Right, right. So we're here at Spark Summit. Your company's been around for a while. What role does Spark play? Why are you here? How has Spark kind of changed your game? Sure, yeah, we're early days here in adopting it and there's a variety of use cases. The big one up front is exploratory and ad hoc analysis and it's a difficult problem that's magnified obviously with a scale of data. You think about arbitrarily large data sets and there's a whole spectrum of what big data means. We're probably, I don't even know relative, there's a wide range but we collect, every time a page loads, we collect anywhere from 500 data points to 2000 and sometimes it's very unstructured, nested data structures, sometimes it's not. We're talking about 20 terabytes a day of a real time data pipeline that we have to make decisions on and then after that do exploratory research. Now that becomes difficult when it's untyped, when it's unstructured and we have a team of white hat hackers and a data intelligence team that's constantly querying and iterating is our core functions for innovation, for producing these things. We produce algorithms that detect humans and bots. So the quicker that iteration cycle is the more impact we can bring. So that's been the whole drive here is driving down that iteration cycle and all the stages. Who's your client? Is it the advertisers? Is it the network? Who's employing you to help reduce the amount of fraud? Yeah, it's the full spectrum in advertising. Our primary direction is brand advertisers and the Fortune 500 and all the brands. We published this annual study on advertising fraud called the Bot Baseline. This year was the second year. It was a pretty in depth 40 page analysis of all the crazy things that bots do to game the system. And it was done in partnership with a lot of the major brands in their advertising. In addition to that, we work with ad tech platforms in the middle that provide all the pipes back and forth and as well as publishers. Because when you think about it, the money should be going to real content generators and it's not. When you think about the New York Times puts a tremendous amount of money into producing real content for all of us to go visit. Now I can put up a blank webpage and by the end of the day just have as many unique visits as the New York Times homepage just by using fraudulent bot traffic and malware driven bot traffic and things like that. So there is a model of price suppression there when you think about an endless supply and what that does to the market. So we work across the full spectrum. So how does someone who is, who has malware, how do they benefit from sending false traffic to the New York Times? Yeah, so there's a variety of fraud models. There's two of these ones. So let's say your computer is malware in fact and now there's, I don't know how many thousand people are at this conference today but assume it's somewhere from 10 to 20% of infection rate is where we're at now and when you include crimeware which things like voluntarily installed toolbars and adware that view strange things in the background that can get upwards of 25 to 30%. So you're talking about real footprint here. So let's assume that something like that is running on your computer and there's a 20, 30% chance it actually is. It's using your computer's resources, also your identity. So when you drive around the internet and you get all those cookies dropped and you're identified as a certain segment and you either have intention to buy a car or you like this certain type of clothing or you're this demographic, all those are embodied in your cookies and anything your browser does, it's attached to those cookies. So if the malware is actually using your system, it's also using your identity. Okay. So there's a whole triage and arbitrage of players in the system where any number of traffic flows can end up on the New York Times and everything downstream takes a cut of the money. So even if it's not them and it's not generally not the premium publishers doing this, it helps fuel. Oh, you mean so one of the other publishers might be one of the principles in this? That is the first order fraud. Now, if I was going to do this fraud model today or tomorrow I would set up a maybe 50 blog sites, all of the same template, I would just copy paste content all over the internet and I would drive a bunch of traffic and I would sell those clicks through some mechanism. Now what you see is this arbitrage of four or five resells of these impressions, these users and at some point some of these premium publishers, they're doing their own advertising, maybe they're doing a social media campaign or something else and these pipelines actually end up in those. They're good enough and maybe New York Times is targeting you, you could end up, your malware could be targeted instead, right? So there's a lot of vulnerabilities in the system and a lot of victims in that layer but it gets complicated. So tell us, we keep going back to this issue of how much of this is repeatable versus how much is custom for each engagement? What can you bring with you that you can sort of work out of the box and then how much is it that has to be tailored or either configured or built custom for each customer? As far as our capability and technology, it's generally not built, it's built generically across all customers, the ability to detect and prevent this kind of activity. What gets custom sometimes is the reporting and data speeds, right? And that's where Spark comes in and it's very powerful, right? I can give this ad tech platform that funnels most of the traffic in the industry or a big part of it, a different data feed by setting up a different transform in Spark and pushing them. Maybe they want a more real time feed and I can actually do live feed there or they want a custom format. All that is the back half of a good data pipeline after your data lake. I was what I was going to ask. So sort of at the tail end of the data pipeline and some of the new stuff that might be relevant is where you can run dashboards off a stream, that sort of thing, so super low latency analytics. Yeah, that's a very powerful feature and it's a very interesting thing because right now we process upwards of 10 to 15 billion web transactions a day and we have a dashboard that's up to date within two to three minutes. Now that's generally a very powerful feature, product feature, usually. To have such freshness. Yeah, right, but now you're talking about real time. Yes, it's taking another look. Park the real time thought and for all our practitioner or doer viewers out there, people who are trying to build that sort of thing, how do you get that two minute latency? You know, what are the pieces you have to put together? Yeah, there's a lot of tenants of a streaming pipeline and we use a lot of the modern things. Everything from Kafka, we're starting to move into Cassandra and some very powerful databases. Druid.io is one up and coming data platform that's incredibly powerful and we have Spark scattered throughout that. Now what we're doing now is kind of moving that more in-stream. You talk about the three components of a data pipeline that extract transform load and Spark is really good at a couple of them and if I can unify that and also provide a platform for all of my engineering teams to use easily and accessibility is a very important part of any framework. Then it works well. By uniform, you don't mean to have Spark do all the ETL or do you mean like have Kafka at one layer, Spark at the sort of transform layer and then maybe sort of Spark also at the query layer? Sure, yeah, so we would keep Kafka and things like that. What we're moving to is a model we call a data decorator and what you end up, so our pipeline, we can either, we take this raw bundle of data, we transform things, we add attributes, maybe some reputational stuff, then it goes into a decisioning engine and decision is made and it goes into some kind of, another transform for reporting. Now all those are different parts of the ETL or some part of it. Moving to something we call a data decorator where it's all on Spark and that can do all of those things. So we're just talking about actually taking it off the queue, doing a thing, putting it back, taking it off, making a decision, putting it back, exactly. And then what are the actions that are taken once you've just determined that a particular feed is fraudulent, are you shutting that off? Is it just a reporting back to the advertisers and the network? I mean, what are some of the actions? And also your thoughts on real time because I'm looking at, 10 to 15 billion is a humongous number. So where does real time, what does real time mean in that type of world? Because obviously it's not instantaneous. But we always think of real time as in time to do something about it. How do you guys look at it and what do you do about it once you figure out what's going on? Yeah, sure, so we'll classify it as near real time in real time, real time is where we're actually taking action as the thing is happening, right? A real time prevention. So we classify detection and prevention separately. We have separate outputs. So when I talk about a dashboard that's up to date in two to three minutes, that is a monitoring and optimization dashboard. It's a decision tool and a monitoring tool. We have another machine learning driven platform that responds in five to 10 milliseconds. Now this is fast enough to actually prevent the thing from happening. So that transaction never happens. It never happens. You just nip it out ahead of time. Right, so we have this global network of these prevention nodes that all of our customers use. And it's a closed loop cycle. All of that stuff we do in detection powers these machine learning models. And those respond in five to 10 milliseconds. That loop is also quite important and that's where we get into real time blocking. Right, and George has brought up an interesting point on a number of other interviews which is this real time monitoring but the ongoing adaptation of the models based on the new data and the impact really on the machine learning to be able to continuously adapt. How are you guys kind of dealing with that? Yeah, that's exactly right. Right now it's a batch process. It's a fast one, but you know, my product and sales teams want it to be faster. Right now when we see something within four to eight hours, not 48, all of our prevention nodes across the globe are updated with that knowledge to block. Now, you know, we think about bot adaptation. It's really usually on a greater cycle than that. Right, that somebody's going to adapt defenses. But there is a lot of... Grabbing there faster. Yeah, right, well, no, they're a little bit slower. Right, so you're a malware author. You're going to do some work to adapt. But it does come into play with new sources that we haven't seen before. So Spark is really interesting there. Now, you know, a lot of the talks here you see, you can actually plug that real world machine learning model into that detection pipeline that's running, you know, over the course of two minutes. And then it can become actually a two minute cycle instead of a eight hour cycle. And that is where the real power of that comes in. Yeah, it's pretty interesting. Fascinating topic and, you know, the ad tech world is so interesting. I think it's often poo pooed by a lot of people and enterprise tech is, you know, it's ad tech, you know, whatever it's serving ads. And, you know, does that warrant the great power? But it's huge amounts, huge amounts of money and a real high performance benchmark to be able to execute these auctions and delivery in real time. It's pretty amazing. I think it was Jeff Hammerbacker who said the brightest minds of my generation going into figuring out how to get people to click on ads. But he missed the point, which was you guys are pioneering a whole new class of applications. So we do it first here, and then sort of that knowledge spreads, you know, in terms of how do we do continuous apps? Yeah, it's a very sophisticated system. I would say only, you know, the only thing that really eclips it from a technology standpoint is real time trading in the financial industry as far as transactional throughput and real time processing. So, Tamara, give you the last word we're running out of time. One is again, where can people go to get the document that you referenced earlier to kind of learn the basics of this world? And what are we going to be talking about a year from now, if we're back here at Spark Summit last 20, or East 2017, what are you looking at over the next six months, nine months, how's things evolving? Yeah, so, check out whiteops.com that's the BOT Baseline and a lot of analysis and in-depth study of how these threat models work. You know, the next 12 months are going to be interesting and exciting. A lot of the groundwork laid here will carry on for the next 12 to 24 months. What I hope to see in the next 12, you know, one to three years is this move towards a unified data platform whereas in the last five to 10 years there's very traditional ETL and warehouse approaches. You might have 10 transactional systems piping into a warehouse running on a 24-hour cycle and things like that and you know, every component of that adds a layer of fragility and also a layer of duplicating data and work. Now I think for the first time in a long ever we have the opportunity for a more unified data pipeline and where we don't have to duplicate functionality and data and that matters at the petabyte scale a lot more than it did at the gigabyte scale the last 10 years. Awesome. Well, Tamara again, thanks for stopping by. Really appreciate it. From White Ops, I'm Jeff Frick. You're watching theCUBE. We are live in Midtown Manhattan at Spark Summit East. Next week we're going to be in Las Vegas at the Mandalay Bay at IBM Interconnect. So stop by, say hello, we got a big production. The whole team will be there. We're looking forward to it. We'll be back with our next guest here from New York City in just a few minutes. Thanks for watching.