 from Union Square in the heart of San Francisco. It's theCUBE covering Spark Summit 2016, brought to you by Databricks and IBM. Now here are your hosts, John Walls and George Gilbert. Well, welcome back to the Spark Summit 2016 coverage here on theCUBE here in San Francisco, as we're starting to wrap things up here on our second day of coverage from the summit, along with George Gilbert. I'm John Walls, and it's a pleasure to welcome now a guest from, well, let's see. What's in your wallet? That's probably all the clue you need. Alec Baldwin's not here with us, but Chris DiAgostino is the VP of Technology at Capital One, and Chris, thanks for being here with us. Thank you for having me. I appreciate that great keynote this morning, centering mostly on fraud detection and how you're using the Spark technology as a real nice compliment to that. But let's set the stage for that, generally speaking. Obviously, a huge, huge influence in your business is trying to reduce the fraud and what's going on, and you quoted a number of $16 billion in 2014 of U.S. credit card fraud. So it tells you about the magnitude of the problem. Yeah, that figure I quoted was available from a NASDAQ article, and that was across all U.S. banking, so it wasn't specific to Capital One, but it's a large figure, no matter how you look at it, and one of the things that we're trying to do is obviously reduce the risk of having those losses inside of Capital One, but then also continue to protect our customers and their identity. So then how does big data come into play here then? Because you had a demonstration that, I think it was very effective in terms of what you were trying to illustrate, but for those at home who are watching right now that didn't get a chance to see it, let's talk about that integration, what goes on, what you're doing with Spark, what you're doing with some other systems as well. Sure, I mean we're really trying to figure out ways to combine data from various data sources, some internal to Capital One, some external third-party data sources, as well as the information provided by the applicants themselves, and validate that information and try to determine a risk associated with fraud as quickly as we can. And so what we demonstrated today was the use of Spark and our ability to combine those data sources into a cluster, be able to run some machine learning against it, map that data into a graph database and be able to traverse that graph looking for connections between applicants. And I think what I found, well there were many impressive aspects of that, but when you were doing, I think you limited, and it was like let's just take these 25 apps, or it was 20 or 25, but you were able to detect in almost real time, it seemed a pattern of possible fraudulent activity that provides you with obvious insight that didn't exist before, right? Yeah, I mean it's a combination of taking the historical data, running algorithms against it to compute connectedness for prior applications that came into the business, and then as new applications flow in, and so we had six million historical fake applications that we generated with certain attributes, and then we had 25 new applications that we streamed in through the Databricks Notebook and scored them against the fraud model that we ran and trained using the six million prior applications, and so we were able to just basically compute the connectedness on the fly, and we were getting the ability to evaluate an incoming app against a very simplistic model for the demo. Obviously we would have a much more sophisticated model internal to the business, but for the purposes of the demo, we could compute that connectedness and fraud score within a couple hundred milliseconds. And the time of that six million, for the time you adjusted that six million, that took five minutes, right? So training, yeah, training the model using the six million historical records took about 10 minutes overall, five million to ingest, and another five minutes to train the model. And the evaluation was done in milliseconds. Yeah, hundreds of milliseconds. It was incredible. Yeah, so good performance for us. So is it on your sort of holy grail list or a Christmas wish list to be able to take that very large semi-batch model or semi-batch pattern and turn that into a near real-time kind of evaluation? Yeah, I mean, obviously we've got a lot of insights that are generated from our historical data, and from a historical perspective, depending on the needs, you could process that in more or less a batch-oriented approach for the incoming data that's new applications, new credit card transactions. We want to marry up the two, take the insights and profiles that we've been able to glean out of the historical information and couple that with the inbound information and try and extract the features as quickly as possible and score things like fraud. Does the historical model drift based on the new incoming data, and is there a way for you to sort of prevent that drift without running the whole batch model again? I guess I wouldn't describe it as drift. I think what we're looking to do is be able to have the new data and feedback loops that we built into our products help inform the validity of the model and then be able to improve the model over time. So maybe you're, I'm not sure if you mean by drift, just the model gets out of sync. We actually want to be able to run several models in parallel and compare them, horse race them a bit and see how they perform. AB testing for the models. Exactly. Okay, so a job like that with fraud, I guess there are new patterns of behavior coming in all the time that are fraudulent. So you're always trying to enrich the patterns you're looking for. I assume with new data sources and AB testing more models. How does that process work? Where, how do you find new data sources that you think might be relevant? And then I guess how do you tell of when you're AB testing a bunch of different models, which one has higher fidelity? Yeah, well we've got a team of data scientists and a team of business analysts that are chiefly responsible for identifying what those models look like, what source data we need in order to execute against the models. And then we've got historical records on which accounts actually had fraud and you could rerun, you know, you could sort of replay those events and see if the new models perform in predicting fraud in a better fashion. Okay, so in other words, you use the same training data. But for you versus other credit card companies or banks, it's a permanent arms race. To get richer and richer data sources and models. Well sure, although I don't view it as an arms race against the other banks, I think the entire financial community is trying to work in a way that we reduce fraud across the board. Because any fraudster that gets in through one bank prior to the bank detecting that it's a fraudulent activity, you know, it creates additional information in their credit record or their bureau record. And so it potentially improves their chances of getting a credit card at another bank. And so I think holistically we're trying to weed that out as quickly as we can. Yeah, do the capabilities change or does your approach change based on the kind of fraud activity? I mean, there's more than one way to unfortunately skin that cat. And people behave, misbehave in all kinds of different ways. So dependent upon how they're misbehaving, do you have a different way to, you know, snook that out if you will? And are you applying different techniques to do that? I mean, it really depends on what type of financial product they're trying to exploit. I mean, obviously the ability to just shut down an account and not allow additional transactions is sort of the lever you can pull to turn things off. But depending on the attack vector, we've got different systems in place and different procedures that try to provide countermeasures for that. Is this potentially like a public good, your anti-fraud models, where you might want to share them with other banks and that each one is contributing to that greater good, you know, in the interest that you wouldn't look at your model as a competitive differentiator, you would look at it as a cost reduction or a risk reduction technology that if everyone's contributing, you're all going to do better. It's an interesting question. I don't have an answer for you in terms of what Capital One's long-term approach is. We are very committed to open-source software, we're very committed to community-driven software engineering and we're trying to open-source and contribute back and have open-sourced and contributed back to many different libraries and to the Spark platform and to others. The models, my guess is near-term, we've got a lot of work to do before we're at a point where we're going to start open-sourcing the models themselves. Mainly when I say that the work to do on the newer technologies and newer platforms, what I demonstrated today was a prototype that we've got running in Spark and using Databricks. So we've got plenty of work to do to tighten that up. And fraud's just not the only game, obviously. I mean, you're in the credit business. Translate what the capabilities are into the other realms of Capital One's business and again, how are you applying that? What's new and different about the way you're using technology today in that respect on the upside, as opposed to maybe what you were doing 12 months ago or 18 months ago? Yeah, I can't speak historically for what our position was 12 to 18 months ago, but I know now we are really focused on trying to improve the customer experience and make sure that our products are as intuitive and helpful as possible. We're also trying to help people manage credit better and so we're using analytics to do a better job of informing them when a bill might be due and give them the opportunity right then and there and say the mobile application to pay that bill or pay the minimum payment or something beyond the minimum payment. So our objective is really to combine data sources to assess risk, data sources to help customers manage their money better and manage their credit portfolio better. And then obviously with the fraud defenses is sort of the underpinning. Can you use data then for corrective behavior or for modifying behavior then in your customer base to keep them from going over the ledge more or less and maybe identifying those who are at risk, I hate to say, save us from ourselves, but in a way that's kind of what data can do here. Yeah, we've got a lot of different applications that we've built. There's an app out in the App Store now called Inform which is really designed and targeted for people to better understand where are they in terms of their budget slash credit limit and so prior to making a purchase decision they can run that app and they can see kind of where they are. It's kind of like, you know, almost along the lines of the wearables and sort of full-time awareness of your fitness level. That's my fiscal health, right? Yeah, it's your fiscal health kind of concept. Yeah. Are you finding borderline between what was, say, a credit card company and other financial institutions is dissolving because of the ease with which you can build new risk-related or credit-related apps? Is something changing, you know, or is the rate of change accelerating now because it's getting to be so much easier to, you know, with this app, you know, on your mobile phone to guide you on your, you know, financial transactions? When you say rate of change, I'm not sure I caught the first part of the question. I guess I mean like where technology makes it easier for you to act like a bank, not just a credit card company, you know, where you can take on, you can take on activities of entities that, you know, you used to regard as outside your industry. Yeah, I mean, to the extent that we've got a customer that's in using one of our financial products, you know, clearly we'd love to have the customers using more of Capital One's financial products and there are teams of people that are responsible for how do you, how do you market to those customers in a way that doesn't violate the customer's communication preferences or regulatory requirements, but at the end of the day, having a holistic view of a customer across the different products that we offer is something that we use mainly to provide a better customer experience, to provide more defenses against identity theft and also to inform them about how to better manage their money. It's the ability to offer additional products is kind of outside the space that I focus on. What are the limitations or walls you have to keep pushing back to make progress? Is it the messiness of the data? Is it, you know, the difficulty of finding, you know, the data scientists who can find the needles in the information haystacks? What, you know, what is it that are the critical resources that are in short supply for you to, you know, keep expanding? I mean, we're always looking for talented individuals, right? Whether they're data scientists or software engineers or business analysts, you know, I think the technology and the rate of change with the technology and the amount of data that's coming in and the ability to either clean that data into a usable format as quickly as possible and build algorithms that generate the types of insights that can move the business forward. I mean, those are sort of ongoing challenges, but as you mentioned, prior to sort of getting started with this interview, that's what Capital One's famous for, right? Right. As you were concluding the demonstration today and you drill down into Arlington, Virginia, where I live and you started focusing on a neighborhood, I thought, my God, he's going right to my house. Oh no. So yeah, you were, if you'd gone about three blocks north, just the other side of 66, you would have been in my neighborhood and I thought, what do you know about me that I don't? So, well, I got a chuckle out of that. I thought, look at all that fraudulent activity going on in my neighborhood. Yeah, so again. No, no, totally. That was all fake data. No, no, no. The fun thing is, and as I said in the keynote, you know, the idea is we don't want to use real customer data for a demo like this, obviously, and we didn't, we didn't use any customer data, but we do want to generate realistic data so that the code can function in a meaningful way, the analytics can function in a meaningful way, and so I think we came up with a fairly clever approach to generating, you know, six million fake accounts and realistic social security numbers that map to the actual age of the applicant and using census data for the distribution of households and incomes by zip code and surnames and given names and all of that, and so that by itself was kind of a fun, interesting exercise and, you know, it'd be interesting to find out if there's ever any real overlap between the real world and that fake data set. And don't get me wrong, I'm glad you're on my side. Yes, all the way. We're there. Chris, thanks for being with us. Sure, thank you very much. We appreciate the time. Yep, take care. The Cube continues here from Spark Summit 2016.