 Live from Boston, Massachusetts, it's theCUBE, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. Welcome back to Boston, everybody, where it's a blizzard outside and the blizzard of content coming to you from Spark Summit East, hashtag Spark Summit. This is theCUBE, the worldwide leader in live tech coverage. Joel Cumming is here, he's the head of data at Kik, Kik and Butt at Kik, welcome to theCUBE. Thanks, thanks for having me. So tell us about Kik, this cool mobile chat app, checked it out a little bit. Yeah, so Kik's been around since about 2010, where, as you mentioned, a mobile chat app startup based in Waterloo, Ontario. Kik really took off, really in 2010, when it got two million users in the first 22 days of its existence. So it was insanely popular, specifically with US youth, and the reason for that really is because Kik started off in a time where chatting through text costs money, text messages cost money back in 2010, and really not every kid has a phone like they do today. So if you had an iPod or an iPad, all you needed to do was sign up and you had a username and now you could text with your friend. So kids could do that just like their parents could with Kik, and that's really where we got our entrenchment with US youth. And you're the head of data, so talk a little bit about your background, what does that mean to be a head of data? Yeah, so prior to joining Kik, I worked at BlackBerry, and I like to say I worked at BlackBerry probably around the time, like just before you bought your first BlackBerry, and I left just after you bought your first iPhone. So kind of in that range, but I was there for nine years. Yeah, can I do that with real estate? Yeah, I'd love to be able to do that with real estate. But it was a great time at BlackBerry. It was very exciting to be part of that growth. When I was there, we grew from 3 million to 80 million customers from 3,000 employees to 17,000 employees. And of course, things went sideways for BlackBerry, but conveniently at the end of BlackBerry was working in BBM and leading a team of data scientists and data engineers there. And BBM, if you're not familiar with it, is a chat app as well, and across town is where Kik is headquartered. So the appeal to me of moving to Kik was a company that was very small and fast moving, but they actually weren't leveraging data at all. So when I got there, they had a pile of logs sitting in S3, waiting for someone to take advantage of them. They were good at measuring events and looking at those events and how they tracked over time, but not really combining them to understand or personalize any experience for their end customers. So they knew enough to keep the data? They knew enough to keep the data. Weren't sure what to do with it. Okay, so you come in and where did you start? So the first day that I started, I was the first day I used any AWS product. So I had worked on the big data tools at the old place with Hadoop and Pig and Hive and Oracle and those kinds of things, but had never used an AWS product until I got there. And it was very much sink or swim. And on my first day, our CEO in the meeting said, okay, so you're data guy here now. I want you to tell me in a week why people leave kick. Man, we don't even have a database yet. So the first thing I did was fired up a Redshift cluster. First time I had done that, looked at the tools that were available in AWS to transform the data using EMR and Pig and those kinds of things. And I was lucky enough, fortunate enough, that they could figure that out in a week. And I didn't give him the full answer of why people left, but I was able to give him some ideas of places we could go based on some preliminary exploration. So I went from leading this team of about 40 people to being a data team of one and writing all the code myself. Super exciting, not the experience that everybody wants, but for me it was a lot of fun. So over the last three years have built up the team. Now we have three data engineers and three data scientists. And data is a lot more important to people every day at kick. What sort of impact has your team had on the product itself and the customer experience? So the beginning it was really just trying to understand the behaviors of people across kick and that took a while to really wrap our heads around and any good data analysis combines behaviors that you have to ask people their opinion on and also behaviors that we see them do. So I had an old boss that used to work at Rogers which is a telecom provider in Canada and he said if you ask people the things that they watch they tell you documentaries and the news and very important stuff. But if you see what they actually watch it's reality TV and trashy shows and so the truth is really somewhere in the middle. There's an aspirational element. So for us really understanding the data that we already had instrumenting new events and then in the last year and a half building out an A-B testing framework is something that's been instrumental in how we leverage data at kick. So we were making decisions by gut feel in the very beginning. Then we moved into this era where we were doing A-B testing and very focused on statistical significance and rigor around all of our experiments. But then stepping back and realizing maybe the bets that we have aren't big enough and so we need to maybe bet a little bit more on some bigger features that have the opportunity to move the needle. So we've been doing that recently with a few features that we've released but data is super important now both to stimulate creativity of our product managers as well as to measure the success of those features. And how do you map to the product managers who are defining the new features? Are you a central group? Are you sort of point guards within the different product groups? Because how does that, your evidence-based decisions or recommendations, but they make ultimately, presumably the decisions, what's the dynamic? So it's a great question. In my experience, it's very difficult to build a structure that's perfect. So in the purely centralized model, you've got this problem of people are coming to you to ask for something and they may get turned away because you're too busy. And then in the decentralized model, you tend to have lots of duplication and overlap and maybe not sharing all the things that you need to share. So we tried to build a hybrid of both. And so we had our data engineers centralized and we tried doing what we call tours of duty. So our data scientists would be embedded with various teams within the company. So it could be the core messenger team. It could be our platform team. It could be our anti-spam team. And they would sit with them and it's very easy for product managers or developers to ask them questions and for them to give out answers. And then we would rotate those folks through a different tour of duty after a few months and they would sit with another team. So we did that for a while and it works pretty well. But one of the major things we found was a problem was there's no good checkpoint to confirm that what they're doing is right. So in software development, you're releasing a version of software. There's QA, there's code review and there's structure in place to ensure that, yes, this number I'm providing is right. It's difficult when you've got a data scientist who's out with the team for him to come back to the team and get that peer review. So now we're kind of reevaluating that. We use an agile approach, but we have primes for each of these groups but now we all sit together. So the accountability is after the data scientist made a recommendation that the product manager agrees with, how do you ensure that it measured up to the expectation? Like sort of after the fact. Yeah, so in those cases, our AB tests are, it's nice to have that unbiased data resource on the team that's embedded with them that can step back and say, yes, this idea worked or it didn't work. So that's the approach that we're taking is having that, it's not dedicated resource, but prime resource for each of these teams. That's a subject matter expert and then it's evaluating the results in an unbiased kind of way. So you've got this relatively small, even though it's quadruple the size when you start a data team and then the application development team is sort of colleagues or how do you interact with them? Yeah, we're actually part of the engineering organization that kicked a part of R&D and in different times in my life, I've been part of different organizations, whether it's marketing or whether it's IT or whether it's R&D, and R&D really fits nicely and the reason why I think it's the best is because if there's data that you need to understand users more, there's much more direct control over getting that element instrumented within the product that you have when you're part of R&D. If you're in marketing, you're like, hey, I'd love to know how many times people tap on that red button, but no event fires when that red button's tapped. Good luck trying to get the software developers to put that in, but when there's an inherent component of R&D that's dependent on data and data has that direct path to those developers, getting that kind of thing done is much easier. So from a tooling standpoint, thinking about data scientists and data engineers, a lot of the tools that we've seen in this so-called big data world have been quite bespoke. Different interfaces, different experience. How are you addressing that? Does Spark help with that? Maybe talk about that one. Yeah, so I was fortunate enough to do a session today that sort of talked about data V1 at kick versus data V2 at kick and we drew this kind of line in the sand. So when I started, you know, it was just me, I'm trying to answer these questions very quickly on these three or five day timelines that we get from our CEO. You've been here a week, come on. Yeah, exactly. So you sacrifice data engineering and architecture when you're living like that. So you can answer questions very quickly. It worked well for a while, but then all of a sudden we come up and we have 300 data pipelines. They're a mess. They're hard to manage and control. We've got code sometimes in SQL or sometimes in Python scripts or sometimes on people's laptops. We have no real plan for Git integration and then real scalability out of Redshift. We were doing a lot of our workloads in Redshift to do transformations just because like get the data into Redshift, write some SQL and then have your results. We're running into contention problems with that. So what we decided to do is sort of stop and step back and say, okay, so how are we going to house all of this atomic data that we have in a way that's efficient? So when we started with Redshift, our database was 10 terabytes. Now it's 100, except when we get five terabytes of data per day that's new coming in. So putting that all in Redshift, it doesn't make sense. It's not all that useful. So if we call that data under supervision, we don't want to get rid of the atomic data. How do we control that data under supervision? So we decided to go the data lake route even though we hate the term data lake, but basically a folder structure within S3 that's stored in a query optimized format like Parquet. And now we can access that data very quickly at an atomic level, at a cleansed level and also at an aggregate level. So for us, this data V2 was the evolution of stopping doing a lot of things the way we used to do, which was lots of data pipelines, kind of code that was all over the place and then aggregations in Redshift and starting to use Spark, specifically Databricks. In Databricks we think of in two ways. One is kind of managed Spark so that we don't have to do all the configuration that we used to have to do with EMR. And then the second is notebooks that we can align with all the work that we're doing and have revision control and get up integration as well. A question to clarify, when you've put the data lake, which is the file system, and then the data in Parquet format or Parquet files. So this is where you want to have some sort of interactive experience for business intelligence. Do you need some sort of MPP server on top of that to provide interactive performance? Or, because I know a lot of customers are struggling at that point where they got all the data there and it's kind of organized, but then if they really want to munch through that huge volume, they find it flows to lower than a crawl. Yeah, it's a great point. So we're at the stage right now where our data lake, at the top layer of our data lake where we aggregate and normalize, we also push that data into Redshift. So Redshift, what we're trying to do with that is make it a read-only environment so that our analysts and developers, they know they have consistent read performance on Redshift, where before when it's a mix of batch jobs as well as read workload, they didn't have that guarantee. So you're right, and what we think will probably happen over the next year or so is the advancements in Spark will make it much more capable as a data warehousing product, and then you have to start a question, do I need both Redshift and Spark for that kind of thing? But today, I think some of the cost-based optimizations that are coming, at least the promise of them coming, I would hope that those would help Spark becoming more of a data warehouse, but we'll have to see. So to carry that thread a little bit further through, I mean, in terms of things that you'd like to see on the Spark roadmap, things that could be improved, and what's your feedback to Databricks? Yeah, we're fortunate, we work with them pretty closely. We've been a customer for about half a year, and they've been outstanding working with us. So structured streaming is a great example of something we've worked pretty closely with on, and we're really excited about. We don't have, we have certain pockets within our company that require very real-time data. So obviously your operational components is your server's up or down, as well as our anti-spam team, they require very low latency access to data. We haven't typically, if we batch every hour, that's fine in most cases, but structured streaming, when our data streams are coming in now through Kinesis Firehose, and we can process those without having to worry about checking to see if it's time we should start this, or is all the data there so we can run this batch. Structured streaming solves a lot of those, simplifies a lot of that workload for us. So that's something we've been working with them on. The other things that we're really interested in, we've got a bit of a list, but kind of the other major ones are, how do you start to leverage this data to use it for personalization back in the app? So today, you know, we think of data in two ways at kick. It's data as KPIs, so it's like the things you need to run your business, maybe it's A-B testing results, maybe it's how many active users you had yesterday, that kind of thing. And then the second is data as a product, and how do you provide personalization at an individual level based on your data sciences models back out to the app? So we do that, I should point out at kick, we don't see anybody's messages, we don't read your messages, we don't have access to those, but we have the metadata around the transactions that you have, like most companies do. So that helps us improve our products and services under our privacy policy to say, okay, who's building good relationships and who's leaving the platform and why are they doing it? But we can also surface components that are useful for personalization. So if you've chatted with three different bots on our platform, that's important for us to know if we want to recommend another bot to you. Or the classic people you may know, recommendations, we don't do that right now, but behind the scenes we have the kind of information that we can help personalize that experience for you. So those two things are very different in a lot of companies, there's an R&D element, like at BlackBerry, the app world recommendation engine was something that there was a team that ran in production, but our team was helping those guys tweak and tune their models. So it's the same kind of thing at kick where we can build, our data scientists are building models for personalization and then we need to service them back up to the rest of the company. And the process right now of taking the results of our models and then putting them into a real-time serving system isn't that clean. And so we do batches every day on things that don't need to be near real-time, so things like predicted gender. If we know your first name, we've downloaded the list of baby names from the US Social Security website and we can say the frequency of the name Pat, 80% of the time it's a male and 20% it's a female, but Joel's 99% of the time it's male and 1% of the time it's a female, so based on your tolerance for whatever you want to use this personalization for, we can give you our degrees of confidence on that. That's one example of what we surface right now in our API back to our own first-party components of our app. But in the future, with more real-time data coming in from Spark streaming, with more real-time model scoring, and then the ability to push that over into some sort of capability that can be surfaced up through an API, it gives our data team the capability of being much more flexible and fast at surfacing things that can provide personalization to the end user, as opposed to what we have now, which is all this batch processing and then loading once a day and then knowing that we can't react on the fly. So if I were to try and turn that into sort of a roadmap, a Spark roadmap, it sounds like the process of taking the analysis and doing perhaps even online training to update the models or just re-scoring if you're doing a little slightly less fresh, but then serving it up from a high-speed serving layer, that's when you can take data that's coming in from the game and send it back to improve the game in real-time. Exactly, yeah. That's what you're looking for. Yeah. You and a lot of other people. Yeah, I think so. Okay. So how's the event been for you? It's been great. There's some really smart people here. It's humbling when you go to some of these sessions and we're fortunate where we try and not have to think about a lot of the details that people are explaining here, but it's really good to understand them and know that there are smart people that are fixing these problems. As like all events, been some really good sessions and then, but the networking is amazing. So meeting lots of great people here and hearing their stories too. And you're hoping to go to the hockey game tonight? Yeah, I'd love to go to the hockey game. That would be. See if we can get through the snow. Who are the Bruins playing tonight? San Jose. Oh, good little. It could be a good game. Yeah. It'll arrive every, yeah. You guys, you guys get to the hockey game? You're in. All right, good. All right, Joel, listen, thanks very much for coming to theCUBE. Great segment, really appreciate your insights and sharing. Okay, thanks for having me. You're welcome. All right, keep it right there, everybody. George and I will be back right after this short break. This is theCUBE, we're live from Spark Summit in Boston.