 This is George Gilbert. We're on the ground at the Data Science Summit in the Marriott Marquis in San Francisco, and we're with Jan Neumann, who denies he's related to John von Neumann, but based on the work he's done it sounds like it's a distinct possibility. He works for Comcast and their customer journey with machine learning and applying it to technologies that we all experience is very interesting. Jan, why don't you tell us what was the initial application, how it got started, and how it got applied? Yeah, the group I work for, which is Comcast Labs, Washington, D.C., originally used to be a startup that focused on video search applications. So Comcast bought them, as I said, I wasn't part of it in 2005, to basically power the video web search behind their Comcast.com website. And so the secret sauce there was really to use natural language technology to not just identify like when a word is being spoken within a video, but also like what's the context around that. And so there was a core kind of team which used NLP, which kind of falls under machine learning nowadays. And then after a couple of years when they saw like how successful basic research was, Comcast decided to let this group basically develop additional machine learning algorithms to support the content discovery within Comcast. Before we go to the next application, how was it applied to use sort of the context, you know, NLP within context to find content? So the idea was that you basically index every single word within that is being spoken within the video. And then at the same time, you also look at what are the related words next to it, to identify like what is the segment that is semantically related to this word that you're searching for. But so if I basically type in a word, it will not just take me to when the word was spoken, but it starts a little bit earlier at the beginning of the segment where at the semantically related segment. Now would the end user for this be like the news department looking up related clips or would it be end users who are searching for... So this find was end users who were using the Comcast website to look for interesting web videos. So they were typing in search terms that they were interested in and it would return then a ranked list of clips, kind of subsets of the larger web videos that hopefully would match what they were looking for. Was it Comcast content or was it like, you know, broader content? It was not necessarily Comcast content, but content distributed via the Comcast website at the time. Okay, okay. So and just to touch on the NLP tools, were these open source tools that you applied to make it to extract intelligence from the video? Or were the tools themselves also developed based on, say, academic research? So these were developed in-house, so at that point the secret saw of the startup was their in-house develop capabilities. And at that point you wouldn't probably call it like big data, it was more like the traditional small scale machine learning, handwritten C code, etc. But since then we have grown up, so to speak, and Comcast then decided to kind of give us the responsibility to develop all the content discovery algorithms that now power the new X1 entertainment system. Okay, so tell us more, how did that path, what was that growth path? And sort of what challenges did you have to go over, skills wise, data, scalability, things like that? So the challenge that we have is that initially we're a relatively small team, so we're only about 8 to 10 PhDs at that time. With like a background mostly NLP, like my personal background is computer vision, so I was asked to kind of join, to basically add image analysis, video analysis capabilities to the mix. But at the end the underlying technologies are all machine learning based. So they asked us, could we develop a recommendation algorithm in-house? And one of our researchers came up with an algorithm that we then compared against external vendors and we found out that our internal algorithm was performing as well or better than what other vendors were offering to us. As in the distributor of movies via DVDs? Well, we obviously did not offer the algorithm to us, so we can't really compare one to one, but the companies who spend a lot of time and said, hey, we can offer you our recommendation solution and who actually worked with TV providers in Europe and other places based on customer testing, our internal algorithm performed as well. So then Comcast made a strategic decision to develop an algorithm in-house and increase the team. We hired like data engineers, people who were familiar with Hadoop and other big data technologies and then scaled up a solution that is now powering the recommendation systems behind the X1 system. So it sounds like it's a Comcast secret sauce and it's not something that you'll license as a piece of advanced technology? Correct. It's like developed in-house because the advantages of being able to adapt to the unique kind of constraints and given that the size of Comcast it was worthwhile to do it in-house versus licensing other technologies. We're definitely building on top of other technologies like Hadoop, like Spark and also here like the Data Graph Lab, which is also part of what we're using so to basically build kind of a best-of-breed model, but it takes into account like a lot of the peculiarities of our data to basically give our customers the best solution at the end. Okay. Can you give us a little more sense of how the third-party technologies gave you sort of a higher place to, you know, a starting point? I mean the advantages nowadays that machine learning technologies are so advanced that and as you saw like in the keynote earlier this morning by Carlos, that it becomes easy and easier to try out ideas at scale, right? So if you're a data scientist, we are good at like the math, we're good at understanding machine learning models, but we're not necessarily good at kind of writing high-performance code from the bottom up. And so what the existing technologies allow us to do is to kind of build upon all that distributed work and then add the secret sauce on top of it and be able to leverage or apply the models that we built in our laptop to like basically scale it up to Comcast scale. Sort of the way Spark says we want to make it easy for you to take what was done perhaps in a single user notebook and scale it. So we're definitely benefiting a lot from this kind of paradigm change. Tell us what are some of the scalable pieces that you're building on that weren't available a couple years ago? So I mean the hardest problem we had initially was that it was very difficult to kind of transfer complex machine learning models that we built like on all machines using Python for example onto the Hadoop platform. We had to use Java, we had to kind of rewrite from scratch to fit the map-reduced paradigm and now with the advent of Spark and other tools like the Graph Lab it becomes just much easier to kind of use where the data scientists who develops the algorithms can use the same tool to also deploy the solution and you avoid mistakes being made in the translation from the model to the actual deployment. Let me just jump in there for a second but you're saying that the development platform is also the deployment platform which is potentially significant so you operationalize the models maybe on a different cluster but the same technology is that what you're saying? That's what I'm saying so I mean not all the way there but that's kind of a process going on. Okay wow so what might this look like in the future? Well I think that convergence will continue so we're given like the quick growth of the tool the ease of deployment and other technologies like Docker which make it very easy to have like the same technology on your laptop that I mean the kind of like same technology stack on your laptop that you have in the data center that we will be able as like data scientists to get a developer solution and directly trying to translate into a company-wide deployment. Now the the Hadoop is getting hardened it's there's got there's some seams in there in terms of operational complexity and you know development complexity because it's more of an ecosystem than a product but Spark is making great strides in coming together so that you can integrate the streaming machine learning you know SQL querying graph processing and with the tungsten project you know running at you know speed on the metal right is that something that looks like it's going to pull you in that direction? Well we're definitely very interested in tungsten too and in the technologies of monitoring it very closely like we have been like presenting like some of the work that I presented today about the real-time recommendations we also presented at the Spark Summit East this year so we're definitely very active in exploring how Spark and other technologies can help us with our like improving basically our infrastructure deployments they make it make our pace of development quicker and overall improve the quality of a product. Okay and with that I'm George Gilbert on the floor thank you for watching and we'll be back shortly.