 Okay, I think we'll get started. So thanks everyone for coming. I know it's late in the day. It's been a long two or three days. And my name's Ian Houston. I'm going to talk a little bit today about operationalizing data science on CF. And what I mean by that is trying to move data science models from, you know, some data scientists' laptop, where it works there, and try and get them in production. And I'm not coming with like all the answers now. And, you know, some of these things, things we've tried. And some questions I have about, you know, what other people have tried. And I'm trying to figure out, start a conversation about what we should do in the future as well. Get this to actually work. Okay. Chrome not working very well for this. Okay, there we go. And so who am I? So I'm a data scientist working at pivotal labs, working with clients, doing predictive analytics and machine learning type projects. But I'm also a Cloud Foundry user. I've been using CF for, you know, two or three years now. And I got so far as I've actually written a build pack. So I've got like quite far into the, you know, process of how do you stage an app and how do you make your applications run on Cloud Foundry. So hopefully I've been able to bring a little bit of that to bear today in this talk. And what we're really going to be talking about is how do we get those sort of machine learning and predictive analytics smarts into applications. And the reason this is important is because it's kind of becoming the expectation that, you know, everyone wants to see, as Mark Benioff said recently, everyone wants to see systems and applications that are smarter, that have more predictive capabilities. And they want things scored and, you know, to see the next best opportunity, next best offer. And basically they want their applications to get smarter. You know, at the moment we might call these smart apps, these sort of things that have this extra machine learning inside them. But very soon we're probably just going to call them apps. It's just going to be expected by consumers and business users that their applications have these kind of capabilities. So what's the problem at the moment? Well, really, if your machine learning model is to provide any business value, it has to go into production. A lot of data science up to now has been in the sort of exploratory mode. So you give your data scientists some data, they go off and they do some analysis, they might come up with some toy models, maybe something that does work on their laptop. But that if that only lives, ever lives in a presentation or lives in a report, then it's not actually providing real business value. So, you know, kind of the way to think about this is, you know, you're investing a lot in your data scientists. If their output, if their work is not really making you into production, then you're not getting the business value out of that. So slide decks don't count as production as well. So that's, you know, a few people sort of think that, but it's not quite true. And a lot of people have this problem, you know, sure, data scientists have the problem. We want to see impact for the work we're doing. We don't want, you know, our kind of complicated, predictive model to just lie on a shelf somewhere and never to be used. Developers are starting to feel the need to include these kind of capabilities into the applications. We're seeing the sort of growth of an ecosystem. Google earlier talked about some of their machine learning APIs. So making it easier for developers to add this capability. And then you've got like CIOs, chief data officers. They've invested quite heavily in sort of big data space. And now they want to see some return on that. And, you know, the only way that the return is really going to exist is if these models get put into production, if they're actually part of the systems that are working. You know, and Pivotal Labs, we see quite a few of our clients are starting out on their journey on this path and want to implement their first machine learning models. I want to see the results they can get. So this is very important for them that, you know, what we're building with them isn't just this one-off experiment that, you know, we don't, we do it, we stop and then move on. This is something that actually lives and can contribute. So to do all of that, you need a few things. The first thing you need is you need a way to actually run your models. So, you know, this kind of day one problem. What's the first thing you need to do? Well, we need to run the predictive model. It may, you know, data scientists put a lot of effort into building it, but now it has to go and run for real. So there's a lot of things that go into that. I'm not going to talk about each of these in turn, but the kind of part of that process is loading and transforming the data and training the model. In machine learning, we normally have these training phases and scoring phases. So actually training the model. That's the actual learning part. When we actually connect it to real incoming data, and that, you know, sounds simple, but in a complex system, that could be one of the hardest parts, cleaning and transforming that data so it's in the right form to go into the predictive model. Actually, applying that model, which again, maybe sounds easy. You just, you know, got some, basically some mathematical formula, you just apply it and run the new data through it. But, you know, there's a whole load of things around that in terms of, you know, just the computation side, but also simple things like library versioning and that kind of thing. And then finally, you take some action. So, you know, that can take multiple forms. Maybe it's you, you know, you show someone the next best offer. You show someone the recommended video that they should watch. Or you do something else, like you increase their quality of service. Or you, you know, maybe you don't do anything because you've predicted they're going to churn and you don't actually want to keep them as a customer. So, it's kind of an absence of action. So, these are all the things you need to do, even just to have anything in production. And I'm going to run through a few really high level kind of, maybe architecture is too grand a word for the next few slides, but a few high level ideas about how to do this on Cloud Foundry. And these kind of follow this typical way that machine learning and predictive analytics models get put into production. So, you know, the first one is kind of a scoring as a service. So, in this case, you're building your model somewhere else, maybe on some kind of big data systems, some kind of distributed cluster type thing, using a lot of data over there. The output of that is something you store, you know, on a data service, Redis or something like that in Cloud Foundry. And then you're ingesting the data in CF, you're applying a model inside an application, you know, maybe these are separate applications, an ingest one and an apply or score application. And then you do something, whether that's through your front end or connecting to some business sort of back end system, you do something. So, this is kind of like the simplest thing you can do. You could argue that all the hard work is being done in this kind of gray box in the bottom left, which is actually doing the predictive part. And then you're just using the results of that on Cloud Foundry. Yesterday I was doing a talk about build packs, and I showed a demo, which was like a sentiment analysis model that had been pre-built somewhere else. And then we pushed it to CF and we were able to send it, you know, new text and see what the sentiment positive or negative was for that text. So, that's kind of an example of this first case. The next case is something much more like, well, it includes CF in the building part. So, this is something I kind of called CF powered learning. You need to come up with a better name, maybe. Here you're ingesting all the data through Cloud Foundry. So, using something maybe like Spring Cloud Dataflow or something to transform it and, you know, get different streams. And then you're building the model. Maybe it's on a Cloud Foundry application instance. Maybe it's a CF app controlling something else, like sending, command and control to your other big data system. And then there's still this batch update of the model that's stored somewhere and is applied by another application instance. So, that batch update normally happens at relatively low frequency compared to the new data that's coming in. But you have to, you know, start thinking about, you know, how do you do this update? How does your app know that there's a new version of the model to download or use? And then how, you know, how do you keep all this in sync? And you can imagine, you know, if you've started looking at microservices architecture, you can squint and imagine like different services being composed of the different parts here. I'm going to run through a video later of an example of a predictive model that uses this kind of architecture where we have three different microservices. So, one is the training application, one is the scoring application, and one is kind of the front end, or how do we show those results? One more sort of high-level way of doing this is to not have the batch building of the model. So, you have much more of it, it's called online or in-stream learning. So, as the data is coming in, you're updating the model as you go. And so, this poses a lot of challenges that conceptually might look simple on the screen, but it's actually probably the hardest one to do from an algorithmic point of view, because you don't get, the idea behind this is you don't get to see the data, you don't get to keep it, you just see it as it goes past you once. So, you'd have to do like a single-pass algorithm, so you don't get to go back and look through the whole data set again. So, this is, you know, Google has been doing a lot of work in this area as well, and you know, there's a few blog posts about this where they talk about, you have to be really careful about the timing of the ingest of the data, because if you don't get it in the right order, you might have to reorder it yourself in order to apply your algorithm, which is expecting things in a particular maybe date-time kind of order. Okay, so how can we do this? Well, you know, you can build it yourself. I've been using things like Spring Cloud Dataflow and Spring Boot applications to actually build these kind of infrastructures for some clients. And you obviously make use of marketplace data services. Today, we had a really good talk about, again, about Google's services in this area being available now through a service broker. And, you know, as a data scientist, I'm often using Python and building these Python models. So, being able to deploy those to Cloud Foundry is really important as well. And, you know, the official Python build pack now has this ability to deploy kind of the PyData stack, which is kind of the standard packages for Python, which it wasn't able to do previously. There's also some initial offerings in this kind of area from, you know, GE Predix, IBM's BlueMix has some of this stuff. You know, Alpine Dataminer have started looking at this. They've, you know, kind of got like a published to CF button that uses PMML, which is a kind of open interoperable data model format. PMML has its issues, and they're actually moving to a new format called PAF, predictive analytic framework or something. So, there's a few different options, probably at this point. And, you know, no one standardizes in any particular one. And building yourself for your particular use case is probably, sort of, if you weigh it up, is probably the best bet at the moment. Okay, so I'm going to try and look at this video. And so this is an application that was built by Pivotal Labs, some of our team. And it's basically an accelerometer in your phone taking sensor readings from your phone and building a model of what activity you are doing at any moment. And you can see here our lovely volunteer is going to, first of all, link his phone into the application. There's actually just one front-end app that's doing what you see on the mobile and browser, desktop browser. And it's going to use, as I said, the accelerometer in the phone. And this is an example of that CF-powered learning model that I showed earlier. So here we're training the model. So we're going to give it some data. It's going to store that data, and it's going to build a model from that data. Here he's walking in place, and you can see the kind of, that timer is just telling you how much data you have to generate before we're going to be able to recognize that in the future. He's very, very happy about this. So this is a fairly simple example in terms of the activities he's doing, but you can imagine something like a fitness tracker and using these kind of predictive models to determine, oh, are you taking 10,000 steps a day, or are you actually sitting in a cab and pretending to take 10,000 steps? So we're going to side-to-side. He's basically doing a little dance, and we're going to, it's called the Pivotal Moves app, but we could have called Pivotal Dance Moves app maybe. And what we'll get to is the point where we're able to hit the build the model button, and what's going to happen there is the front-end app is going to hand off to that training app that's sitting in the background that's actually going to do the predictive analytics. It's going to be using a random forest ensemble of decision trees to take in the accelerometer input and then predict one of these three outputs. So it's a classification problem here. So he's trained, got some training data now. We can build the model, goes off, builds it, comes back. And then we can now go to the scoring phase. So this is kind of, as new data comes in now, what happens? And you can see, as he's standing there, you can see the big caption at the top says walking in place. And you can see the accelerometer details. He switches to moving side by side. And as the scoring phase happens, it changes to side by side there. So it's determining as he's going what the activities are. So that's kind of a really good example of that kind of CF-driven learning because all of the whole thing lived there on Cloud Foundry. So that's great. We've got our model. It's working. It's in production. It's taking in real data. What do we need to do next? Well, day two, you want to update the model. You've figured out something else. You want to change the parameters. You want to use a different type of model. And you suddenly start getting a few problems. You now need to know which predictions were made with the old model and the new model. So if you're storing these for some kind of auditability later, you need to know, did you predict that this person could have a mortgage or could not? Which version of the model are you using to do that when the auditor comes along to ask you questions? And do you need to be able to continue serving the old predictions? Is it important that that old version of the model is still available to maybe consumers who, in whatever form, have built against that old version? And maybe you're building this for customers and some of them want to move to the new version. Some want to move to the old version. Kind of like Spotify bringing in a new prediction algorithm for what songs you're going to like. Maybe some people want to stay on the old one. There's a whole usual problem of dependency management. So that needs to be looked at. And also data schema changes. So is the data that's coming in different to previously? Has it changed in the underlying system? Is the data you're storing changed? Are you doing some different transformation? And can you replay the stream if necessary? So if someone comes back to you and says, okay, why did we decide that this person liked this song? This person could have this mortgage. This person's hard transaction was denied. Can we replay it so that we understand what decisions were made? Some predictive models have more explainability than others. So sometimes you can get that sort of out of the predictive model. But you're not going to need to have the inputs and you're going to need to have them all versioned. So to me, this kind of starts looking like, okay, we can't just do as was done in that demo store model as an object, a serialized object in some data store, maybe in Redis or something, then pull it out and use it again. Maybe we need something more. Maybe we need something like a model service. Something that when you ask for the results, so when you ask for the model, it gives you the right one for your use case. Maybe it's versioned. Maybe it's parsing data with different schema. It kind of has that ability to deal with legacy data. Maybe it's, as I said, serving the appropriate version depending on who's consuming it. And it's possibly storing the underlying data in some more general form or an unstructured form so that you're able to retrain and reproduce the results later on. I don't think there's anything like this quite yet on Cloud Foundry. But there's a few other things out there. It's kind of going in this direction. There's a project called Palladium from the Auto Group that did something sort of like this. You could send it and you'd get a version of your model. And Prediction I.O., which I think was just bought by Salesforce, but also the technology has just been donated to the Apache Foundation. So it's Apache Prediction I.O. incubating. They have something like this, but it's a whole system itself. So they have a server. I don't quite know what the underlying technology is, but you can implement sort of replaceable algorithms as little extensions effectively. So kind of, I think some people have looked in the past at moving that onto Cloud Foundry, like would it work in that kind of, would we be able to deploy it as a service or an application on CF? I think it didn't go very far with that one. So maybe there's a need for someone to build something more general that could be reusable in different scenarios here. So that's kind of like, you know, my hope for this talk maybe is to start a conversation about what is needed here. I have a few ideas myself, but I'd like to hear from other people who kind of need this functionality. How do we make it? And kind of more generally, what's next for doing data science on Cloud Foundry? I think we're still, you know, very much at the beginning of this journey, even just given by the number of talks at this summit about sort of data science topics. There's a lot now about IoT, and you know, suddenly this year, there's a lot more sort of ingesting of IoT data. And you know, immediately the sort of thing that screams that you then is, okay, we need to do something with all this baggage tag data, for example, that we heard about earlier. So to do that, we kind of need a few things and probably need more data services and maybe more data computation services. And so I know there's a few different projects to incorporate Spark into kind of a CF workflow and some other distributed computation systems. We need more examples of how this works, some demos, I'm trying to collect some of those and sort of make them available to the community or like just list them. And we need some examples of successful projects whereas this worked sort of hopefully next time at the CF summit, there'll be someone doing a talk about a project where they've used Cloud Foundry with machine learning and useful for the community to learn from that. And I think we also need to start building some of these building blocks so that, you know, they're a little bit easier to get started on this than not everyone's reinventing the wheel every time that they want to build something like this and have, you know, have this capability of having this model service. And so that's really all I've got. I was, yeah, this DS on CF.com thing is kind of this list of examples that I found of day science. If anyone has any other examples, please let me know either in the website or via Twitter. And but apart from that, just thank you very much for being here. Is there any one of any questions? No, I don't think there's any limit on the kind of learning you can do. So if you think about, you know, CF at a sort of basic level, it's just, you know, another compute system in some ways. So as long as you can get the data in and do something to it and restore that result, do some training. Like, I think the limitations are there's some technical limitations. So if you want to do, you know, a very large batch training over terabytes of data, doing that on a single application instance in CF is probably not the way to do that. And maybe in the future, you know, with things like the, and what's the name of the future where you get a, you can have different zones of computational power isolation segments. So with the isolation segments, you might be able to hive off a part that's like very high powered machines for your data science work and keep your, you know, relatively smaller machines for your just your web apps. And so there's a few different things. We were talking earlier about tasks, I think as well being a really good way of like running a one off like training. And so when these come into CF proper, it's going to be interesting to see how they can be used in the data science context. Any other questions? Okay, thanks everyone.