 OK, I think that's the last few strikers coming in now. So maybe we'll get started. And my name is Ian Houston. And along with Alex Kagashima here, we're going to talk about data science on Cloud Foundry. And something Andrew Clay Schaeffer said in his talk this afternoon really resonated with me about we're trying to build a community of practice. And I think that's really what we're doing here as well. So we're going to talk a little bit about how we think about doing data science on CF. But we'd also really like to hear any input from you, what you've done, what you've tried, what worked, what hasn't worked. And we'll talk a little bit about how you can maybe get involved later. So first of all, who are we? We're both working as data scientists at Pivotal Labs, which is the agile software development arm of Pivotal. And we both actually work in Europe, Alex in Berlin, myself in London. And we've been using Cloud Foundry for the last few years to deliver data-driven applications for our customers. And what we really do with our customers is we try and work with them to get value out of their data. Maybe just have a quick show of hands. Who here would identify themselves as a data scientist? OK, we've got a few. So it's not maybe as rare as I thought. And who works with data scientists or provides services or operations for data scientists? OK, so a lot more hands are growing up. And who doesn't know or has heard the buzzword but doesn't really know what a data scientist is? And I wonder why I keep putting those words together. Anyone? So you all know what data science is, OK, that's great. Really brief recap, then, maybe, is to understand what is a data scientist and what is part of their job. So this Venn diagram is famously created by Drew Conway and kind of shows the mix of skills you need to have to be a data scientist. So you need programming skills, definitely. You're hacking and coding skills. But you also need quite a lot of maths and statistical knowledge. And then to actually apply that to a problem, you need domain knowledge in one area. And when you get the intersection of all these three, you get data science and a data scientist. Maybe a different way of saying it is this quote from Josh Wills. Says that a data scientist is a person who's better at statistics than a software engineer and better at software engineering than a statistician. And the point about this is that we're not really software engineers. We don't have computer science backgrounds in the main. Like I have a physics research background and some of the machine learning backgrounds. But we didn't really go through a traditional software engineering education. And I think what that means is that something like platform like Cloud Foundry is actually really ideal for us. Because we are the people who really don't want to get bogged down in setting up and configuring servers and maintaining and doing operations on them. Because really we're trying to get quickly to business value by understanding data and providing some insights. So where software developers in the past had to stand up servers themselves and provision and do those kind of things, as a data scientist, that is really not my core skill, my core competency. So I want to be actually doing a data science task. I don't really want to be doing that. So that's why Cloud Foundry is kind of interesting for us. Briefly though, what are the type of projects that we actually work on? And well, there are a wide variety of them. Here's three sort of straightforward examples. And for example, you could be an insurance company that wants to understand the risk. You have insurable risks, buildings in different places. And maybe you want to understand how natural disasters like earthquakes or flooding will affect those buildings. So how much money would you lose if a particular country or a particular region flooded? And so we have a client who's trying to do this, and they're trying to run large-scale, very computationally intensive tasks. And what we're trying to do is trying to help them to run that in a parallel way, maybe use in-database systems, and go from running, maybe being able to run one or 10 of these statistical procedures to being able to run 1,000 or 10,000 of them to get a better understanding and better insight and, in effect, reduce the risk that they have. We've heard a lot today about the internet of things or the industrial internet, and predictive maintenance would come under that sort of heading. This is where we have some mechanical thing, maybe hard drives, or maybe it's an oil drilling platform. And you're trying to predict when it'll fail, because the cost of having that system out of production is very high. So I've heard people have systems with a cost of hundreds of thousands of dollars if it's out for one hour or one day. And if we can predict when those outages might happen, we'll be able to either repair them in advance or send the right spare parts that need to be there, or maybe take them out of production and put something else in its place in time that we don't actually get that downtime. So we do that with a mixture of large-scale machine learning processes, understanding the live data feeds that are coming in from those industrial internet applications, and trying to predict and then take action because of that. And then the third one here is understanding your customer. So lots of enterprises and large companies have siloed data where they understand a little bit about their customer over here and another little bit over here, but these never talk to each other. So trying to bring those together, trying to understand your customer from a holistic point of view, and then being able to provide better services, better customer experience because of that. And that's quite a lot of what we do. But there's a lot of other things, for example, like trying to reduce fraud in banking or trying to predict the destination of your journey in a car, and we do a lot of these different things, and we want to be able to provide the services we do, the data science services, in a quick and easy way and get to those data-driven apps. So what does a data scientist really need out of a platform or what sort of infrastructure do they need to do their work? Really, I think it boils down to three things. We need somewhere to store data in some easy way to capture that data. So for example, in the Internet of Things, the wide variety of different types of data coming in from different devices, we need a way to be able to channel that data somewhere and be able to store it long-term and be able to access that easily as well, like not have it in long-term storage, which is very hard to get at. And for example, I'm working with a client at the moment, and we've tried to do a data extract, a relatively small size of data, like it fit in my free Dropbox account, but it took over 24 hours to get that extract out, and over 24 hours we couldn't work on the data. So we need somewhere easy to put data and access it. We need somewhere where we can do large-scale, intensive computations, so running at scale with distributed computation systems, like Apache Spark or on top of Hadoop, and MapReduce Paradigm, that kind of thing. But finally, and this is where we really get to value, we need to be able to deliver results, whether that's purely just as a list of results on a website or it's a data API where someone can go accesses and get predictions for different things, and Alex is gonna talk a little bit about that. Or it might be simply an interactive data visualization where you're able to explore the data and see what the consequences are. So we need all three of these things, and I'm gonna talk about the first one, then Alex is gonna talk about the next two. So I think the first of these is the data storage, the how do we get data in and how do we keep it somewhere. And in Cloud Foundry terms, platform terms, these are data services. So we want an easy way to get access to these services without me having to go and download Redis myself and install it and try and tune it. I want an easy way to get a key value store and just push things towards it. And I also want to be able to build an application that can actually feed that relatively easily as well. So instead of just getting someone delivering me a hard drive and I have to load it up somewhere with the Internet of Things and other online, real-time streaming data, we're gonna get these streams of data in and we're gonna need to be able to do something with it quickly. So it's kind of a natural way of doing this in CF with data services, so you can have your managed service. And there's lots of examples now and I think we've heard a lot about these today and we'll tomorrow. But even things like highly available MySQL or Redis or even Rabbit Message Queue, we want to be able to create them easily and we want to be able to bind our applications to them as well. But lots of people have dedicated standalone big data infrastructure. They might have their own Hadoop installation, something like an Apache Spark cluster or whatever else. And user provided services allow you to connect to those really quickly and easily and enable you to use your existing infrastructure without having to manage it through Cloud Foundry. Now you may want to get to the point where you manage it and provision it using something like Bosch but using user provided services for now gets you to meet that distributed data requirement today if your service isn't managed by CF at the moment. And one good way of thinking about this is the ease with which you can switch from a test data store to a real-life production data store. And a sort of traditional way of doing this in data science might have to actually go and edit your files and change the way the data flow happens. Here we can just bind to a different service. So I can have one app pushed to CF that it's bound to my test post-gray instance and then I push another app but I bind that to my production instance or I switch between the two. So that provides a really easy way of going from one to the other. So that was the data services part. Alex, you're going to talk a little bit about the computation and the delivery of results. Sure. Thanks, Ian. So I'm going to talk a little bit about the compute part. So on the one hand, I'm going to explain a little bit what are the typical challenges when we work on actual customer projects with this and show the concept of a little prototype we developed. But first of all, so as data scientists, what we usually do in our work is we, obviously we implement code. So some people, they have this image that we stand in front of a whiteboard with a lab code and then code stuff in C or something like that. That's not how it is. So what we use mainly is Python and R. So these are two fairly high level languages and the reason we use them is because they have really, really good library support for a lot of machine learning algorithms. So these are really our favorite tools. So when Ian and I started out working working on Cloud Foundry, the first thing we found is there's no R build pack and the Python build pack, which was there is kind of, let's say it doesn't really have a lot of the libraries out of the box that we usually need. So what Ian did is he used the Anaconda Python distribution by Continuum IO and built his own build pack out of it. And if we use that, there's a lot of stuff like Cycle Learn, for example, which is a machine learning library and we can use that out of the box. So that was very handy. I used a lot of R, especially in university, so I'm a big R guy. So what I did is I created the build pack for R, which was kind of challenging, but at some point I got it done. So these two things were really helpful and really essential before we actually could do anything on Cloud Foundry, right? So first thing first, we had the build packs, which was good. So let's take a look at our usual work. So Ian already mentioned briefly, what we do is we work as kind of consultants for our customers of the Pivotal Big Data Suite. And what we do there is we kind of try to get some meaningful, valuable information out of big, big data sets. So the way this happens in practice, so we work with a lot of enterprise customers, so you see these siloed data and siloed systems at the customer, and then what we do is we get a big data extract and put all of this in some kind of distributed big data platform, which is nowadays usually HDFS, and then we work on top of it with Spark or something else. It could also be Green Plum, some that's an MPP relational database. And once we have it there, we are happy data scientists. We can see all the data with great speed, so we don't need to go through long-running extract processes because these already took place. So we already pushed everything over there, and what we do over there is we develop the actual model. So we think about how can we, for example, let's say for a specific customer, predict his lifetime value, for example. We use different statistical models, machine learning models that we train there, so we show a lot of data to that particular algorithm, and then that algorithm somehow learns how valuable a customer is. So everything happens over there. But the big problem is actually how do we push this model back here? Because the business, they actually need the prediction here in their legacy system landscape, right? So that's actually kind of a big issue that we face in a lot of our customer engagements. Very often after we created a really fancy model, we created a really great algorithm, and then we show a PowerPoint, but then the model kind of dies in the PowerPoint, is what we say. So nothing, not a lot happens. So this is kind of the issue that we have, and we were looking at some ideas on how to solve that with Cloud Foundry, which leads to roughly two thoughts on how you can actually do data science on Cloud Foundry. So this is just a very rough idea on how we think about this. There are a lot of different variants to it, but essentially, so what you can do, let's start here on the right side. What you can do is keep using your big data platform, which is good because there's a lot of libraries there, you can use Spark, and you can do the computation on the data in place, which is very good, and you kind of use Cloud Foundry mainly as a visualization thing. So once you have some aggregated results, you are able to show it to your customer in a web app that you deploy in Cloud Foundry, which is good. The other approaches that you actually somehow try to leverage the compute power that's available in Cloud Foundry, and use the big data store just for storage. So you don't do any computation in there. Yeah, so these are the two different approaches. There's also some variants to it. Let's say you don't want to store the data for some reason, then you can just leave that out and just do some online learning computations up there. So there's different variants to it, but these are the two rough ideas, how you could do it. So what we did is we created this prototype of a prediction API. We call it, so what we want to do with it is basically have a better way of actually interfacing with other software. So this is actually deployed at dsoncf.cfapps.io. When you go on there, you just get the readme landing page, basically, which tells you how you can send JSON there to do stuff with the API. And if you're in a pivotal organization on GitHub, you can actually get the code here. So what does this do? So basically you have this REST API endpoint and you can send it a request that says, hey, create me a model, which then creates a model in the backend. That model then is able to ingest data. So you send the data as a JSON blob as well. And it's kicking off some periodic retraining. So in machine learning, there's this notion of training. You show the model a lot of data and then the model gets smarter and smarter about the data. So this framework is actually able to do some periodic retraining, saves everything in Redis for now, which you can bind really easily on Cloud Foundry. And then you can also kind of send scoring requests to this API. So you tell it, you let it know about the data point. For example, all the transactions of a customer and then the model gives you a prediction back on how valuable that customer is, for example. So that is kind of the API idea that we have and on how we can actually leverage Cloud Foundry for data science. We created this kind of interface, which means you basically, if you want to create a model in that kind of framework, you have to implement this class interface, which means you need to have a train function, a score function and a get parameters function. And this is all done in Python. Oh, and by the way, this is using Ian's Python build pack, I mentioned previously. So what are some data-driven applications that we did? What are some examples on our work? So one thing which is really cool, which Ian created is this transport for London demo. So what this does is it scrapes a live feed of all the disruptions on London streets. And then you can see the current disruptions that are happening, but what it also does is it gives you a prediction on how long these disruptions are going to last. And that is based on historical data. So we scrape this data feed, store it, show it, show the live status, put some predictions in there, and the model also gets periodically retrained on the historical data. And you can access it right there. I think it's fair to say this is like the simplest possible way of using Cloud Foundry. Yeah, yeah, so this is basically the right approach that you see here. Another thing that I created with my R build pack is we call it insurance demo. So it's basically an insurance data set. And this app basically allows you to explore the data a little bit. And the goal here is to find valuable new customers. And what you can do in this app is try to create some rules manually, but also it lets you just train a model that picks out these customers for you and then you can compare the performance of your manual rules and the model. And the model is usually a lot better. And that's an example of the second one where the computation is actually happening in the Cloud Foundry app itself. So it's not happening on the big data platforms. Yes, so it's possible in this case because the data set is really small. It's like a megabyte or something like that. Okay, so these are two examples on data-driven applications we did. With that, I'm going to hand it over to Ian again. Yeah, well, just, I think these are two public examples. We've done quite a lot of customer work as well where we've used these ideas and we've gone a bit further in those. But what we really want to hear is about the rest of the community and what they're doing. Already gone down to the GE booth and heard a little bit about Predix. And I'm sure there's a lot of other examples in the community of people where people are using Cloud Foundry to not only just display results but maybe provide data APIs and provide, you know, understand some of the sort of issues we're talking about. So we'd be really happy to hear anything that anyone has to say about that. And we set up this website as a place where you can just show examples of how to do these kind of things and you can send us something on that Twitter account. But also we'd be happy to hear right now if anyone's doing any of this or if you have any other questions as well, so. Questions? Okay. Well, I think it depends on how much, on the, if you're setting it up internally on your own CF instance, you just have to provision it as you will. Say on PWS, which is the Pivotal Web Services, and that's Cloud Provided, Redis. So you keep paying for higher and higher levels, higher and higher tiers of service. But, you know, I think it goes up for quite a far way at the moment. I mean, Redis isn't really made for the storage of really large amounts. It's in memory. So its feature is that it's quick. But I mean, for this prediction API architecture that I showed, this is mainly, we think of it as kind of a prototype, proof of concept type of thing that we eventually also want to hook up to something like Spark, for example. So we can also do batch training on really large data sets. So that's a good question. So in the example Alex showed, the ingest is just, again, a REST API that you can send data to. For example, you've helped build this connected car demo, which is streaming data live back from a car, like a widget stuck into your car, and it just hits a HTTP endpoint and that's provided by a CF app. And then the CF app knows, oh, you've sent some data to me, I'll go store it and I'll run my machine learning, I'll run my scoring on top of that. Yeah, yeah, exactly. So basically, there was a talk earlier about how CF right now uses HTTP as the transport, but there's moves to also use TCP because that allows you to, or sorry, moves to use other things like AIMQP and other ways of getting data in because the internet of things runs on a lot of different protocols, not just HTTP. I think there were some more questions. Yeah, they're different. So, I mean, one thing you could think about is, so I'm actually using Spark and some customer projects now, and I'm using PySpark, and since the prototype that we wrote, it's written in Python, you could actually easily integrate PySpark with some Spark backend that you have running somewhere. So I think right now, you're using a Spark instance that's separate, it's on your standalone big data hardware, big data infrastructure, where I'd like to get to, and I don't know what sort of timescale this is around, I'd like to be able to provision a Spark cluster the same way as I provision today, I can provision a Cassandra cluster in PCF. So I can do that all through Cloud Foundry, maybe Bosch is provisioning Spark, and I can just bind to that. So that would be the ideal thing for me. I don't want to, Yeah. I want the minimum amount of fuss for me to get access to something. So it's the difference, we heard Andrew Clay-Shaver talk a bit about the old days of provisioning a web application as you go and request a server and it takes three months and then someone has to set it up. That's kind of still the same for big data infrastructure today, in many ways, especially for on-premises. It's sometimes even longer than three months. So. But you couldn't imagine, people obviously spin up AWS and you have Spark living on there, but we also have large bits of kit being moved around. So if you make it easier for people to start, get to start their data science work as quickly as possible and then that's provisioning it through CF or Bosch, would be the way I'd see that going. Question there? So that's a good question. I think, I don't think I have a hard and fast rule. I think the way everything is going is to be more distributed rather than one single large VM set. Obviously there are some overheads, but you have people running large-scale machine learning systems purely on top of AWS with all the sort of overhead that that entails and someone like Netflix is able to run their machine learning pipeline purely on that infrastructure without having to go down to bare metal at any point. So it's definitely doable and you probably have to be a little bit clever about it because you're not getting 100% of the speed that you would on bare metal, but what you're gaining is the ease to provision and the ease of getting started. Whereas on bare metal, you have to then be responsible for maintaining all of that. Yeah, exactly. Yes. That I would say there's no general rule of thumb that you can apply here. I mean, it also highly depends on the use case that you have. So I would say in general, if you follow that traditional approach where you have one big data extract and you put that somewhere, then I would definitely do that on top of Spark. So I'm not inside of Cloud Foundry, but somewhere where I have the storage and the compute together. Yeah, yeah, yeah. If the data is streaming in though, maybe it's a different use case. If the data is streaming in, then it might make sense to deploy some online learning on Cloud Foundry directly. And I think the other thing to think about is Cloud Foundry installations maybe originally set up for web applications, which don't need large volumes of RAM compared to maybe some machine learning applications. So for example, on the hosted versions, it tends to be like a two gig RAM limit. But really, I'd want a lot more memory if I was to do this as a CF app inside the CF installation. So maybe what you get is resource pools that have much better, I think this is gonna be part of Diego if I'm not wrong. Someone correct me here. Assign resource pools that have a lot better infrastructure for a big data computation and let your app pick those when it needs to. So I think that's, it's about how much you give the application each time. Any more questions? Okay, thank you very much. Thank you very much.