 This is George Gilbert on the ground at the Data Science Summit, married Marquis San Francisco. I'm with Olia Farinas, master data scientist at Pivotal, and one of the pioneers in terms of taking advanced tools and helping customers apply them. So you guys have several years experience in taking customers on this journey towards applying machine learning. Tell us typically how that journey starts and some of the obstacles customers have to get over to get moving towards high-impact solutions. So a lot of our customers actually have some sort of a data scientist in a house, but they're limited by the traditional tools that they use. Data is very siloed, so in order to take advantage of the full potential of data, it needs to be co-located first, and co-location isn't just putting the data into a single place. It's a lot of data science related activities go into taking advantage of disparate data sources, so you want to be able to identify this entity in this database in another entity, so there's a lot of entity resolution modeling. And common metadata and catalogs sort of? So some companies would like to have a top-down approach, master data management. They think that, oh, my data is too messy so I can't really use data science. We don't believe that. It could be very messy data. You don't need to have a common framework, way to categorize things. Data science can help. Data science is not something to use at the very last moment. It says, don't worry about master data management related stuff. Let the data scientists take a look at your old data set and understand the biases in data, understand commonality between different data sets, and then find ways to interrogate data so that you know which part of the data you can actually believe in, which part you cannot. So it's all part of the data science practice. So just bring in all these proprietary data sources, partner data sources and external data sources and building an analytical capability on top. That's how we help our customers. So step one is inventory your data and determine its quality, source it, but make sure you know what you can rely on. Absolutely, absolutely. So bring all your data assets into a single place and find correlations, causalities between them. But it's not going to be very simple. It's not just putting the data into a single location. There are a lot of data scientists. Time will be spent on understanding the biases in that data and then interrogation of that data. Okay. So to be clear, that sounds like whatever we call the data lake really does have a very different role from the data warehouse, which is highly curated and organized and accessed by business analysts. This is for the data scientists, the hardcore guy who has to find out lineage and meaning. Absolutely. Just give them access to the raw data and its rawest form possible. It's not ETL. It's not has gone through all kinds of processes in order to put logic in there because you don't want your data scientists to be rediscovering all the business logic that was embedded into your ETL processes. So create a data lake, put your data in any shape or form. It could be structured data, unstructured, semi-structured data, but allow them to understand the biases in the data and build trust in particular data sources using data science. Taking just one step up from there, is there a role for tools that manage the lineage, you know, the governance issues around this data, where you've taken it from the rawest form, but in one repository, you're going to track how it gets refined. Sure. I mean, a lot of very highly regulated industries are interested in solutions like that, and there are many different technologies that answer that question. But for the data scientists, it's not necessarily in the domain of data science to create that lineage, but of course, some particular industries, financial services, or pharmaceutical companies, et cetera, if they're supposed to give some sort of a report to the regulatory systems, they would want to understand the lineage of the data. That's absolutely true. Okay. So now that you've got an inventory of the data, how do you decide which applications, or which perhaps better said, which problems are best addressed first, you know, with machine learning technology that would have the biggest impact, but also the lowest risk to start with? That's something that we help a lot of our customers with. First, we have that understanding of their data assets, then we identify all kinds of business decisions that are being made, and understand what kind of data science can support those decisions, and assign a value and level of effort, and build those companies and analytics roadmap, and understanding how big of an impact a particular project is going to have on business is very important, and it shouldn't be skipped, because in prior years, people were interested in the technology. They didn't really know what it would do for them, and they started a lot of little science projects, like can this technology do this faster, or would you be able to use this extra data source that we've never been able to look into? Can you give us some examples of a roadmap anonymized, if necessary, where you balance the impact, and the cost, or risk, you know, as a roadmap? So, since these are extremely strategic exercises, I'm not allowed to give any customer without the names. So, well, there are very different dimensions of value, like a project may be valuable, because it's going to be consumed internally, and it's either going to recognize, generate new revenue, or cost reduction, so being able to understand how the opportunity is extremely important, and sometimes it requires you to actually dive into the data to understand the fraud detection kind of projects. You first need to have an understanding of the size of the pool, whether the machine learning will be able to identify all of it or not. There's another question, but you first need to have an understanding of how big is the opportunity. So, level of effort has completed different dimensions also, like is it accessibility, or whether it has the right granularity to answer the question that you asked, is one aspect, whether the delivery of analytics is another aspect, maybe the people who you're trying to influence decision making are not open to decision support systems, right? If they're not going to use the tools that you build, that level of effort is going to be very high, and technology, of course, another one, how the message of the data also contributes into level of effort. The delivery of analytics, as I mentioned, is another thing that we pay attention to from all that. When you say delivery, meaning operationalizing it. That's right. Tell us what are some of the methodologies for doing that? So, one thing is for you to start thinking about how the model is going to be used by the end users. The Pivotal Data Science team is now part of Pivotal Labs rapid application development team. So, we have an increased focus on users and how they make decisions, when they make decisions, when some particular pieces of information is going to be available at the time of decision making. So, when you start with that mindset, you increase the chances of that particular model to be operationalized. That's the key. So, it's having an understanding of users' everyday life and how they make decisions and build a product that fits into those needs. So, it's a matter of understanding the workflow or business process more than any technical integration issue? That's right. That's absolutely true. I think you said it so much better than I did. It's like embedding analytics into their everyday life so that they'll take advantage of it, because there's so many predictive models that are built that no one takes advantage of. And it could be the most insightful model, but if it's going into a PowerPoint presentation, it's dead. That's where models go to die. Is there a... Are you noticing an acceleration in the change of the tools that enable the data preparation, the modeling? Is there... Because to the casual user, it looks like we have dozens of tools on the preparation and on the analytics side that can be applied any which way. Consider this as self-tool agnostic. We picked the right tool for the problem that we're solving. And almost all the tools that you can actually take advantage of works on top of our platform anyway. So if it is unstructured data like medical images and you want to be able to extract features from it, so that's something that you would have to do maybe using MapReduce. We have a couple of data scientists who are experts in computer vision, so they take advantage of that, majority of the time because the data that we have access to has some sort of structure in it. So if that's the case, being able to use an advanced SQL capability on top of Hadoop is extremely exciting for a lot of our customers, so using Hock for that purpose is another way to go about it. Okay. And with that, this is George Gilbert, Data Science Summit, San Francisco, and we'll be back. Thanks.