 Hi, I'm Peter Burris. Welcome to Wikibon's Action Item. No one could argue that big data and related technologies have had significant impact on how businesses run, especially digital businesses. The evidence is everywhere. Just watch Amazon as it works its way through any number of different markets. It's highly dependent upon what you can get out of big data technologies to do a better job of anticipating customer needs, predict best actions, make recommendations, et cetera. On the other hand, nobody can argue, however, that the overall concept of big data has had significant issues from a standpoint of everybody being able to get similar types of value. It just hasn't happened. It's been a lot of failures. So today from our Palo Alto studio, who's I've asked David Foyer, who's with me here, Jim Cabilis, and Ralph Finos, and George Gilbert are on the line, and what we're going to talk about is effectively where are we with big data pipelines and from a maturity standpoint, to better increase likelihood that all businesses are capable of getting value out of this. Jim, watch, take us through it. What's the core issue as we think about the maturing of machine analytics, big data pipelines? Yeah, the core issue is the maturation of the machine learning pipeline. How mature is it? And the way Wikibon looks at the maturation of the machine learning pipeline, independent of the platforms that are used to implement the pipeline are three issues. To what extent has it been standardized? Is there a standard conception of the various phases, tasks, functions, and their sequence? Number two, to what extent has this pipeline at various points or end-to-end been automated to enable greater throughput and consistency or capability? And number three, to what extent has this pipeline been accelerated not just through automation, but through a very standard downs and collaboration and handling things like governance in a repeatable way? Those are core issues in terms of the ML pipeline, but in the broader sense, the ML pipeline is only one work stream in the broader application development pipeline that includes code development and testing pipeline. So really DevOps is really the broader phenomenon here. ML pipeline is one segment of the DevOps pipeline. So we need to start thinking about how we can envision the ML pipeline creating assets that the business can use in a lot of different ways where those assets specifically are models, machine learning models that can be used in more high-value analytic systems. This pressure has been in place for quite a while, but David Floyer, there's a reason why right now this has become important. Why don't you give us a quick overview of kind of like where does this go? Why now? Why now? Why now is because automation is in full swing and you've just seen the Amazon having the ability now to automate warehouses and they've just announced the ability to automate stores, brick-and-mortar stores. You go in, you pick something up, you walk out and that's all you have to do. No lines to check out, no people in the checkout, a completely automated store. So that business model of automation, of business processes is to me, what all this has to lead up to. We have to take the existing automation that we have, which is the systems of record and other automation that we've had for many years and then we have to take the new capabilities of AI and other areas of automation and apply those to those existing automation and start on this journey. It's a 10-year journey or more to automating as many of those business processes as possible. Something like 80% or 90% are there and can be automated. That's an exciting future, but what we have to focus on is being able to do it now and stop doing it now. So that requires that we really do take an asset-oriented approach to all of this. At the end of the day, it's impossible to imagine business taking on, increasing complexity within the technology infrastructure if it hasn't taken care of business in very core ways, not the least of which is, do we have, as a business, have a consistent approach to thinking about how we build these models. So Jim, you've noted that there's kind of three overarching considerations. Help us go into it a little bit. Where are the problems that businesses are facing? Where are they seeing the lack of standardization creating the greatest issues? Yeah, well, first of all, the whole notion of machine learning pipeline has a long vintage. It actually descends from the notion of a data mining pipeline that the data mining industry years ago consolidated or at least had a consensus around some model called CRISP, I won't bore you with the details there. Taking the machine learning, taking it forward to an analytics pipeline or a machine learning pipeline, the critical issues we see now is the type of asset that's being built and productionized is a machine learning model which is a statistical model that's increasingly being built on artificial neural networks and to drive things like deep learning. Some of the critical things are, well, upfront, the preparation of all the data in terms of ingest and transformation and cleansing. That's an old set of problems well-established. There's a lot of tools on the market that do that really well. That's all critical for data preparation prior to the modeling process really, truly beginning. So is that breaking down, Jim? Is that the part that's breaking down? Is that, is the upfront understanding of the processes or is it somewhere else in the pipeline process that is? It's in the middle, yeah. It's in the middle, Dave, Peter. The modeling itself for machine learning is where there's a number of things that have to happen for these models to be highly predictive. A, you have to do something called feature engineering and that's really fundamentally looking for the predictors in large data sets that you can then build into models and you can use various algorithms. So feature engineering is a highly manual process that increasingly can be automated, is being automated, but a lot of it's really bleeding edge technology is in the research institutes of the world. That's a, it's a huge issue of how to automate more of the upfront feature engineering. That feeds into the second core issue is that there's 10 zillion ways to skin the statistical model cat in terms of algorithms. You know, from the older approaches, regression models, the port vector machine, to the newer artificial neural networks, convolutional, blah, blah, blah. So a core issue is, okay, you have a feature set through feature engineering, which of your 10 zillion algorithms should you use to actually build the model based on that feature set? So there are tools on the market that can accelerate some of these selection and testing of those alternate ways of building up those models. But once again, that highly manual process, traditionally manual process of selecting the models still needs a lot of manual care and feeding to really be done right. It's a human judgment. You really need high power data scientists. And the number three, once you have the models built, training them, training is critical with actual data to determine whether the models actually are predictive or do face recognition or whatever it is. With a high degree of accuracy, training itself is a very complicated pipeline in its own right. That's a highly, it takes a lot of time, takes a lot of resources, a lot of storage, you got your data lake and so forth. The whole issue of standardizing on training of machine learning models is a black art in its own right. And that's, I'm just scratching the surface of these. These really, these issues that are outstanding in terms of actually getting greater automation into a highly manual, highly expert-driven process. Go ahead, David. Jim, can I just break in there? You've mentioned the three things. They're very much in the AI portion of this discussion. The endpoint has to be something which allows automation of a business process. And fundamentally, it's real-time automation. I think you would agree with that. So the outcome of that model then has to be a piece of code that is going to be as part of the overall automation system in the enterprise and has to fit in. And if it's going to be real-time, it's got to be really fast as well. In other words, that the asset that's created by this pipeline is going to be used in some other set of activities. Correct. So it needs to be tested in that set of activities and part of a normal curve. So what is the automation? What is that process to get that code into a place where it can actually be useful to the enterprise and save money? Yeah, David. It's called DevOps. And really, DevOps means a number of things, including especially a source code control repository. In the broader scheme of things, that repository for your code, for DevOps, for continuous release cycles, that repository needs to be expanded and it's scoped to include machine learning models, deep learning, whatever it is you're building based on the data. What I'm getting is a deepening repository of what I call logic that's driving your applications. It's code. It's Java, C++ or Sharp or whatever. It's statistical and predictive models. It's orchestration models you're using for BPM and so forth. It's maybe graph models. There's a deepening, thickening layer of logic that needs to be pushed into your downstream applications to drive these levels of automation. So Jim, he has to be governed in a consolidated way. So Jim, the bottom line is we need maturity in the pipeline associated with machine learning and big data so that we can increase maturity in how we apply those assets elsewhere in the organization. Well, I got that right. Right. George, what is that going to look like? Well, I want to build on what Jim was talking about earlier. And my way of looking at the pipeline is actually to break it out into four different ones. And actually, Jim, as he's pointed out, there's more than potentially four. But the first is the design time for the applications, these new modern operational analytic applications. And I'll tie that back to the systems of recordness. The second is the runtime pipeline for these new operational analytic applications. And those applications really have a separate pipeline for design time and runtime of the machine learning models. And the reason I keep them separate is they are on a separate development and deployment and administration sort of scaffolding from the operational applications. And the way it works with the systems of record, which, of course, we're not going to be tearing out for decades, they might call out to one of these new applications, feed in some predictors or have some calculated, and then they get a prediction or a prescription back for the system of record. And so I think the parts, let me just wrap up. So George, what has to happen is we have to be able to ensure that the development activities that actually build the applications that the business finds valuable and the processes by which we report into the business some of the outcomes of these things. And the pipelines associated with building these models, which are the artifacts and the assets created by the pipelines, all have to come together. Are we talking about a single machine or big data pipeline? George, you mentioned four. Are we going to see pipelines for machine learning and pipelines for deep learning and pipelines for other types of AI? Are we going to see a portfolio of pipelines? What do you guys think? I think so. But here's the thing. I think there's going to be a consolidated data lake from which all of these pipelines draw the data that are used for modeling and downstream deployment. But if you look at training of models, deep learning models, which are, like their name indicates, they're deep, they're higher work at all. They're used for things like image recognition and so forth. The data there is video and speech and so forth. There's different kinds of algorithms that they are used to build. And there's different types of training that needs to happen for deep learning versus, like, other machine learning models versus whatever else. So, Jim, let me stop you. But those are different processes. Jim, let me stop you. So I want to get to the meat of this, guys. Tell me what a user needs to do from a design standpoint to inform their choice of pipeline building and then, secondarily, what kind of tools they're going to need. Does it start with the idea that there's different algorithms? There's different assets that are being created at the model level? Is it really going to feed that? And that's going to lead to a choice of tools? Is it the application requirements? How mature, how standardized can we really put in place conventions for doing this now so it becomes a strategic business capability? I think there has to be a recognition there's different use cases downstream, because these are different types of applications entirely that are being built from AI in the broadest sense. And they require different data, different algorithms. But you look at the use cases. So in other words, the use cases, like chatbots, that's a core use case now for AI. That's a very different use case from, say, self-driving vehicle. So those need entirely different pipelines in every capacity to be able to build out and deploy and manage those disparate applications. So let me make sure I got this, Jim. So what you're saying is that the process of creating a machine learning asset, a model, is going to be different at the pipeline level. It's not going to be different at the data level. It's going to be different at the pipeline level. George, does that make sense? Is that right? Do you see it that way, too, as we talked to folks? I do see what Jim is saying in the sense that if you're using sort of operational tooling or guard rails to maintain the fidelity of your model that's being called by an existing system of record, that's a very different tooling from what's going to be managing your IoT models, which have to get distributed and which may have sort of a central canonical version, and then an edge specific instance. In other words, I do think we're going to see different tooling because we're going to see different types of applications being fed and maintained by these models. So organizationally, we might have a common framework or approach, but the different use cases will drive different technology selections, and those pipelines themselves will be regarded as assets that generate machine learning and other types of assets that then get applied inside these automation applications. Have I got that right, guys? Yes, and I'll just give you a quick example to illustrate exactly what we're referring to here. So IoT, George brought up IoT analytics with AI built inside edge applications. We're going to see a bifurcation between IoT analytic applications where the training of the models is done in a centralized way because you get huge amounts of data that needs to be training these very complex models that are running in the cloud but driving all these edge nodes and gateways and so forth. But then you're going to have another pipeline for edge-based training of models for things like autonomous operation where more of the actual training will be happening at the edges, at the perimeter. It'll be different types of training using different types of data with different types of time legs and so forth built in. But there'll be distinct pipelines that'll need to be managed in a broader architecture. So issues like the ownership of the data, the intellectual property control of the data, the location of the data, the degree to which regulatory compliance is associated with it, how it gets tested, all those types of issues are going to have an impact on the nature of the pipelines that we build here. So that, look, one of the biggest challenges that every IT organization has, in fact, every business has, is the challenge that if you have this much going on, the slowest part of it slows everything else down. So there's always an impedance mismatch organizationally. Are we going to see a forcing of data science, application development, routines, practices, and conventions start to come together because the app development world, which is being asked to go faster and faster and faster, is at some point in time going to say, I can't wait for these guys to do their sandbox stuff. That, what do you think guys? Are we going to see that? David, I'll look at you first and Jim, I'll go to you. Sure, I think that the central point of control for this is going to have to be the business case for developing this automation and therefore from that what's required and that system of record. Where the money is. What is required to make that automation happen? And therefore from that, what are you going to pick as your ways of doing that? And I think at the moment, it seems to me as an outsider, it's much more driven by the data scientists rather than the people, the business line and eventually the application developers themselves. I think that shift has to happen. Yeah, well, one of our predictions has been that the tools are improving and that that's going to allow for a separation, increased specialization in the data science world. And we'll see the difference between people who are really doing data science, people who are doing support work. And I think what we're saying here is those people that do support work are going to end up moving closer to the application development world. Jim, I think that's basically some research that you've done as well. Have I got that right? Okay, so let me wrap up our action item here. David Foyer, do you have a quick observation on a quick action item for this segment? For this segment, the action item to me is putting together a business case for automation, the fundamental reduction of cost and improvement of business model. And that to me is what starts this off. How are you going to save money? Where is it most important? Where are new business model is most important? And what we've done in some very recent research is put out a starting point for this discussion. A business model of a $10 billion company and we're predicting that it saves $14 billion. Let's come to that. But the action item is basically start getting serious about this stuff because based on business cases in. All right, so let me summarize very quickly. For Jim Cabilis and George Gilbert and Ralph Finos who seem to have disappeared off our screens and David Foyer, our action item is this, that the leaders in the industry and the digital world are starting to apply things like machine learning, deep learning and other AI forums very aggressively to compete. And that's going to force everybody to get better at this. The challenge, of course, is if you are forcing or if you're spending most of your time on the underlying technology, you're not spending most of your time figuring out how to actually deliver the business results. Our expectation is that over the course of next year one of the things that are going to happen significantly within organizations will be a drive to improve the degree to which machine learning pipelines become more standardized, reflecting of good data science practices within the business which itself will change based on the nature of the business, regulatory businesses versus non-regulatory businesses. For example, having those activities be reflected in the tooling choices, have those tooling choices then be reflected in the types of models that we want to build and those models, those machine learning models ultimately reflecting the needs of the business case. This is going to be a domain that requires and a lot of thought in a lot of IT organizations, a lot of inventions yet to be done here but it's going to, we believe, drive a degree of specialization within the data science world as the tools improve and a realignment of crucial value-creating activities within the business so that what is data science becomes data science, what's more support, what's more related to building these pipelines and operating these pipelines becomes more associated with DevOps and application development overall. All right, so for the Wikibon team, Jim Cabellus, Ralph Finos, George Gilbert and here in the studio with me, David Floyer, this has been Wikibon's Action Item. We look forward to seeing you again.