 Hi everyone. We have a special program today. We are joined by Dinesh Nirmal, who is VP of Development for Analytics at the IBM company. And Dinesh has an extremely broad perspective on what's going on in this part of the industry. And IBM has a very broad portfolio. So between the two of us, I think we can cover a lot of ground today. So Dinesh, welcome. Oh, thank you, George. Great to be here. So just to frame the discussion, I wanted to hit on sort of four key highlights. One is balancing the compatibility across cloud on-prem and edge versus leveraging specialized services that might be on any one of those platforms. And then harmonizing and simplifying both the management and the development of services across these platforms. You have that trade-off between do I do everything compatibly, or can I take advantage of platforms, specific stuff. And then we've heard a huge amount of noise on machine learning, and everyone says they're democratizing it. We want to hear your perspective on how you think that's most effectively done. And then if we have time, the how to manage machine learning feedback, data feedback loops to improve the models. So having started with that, so you talked about the private cloud and the public cloud, and then how do you manage the data and the models or the other analytical assets across the hybrid nature of today. So if you look at our enterprises, it's a hybrid format that most customers adopt. You have some data in the public side, but you have your mission critical data that's very core to your transactions exist in the private cloud. Now how do you make sure that the data that you push onto cloud that you can go use to build models, and then you can take that model deployed on-prem or on public cloud? Is that the emerging sort of mainstream design pattern where mission critical systems are less likely to move for latency, for the fact that they're fused to their own hardware, but that you take the data and the researching for the models happens up in the cloud, and then that gets pushed down close to where the transaction decisions are. Right. I mean so there's also the economics of data that comes into play. So if you are doing a you know large-scale neural net where you have GPUs and you want to do deep learning, obviously you know it might make more sense for you to push it into cloud and be able to do that using Watson or one of the other deep learning frameworks out there, but then you have your core transactional data that includes your customer data, you know, or your customer medical data, which I think some customers might be reluctant to push onto public cloud, but you still want to build models and predict and all those things. So I think it's a hybrid nature. Depending on the sensitivities of the data, customers might decide to put it on cloud versus private cloud, which is in their premises, right? So then how do you serve those customer needs, making sure that you can build a model on public cloud and then you can deploy the model on private cloud or vice versa, right? I mean you can build that model on private cloud or on your on-prem and then deploy it on your public cloud. Now the challenge one last statement is that people think well once I build a model and I deploy it on public cloud, then it's easy because it's just an API call at that time just to call that model to execute the transactions, but that's not the case. You take support vector machine, for example, right? That still has vectors in there. That means your data is there, right? So it's even though you're saying you're deploying the model, you still have sensitive data there. So those are the kind of things that customers need to think about before they go deploy those models. So I might, this is a topic actually for our Friday interview with a member of the Watson IoT family, but it's not so black and white when you say we'll leave all your customer data with you and we'll work on the models because it's sort of like tea bags, you know, you can take the customer's tea bag and squeeze some of the tea out in your IBM or public cloud and give them back the tea bag, but you're getting some of the benefit of this data. Right. So it depends, you know, depends on the algorithms you build. I mean, you could take a linear regression and you don't have the challenges that I mentioned in support of vector machine because none of the data is moving. It's just modeled. So it depends. I think that's where, you know, pre-canned pipelines like where Watson has done will help tremendously because the data is secure in that sense. But if you're building on your own, it's a different challenge. You have to make sure you pick the right algorithms to do that. Okay. So let's move on to the modern sort of what we call operational analytic pipeline where, you know, the key steps are ingest, process, analyze, predict, serve, and you can drill down on those and make them more fine-grained. Today there's, those pipelines are pretty much built out of multi-vendor components. How do you see that evolving under pressures of sort of the tension between simplicity coming from one vendor, one throat to choke, and the pieces all designed together, and the specialization where you want to have a, you know, unique tool in one component. Right. So you're exactly right. You can take a two-prong approach. One is to, you can go to a cloud provider and get each of these services and you stitch it together. That's one approach. A challenging approach, but that has its benefits, right? I mean, you bring some core strengths from each vendor into it. The other one is the integrated approach where you have ingest the data, you shape or cleanse the data, you get it prepared for analytics, you build the model, you predict, you visualize. I mean, that all comes in one. The benefit there is that, you know, you get the whole stack in one. You have one throat to choke. You have a whole pipeline that you can execute. You have one service provider who's giving you the services. It's managed. So all those benefits comes with it. And that's probably the preferred way where it's integrated all together in one stack. I think that's the path most will go towards because then you have the whole pipeline available to you and also the services that comes with it, right? So any updates that comes with it and how do you make sure if you take the first path, the one challenge you have is that how do you make sure that all these services are compatible with each other? How do you make sure they're compliant? So if you're an insurance company, you want it to be HIPAA compliant. Are you going to individually make sure that each of the services are HIPAA compliant? But once you get it from a one integrated vendor, you can make sure that they are HIPAA compliant. Tests are done. So all those benefits to me outweigh you going putting unmanaged service all together and then creating a data lake to sit underneath all of it. Would it be fair to say like to use an analogy where Hadoop being sort of originating in many different Apache projects is a quasi-multi-vendor kind of pipeline and the state of the machine learning analytic pipeline is still kind of multi-vendor today. You see that moving towards single vendor pipeline. Who do you see as the sort of the last men standing? So I mean I can speak from an IBM perspective. I would say the benefit that a company like a vendor like IBM brings forward is like so the different public or private cloud or hybrid, you obviously have the choice of going to public cloud. You can get the same service on public cloud so you get a hybrid experience. So that's one aspect of it. Then you get the integrated solution all the way from ingest to visualization. You have one provider. It's well tested. It's integrated. It's compliant. It works well together. So I would say going forward if you look purely from an enterprise perspective I would say integrated solutions is the way to go because that's what will be the last men standing. I'll give you an example. I was with a major bank in Europe about a month ago and I took them through our data science experience our machine learning project and all that and you know the CTO's take was that Dinesh I got it. Building the model itself it only took us two days but incorporating that model into our existing infrastructure it has been 11 months. We haven't been able to do it. So that's the challenge our enterprises face and they want an integrated solution to bring that model into their existing infrastructure. So that's you know that's my thought. So today though let's talk about the IBM pipeline. Spark is core compute ingest is often Kafka. Right so you can do spark streaming right you can use Kafka or you can use InfoStream which is our proprietary tool. Right. So ingest although you wouldn't really use spark structured streaming for ingest because of the back pressure. Right and so they are yeah I agree. The point that I'm trying to make is it's still multi-vendor and then the serving side I don't know you know where once the analysis is done and predictions are made some sort of no SQL database or or new SQL database has to take over. So it's today it's still pretty multi-vendor. Right how do you see any of those products broadening their footprint so that the number of pieces decreases. So good question they are all going to get into the end-to-end pipeline because that's where the value is unless you provide an integrated end-to-end solution for a customer you know especially enterprise customer it's all about putting it all together and putting these pieces together is not that easy. Even if you ingest the data IoT kind of data a lot of times 99% of the time data is not clean. I mean unless you're in a Kegel competition where you get cleansed data in real world that never happens. Right so then I would say 80% of a data scientist time is spent on cleaning the data shaping the data preparing the data to build that pipeline. So for most customers it's critical that they get that end-to-end well oiled well connected integrated solution than take it from each vendor you know a very isolated solution. So answer your question yes every vendor is going to move into you know the ingest data cleansing piece transformation and then building the pipeline and then visualization if you look at those five steps has to be together. But just building the data cleansing and transformation having it in your you know native to your own pipeline that doesn't sound like it's going to solve the problem of messy data that needs you know human supervision to clean up. Right oh yeah I mean so there is some level of human supervision that needs to make sure so I'll give you an example right I mean when a data from an insurance company comes a lot of times the gender could be missing how do you know if it's a male or female then you got to build another model to say you know if this patient has gone for a prostate exam he's you know it's a male gynecologicism is a female so you have to do some intuitively work in there to make sure that the data is clean and then there's some human supervision to make sure that this is good to build models because when you're executing the pipeline in real time yeah you know it's it's all based on the past data so you want to make sure that data is as clean as possible to train that model that you're going to go execute on. So let me ask you turning to a slide we've got about complexity and first for developers and then second for admins if we take the the steps in the pipeline as ingest process analyze predict serve and sort of products or product categories as Kafka spark streaming and sequel a web service you know for predict and MPP sequel or no sequel for serve even if they all came from IBM would it be possible to um unify the data model the addressing in namespace and I'm just taking off a few that you know I can think of programming model persistence transaction model workflow testing integration I mean there's one thing to say it's all IBM and then there's another thing so that the developer working with it sees it as as one suite yeah so so it has to be well integrated and that's the benefit that IBM brings for it because we obviously test each segment to make sure that works with well but when you talk about complexity building the model is one you know the development of the model but now the complexity also comes into the deployment of the model now we talk about the management of the model where how do you monitor it right when was that model deployed was it deployed in test was it deployed in in in production and who changed that model last you know what was changed in there and how is it scoring is it scoring high or low and you want to get notification when the model starts scoring low so the complexity is all the way across all the way from getting the data bringing the data in cleaning the data developing the model and then deploying the model it never ends and the other benefit that IBM has added is the feedback loop right where it when we talk about it complexity it reduces the complexity so today if the model scores low you have to take it offline retrain the model based on the new data and then redeploy it usually for enterprises there's you know slots where you can take it offline put it on back online all those things so it's a cumbersome process but what we have done is that we have added a feedback loop where we are training the model real time using real time data so the model is continuously online learning online learning and challenger champion or ab testing to see which one is more robust right so you can do that i mean you could have multiple models where you can do ab testing but in this case you can continuously train the model to say okay this model scores the best and then another benefit is that if you look at the whole machine learning process there's the data there's a development there's a deployment on the development side more and more it's getting commoditized meaning you know picking the right algorithm there's a lot of tools including IBM where we can say you know linear regression is the right one for you to use for this so that piece is getting a little more less complex i don't want to say easier but less complex but the data cleansing and the deployment piece is you know to enterprises when you have thousands of model how do you make sure that you deploy the right model so you might say that the that the pipeline for managing the model is separate from the sort of original data pipeline maybe it includes the same technology or it's much the same technology but once your pipeline your data pipelines and production the model pipeline has to keep cycling through exactly yeah i mean the so the data pipeline could be changing so if you take a loan example right a lot of the data that comes into the loan that goes into the model pipeline is static i mean my age it's not going to change every day i mean it is but you know the age that goes into my salary my race my gender those are static data that you can take from a data and put it in there but then there's also real-time data that's coming my loan amount my credit score all those things so how do you bring that data pipeline between real-time and static data into the model pipeline so the model can predict accurately and based on the score dipping you should be able to retrain the model using real-time data i want to take uh dinesh um to the issue of a multi-vendor stack again and the administrative challenges so here you know we look at a slide that shows again just me rattling off some of the admin challenges governance performance monitoring scheduling orchestration availability and recovering authentication authorization resource isolation elasticity logging testing integration so that's the y-axis and then for every different product in the pipeline as the x-axis say kafka spark structured streaming and sql web service mpp sql no sql so you've got a a mess right now um most open source companies are trying to make um life they're trying to make life easier for customers by managing their software as a service for the customer and that's typically how they monetize right but tell us what you see the problem is or will be with that approach so great question let me take a very simple example you know probably most of our audience know about gdpr which is the european law to right to forget right so if you're an enterprise and i come to you and say george i want my data deleted you have to delete all of my data within a period of time now that that's where one of the aspects you talked about with governance comes in how do you make sure you have governance across your not just data but your analytical assets right so if you're using a multi-vendor solution in all of that stack let's take governance how do i make sure that that data get deleted by all these services that's all tied together let me maybe make a an analogy on csi um when they pick up something at the crime scene they got to make sure that it's bagged and the chain of custody doesn't you know lose its integrity all the way back to the evidence room i assume you're you're talking about something yeah yeah something similar where the data asset moves between private cloud public cloud the analytical assets that's using that data all those things need to work seamlessly for you to execute that particular transaction to delete deniesh's data from everywhere so that's it's not just administrative class but this is uh regulations that are pushing towards more homogeneous platforms right right and then you know even if you take some of the other things on the stack monitoring logging monitoring i mean the platform provides some of those capabilities but you have to make sure when you put all these services together how are they going to integrate all together right you have one monitoring stack yeah so if you're pulling you know your iot kind of data to do a data center or your whole stack evaluation how do you make sure you're getting the right monitoring data across the board those are the kind of challenges that you will have it's funny you mentioned that because um we were talking to um an old lotus colleague of mine who was cto of ibm's ibm of microsoft's it organization and we were talking about how the cloud vendors can put a machine learning uh application a machine learning management application across their properties or their services but he said one of the first problems you'll encounter is the telemetry like it's really easy on hardware you know cpu utilization memory utilization noisy neighbor for i o but as you get higher up in the application services it becomes much more difficult to harmonize so that um a program can figure out what's going wrong right right i mean like anomaly detection right i mean how do you make sure that you are seeing patterns where you can predict something before it happens right and and is that on is that on the roadmap for yeah so we are already working with you know some big customers to say if you have a data center how do you look at patterns to predict you know what can go wrong in the future root cause analysis i mean that is a huge problem to solve so now let's say customer hit a problem you took an outage what caused it because today you have specialists who will come and try to figure out what the problem is but can we use machine learning or deep learning to figure out you know is it a fix that was missing or an application got changed that caused the cpu spike that caused the outage so the root cause analysis is the one that's the hardest to solve because you are talking about people's you know decades worth of knowledge now you're you know influencing a machine to do that prediction and and from my understanding the root cause analysis is most effective when you have a really rich model of how you're in this case data center infrastructure and and apps are working and there might be many little models but they're held together by like a some sort of knowledge graph oh yeah that says here's here's where all the pieces fit above these other pieces below these you know um uh sort of as peers to these other things how does that knowledge graph get built in and is this the next generation version of a configuration management database right so i call it the self healing the self managing uh self fixing data center it's easy for you to turn up the a you know heat or ac when the temperature goes down i mean that those are good but the real value for a customer is exactly what you mentioned building up that knowledge graph from different models that all comes together but the hardest part is that how do you predicting an anomaly is one thing but getting to the root cause is a different thing because at that point now you're saying i know exactly what caused this problem right and i can prevent it from happening it again that's not easy and we are working with customers to figure out you know how do we get to the root cause analysis but it's all about building the knowledge graph with multiple models coming from different systems today i mean enterprises have you know they have different system from multi vendors we have to bring the all the monitoring data into one source and that's where the knowledge graph comes in and then different models will feed that data and then you need to prime that data using deep learning algorithms neural nets to say what caused this okay so this actually sounds extremely relevant although we're probably in the interest of time going to have to dig down on that one another time but it just at a high level it sounds like the knowledge graph is sort of your your web or directory into how uh local components or or or local models work and then knowing that if it sees problems coming up here it can understand how it affects something else tangentially so think of knowledge graph as a neural net because you know it's just building new neural nets based on the past data and it's it it has that built in knowledge where it says okay you know these symptoms seem to be a problem that i have encountered in the past now i can predict the root cause because i know this happened in the past so it's kind of like using that net to build new problem determinations as it goes along so it's a it's a complex task it's not easy to get to root cause analysis but that's something we are aggressively working on in analytics okay so let me ask um um let's talk about uh sort of democratizing machine learning and the different ways of doing that you've you've actually talked about the the the big pain points maybe not so sexy but that are critical which is operationalizing the models and preparing the data um let me let me bounce off you some of the other approaches one that we have um heard from amazon is that they're saying well data munging might be you know an issue and operationalizing the models might be an issue but the biggest issue in terms of making this developer ready is we're going to take the machine learning we use to run our business whether it's merchandising fashion um running recommendation engines managing fulfillment or logistics and just like they did with aws they're dogfooding it internally and then they're going to put it out on on aws as a new layer on the platform where do you see that being effective and and where less effective right so let me answer your first part of the question the democratization of machine learning so that happens when for example a real estate agent who has no idea about machine learning be able to come and predict the house prices in this area right that's to me democratizing because at that time you have made it available to everyone everyone can use it but that comes back to our first point which is having that clean set of data you can build all the pre can pipelines out there but if you're now feeding the right set of data into it none of this you know garbage in garbage out that's what you're going to get so when we talk about democratization it's not that easy and simple because you can build all these pre can pipelines that you have used in house for your own purposes but every customer has very unique cases so if I take you as a bank your fraud detection methods is completely different than me as a bank my limits for fraud detection could be completely different so there's always customization that's involved the data that's coming in is different so while it's a buzzword I think there is knowledge that people have to feed in there's models that needs to be tuned and trained and there's deployment that is completely different so so you know there is work that has to be done okay so then what I'm taking away from what you're saying is you don't have to start from ground zero with your data but you might want to add some of your data which is specialized or slightly different from what the pre-trained model is right you still have to worry about operationalizing it and so it's not a pure developer ready api but it up levels the up levels the skills requirement so that it's not quite as demanding as working with TensorFlow or something like that right right I mean so you can always build the pre-canned pipelines and make it available right so we have already done that so you can for for example fraud detection we have pre-canned pipelines for IT analytics we have pre-canned pipelines so it's nothing you know new you can always do what you've done in house and make it available to the public or the you know or the customers but then they have to take it and have to do customization to meet their demands bring their data to retrain the model all those things has to be done it's not just about providing the model but every customer use case is completely different whether you know you are looking at fraud detection from a one bank perspective not all banks are going to do the same thing same thing for predicting for example you know the loan right I mean your loan approval process is going to be completely different than me as a bank loan approval process so so let me ask you then just and we're we're getting low on time here but what would you if you had to characterize Microsoft Azure Google Amazon as each bringing to bear certain advantages and disadvantages and you're now the ambassador so you're not a representative of IBM help us understand the sweet spot for each of those like you know you're trying to fix the two sides of the pipeline I guess thinking of thinking of it like a barbell you know where the others based on their data assets and their tools where do they need work so there's two aspects to it right I mean there's the enterprise aspect to it so as an enterprise I would look to say it's not just about the technology but there's also the services aspect if my model goes down in the middle of the night and my banking app is down who do I call you know if I'm using a service that's available on you know on the cloud provider which is open source do I have the right amount of coverage to call somebody and fix it right so there's the enterprise capabilities availabilities reliability that is different than a developer comes in has a CSV file that he or she want to build a QC model to predict something that's different this is different two different aspects so if you talk about you know all these vendors if I'm wearing an enterprise hat some of the things I would look is that can I get an integrated solution end to end on the machine learning pipeline and that means end to end in one location so you don't have right you know network issues or latency and stuff like that right it's an integrated solution where I can bring in the data there's no challenges with latency all those things and then can I get the enterprise level service sla all those things right so so in there the named vendors always obviously have an upper hand because you know they are preferred to enterprises than a brand new open source that would come along but then there's within enterprises there are line of businesses building models using you know some of the open source vendors which is okay but eventually they have to get deployed and then how do you make sure you have that enterprise capabilities out there so if you ask me I think each vendor brings some level of capabilities I think the benefit IBM brings in is one you know you have the choice you have the choice or the freedom to bring in cloud or on-prem or hybrid you have all the choices of languages like you know we support our python spark I mean scala spark ml you know so SPSS so I mean the choice the freedom the reliability the availability the enterprise nature that's where IBM comes in and differentiate and that's for our customers is a huge plus one last question and we're just we're really out of time in in terms of thinking about a unified pipeline when we were at spark summit and sitting down with Matei Zaharia and Reynolds Shin you know the question came up you know the Databricks has an incomplete pipeline you know no persistence no ingest not really much in the way of serving but boy are they good at you know data transformation and munging and and the machine learning um but they said they consider it part of their ultimate responsibility to take control and you know on the ingest side it's Kafka the serving side you know might be Redis or something else or the spark databases like snappy data and spice machine um spark is so central to IBM's efforts what might a unified spark pipeline look like have you guys thought about that it's not there yet I mean you know obviously they probably could be working on it but for our purpose spark is critical for us and the reason we invested in spark so much is because of the execution engine where you can take you know a tremendous amount of data and you know crunch through it in a very short amount of time that's the reason why we also invest in spark sequel because we have a a good chunk of customers still use sequel heavily we put a lot of work into the spark ml right so we are continuing to invest and probably they will get to an integrated end-to-end solution but it's not there yet but as it comes along we will adopt if it if it meets our needs and demands and the enterprise capabilities definitely I mean you know I mean we saw that spark the core engine has the ability to you know crunch through tremendous amount data so we are using it I mean 45 of our internal products use spark as our core engine our dsx data science experience has spark as our core engine so yeah I mean you know today it's not there but I know they're probably working on it and if there are elements of this whole pipeline that comes together that is convenient for us to use and it's at an enterprise level we will definitely consider using it okay on that note Dinesh thanks for joining us and taking time out of your busy schedule my name is George Gilbert I'm with Dinesh Nurmal from IBM VP of analytics development and we are at the cube studio in Palo Alto and we will be back in the not too distant future with more interesting interviews with some of the gurus at IBM