 Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager of DataVercity. We'd like to thank you for joining this DataVercity webinar, How to Consume Your Data for AI, sponsored today by IBM. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we'll be collecting them by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DataVercity. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout. Now, let me introduce to our speaker for today, Jay Lindberg. Jay is an IBM Distinguished Engineer and Director of Product Management at IBM. Working within the IBM Watson and Data and AI organization, Jay is responsible for driving the business and technical strategy for the next generation set of cloud-based cognitive governance and data management capabilities to power data science and self-service analytics. Jay is an expert in the fields of data governance, data management, and MDM, and has worked with some of IBM's largest clients to define and develop industry-leading and innovative solutions. As an IBM master inventor, Jay holds 17 patents in areas such as machine learning, mobile device interaction, and application generation. Jay, we're honored to have you with us today, so I'll give with that. I will turn it over to you. Hello and welcome. Hi, Sam. Thanks, Shannon, for that. Good morning, everybody. Good afternoon, good evening, wherever you are. So delighted to be here today to take up an hour of your time to take you through our view of the world around how you can really start to leverage the data that you've got across your organization and really start to consume it for AI. So I'm going to take about an hour to take you through the role of how we believe modern asset catalogs can really give you that stepping stone into AI. Okay, and so let's get started. So we all know that there's a huge amount of digital disruption taking place. You know, we can see that the large organizations, the incumbencies, if you like, within their industries are actually becoming threatened more and more by smaller, more nimble organizations that are able to take advantage of things like, you know, readily available compute and elastic scale through the cloud where these new kind of new kits to the block can really start to disrupt the traditional businesses that many of us find ourselves in. And so there's a growing recognition, in fact, everybody's talking about how AI can actually be the thing that will allow the disruptors to become the disrupted. So really allowing the incumbencies to start to build AI and disrupt their own industries themselves rather than become disrupted. So really what I want to talk about today is how we can start to allow these organizations to really start to apply AI technology inside their business processes to drive that disruption themselves inside their own business. But there's a huge amount of challenges that kind of hinder the progress towards that. And, you know, we know that, you know, the growth of AI is going to be prolific and you can see some quotes there, you know, around kind of the type of growth that we're talking about. You know, this is really kind of a new wave of technology that we're very, we're beginning here taking us through into kind of the early 2020s. And these challenges are the things that are already starting to inhibit how enterprises can move towards AI. And these are the things that we believe that we're able to start to tackle and start to remove some of those roadblocks. But let's talk about them a little bit. So on the data side of things, it's the traditional data management problems, right? Data residing in silos, you know, unstructured data, structured data, the differences between how data is persisted. How can we organize that and understand that and then put that in a format which is ready for consumption by our data science team? How can we apply governance in a way, in a modern way, which really allows us and encourages self-service access to information by our data science community? How can we improve the skills associated with data science and building AI? There's a huge, with this a huge amount of growth that's taking place and driving the need for data science skills. They're in an incredibly low demand. And so what can we do through technology to start to lower the entry point into becoming productive with data science and AI in spite of those low skills? And then finally, how can we provide tools and infrastructure that allow you to fail fast? How can you innovate around your data and bring your data and try out how your data can be used to consume with AI and inside those business processes? But not spending a huge amount of time, money, and energy on large projects that need to fail at the very end. How can we fail fast in this? Let's look at a typical kind of high-level flow around what it takes to take a model and bring it into production. Well, as I said, there's the data problem. There's the huge amounts of enterprise data that gets produced. And that's data that maybe we have in our traditional systems. Maybe it's data that's being generated by sensors externally. Maybe there's external data sources. We've got cloud-provided systems, et cetera. All this data is being generated. That data requires governance, obviously, to make sure that the data is used and consumed in the correct way. And then once we've kind of understood that data, well, then we need to put that data into the hands of our data scientists who are going to be building out our AI models. And they're not only going to get access to data, they then need to have tools available to them to be able to shape, cleanse that data, and get it in a format whereby they're ready to be able to use it and train those models. In terms of the actual model building, well, there's a whole host of technologies and tools that are available that data scientists each have their favorite type of tool, their favorite libraries, et cetera, that they want to use. There's really not much standardization around that. So they want to be able to work in the tools that suits their needs. And then finally, once they've kind of started to build their models, they need to continuously train them and then finally push them into production and have them being monitored in production. So there's an entire lifecycle there that is relatively new to some organizations. Each step in that lifecycle requires a huge amount of complex processes to be put in place. And therefore it's extremely difficult to get into a robust, efficient practice for doing data science and building AI. And so this is where, you know, IBM has tried to start to reduce some of those barriers and make things simpler. And so IBM Watson is our AI platform for business. And therefore we have a whole host of capabilities that is designed to really drive the AI into your organization and make your organization a slick, efficient AI and data science practice. Watson includes, you know, out of the box business solutions and applications that allow you to deploy chatbots and improve your custom care experience at your front end. But for the purpose of today, we're really going to focus on this Watson Studio layer. Watson Studio is our building environment. It's our AI development environment. It allows you to easily support the full end-to-end lifecycle that it takes to go and create AI and deploy those models into production with continuous learning. Underpinned by the entire Watson Studio is our Watson Knowledge Catalog. And this is really where we're going to focus mostly on today. We took a very early decision with our Watson platform to bake into it an intelligent asset catalog. It's an intelligent asset catalog that ensures that all of the relevant information that is needed to be consumed by your data science practice to make your data science practice efficient and effective at building those models quickly is baked into that platform so that the data and the AI assets that need to be consumed for doing data science are readily available and easily discovered. And the Watson Studio and the Watson Knowledge Catalog really sort support the end-to-end lifecycle that it takes to build that. So let's talk a little bit about, you know, in more detail, the types of day-in-the-life data science capabilities that are required here. So if I'm a data scientist, then I may go and get a new project where I need to go and build a model to predict customer churn, for example. So the first thing I need to do as a data scientist is going to go and find the relevant data. I need to go and connect to different data sources. I need to access that data. I need to figure out whether it's structured or unstructured and how I need to work with that data. And that's actually a very big obstacle to a consuming time drain in my path to create my overall model. So using our Knowledge Catalog, we've decided to focus on how we can alleviate those first pieces. And once I've found the data, I then need to go and prepare it and do some analysis. So included in the studio is our Data Refinery capability. Data Refinery is our South Service data preparation tool. So I can use this to go and shape and understand. I can visualize different aspects of the data set and ensure that that data is in a format that is optimized for me to then use for building my models. Then I can go and use the studio to go and start to build out those models. And we've really built an easy-to-consume set of tools that leverage a whole bunch of different open source capabilities and frameworks as well, where you can, at one end of the scale, as a very deep technical data scientist, go straight in and start writing Python and R scripts inside the tools. But in the spirit of trying to lower the entry point to the platform and start and addressing some of the skill shortage, we've also provided lots of wizards and lots of help to really try and ensure that if you haven't come from a data science background or you're not quite familiar with or is good on the coding side, then you can really start to click around and start to build out some really robust machine learning and deep learning models through the tool. And then once you've built those models, you can click a button, go and deploy them into what's the machine learning runtime. And then once they're deployed, then you can monitor and maintain those models very easily and track things like accuracy and bias detection and all those other things that you need as you're running these models into production. So the studio and the catalog combined really solves a number of challenges. It allows you to find your data very, very quickly and understand that data so that you can then consume that and use that insight, building a robust set of models and data science assets that you can deploy and run into production. So I'm going to focus on the data side of that diagram because I think this is where many of our customers are really starting to figure out how they can get into the AI space. We estimate, and this is kind of a widely known figure, that data scientists typically spend in the region of 80% of their time just looking for data. That's huge because we have more and more customers that recognize they need to do AI and so they're investing in spinning up data science practices, hiring the best data science talent that's available to them, using the best data science tools that they can find. And those individuals are really not demonstrating their value to the company because they're struggling with finding information. And so we wanted to provide a way in which with our platform we could solve that problem for business by ensuring that data is given and made accessible and discoverable to people to data scientists much more quickly. So really starting to move the needle on that figure so that our data scientists actually have much more time doing data science. The other thing that just was, we have endless customers that have spent many years building out their data lakes. And very, very few of those customers have said to us that our data lake is delivered all the promise that we thought it would when we set out on this initiative. And there's a number of reasons for this, why this has become a challenge. They were sold on the fact that data lakes as a concept in the industry will allow you to put all of your data in the hands of individuals. And it's not really happened. And so there's some of the reasons for that. Number one, lack of data governance. All right, so if we've got a whole bunch of data that's important and I'm a data owner for it, well maybe that's sensitive data and it needs to be governed. I'm not going to allow that data to be released into a data lake because I'm ultimately responsible for the governance of that data. If I put in a data lake then anyone can access it and I can't release it. So you're not having my data. So as a result, these data, these data lakes haven't really kind of built up a view of all of the information that's available across the business. Second to that is it's actually quite a long complicated process to go through to bring data into a data lake. Once you've discovered it, discovered the data and gone through kind of the business level approvals to get the data in. Well, you then need to start looking at the data, looking at data mappings, and maybe you've got to do some data masking or cleansing on the data to move that data into a data lake and make it available. That can take kind of in the region of three to four months to do that. So to bring one source in, it takes, you know, that period of time. What about all the 20 other data sets that I've got, or the 50 other data sets that I want to bring into the data lake? It takes a huge amount of time. Spariling Hadoop costs have also kind of had a factor in that as well. We've had more and more clients that have, you know, put in requests for new Hadoop clusters to grow the data lake. And these things, these requests are now, you know, it seemed to be trending towards being pushed back because the CTOs are saying, the CFOs are saying, hey, where's the value for this? Why do I keep clarifying money at this thing and you're not demonstrating value? All right, so there's some kind of challenges there. And then the final challenge is for some of our customers that have gone kind of a little bit further down that road, they've got these data lakes running, there's good amounts of data in there, there's good awareness around that data. They're finding that actually that data is not being consumed efficiently or it's not being consumed at all. And that's because nobody can find anything. None of these users that are being told to then consume data from this data lake are actually able to find it. They don't trust it. They don't know its provenance. And therefore it's the data swamp story that I'm sure many of us are aware of. And so how could we start to not replace these data lake initiatives that are incredibly important, but how can we start to ensure that the information inside that data lake is consumable for the data science practice? And again, this is kind of where we believe the role of these intelligent data catalogs really fundamentally give you that the way in which you can then kind of transition to AI. And this curve kind of is a very high level view of where we think this evolution to AI comes from for our clients. We've worked with our clients for many years. We have some fantastic products that really allow you to run your information architecture. They focus on understanding where your data is, what your data is, providing intelligence and understanding around it, collecting data from your operational systems, etc., tracking your operational lineage, feeding these information into your data lake. Kind of traditional metadata management systems where it's about collecting, understanding and governing. But to get to the right-hand side of this diagram, where you then really have machine learning and AI as a practice where everybody can go and access and understand this information, it requires a different way of thinking around data cataloging. It requires you to start to think less about the process of collection and more about the sharing and the activation of that data across the business. The analogy I think about is if you think about an online marketplace, eBay, Amazon, whoever, they have systems that allow them to inventory all of the things that they sell through those marketplaces. But they don't open that up to us as a consumer. For us, they build a marketplace portal where we can easily go in and we can understand which products are available for us to buy and use. And that's really the difference here around a data catalog. It's providing a business user-friendly interface where people can go and easily find and understand if the data is relevant to them, how they can get access to it easily, and ensuring that there's a whole bunch of other capabilities that put the data in the hands of those business users. Because without that, you're not going to be able to allow your data scientists to become more efficient, being able to allow them to find data more effectively, and being able to consume that information for AI and data science practices. And we truly believe, going back to my earlier point on the first chart, that these organizations that are struggling to disrupt or worried about disruption from others, we truly believe it's the organizations that are able to make the most efficient use of AI to bake into their business processes, to develop new models, new business models. They are the companies that are going to be the ones that are able to disrupt and change the business models and ultimately be the leaders in their market space. Okay, so let's talk more specifically around our own intelligence asset catalog. So there was a knowledge catalog, as I said, part of the Watson portfolio. I really talk about it through these three key bullet points. Let's talk about discovery. So discovery is really the creation of the catalog. We make it really, really easy to populate your asset catalog to discover all of the sources of information that you have across your business. So those sources of information could reside on IBM cloud, on somebody else's cloud. Those sources can exist behind your firewall. They can be IBM data sources and non-IBM data sources. We can connect to a whole host of different data sources. And once we connect those data sources, we classify and profile the data that's there available as data sources. So as we're going to go and connect to a database, for example, we will run our own profilers and we'll extract out and detect that there are credit card numbers in this data source, or there are social security numbers, or there's PII data inside this data source. And we store all of that information inside our rich metadata index alongside the technical metadata that we discovered in that data source. We can also catalog unstructured data sources. So with unstructured data, we can then profile the contents of that using our natural language understanding APIs, so that we're pulling out the key entities from those unstructured information. And again, we store that inside this rich metadata index as well. We also integrate with metadata management systems. So, you know, we obviously integrate as a first class with IBM information server so that we can populate this catalog with the information that's available already that's already been collected inside our clients' ecosystems. But we have APIs available so that you can also integrate your other metadata systems into this catalog. Once we've done the discovery and we've got this rich view of all this information across our business, we then can kind of roll that into the catalog phase. The catalog phase is where we then open all that information up to our business users. We have a built-for-purpose business-friendly shopping for knowledge portal whereby anybody can go and find information that's available to the business. But it's really important that there's some capabilities as part of that shopping for knowledge business portal that kind of is important to drive the intended outcomes. First of all, we catalog more than just data, right? If we're going to be driving to improve data science and make that kind of a thing that's easier to do and efficient, then we need to make sure we can catalog all information assets. So it's not just data, we can catalog notebooks that are being created, machine learning notebooks, we can catalog machine learning models, we can catalog analytical dashboards. A whole host of different assets that are able to be added to the catalog and shared and reused to drive those efficiencies for our knowledge workers. But also, because we've captured all this rich metadata, we know where our data is, what source it came from, we know what our data is, we know if it's credit card numbers, etc. Then we store that information and we provide our own machine learning model around it whereby we've built a Watson recommendation engine into our catalog experience so that we can start to make suggestions as to other assets that could be relevant for individual users. Think about what Netflix does for data or Spotify does for music. If I go watch a TV show on Netflix, then Netflix will start to suggest other TV shows I may be interested in based upon my behavior. It's all powered by AI. And we don't think the data. Based on my usage of the system, the types of data I work with, the other individuals I work with, then Watson will start to suggest other assets that I may have previously never found. It's exactly the same as Netflix. I would never have known that I needed to go watch Breaking Bad. However, because it was recommended to me, I actually found a TV show I really enjoyed. And the same is true for data. The most valuable data may exist in a silo somewhere else in a completely different part of the business. But with this Watson recommendation engine, it will start to surface those things up to me and maybe just maybe that would be the best data that exists inside my entire business to train my model more efficiently than anybody else in my industry. So we use the intelligent metadata to do that. But the final point around the catalogue is it really has to be integrated for productive use. Just having a catalogue that's available for people to find information is great. But then what? What do I do with it? How does that help me do that science? And so we've integrated our catalogue with our productive use tools. I can go find information, find the best information. And then I can go and click a button and it's there available for me to use and do data preparation and data shaping. Or for me to go and do some use that data inside a notebook or use that data to go and train a model at one click of a button. It's going back to my marketplace example. If I was a user of an online marketplace and I found the perfect coffee machine, but I wasn't able to add it to my shopping cart. That's kind of useless. So with our catalogue, you can add this to your shopping cart and you can go and use it productively at a click of a button. And then finally on the activate piece, this is really how we've rethought data governance. Data governance is all about protecting understanding and ensuring that data is used in the correct way. It's absolutely incredibly important that we still do that. However, if we focus on the consumption of data, how can we turn governance into something that really enables consumption? And so included in our catalogue is our active policy engine. This active policy engine uses the rich metadata index to know what the data is, and then uses the policies that have been defined by the business to ensure that the data is used in the correct way. So let me give you an example. If I have a data set that maybe has 20 columns in it, and one of those columns contains credit card information, credit card numbers. And there's a policy that says that I'm into division or a department that shouldn't see credit card numbers. Well, in the old world, that data wouldn't have been available to me. But you know what, in that data set of 20 columns, those 19 other columns could be the most relevant data, the best data that I need to build the best data science model or to run the best analytics. And so rather than lock that data away, this policy engine will determine what it is I'm trying to do with data and who I am and what the data is, and will mask on the fly those credit card numbers so that I can't see them. But it will open up me to the other 19 columns inside that data set that I can use. And if somebody else had different rules applied to them, they may be able to see those credit card numbers. But I can then use the data shaping data preparation tools to have a tailored extract to me of that data set with the correct policies and rules applied. So it's really powerful to be able to apply this to the data so that we can open up more data. Because as I said, this is all about driving those efficiencies, putting the most relevant data and the most important data easily in the hands of our knowledge workers so they can consume it and actually use data in a really efficient and effective way to deliver those better business help. And let's take a look at some of the value prop that we've seen and some of the kind of stories that we've had with some of our clients in this space. And we've been working really closely with some of our clients as we've kind of been developing this concept and really focusing on how data catalogs can really start to be the key to how you consume your data for AI. And so the first three kind of use cases here are kind of related. They're all about driving data science efficiencies, but the detail is important here. So the first one is around how you can put data in the hands of those users much more quickly and efficiently. As a data scientist, with an intelligent asset catalog, I can find data quickly. I can be suggested data quickly, but importantly, I can understand the data quickly. I can visualize it easily and understand what's there. I can understand the different types there. So that's the first one. The second one is around how then that feeds into feeding yet the overall accuracy of our models. We discovered that a typical behavior of these kind of fledgling data science teams was that they may have two weeks to build a project and deliver a model. And in that two week period, if they're spending 80% of the time looking for data, they get to kind of the middle of the second week and they think, I've got to get this done. I've got to deliver this project. Therefore, the data I've got now is good enough. I've run out of time. Do you really want your machine learning and AI models being built on data that is good enough? Is that going to give you the type of differentiation that you need in your business working with good enough data? And so we didn't think that was acceptable. So this is another reason, a good reason why we wanted to really focus on making that data consumable and understandable. Then the third one is around kind of driving a data-driven culture across your consumers of your data. We discovered that a typical data science project, say a project running for two weeks, the data scientist team there would actually create some very valuable assets. The thing they cared about at the end of it was delivering that model, making sure that model was delivered and delivered to the correct individuals. But to get to the point of building that model, all of the data they've gathered, maybe they've done some data shaping, they've written some notebooks, maybe they have some models that were optimized and they decided not to use. All of those assets are extremely valuable if they're fed back into the catalog so that they can be fed into the Watson recommends algorithms and maybe they're useful to someone else in the business. So you can actually jump start and drive those efficiencies through other projects by sharing the artifacts that have been created as part of other data science projects. This information was being thrown away and lost. And so every project that got started was starting from day one every day. Well now these assets are easily shared, click of a button back into the catalog and available to be discovered and consumed. Number four is kind of more of a CDO play really, but it's already around, you know, if we're building out these intelligent views of data and AI assets and we're classifying and we understand where our data is, then it all starts to answer questions, you know, where's all my PI data, who's consuming that data, which data assets are relevant and most relevant to my business, which data assets are not relevant. And giving you from a data perspective a view of actually the value of your data and different parts of the data, incredibly important. And then number five is, as I mentioned earlier, that kind of data like initiative. We don't want to replace your data lakes. We want to allow you to extract more value from your data lakes so that the knowledge inside those data lakes can actually be consumed. So you can layer an asset catalog on top of those data lakes. It can catalog your data lake and other data that's not in your data lake. And then you can really start to surface that up in a business friendly way with all of the intelligence around it, feeding into those recommendation engines, feeding into those productive use tools. So you can then really start to put that data in the hands of your data science community. So this is just kind of like a higher level view of what we're able to do with our knowledge catalog. But this is really around building an intelligent metadata index of all of your information. And like I said, that data can reside wherever you want it to reside. You do not have to move your data into IBM Cloud to make it part of the catalog. If your data resides in an on-premise Oracle database today or your Teradata warehouse or residing on AWS, your data can stay there. But what we do is we allow you to catalog it. We allow you to understand it and provide it in a central index that is then in an easily consumable form for your data science. Your machine learning, your deep learning, and the consumption of that data. And importantly, it allows you to easily start to bring in new sources of information. If you want to bring in your social data, your sensitive data, you can do that very easily. If you have department level spreadsheets and information that otherwise would not be available across a company in an easily discoverable way, then that data can be brought in and used and made available to the consumers of that data. Okay, so some of the differentiating cases. This is kind of more of a summary really. But these are kind of the key things that we believe are truly important to consider when looking at how catalogs can get you to AI. This is cataloging for a reason. We are cataloging to ensure that you are able to do AI quicker, smarter, better than anybody else delivering the most efficient models into production using the best knowledge that you have available to you. So, and to do that, we fill in, as I said, an AI powered recommendation engine. We've called it Watson recommends, but we're effectively using AI. We're using Watson to improve your own AI by ensuring that the best data, the most relevant data is available to be used. There's a whole host of social collaboration capabilities that can be used as part of that, which further train the AI model. And really, the more interaction you have in the system, the more users you've got, the more data you've got, the better that model gets in making those recommendations. These catalogs have to focus on AI. As I said, we're cataloging for a reason. And therefore, it has to go beyond data. The catalog has to ensure it can make sense of the data, the models, the notebooks, the connections you have available, everything that's available, an enterprise level that works with data. Having those things cataloged and reused and easily consumable as part of that consumption is extremely key to that. Closely related to that is the integration with productive use. As I said, having a marketplace with no kind of add to cart button is kind of frustrating. So therefore, from a data perspective, if you find data, then making sure that that data can be easily used and brought into your project so you can do something productive with it, extremely important. Otherwise, you're not going to get – if you have to go through importing and exporting data and requesting access, et cetera, et cetera, then you're not going to get the types of benefits that can drive productivity into your data science practice. The modern policy activation. This is the fresh tape on data governance. How we can start to use the intelligence of our data as well as this activation engine to ensure that data is masked on the fly or people are denied access on the fly. How can we start to use that to provide and open up more access to more information? Structured and unstructured, catalogs have to ensure they've got a full view of all of your information. Structured and unstructured data is handled very differently at the collection phase for really important reasons. However, as a consumer of data, as a data scientist, I really don't care if it's structured or unstructured. What I need to do is I need to get the information out of that structured and unstructured data in the most efficient way. And so using our Watson natural language and understanding to extract the key entities, the key sentiments, the key concepts from documents so that I can then start to use that alongside structured data to further inform and further train my models is extremely important. And then finally, your data your way. You don't need to move your data into IBM cloud. Of course, we would love you to move your data to IBM cloud, but you don't have to. Your data can reside where it is today, and we're able to provide the same level of intelligence around that data and put it in the hands of your data scientists so that you can start to build AI for business. Okay, so I was going to do a demo, but I didn't think we really had time for that. So I've just got some, just to show you kind of the realness of this. I just got some screenshots I'm going to take you through so you can get a feel for the type of experience we're talking about here. This is all available on IBM cloud. So, you know, you can log on to IBM cloud today. You can go and click a button, provision an instance of this. It's completely free for you to try. And you can kind of, you know, try out some of these capabilities yourselves. But the Watson studio, as I said, we're integrated for productive use. So you can find information in the catalog, and then you can click a button and it's available inside your data science environment with what's the studio. What's in studio has a whole host of different libraries that are available to ask you to SPSS so you can then start to build that science in the way that you want to. We've got a whole host of different run times. Whether it's TensorFlow or cafe or Pytorch, you can use these different, use the runtime of your choice inside that environment so that you can start to use the data and use it with the tools that you want to build the models that's most important for you. Also integrated with productive use for data planning. So like I said, if I want to go and find data, I found the right thing. I actually want to do some prep on it. Going to the refinery allows you to do that. I've got a wizard based approach I can use as part of the refinery. I can kind of selecting a number of operations through wizards and apply those operations for the data. But also, if I'm more of a coder, I can actually use our scripts. I can write my own transformations and build my pipeline of operations using our script and code. But also inside the refinery, it was really important that we put a capability to allow these data scientists to understand their data before they kind of started to use it. And so we've got a whole host of visualizations that you can apply to your data. So you can be confident that the data is the right data that I want to work with. And then once you've done, once you've built your pipeline and you want your shaping and you've got your shaping laid out, you can click a button and that button will go and connect to the source. It will suck in and transform, apply those transformations to the entire data set, and it will deliver to you in a CSV file or write to another database somewhere, the extracted data set with all those shaping operations applied. And if the policy engine has been enabled, if there's data in there that I shouldn't be able to see, that data would be masked as well. So I would have a tailored, masked, governed extract of that data set that I know I'm safe to then go and use for my data science project. I talked a little bit about our recommendation engine. So here at the bottom there, you can see our kind of recommendation bar along the top. So that was some recommendation engine really starts to learn all of it from the digital exhaust of the system. It understands what the data is, how it's been profiled. Okay. And then once we've done that, we can see how it's related to other data. We can see who's using data with other data. It can see what I'm doing with data, who I'm working with. And all this information gets fed into our model, and it learns over time so that it starts to improve the recommendations. You know, when I go and use Netflix from day one, all right, it makes some standard suggestions. But the more I watch Netflix, the better those recommendations get. And it's exactly the same with this engine. And the reason we built this was not just because it's kind of cool, but actually it's all about driving the putting the data, the most efficient data in the hands of the data scientist as quickly as possible. We bought in some social collaboration features. This is all about consuming data. And the best people to determine whether the data is valuable to them are actually the consumers. So we've put the curation of the data in the hands of the consumers, the people that are using it the most. And so we've built in some social capabilities so people can like and rank and comment on data. Of course, all of that exhaust feeds into the Watson recommendation engine as well so that you can start to very quickly find if new data sets are coming in, they're being recommended. Yeah, you know, there must be something good about this data set for this particular purpose. Then that's all brought into the experience so that that can then be used to guide people to the best data for their purpose. I mentioned unstructured. So we announced that the think conference back in March that we were bringing that we brought unstructured data in. And this is just some screenshots of it. So not only can you bring in unstructured information and there's obviously viewers for the unstructured document. The key part of it, the key thing that makes it valuable to the consumer of the data is the fact that we're running the natural language understanding. So the data scientist doesn't have to read the document. The data scientist hasn't got time to do that. They've got to go and build the models. But using the AI based in a new extract, then it can automate that job for the data scientist that pulls out the key terms. Those terms, those concepts, the sentiment, the emotion of the document, that's all then stored inside a JSON file. So that JSON file can then be used for consumption and data through data science alongside any other data sources. So like I said, we don't care it's unstructured. What we care is that the information in there is it relevant or is it not relevant to me completing my task to build a model. I talked about data masking. So this is really cool. We're pretty proud of this one. So working alongside IBM Research, we were able to really kind of redefine how we could open up more data than ever before. And so as individuals are using the system, you can then start to determine whether or not data should be masked based on the rules that have been created. And then that data would be masked on the fly. And we're bringing in different masking algorithms all the time on this. Currently, we've got kind of a hashing out of values. We've got a randomization of values. We're going to just bring in kind of an annexing of values in there as well. So there's a whole bunch of different transformations that we're putting in place there to further enhance that. But the key thing is it really allows you to open up more information, make sure the most relevant information is in a way in a format that can be consumed in a safe way by the data science team so that we can build those smarter models for our business. And another good example of how we're really thinking for the consumer is what we've done with data quality. Data quality is extremely important for the collection and the understanding of the data that we have across our enterprise. But we wanted to think what data quality meant for a consumer of information. And so with data quality, we came up with a concept of why don't we think of data quality as more of a currency? Everybody on this call knows that 10 bucks is 10 bucks and they know the value of 10 bucks. They know what 10 bucks means when they go into a store to go and buy something. Well, let's apply that same concept to data quality. What if data quality could become a trusted currency so that we could use that as a quick glance to determine whether or not this data is of value to me or not? So that's what we've done. Our data can be used to generate a quality score and then that can then be used to determine whether or not this data is relevant. Because if I'm a data scientist, maybe I want data that's a little bit more right. Maybe I want data that has some anomalies in it. However, I'm a business analyst and I've got a producer report from a boss. We want more accurate data. So I want data that's over this trusted index of 90, perhaps. And then how can we start to use this along with the policy engine so we can say, well, do you know what? Maybe I am a business analyst and maybe I'm not allowed to use data that's lower than a 90% quality. So how can we apply the policy engine as well as this concept around data quality currently to really start to use that to further inform users whether or not the data is relevant to them? So I'm getting to the end now. So one of the other points we want to talk about was obviously IBM Cloud, IBM Cloud Private on-premise deployment models. Well, IBM is one of the only vendors that's able to provide a proper hybrid cloud capability. And so all of the collection and understanding of data that can occur using IBM Cloud Private for data. We have synchronization built whereby you can then synchronize the collection and understanding of the data you're doing, and you can synchronize that across into Watson Studio so you can then use that data for AI. So it's really important that we kind of build this knowing that a lot of our client data is on-premise or in a private cloud. And therefore, we needed to make sure we could integrate as a first-class domain with those systems for collecting and understanding and governing data, but allowing them to really start to figure out and use that data and put it in the hands of a data scientist for consumption. And so, you know, these capabilities are completely complementary. We can use our unified governance capabilities alongside our Watson capabilities so that we can collect and understand data, and then we can consume it for AI. And so we built a first-class synchronization between the two that really allows you to start to use data in the way that you want to. And of course, ingesting into those systems, we can also bring in industry domain experience as well as integrating with third-party metadata management systems so you can take advantage of the entire integrated portfolio to further govern and further consume all of your data sources. We don't see a world of one Uber catalog. We believe that many catalogs serving many purposes, but being able to ensure that they can integrate, they can share the intelligence between them, and that the users and the consumers, whether it's for data science or anything else, are really able to use a set of tools that are fit for purpose and aimed at really empowering the users that are consuming the data. Okay, so just to wrap up, I started on a slide that covered digital disruption. Digital disruption is happening. We've all seen it. And AI is the way in which we believe our clients can remain competitive and be the disruptors. To be efficient at AI, to be able to be, you know, the team that are building the best, most accurate chatbot or engaging customer experience or models predicting customer churn or whatever it may be, you need to make sure that you're taking advantage of your data. Because the one thing that enterprises have over the smaller start-up is the volume of data. So ensuring that we can understand that data and putting in the hands of our data scientists who are going to be extremely efficient in doing data science because they have self-service access to that information is fundamentally key to how you win in your industry in the world of AI. To do that, it's really about having an intelligent catalogue that really allows these individuals to consume data extremely easily. And as I mentioned, it's available today on the IBM Cloud. You can go use it, and we would love to hear your feedback and anything you can do to help. Thank you very much. Jay, thank you so much for this great presentation. We've already got a lot of questions coming in. So feel free to submit questions in the Q&A section for this portion of the webinar. And just to answer the most commonly asked questions, I will be sending a follow-up email to all registrants by end of day Friday for this webinar with links to the slides, links to the recording, and anything else requested throughout. So diving right in here, Jay, what is the best approach for developing a data catalogue inside an organization with hundreds of potential data sources as well as multiple pockets of analytics teams? In other words, how does an organization such as Kaiser begin? Yeah, great question. Yeah, there's capabilities that can help you do that. So first of all, there may be existing metadata systems that describe pockets of your data. They're obviously easily integrated into kind of the data catalogue that's available here. Typically, our customers have started small though. They've discovered that, you know, whether besides their existing initiatives, besides the data lake initiatives, or besides the other kind of data strategies, they have a whole host of new sources of information that they'd like to understand. And so they really pick out, you know, a handful of those data sources and add them to the catalogue. And it's quick and easy to do. You can add a source to the catalogue very quickly. And not only can you do it manually, but you can also automate the cataloging. So you can, there's a button you can click and it basically goes to crawl that data source and will suck in all the metadata automatically. So you can start to build up a very quick catalogue of a handful of sources. And if that depends on the data, if the data is related, you know, you kind of detect the relationships between the data, et cetera, you can start to demonstrate instant impact because you can instantly see then from that view what the data is, you know, what the classifications are, what's in that data. And you can open that instantly to the business if you wanted to. So I guess it's really, use the tools available to start small, but also because of the integration, because the ability to kind of catalogue the data lake, because of the ability to integrate with existing metadata systems, it's kind of an incremental piece to get to a point whereby there's enough useful information in the catalogue that you can then open it up to the business. And then once it's opened up to the business, you can then also allow the consumers to further share their assets back into the catalogue. So the catalogue grows based on the consumers of the business as well. I hope that helps. Definitely. And would the catalogue be on the cloud or can it be on premise? So the Watson capabilities that I've shown today are currently available on the cloud. However, IBM's portfolio, unified government's portfolio is obviously available on premise as well. So we have the integration between the technologies so that you can really deploy this as a hybrid capability for doing data in AI. Great. And is there any use of graph databases? If so, what's the use case? I'm assuming the question is around if we're using graph under the cover. So we can catalog graph database if that was the question. If the question is, do we use graph underneath? We don't use the graph database as part of the Watson Knowledge Catalogue at the moment. We do capture relationships and we kind of have a graph model that we use to store those relationships. But we do have graph technologies across our unified governance portfolio to understand the relationships between those nodes. So yes, across the portfolio, a graph of graph is used, but also you can catalog graph sources as well. And does IBM have any predefined machine learning models for insurance companies? I believe we do, yes. I don't have them to hand, but if whoever asked me that question drops me an email or sends me a tweet, I can respond and get details. Fantastic. And maybe I know Lynn, you're on the line, maybe we can get that further follow-up email as well. And most organizations today don't have an understanding of the necessary data modeling, data architecture, data governance, data foundation that would enable data science and AI. Machine learning has to learn the proper data rule from humans. How do you advocate for that? I think it's a technology solution as well as kind of a business solution. So we've got a long, deep history in solving those types of challenges and helping clients evolve their data strategies. And that's part of this vision that I've laid out here. You really need to kind of get your house in order in terms of understanding your data and getting it prepared and understood for AI. And then this is the kind of the final step towards AI when you then want to consume that data. So we've got a whole host of technologies available on premise, on cloud private, on cloud public, that allow you to kind of address that spectrum, that whole spectrum of challenges there. So part of that is part of a question that we get in almost every webinar is how do you get executive buy-in? What is the elevator pitch that enables executives to just buy into the proper use and management of their data? I think that's a great point. So I think this has got much easier as, to my point earlier, we are now cataloging for a reason. We need to be able to be smart to AI. We need to be able to challenge our own business models to consume AI, to drive new products to market, to reduce customer chain, whatever it may be. Because if we don't, as an organization, I don't mean IBM, IBM is obviously doing this, but I mean, you know, if as an organization we choose not to do this, then everybody else will. And they will be the disruptors in our industry moving forward. So we now have business level, we can now have very easy business level discuss to say we need AI to fend off disruption, to be the disruptors, to change the way we're operating, to reduce costs, to do things smarter. And to do that, we need to ensure we can monetize and operationalize our data so that it's easily consumable to aid AI so we can do AI better than anybody else. I love that answer. And moving on here, you know, could the Intelligent Catalog federates information from distributed sub-catalogs? Yes, it could. So we, at the moment, we can do that today through APIs. So there'd be a custom services engagement where you would use our APIs and kind of build the framework around it as part of, you know, an AI platform. IBM is a big supporter of the belief around open metadata, whereby you have many catalogs part of a information ecosystem. And those catalogs, you know, capture or collect or consume, allow consumption of metadata for different purposes and from different sources from different geographies, but can federate and synchronize information between them. And so some of the things we're working on at the moment are absolutely along those lines so that it's more native to the technology rather than requiring, you know, kind of custom consumption of APIs around it. So going back to an expanse of a previous question, how would an impremise implementation of IBM catalog benefits from the Watson recommendation system? I lost you a little bit then. You broke up a bit. Sorry, could you repeat that? Sorry. So how would an impremise implementation of IBM catalog benefit from the Watson recommendation system? Yeah, so at the moment we've got capabilities that have been built out across the on-premise cloud private cloud public ecosystem. But our roadmap is to bring all these things much more closely together so that regardless of where you want to deploy this, then you can take advantage of these different capabilities. At the moment, it would be through a series of API integrations that you could do to do that. However, obviously, you know, we want to make sure that this is natively easily available and you can deploy and run wherever you like. All right. And that seems to be all the questions we have. Currently I'll give a few more seconds here. Jay, this has been a fabulous webinar. Thank you so much. I love the opportunity to geek out here. This is one of my favorite things to talk about. Anything else that you want to wrap up with? No, just to thank everybody for that time. And please do reach out if you have questions. Absolutely. And thanks to all of our attendees for being so engaged in everything we do. We just love it. We love all the questions that came in. Again, I will send a follow-up email by end of day Friday for this webinar with links to the slides, the recording, and the additional links that Jay has in here for you. And we'll get some of that other information to you. So I hope everyone enjoys their day. And thank you very much. Thanks to IBM for sponsoring today. Thank you.