 So, next up we have a couple of different presentations that will be walk through back to back the 1st is on the power of insights from and data science projects. And the 2nd is on data science best practices and we'll have time at the end for Q and a. So, if you do have questions. Watching the 2 presentations, please put those into the panel and we will get to them at the end of the. Presentations so I'll now introduce our speakers for both of the presentations 1st is Vowder Luster Bach, who's chief data scientist Europe for advanced analytic center of competence. Vowder is a neuroscientist by trade and a data geek at heart. And an experienced cross industry data science consultant who has seen companies through the phases from unplanned data landscape through to actionable. Results delivery and Vowder will be joined by Beth Rudin, who's a data science and analytics executive with IBM. Beth is a global executive leader with 20 plus years of IT and data science experience. At IBM so over to you better and Beth. Thank you so much. We are very happy to be here and what I'm going to do is I'm going to bring up my screen to share. And I want to just do a quick check to make sure that everybody is able to see the. Screen router awesome looking good. We're seeing the screen. All right, thank you very much. So. I am in part of the position that I'm that I'm holding currently and I am an IBM distinguished engineer and principal data scientists. So I have. Gone through this certification and really got to be, you know, 1 of the 1 of the very 1st people to take in and make sure that the certification is a really good cloud path for. All of our IBMers because 1 of the things that I believe in very strongly is given, you know, given the the ability to take your own goals in your hand and and you know, I have people all over the world who asked me every day. Am I allowed to go get certified as as a data scientist? And it's interesting because I think that, you know, that that type of perspective is something that we don't often get. But when I got to work with George Stark and Dan Riley and Marine Norton on creating the data science profession with an IBM and really rolling it out and making sure that it is. It is something that is sticky. I also, you know, the, the data scientist and me, I was super curious and I wanted to see what are all these amazing projects that people are submitting. And, you know, of course, we use some data analysis to do this and we put together this, this package and it's, it's 1 of the things that it's a setup because it's a setup for. All of the various hypotheses of really starting to understand this new data science profession. And I think it was 2012 when the Harvard business review said that the data scientist would be the sexiest job of the 21st century. And, you know, that's when we really started thinking about, you know, how, how can we create data scientists that have the type of training? And that's why we're so excited to present this Academy of Technology study as well as our data science method in order to show you the open group or accreditation, you know, system to show what we can do with the data. And so this, and just to make sure that everybody knows the Academy of Technology is IBM's Academy and it has been around for well over 50 years and, you know, it's, it's, it's definitely evolving, but 1 of the things that we get to do is and is we get to work with people across business units. And we are 250,000 humans in a different, in 175 different countries, and we have such a wide diversity of culture and thought that we, we have these Academy initiatives that we create and I sit as the vice president on the technology council for ethics and data science. And our Academy is really about gathering people together and doing projects like this that, that aren't necessarily revenue generating or client facing. But this is something that we do because we want to be able to test hypotheses and the best way to do that is to gather people from all kinds of disciplines across different, across brands, cross business units and allow them to. Look at data or do experiments in order to really showcase what we can do. So our Academy does many of these different initiatives and we produce a lot of different reports across the board. And this is something that I think is a little bit different because we actually used our own data set. So with that set up, these are insights from 900 different projects by IBMers. We, we have obviously, we have all of our data scientists submit their project profiles for the certification. And this is a treasure trove of information because I can start seeing some of the various shapes takes, you know, takes space and come to form on who is a data scientist and what, what kind of work are they doing. Who are the people that they're interacting with understanding what kind of algorithms that they're using understanding what kind of business problems that they are solving. And, you know, all of the data was anonymized. So we didn't know who was who we do have, you know, lots of different specific more mathematical things that, of course, we didn't anonymize that I think they're just absolutely fascinating. So we had 28 different IBMers from 10 different countries contributing all of this shared knowledge. We, we also use this to connect to all of the master data that I'm going to talk about a little bit more because I think that most of the data scientists that I talked to the the thing that I hear the most is, you know, obviously the data preparation is 50 or 80% of the problem. And if you think about, you know, just roughly every single data scientist creating a notebook and if they have the master data to connect to this, this accelerates their ability to be able to use the master data that is already well formed. As opposed to well formed and governed as opposed to redoing that normalization and standardization to get to your master data dimensionality that you would use within some of your data science projects. We have stories on client industries business units data science algorithms and tools. I mean, and this is just the beginning. Like I said, this is a setup. And we're, we're really taking this to a, you know, a different place and modeling and meta modeling, how we can work in multidisciplinary teams in order to be able to tell the stories and put the soul underneath what data is because data is a story, but it needs the soul of the people to really tell that story. And then we have a recommendation on augmenting and enriching with our data connection and we've already initiated phase two. So, when we first started looking at at the shape of the data, you know, we wanted to make sure that we were looking at, you know, a good variance from each of the different badges. So we really focused on level one and level two badging. We did augment with some of the level three and both router and iron level three certified. The application obviously includes all of the free text, but then we were able to link it with most of the metadata that we have within the employees. So what role they play, like, what is their job role skill set within the HR system? What industry did they, you know, connect to what, you know, what was the date, you know, obviously the, your time dimension on when they put through. We really, obviously the data preparation and preparation and engineering is the hardest part of getting, you know, the data into some of the well for well formed. Order data sets that we need in order to be able to perform calculations. And this is where we started connecting to metadata and I'm going to talk a little bit about ontologies later on. But, you know, this is where we really were able to abstract a lot of the structure from the unstructured text using NER and LDA and some of the things that we can use from a text analytics perspective to be able to put this together and see some of these stories. But if you see the global reach of this and of our projects and our corporation, this is something I am hugely proud of, because I think that through the last two years in the pandemic, my friend router who is sitting in Munich. You know, it just offers me such a different perspective on a personal level. So having a global company and being able to connect to people all around the world, especially as, you know, we're locked inside and hopefully being careful and taking, you know, taking advantage of understanding different perspectives from what is happening in different countries. This is something that I have a huge amount of pride for that that IBM is able to showcase this level of cultural diversity. So, and just a side note, one of the biggest reasons that we chose L1 and L2 is that they had, you know, very explicit questions and L3 obviously has some different questions, which I think that you guys know, but we wanted to have like that control of having the L1 and L2. So, in the middle of this project, I had connected with my, my friends within the global chief data office and prior to taking the role that I have now I was the chief data officer for our company, formerly known as GTS it's now known as Kindral. And one of the things that that I believe in so strongly is this construct of, you know, creating a data fabric or a data mesh or something that you can really start to centralize and govern your master data, and then allow access to that master data through APIs. I think that it really trains your data scientists to start interacting with cloud services as well as DevOps systems and be able to access things, simple things like industry as an API instead of rebuilding the entire dimension of industry. So, we were able to accelerate this quite quickly because when I found out that the data scientists, the people who were doing this particular project were struggling with the, you know, the simple normalization and standardization in order to get the industry aspect out. I was like, Hey, did you know that our global chief data office has 55 different APIs for access to master data. And it eliminated a huge amount of work. It eliminated, you know, the, the, the understanding that they would have to start to do some of this. And it's just, it's a simple thing, but I mentioned it because even within a corporation like IBM, we didn't, we didn't have everybody have knowledge that they can access these APIs so that they can simply understand how to map the project badge to the industry. And that was, that was something that I think is, is a very, very good lesson is that especially we as executives, we need to communicate better to all of our people who are doing this work so that they understand what's out there as opposed to trying to recreate it. The business units and industries. So this is, and this is where I'm exposing a little bit about IBM, but I think it's so significant because we are one of the, the only product and services companies that have this type of distribution of all of our different project profiles across business unit and across industry. And when we first took a look at this, and this is probably one of my favorite charts. I started to say the people who are doing things in the automotive industry and GDS, which is now known as our IBM consulting organization. You know, we had, you know, five in cloud and cognitive, which is a product division, and then 26 in our consulting division. And I have started and I'm the anthropologist by training. So I'm the social scientists. So I'm always sitting here going, what does a data scientist look like and how did, what do they wear, what do they do, what do they eat, you know, what do they sing. And I started to formulate some hypotheses about, you know, people who were working in our product division as data scientists, first people who were working in our consulting division as data scientists who are people who were working in our managed services division. As data scientists or research or global markets, and we started to be able to see the flavor differences on how they started to think about the problem. You know, really understand what our clients issue was and formulate that into a hypothetical statement that we could test. We needed to start to look at the data in detail and again, start to extrapolate these stories so that we can shape and start to understand all of the various flavors of the data scientists, not from their resumes or the things that, you know, they have self reported, but more from their project profiles and how they have helped serve our clients and customers with their data science tools and techniques. So I just, I found this really fascinating, as well as really proud again to see this, this level of variants across IBM of the people who submitted their project profiles to talk about what did they do in the different industries and then the, the width and the breadth of the industries that we were able to create. This is another, you know, sand key diagram, and this, this I thought was very interesting because like all companies and corporations, like all human beings, we, we, we kind of wear our tribes and we wear our team and we say, well, I come from research. So, you know, this person in services, you know, I don't, I don't think that they, they truly, you know, past the data science problem or that I don't think that they really got it. And, you know, I use that as a negative example, only because I want to, I want to talk about this, where, when we have different data scientists that are using data and the scientific method to truly solve a client's business problem. We have to make sure that we're speaking the same language and we have to have that common repertoire in order to make sure that people understand that a services engagement will be very different than a research engagement where the goal in services is, is to deliver for a client where the goal and research might be to publish a patent. And so we had to deal with these various flavors in a way that we could start to make sure that we, we engendered the respect from one division to another from one industry to another. And using the sort of sand key diagram to look at who started it, you know, who, who passed, who, who failed, where did it come from from the different countries, which business unit did it flow through. Where did it go to within the industry. I just, I found this an excellent way to study the behavior of our data scientists and our human beings. Both the ones that were submitting the projects, as well as the ones that were, you know, truly looking at the projects in a way of, okay, I am going to pass this person or I'm going to fail this person. And seeing if we could start to tease out some of the biases that we wear and that we, that we are part of, because we are part of the system or that it's part of our team and our team culture and how we behave. Stories on data science tooling. Now, this is where I, I really think it's very interesting. But, you know, the, the thing that, that just jumped out at me. Is that our, our IBM services or IBM consulting organizations really used far more open source. than any of our product division. Shocking. And the data science applications from global markets and our cloud and cognitive or product divisions obviously reported Watson and SPSS more frequently. And I think that it's just, it's interesting to note that you started to see this, this flavor of open source be, be more apt to the services. Because I do think that we're seeing a shift in the building of the AI software to the frontline of delivery where we are looking at the more open source applications of how we can use those types of applications. As opposed to some of the products that may or may not give us the facility of things like, you know, Python in order to be able to do that. However, I will tell you that I would love to do time series analysis to see if there is a lot of because you're doing that trade off because when you use open source. It's less maintainable in the future and you always have to update it. So if you looked at it from like a time series analysis if you look at something that went into production as open source. What did that application do over time, versus an application that was fully supported from our product division, for instance, these are the types of kind of questions and hypotheses that I literally think of every single time I look at some of this data and this analysis. The detail investigations that we're doing now is we're using ontologies and a knowledge graph in order to be able to look at all of the language variation and the ontological hierarchies that we are starting to create in order to be able to do some inference and reasoning. So the stories on the data science algorithms now this I thought was even more interesting because we have a wider range of algorithms within the in the services division. And again, we're going to tie this out in some of the hierarchies and build out an ontological knowledge graph so that we can actually start to look at some of the unstructured text and the ontology that we're building out is based on all of these project profiles. And starting to see the, the, the skew that is happening. And, you know, there's definitely some something that I'm starting to see. Is that when you're looking at the use of different algorithms, some data scientists obviously have a preference over other data scientists, especially when you're, you're, you're trying to do some of the models and the model connections here. So we're going to, we're going to see what we can, we can actually infer from some of this work. Again, I just think that this is a phenomenal way to look at the fact that bank and finance obviously loves your tree based models. Fascinating where we had this again skew that you can see that, you know, the in the banking and finance industry again looking at it as your X and Y access of what model and algorithm you're using against the industry and where, where your clusters are. So regression and classification and exponential family linear models within your banking and finance as opposed to some of your other industries. And I just think it's, it's fascinating to start to develop your hypotheses look at the health one, which is obviously, you know, more skewed to to your tree based models, but has a little bit more variance when you're looking at regression and classification. So, again, we're faced to understanding what we can do and see if we can come to, you know, either conclusions by industry or my hypothesis is it's based probably more on your network of people and how you were trained. So, again, just a little bit of an ode to knowledge grass to understand both the ontologies and the entities that we are extracting out as well as their relationships to one another. And, you know, this is some of the, the more hierarchies that we are working with like a mode seeking algorithm is a mean shift clustering is a clustering model is an unsupervised model. And understanding how to build out these hierarchies in a dynamic way using the unstructured text so that we can, you know, detect the hierarchy and then put it into more of the formal knowledge graphs so that we can use the knowledge graph in order to feed some of the text analytics in order to feed the ontology to be able to get back to this taxonomy. So, this currently we're using a named entity recognition in order to be able to detect, you know, some of the, some of the algorithms within the unstructured text. So to just give you an example of how we're doing this, and then using a human neocortex in order to create the hierarchy but also to place that into a knowledge graph that can be used across the organizations. And then be able to be expressed as an API with code architecture. So, here's where I want to pause for a bit and I wanted to ask router to talk a little bit about, you know, when he went through the data science certification. As an L3, he went through, you know, a combined path in order to really look at L2 verse L3. And this is a slide that really starts to showcase, you know, how we are looking at the level of performance across the different packages that we see in the project profiles that we see. But I wanted to give router a little bit of air time to just talk about his experience here because I think that is, it is seminal and thinking about how we are actually, you know, able to solve our business problems and connect to our client issues and understand that. And then different type of variety that we're starting to see between L2 and L3 because of the ability for us to start seeing the similarities between all of the L2s together and doing some clustering and cosine similarities there. Absolutely. And thanks. And, you know, to almost take that first part of the question and like what was my experience like. So one of the things that, I mean, you all may know that this is a fairly structured process in a sense to go through as in we have different sections of the process, different aspects of the whole data science workflow in essence. And what we do is, of course, we look at, you know, how do you see this particular section? What did you do, you know, for example, to understand your business users? What did you do later on in deployment, etc. So one of the things is that it makes you, to have that overview now, makes you look at some of the newer projects and go, OK, well, do I have the same type of breadth now in a project? Do I have the same type of QA as I did for the projects that, you know, that help get me to this level? Because sometimes, and this was, I think, one of my other big experiences, you tend to think differently about some of these projects afterwards versus, you know, the pressure you sometimes get in data science projects to get the immediate value to deal with things that are, you know, perhaps an inflated perception or an inflated expectation as to what the tooling can do versus what the reality of what the data it looks like sometimes. So, you know, and to go then through, to really have to write that down in these different steps and also to talk about things like stakeholder management and impact, you know, it makes you look back and go, well, you know, in these and these and these cases, I was able to take, you know, also the long term view with with regards to, you know, impact and stability of the system to link back to what Seth was saying, the impact on the users, for example, and, you know, really take that and balance that all out versus I built a model that was as good as it could be, which is a very different goal and metric to try and optimize. And it, it got to the point where so I was telling this to Beth before that I happened to be coaching an L3 candidate last week. And this L3 candidate was looking at, hey, you know, these are some of the questions I have for L3, but he said I'm not sure I get all of this because some of this seems very, you know, business strategy focus. Some of this is very consulting focus, almost in sense. And in the conversation I had with him was, well, to me, some of that is the essence of an L3 because of thought leadership that you're going to look to apply because what we're not necessarily looking for is you can grab data and you can apply a model and report an outcome. But you can understand, you know, how the business and how, you know, people that input numbers into a database and how they do that and why they do that, how that affects the rest of the system you should be building, how that affects, you know, what you should do with missing values, what you should do with normalization, what you should do with model selection and strategy and how you should design, you know, human in the loop systems. And to me, you know, thought leaders there can express what that interaction is like and why they make choices. To me, thought leaders there were looking for people that motivate their actions almost when building a human-centered AI system. And, you know, those were some of the things where I think the structure we have with the open group certification helps, let's say, fill in and round out this picture of, you know, building human-centered AI systems. Thank you so much, Radar. I wanted to pause and ask if there are any questions because we're going to go into our next session. But if there are any questions that anybody would like to ask at this time, I think that this would be a great time to do so. So I see one in the Q&A, Beth, and I'll, let me just read it to you and either one of you can pick it up. How do you ensure privacy, ethical and security guards are in place when providing access to master data across industries? So, one of the lovely things that we have within IBM, and it's something that both Radar and I actually provide services and consulting for is to be able to create a centralized chief data office that governs and manages the master data and then provides a thin consumption layer across the entire organization that is fully secured and has traceability and auditability for who is accessing what type of data when, and we provide that in these 55 different APIs that we have to date, but they are growing and growing and growing. So, this is something that we have within IBM that we are always trying to replicate and help our clients get to. And it's very difficult to do over an enterprise organization at that level, but the concept is very simple is you centralize and you govern the information that you need to replicate and get out to the rest of the organization. And then you allow that consumption layer to be open so that all of your developers can use the correct API accessibility that have the tied in auditability and security that you need in order to make sure that you have the access that you are required to have. Not a simple answer, but I will tell you that it is incredibly effective once you have the correct investment to be able to push this out into an organization. It allows people to have a common language, and I cannot emphasize how great of a impact that can have across your organization. So I have one myself after looking at your data about the projects and the industries they came from. It occurred to me that in oil and gas, so we have a forum in the open group called the OSDU forum that was really founded to solve their data access problems and to enable and move to the cloud. Things were very fragmented in that industry and prior to the advent of the OSDU data platform. So I wonder in other industry and so that fragmentation I would think would be a barrier to really doing effective data science work for those kinds of companies. And I just wonder among other industries if you see structural barriers like that that exist and just sort of what the experience is with those. What I have found is that the domain expertise is far more important sometimes. And the knowledge and the language in order to be able to resolve those types of fragmentation is a larger feature of being able to be a successful data scientist within an industry. And so that is my opinion because I have seen so many data scientists struggle, healthcare is one where the acronyms and the language and the understanding of that domain is far more critical to the success of being able to solve the business problem than dual PhDs and linear algebra. Radar, do you have a comment there on domain expertise and fragmentation of data? So I agree with you on the domain expertise versus the dual PhD. I wanted to touch upon the fragmentation bit and we'll touch upon this briefly in the next section as well. But this is also something that companies that have tried building data science capabilities have kind of gone back and forth on. Because what we've seen I think four or five years ago and this is when a lot of the data science within industry was we'll build our first teams and we'll do our first let's say attempts with data science and AI. People did indeed say well it is tricky to get all of this data together. So we are going to go to a lake set up we're going to put everything together data scientists going to plug right in there. And you know and then they're going to build models and that's going to be the easiest section. What that has ignored almost I mean and this is my hypothesis is that within larger enterprises there is nothing to properly incentivize that kind of behavior to keep that going. When you centralize your data and you clean it all up that is nicely centralized and cleaned up for about a week and then things change again. So what we are now actually seeing is companies you know starting to ask about OK how can I distribute data how can I have fragmentation by design have and have that sit within the domain with the experts. While still maintaining like a modicum of control while making that inefficient setup because then you acknowledge the fact that companies are made up of humans that tend to do things that are in their own interest. And you know I think having trying to centralize everything that was kind of you know in some cases that worked but in a lot of cases that costs some type of conflict in that way. So what we're actually now seeing is more fragmentation or rather federation by design and companies asking what is what is the level of federation that works for me that works for my setup that works for the people I have. Well stated and I. I see another question in the chat in the Q A. My role is mainly client facing I do I do have we actually have some some things that are going out more recently on ethics. And if you want to connect with me on LinkedIn I can I can steer you to that because part of. Part of what we're going to go through next is the the data science method. And I think that this would be a good lead in unless there are further questions that you guys would like to ask us. It looks like that's it for the questions at this point. So I'll turn it over to you to go into your second presentation.