 So, I will tell you that in my background, I was an archaeologist and anthropologist and back in the 90s, I was using SPSS on a mainframe. And if I had been able to submit my project profile back then I would have had the SPSS. And what a lot of people don't know is that SPSS means statistical processing for the social science is and I was using it for demographic distributions as well as like, you know, looking for shards and shards over geographic geographies. So, one of the things that I have just been trained to think like is as a social scientist and when we started to steer towards, you know, being able to differentiate what we do with data and the scientific method to solve our business problems or clients business problems. Is we started to think about, you know, how do we want to show up as a corporation. And thankfully we have a very long history of really data responsibility where the, the data, the data and the insights belong to our users to belong to the creators of that data. And, you know, with that, we have now established an ethics board and I am a representative of our trustworthy center of excellence where we have really enabled the humans to think a little bit differently. And so when we created, and I think that this was out of sheer necessity that we needed to show people what, what is the difference between a data science experiment and scaled AI. And when we set about to create this method, which is a formal method within IBM, we remixed Chris DM, which is the methodology that comes with SPSS that we inherited as IBM when we purchased SPSS. And it is, it is with intention that we all chose Chris DM because it starts with the business problem and understanding that business problem. And when I was an archaeologist and an anthropologist before I would go listen to a story and do an ethnography. It was required for me to document how I was feeling, how I was thinking what what my mood was. And that was something that I believe strongly that we bring to the table when we start to look at data and we start to scale out understanding how to look at data. We need to understand our own cognitive biases. So the mixture between social science and mathematics between soft and hard quantitative and qualitative semantics and statistics. All of these things come together. And with that, I'm going to let router do the do do his interpretation of how we have now taken and created the scaled data science methodology. And we put it out there so that everyone can at least see and understand. What are the 254 different work products and 53 different roles that you need in order to really perform a scale data science project. So with that router. Let me go back for you. Sorry. That's that's easy to see why now. Thanks, Beth. And, you know, what what I often have to do when when I talk about this presentation and this topic to people is almost, let's say, explain a bit why it says best practices because because sometimes that that leads people into thinking things that, you know, we we don't intend. So so in the very beginning, and especially when we're working with with junior data scientists, that's that that that came up. Often some or a lot of times the expectation was a bit like, you know, I expect best practice to tell me like, hey, you know, this type of situation, this type of model will work best. And, you know, and that is not that is not the level of detail or not the level of, let's say, determine deterministic aspects that a lot of the the industry applications of data science have. And I'm reminded of this and Alex asked the question Q&A a bit before, you know, sometimes, yes, we do publish research papers and what had happened a couple of years ago. Was that I was actually working on a joint research paper with an oil and gas client about predictive seismic modeling. And we were we were benchmarking several physical approaches with with, you know, machine more machine learning driven approaches. And, you know, from from from a research team, we were told that we were going to get reviewed by an expert, not the dual PhD but quadruple PhD. So that was that was a very tiny bit intimidating. And so we went into those day long review sessions. And we thought, okay, this this person is going to tell us, oh, you've done this wrong. And this program, this hyper parameter is strongly tuned, et cetera, et cetera. And what we came away with was really where he said, you know, the approach is solid, the model selection strategy is solid. You could try some Bayesian modeling. But I also don't know, maybe like without trying it, I don't know whether that will be better. So so so that is kind of the level of, you know, certainty on on approach that someone with years and years and years of studying experience was able to put on a very ill defined problem. So so when when we talk about, you know, best practices for our practitioners and our profession, let's say, population, we are talking much more about, okay, what are some of almost the the philosophical thing, what are the big picture items that we want you to get right that we want you to be aware of. And then indeed, you know, yes, you want to do proper hyper parameter tuning. Yes, you want to have a solid model selection strategy with that. That's almost one one level down. And then there's something where and this is this also links back to that conversation I had with that three candidates that said, well, you know, no one is here at this point to put your projects under a loop and saying like, oh, you should have used a random hyper parameter tuning here versus a grid search. No one no one here is to provide that level of scrutiny because we don't know yet what that matters. You were here to see, are your choices properly motivated? And do you understand the context in which you operate? So so that is, you know, the important caveat I tend to give people when when I talk about data science best practices. And, you know, and the reason I care about the same storyline that that that Beth is was expanding upon here is that we talked about her being an anthropologist and everything that comes with that. I'm a psychologist. So that that that makes us quite a duo. You can imagine in in some of the IBM calls. But, you know, that that also means that for me, I haven't gone from psychology to neuroscience to to let's say more general data science. What I also tend to care a lot about is the scientific method. And I also spend quite a bit of time coding high coding up hypotheses in SPSS. But I care a lot about how questions are formulated and what kind of the impact is we tend to have. And for me, data scientists is therefore a very, very general term. I think we mentioned the Harvard Review thing from a couple of years ago before. And I think in the wake of that, sometimes we've tended to or some companies have tended to define data scientists as people that, you know, take data, put the algorithm on, take the output. And for but for me, it's much more than that for me, it is, you know, taking the right type of decisions based on the information available. And, you know, whatever the tooling set there is, that is very open. And when you think about then, you know, the scientific method and how we're doing this in data science is, you know, that aspect of it has sometimes been a bit, let's say, less illuminated, less reaffirmed. By the way, some of the some of the profession talks have been going. And what we were thinking of doing is, you know, how can we how can we help that because ultimately when when we work with clients, but also when clients work internally, that is something that has impact. I don't know how many of you've seen that, but sometime earlier this week on archive, a paper got published that that was called how to fool the masses with machine or 10 ways to fool the masses with machine learning. And it kind of used all these kind of, well, inadvertent tricks that that people can pull with numbers and with with data science methodology and with machine learning to, you know, give the appearance. Of impact, but not the actual realization of impact. And this is, you know, this is a huge problem in a sense that all those companies that that tried dipping their toes in the data science field five, six years ago. 50 60% of them, you know, two, three, two, three years later was scratching their heads and going. It's weird. I have 30 people working on this. I have no impact, or I have, you know, 60 people doing data processing. I have no models in production, or I don't know what these models are doing. And, and so, so what, what week we got to see what the questions we were asked often as, as consultants in the field was, you know, where is that gap coming from because I did all the things that, you know, the strategy agencies and everyone else were telling me to do. And so why, why did I end up with, let's say, a mirage of value rather than rather than something that's that is changing the way I operate. And, and, and this is really where we started thinking of, okay, what do we need to change about also the way we tend to do things and and Beth was mentioned before. Chris GM has long since been the standard of, of data science methodology. And you probably all know the phases where it has business and data understanding. There's, there's some feature engineering, there's model building, there's deployment. And, and, and, you know, that it is kind of the cycle to find the answer to the question that that people used to have like tell me something about my data that I didn't know. I didn't know yet. And that was valuable. And, and the way that can have impact, the way that that knowledge would have impact was, was often undetermined. So, so companies who are not thinking through, well, once I know this, what does that mean? How will, how will that change things for me? And in addition to it being a very difficult question, because when companies said, you know, here's my data, tell me something I didn't know yet. Often data scientists were, were in a very tricky position because where if the data scientists would come up with something, something that was weird in that data or that, that stood out to them. Companies would either say, I knew that already that is too, that is too generic. I've seen that that's just the way my business works. Or that is so far outside of how I thought I would be operating. That can't be true. There must be something that, that you did wrong. There must be something that is wrong with this data. Whatever might, might be the, be the excuse of the time, but it was very hard to calibrate the, let's say that finding into something that was novel, both novel and accepted. So we started thinking about, okay, what does, what do we need to change in that life cycle to, to go straight through to, to impact so that we are no longer doing, you know, science on just the lab route. But, but that we have it, that we have the, the, the properly scaled version, and that we solution for impact. And, and we started thinking about, okay, what, that means it's not just data scientists. We talked at the end of the previous section about data access, about domain driven design, about domain expertise, about exposing, you know, the standard data through a set of API. So that, that's the master data gets, you know, properly followed up, et cetera. So, so here are all the things we started bringing together into that, that scale data science method. And yes, it has the aspects of Chris PM and it has, you know, proper data engineering. It has proper ML engineering. But it also has things like DevSecOps and security by design because a data scientist will need to work with people before putting something in production where we can prevent. You know, data extraction like malevolent data extraction taking place or where, you know, high volume calls of the API can actually reveal data that should be secured by design. We talked about test driven design because, you know, we want to make sure that the systems we build hold up under all the circumstances. That's the data we'll throw at them. So, and this is something that is all far too common because so we've worked with companies that, for example, said, you know, my model worked when I tried it in my notebooks. I deployed it to production. Things don't seem to work that well. It turns out that, for example, all the data that was coming from Japan, China, from Russia with where the characters of the text were non-western was getting filtered out. So that is something that was not tested for that was not checked in production. And, you know, that obviously had impact on how that company was operating. Seth talked about enterprises, I think, for data and AI. So that goes in. So all of these different disciplines is something that we're putting together into that scale data science method because I think also from the beginning when we came to this profession, we've always said data science is a team sport. And this is really how we try to formalize that because Chris BM by itself wasn't something that was acknowledging that fact that explicitly if you go to the next slide. And so what we've taken is really, you know, these sections of Chris BM, these critical steps that we expect data scientists to be able to take, business understanding data, understanding data preparation, etc. etc. And embed them into, let's say, into what we call, let's say that garage innovation loop. So the enterprise design thinking workshops are really our guidance for doing business understanding and data understanding with those ethical considerations with that human and center. That Seth was talking about when we are talking data preparation and modeling that goes together with the people that know the master data around the availability of these APIs that know what needs to be changed, that know what the lineage and provenance of this data has been within the company, etc. And, you know, when it comes to, when it comes to then showing impact with these models, this kind of workflow also takes into account, okay, an AB test to show impact. If that is successful, how would then a team operate such a model? How would we make sure that a model gets refreshed from time to time? How do we make sure that the model is still valid when a law changes, when a business rule changes? And how do we make sure all those different steps of the business, of the world we built this model in, that they act in concert? So, and I just, I want to, you know, pause here for a second because, you know, we have these beautiful like loop-de-loops. This represents over 254 different work products that we have examples for and 53 distinct roles. And, you know, front to back, it is an incredibly large effort to be able to get your data science team to be able to express their models as an API using code architecture so that the human being can get the right information at the right time to make the right types of decisions. And this is how we build out, we call them intelligent workflows. But it is not a solve problem of what information, what human being needs at what time to take what information or to take what action. And I think that more than anything, we want people to understand that if you don't have your test retest reliability for your data science output, then you lose trust. And this goes back to what Seth was talking about earlier as well. You lose trust with your users and you cannot have impact with your data science project. And that's something that I think that I want to steer people to is that it is the human being that we are augmenting with the right information. And that takes an incredible amount of understanding of the business process and the domain expertise to tie it back to what we have. In addition, Rader, do you want to talk just a couple minutes about what we have open sourced as far as our best practices? Yeah, absolutely, Beth, because I did talk a bit about these are all the things that go into that full system design. We had a challenge there internally for a data science population as well, because we also had so many practitioners that grew up with Chris, building models is my job, and we had to think about how do we enable thousands of practitioners as well to know what they need to know about data management and how that feeds into the models they build, to know what they need to know about ongoing quality measurements. So we developed an internal training, but we also wrote down everything we taught in the training and we open sourced that. So what we did is about 21 chapters from everything from what does a diverse project team look like to use it to recommendations on scalability. We have cloud deployment patterns in there to empower our data scientists and widening their view really. And so the link is there. So like we said, that's completely open sourced. We use that also to drive some publications for example IEEE, but this is how we built our internal community and drove that kind of next step. And like Beth said, this is not a solved issue. This is also why we open sourced it so that there's iteration and there's community around this. But this is the approach we took in terms of saying, let's just put it out there and continuously work on this over the next few years. And then part of, you know, part of what we were thinking too is like we wanted to make sure to show people where to start. And where to start is always with our data and then AI and it's understanding the problem that we need to solve. And I have got to tell you that working with our designers on creating this very comprehensive set of workshops of work products of procedures to get at the intent. It has been such a phenomenal discovery because a lot of where people want to use AI or where they want to apply data science. Most of it typically comes down to understanding what is the business need and how do you get that right information to the right time to the right person. And so that that person can take the right action hopefully for growing the business. So with that, I definitely wanted to stop sharing and see if we can field any questions or if there's any comments. I don't see any questions in the panel. I am curious if you can talk a little bit more about the decision to open source the. The data science best practices. I mean, it seems like a really great thing for for the industry and for practitioners and just sort of what went into that decision. So we, with our acquisition of red hat, you know, a lot of our company is shifting culture, and we're shifting from, you know, proprietary products to more open source and I think that we can see this shift within the industry. I briefly mentioned this before there's always trade offs. When you, you know, I think that the log for Jay issue that just happened to all of us is a great example. But I will say that when you have the widest, the wider your variance, the more standard you're mean, you know, you look at the amount of things that can be that can be constructed when you make space and put it out there and you want people to contribute. I am always surprised, and I love being surprised and delighted by the next generation's version of what what we are doing. And how does the next generation take, you know, some of the principles of the past and remix it and make it relevant for today. So I think that that is something that, you know, we believe in strongly and it's part of our ethos and atmosphere and our ethics to be able to share our expertise with others. I don't know any other company that could have really put together, you know, our data science methodology and some of our best practices, because we have that that variance and we have that breadth in the depth. So we want everyone to contribute. We don't think that we've solved it and we don't think that we can solve it alone. We need we need everyone to help. And we tried to make as much space as possible. And I think that one of the things that, you know, I am personally learning always is to welcome feedback. If if I can, can I cannot do to that best because so I think from, let's say our culture and direction that that's completely accurate and complete. The other thing that that's, you know, I think is a bit of a factor or at least was to me was time in the sense that a lot of the data scientists, I would say they're, they're growing up almost to use that term perhaps a bit wrongly. But, you know, the the the open source culture is a lot more ingrained into into the group that is now, you know, mid-level managers get that's getting to senior positions, etc. Then it was, you know, 10, 20 years ago. But I think Beth and I both talked about how we still did a lot of things, you know, in SPSS and that we coded our analysis that way. A couple of years after I stopped doing that, the most of the faculty was using our as as as open source software. So, you know, it's also an acknowledgement of the fact that that a lot of this big chunk of this population will put a lot of trust in open source and the open source community and has that as their homes. And, you know, it is also so it's also a reflection of us seeing that that, you know, the makeup of this profession change over time and and, you know, wanting to be a part of that. Okay. So, a couple of other questions have come in one from my colleague and your former colleague on Dr. in this day and age. How do we ensure data transparency and avoid bias? I have seen researchers look for the right statistical analysis that supports the original thesis. But this may not actually support ground truth. How is IBM working to avoid this in data science and AI. I think you have to train the human as I alluded to earlier. You know, there are some of the work that we get to do again through our Academy. Recently, we just did one on using the Titanic data set and using that to understand survivability and people in steerage would never have gotten on on a boat because of their socio economic class. And we think in terms of biases, but, you know, there's statistical bias and there's skew and there's, there's lots of things mathematically. But when we are talking about AI that impacts human beings, we need to talk about cognitive biases and we need to talk about our own understanding of having a diverse amount of people thinking through the problem. Because it is, it is always the underrepresented and the, the, the person that is not on the data science team that is typically the one going, hey, how come Alexa doesn't hear my voice. And every single time we think through that we have enough of a variance. We need to start thinking differently about looking at algorithms and looking at who, who they are benefiting the output is the output benefiting a certain amount of people over another person. And I think that that's a really good way to start to think through some of these things. It is a very complex, hard problem to solve. And I think that we, we really are just at the beginning, I will add that's, that's another reason that I want so many more people to. I want to democratize how many people are understanding that algorithms are only probably correct some of the time and that probabilistic thinking is something that we need to engender in, in this next generation, if not ourselves. Okay, and so 1 final question, where can people find information about the date, the IBM data science apprenticeships. That is a great question. I'm positive. It's, it's probably on our IBM.com website somewhere. And we do have an excellent apprenticeship program. And, you know, part of the work we do and I get to do is with our P tech students, which is our pathways to technology, where we take students who are 1st generation of high school graduates. And we have, and they, the P tech, when you graduate from high school, you graduate with a 2 year associates degree. And we are making them part of our apprenticeship program and part of our IBM culture. Again, looking for that variance of thought so that we're, we're looking for people who, you know, come from a very different socio economic condition than some of the, the PhDs that work in our research team. For instance.