 Good morning. Good afternoon. Good evening wherever you're hailing from. Welcome to a very special virtual Friday edition of the data services office hour. I am joined by my fellow Red Hatter's Carl Eklund, Sophie Watson and young. Yeah. Yeah, yeah, yeah. Yeah, you got it. All right. I'm winning this morning. All right. Kind of had a rough night here in the short household, multiple thunderstorms, lots of lots of child holding and such. But we are here and we are ready to party, I guess. So, topic of the door is data engineering and data science and data engineering, like that to the layman here. I'll fully admit it. I'm no data nerd. Kind of means the same thing to me, but doesn't. Right. Like I get it. There's CS. There's the science behind it. Then there's the engineering behind it, like the application of it. But with data, how does that like slice, I guess, as I hit my mic. Don't all jump at once. I can start if you want, because I'm old. So that means I've seen those definitions change over the past few years. So maybe we can start with that. It will help people figure out what's what and how do things are settling. Yep. Everybody introduce yourselves first. Yeah, sorry. I forgot to do that. It's early. I haven't had coffee yet. But please, everybody introduce yourself. Yeah, yeah, since I have the mic. Okay. Data engineering architect in the, in the, the cloud services be you so the title is not interesting. What's most interesting is what I do is I try to help organizations, various organizations to set up their data science platforms, which includes that engineering, the data engineering box. So I'm here to try those people figure out the infrastructure, the applications that they have to put up even some aspects of the project, how they want to run it and helping them sort this out. That's what I do. Awesome. Well, I'm Carl Eklund. I am also an architect here at Red Hat. I have a background in academic numerical analysis. I mean, I don't don't always call that data science, but there's certainly data science happening in academia. Right. The specific field I was focused on it wasn't quite data science, but that's where I gained all of my interest and then I moved into private sector and focused on technology and data strategies for enterprise surrounding data science and you know actually getting a return on your data science investment. So it's kind of where I'm coming from. Awesome. And I'm Sophie Watson. I am not a data engineer. I'm a data scientist. So I'm really excited today to hear from the experts about how they work hard to make my life easier. I just want the data. I'm kind of with you, Chris, and that kind of Right. Data just happens. Data just arrives and exists and I don't really think about how it got there. I just want to use it to, you know, iterate through my workflow and not think about what they do except for this hour where we're going to learn about what they do. Wonderful. Okay. Yeah. Yeah. And I think it's a nice introduction into introducing ourselves is a nice introduction to this discussion because you see we have different backgrounds. I have these backgrounds in infrastructure and you know general architecture for foreign organizations. And that was the city of level university here in Canada so also in the academics, but not from a pure data science perspective you know all the other systems that you have to set up to sustain 45,000 students university so which is much closer to a big enterprise then than what is done inside the labs inside the research labs at the university so I have this background, call is purely academics, originally but then shifted to the enterprise world to to to make this happening for really into organizations. Sophie, well she characterizes characterizes herself as a pure data scientist, but I know she's doing some data engineering. And if she don't want to issue. She doesn't want to that's that that's exactly the point she met which is really important. But still, we are at some point you have to do that and that engineering so what's about it. If I go back to what I was, I was beginning to explain a few minutes ago. When I started working on those things related to data science. So, let's say, back four or five years ago. There was this common knowledge that data scientists did not exist. It was described as a unicorn, because to be a data scientist, you had to new mathematics advanced mathematics, algorithmics and advanced statistics. Okay, this part of academia related area, you know to understand the algorithms, especially at the beginning of this deep neural network and those kind of stuff where if you have not done eight years of mathematics at the college, you will have to struggle. Maybe not understanding but explaining or using it to it can be hard. But at the same time, you had to know, you had to know a little bit about computer science, because you had to be able to program this, and to enter this into the machines, and then deriving from that you have to you had to know how to build those machines, you had to know your requirements, you know, for processing power and storage and networking and the rest. And you also had to know about the business itself, because if you don't understand what you are trying to achieve. Most of times you will prove something that the business already know, because, you know, and it's true. It's the, it's the first thing normally you learn is that if you are trying to prove something that the organization already knows you're just wasting your time. So, there is a business goal when you, which may be differs from academia in academia you will try to prove that your, that your algorithm that your technique is working is producing interesting results. Okay, but in an enterprise, those results as have to be useful, which is totally different. You can find something that is fantastic but there's no application whatsoever. So, there's this differentiation. So, back to my story four or five years ago, data scientist did not exist. But you could build a data science team, you know, bringing all those skillsets together. Okay, some people are more versed into infrastructure and computing some more in mathematics and everything and the rest and business and you have this data scientist team. For whatever reason, in my opinion, partly marketing partly HR partly it's cool. The definition has shifted a little bit. And now data scientist will the common expectation will be more focused on these science aspects, algorithmics, and you know all those tensor flow by torch algorithm that you would train that you would use to train a model to recognize images patterns and data and whatsoever. That's what now that that's what would be the common description for a data scientist. But still, the rest of the work has to be done. That is gathering this data, storing the data, transforming the data, cleaning it, securing it, making it available, you know, all those. Yeah, not so interesting stuff that you have to do but still that that exists, and therefore came the name to describe this, the data engineer. That's what the data engineer would do. It's not only the data engineer because after, let's say you have trained a model, you have to put it into production. That's another skill set, which is now calling a software engineer or ML engineer machine learning engineer, you know, using those models so we have seen those. Maybe it's, you know, a refinement process. We started from, okay, general name for this new role, this new science that that emerged. And now, because it's still young, you know, all of this is really young to three years top. Now, we are seeing emerge those different categories with different skill sets and everything. And I have here maybe I can share this very quickly. I have this image and not even sure that's true. It's meant to challenge. Okay, it's part of presentation about, okay, as as the data scientist vanished. So it's, it's more of a challenging blog post than anything else, but it shows those different aspects of that engineering software engineering research and data business analysis. So you see different aspects of this of the work that you have to do around this. And if I take this example. It's lift the company, the car sharing company, they've made this change back in 2018, 19, and sorry if I'm wrong if not the right company, but for example, what they, the position that they named data scientists back around those areas, 2018 now is called research scientists. And people who were coined as data analysts are now called data scientists. That's just to say that those definitions are still floating and quickly evolving. So it's much more interesting to focus on, not the name of the role but what's being done, right, which is data engineering so all the things I explained, I began to, I began to explain about data processing, storing and, and all those kind of things. That's my personal view of the story, but I let go and Sophie came in and because I'm sure they have another perspective on this. Yeah, that was, that was an awesome intro. I can't. Yeah, yeah, no, I can't find anything in there that I would disagree with. I think one of the, the ways that I like to start conversations like this is through a story or just like personal experience. Like years ago now probably like five or six I was working with a not for profit institution, and we were trying to figure out the prospective donors right so how can we start to, you know, not only regress and predict how much a donor would offer but how do we take limited resources and our, you know, our staff to, like, go meet with these people and, you know, talk about like the donations and you know where do we apply our resources, right, to perspectives. What we found was as we went through this data science process and as I looked at the data and tried to understand it and build a model and do everything with it came to the conclusion that I just needed more data. I mean anyone who spends time around a data scientist is going to hear those words like the models promising, but I need more data. And I stopped for a second I was like wait a second. If I had more data, in other words, donations or transactions. I wouldn't need to build a model so it was like, you know, it seems kind of trivial but at that point in my career pretty you know, early on it was like kind of earth shattering and then I started doing the research and I figured, like, Oh, there's a whole data strategy component to this there's data engineering that needs to happen you can't just take more of something and expect the end result to simply be better right because in that case, more of that something would eliminate the need for modeling. So I shifted my perspective from just more data to high quality or different dimensions of data. And that's really where I view data engineering and how it fits into the puzzle. Cleaning sanitizing. I'll make sense to me right like, where do you put it? You just stick it in Spark. Do you throw it out of my SQL database. I mean, you know, there's these things called data lakes and yeah all that fun stuff right like, let's cut through the, you know, some of the terminology and get to the real where does the data go kind of thing. Yeah, can we start with where does the data come from? Yeah, great. Where does it come from? From perspective. I guess it will be easier for the audience to follow. Of course, that's the first problem of the data engineer. So, you know, we have, okay, we have we're pretty agreed that the data scientists that take the output of everything that is happening before and they have a clean set of data that they can use. Okay, but before that, of course, you have to gather the data and that's what Carl explained that's what he faced at some point. Okay, where do I get this data? And how do I acquire it? So there's a whole set of solutions, software and techniques to help to gather this data could be simple as, you know, plugging a sensor to your environment and then okay, you are harvesting the data directly from the sensor. It can be web scrapping. Okay, you go every day, you go onto a web page, you scrap the latest results, I don't know, from a stock exchange or whatever, and then you play with the data. So, myriads of possibilities and each of them come with their own challenges. Do you want free time data? Do you want to do a batch every day you are doing this web scrapping? Or do you know every year? So that's part of the refreshing thing and that's part of the data engineering thing. But okay, let's say you have figured this out. Of course, what you have is raw data. So you have first to select what you want to use because from the sensors data that I took as an example, maybe there are four, five or five thousands entries that the sensor is measuring, but in fact, only two or three of them are really useful. The interesting part is that you may not know it from the start. Meaning that you have to be clever enough to prepare yourself rebuilding and rebuilding and rebuilding again this pipeline because you will start with something, okay, I will take this 100 thing and then I will give all those data to my data scientist and then they will say, no, you know, if you want a number 10, they're heavily correlated. And in fact, it's the same data so you can just remove number 10. Oh, you have to go back to your pipeline and then no filter out number 10. And then another iteration. Oh, no, in fact, you know, number 10 was useful because it was a little bit different and so that's exactly what happens in the real world. And that's why it's, it dates back from, I guess it was a survey in 2016 but nobody challenged this so far so I guess it's the right number but 80% of the work you are doing in data science in general is spent exactly on this. Figuring out the data part, data cleaning, aggregation, ingestion or whatever. So you do this ingestion then you do the cleaning, okay, selecting the things. But of course you maybe have to group different inputs of data so you have some aggregation to do and refining. And finally, after some heavy work, you store it because of course you may... You've invested in this data at this point, you want to save it somewhere. Yeah, you want to save it and you want to save it for two different aspects to be able to work on it, making your inferences or things like that or to train your models. So two different, maybe two different sorts of data storage. There is some cold storage because maybe the data scientists, they will run their magic only next year but still they will want to have access to the past 10 years of data. And there is another data that you want real time to be able to do the inference and everything. So different techniques so far. So that's what we would call the data lake or it's the data lake. And yes, more and more we see these named as features because the data or features are points of interest that you can use for your model. So it's a feature store. We're evolving from this data lake thing. You put everything a row more or less and you figure this out to something more refined, which is a feature store where you have done the refining part. And then the data scientists are able to directly pick from this to do their stuff. So to your question about the tooling and everything, depending on the size of the data, you could you will have different, different solutions. Of course, if you run into big data, you know, a few terabytes if not pedabytes of data, you definitely have to use a distributed engines. All this one map reduce more common now spark or the new things like starburst galaxy, which new name for presto or tree node the upstream version, but those heavily distributed query engine that will help you make this sorting out and cleaning and everything because most of times it's it's lots of sequel queries, you know, it's not that different from from standard database but one interesting thing that I've seen is that people tend to use what they know. Okay, which is never bad thing. But sometimes it's good to you know to take a step back and think of something else. I've seen in a chat few days ago. Oh, yeah, I intend to use spark to do this. And, okay, but guy you've got 500 megabytes of data. Don't bother spark with that you know you spend a lot of CPU time doing nothing. It's, well, it's, it's, it's kind of, you know, spark is great, but it's still some heavy lifting to have it tuned in the right way. And everything. So maybe just depend that data frame in Python is more than enough for what you want to do. And, well, data that the scientists won't like me, but I don't care. Most of times Excel is just perfect or any spreadsheet for what you do depending on what you want to do. So that's what differs maybe from academics versus enterprises and things like that. It's about finding the right tool where you will just choose what is necessary but no more. Otherwise you're spending on resources and time implementation. So if you do you do work in a, yeah, I was about to say, I called you a unicorn earlier, by the way, right? Like, he did, he did give you credit for being a unicorn, but Oh, well, no, that was, that was the data scientists of past that know how to do all of the things. I'm more modern, less unicorny version, I think. But I'm sure it's, it's, it's a virtual Friday all as well. So, again, when you're talking about the data engineering process there, and particularly kind of this movement towards using feature stores, and the process that you were talking about. So it sounded like, I mean, first off it sounded like there was a lot of interaction with the data scientists, a lot of back and forth to determine what the data scientists needed. And that kind of led me to wonder, I feel like you're talking perhaps about a specific type of data engineering when the data scientist knows what problem they have to solve. They've already been given their tasks, they know what they're trying to, you know, optimize for for the business. Is there a different approach when you're just, you've just got these sensors up everywhere. No one's actually using the data yet, because that's what usually happens, right? People just collect data for data's sake. Right. What would you do with that? Do you still call that data engineering if you're just dumping it somewhere, not looking at it at all? Is that a different role? Okay. Collecting the data is never a bad thing, because you're right, you never know what you can do with it. Okay, so that's what has been said to the organizations for the past few years, collect all your data. At some point, it will be useful. And I believe it's true, because with all the fantastic things happening in this domain every day. You know, it's not even every year, it's every day that there are new techniques, new algorithms and everything that's fantastic. So data can always be useful. Now for the part where, you know, you look at the data trying to find out something without any precise business goal. It comes down to business decisions. Okay, exactly like the data store. Okay, if you have the money, if you have the resources to store 10 petabytes of data a year, yeah, go ahead, do it. But it costs quite a lot of money. It's the same for data scientists. You know, they are unicorns, so they are really rare, so you have to pay them a lot. You know, not as much as it should be paid. Because you know, it's, no, I have most admiration, you know, I'm definitely not a data scientist and the mathematics behind this and everything it's it is baffling so I have a huge respect for for data scientists. But still, if you are paying them without any precise goal, at some point you will have the CEO looking, looking at the lines at the end of the year. Okay, I have hired those 10 data scientists cost me two millions this year, or whatever. What's the result. Exactly as, as a CEO, he would look to, okay, this factory that I've bought last year, because it was a bargain. Okay, what, what's the return of an investment zero. Okay, let's scrap it. Let's resell it or whatever. Unfortunately, in, you know, in the chats in the community, I've seen these beginning to happen. Business people beginning to challenge the usefulness of having a team of data scientists are okay that's perfect but what's in it for me. At the end of the day it has to be useful but I wouldn't say it was, in my opinion, a wrong business decision from the start. You don't start a project without any specific goal, or at least you call it a POC, and it has a defined, defined a timeframe. Okay, for the next two years, I will put a bunch of data scientists on it. If they find something useful, perfect. If not, it was an investment as a businessman, I will start maybe 10 projects like this a year, and out of them to will succeed to will be so so and the other six will be just crap. That's investment. That's one on one investment, and I'm definitely not a businessman so I won't pretend it's true, but I've managed many projects and, you know, some projects are successful, some are not. So, we are beginning to see these trends of people starting. Okay, but what am I doing exactly with this. So, to your point, so if it's your question. Yeah, it's important to gather and keep as many. Data as you can. But in my opinion, not starting with a precise, at least business objective. I want to increase my revenue by 10% or I want to have 10% more customers or I want to know more about them to target my marketing campaigns, or I want to improve the overall quality in my factory. I want only 1% of defective, defective products, instead of 10% that we have now. Maybe the data science part won't resolve the things but at least the data science team, they can they have a goal, they can focus on something. Okay, and, you know, it's, it's not that different from standard science in science. Well, it happens. It happens that serendipity gives you something that you were not looking for. But most of times you have a reason you have a, you have a project goal, you have a research goal. You want to cure cancer or generally speaking or I want to understand the interaction of this molecule with the sort of one you have a goal you just don't do just don't come in the morning and say oh okay let me mix some magnesium with some other thing and see what it does. So nobody does science like this. So you cannot expect data scientists to do the same thing in any organization. Oh, let's see what happens when I mix this data set with this data set. If you start like this, well, you're in for much disappointment. Yeah, because you've seen this before. Yeah, yeah, I mean what you're describing is taking a very logical rational scientific process and trying to just either, you know, use luck or pure chance or turn it into an art form and yes there is art as part of this but that comes from experience that comes from practice that doesn't come from just willful, you know, combining chemicals in a laboratory setting I mean I'm pretty that's very friend upon probably very, very dangerous as well. Yeah, don't mix magnesium is very dangerous and certain forms. Yeah, no like, yeah, yeah, basic science taught me all that like don't write random chemicals together. I think we need a PSA now don't mix chemicals at home people. But I want to go back to what we were talking about the, you know, poorly scoped business definitions and the lack of a return on investment in data science because this is something I hear all the time, I mean, all the time. I mean, we get paid did fly around the country talk about this sort of thing right. And to me it doesn't make sense and I think when we think of the diagram that Guillaume put up earlier and there was you know research and data science and data engineering and data analyst right that's the whole picture right and it's my personal opinion that a lot of people focus on the lack of a return on investment in just that data science aspect, because they're not considering all of the, you know, ancillary but required when we look at it in this view. Actions that need to be taken. Right and if you go back to thinking of you know my experience with low quality, you know, insufficient data, right, if we had a robust and funded data engineering process, and the models happen to not work out. The business is all the better for having a data foundation I mean that's what everything starts from is that data foundation. And so often, I would see groups say well let's just hire these people over here they're data scientists they're smart they can do something with that data because no one else in our organization has been able to yet. Well, that's not really the right approach so a lack of return on investment is possible. And not the numbers that we see out in the community right right. Yeah, and I think that that kind of notion that data scientists will save the world. That's all on you Sophie. Yeah, it's, I mean, I don't need that kind of pressure but also. I think this goes back into you know what me and Chris Chase we're talking with you about Chris at the last session we had like in our last you know our we talked about how to integrate the work of data scientists into what application developers are doing. So, you know, I think we've seen these silos of data scientists that are off working in their own snowflake environments. And then when you want to take the artifacts that they've created be a model that they've trained in a Jupiter notebook, or a model that they trained in some Python file but you know when we're not, we're not application we don't have experience putting things into production. So when we go to put that into production it just kind of doesn't happen and the businesses aren't seeing the return. And kind of I think if we think about that machine learning workflow with the data scientists kind of here, and then application developers this some kind of communication and pass off there. I think with the data engineering, it's just at the other end and I don't know whether the monitor is going to be flipped as this gets streamed out to the world. So maybe my arm makes no sense at all right now, but it's kind of a timeline right and so I'm seeing kind of the data engineering is such a critical thing that is needed so that me and Chris can get to that point where we can put that model into production, because we need all of that data we need that communication. And if you take any one of these people out of this arm, then you're not going to have that nice workflow that lets you put these intelligent applications into production. That makes sense. And that comes back to your prior questions about the link between data engineering and data science. It's a teamwork. It has to be a teamwork, because at the end of the day, you can choose to spend a few months trying to refine your algorithm to have some more efficiency or whatever, or maybe you can throw into more CPUs and that's it, it works. So you see, or you can try to figure out, okay, why is my calculation taking so much time? I can tell you because you have a crappy network or your storage has some bottleneck or whatever. So all the things are totally intermixed. And you cannot take each and every part by itself saying, okay, no, data engineers, you prepare the data and that's it. Data scientists, you do the calculation and then software engineers, you put it into motion. No, it's a continuum. And all the PCs have to be integrated together because of the scenario I described at the beginning, you will have to do constant back loops to, okay, modify the data that you will ingest. And then it will, of course, modify your algorithm, then you have to reimplement into your production systems. So that's a perfect example of continuous development of integration, what we have been advocating for standard development for years. It's not that it's applicable to data science. It's compulsory for data science. That's how it works. Data is constantly changing. It's even more true compared to standard applications. Let's say you have developed something to manage your finance. Well, people will still, 10 years later, people will still enter spending or earnings and, okay, the application itself won't change much. That's why we still have tons of cobalt lines into those banking systems and everything. It doesn't change much. Of course, it has to be adapted to regulations and some evolutions. But it's totally different from this data science where those models have to be constantly retrained from fresh data or at least tested continuously compared to the data that is coming in and then redeployed into production. That's a huge difference versus standard standard application where I've seen in most projects you can, even if it's in Android mode, you have this waterfall process. Okay, this is the business requirements. It's fixed, finished. And then you do the development and then you deliver it into production. And that's over. It's not true for data science. It has to stay alive. It's a continuous project. And that's really challenging for any organization. Right. And on top of that, something I always think about with data science, which becomes more and more important is repeatability, traceability. So I need to, if I train a machine learning model, I need to be able to tell somebody at some point, some stakeholder or some customer why I declined their mortgage. I need to know what data this was trained on, what day it was trained on, what hardware it was trained on. Can I reproduce that result to like the bit level reproducibility so that we can go back and potentially advocate for the decisions that we made. Like, well, we made this decision now because of X, Y, and Z. And so for me that I need to know exactly where my data came from or what happened. And so I'm leaning on Guillaume and Carl to save the world. Yeah. Team effort here. Yeah, we all have capes. It's true. Nice. But Guillaume, you reminded me that I was an unfortunate data engineer at a company I used to work at because I worked with the data scientists to help them get their data where they wanted it to help them run their models against it. Like, I had to teach them a lot about building containers and about, you know, using the AWS, what was it Aurora, I think is what they were using. And like having to go in between that like, you know, DevOps to like, oh, you just need to store the data and clean the data mode was a very like hard shift. The skill sets didn't necessarily jive at the time many years ago. And, you know, the fact that I didn't realize what I was doing was actually data engineering was probably part of the problem. Right. Like, you know, it's just like how much like I just worry about spend at that point right how much money do you think is going to cost and like, you know, the business problems is what I'm worried about. Because I'm trying to run the enterprise operations not necessarily trying to help the data scientists figure out what the problem they're trying to solve is. I was just trying to get the data to them kind of thing. So, yeah. Yeah, the engineering is really result focused area, meaning you're the data plumber. You don't care how you will fix how you will you will fix everything but at the end you have to stop the leak or you have to bring this water from the water heater down to then to the faucet. I don't know how you do it but you have you have to do it so that's why it's. I have no opinion but it's a little bit more of a jack of all trade skill set because you have to know some sequel. Okay, if you're if you have if you have a background as a dba that's perfect you know this but you also have to know some about storage, because at the end of the day you have to start a data and the format the performance and all the different techniques around that are really important. Some networking will help if you understand the protocols, the efficiency of those protocols and everything it will be easier to move your data around so it's a little bit of all the things that are related to data that that that in my opinion makes if not a good at least a somewhat efficient that that engineer and that's what I see in my practice. You know we have started this project that for that which is called the data engineering jumpstart library, which I talked about this last time with Chris bloom and your Chris, it's about bringing directly usable patterns, what we call patterns, so that you can mix and match them together to create your data pipeline, which is about data engineering so those patterns can be simple. Okay, Kafka to history. I have some data flowing into Kafka how do I persist them in pocket format into object storage so that my data scientists can use it. Okay, this is a pattern that in the library, we describe we are publishing and sharing the code on how to implement that on OpenShift. That's one pattern and we are devising many more patterns in today so one is this one and maybe I could quickly show you this. Just share the screen. I will make. So you see an example of a pattern is trigger a Kafka event when an object is stored into my object storage. That's to create some even driven architecture whenever something new in store is stored in in my bucket, I will send information to Kafka. Another one would be, okay, whenever something is happening on my Kafka bus, how do I trigger a server less function a candidate function into my OpenShift cluster that will that will be able to process this event. So that's another pattern, and so on we we are creating all the things in the in the data in the jumpstart library that engineering jumpstart library, but that's the idea of mixing all the things together to be able to at the end create the pipeline you need to ingest transform store and make your data available. So whatever is needed in terms of skill set to achieve this will will define you as a data engineer. So that's why it's not a pure background in this or this specific thing. I've seen more and more people coming from various horizons, each with their own strength. But there are I guess there are some some things that you cannot skip. And the main one being SQL. You have to know your basics in SQL. At the end of the day, even if it's not SQL language, at least relational databases and those types of query. They, they, they, you will have to use them at some point in some form of another or another, or another, definitely. And the dog agrees. Perfect. chaos here this morning. containers building containers for data science we were talking about that a little bit in the pre show right like data scientists lean on the bleeding edge of stuff. I'm not going to call me as, you know, an RH CSA, or, you know, DevOps engineer whatever you want to call me. I'm focused on stability and resilience. So I'm going to run things like rail and open shift and you know, it's not necessarily the latest and greatest things but how can data scientists consume those things in an enterprise environment is kind of my question. Or engineers for that matter because I'm sure you all use some fancy stuff to. This is the, this is the other part. Okay, you wanted to go in. Yeah, please. Go ahead and stab at a call. Yeah, I was, I was just going to say, you know, what Gio just described, right, it's highly technical, it's unique to skills, it's, these are trivial problems right like I mean, these are hard technical problems that data engineers face all the time. Same is true for data scientists but they're hard and difficult problems are slightly different right it's, you know, we're talking about Bayesian statistics we're talking about loss functions we're talking about, you know, tons of math. But red hat brings to these fields is that stability and that consistency and there's a lot of repeatable patterns that we can do as data engineers and things that data scientists may want to start adopting from a lot of the teachings of DevOps engineers that will give us, you know, more organization or logical patterns and that allow us to go from one model to 1050 20 I mean like, you know, 100 that's that's the goal is the ability to cleanly scale because if we have complex environments if we have complex pipelines if we have data scientists writing their own data engineering code that differs from the data scientist, a few, you know, cubicles away, or, you know, streets away in their, you know, guest bedroom whatever. You can't just add people and expect, you know, linear scale right right I mean at some point this will taper off unless we bring those consistency, the consistency in the patterns that you know we are so good at red hat. And that's one of the first value proposition for open data hub or roads red hat open shift data science it's bringing this reproducibility because you know that was the case before it's still the case in many organizations that a scientist is okay here is your PC your laptop whatever and do some data science. Okay, first thing I will install Python and we install a bunch of libraries and call will do the same on his own laptop and Sophie will do the same on her own laptop. And then the end of the day, hey, hey, Karl Sophie can you look at my model. Oh, okay, they try to to load it. Oh no doesn't work because I have a library version one or two that one, and you build it with one the two that two, which I don't have yet doesn't work. And the end of the story. So, and again, that's generally how it works is this artisanal mode of doing things, everyone with a skill set and his right set of tools that fine tuned for their own way of working, which is not compatible to the industrial mode, the enterprise mode the scale mode that we want to do so the value proposition on the red open shift data science, open data base to bring this reproducibility because each and every data scientist will just spawn its environment, and it comes with the exact same installation of Jupiter of curated and selected libraries, and, and all the other tools that come with it. That's one way to to to ensure this for good disability, because then you have and if you do things properly as Sophie described, being, okay, this is the training of this model with this version of the data at this point, and there are tools to help trace those kind of aspects, then you are able to reproduce it on the other environment, which then simplifies the job for the software engineers, because they can directly say okay this model was trained with these sets of libraries so when I'm doing my package thing with this, and I know exactly which image I have to use which libraries I have to use, and then I just fetch the model that has been used with these parameters, and this I can automate, because I have the consistency, I have all the data. Sometimes we ask, okay, but I could install again a Jupiter notebooks by myself, yeah, go ahead, but that's not the problem. The problem is that you install it exactly in the same way for each of your data scientists, and that you do it fast. You cannot rely on Bob installing the laptops of everyone and doing the same thing exactly the same time. Yeah, you know, quick backstory from my old times. At some point as a project manager, I ordered a cluster of three different, three different, well, three machines cluster of three machines. Each one came configured differently. They were supposed to work in a cluster, but oh yeah because this I did the week after I did the first one the week before and then I changed this parameter and because I'm doing that the technician told me. Oh yeah, oh yeah, forgot to install this package guy, it's supposed to be a cluster. So you see, that's how it works in any organization. She happens. So you have to shield yourself for this so you have to automate everything. And that's part of what we do with Red Hat Open Sheet Data Science. We have automated the deployment, the library's curation and everything so that you have this reproducible environment that will be exactly the same now in six months from now and whatever. And you are then able as a data scientist to work on data science and not to figure out why the specific package doesn't work anymore with this environment because blah blah blah. That's the value proposition, much more than the tools themselves that are open source anyway. Yeah, all the tools are open source. Right, and I think, you know, on top of that we've talked so much about the communication that has to happen between data engineers and application developers and those data scientists. And it just makes sense to have everybody working on the same platform, you know, if everyone's on OpenShift it makes those communications, those sharing of work, and that, you know, reproducibility, not just between all the data scientists, but between the data scientists and app dev handoff, the data engineer and data scientist handoff, maybe handoff is the wrong term because it's, you know, such a back and forth, but it's early here. No, but handoffs, like we say it all the time a handoff is, you know, an expense it is a transaction that occurs between, you know, two entities within an organization. So it, it's always going to be a conversation it's just whether you programmatically have that conversation, or whether you're just setting that thing up and you need to kind of whiteboard out exactly what you need to do. It's always always a discussion. But I have a funny little aside about me creating a model and passing it off to another data scientist for some, you know, QA and you know that was our process right we would check each other's work. For the longest time I wanted, I couldn't understand why my models would not build on her on her laptop, right and it would. It evaded me for an embarrassingly long time but you know, in that case we were getting different answers on these models, right. It wouldn't converge for her respectively we're getting wildly different results, but the, the, it's not truly failing. Right and it all comes down to their linear algebra libraries installed that most people don't know where they live on your laptop right and oh yeah I was truncating, you know floating points and she wasn't you know like neither of us knew that right so you know, there are times where you fail and dramatic red text and you know stack traces that's a good thing right if you're not running containers and you don't have that consistency you want the airs you don't want to be stuck in this endless loop of, you know, failing silently and confusion. I could see the. Yeah, if you're not standardizing right like if you're not creating a consistent environment for people to do whatever it is they're doing with the data, then you are going to have wildly different results, which leads to a lot of confusion which then goes into that business of what am I investing in, you know, and like creating a consistent environment to run models against to where it's flexible enough to say like, Okay, today we're just going to upgrade this version of, you know, by torch and this version of future hub, and we run everything and you know consume lots of CPU in the process and churn through all the data, but you get better results potentially right so everybody benefits from that at that point if you can redistribute that model. And when I say model here I mean the model of containerizing and the actual applications in use, not the data model sorry Sophie. That's exactly why containers are so fantastic for this kind of things because that's one way to achieve this reproducibility, especially and it's especially why it's true for many applications but in data science it's especially true because of the mathematical aspects of those algorithms which always have some random aspects and seeding aspects and let's say you have the exact same data exact same everything but you change the version of TensorFlow you are using but you know one minor revision the result will change. And even if you run the same data with the exact same thing exact same version of library, the results will change because of these fuzzy randomness aspects of the algorithms. So, as interestingly, the things are already skewed a little bit you don't want to pile on it and adding some more, some more diverging aspects to what you are doing. So you have to stick with okay I build this this is the container version with everything I need and this is the truth. And then you keep it because that's another important aspect you have to keep it, because in two years from now you may want to run the exact same thing against the new set of data but it has to be the exact same thing. Otherwise it won't work. Awesome. Well, we are rolling up on top of the hour we got about three minutes left is there anything anybody wants to share as we wrap up here. Any little thing to help that data scientists out on a virtual Friday. Maybe a second cup of coffee. Yes, I would need another pot. Maybe what I could add is, because of what I explained at the beginning, don't focus too much on the name or the title or whatever. There's a huge divergence now in between what people are doing and the title they have and the title with which they were recruited. You have to focus on what you want to do what you want to achieve your skill set because it's short in six months from now it will be named something else, but that's why that's why it's important to understand all those aspects. What was what gathered the data, clean the data, make it available, feature store and anything like that and then work on the data and then package it whatever names you put on this that engineering or whatever that's not even relevant. It's the whole process. It's the team aspect as a team of that engineer scientist software engineers, you have to put something into into production at the end of the day. So it's, if you have these clear site and understanding of everything that that's for the best. That's the advice I would give. Don't focus too much on all other engineer I'm doing only this or this. Yeah, maybe. Maybe today you're only doing that. All right, awesome show everybody. Thank you all for joining today. Appreciate all the insights as always. Thanks for having us Chris. Hey, anytime coming up next on the channel. I'm going to sit down with the one and only john Chapman junior for an episode of in the clouds. John is an advisor to our build the group of blacks united in leadership and diversity, as well as the head of red head it or director at it, which is by far the best it department I've ever worked with. Like hands down bar none right like they make life easy so it'll be interesting discussion to see like how they do that, and how they go about problem solving in that enterprise kind of scenario so please tune in for that at 11am Eastern 1500 utc. And until I see you again please stay safe out there and come back and watch us soon.