 Hello everyone, my name is Phil Winder and I run a business called Winder.ai and we are AI consultants, we do development and consulting in the fields of machine learning, reinforcement learning and demo-lops as well. So it's the application and the operation of those AI applications. I'm also the author of O'Reilly's reinforcement learning so I know quite a lot about that. But today we're going to be talking a lot more about data engineering. So I'm here with my colleague Enrico and Enrico has done a load of really cool work investigating the ecosystem surrounding general purpose compute and whereas if you come from the Hadoop, well you might want to plug your ears in or something because there's going to be a bit of Hadoop bashing in this presentation. So in terms of the outline, I'm going to start digging into just some high level concepts about what data science is and specifically the life cycle of data engineering because it is relevant to this project and it is relevant to what Enrico is going to say but you do need a little bit of background information in order to understand it. Then Enrico is going to jump into a load of examples of how you do this in various frameworks and he'll sort of talk about the choice of frameworks that he's chosen to analyse a little bit later. But then he's going to talk about the pain and specifically we're going to try and highlight the fact, if you wanted to just get the 10 second version of this presentation and the main reason why we're so interested in this project is that most data engineering and most data science tooling is rubbish. It's really not very nice to use. It's really hard to use. So if we have an opportunity to build something that is nice to use, that's going to be great. Then I'm going to try and sort of map those solutions back to the portfolio that I've presented a little bit earlier and try and pull out some key themes and ideas that I think Baclau or the subsequent version of Baclau needs to adhere to. OK, so to begin with, I wanted to try and describe what data science is and the workflow and the life cycle, but specifically where the pain points emerge in this life cycle. A number of dimensions that you could use to try and describe this. I chose these two. I can't quite see the arrows that are going at the top there but it doesn't really matter. Basically we've got two axes yet. On the y axis up and down, we've got the traditional data science life cycle. So this is the thing that's taught in most data science courses. This is how you go from having some data to a serving usable product at the end and we'll go through those phases in a little bit. On the other axis, I've called it AI maturity but really it's just a normal engineering project maturity. You've got different phases in the development of your application that starts off with trying to prove the viability of something and you go through various different stages. Again, you may have more or less of these stages and phases but eventually you're trying to get to production. That's the ultimate aim. One of the hardest things to describe to non-data scientists about this process is that jump from proof of concept to MDP or that jump from MBP to production. There should be like a massive gulf, a big trench in between those two things and you've got to try and span that trench because it's a bit like with a projector today. All you want to do is just plug that cable in and you want the projector to work but the projector gods are just saying, no, no, your laptop's not good enough. It's not going to work and it's the same data science. You've got your code, you've got this pandas code that looks really simple and really nice and it's pre-processing your data. I just want to put it out there on the internet or I want to put it out there on a cluster or something and you've got to go through a complete translation process in order to do that and it's really horrible. On the left-hand side, we've got data working with the data. We've got feature engineering so that's working with raw data and turning it into something that's useful in the model itself. Important data engineering step. Then we've got modelling which is principally the role of a data scientist and they're responsible for investigating and analysing and deciding upon which algorithm and which implementation they're going to use to build this model. You go through some training process, the evaluation is trying to figure out how are you doing and then the deployment. This is like the traditional data science process. When you mix that with the AI maturity, you effectively get this pattern where when you're doing a POC, you're just trying to prove whether the whole idea is viable. So you want to try and fail fast, you don't want to spend much time on money, so it's all local, completely local. You're working on your laptop, you're working in notebooks, you're working with local data, it's all local. You're just trying to prove that it's even possible. When you make that jump to try and get something in front of a user, then you have a whole load of rework that needs to try and make these things more scripted, more declarative, more definitive basically so that a larger system can take over and run it. Finally, you get into the prods and you've got to be reactive. So there's three sort of key phases there. The other important dimension to get into this picture is that this isn't data science. Data science used to be like, in the past when you said you were a data scientist, you did all of this, but in fact that doesn't happen anymore. In most companies where there's more than one data scientist, this role has been split out into specific sub-domains. It's been split out into data engineers, into ML engineers, into DevOps obviously, into ML ops platform engineers, and so on and so forth. And so there's actually a real mix of people all working on this one so-called data science project. It's not just a data scientist. And I think when we're talking in the context of Black Law as it currently stands, we're not actually talking about data science at all. We're talking about data engineering. We're talking about we have the data in place. We need to do and operate upon that data to produce other data. And that is exactly the job description of a data engineer. So if I start focusing a bit more now on the specific role of a data engineer, then again, they have very different needs and very different tasks that they need to achieve at different points in the project life cycle. So when you're at a proof of concept phase, the whole goal there is to attempt to give the data scientist the data they need in order to do their job. But in order to do that, you need to find it. So it needs to be well-catalogged. It needs to be understandable. You need to mine it, which is kind of a way of describing sort of mixing data together in order to produce something useful. You've got access level control issues. And finally, that all needs to be locally accessible to the data scientist. When you get into some kind of almost production state, then we need to start turning these things into pipelines so that the feature engineer that's going on is encoded and is repeatable. We've got schemas that need to be defined because if you don't have schemas for your data, then it's very hard to test that your data conforms to that schema so on and so forth. And then finally, in production, you get sort of higher level governance related issues with regards to your data in terms of monitoring, making sure it's OK, testing, making sure it's good quality, scalable, guvernable, et cetera, et cetera. So I'm going to come back to that slide a little bit later on because I think Baklal does map on to some of these requirements. And so I'm going to sort of attempt to do that a little bit later. But for now, I'm going to pass you over to Enrico to talk about. Thank you, Phil. So here's an overview of some of the popular general computing frameworks. So I just want to run a little test here. Raise your hand if you've heard of use any of these tools before. I heard some good laughs before. Well, it's cool. OK, so just to recap quickly, we have Python, PyData kind of stack, top left corner. We have database systems in the top right corner and big data players at the bottom. Let's take a look at some of these. Well, now, so Pandas is a Python for data analysis. It's very popular because it's a data frame object. It's a table you can run lots of interesting data managing on it. It comes with utilities for managing and sampling, missing data, and so on, and so on. Very, very versatile syntax. Postgres is essentially a relational database. You can essentially use it in any language. It comes with lots of adapters. Popular syntax is a SQL declarative syntax. Now, these two. So when do you want to use the other two players on the list? Dashing a snowflake, essentially when your data set doesn't fit in your laptop, you may want to consider these other two players. But there, dashing a snowflake in terms of syntax or approach to data are quite similar to Pandas and Postgres respectively. Let's take a look at a couple of examples. Now, in this example, it's structured tabular data processing. The idea here is we have a real estate housing data set, and we just want to compute average aggregated per zip code, just an example. On the left-hand side, you have a Pandas example. So, as you can see, data importing a library and data parsing and reading is just a matter of one liner. And processing it, that's also very concise. You do goodbye, mean, bang, you have your result. Now, on Postgres, it's slightly different. You first need to create a table which is a way to specify your data schema. And it's got to be very strict on that, and verify your data set complies. Data loading is also fast, and what's also highlighted by the red arrow is, yeah, it's a SQL query to compute that average aggregated, quite similar to Pandas, I would say. So, let's take a look at these, yeah, these two players have been making history in the big data world. Well, ADUP is essentially a distributed storage system, ACFS, and it also comes with a computing engine that implements and might produce paradigm to processing. It's mainly Java tool, quite famous for its very verbose, and also it has a clumsy API compared to modern tools. It scales out well, though its intermediate processing steps are going to be stored on these, and this makes the world computing a little bit slow. In part, it's the new successors trying to solve that problem essentially by running in-memory computation makes it faster. This is also the Scala Java kind of tool. What I want to highlight here, though, is that these tools are production-ready, right? So, it means they provide you stability, high availability, well-configured, and also an example later, the introduced concept of job and pipeline, and some try to have a more declarative syntax and scale out to very, very large data sets. Pandas, that's really hard to personalize on at the deployment phase, very, very good on large laptops. Let's take a look at an example of these two. Now, Spark. This is the same average house price example I showed before, and this is in Spark. That's the Scala code. As you see, it's a little bit more verbose than the previous options. You can really see that, essentially, you start by creating some context that allows you to communicate and connect to your remote cluster. Then there will be a session set up where you configure your parameters and maybe past credentials or whatever, or you find your job there. Then, Spark also comes with a data frame structure that somewhat reminds you to the Pandas data frame. Here, you can just change method calls, group buy, average, as I let everybody harrow. This stuff doesn't just run right away. You need a way to package this up into a jar file, which is essentially a package that ships your code along with the required dependencies and so on. A great possible way to run this is by using Spark Summit CLI, where in this case, you plan it to your cluster, local cluster in the example, can be remote, and you pass along your jar file as well. This creates a job and creates a DAG. This is essentially the first execution at a later stage when the cluster will be able to process that. You can really see it's quite different than the local kind of setup. How about Hadoop? Where does it all fit in all these? I say it's really hard to cage an elephant. In fact, as we said, a Hadoop example doesn't even fit on this slide, but let me show you something later. We don't really see the two axes here, but essentially the two errors, but we have data scale, scale on the X axis, and flexibility on the Y axis, and also other possible axes. The idea here is that there are different ways to look at these landscape analyses. As we said, it's really important to look at production readiness, for example. In that case, you will be looking at the right hand side of the screen, but if we look at essentially trade-off between flexibility, the tool gives you a data set scale, we see that as you grow in a way in the capability of managing large data sets, then your flexibility and complexity increases. That's definitely a hurdle for data protectioners. Talking about hurdles, it's kind of intuitive, so simple tools have lots of pros. It's very easy to set up and maintain, easy to use, maybe friendly to your usual text stack or language, easy to debug, and here, for instance, when you need to debug a Doop job or PySpark job, that can be very, very painful. You also want to take a look at or consider community support. Some frameworks are doing very, very good in pushing that. However, when it comes to shareability and capability of sharing pipelines, artifacts, and being declarative in some way, then that's what most frameworks are struggling with. Another example we mentioned before is the headache in moving from a local setup to a cluster space with big data tools. Here's an example. You have pandas, this runs on your local, very concise syntax, but then you have a Doop on the right, and this is a map of these job runs on a large cluster. Problem is, that's not even the whole thing because there's much more behind that. First of all, you're going to need to connect and test that connection to your distributed storage, and there should be arrows pointing at those rectangles. When you press with a large data set, there's always that single data point that's going to fail your execution, so you want to introduce that exception handling code, of course, maybe add schema and validation. At some point, you need to distribute that. You need to allow your cluster nodes to take your application and run it. You may want to package it in a Docker container, distribute that, or maybe push it to a Docker registry, distribute that across your nodes, figure out notifications, and so on. Finally, you're able to do your Hadoop submit or Spark submit run. You'll have to configure resources such as CPU and so on, launch your job, again install CLI and so on, and then monitor it. If that fails, you'll have to go back to configuration or maybe even coding, and it's right there. It can be very painful. The thing is, it's not a one-man show. You'll probably need a world team next to you. Well, with that, I'll give it back to Phil. Thank you. Oh, Zapper Rooney. It's going to be the chair, it's like a static chair, my hair sticking up on end. Just coming back to some of these ideas as well, I think, just trying to sum all of this up. So firstly, in order to do data engineering and data sciences skill, you don't do it yourself, you do it as part of a really big team. It takes a lot of people to make this work. I think that's one big takeaway. Another big takeaway is that there is no debugability in any of these high-level orchestrators. If you want to test your data engineering code, you just try it, and you wait until it fails. When it fails, you then go through the horrible Java stack traces that are 600 lines long, and you figure out which little thing caused it to fail. Stack traces, if you're lucky, most of the time, it fails silently, or rots your output to someone you know that you have no idea where it is. That's right, yeah. Even in Spark, which is somewhat better in terms of above team usability and stuff, Spark excels in situations where it's doing analytics like workloads, and they have put a lot of effort into trying to optimise the performance of SQL-like jobs. One of the ways in which they do that is that they dynamically control the scale out, the fan out of the job. That's all the code that we had before. This code here, this code actually defines the DAG. This defines the pipeline that is going to run in Spark, but you as a person, you don't really know what that DAG looks like, because Spark doesn't know at this point. It doesn't know until you submit it. Then it figures out how it's going to run it, and that makes memory management incredibly difficult, because any one of those steps could be a little bit bigger than 64 gigabytes or however big your node is, and it'll fail. So it's really hard to do memory management in Spark. It's really hard to figure out what your pipeline and DAG is. It's just all of these things that sort of sound like little nits, but they add up to just being really painful. Why is my Spark cluster spending 80% of its time shoveling data between Spark clusters? Yeah, exactly. In fact, operating Spark clusters is a job now in big companies. There's people at Uber whose entire job is to optimise the Spark cluster. So I wanted to sort of come back to this slide again and just sort of try and place some of these tools into the lifecycle. As you can see, when you're doing your POC, everybody uses Pandas. If you wanted to try and scale that job, then you can use something like Dask, which is really nice to use, and it does allow you to scale, but it doesn't provide the production-level quality, declarativity, monitoring, all of the things that you need for intensive jobs. Finally, on the right-hand side, you get the schedulers of the world. They're okay, but they're not great. I think that this is really the landscape that we're competing with when we're talking about back-of-a-how. These are the things that we want to be comparing ourselves to. So going back to what we need from back-of-a-how as an ML engineer, then we go back to this picture again. This is an arbitrary picture. You speak to another data engineer, they pick another set of criteria, but I thought this was a pretty good representation of the day-to-day job. What I'm proposing is that we could possibly either build back-of-a-how to solve some of these problems or build tools on top of it to solve some of these problems. So, for example, on a data access level, companies need to have access controls in place to be able to control the use of data. For legal reasons, for regulatory reasons, for business reasons as well. They need that data to be exposed in a catalogue. It needs to be searchable, it needs to be observable for the people that need to see it. The data needs to work locally. When it comes to pipelines, in my opinion, I quite like specifying the DAG myself, specifying the pipeline myself, as opposed to letting the tool optimise it for me. It's only because, in my head, I've got a mental map of what the DAG is doing, and that helps me debug. You could argue that if you provide a better way of doing debugging, then maybe you can allow the framework to control the DAG a bit more. Schema, I don't know how you'd handle schema. We could talk quite a lot at length about what could we implement a lot of this? I think basically what I'm saying is that we need to present Baklau in this way in order to get data scientists on board. We need to be familiar, we need to talk in familiar terms. We can't use the Web3 terms, you can't use the words DHT, which I didn't know until about three hours ago, because I didn't know what that acronym meant. We need to talk in terms of data catalogs, not DHTs and so on and so forth. There's going to be some sort of interface that we're going to have to cross in order to talk in their language. Other things that have popped into my head over the day that I find really interesting, we've been talking about verification a lot in the morning, and we were talking about verification in the sense that we're trying to verify and prove that they did the work that they did. But because ML and pipelining and data is so uncertain in the first place, most of the time, I'm not going to get attacked. It's not going to be this guy that's pretending to do work, and that's going to cause my things to error. It's going to be my fault for the things of error. The data is crap and I haven't checked it and I haven't cleaned it properly, or my ML model is going to diverge and it doesn't actually learn. So I started to think about whether we could actually use the verification interface as a part of the testing and quality element of the lifecycle and the monitoring as well. So if you're in your little table that you showed, if you had a little cross when my model diverged, I ended up with a complete nonsense model. Things like that. Then we've got sort of governance. A big thing that I'm a big advocate of is lineage in data science and provenance, and the idea there is that it's really important to know how your models were built and where those models were built from. You need to know that this particular artifact was trained on that data. We've been doing some work with Ofcom, a UK regulator, that's about to start regulating social media platforms to prevent online harm. Part of their remit is that they want the platforms to be able to demonstrate that whenever they serve some content to some users, they don't want people to come to harm because of that content. If a regulator comes into that business, they're going to want to know what model was running at that time and what data was that model trained upon. The vast majority of businesses that we've spoken to, even really big ones, they can't do that because they don't know how to do it or they can't do it because they haven't got anything in place. Anyway, so I think that there's a big gap in the lineage market that I think we could also tackle basically chaining data together to verifiably prove that this data set originated from that other data set. That's a feature map. This is what you need to try and provide on a feature basis. There's then other dimensions that fit on top of that, in terms of how well you do that. The first one is familiarity and ease of use. It's the UX. We've been talking about that a lot already. Flexibility. This is a big one. So Spark and Hadoob are very locked down in terms of what you can use and what you can do. Trying to keep it as flexible as possible to allow the data scientists to use, the data engineers to use whatever libraries they want to use. That would be great. Scalability. That's when people talk about production ready. They're usually saying scalability underneath the scale. Declaritivity. That's how well the lineage and provenance and declaring your pipelines up front. Production readiness, of course. That's it from a data engineering point of view, but I did want to touch on the ML side as well. I think everyone talks about training ML as this big single step thing. People have been talking about it in the sense that we want to distribute our ML training. We just want to submit this one job that scales over massive numbers of nodes for performance reasons or something like that. But we've been talking to more and more people that are having to have local models in locations and localities, specific models for specific localities. And it's because of different languages, of different cultures, of different things that are deemed harmful and things like that. So some of the larger ML companies and tech companies out there, they're not just running one model for one thing. They're running hundreds of models for one thing. All of those models are completely different. So there's an idea here that locality is important and where the data resides is important because that specific model for that specific locality is based upon that specific data. The data doesn't need to go to a central location to actually do that training because generally the models that they use in these situations, the model architecture and the model definition is the same. It's just the data that's changing. So I can imagine a situation in a world where you have this distributed in the sense of many models, just all trained on different data, all the way down to local level. You've got data in your home and you're training models for your specific needs in your own home. That's certainly not possible at the moment. So bear that in mind. And there's a corolli there with business processes as well. And when we talk about business processes, they always follow this kind of plot where you're talking about the brewing process and it's just another process. But businesses have processes like this everywhere. And we've worked with a client at the moment who is trying to fully automate the paper manufacturer. So they're trying to automate going from wood pulp to bleached white paper. And it looks very much like this. They've got lots of tanks, they've got lots of steps and they have lots of models all powering each of those tanks. But again, the data is local to that one specific phase. The data is local to that one silo within the business. And it doesn't exist anywhere else. It doesn't need to exist anywhere else. So if we can have a distributed training for distributed data, that's really cool. And just to demonstrate that real home, I'm a big home brewer. And this is my cooling stage. I've just got a copper pipe that just cools it down. That's what home brewery is. And this is my bottling stage here. This is my packaging stage. So that's what it looks like in real life. A bit of both. Because it is home brew, any of those bottles can explode anytime. So it's a bit dangerous to open that safety door. OK. So, yeah, the future, we've talked about IPFS, IP... What did you call it, Luke? CS? IPCS. I'm going for IPAI. And it was totally accidental that the IPA and the beer thing all came together as well. But I'm going to take it. All right. That's it. Thank you very much.