 All right, so hi everyone, I'm very happy to be here today speaking about a topic That's very important to me and I've been working a while in the industry you know with data and all the stuff and I would like us to say hi to my girlfriend who's watching the live stream who couldn't join today So with that out of the way, I would first like to start with some disclaimers, right? So I'm not affiliated with Europe Python with the software foundation the rust of the foundation only of that I'm not a rust expert So this is just purely my opinion what I learned over the year that I've been working with rust now And this is not about bashing Python, right? This is just about finding some shortcomings that Python has and how we can complement them with rust Another thing is it's half an hour. So very light and condensed overview So if you don't have time for questions, we can talk over any of any of questions that you have right after Opinions are my own and there will be a lot of memes. Hopefully you'll find them funny I found them funny at least so all right. Who am I? I'm Karim. I'm French Tunisian living in Germany right now in Ingolstadt It's a city north of Munich not many people know it, but it's a headquarters of the Audi company, right? I've been writing codes since 2012 mostly Python and I'm still enjoying it so far. So all good I'm focusing on data engineering and operations for making data available in the different companies. I've been working at So I worked in consulting and ad tech for three years and automotive many cars for four and a half years and For the past one and a half years. I'm working at the blockchain company My role at this company is lead data engineer The company is called parity technologies the link in the slides will be available afterwards if you want to look at it Into it what I do is I manage and lead a growing team of data experts essentially people who work with data Our main I would say goal within the company is do data warehousing for on-chain and off-chain data meaning data That is happening on the blockchains themselves getting all of that See what's going on there and off-chain data as well meaning data around, you know activity and blogs Twitter and whatnot Making all of that available for the people that would like to base decisions on top of that data, right? I'm also the author of a website guide Slash book called data with rust which is a free guide essentially on how to do what I will show you today a short introduction off Who is this for so first? I would like I tailored it to people who know who know Python but heard about rust and would like to know a little bit more But also for people who have like slow and messy Python based data pipelines who would like to improve those hopefully using rust and Maybe people who are curious, you know, so we just want to try out something new This is not for people who are absolutely experts, right? Who know what they do and obviously this is my cat boo boo who was helping me do the presentation But he wasn't that much help So joke aside the agenda I'll start by short overview What is data engineering for those who don't know why even bother with rust, right? How it is an alternative in my eyes to I'd say pythons bolting on different tools approach and The benefits of rust in data engineer the pros and cons Some like perspectives of mine and how it simplifies also the downstream operations of the applications that you build Also, just like after you showed beautifully before I'll show how to interface shortly rust with Python And how that can be beneficial and then of course some guidelines and some hints how to get started with rust for Data Python is this so I would consider this talk a success if afterwards you just try it out You know you open up rust you try to figure it out and you know build some data thing with it And there's at least that expectation that I'd be happy with it Okay, Python data engineering. What is data engineering? so here I'll show everything at once so you can read but It generally data engineering is moving data from one place to another processing it in between and making it available Available for different use cases, which are not focus of data engineering But I would say that mainly is for answering questions. How many users did we have last year compared to this year? What was the context of the activity of last year? Making that data and these insights available for anyone who wants the main interfaces are business analysts and The business itself. So I would say People who want to know what decisions to drive and on which data to base them on So yeah, so data engineer generally writes data pipelines mean writing code that Does something with data? Why it's needed so usually Data that we generate on social networks on systems on blockchains doesn't fit on one machine It's fits on multiple machines And it's not easy to manage at the same time the networking moving the data the processing in making sure that thing is You know finishes processing before the heat death of the universe So we need to make sure that whatever it is we doing Happens during our lifetime. Anyway, so accessing data is more and more relevant. I don't know if you saw lately Reddit Twitter, everyone's closing down the APIs because they provide data that is very valuable to do things with right data is good business Essentially data also has only value in its package So there's a lot of things happening, but not every data needs to be stored and processed Either way what I think is very useful and why data engineering something I really enjoy is when you gather data and interact with it This is where you can I don't Get an edge either as a business or as a person or in whatever it is you do Essentially dealing with the data is how you get GPT day write the algorithms once right But then it's a matter of tweaking the data to get the best possible algorithms Lastly what I want to say is this role is at the intersection I would say between DevOps infrastructure So we do a lot of Kubernetes a lot of Docker a lot of you know info stuff that also the business they come to us Hey, we need these insights. We have to figure out the way back to the data so Data engineering for now Python is absolute king right whatever you look for data engineering You will find something in Python that answers that you know converting some Jason's to another data format working with Python is I would say ubiquitous you can find anything there There are many tutorials there are all the libraries that you could wish for and they're mostly up-to-date and maintained So you won't have any troubles there Requires however a lot of tooling So, you know a lot of tools that you add on to Python to make sure that the whole system works properly And this makes it a great business right because people provide those tools build those tools and also sell those tools So it's a beautiful ecosystem and nothing against it But you know right now it's like the almost the only option I would say For data engineering tools however on the other side So if you want to build a whole system for managing data You usually use some of the tools on the left right so a cloud provider some sort of mix of I don't know Kafka and S3 type storage or whatever so mix and match of the technologies on the left They have a common thing though almost all of them right if you might have guests so far They're currently not written in Python right so everything that does the heavy lifting is not being done in Python And that has its reasons which I'm pretty sure you already know But just to bring the point home at the bottom right a usual error that you would get when you work with spark For example, you write your Python code in spark you get a Java exception. What are you doing now? right, you're stuck and And you know this type of bolting on different tools for the infrastructure side introduces errors that are impossible to debug So there you got the mercy of your infrastructure team that will or will not help you fix that and meanwhile you cannot do anything Also now let's say you write something purely in Python and you want to improve the performance There are some limits that you will hit if you just focus on writing Python So essentially this is why there are some tools like siphon or other things that help Squeezing out some of the performance just by wanting to stay in Python But essentially if you use for example, I don't know spark You would be stuck with no options if you want to improve something or rewrite it in Python So despite its flaws, right? It's an excellent choice and just available libraries as I said growing ecosystems events like today So many good things going for Python it gets the job done and for I'd say 20% of the effort you get 80% of the stuff If something is slow, maybe adding more machines would make it fast. So these are the usual trade-offs that we want we can make still and Yeah, I would say it's okay if the guardrails that you have during your in your infrastructure in your system are okay As I said offloading the issues to the info team letting them figure that out So why even bother with rust? I will just tell you what is rust So this is really not an introduction to rust itself But I mentioned these things that are relevant to a data engineer. So rust is a systems programming language I would like to say it's language to program systems, right? And it's useful for data pipelines So writing code that transforms and moves data building the data tools and services themselves You can do that too. There are many of those currently available like Ballista or other tools where they are written in rust But they are have interfaces with Python so differently than something written in Java for them It's also useful for developing the data libraries like out here Just say it showed so you can write something rust interface it in Python and it's working great for one use case And of course everything in between so you can mix and match and use rust for all the data engineering I'd say life cycle It has many interesting futures many memory safety It's extremely performance supports concurrency is statically typed and it has an excellent compiler that helps you Essentially write code and avoid some of type of mistakes for which you would have to write tests in Python so Can be used to build the tools and the processing and also of course it's popular right everyone's speaking about rust and I think If this room is full right now, it's because maybe somewhere you read that you know rust is Most admired language some people we should write everything in rust and even some people think that you know A. J. I will be written in a rust right artificial intentions. So yeah At first glance it's good when developers talk about it usually in all the talks that you will find online They will say good things about it. It's You know saving costs saving money it simplifies operations brings huge performance boost and Also, but my perspective it forces you to think about data first instead of you know what's an object of these kind of other things and Yeah, he has a running gag in the community that anything written in rust is blazingly fast. So that's I think Well Valid enough point to explain what I'm saying So I took a small Python project like a Python data engineering project like classically how it's set up how I saw it being set up And then we will have a look at how it's set up in Python how it looked like in rust and where the you know trade-offs and benefits Are in each so usually when you start a Python project, this is how it looks like so you need to search You know set up the environment, you know People need to figure out how to install Python install pip install all of the other stuff Configure their pythons configure talks configure linting configure, you know there how they want to manage their objects There's a lot of things going on. So before you even start working with data and This is what I mentioned when I say pythons bolt-on approach, right? You add pylint to you know, sorry You add some typing or use the built-in typing within Python But still some errors pass through the cracks and end up in production So the Python project I took a few lines I will not go through a long example as I said half an hour. We have since at 17 minutes left You have this function here everything works, right? What it does it counts an amount of times that oh is available is seen in hello world, right? So you build it deploy it you pack it in a Docker container get it into your CI CD It's running and everybody gets their data and they're happy about it This is how I would say the state of the infra at the time where the Python code runs, right? It run locally, but just at the end you see actually its real output in the in the I would see real world After that a new developer joins the team adds a few lines of code and it goes unnoticed, right? You can take this code to try to run it at home add some typing libraries Some of them catch it some of them don't and I think that is what the big problem is when data When you try to convert some data set from one type to another and these things pass through the cracks The insights that come at the end are wrong, right? So again very small example Imagine this is I don't know 50,000 lines of code and it's not very obvious, right? Where the problem is if you didn't write the test for it or if the typing library didn't have the test for this kind of case You will see it in production. You have a bad day, right? now as I said Example is not complicated. This is just focusing on typing one simple aspect, right? Which you know might be solved some may say in 2023 and The only time you see it is in the runtime and I added two Twitter quotes that I think you know are very From experts in the business, right? That say that you know even with typing added in It looks bad. It doesn't work and you know, why are we doing it even, right? So Next the same thing I would say the same project in rust. I'm simplifying a lot, but there is no I'd say Room for interpretation if you want to start a project with rust you use cargo Which is a package manager for us it will handle all the dependencies initialize your project and then the you add the Dependencies that you want to use in a project and something called cargo tumble and manifest It's their requirement txt of the rust world cargo comes with a lot of functions a lot of you know Command line flags that help you during the preparation of the project writing the code It also runs the compiler in the end and then you see the errors or the mistakes that would be cat caught with rust tabs you build the project so essentially it's a Swiss Army knife for everything that you would do with rust so here When you start with how you don't need to add any of the libraries that I mentioned previously, right? The same code in rust so I added for reference the Python code below it almost like reads like Python Although it's absolutely different right so a filter cars Why is a car different than a string and these kind of things? So there are some things specifically that you need to watch out for but at the end with a little bit of Training you can start reading code almost like you read Python Typing notations in both just to be sure that you know things look the same, but they might have different names now This new developer joins the project adds the same line of code as before right in a random case It needs to call the function with a different I would say a variable so the different like an integer instead of a string and The compiler on your local desktop not in runtime not in production tells you hey you made a mistake Here's the mistake. Please fix it. Otherwise. We're not building this stuff and This is how you would catch some of the errors without even writing a single test right? So it's immediately available and it's immediately obvious that something is wrong and needs to be fixed before we can continue This is extremely valuable with data because once you transform the data store it somewhere and somebody builds something on top of That data all of their stuff is wrong because of some errors that could have been avoided by for example using rust Again, this is not to bash about Python or the typing libraries built with Python something like that But it has a different approach. This is embedded in the language itself So the same project in rust as I said is focusing on just on typing features. It's Already bringing a benefit the same thing that out you showed previously works with the complex structs and enums So you can be sure that if you you know define your data and rust code It will be correct any any operation you do on it will be at the end kind of correct I'd say it will need a lot more work and effort to I'd say fool the compiler the tooling is great So if you run this you will not have a Java stack trace Maybe you have another type I would say of error, but then it's pretty clear where that error happened and Yeah, let's need for additional tools and the infrastructure can be simplified rust builds binaries or libraries that you can really attach to your I'd say services that run in production Eliminating the need for example things like Docker or what not just run the binary and whatever system you have To visualize what rust has to offer a made a small graph So from left to right the easy to do the wrong thing and heart and right hard to do the wrong thing I hope that drives the point home that these two are you know opposite But if there any questions I could take them afterwards. I'm still you know I thought this is the best way to represent it And top to bottom is easy to do the right thing So write the code that you actually want and the bottom hard to do the right thing So you need to do a lot of things to do what you like I would put Python on the top left So Python is very easy to do the right thing you write the code the five lines You're done with your day or you copy it from somewhere. It works immediately It's easy to do the wrong thing as well because you write the code Python won't tell you hey It's wrong. It was just to run it and then you'll figure out that it's wrong rust I would put at the opposite. It's hard to do the wrong thing usually because the compiler catches some types of errors that usually happen You know in running system It's less easier than Python to do the right thing because you need to write a lot of code You need to consider the types you need to fight the compiler You need to do a lot of different things and just as a small joke. I added JavaScript and cobalt Of course our interpretation can vary but That's what I would think of them today. Maybe in the past it would be different, but yeah So the benefits of rust now for data engineering now we think we have roughly 10 minutes left So I try to summarize it in this table and in the next pros and cons kind of table So rust in compile time. It will tell you no, I don't take this code Python in compile time It might catch these errors if you set up your typing your tests and whatever correctly But you will see the problem in runtime JavaScript you are left to guess and need to figure it out yourself, right? So the pros compilers catches errors early consistent data and consistent workflows on top of that data It's performant out of the box even without using the advanced system like features of rust like the previous talk You can get a huge performance boost, right? It has data and typing guarantees built in so no guesswork No none you just hover on the thing and you know exactly what the data type is has excellent tooling cargo covers Most of the needs I'd say in the first few years starting out and rust knowledge is portable if you know how to write rust You would know how to build your own browser or build something for embedded devices and whatnot So there is benefit in learning it gives you low-level control to the system resources So to improve or to tune some of the data jobs that you have you can use it as well, right? You don't need to use something else the cons of course simple things take a lot more time So it's I wouldn't recommend it for tinkering if you're trying to figure stuff out try to figure them out in python If you want to make them run reliably then move over to rust It's easy to over engineer and make absolutely unreadable. So there's a lot of Like repositories outside that you cannot parse you cannot read there's types everywhere It's complicated to understand what's happening even simple operations are hidden behind different layers of types, right? So it's easy to overdo that has long compile times for big projects So I think after you have 10 developers and you're gonna huge code base might be minutes might be hours depending on what you're doing It's not immune to logical errors obviously, right? If you write a logical error, but it's consistent with what the compiler things should happen It will just run right but you know a common theme is like every bug that ever happened past The tests the typing and all the checks. So there's not much more we can do there So trade-off if you want five if you favor development speed go with python if you want low maintenance go with rust We wrote last year when I first joined a new company a system in rust It's still running to this day and has zero errors Nothing happened wrong with it and it runs. I think five or six times a day since then it's not much but still Some of the Python code that I wrote was less robust If you favor performance go with rust if you want smaller learning curve We're starting out to just try with Python try to understand what it is writing code and then go to rust If you want good developer user experience of DX developer UX, right? Go with rust has excellent tools. You don't need to figure out how to install Virtual Env or if you need to install by project or whatever So it's very straightforward and I would believe the best of both was is using rust with Python better than I don't know using Scala with Python or C++ with Python because of the readability because of the tools and the simplicity of course, which to choose will depend and Depend also what you would like to achieve So I'm closing in on the end of my talk and I will show you a quick example of interfacing rust with Python I shortened it because previously you might have seen it already But just for the recording I'll go over it and explain to you for a simple data engineering test How would that look like? So what we want to do is call rust from Python and Python would be used as an interface essentially as a Keyboard and rust is a computer running all the stuff one of these I would say libraries for data engineering is Polars essentially Replacing pandas with a rust based implementation Just by doing this switch in your code like replacing Polars with pandas You already have a huge performance boost on whatever it is you're running you would have run in pandas And let's see how you can do that yourself the simple examples We have one CSV one million rows, right? And what we want to do is for each row take the data take each digit of the data Computer factorial make the sum and put it in the row of the result, right? So it's a CPU I would say intensive operation not the IO bound kind of thing Where we can clearly see the I'd say benefit of using rust and this is what I would recommend using rust for If you have something CPU bound and needs to run multiple times you will get improvements with rust and just for the for the record like factorial is when you multiply the numbers up to the current number All right, so how would you do that in Python in Python? All right, you would write something like the thing in red at the top some of the factorial of all of the digits and Then add that up right at your CSV call it a day it's runs three point one seconds, so it is just running time on this thing and It's pretty straightforward 17 lines. You don't your job. It's all working pretty well. I would say now the thing is you need to run This a million times a day, right? How many seconds would that take how many services how many machines do you need to provision to make this happen, right? Then I would say the addition starts to make sense If you want to do it with rust like how to show it with material in pi o3 This is how the files would look like so you set up a cargo tumble the dependencies of your project and You write this source lib dot RS file I will zoom on it a little bit and this is how it looked like in green is Running the operation. It's a function that you would call within Python at the top here Where my mouse is there's a decorator that says it tells you know maturing and rust that this is a function called process CSV That I would like to call from Python later on in Red is actually the actual operation right is actually what is going on what we're running Sequentially in rust and at the bottom in blue the module definition called rusty pi Where I add this function process CSV and expose it to Python at the end If you run this code you will get you see that it's It takes two seconds right and this would be the I would say the API the interface that you can provide to your developers or people Who would use your library? They just need to put the folder and where they want to start the result all the heavy lifting is done in rust And there's no shuffling from data from rust to Python Python to rust The operation happens in rust and just by doing that we get a huge performance Just put a reference three seconds two seconds do that one million times a day and you see that there's a huge benefit you can gain from that and The benefit is extremely obvious when whatever you're doing Either costs money or makes money because this is the difference between I don't know extending your runway two years or You know going bankrupt in a month with your AWS build either way to recap Python plus rust is clearly Working better than just Python on its own There's a little bit more code a little bit more care to what you need to do and Rust gives you the guarantees without any additional tools. So right now we added pi o3, but still You know we can make that work It's possible to add as I said rust data types in the mix trucks enum So more complex operation than what I just showed and you can have some sort of executable data So you have your definitions of the data You can see what happens and you can predict what the results would be without running it just by reading the code It also as I said saves a lot of money So getting started with rust right now the status is there are many missing tools and libraries in rust So a huge opportunity to write some and to you know establish your name within the community The online guys are too technical. So there's benefits of making that a little bit simpler, right? A light start with us if you want to start and you don't know much about it Focus on the things on the left right statements type functions trucks in our modules Don't spend too much time trying to understand the rust internals after a year I still don't understand 100% how they work But I know how to use it and by focusing just on that I think you will get the best benefit out of it if you would like a light start with rust for Python I wrote this website where you can get started from zero as I said it's free It's available and it's task-focused. So exactly the test that I showed explains in a little bit more depth and I will finish this talk by thanking you for your attention and adding two links to resources the website or My own website to learn more. Thank you very much Yeah, thank you very much Karim for inspiring us to get into rust that was very good for us And I hope that you're going to post the slides In the discord channel so that if something went too fast You can then revisit them or maybe ask you other questions later We do not have time for Q&A. So we just have time to ask you to get another round of applause for Karim. Thank you very much Thank you