 So Fabian is going to be talking about building reproducible distributed application at scale Yes All yours good luck. Okay. Thanks. So thanks for having me at Europe by then I'm going to talk about reproduced distributed applications at scale About myself something some words about myself. I'm a software engineer at Coteo in Paris And I'm working on the machine learning platform You can find me on Twitter and get up and what is Creteo Creteo is a major advertising company with offices in France and the US We are about 600 engineers and about 100 engineers working on machine learning stuff So we don't deploy our tools in the cloud. We have our own on-premise Hadoop cluster with 3,000 nodes So we and we every day we process the petabytes of data So not in one job, but at the end of the day we have processed petabytes of data. So Well, so scale is something we need So this is the tools we deploy for our machine on our machine learning platform We are big users of a patch spark So we use it for analytics data processing We are more have our own models built into ML models built into a patch spark We use tensorflow for deep learning scikit-learn for some smaller models Jupiter and Jupiter and Jupiter have for Experimenting pandas mat.lib for visualization and displaying the metrics and ML flow for the whole workflow We're also contributors to ML flow Made quite a lot of contributions. So In this talk, I will mostly speaking about how to run a pi spark job on a cluster That's because we are big users of spark, but you will see at the end basically all those ideas They also apply to tensorflow if you want to run tensorflow or the ascop i3rd It's basically the same ideas and we also using this for other tools So when you want to run in a cluster, it's particular because you're not in your local environment. So Basically, generally what you have you have some kind of distributed storage and some compute nodes. And So Let's take an example. So what I will try to do is just execute the pi spark example from the documentation I will use a pandas udf So what is or what they also call vectorized udf? So what is this, you know, historically spark has been implemented in Scala And then they just did the python lay on top. So basically what this means is when you want to execute some custom python functions Your execution flow is in scala in Java on the gvm and then you so it means you have to to serialize your data in both python and then serialize back when you get the output And historically this was quite slow. So then they add the pandas or vectorized udf Which means that you can directly work on vectors on pandas series, which is much more efficient And also they they used a patch arrow, which is an in memory efficient in memory storage format To pass data between scala and python. So it gets now much more efficient and it's working quite nicely So what I will do here is write an example. It's not so important It's more about having some real use case. I will use a grouped aggregate just create a data frame with two columns And then for each Each group I will call compute my mean function mean fn And just so my custom python function and just compute the mean So I can execute this now on my on my local machine and I can do this just with the spark shell. So I just execute locally Executed and I get my pandas data frame. So for one, I get the 1.5 is a mean two six is mean So what happens now I want to execute this on the cluster In my case, well, our cluster is already set up. So A patch yarn is the scheduler of Hadoop. So if I just set a flag to yarn, it will be deployed to our cluster And if I just execute this now, what you get is something like this So it won't find your pyarrow implementation because what I said before you use pyarrow To make this vectorized udf work So Why doesn't it work? It's because locally you have installed all the dependencies. You have installed pyspark pyarrow pandas So but when you're on the cluster, you just get only the plain python interpreter and when spark tries to execute your task, your task, it won't find it So to take the example from the doc, it won't work What you could do now is well, you can just install on each compute node, you install pyarrow and pyspark And then when you try to run it again, it will find them Also some words about the spark submit, you know, the standard python way is to deploy your jobs by using spark submit Which is a bit more complicated than the standard python way To deploy your jobs is by using spark submit, which is a bit more complicated than the Default mpi use case. So basically what you do you call spark submit Then you call your entry point your entry point will contain will get you will get your spark session and then You will define your transformations and And actions Then you give this to spark submit, spark submit will take care of putting it to your distributed storage instances your execution graph create your executors your and your task and then your tasks will start running in java and then Well, when you need to hook back into the python interpreter and execute your mean function Spark will just call the python interpreter with some spark module Then we'll get the function and pipe scd in and get the scd out so It's not that important what is just important is you have a way to invoke the you call your script you define your execution graph spark will ship it and then spark will call the python interpreter and Try to execute a function. So if you install just barrel and by spark on your machine level, it will work out So what happens now you want to use another version of spark? so Basically screwed because you you have installed spark for example spark 2.4 and want to install use spark for the 2.4.5 So, yeah, you can take your whole cluster down and put it back again And then you have want to have another job with another library Or you'd label your notes somehow and have specific notes, but it's not really Well, it's not scaling. It's not really working. Don't forget also that in reality your python environment will look something like this So a huge dependency graph and I'm sure everybody of you already has encountered this You have some a huge dependency graph locally You have your environment and then you install something and you can't just can't because of conflicting dependencies So also it won't scale up. So the way out of it is using python virtual environments Which is quite cool. We just well, I guess everybody of you knows this but So it's quite quite a nice way. You just isolate your environments in a separate repo You put a link to your python interpreter and then your python interpreter will just find everything related to this repo So what you could just do now is like you install dedicated virtual environments for each for each spark version And then some other thing you have is like you have an n variable which is called by spark underscore python And when you launch this when you set this before the spark submit What it would just do it will instruct spark to use this interpreter. So you can just target python interpreter of your local environment And then when you do spark submit it will target the right one So it will work it will usually it works for two stops different versions Don't forget about the life cycle of all the stuff So if you just do it once then you have the problem of how to release them and So you could just say you do it as a pre and post scrap Hook of your scripts So first you install your virtual environment and then you target it and at the end you release everything But it's not so easy to do this with spark because everything you do it's about task level So you generally don't reason about what you really execute Also, what happens so you so now we in the mode where we install our dependencies all the time What happens if you want to if a new box by spark version is released like Some weeks ago they released spark spark 3.0 dot 0 And so you tested everything with spark 2.4 and now a new version is released and Well, it won't behave the same way also something I copied from a spark I copied from some spark ticket. So what can also happen you create your executor You have all your tasks that are running and then your people use some shared Shared cache and so it can happen. They will access concurrently to it And in some cases you can raise conditions and then you will just the job will just get stuck It won't even fail. It will just stuck and run forever It's quite nasty because you don't know when it happens and when it happens What can you do you can do some monitoring but It's quite quite an ugly back Let's come back to the title. So we want to be distributed application set scale One example about our use case. So we doing machine learning and one machine learning model is Learned with several terabytes of data And in general year when we use gradient descent and we execute this on 80 nodes, it takes one hour So if we would just execute it on one node, it would take several days to learn our model Which is not reasonable because For for models we use which is basically logistic regression models It's quite important to have an up-to-date model So the model should be up to date like for some hours and then we learn it and we use a new model Otherwise, our performance gets degraded quite quickly Also, we we are using we learn thousands of jobs many many jobs running in perils or people who lose Run spark tensor flow desk Different teams run different jobs One other thing that is quite important for us is reproducibility We want to build reproducible applications which is quite important in machine learning because Well, if you do experimentation and you add a new feature Then you must be sure that your uplift in a metric is linked to this feature If it's not the case then just for random noise then Well, you just work randomly. So what can be sources of non-determinism machine learning in quite a lot? For example, it depends you depending on how you initialize your layer weights You can have different results if you do distributed training you Generally you you shuffle your data set will get also different results If you use deep learning and dropout by design, you just take out random notes This some most of the stuff you can fix it because you can make be careful always initializing your weights the same way when you do experimentation You can take care of Setting the random seed so that your shuffling is deterministic and also that you always take out the same notes And the source i'm talking here of non-determinism is really updates to your dependencies or ml frameworks So use spark 2.4 new spark comes out and everything is different because the underlying implementation has changed So it would be cool if we could somehow ship our whole environment There are some talks in your python about docker. So obviously you could use docker Which is quite cool because uh, well if you have if you interact with python You need an sql server or you put it inside your docker environment And you know that you get the same environment Nevertheless docker is difficult to to it's okay. It's easier than water machines But it's still difficult to to to master you have a lot of Options you have to handle the acl docker file I use it every day, but I still not getting really used to it. It's quite cumbersome I find also to use them. That's the docker cache sometimes get messed up So we are in a pure python environment, but and also what i'm including to python world I'm also including native lips. So tensorflow uses c implementation numpy So all this to me it's supported by wheels Which are pre-compiled python packages. So For me the python world also includes packages that you'd native c apis And it would be nicer if you could just to to ask about deeper jobs in python Turns out you can do this with conda. So you can just use conda virtual length and Like I showed before you target the conda with your n variable And the trick is basically what you do is you Create your local converter environment, and then you just zip it push it to the cluster When you on the cluster it gets unzipped and then you just target the python interpreter inside it In our case We we started off with this because they many people in data science use conda basically everybody after pressing uses conda And but in our case it's special because we have our own internal pripy Package repository. So which means we have conda environments and then we have to install stuff with pip inside our content environments Why because when we started off we We there was no open source implementation for having a we so we have our own code We have to deploy it internally and there was no open source implementation of the conda package manager And so it was quite easy for us to use nexus repository manager to set up a pipi server and It we found it also more easier to just republish internal packages with pip So we have our own pipi package manager and also we have big rnd so we have like 600 engineers And all data science people use conda, but all people who are doing other stuff like tooling web development They use pip basically so We had to choose we choose this because it was easier for us so Basically, so we ended up with using conda and pip at the same time and here's something I copied from last week from the from the conda documentation, so What happens if you use conda and pip at the same time? So you have to respect certain routes. Otherwise it doesn't work So we should only use pip after conda. We create the environment all the time and never use the root conda environment Also, I tried it out like again like because when we like in the when we started off was like one year ago and I I with what it really didn't work very well. So I tried again last week It I've the impression was much much better now to mix conda and pip So they there was a lot of stuff that has been fixed But still when you use both you see like you can end up with different versions of numpime So the main reason is they both don't have the same dependency graph conda does some optimizations by default Like it's activating mkl and packaging stuff differently So basically you can end up with some inconsistent environment So spoiler ahead so Like we succeeded to to to do every to deploy all those tools like tensorflow Spark without conda. So we're just using python standards and So I so I'm just sharing here how how we did it. So This is when you use a python Well, so you the same trick we did before you can do it with python virtual environments The good news also is python virtual environments. Now they are supported by default So They support by default by python 3 So basically what they just have to do is you call the python interpreter of python 3 dash m winf and you create a And it will get you a new virtual environment and then you can just target on the remote node your virtual environment an executor And it will be the same way you zip your virtual environment you ship it to distributed storage and then you Get well the python interpreter You can you target the right python operator from your virtual environment and it will work out So it works out like 80 of the cases, but not all the time because while virtual environments, they're not designed to they they were not designed to To to be shipped on other nodes You can have local simplings and just target your local node and then when you're in your remote container It just won't work out. So here's where pecs comes in So turns out it would be nice if you just Wouldn't have to zip your virtual environment and you can just use build a virtual environment from scratch And then just execute is like an executable It's a functionality that python has since a long time. Which is called self executed build zip files specified in pep 441 and basically What what what this does is you put your you just put your zipped environment into a file And you put the shebang with the python interpreter and when you execute this It will the while the bash will execute the python interpreter and then it's able to target all the zipped content from inside this file So basically it's it's like it's it's like the java equivalent to uber jars where it's used all the time So you have an executable everything is packaged inside and you can just execute it and you it will work out You can distribute it you put it everywhere and you're sure it works So the python default implementation, it's the only thing where we don't use python standard. So the zip there's python zip app, but it didn't work Well, we didn't succeed to make it work. So what we're using is pecs, which is just another implementation of this Just another implementation of this standard and what is cool about pecs is Well, you have a nice ap you have a nice cli To just where I choose past all your requirements and you get the output an executable file When you do this pecs will build a real put when you have a source distribution Where it will take the source distribution and compile your your wheel And also you can have some flags like you could say I wanted really to be self-contained I want to use nothing else Or you could target python You could just say that you want to use stuff on the python pass In from outside from the outside world can be handy. You will see later So here's the schema again on how to use this with pecs. So basically you create your pecs archive. You push it to the storage and then you just target Instead of the python interpreter or targeting the virtual environment, you just target your your executable pecs file Basically, and then by when the spark does we execute it will target your your executable file and you it will work with the same options as the python interpreter So you can use models. You can use scripts or you can also use just the dot by file So here's how it looks with the command line. So I'm creating a pecs with pandas by or by spark And then I just when I launch this pecs by default What will happen? It will it will just create the default python interpreter and then I can import all the stuff. Everything's inside Using the pi spark gel it looks like this. So I'm setting my environment and then I just set another flag flag Where I instruct spark to ship it to the cluster spark will you take it from the cluster and then everything will work out So this is nice. So but so the next thing is you see it's quite Is it's not very reproducible because you have to set n variables you have to Set uh, well you have to set different variables your x application code is in python So what you could do now is you Put all this stuff in python you say I want to build a reproducible module and I built my spark context in python So I Define something like spark session builder where I create the spark session I set my n variable and I can so basically I can move all the stuff to some reproducible to some reusable function and Somebody uses those it's sure it's using it will set everything as is and it will work. So my application code. I just put it to Also to a function and pecs also I can create it from the command line, but I can also create it Just with a function in the python api. So I create the pecs and just upload it So and then I just assemble all those methods and I create the current package. I create the main model I just assembled them all in one in one method and What I can do now is just I launch the python interpreter and I Well, and I just call my main pie and this will do everything all the magic So it will create the spark context ship create the pecs sip your pecs upload it to the cluster And everything will be available and you just can use visual code. You change one line You ship it again and everything will be available on your on your distributed job in a reproducible way And then when you are ready for port you take it you take it to port and it will work as is Yes schema again. So Okay, this is nice now, but it's reproducible but slow But basically so you create your pecs and it takes like one minute to create it Well, the reason is the pip resolver, but basically all packaging is slow. So you create your You create pecs use pip use konda or even doc or everything basically slow nothing is really fast but you want to iterate fast when you develop so You have one minute to just launch a job. It's quite a long time to wait So the next idea would be how you could separate code and you could separate code that you just develop And your dependencies that always stay the same So one idea is how to do this is use pickling which is used by some tools which is by by spark ray, you know so basically What the pickling in bite and it's a zero it's a serializing your bite and functions So basically you serialize your if a function you serialize it It gets realized to binary stream. You can put this binary stream to a file And then it's just on the cluster. It gets just loaded unpickled and you execute this So with pickle just just pickle it doesn't work because pickle just will keep the name of your functions But there's another library for distribute computing which called cloud pickle which will Pickle your function the content of your function and dependent functions Basically And so it is so you have a function you pickle it and then you can just re-execute it on the remote node Uh, it's already used by pi spark. So when you define a vectorized udf or you or when you define a custom function You see mean function and you just push it. It's a magic to the execution graph Basically what pi spark will do it will it will pickle it and unpick it on the cluster Well, the hicc with this is you know, we want to build reproducible applications So we have moved all of our code to To some shared package that we can reuse So packages are not pickled and also you I won't go into the details of this ticket But pickling in python. It's quite it's quite difficult and there are a lot of issues with it Like a quad of cases where it doesn't work for example this case I won't go into the details, but sometimes you can just up with Code that doesn't work So the idea would be now like you remember we put everything in the current package. So what we can do now like We can there's already a spark function that Where you can upload dynamic files So you can just sip your current package where you're working in and just upload it all the time And it means that your code magically will be available And you can even do this from multiple packages because you have to install the editable mod from pip So basically you every package you develop you put it in editable mod And then when you do pip list you see that you have the information where it's installed So basically what you can do you take all those packages and you upload them all the time So all those packets are uploaded all the time And what you can do now like you separate it really dependencies that never changes and dependencies that always changes So those that always change you upload them all the time And all the packs archive like the the tensorflow and by spark dependencies What you should do now you put them for once for all on the distributed storage And we use them all the time and you it's much much faster While you it's it's quite well if you want to iterate fast and you can really launch a job Instantly So I promised also s3 storage. So what what well, how can you connect spark to s3? What how can you push to s3 storage easy? Well, you have some you have a lot of packages always ready available one cool one is the one that is developed by the dash guys which is called a s3 fs which is a layer on top of The default bot to call the default amazon stuff And basically the the api it's like a postx style So like a linux page style and also using default buttons Stuff like for opening a file you just to open you take your bucket You take your archive and then you can just read and write to it You can also just list some files So you can basically just use this to upload your file to distributed storage and then what you need to do is you need to You need to instruct spark to use your distributed files So it's also some flux just you pull the jars you instruct spark to use s3 file system And then well you put some helpers you upload your n3 and then you based on the same method we defined before You can just upload your n3 at 3 and you reuse it from s3 and if you want to use other stuff like cloud google cloud storage file system or Azure blob storage or hdfs It's quite easy because if you risk because s3 fs will Inherit from another package which is just the interface with the apis Which is called fs spec. So we respect this all the other implementations are compatible So you can push to every file system and then just you instruct spark or TensorFlow to induce this file system so Nearly done. So as a sample implementation Um We have put all the stuff into our own package Which is called cluster pack, which is just following this workflow So it's supporting python packs. So you can push to amazon s3 and h2 hdfs And then when you use tensorflow, you just get the default builders for spark you get default builders So you get basically everything set up. It's just the wrapper on all the stuff. I just showed So if you show the same example again, uh, you I just upload my nth Uh Connect spark to s3 at the package environment at edible inquire requirements, then I execute my My example and magically everything will work distributed way. You can do this for thousands of jobs And it's just magic So what about konda? So uh basically You have seen that basically the the main idea is still nevertheless you invoke your python interpreter remotely So this works for packs, but you can do the same with konda But if you have all the stuff in place, you have all the apis in place Basically, what you just do is you add as usual a sip on your konda environment sip it push it And then you invoke the right department or better the right way So we still have this this support We still use this for sometimes because Well, one difference with python with pip and konda is that konda embeds the python interpreter And uh and other stuff you need to have the python interpreter on system level so If you want to experiment stuff with python 3.9, I'm not sure exactly if it's already supported by konda, but It would be nice. Well, if it supported our python 3.8 with using python 3.6 but uh, you can just do some experimentation with konda and Well done. We don't use this in production, but it's still nice sometimes to use this Also tensorflow, it's honestly, it's the same idea. It's the same. It's the same idea It's even easier because you don't have to spark submit Basically what you do is tensorflow distributed tensorflow is really following the mpi style It's uh, you basically you define your function You put all the tensorflow stuff inside you push it to storage and remotely on each note You want to execute you invoke the python interpreter. You invoke your main dot by basically And then you have the execution workflow will start and your distribution framework will take care of Stopping the execution workflow synchronizing gradients if necessary and then moving on So you can so as you just target the python interpreter, you can target virtual environment You can target packs. You can target everything. It's the same idea Uh, so we're we're finished a bit early. That's all I had to say So I can take questions if there's some Cool. Okay. So thank you very much. Thank you for presenting. Have a good day and see you next year. I call Yes, thank you