 That's probably fine, yeah It's on yeah Please stay Okay, cool Yeah, I guess I have to scream Yeah, it's weird Maybe that's better Hello, so Yeah, my name is Max and I'm a software engineer at the beam project and I want to tell you about beam today and how beam realized its vision for portability and What do I mean by portability because portability can mean a lot of things? Well, first of all, you have to listen carefully to understand but hey golden But the short answer is that it enables you to run your data processing jobs on top of various Execution engines like spark or flink or samsa or Google cloud data flow and you can do that in the programming language of your choice So that sounds pretty good doesn't it so I've put this agenda together So first of all, I mean some of you might know beam I will give it like a short introduction then we will talk about you know a little bit more about portability and Then how we can actually achieve it because there are multiple ways to do that And then we let's recap and see how far we are actually with portability So what is beam so first of all beam is an open source project at the Apache software foundation So, I don't know if you know the pipe Apache software foundation, but it's like a framework for developing open source software Which they provide infrastructure and kind of a guide how to develop software in the open source and Beam is a project there and it it focuses on parallel and distributed data processing so and you typically run your beam job on like multiple machines and And you have probably lot you have mostly large data, but you can also run it on a single machine if you want and It has a really cool API which can do batch and stream processing at the same time So often like you have like a batch and stream API which are separate and you have to like part your batched up to streaming but in beam. It's all the same So and and once you've written your job you can actually run it on like multiple executions engines That's why sometimes we say it's like an Uber API because I use one API But you can execute with multiple backends or execution engine and now or you can also use your favorite programming language So a little bit more detail in this so we have I mean this is the vision of beam We have the SDKs here on the left side and so that's like Java go Python scholar and sequel and Then we have some magic happening in beam, which are the runners There's a runner for every execution back end and the runner translates the beam job in the SDK into The language of the execution engine and You can see there a bunch of them and more and more are coming and Yeah, I mean that's really nice to have that choice, right? So how does the API work like just concept wise so in beam? They're they're called There are P collections So the first of all there's the pipeline the pipeline is like the the object that holds all your job information so you create that from some options which you can pass in there and then you and create P collections P collections are created by Applying transforms to the pipeline so you do always like apply transforms It's really easy and this can be like you can do multiple transforms after each other Or you can do you can also branch like here where you create this P call to which is like, you know a branch of P call one So and then you can you can run that and that pipeline. That's pretty sweet so transforms are actually quite nice abstraction because Transforms can be either primitive or composite. What does it mean? Actually in beam. We only have a few primitive transforms We only have like Pardu group by key assigned windows and flatten so I will explain two of them in a bit So basically what that means you can define like composite transforms which use these and then These are actually the composite transforms like expanded to these primitive ones, which is really easy because we just need to I mean as a runner creator You just need to implement those for primitive transforms and we can we can do optimizations for composite transforms But it's enough to implement that primitive transform So of course because this is like a big data framework, we have to do a little word count and For those of you who don't know what count It's basically you're trying to you have a list of words like to be or not to be and you try to count how often Like a unique word distinct word appears in that list The way to do that is you to use if we are talking about beam Then use a Pardu which stands for parallel do and you would you would assign like a key value You would transform your words into a key value object with like a one which stands for number of occurrences And then you would do a group by key which basically well shuffles the data and Gives you a list of all the values for Every distinct key and then you can sum them up and you know that two is twice in this list and be also and the others just once and so Don't don't get confused now. This is this looks really ugly This is actually how you would do it in beam, but we will see we can simplify it a lot So we we have the pipeline we created we have our list of words in this case like hello Hello foster and we we have this part of this first one with the signs like the one And then we do a group by key and then we have this loop here in the second part of which sums it all up Yeah, I mean that was pretty ugly. I agree. I mean, I don't know a better way to write this Any non-comprehensible? so luckily we have composer transforms so We we can simplify this now further so instead of doing the the first part do which where where we do this Do if n function we just use a map elements function Which is slightly some more simple and we do like an integers per key Composer transforms, but which does basically it sums up the value the number of occurrences for each key and We can simplify this even further by By just using this count per element to impose a transform. So that looks pretty simple, right? So there are a lot of these Transforms in beam and if you read the documentation you can you can write really readable code even in Java because that is that is the Java API and We have of course fortunately also a Python API which which looks so much nicer So here this would be the same initial example. We just use lumber functions to that do that work count and Also in Python we have of course these composite transforms So this is maybe slightly simpler where we have the combined per key function and we pass some Sum is an argument This is just like a very Quick look into the beam API. I thought it would be useful. There's lots of more Composer transforms you can create your own. We have lots of IO. We have windowing event time watermark site inputs. I mean state and timers, which is it doesn't make sense to you at the moment Maybe if you haven't tried it, but it's really useful concept once you Learn more about beam and your pipeline gets more complicated So what does portability mean now? I mean I showed you Java. I showed you Python Where does I mean it's I mean that should already be working, right? so Let's see first. What is I mean what are the two different kinds of portability in the beam context? So you have the engine portability, which is like the ability to run it on Different execution engines and we have the language portability, which is like using different SDKs for composing the pipeline and If we look back at the vision, which I showed you at the beginning. This is really I mean how it should work, right and In terms of engine portability it is actually true like we are in the Java API We we just you know these options would we pass to the pipeline we just set runner flink runner and Then we do run and it really runs on flink. That's pretty amazing. So we have that part covered already Now what about language portability? Why would we use other languages? Well, it's kind of I mean clear I guess syntax expression of Communities is a big point because there are a lot of people simply don't like Java for various reasons Which I can understand. I mean I'll really like Java, but it's okay but we also have a lot of Libraries which is like an important factor like tensorflow are really like huge libraries, which are simply not available in Java So that's a good reason to use Python So I was actually lying a bit to you this whole This whole portability language-wise doesn't really or didn't really work So it used to be the case that we just I mean basically only we're supporting Java and Scala in in the open source world and we had like When you use like the Google Cloud you could run Python, which is like Not so cool, right? I mean It kind of breaks the promise So what we what we need is and what we worked on in the past like almost two years is To build a language portability framework into beam and its runners so that we actually Can do the full realize the full vision So How do we achieve how do we achieve it? if we look at sort of the Very abstract translation process of a pipeline it used to be like this where we had Java and then a bunch of runners and They all executed in Java so they need to implement their own translation way But once they translated it was fine now that we have language portability. It seems like a Well, maybe not very good idea, but It's certainly possible to just you know, let every SDK figure out a way to translate To every execution engine then the execution engine has like various their own various ways of supporting that language but this that this seems like a terrible idea very complicated and Replicating a lot of work. So what what we did is we introduced the the runner API which takes the pipeline from the SDK and Sort of transforms it into a language agnostic format That's called the runner API. So it's it's based on protobuf. I mean doesn't really matter It's just like a format that is consistent across languages so And then what we also needed is during execution We have like these language dependent parts like when the execution engines all most of them are actually all of them are written Java so when you have Python you need to figure out a way to Send data to to that Python process and Existate and on all that and this is called the fun API Fn API. Yeah, and that way we pretty much only have these two extra layers and Just have to make sure the runners are compatible with that and then we're good to go So let me simplify this a lot. So we have the The old way was like we have the SDK and the runner and we have for example Execution engine like flink with a bunch of tasks and the all these were in Java. So and that worked pretty well The new way is a bit different. So In the new way, we have the SDK here which Uses the runner API to produce this universal pipeline format and then we actually have the job API which is a way to Send this pipeline to the job server and the job server is really a beam concept now It used to be that every runner had, you know, every execution and had its own way of submitting applications And but we wanted to you know, really Get everything portable. So we created the job server and in the job server the runner translates this runner API pipeline and then It executes it on the engine of your choice but of course we have these like Python blobs or go blobs in between which we don't really understand and whenever we have that we we we have a special Well task called executable stage Which is the fancy name for we don't know what to do with this So we have to send it to an external Process which is called the SDK harness and that harness exists for every language like for Java Python and go so Whenever So whenever we I mean so we put the we create the harness when we start the job with the Python code for instance and then whenever we receive data in that Task we send that to the external process the external process does its processing and sends that back, you know it's very simplified and There's there's some challenges to that because There is not a great cost But there's some costs when you send data to an external process right because you need to Serialize that data and deserialize it again. So we built in some optimization called fusion which tries to combine as many of these Python stages for instance into One SDK harness so we Don't do any like the duplicate serialization work How does the SDK harness work? so First of all the SDK harness needs to be bootstrapped somehow, right? So what we typically do is we use Docker So we have an environment which contains all the dependencies like my tensorflow or my numpy and And just use this Docker image directly. We can specify that in the options That's a really easy way of deploying because you you have an image registry and you just download that image automatically and start it but some people don't want to use Docker because for various reasons and so you can also start like a process-based execution, but then you have to make sure you set up the environment Thank you the environment like manually and it's also possible to run this embedded in case you're you are using Java and So There's I mean there's a lot of happening with love communication between like the back end and the SDK harness like obviously we need to Control like we have like control plane a data plane. We have a way to access date and Report progress and also logging. I mean everything is locked So you could know actually was what is happening inside external process because otherwise debugging it would be would be really hard so What is now missing is and kind of a problem? It is not only. I mean a runner is is Like a SDK is only complete if you can read and write data, right? Because it's not really worth anything if we can support all the primitive transforms. We also have to be able to Actually have that Connectors which we have in Java in in any SDK available and you can see there a lot of them available Now it would be kind of a lot of work to replicate them and the language support For example, when you want to create a Kafka connector in Python the language support is not so good in Java It's really good. So ideally we would we would just use the Java connector in Python and Not you recreate it in in Python and Turns out we can actually do that and that's a pretty amazing solution We can simply use that process which I've described to run cross language pipelines So how does it work? Theoretically, I mean we're finalizing like the specification at the moment But it's sort of like this. So you have a Python job and I mean probably it's not gonna be named IO expansion But it's kind of like it like a dummy object where you specify your IO like Kafka IO or maybe the full Java name I mean it though it will be made a bit simpler and use pass some configuration and then of course, I mean Python doesn't understand this but When we do the translation to the runner API, we actually have like an expansion service a Java expansion service running if we want to in the case of Java and we so we take that stop this placeholder and Expand it into like like a native Java Kafka transform so and then When then we do the rest of the translation and doing when the job runs We actually have now two different kind of SDK harness running. So we have a Java one for our Kafka source and then we have maybe some Python data processing afterwards where we do some map and count and we of course also have the native Trans like native flink or whatever you're using Execution engine transform like a group by key which which just doesn't need an SDK harness or anything Because it's supported by the execution engine Yeah, so This is sort of how portability works. There are a lot of details, of course, but it's a 20 minutes talk So how how far are we? So we have the engine portability Mm-hmm, and we have the language portability Almost I would say I mean for developers you can try it out yourself. I have a link for you in the end You can try it out. It works. We just have to make it a bit better You know, there are some some like we have to tune a bit of performance Although we have estimated five to ten percent only overhead in most cases and then cross language pipeline support needs a bit more and Specification but that's gonna happen in the next weeks There's also this fancy thing called splitable do a fan, which you can read up, but that's not so important There's a compact compatibility matrix which tracks like the the the status for portability of all runners There's a link here and sling actually is like the best runner I would say because it supports most features at the moment. The others are gonna catch up and That brings me to the end of my talk Please check out the portability Website or just go to the normal beam website if you want to learn more about beam We have mailing lists and an awesome select channel, which is where there are a lot of help for people Yeah, and that's it Thank you Compiled to what sorry Common bite code Yeah, so the question is why not use something like a patchy tinker Tinker pop which uses like an intermediate common intermediate format between the Languages and then or which is like bite code which can then be executed There are a lot of other Frameworks with do that for example flink has a Python API which uses Jaisen Which is sort of like the same idea you can generate bite code from Python We want to be able to support all kinds of libraries like Times of flow which is like a native C library and that you can only achieve if you run like a C Python interpreter and not like some custom version of Python which only supports a subset of Python That's the reason I have and so yeah, I repeat the question. So how is the debugging experience like in these? Python libraries when when you run into an error in Python like how fast do you see it and when you execute on a on a Essentially Java runtime It's actually pretty good and it's been part of the design So when when in Python you see an exception it will be like forwarded directly to To the to the op like Java operator and it will catch an error there And so and use you due to the logging and stuff like that You actually see immediately what and the errors also send back So you see the error message immediately there and your pipeline will fail because if the runner receives a failure It should fail Yeah, good question. I'm not working on the Yeah, so the Python 3 is it supported or not it is supported But it is like 99% done. So it is there. You can use it. There are test cases and everything it's just not, you know officially been released because I'm not working on the Python site myself. So I expected to be done actually in the 211 release, which is the next beam release should be out next month. Yeah