 Alright, I'm Jason Schlesinger and I'm an undergraduate at Arizona State University in Computational Mathematics and here's my presentation. This is what we're going to cover today. We're going to cover what cloud computing is in the context of this presentation because it's kind of vague and then I'm going to introduce and describe what MapReduce is so hopefully most of you should already, that should be a review, I'm going to introduce what Hadoop is and then we're going to introduce security issues with Hadoop. We're going to discuss some solutions to work arounds and then we're just going to have some final thoughts on the whole situation in general. My main goals of this presentation is really to raise awareness of Hadoop and its potential and then raise awareness of the security issues with it and then hopefully inspire current and future users and administrators of Hadoop to be aware of these kind of issues. Now what cloud computing is, is it's distributed across multiple machines that are linked together. It's fault tolerant to hardware failures so if a hard drive or power supply fails then the whole system keeps working and the applications are abstracted somewhat from the operating system and it's often used to offload tasks from user systems that would be difficult to maintain or run. Considering all this, Hadoop is an incarnation of MapReduce in a cloud environment. Now what MapReduce is, is it's good for huge data sets that need to be indexed, categorized, sorted, called, analyzed, so forth. It allows data to be distributed across a large cluster and then the jobs are distributed out to the data and they can work independently and in parallel. Google implemented MapReduce back in 2004 and then Apache created an open source version of that. Now an example of the speedup of MapReduce in my MS Paint Tastic diagram here is if your data is laundry then you're doing laundry in a serial environment, then you can only run one load at a time and then when the load is done then you need to break up your laundry based on material so that your dryer dries everything because if you put in a pair of panties and a bunch of towels and the panties are going to be burnt and the towels will still be wet. Now even if you get creative with this it can take up to two hours to do your laundry and if not then it's going to take three. MapReduce is like going to a laundromat and you distribute out all your work across a bunch of washers and then that's going to take as long as the longest wash load. Then you have to break your data back up into the different types of materials and then those go out to the dryers which are analogous to reducers. Now this is just an example of how the speedup works but as you can see it definitely gives you a speedup. Now a real world example of how you could use MapReduce not really efficient but let's say you need to do a word count across a lot of text files then what you could do is you then break up your text files to your mappers and the mappers will count the number of words and the number of times each word appears in only one text file. Then they'll output sort of word number pairs and those will get grouped together based on the word is the key and then similarly key pairs will get sent to reducers and the reducers will then add up each of those numbers. Some other potential uses of MapReduce and perhaps more realistic are indexing large datasets, image recognition, also processing geographic information systems data such as combining vector data with point data. It's good for analyzing unstructured data or even stock data or even using it for machine learning tasks. Pretty much any situation where the data is just incredibly cumbersome. Now a little bit of walkthrough of the terms that I've been using. There's Map which is a functional programming term which means to use a function on each element in an array. The mapper performs a function on one element of the dataset. So in this graphic here which looks like it's kind of hard to see, the data gets split up and those workers on the left are the mappers. Then reduce or fold is a functional programming term which means to iterate across the data using the results of the last element as the input of the next function. And like in our word count example there was word five, word one and word two. What it would do is it would add five and one and then get six and then add six and two and get eight. And then the reducer which are the workers on the right here will reduce across the results of the mappers. Now a brief overview of what Hadoop is. It's a cloud platform and framework that allows programmers to distribute data for MapReduce jobs. So you would use the Hadoop API to write programs in Java to then run on Hadoop cluster to process your data. As I said before it was developed by Apache based on Google's paper. It runs on Java like I also said before. It allows businesses to process large amounts of data quickly by distributing the work and it's one of the leaders in the open source implementations of MapReduce. There are other implementations but they're just one of the biggest and it's very good for large datasets and on large clusters. With that it's also bad with small datasets and on small clusters. It's also growing as a business tool and as a business tool some large content distribution companies such as Yahoo uses it for many other tasks. They've got over 25,000 computers running Hadoop and they use it so much they recently released their own version of Hadoop with their own modifications. A9 which is Amazon search engine uses it to index their user generated content as well as their product data to make that searchable. The New York Times is a really cool case. What they did is they took their public domain articles and they sent them to Amazon Web Services which is really cool if you haven't already checked it out. And then they made a virtual Hadoop cluster on Amazon Web Services using their elastic cloud computing structure and processed the data and then got back the result and then they could just get rid of the virtual cluster. And then Veo uses it to quote reduce usage data for internal metrics for search indexing and for recommendation data. Now it's also used by non content distribution companies. Companies that have personal information such as Facebook, eHarmony, Rackspace, ISP and the NSA is using it HDFS for storing intelligence data according to a recent slash dot article. Now other early adopters would include people who have medical records, tax records, network traffic, just any large quantities of data that they really need to process. Pretty much wherever there is lots of data, Hadoop cluster is a good thing to put in. Now that kind of leads us into our security framework and access control. I mentioned those companies that either would benefit from or are using Hadoop but a company that could potentially benefit from Hadoop would be say Intel. They have a lot of business data and then they got a lot of chipset data. However, as I'm going to mention shortly, if they were to put them both in the same cluster then people would have access to both sets of the data. Which kind of leads me to my point here that HDFS or the Hadoop distributed file system has no read control. So if you're any user on the cluster and you can read anyone else's data. The client identifies which user is running the job by using who am I or yeah, who am I pardon me which can be forged. And Hbase which is big table for Hadoop which is kind of a database. It's what you need to use if you're using Hadoop but it's not quite a relational database. As of version 19.3 has no read or write control. And so the lamp analog of this would be that any application can access any database just by asking for it. What this means is that any business running into a Hadoop cluster gives all the programmers and all the users the same level of trust. Then any job running on that Hadoop cluster can access any of the data on that cluster. And even if a user only has a limited number of jobs that they can run, they can run those jobs on any data set in that cluster. Finally malicious users could potentially modify user data that's on that cluster. Now I'm going to give a little demonstration of this. What I've done is I have a Hadoop job running as two users versus user A. So I suit into that. And then just a little LS of what they have at the Bible is their data set and they don't have any access to any of their stuff, which is unfortunate. So I'm going to run a MapReduce job. It's going to count the number of times begat appears. And it starts out by distributing the job out and then the mappers are going to count the number of times they see the word begat and send that out. And as to finish the map job before it begins the reduced job and the reduced job takes a little bit longer because it needs to send the data out and do remote reads and stuff like that. I actually sped this up. If you look at the timestamps there's a good five seconds between each one of these ticks here. All right, so it finishes up. And then let's take a look at the result. And let's take a look at the result. Hello, computer. There we go. Oh, it went away. It was 215. And then we have Hackeray, which is a different user running the same job. And come on. 221, that's what I meant to say. Now, some possible work runs for this is that you can keep each dataset on its own Hadoop cluster. If the attacker doesn't have any access to data they shouldn't have access to, the point is moot. It's also possible to run each job on its own cluster using Amazon Web Services. And with Elastic MapReduce it's even easier. All you need to do is just upload your data, send it the job that you want, tell it how many nodes you want to run, and then just pay for the time. And then there's Hadoop on Demand where you can load your data into a real cluster and then every time a job is run it'll create a virtual cluster only with access to the data that you want it to have access to. Another possibility is that you just don't store any confidential information in Hadoop. If it's public information then maybe they should be reading it. Another possibility is to encrypt all your sensitive data. Now I've heard of some theoretical ways to encrypt data and still analyze it or process it, but as of right now it would be difficult to process it if you have to unencrypt it and reencrypt it every time you want to run an operation. And also if you don't have access to it then you just have the overhead of pushing this data around that you're not going to use. Now a solution would be to develop a solution that sits on the file system or at least write a concerned email to the Hadoop developers. The problem is that access control is held at the client level when it should really be at the file system level. Access control is checked, should be performed at the start of any read or write and user authentication really should use a tried and true method such as password or RSA key authentication. Now some final thoughts. Hadoop is a rising technology. It's not quite mature and still holds plenty of its own issues. However, it's starting to take hold in the marketplace. A lot of businesses are using it and despite its issues it's really worthwhile trying out and using on your larger data sets. And we have the power to shape the future today and with all our problems we really should learn from our mistakes in the past and take that move on. Now if you want to know more about Hadoop there's another presentation on Saturday in track four that really covers more of Hadoop's capabilities and shows different ways to use it. Also, it explains some of the features that I talked about in a little more detail. All right, if you have any questions please meet with the outside of the room or around the con. The next thing I'm going to is the beverage cooling contest so you can meet me there.