 So just excuse me real quick, can everyone just smile for me? Alright, here we go. One, two, three, cheese! That's my favorite app, that's called WaterSquare. Not for cigarette. It's a social app, so when you check everything in, it goes up online and shares with all my friends. In the background, they're using an app to track and give you suggestions and give you suggestions to your friends and to visualize what people are checking in. So I'll talk about that in a little bit of a sec. So my name is Joe Zaker. I am an evangelist for Amazon Web Services based out of Australia. I'm happy to be here. I want to thank the organizers for having us here today. Amazon loves big data. I'm going to cover three points, three basic ideas. I talked a little bit about some characteristics about big data and I have no doubt that many of you actually know all these characteristics. I want to talk to them in that framework of why the cloud helps enable big data. I want to talk about how the cloud is big data's best friend, BFF, and how big data is used currently in the cloud, in the real world. And so I'll talk about a few case studies, specifically ones that I'm familiar with that are the ones using Amazon Web Services that certainly can apply to anybody using the cloud. So what are some of the characteristics of big data? Well, when we talk about big data, we need data that's so big and so large and encompassing that we actually have to innovate in order just to collect, store, organize, analyze, and share it. So it's such a massive amount of information that it just takes innovation just to hold on to it. For big data, the bigger the data, the better the data. The more you can collect, the more valuable the data is, right? So you take a company like Yelp. Yelp uses big data to correct, it gives recommendations and corrections that people type in and search on Yelp. The more data that collects in, the more mistakes, the more correct answers, too, the better the recommendations can happen. If you've ever seen what's on Amazon.com before, yeah, so that recommendation engine that runs in the back, yeah, of course, that is completely powered by the analytics that it collects while not only you are shopping, but why other people are shopping, that gets better and better as you shop more as it collects more data. Those recommendations, when you buy one book, recommending another book or talking to an FBI, those get better at it because the data gets bigger. So bigger data is harder data. And it's not just harder because there's more of it. But it's also coming from different sources, right? So you're not dealing with just one source of data now. You have to collect multiple different sources in multiple different formats. Now, like that, but once that data gets bigger than you can store on the team, now you have to come up with some methodology to share the data where different storage is, right? So you're a traditional IT infrastructure. If you're trying to store head-to-body data, you have to have some very high-end infrastructure to store that. Or you're going to have a lot of hard drives to stack up on the side of the wall to take a terabyte each, right? So big data is getting bigger. So it's accelerating at this point. We're going to have more and more data. We're going to have, according to IUC, 2.7 zettabytes of data by this year. Who knows what a zettabyte is? That's better than I was going to say, right? So it's 1,000 exabytes. That's a lot of data, right? That's a lot of data. And a lot of it's going to be unstructured, right? This is going to be data coming in from lots of different sources. This is going to be data that's coming from different places, spread and stored in a different area, right? Now, like that, the requirements are changing for data. The way we use this big data, that is also changing. So, for example, people want immediate answers to the questions of big data supply. We want answers for processing. We want answers right now, right? When I typed in that book name in Amazon, I want those recommendations right now. There's no value to Amazon if we send an email two weeks later and say, hey, by the way, you know that book you bought? Here's another book that's kind of like it. Your transaction's done at that point, right? You don't want sampling anymore. In the old days, back when I was young, you could do sampling, right? You can say, you know what, I just want to take a little bit of this data. I don't project it, I want to guess that. Those days are gone, right? By now, you can get accurate answers because of big data. But it's getting more complex, right? The old tools are no longer applicable because of the amount of data. SQL is no longer enough of a language to describe how we're going to process that data. Relational databases can't contain all this information. And people want to experiment with data, right? We want to be able to use the data to ask questions on the data. But in order to do that, I don't want to book time on a supercomputer or have to pay and provision tens or hundreds of thousands of dollars with the hardware in order to ask these questions. So where's some of this data coming from? A lot of it's computer generated, right? So we have more and more online services, right? We have Twitter, we have Facebook. Systems are also talking to each other, right? The web server logs are talking, or web servers are talking to other back-end systems. Sensor data, right? Post-captain cameras. Traffic data. All of this is being captured at it. Growing and growing right. Logs, emails, pictures. I don't know about you guys, but I look at the amount of pictures I had when I was a kid. My parents have about six pictures of me in the 80s of like one to 20. Right? Maybe they just love me. I already took like 20 pictures today. Think about how much that's growing. Everyone here has a camera right at their pocket right now. That's all it's entering data from. So the role of data is changing, right? It used to be that we would construct the models that we wanted to that data and then collect that data. So I came from a software engineer background and I would design schemas based upon how I wanted that data to look. Right? So I would design a schema for my ordering system. Maybe I could design a schema for my customers. I would do that ahead of time. That's changing now, right? That data is coming in. We want to collect that data and that data is driving the model. We flip it around. Right? You don't know what questions we're going to ask. This is moving to a data-first philosophy. Collect as much data as you can. Don't worry about the structure. Don't worry about how you're going to ask about it. Later you'll figure it out. So you have innovations. For example, in Singapore the government has collected massive amount of traffic data over the years. Taxis. They've got censors in their taxis wherever they're going. They've collected all this data not knowing how they're going to use it. And now when public is available anybody in Singapore wants to use it to improve life in Singapore. They have public access to that. These innovations are going to be driven not because they knew ahead of time but because they can improve our traffic situations. They can make taxis. So now data is really a first class citizen. Right? Data is the same as people as the labor has happened. It's that important. It's the industrial revolution of this decade. Some people say it might be the web 3.0. It's that important. There are going to be entire economies in the next decade. So right now we're kind of this transition thing. We've got a lot of tools that are based on traditional IT infrastructure. We've got relational data bases. We've got sequels. We've got hard drives. We've got vendors who want to sell you sands. The tools are going to start changing. Now, I'm sure that many of you guys know about the Duke. It's almost that game-changing tool now. You can see that being the beginning of a bunch of new tools that are coming and built on top of the map. You can think about this. I would say it's like DOS in the 70s for those of you who are a little like me. Or maybe Java in the 90s. It's going to be that watershed moment now. And why is the Duke so important? The Duke is really important because it's provided a couple of really basic features that's going to enable all these other tools and make DOS to be able to use this big data a lot easier. And one of the big things is it allows you to use commodity computing. Right? If you've got an x86 server sitting in some place you can use that to process big data. Right? You don't need any special equipment. And in fact, if you haven't had a thousand x86 computers looking around, the Duke will help you do that. Because the other big thing about the Duke allows you to scale cores on it. Those two features together make the Duke very, very important. And it's almost a big problem. So it's also a really basic fundamental problem. How do you break down data? How do you move it around? How do you top it up, slice it up, analyze it, and put it back together? But at the end of the day, the Duke's still kind of hard. So, has he ever managed a Duke cluster? At all? Try to? It's hard, right? There's a lot of setup for that. There's a lot of work. And while you're doing that, are you using that big data? Are you actually asking any questions? You're not. You're setting up a new structure, right? And these skills, these people that are setting up the Duke, these are really rare IT skills right now. So IT itself is obviously a very rare, specialized skill. Big data IT skills, good luck being able to hire people with that skill set. Do you want them setting up servers? Or do you want them using that data? Do you want them building something for your customers? So, how does the cloud work? And I'm going to speak very generically about the cloud, not specifically the Amazon cloud. Although that's the one that I have. So, just to kind of be clear, I'm sure everybody here, and the way we talk about the cloud is we define it in five customer metrics, right? The cloud has to be elastic. You should be able to set up as many mission moves that you want, anytime you want. And you should be able to shrink that and set that back up to us for cost savings. It's not elastic, it's not the cloud to us, right? If you have limited capacity, it's not possible to do it. You shouldn't have any cloud that, so no capitalist manager. You shouldn't have a body that can have a time to use the cloud. No contracts, you shouldn't have to lay out on it. You should be able to pay on a per news article, for every hour you use it, you pay, and you turn it off, you don't pay for it. And it shouldn't enable you to have faster time to market. And the reason it should be doing that is we're not setting up machines. You don't care about that. Well, maybe some of you guys like to build your own, and that's fine. But when you want to use that computing power, you don't want to sit there and have to rack up these machines. You don't have to pool them. You don't have to have everything right. And as a scientist, that means you're not sitting there wasting time on computing machines. You're trying to get something done. So how does that make the cloud be gay as best friends? Well, as we said, there's really a few big problems we've had at the beginning of the time, it's all over the big pain points. You want to collect it, store it, organize it, analyze it, and share it. But we all have only resources, right? There's only so many people, but first of all, as I said before, if you're not stable as a new cluster, if you're not trying to scale out databases, you're saving on a very, very precious IT economy. You're saving on these key frames, but the SSR is still extremely important to your organization. If you have somebody, as you're collecting enough data, and you realize we're in a hard drive space, and the guys haven't talked to some of them, they're trying to run out and get other people looking together, is that working on delivering value? He's just keeping you from it. And an interesting study you might see is that the data is going to grow by 75 times in the future. But the amount of people that are actually going to be able to manage that is only going to be 1.5. So that IT commodity's going to be using more precious time. So if you're wasting time not using that data, you're going to pull it off. The new time. Right? That's why me and my university apartment. I was very lonely, but I had a lot to do. It's a hard thing to run. You have to program for Purdue. You have to set up a machine. You have to work with it. So what happens in the old world? You're not using cloud computing. Gartner says that we've been in the old world for bigger IT companies. They spent about 30% of their time doing something creative, creating something for their customers, interpreting themselves in the marketplace, creating programming code, 70% of the time they're just keeping the lights on. 70% of the time they're just keeping things running. Right? How many times have you needed to use the resources? Gosh, you had to get someone to procure a server. Or your email went down. Or your database got. So what we want to do with cloud computing is we want to switch that around. And you still have to course manage some things. There's some manage over it. We want to flip it. So 75% of your time actually able to use that. It really comes to four big points on how to follow. It creates usability. If you set up your infrastructure in a certain way, you can snapshot it. You can reuse it. If you set up a new cluster, even if you want to set it up by yourself by hand, you can copy that and replicate it over. If you have a production environment that perhaps the processing team did and you've got an issue, you can copy it to your environment for your development. A lot of cloud services and managed services. You have to be lifting that 70% out of your your own hands. And the provider, the cloud provider deals with that. So for example, relational databases. At Amazon, we've got something called Elastic MapReduce. That is a managed, hosted, the dupe environment. You don't have to set it up. Scales. You've got a massive amount of scale you're going to get in the cloud infrastructure. And it's sale at demand. So you'll find that the workflows for especially the processing data, the workload. We've got customers who on the weekend they have to process a massive amount of data. But then during the week, they're done. So how do they scale up? How do they get tens of thousands of servers to process their data? And then, truthfully, how do they get to get rid of those servers if they're not using them? So you get a massive amount of scale. Not only for compute, but what about storage? How are you going to deal with pulling in tenabytes of information and innovation? Do you want to be discovering new technologies? And some of you might actually process the big data. Or do you want to be using that big data? Do you want to deal with new technologies on how to get more sites on a hard drive? Is that your job? And some of you might be your job. But for the rest of you, maybe you don't want to do work in that. And so when you deal with the cloud, you're going to find a lot of these providers, they're innovating on behalf of their customers. And they're going to do that. They're saying, hey, I want support for this version of the group. Hey, I want to use Py. The cloud optimizes for capacity and resources. You don't have to think ahead when you use the cloud for big data. You don't have to plan for that project. You don't have to say yourself, you know what? I have an idea. Let's collect all this data. And I want to create a recommend date in out of it. And I'm going to need to store, say, a thousand terabytes. And then you have to go out and find your vision. And then maybe you might not want to use it. And now you've created some risks and you've finished the waste and you don't have to use that anymore. With the cloud, you're not worried about that. It's got that elasticity. So I talked a little bit about computing capacity. So you're going to have a very jagged kind of workflow when you're doing big data analytics. You might have kind of an on and off type scenario, a fast growth scenario, peaks and valleys, maybe even predictable peaks. This could be an internet startup. This could be an office between 8, 9, 8, 10. The current one is coming in a long year. So what happens in your traditional infrastructure? All of this, if you provision this amount of data, or this amount of capacity, all of this is wasted. Right? Those are computers you're paying for that aren't doing anything. So you've got to go home for the day, turn them off, and they're just sitting there. Maybe someone's coming to the office and you think they're playing games. But it's not making any money. Even with a fast growth and you've provisioned here and you've tried to come up with a methodology so you're not wasting as much. It's still wasting. It's basically calculus. A really bad situation is when you don't have enough scale. So now you're not able to do the work that you want to be able to do. It's really a lose-lose situation when it comes to not using the cloud. You're either going to your guess and maybe you'll be close to right and say you've provisioned for that extra 15% of your capacity or whatever you could do. But you still wasted all that. Or you're going to be wrong. You're not going to be able to do your job or anything like that to your customer. So what you need to be able to do is have provisioned capacity. So that's the power. Any of the different types of revisions when using the cloud is when you do that ability. When you need that ability to process regardless of what the pattern is, you're going to be able to have your capacity match what you need to get done. This allows you to balance cost and time. Now it's an interesting fact especially when you're coming with using Purdue or using any kind of distributed computing architecture it doesn't matter how many machines you have running the costs you're going to be just saying. Right? If you've got 500 instances running it's going to cost you 500 hours of processing. Right? If you've got one instance running it's going to cost you the same. So you are now empowered to decide how fast you want to process that data. It's your choice. Which is, you know, great. That's one of the things that we get to do with the mappings, right? We get that ability to massively scale out. It also allows you to experiment. Now you don't have to provision a massive amount of physical infrastructure. It allows you to ask questions that maybe you worked on to ask because those are quite low. Right? You're not sitting there having to put up a bid to buy a hardware for example. You're not having to go through a provisioning process. You're also empowered to sit there and kick off maybe 10,000 servers for an hour to try it out. You can ask new questions now. That allows you to innovate. You don't have to worry about the fear of failure and the high cost of failure. I come from the startup world and we always talk about fail fast. In our world, we adopt a theme and add them all. And the whole idea is if you don't really answer to something, don't guess. Do it. Find out from the market. Take that information back and iterate. And that's what we do if we do it on Amazon a lot. We'll launch a very basic service. We'll let the market tell us what's happening. Sometimes maybe we don't launch the right thing. We can do that the same. The same applies to big A. It allows you to collect and store as much data as you want. It's almost unlimited scale in using storage in the cloud. You're not at that point where you have to decide, look, I'm out of space. I have to decide what to collect. Collect as much as you want. Collect as much as you want. If the cost is going down, the durability of storage in the cloud is very, very high. And I'm going to use an example. It's going to be one of our storage products. But it applies to many types of storage in the cloud. We've got a storage product and we use quite a big data. It's called Simple Storage Service, or S3. Does anyone use Dropbox? At all? Dropbox is all S3. And Dropbox uses it to collect as much data as it wants to on it. We've got a company in the U.S. called Netflix. Netflix sometimes it takes up to 25% of all the internet traffic in America. It's the number one cable provider in America now. When people are watching those movies and they're trying to figure out a reputation and advertising, it's still all S3. They don't have to worry about how it scales out. Behind it, I'm sure there's billions of drives over many data centers, but they don't work that well. This level of scale is so massive that right now, in our S3 buckets, an object is just any kind of file. So that means an object for 156 objects for every man, woman, and child on the planet. You can store a massive amount of data. And not only that, but you can access that data in a massive transactional format. If you use any kind of MapReduce and you point it to something like an S3 bucket, it handled right now 750,000 seconds per second. So it removes one of those modelments when you're trying to process it. And it's accessible, basically, everywhere. So another big problem is besides storing the data, it's how do you share it? If you've got some massive amount of data, what are the benefits that you're going to be able to share with other people that can process it? So what we've got, for example, are probably data sets stored on the cloud. So the Human Genome Project is out there. You can go ahead and process that anytime you want. Google Books and Ngrams, which are basically the books that you're trying to process are different tools that you can use to process to figure out information on the literature. All that has to be stored on the cloud. There's good access to that. And the interesting thing, too, is the cost is going down to the store. So that's fundamentally a big problem for big data. If the cost of storing data is high, you're not going to store it. But the consequence of not storing the data now in big data is such an important pillar of business, means that that company is going to lose that credit image. Now, you can store it without working. So I'm going to talk about a couple of different customers that we have. And by no means does not everyone who's using the cloud use it just one that I'm most familiar with, because I get to work with these fantastic companies. And the companies are generally broken down into different areas that we're seeing people using big data. And clearly, from a cloud standpoint, you get a lot of targeted advertising. A lot of images and video processing. Oil and gas use it for sizing. Data studies. That's quite good, of course, in Australia. Recognition engines for retail. Genome analysis. Harvard University uses it for genome analysis. They're going to solve a few medical problems with that. I mean, it's fantastic to be able to enable that. Financial services do risk analysis. Antivirus companies use it to create new inoculations. And, of course, game. Amazing amount of game information going out. Social networks. Facebook, Twitter. So this is four square pounds figuratively. And this is a visualization of people checking in. And it's a little dark. Maybe you can't see that clearly. But you can see where it's spread out. Kind of starting in the U.S. It'll go over to Europe. It'll get actually very big in Japan. And for some reason, and it's not that clear on this one, it gets really big in New Zealand. And I don't know why it's that big in New Zealand. I mean, there's only four-lane people there. I think there's more people in the audience than there is in New Zealand. I guess if you live in New Zealand, you'd say, well, what are you doing? I don't know. Let's play four square. Okay. There's no... That's a great place. It's a company. Vancouver. They're a bank. They're a bank in Spain. But it wasn't the cloud that caused the global financial crisis in the U.S. So let's leave that. They do a bunch of marvelous simulations for credit risk analysis. And the more they can run that simulation, the more value they've got. And so what they do is they try to run the simulation. They need to run 400,000 tons to get an amount of value out of it. And when they're using their own infrastructure, it would take about 23 hours to do that. But because of that limited speed over the cloud, they're able to do it in 20 minutes. Another company in South Australia, they do analytics for wind farms. So they project exactly how you can set up a wind farm for maximum efficiency. Originally, the wind had everything set up, and they're using their own software when they're running a cloud. You only think about 22 days for them to do what customers have to do. When they move the cloud, when they've got that scale, they're going to push that down to two days. So you can imagine in their own infrastructure, if it took 22 days for them to process one customer's data, how long do you think it took for two customers? You guys are good at that, right? If you're not good at that, now it's just the beginning. 24 days. How long does it take for them to do that in the cloud? Two days. Two days. Because they've got that scale, right? 66 days for three customers. Two days. Maybe a little bit extra, because someone's got a right to move the cloud. But you see how it's good. Etsy is a really neat company. I actually use that. It's a little bit like eBay, but it's for crafts. It's all personal items. When you go in there, you go to Etsy, and you can actually go there any time you want and go to this URL here. The first thing it's going to do is it's going to try to create a recommendation for you, day one. So it'll tell you, you'll go through these different tiles here, and you'll pick the one that's most interesting to you. And they'll feed that back in. They'll use the last two map reviews to break that data down and create better, better recommendations. It's really cool that you can try it out. The other thing they do is they do a different recommendation. And the way they do different recommendations is if you allow them to connect into your basement account, they'll use that graph and go through your preference, your contacts, and find out what their likes are. And they'll pull that together and match that with the recommendation to try to find a better guest for your friends. I can tell you, I'm really, really bad on this. I'm the same. There was another customer I talked about a little bit before. So he helped us very much with our community-driven recommendation engine. As they get more data, these recommendations get better. Graserfish is a very interesting company. They do what's called the 15 analysis. 15 analysis is basically what happens is as you traverse through the web, you start to build up some preferences. And they process those preferences and they make them available to advertisers. So for example, this is an unfortunate October meeting in America as a chief head and he's a fan of NFL football. If I find that this person goes to a lot of sports sites when in the end when they finally go perhaps they go to a gaming site by game, I can feed that to the gaming company and assume that sporting games are more important. This is a tiny-core company and the way the model works is that we have a holiday in the U.S. called Thanksgiving and that's where everybody buys everything. Right? And so their usage model just shoots to the roof and they have to invest half a million, I'm sorry, half a million dollars in order to have a hardware just to process the data to the customer. So instead of doing that and spending about $15,000 to process that and then when they're done processing it the return is allowed. They were able to provide additional returns to their customers by 500%. So the customers saw that by using RazorFish people had a chance of buying to be 500% higher, RLIs 500% higher and they're handling 1.7 million different requests a day. So that's pretty much what I have for you. One of the statistics we've talked about is why the Clouds Big Data is best friend and it comes with some real use real world use data. So I guess at this point if you have any questions I'm happy to answer them. Here's my information. Feel free to email me if you're asking some really big data questions I don't know the answer to that. Hopefully we actually have someone now based in Singapore who's a big data AWS. I'd be more than happy to answer them if that's something you're interested in doing. Just let me know.