 Live from the Moscone Center in San Francisco, California. It's The Cube at AWS Summit 2015. Welcome back everyone. You are watching The Cube Live in San Francisco for Amazon Web Services Summit. I'm John Furrier, the founder, still looking at you, my co-host Mark Farley. Our next guest here is Matt Wood, the general manager of the data science team at Amazon Web Services. Welcome to The Cube. Thanks, great to be here. Looking great in your suit there. Look at the tie, he's got the tie over here. That's a great tie though. You can have it, it's yours. Okay, all my pleasure to say it. Let's talk about machine learning. Let's get right to it. So data science, I'm obviously, great announcements, congratulations Andy. You know, you can tell he's excited about the sports, the MLB app, and all the new stuff. One of them was machine learning. Clearly data is big, we heard Splunk on earlier. Internet of Things is pouring a ton of data. People are using the cloud to spin up and get all that data exhaust, get all that data, not only that's out there in the Internet of Things, but from apps. So what is going on for your group? Describe what you do. You're the GM of data science. Quickly tell us what that is. And then we can go to some questions about the data. I'd be happy to. So if you rewind three to five years, the role of data was very different inside organizations. It was very different inside governments. It was very different inside academia. What was going on was because there was data being generated at sufficient scale, that collecting it, computing against it, and then collaborating around it was starting to become challenging. And the truth was that the cost of generating data was becoming lower and lower, and economics therefore became favorable that more data was going to be generated. Now the risk there, and the challenge that a lot of companies run up against was that they had large data centers, but the walls of those data centers couldn't move. The resources allocated inside those data centers couldn't change. They were effectively frozen in time. And so what happened was they would collect this data and then their infrastructure started becoming a rate limiting step in how they could use that data. Particularly as a lot of data analytics with the rise of the Doop and similar software platforms has become a lot more experimental. Customers want to take data, they want to aggregate it in various different ways, and then they want to play around with it and experiment with it to see what they can do, and then take a lot of those ideas and put them into production. So it's very challenging for an organization to be able to take that data and experiment with it in a constrained environment. So what's happened with the cloud today is that- By the way, in the time period of what? Weeks, days, months, hours? In the time period of experimentation? Yeah, and inside the data center, it would take longer for them- It would take much longer, and in some cases you couldn't even do it, because what happened was customers would start to frame their questions of what they wanted to do with their data, not on the questions that they wanted to answer, not on the challenges that their business was facing, but on the resources, the arbitrarily limited resources that they had available in their data center. So they've framed the question based on the amount of cores. You've already lost, you've already lost again. So when you go from that to thinking much more about a data center whose walls can move, you can store any amount of data, you can ask any question, and anybody gets to focus more on the questions that they want to ask, and the answers that will impact their business instead of spending a whole lot of time feeling artificially constrained. So the cloud moves those constraints, and it allows people to do a lot more with their data and be a lot more productive with any data that they do have. So talk about what's going on here inside the stack, because obviously the evolution of Amazon is pretty amazing, just more and more features being launched and launched and launched. That's right. Basic building blocks, now you're filling in the gaps. We saw Kinesis last year, Redshift, fastest growing service in Amazon history. Machine learning comes out this year as one of the announcements, one of the many big ones, those three big ones, this is one of them. Describe machine learning as a service. Sure. So what does it mean and how do I use it? How do I get on board? Sure, that's all good questions. So one of the trends that we've seen is that as these constraints have started to melt away, more and more developers are interacting with data on a day-to-day basis, and they're doing that in three basic ways. Number one, they're using data warehousing and Hadoop to look retrospectively at what's happened on their platform. So they're doing log analysis, they're using things like Splunk for analysis of operations and everything you can think of. But it's all retrospective, it's what's happened in the past. With things like Kinesis, which is a real-time data streaming application, they're able to look at the real-time, they're able to build dashboards, they're able to see what's happening on their game, on their mobile app, in real-time, what's happening right now. So the third area is what's going to happen next. How can we make predictions on what hasn't happened yet? And one way you can do that is you can build predictive models using machine learning. Machine learning is just a way of identifying patterns in large amounts of data and then using those patterns to predict what will happen next with data. We're applying those patterns to new data that you haven't seen yet. And what we've seen is that developers have a huge opportunity, they've got a lot of data on AWS, they want to be able to use for prediction. But there's a very high barrier of entry to using that data. They have to learn all about algorithms and data transformation and they have to worry about scale and production systems. And a lot of that isn't in their wheelhouse. So the activation energy is sufficiently high that machine learning just doesn't get used. So what we did today was we announced Amazon Machine Learning, which is a fully managed machine learning service specifically geared towards developers. So it helps developers focus on working with their data. It connects directly up with data that's already in Redshift, S3, and RDS. And then it has a collection of visualization and interpretation tools, which allows you to experiment with and very quickly understand and then build, model, train, and validate machine learning predictive models. So you've built in some algorithms basically into the stack with developer tools. Pretty much it's in, right? Yeah, so we basically, we customers supply the data and the service runs around and it automatically evaluates that data, creates summaries, statistics, all the sorts of things that you would want to do as a first pass. And then it presents it in a visualization right in the AWS Web Management Console. So you can start to experiment and you can actually interact and segment your data in interesting ways to prepare it for training your model. Then we have, again, visualization and tools to help validate that machine learning model and then able to take it into production at very, very high scale with real-time predictions. So what kinds of industries or business functions have machine learning been most successful in, right? We can look at a movie like Minority Report where people were trying to predict behavior. That's probably not that realistic, but it might, well, predicting behavior is at the core of a lot of this, right? Right, so I think, I mean, first of all, Amazon Machine Learning is a very general purpose platform. It's designed to appeal to as many industries as possible and take away a lot of the heavy lifting associated with it. But where we've seen some machine learning be successful is in everything from recommendation engines, like when you first sign into Netflix, you're presented with an entire catalog you have to browse. You're presented with Netflix's best guess of what you'll find interesting and the more that you use Netflix, the better those recommendations get. So you can do things like recommendations. You can use it for speech analysis. So using a service like Amazon Echo, which is a connected speaker, connected device, and you interact with it purely by voice. So you just say, hey, Alexa, play me through fighters. And the- Good choice. There, you're welcome. And what Alexa will do is it'll take that speech, it'll convert it to text, and then it'll infer using these sort of predictive models what it is you really want it to do, and it'll go off and find your three fighters album and start playing it for you. So speech recognition is one example. We also use it in fulfillment at amazon.com, using it for vision to power vision systems. So inside our fulfillment centers, we need to be able to take inventory from trucks, physical trucks, and move that physical inventory from the physical truck into our fulfillment centers. So what we did was we want to make that as efficient a process as possible for obvious reasons. So we have all these computer vision systems which monitor that process. And now we've gone from unloading a truck in hours to unloading a full truck in less than 30 minutes. Interesting. So these are automated intelligent systems, everything from connected devices which are aimed at consumer to vision systems inside fulfillment centers, all the way through to large high traffic websites that can take advantage of machines. So is it safe to say you've took some algorithms for that you've already used internally and just bundled them in? Is that pretty much the base code? Yeah, I think, I mean, I wouldn't just bundling is a bit reductive. There's a lot of- I mean oversimplify. I didn't mean to oversimplify. So just some of you guys have done for your own- Yeah, we took all of our experience of delivering this stuff in an easy to use environment at scale. And we basically do, we do binary classification. So male or female. We do category classification, horror or comedy. And then we do regression as well. So you can look at predict a natural number. So the temperature that it'll be in your house tomorrow. So those are the three things we're starting with. They're built into the platform right now and you can work with very, very large 100 gigabyte data sets and start plowing those into these models to build a predictive API for very- So talk about developer team now. So let's just make up. So typical developer team might not have a data scientist or a machine learning algorithm developer, or some might. So usually it's one guy doing a multiple of other full stack developers. So what would be the role of that guy? Can they still add to the code? How does that- That's a good question. So what we're seeing is that developers are very excited about this platform. I was actually just at AWS booth which is just behind you, John. There's a queue of people waiting to ask questions about Amazon machine learning, which is great. So developers are definitely excited, but we also see the sort of data scientists, those people that do have more in-depth knowledge of the algorithms, they're able to use this as a service to add more of their differentiated skills and remove some of the kind of undifferentiated heavy lifting. So building those summary statistics. Their role is in interpretation versus the actual deployment of a cluster to build summary statistics against hundreds of gigabytes. So it's seamless to that developer. So there's a- It's seamless to that developer, but machine learning experts will also find value because they do have to do less of the undifferentiated work associated with building out their models. So the question I'm getting on the crowd chat is supervised, unsupervised, explain that. Does that matter in this machine learning? Just to explain it, there's basically a couple of different ways you can do machine learning. One is supervised, where you basically have data and you use that data to drive a model and then you validate that model against your data. So this is what the Amazon Machine Learning Service does today. It takes a portion of your data, 80, 90% of your data, use that to train the model. And then you take the remaining section, 10 to 20%, use that to validate the model to test all the predictive accuracy. And there's various different cost validation techniques that you can apply to that. Unsupervised learning is very different. It doesn't assume any foregone models and you don't need a large amount of data to train. So we're very much focused on the algorithms I mentioned earlier. On the unsupervised? On the supervised. On your supervised, okay. The goal is to be able to take data which is already available. Okay, got it, yeah. And then provide a low friction way to start analyzing. Unsupervised is much harder because you pretty much have to find your own patterns. It's just a separate problem. It's a different problem set. It works well for a subset of problems but we kind of start where the meaty problem challenges are. And for developers, they are handling a lot of data, doing this in an easy to use way which is fast and quick for them to do. And then do it in a way which is low cost so they can actually put this stuff and use it in production. Yeah, so if you are a developer, what would you recommend is the best way for somebody to get their arms wrapped around this so they can either start learning about it. I mean, to implement machine learning, you have to learn about that first, right? What path does it make? That's actually one of the benefits of the platform is you don't have to go deep on machine learning in order to be able to start applying it. Okay, tell us more. Yeah. So all you have to do is you provide a data source. You provide that to the platform. The platform will go off and it will learn the structure of the data for you and it'll start to make recommendations on transformations that it can do to the data to make it more applicable. Then you validate the model and then you just put it into production. It's as easy as that. And we've actually built directly into the console. If you don't have data available, you've never used machine learning before, we actually have a walkthrough in the console where we have some pre-formatted data currently stored in S3 and you could just run that data through the machine learning service just to get a feel for how the models work and how validation works and just a risk-free environment to build an actually useful model at the end of it. So is this mostly unstructured data then that you're looking at? It's anything you can put in a spreadsheet. Think of it that way. So columns, you're basically saying, here's all the columns of my data in my spreadsheet and here's one column that I want you to predict. And then the service figures out which is the relevant waiting to give to the columns which you have data to make a prediction on when you do. Okay, so let's bring it to the next level. Redshift, Kinesis, we said on theCUBE last year, two years ago when Kinesis was launched, kind of closes the loop. It's really interesting, right? You get a lot of in-streaming, Redshift, obviously stores the data and data warehouse and stuff. How are those impacted? How does that factor in? Natural reaction to this is, you got more tooling going on at machine learning level that extends beautifully into some real time, mobile computing, break that down for us. How should we think about that? Sure. So I think that the way that I'm hearing, I spend a lot of time with customers and the way that I'm hearing customers kind of describe it is that their big data needs or usage kind of break down into three different areas. The first is that they're building sources of truth inside their organization. So that's a single canonical store for a particular type of data. And that may be in S3, it may be in Redshift, it may be in RDS, maybe in DynamoDB. And then right next to that, they're using a lot of real time. So this is Kinesis to both collect information, do some processing, and then store the results of that processing, usually in DynamoDB or in Redshift as you said. And then the third piece is using large scale distributed software frameworks such as Hadoop. In order to run experiments. So we think of these as task clusters. These are clusters which are specifically designed for a single task. And it may run for an hour, it may run for several weeks, but it's specifically designed and configured for a specific task. So you can get the right balance and mix of resources inside it for your task, but the majority of the data resides in your sources of truth. So you have the sources of truth, you have real time on top of that, and then you have kind of experimental task clusters with things like Hadoop sitting right next to it. So there's kind of fourth wall on that is using that data in a slightly different way. Not just collecting it, not just storing it, not just running reports on it, but using it to predict what's going to happen next. And that's where machine learning fits into it. And that's why from the get go, we've integrated Amazon machine learning with S3, Redshift and RDS, because those are the primary sources of truth. Yeah, that's a real interaction relationship there. All right, so I got to ask you a personal question. Just take your Amazon hat off, put your data geek hat on, tech hat, PhD, academic hat, and do a hot tub time machine back to the 80s, okay? All right, go back to the 80s, go back to the 80s, and then think about what's available in the 80s, and go to today and say, what's different and what are you going after right away? What are you getting your arms around? What are you chewing on? What are you wrangling? I actually have a really good example for this. I did my PhD in machine learning, and I did my PhD at the University of Nottingham in the UK, did it for four and a half years. And in that time, I was trying to use machine learning to predict how proteins fold. So you take a string of amino acids, you stretch them out, they just got 21 choices of each amino acid, you drop it into water, they always fold into the same shape. And that shape confers how they interact in the body. It's what makes your hair blonde, or your eyes blue. And the trouble is that you can't always get a good sense of that structure by traditional methods. So I was interested in building computational methods which predicted how that would work. And I spent four and a half years really working that process and writing the apps and building the neural networks and scheduling things on a cluster and like. See, see, see, written in C? It was, yeah, it's definitely written in C. Actually it was Fortran. Fortran, oh my God, yeah. Yeah, exactly, Fortran and then pulse scripts for the data manipulation. So yeah, this is a process which yielded a result. But if I could go back then and use something like Amazon Machine Learning, I believe that that would have been just one of the things I would have been able to get done. And rather than it taking four and a half years, I probably would have had my whole thesis done in a couple of weeks. And I would have been able to use that and iterate on it and build on top of it for the next couple of years. I'd probably advance much further than I would have got in that time. So there's an order of magnitude shift in productivity. So the bottleneck was twofold from what I heard you say, one, coding and then iteration feedback. Yeah, it took so long to actually run these things and it was so heavyweight and there was such limitations on the resources available to me. Everything just took much, much longer. So the opportunity to iterate was very, very low because the rate of iteration was so slow. But what we see time and again is that customers value fast iterations. They value being able to fail quickly. And if you're experimenting and truly innovating, you're going to fail a lot of times. So to be able to understand when that's happening and get to the next iteration, the next iteration to find the one that works, to do that as quickly as possible is extremely important. So back in those days, you had to develop your own algorithms, I'm sure. Yeah. And now today for somebody, do they think about algorithm development or? Yeah, I mean there's definitely, it's a flywheel. There's more and more work being done on the sort of deep algorithm development. And there's more and more opportunities to apply those algorithms across everything from aerospace to retail to space exploration, you name it. And so there's this kind of beautiful flywheel of people doing the algorithm in development and then people applying those algorithms in various different domains. And then they kind of feed off each other around and around. So still a huge amount of innovation to be done. And we see our role today of bringing the, breaking down the biggest walls to adoption of machine learning by making a service which is high scale and high performance, but is beautifully easy to use. So are these, would somebody go to GitHub or where does somebody find the source of algorithms to start? Yeah, I mean the computer science literature. I think even GitHub, I mean there's some good stuff on GitHub, don't get me wrong. But even GitHub is not yet in the daily cycle of academic work. So you have to go back to the literature, you got to read the papers, you got to gen up on this stuff. And that is a huge barrier to people that just want to get started with data that they've got. So time and again we hear that customers, what might find that of academic interest? But they're really keen to actually use this stuff in production to make it useful inside their organization. That's the sweet spot for Amazon Machine Learning. It's been great to have you in the queue, we're getting a hook here, but I want to ask you one final question, throw it out there, see if you can wrestle this one down. Advice and commentary on the global landscape for a developer. I mean, I could be global in a second now, certainly on Amazon. What's the data challenge globally? 10X startup, as Mark Andreessen calls it, 10 guys would work a 300, which is heavy duty, full stack developers. So Amazon's been a great friend to the full stack developer. And now you've got more stuff in there with Machine Learning and whatnot. The whole world's the oyster. What is the geo challenge for developers? See if science means a lot of stuff going on. What's super interesting, John, is that the walls between individual geographies are melting away in the same way of the walls of the data center. But so we had in the keynote this morning, Andy, welcome double stage, the CTO of Adrol. Adrol are an absolutely fantastic company. They do real time ad bidding, and they have a large real time bidding platform that they have to maintain the state of across the entire world. So they operate across all of our global regions so that they have the latest bids and ad inventory available to customers all around the world. And they can deliver the recommendation of the ad re-targeting to customers in a very low latency way. So they're routinely shipping and distributing and sharing and synchronizing that entire complex platform at thousands of records every second across different regions. So to think in terms of geographic distribution is to limit the way that modern applications can be built. And things like DynamoDB Streams and Lambda all help in that synchronization task and in building these truly global cloud native applications. It's interesting when you think about what you have to store and manage process, different geo codes, different countries, Germany, Irish, different customers, different ads, different inventory, different bids, different websites. Yeah, it's exciting. I'm super glad time to be a CUBE host because so much to talk about, Matt. Well, thanks for coming on theCUBE. I really appreciate your time. I know you're super busy. Thanks for sharing with the audience. Thanks for being a great guest. Getting all the data here at Amazon Web Services Summit. This is theCUBE. We'll be right back with our next guest after this short break.