 Thank you all for having me here. It's a long time. I consider this to be somewhat of an academic conference, if I may. And so this is a long time ago since I talked to academics. And I hope that I have a story for you that can take you where you find some utility in and maybe gives you some ideas about the IT side of things. As I, as Marco really said, I'm the Chief Technology Officer for Amazon.com for the past 14 years. And you may ask yourself why, what's a bookseller doing here at the Rights Congress? Well, about 10 years ago, 12 years ago, we pioneered a concept called cloud computing. Basically, removing the need for anyone to own their own IT infrastructure, where you can get databases storage, compute, networking, security on demand. And so that's become a very big part of the Amazon business. And literally, millions of businesses around the world are running on the what's called Amazon Web Services, AWS. It's something that you'll see coming back in the presentation a number of times. And I'm very fortunate that on this platform, it's not only millions of businesses are running, but a lot of non-profit organizations and research organizations are making use of this. Because one of the concepts around cloud computing is that it reduces costs significantly while giving you access to almost infinite capacity for doing computer research. And of course, I think as a research organization, anything that you can do to save some dollars off can be contributed to doing research instead of sort of having to fund hardware companies and buying hardware. And so as such, there's a great match between sort of the on-demand infinite scale capacity that a cloud computing environment gives you and heavy computational research. Because I think most research these days is heavy computational because it circles around data today. Now, if you think Jim Gray famously called this sort of the fourth generation of research, where the first one was thousands of years ago where research was all about observations. Now, Newton said that I saw an apple coming down. Hundreds of years ago in the second generation of research, it was all about theory, thinking about theories. And then tens of years ago, the third generation is really focused on trying to find computational proofs for the theories that we've done. But I think everything is shifting today to fact-based research. Basically, everything is becoming data-driven. Everything with that is also becoming computational-driven. So there's a big synergy between organizations like ours, like AWS, who provide infinite capacity around the world and sort of these new computational research, and whether that is done at academic institutions or at commercial institutions. I mean, it's very fortunate that there's all these amazing organizations running on AWS and making use of AWS. And there might be, on one hand, maybe even for their EAP systems that they're using to connect different organizations together or really actually doing hardcore genomics research. And so one of these organizations that we're really proud of that's been a great partner with us for a long time is Erie. And so you probably all know Erie much better maybe than I do, but really the support for the poorest farmers in the world and building research programs around that is very close to Amazon's heart as well. And we'll do whatever we do can actually support this kind of research. One of the areas that we're working with researchers around the world is making sure that the data that they do produce can be made available to everyone. And literally many of these data sets are becoming so large that it's very hard for anyone to sort of send these data sets around. And there's a great anecdote about the 1,000 genome set that I'll come to in a bit. But more importantly, Erie, together with two other research universities, has made the 3K RGP data set available on AWS. This is well over 3,000 sequence rise genomes from 89 different countries. In essence, it's about 30 million different variations that you have access to. And this is a data set that is available for everyone to access on AWS. And I think these things are crucial to actually really accelerating research. And we see this across the board, not only with the great data set that Erie makes available on AWS, but we see it across many other disciplines as well. As soon as you make this available on a platform like AWS, research truly explodes, because now everyone has access to this data. An interesting thing is also that you have to think about sort of what format you put your data into, such that you can make computational processing of this very, very simple. For example, Erie has built a web-based interface to the data sets that lives in S3 that you can use to immediately sort of search for all the different variations that you're looking for. So it's not just that you have access to the data set, you actually already get web-based portals that actually give you access to this data set as well. It's great work being done by Erie, and we're really proud at Amazon that we're able to support this through the OpenData program. If I think about sort of how these data sets are growing over time, it's interesting to look back at the sort of the human genome program. If you remember, the human genome project started off with a sequencing of one human genome which literally took 13 years to shoot. And the resulting data set of that was actually put on an iPod, which was then sent around the world to different researchers, which could take the data off. And of course, that data set wasn't that terribly big. I mean, gigabytes in those days definitely were. But if we then look at how the human genome project evolved was that doing that one human genome was way too much work. Yeah, it was really, you couldn't do all the genomes in doing all the 13 years. So everybody went on to doing easier animals, yeah, mouses, rats, things like that. And basically that set got to the hundreds of gigabytes. And then the sort of next generation happened where you get these new sequencing instruments becoming available, where actually sequencing genomes is becoming extremely cheap, yeah? And so the result of that is already getting into terabytes. Yeah, and terabytes are becoming more, it's harder and harder to download just a terabyte of data, especially if you start talking about hundreds of terabytes of data. Because if you look at the 1,000 genome project, the resulting data set is that of 1,000 genomes and actually has 200 terabytes. Yeah, and again, this is one of those data sets that is available on AWS for everyone to access and to operate on a drive enormous amount of research. Also on one hand, because everybody now is operating on the same data set. I think research in the past often happened on your private data set. And maybe your results will be published in the paper, but there were very few people that were actually publishing their data sets also, such that research could be repeatable. And I think some side of the whole open data process that we run on AWS and other organizations do as well is to drive repeatable research. And so if you look at sort of the different types of organizations that are all putting their data on a cloud provider such as AWS as open data, it can be on one hand, for example, research organizations like the Met Office in the UK or companies like Digital Globe who do all the digital with all the satellite imagery who actually also rely for a very large part on open data. And so we do this and make it available for anyone to use. If you go to this website, opendata.aws, you'll find a whole collection of data sets with detailed description about exactly how to access if you're kind of licensing around the data sets and also what kind of formats there are. And so we have quite a number of right variety of different data sets available here from the US government, to NASA, to Erie, to the tax office in the US or different countries. And this is just a very small number of these data sets that are available. And if you're a researcher who is thinking about publishing and actually thinking about sort of also making your data available, I urge you to contact us and work with us to actually make your data available for every other researcher to access as well. The thing is that I think most research these days is really co-collaborative. This is jokingly called Joy's Law. He, Bill Joy, used to be the co-founder of Sun Microsystems, a company that no longer exists. I agree. But he was smart enough. He always said that no matter who you are, most of the smartest people will work for someone else. And it always means that sort of you can augment your research basically by sort of working together with other people. However, traditional data acquisition was sort of hard. It was really from tape to boxes to so much research funding being wasted in being a sort of indirect funding for the computer hardware industry that we really need to move away from this. One thing is, of course, that already much of this data is actually being generated in the cloud itself. DNA Nexus, which worked with Erie on the 3K-RGP dataset, actually did all the processing of that data in the cloud. And they used 37,000 cores for two days to actually create, to do all the sequencing necessary to actually make this dataset available for everyone else. And of course, all the stuff is really happening in the cloud itself. And as such, it's a great sort of centralized place for much of this research to be taking place. A lot of this stuff that I think is, and I've been a researcher for a long time, myself, and I know how much time I wasted on what I would now call undifferentiated heavy lifting. I was a computer science researcher at the time, the amount of time that I actually spent putting cables together, putting servers together, building clusters, all these kind of things was terribly interesting, but so terribly wasteful from a research perspective. I learned a lot about sort of keeping hardware up and running, but that had nothing to do with the research that I was doing. So there's a lot of undifferentiating lifting in and around all this data research that I think we need to really move away from. And so the cool thing is, of course, sharing this data in this one particular location will, first of all, give you a whole set of new services and tools that you did not have available before you, and whether that is, let's say, the more traditional sort of analytics tools that Amazon provides you with. But it can also be all sorts of partner tools and sort of open source tools. If you're a user of, for example, HTC Condor with one click of a button, actually, you can launch a whole Condor cluster on AWS without ever having to sort of instantiate different machines again ever. And so it significantly lowers the cost of research, but it also really drives sort of acceleration of research as well, because everybody now has access to the same data sets. And I'm going to point out one particular data set that I really like. So this Landsat, the Landsat satellite, if you know, number eight is, I think, circling the world at the moment. So what NASA did was actually put all the data from Landsat as an open data set on AWS. I think this is the 2015 data set. And after that, every new image that the Landsat satellites produce actually immediately gets added to this particular data set. So Landsat is the largest collection of data sets in the view of the land of Earth seen from space. It's an amazing data set. Interesting is that we worked with NASA to actually find a particular format for the data that contained enough, let's say metadata and other components in each imagery, such that actually it was very simple to build web-based systems immediately on top of this, where you did not need to run any servers or anything like that. You could just create a web page and point it to this data and then create, for example, a Tyler like this one. There is no servers involved here at all. Basically, the data lives in what we call Amazon S3, the simple storage service that is directly internet accessible. And then together with some serverless components, what we call Lambda allows you to build sort of applications like this without ever having to run one server yourself. Well, that immediately brings data in front of every researcher or in every anyone that's interested in this particular data. So this is a Tyler that the guys from Mapbox built, who by the way is a great group of individuals. To really figure out what kind of formats to use is really important. So this is actually, this is the data from someone that actually normally had to access the data set from a traditional environment, moved it over to AWS and in a much more efficient way able to process all the imagery because the metadata sits now with the image itself. And actually he shaved off about 250 days of his research by purely using more, let's say, IT optimized imagery instead of the traditional imagery that comes from the satellite itself. There's one particular whole group of data sets, open data sets that I would like to point you to, this may be of your interest. It's what we call the Earth data set. And it has a whole set of public different data sets about geographical imagery and whether that's around climate models or elevation models or satellite imagery that's available for everyone to use. If you actually are interested in doing research with this particular data, go to this URL and we may have research credits available for you to actually process, to actually work on this data. Now I said earlier on, it's a large part about this is tools that actually are available in AWS. I think we help you get most of your data. We have all sorts of different storage options and talked about S3 already. And so it is mechanisms where we can hook up a whole set of different analytics tools with the different storage options that are available and whether those are actually in the cloud or on-premise, we make all of these things available for you. Machine learning is becoming increasingly important part of research and I'll talk a bit more about that in a minute. Many of you have been running your own hardware and things like that and there are quite a few researchers who are always interested in what kind of machinery they need to use to do their optimizations. Well, we have a whole family available for you. Really you want to do multi-massive in-memory processing, whether you want to use GPUs or general-purpose GPUs or whether you want to put it onto FPGA arrays. All of this stuff is available for you. Hit us up if you want to get research credits to make use of any of these machinery. One thing I want to talk a bit about because it's so much so hot at this moment everywhere is machine learning. And so I'm not really sure how many of you in the room are aware of what machine learning is. I think it's part of what we call artificial intelligence and that scares everybody immediately. Of course you start thinking about robots and Skynet and things like that. It's not. Machine learning, if I think about analytics, there's sort of three different types of analytics. I think there's analytics that looks towards the past and sort of the traditional data warehousing and some of that kind of analytics. Then there is analytics that looks at the current state of affairs. What is happening right now in real time? You're not interested in your inventory level. Yesterday you're interested in what's your inventory level right now. And then there's the third part of analytics and that is making predictions. But it turns out we're really bad in making predictions. But what we can do is actually take data from the past and then try to make use of that data to make an educated guess about the future. And that's what machine learning is all about. At one moment, data sets are becoming so large that it's so complex that it's no longer possible for humans to just write an algorithm. You actually want a computer to start sort of figuring out what are the most important aspects of this particular data? What are the things that we should be looking for if we are looking for, let's say, objects in this imagery? Or if we are looking for, let's say, maybe doing inventory level planning or things like that. At Amazon, we have a very long history at machine learning. And if you've ever been an Amazon customer, Amazon retail customer, you've been exposed to machine learning for the past 25 years. And that's all the things that we've pioneered, for example, around recommendations or all the robots that we're running in our fulfillment centers. And then we've done really big sort of new innovations driven by machine learning, whether those are the drones or whether it's Alexa with the voice assistant or recently our Amazon Go stores where you can make use of machine vision, video and machine vision to actually allow you to walk into a store, take things off the shelf, put it in your bag and like walk out and get automated lead charts. All of those applications are machine learning in one way or another. Basically operating on very large data sets and then discovering, let's say, what the content is in there for you to use. Well, those were all big innovations but we've been using machine learning all over the place for such a long time. And whether that is things like from the lead time detection or visual search or product matching, all these kinds of things are machine learning driven. However, they may have been many of our innovations in that world, but turn out there was pretty heavy weight processes to get these done because you needed data scientists for them. These were data scientists that were writing new algorithms that were processing your data, using your TensorFlow or MaxNet or Cafe, any of the popular frameworks there. But they were very, very heavy weight. And so what we were looking for was a mechanism such that every developer could make use of machine learning because it's such an important part of an organization like Amazon where we have very large corpuses of data and then actually would like to make use of the data to actually serve our customers better. And so for example, we literally sit on billions of orders. And so you know which one of those orders in the past were fraudulent and which ones were not. And then you can make use of machine learning to make a model out of that. That if a new order comes in, that you can give it a score what the likelihood that this is also a fraudulent order. Yeah, and so this is all about sort of discovering patterns in that last data set that you then can use to give this new score. That's not that you don't automatically reject that order, it then still goes to a human investigator, but it sort of is augmenting humans. Important there is that we want every developer to do this. So we started developing services around that. And so with these new services, suddenly you no longer needed to be a data scientist to actually make use of machine learning. Everyone could do that. And we saw literally an explosion of hundreds of new applications happening within Amazon because of the use of machine learning. Yeah, and whether that is for example, counterfeit good detection or sort of display ads and all sorts of other things that are really driven by machine learning. And we have literally tens of thousands of customers also willing on AWS doing all this machine learning. And so with all of that, what have we learned actually from both the Amazon experiences as well from the experiences that we see it, all these customers on AWS that are willing machine learning is that machine learning needs a new stack because it really is getting outdated. No, nobody can operate anymore really at the level of the old data scientist, although we do need to support them, but we need to start looking at putting machine learning in the hands of everyone, including you as scientists and researchers. So in essence, if you know what machine learning is, it basically takes your data, you do a lot of you run it through a number of algorithms repeatedly with a whole set of different parameters until the accuracy that you get out of it is something that you satisfy, then you create a model, you run that model somewhere you can ask questions against it. And so you have new data then coming in, you can use that against the model and then do predictions based on that model. Just like I just said with the order data set, you can sort of start making predictions where the new orders are actually forward or not. Or it might be something else, might be an abusive review detection mechanism, things like that. There's whole things across the board that you can do with machine learning. So yeah, our mission for machine learning at the AWS division is really to put it in the hands of every developer and every scientist. So for that, there's a few new pieces of this tech necessary. At the bottom, there's still the traditional frameworks that sort of the data scientists are using. But on top of that, you build a platform that makes it easy for everyone to use these machine learning algorithms. And then there's a whole set of services on top of that again, for if you don't even want to run these algorithms, but want to have several pre-built models that you can use, we have those available for you as well. So let's take a look a little bit deeper. So the framework level is still all the work that normally the data scientists do. And if you've ever been doing machine learning, you know these ones very well, Café, MXNet, TensorFlow are sort of the well-known frameworks and interfaces out there that most data scientists use. And we run them on an infrastructure with sort of the latest GPU boards that you can have, which basically gives you a petaflight of compute per board with lots of different cores and very fast access to memory so you can build all these graphs and networks very easily. Again, I think this is sort of the level where data scientists want to work. They really want to tinker in creating new types of algorithms. But in reality, if you look what we really do in machine learning, what you do is actually you create your data sets, work hard on data quality, which is very unique and specific for the particular data sets that you have. And that's something that is hard to help people if we look at the data quality, but it depends a bit on the kind of data that you have. You sort of lay out what kind of algorithms you want to use. Then you start to train. And basically, training in my house is extremely dumb. It's basically running the same algorithms over and over and over and over again with a different set of parameters. And then the upcoming model, you pump some historical data in to see whether you can predict the past and then see how well your model actually does with sort of your test set of data. And then once you have that model, you have to deploy it somewhere because you actually have to ask questions against it. So this is what we do. In essence, I think 80% of machine learning has nothing to do with machine learning. It's just heavy lifting. It's just undifferentiated heavy lifting. You have to get your data somewhere. You have to repeatedly execute the same algorithm over and over again. You have to go back to your drawing board and then you have to actually run it somewhere. All of these things are actually pretty hard to do. And so we developed a system for that, for you to use there, a platform called Amazon SageMaker that basically takes all the heavy lifting out of machine learning for you. Basically, what you do is you start off with, in general, a Jupyter notebook where you sort of in a few lines of Python, describe exactly what you want to do. You can use the notebook as your execution environment. You pick the algorithm that you want to have. And so this is where the difference is with the data scientists. They have the data science to create new algorithm. Here, as an engineer, you basically pick the right algorithm that you want to use. And we built very high performance ones that were not available anywhere else. We call them streaming algorithms. Because in essence, data sets are getting larger and larger and larger. And so having to, when, for example, you get a new set of satellite images or where we get a new set of orders in, you do not want to retrain the whole data set. That will be very expensive, both in time as well as in money. And so we've been building these new algorithms for you to use where you can basically stream your data through and then checkpoint it and then basically start adding new data to it. So you pick the algorithm that you want to use, then with one click of a button, you start the training. And you can do this optimization with all the parameters that you have until you get a model that you like. And then again, with one click of a button, you can deploy it. Yeah, and deploying means basically putting it somewhere so that you can ask questions about it. And whether you do that in real time, like with one order coming in, or you had this BAX facilities there as well, if you have messaged amounts of BAX that you want to push for your machine learning model, you can do that as well. And so it might be that, for example, if you want to detect objects or locations or faces and things like that in imagery, you may pick a confluted neural network, K-MIS clustering is where you can use sort of sales prediction. That's what you want to do, or logistic regression. You pick just the algorithm that works best for you. The whole thing there is that all of that, the housing of the algorithms that compete the executing of it and all of these model management or things like that, you no longer have to worry about it in a new machine learning platform such as SageMaker is. So that's what SageMaker basically gives you. It takes away all the heavy lifting for you around machine learning to be able to execute your research. And of course I said, I want to put it in the hands of every developer, but not even every developer needs to train new data sets. Now, there's many occasions where, for example, you just want to do image recognition. And so you may have 1,000 images or 10,000 or maybe a million images that you just want to push through a standard model. And for that, we've created a whole set of application services for you. And whether there's image recognition or video recognition, all sorts of things that you can do on there. We have mechanisms for speech to text and text to speech and automatic speech recognition, text processing, all of these services are available for you without having to become a machine learning expert. Yeah, and so if there's any of these sort of of interest to you, urge you to go to ml.adw.us and you get all the details on actually how to get started there. I want to close this off actually by talking to about another customer that we have on the AWS platform. And this is an organization that I met last month with in Jakarta. And so they basically have, they're tackling one of the hardest problems there is, namely the generation of data. And how can we generate data that how can we generate data from let's say the poorest farmers in the world who may not even have an identity, who may not have a computer who definitely do not have a cell phone. How can we actually start generating data from that such that they get actually included in all sorts of opportunities for them? And so I'm very fortunate that I can talk about these guys because I think it's sort of relevant for your world as well. After all, the targeting rice farmers, not only in Indonesia, but sort of around the world, there's projects like this going on in Colombia, in Uganda and other places as well. So they built most of their systems around the technological blockchain. I'll skip that. Let's just say that that is an indestructible ledger that you may have. So anything you put in this ledger can never be erased anymore, which is important for identity, which is important for sort of tractability of data over time and really finding the lineage of that. So how is mission is to really try to revolutionize the agriculture sector with data and sort of try and find mechanisms and processes such that this data can actually be generated? And so if you look at sort of the challenges are huge in Indonesia. And so there is hardly any information available about the poorest farmers in the world and they're not able to actually really produce and have access to resources as are available in many of the other countries around the world. Of course, this is the things that I have on this slide is absolutely things that you on the room probably all know about. But much of this is actually driven around the lack of information and the lack of data available to any of the other players that might be interested in this. And so they basically created a system that consists of four different stakeholders. One is the data providers that is not necessarily the farmer itself it is most likely an agent that operates in the neighborhood or the village and who actually has the Hara app on its phone and actually collects all the data from the farmers does polygon tracking of the rice fields that they have tracks the yields of the crop and things like that. Then those data qualifiers that basically look at the data and actually validate that this data is correct and actually is in line with other data that you're seeing. And then there's data buyers and the data buyers in that particular case are for example banks and insurance companies. Now there's many of the farmers actually really have no access to financial resources at all and given that there is almost no identity there's hardly any information about the effectiveness of the farmer and so it's very hard for them to actually get access to loans. Most of them will have to go to loan sharks which actually charge something like 2.5% a month for them. And so given that now with all of these mechanisms they actually both their identity is created as well as sort of being tracked in their fields they now have the ability to actually get access to this data through the data sources that they provide. And now actually gives those data buyers the ability to start providing loans and microloans to individuals in ways that they could never do. Yeah, so these are the typical numbers actually the majority of them do not have a bank account mostly because they have no identity whatsoever. Many of them are actually have to borrow money and actually have to borrow money to actually be able to do their job. Now it turns out that working with these financial institutions and this data that comes directly from these farmers, these loans that are actually being produced is actually sort of the repayment on the loans that are getting the microloans that are getting now to these farmers almost have a repayment rate of 100% mostly because I think the interest rates on that are something like 0.003 a month instead of the 2.5 that the loan sharks are actually giving them. Yeah, it's also actually really tracking all of this that we had the high repayment rates make that sort of these data sets are becoming increasingly important for everyone in Indonesia because it really drives access from the farmers to things like insurance and to things like microloans. Yeah, this is another suddenly the whole issues of farm subsidies can be tracked because now suddenly data is available from all these farmers in an immutable way. So you really know that this data is actually data that has been generated by the farmer and not by anybody else. These farmers of course initially cannot be necessarily incentivized by sort of the future promise of a loan. Actually, they get incentivized through a loyalty point system which is actually really simple. Basically, you get a stamp card that every time the data gets collected by your farm you get a stamp on it and once you collect enough stamps you can actually use this to buy fertilizer and to buy farm equipment and things like that. So there is an incentive to actually start contributing this data that is actually immediately impacting farmers to today. So I think this is an amazing system of actually trying to use out of band mechanisms to start creating data sets that can be increasingly important for us understanding really how the poorest farmers in the world are actually farming their rice. And the same goes for the future idea of that. If we have this technology available it makes it much easier for these farmers to actually participate in IoT projects that may be creating actually data on an automatic fashion instead of having these agents running around drawing these polygons or looking at the crop yields. And so with all of that this new data actually combining with all sorts of other research data and satellite imagery now creates a whole large data set that is unparalleled in this world and hopefully can drive a whole lot in future work on the precision agriculture and the likes. And I think we're really proud again that at AWS we're able to support an organization like this that truly has access to that truly gives data access to everyone of the poorest farmers in the world. So with that thank you for your attention I hope you picked up something from this and I truly enjoyed being here. Thank you.