 Thank you very much. I do indeed remember that was my first trip to Barcelona. It was a magical magical city so I'm going to try to cover a whole bunch of different things today that I think are going to be of interest to this sort of audience How many of you have noticed an increase in machine learning in the last say five or ten years? Yeah, okay, so me too There's you know, here's a bit of quantitative evidence about The output of machine learning researchers around the world So this is a graph showing how many machine learning research papers are posted on archive every year in sort of a few of the subcategories that are sort of machine learning related and The thing you'll notice is that the rate of growth there is growing faster than the Moore's law Exponential growth rate of computing that we got so nice and accustomed to For so many years that has now slowed down So we sort of replaced, you know computational growth with output of research ideas and hopefully that'll be you know equivalent in utility But you know the reason machine learning and in particular Sort of subfield of it called deep learning which is really a rebranding of a lot of ideas that were invented 35 40 years ago of artificial neural networks is Really making a big impact in the world because these systems can learn from very raw forms of data They can learn to do very powerful things, you know, there's a lot of new ideas But fundamentally a lot of the algorithms we're using today are ones that were developed 35 or 40 years ago And we just locked in of computational power to really make them sing on real-world problems They were really interesting in fact I did a undergrad thesis on parallel training of neural nets because I felt like if we could just get a bit more compute for them So I was excited as an undergrad and we I tried to use the 64 processor machine in our department to Train bigger neural nets than we could at that time Turns out we needed like a million times as much compute not 64 But we now have that starting about 2008 2009 2010 and that's really Created a lot of really interesting applications of machine learning So one way to think about a neural net is it can learn really interesting and complex functions not like y equals x squared, but like put in raw pixel values and get out a You know prediction of what is it? What is one of the ten thousand different objects that it the system knows about in that image? So, you know, there's a leopard you can put in raw audio waveforms and get out a transcript of what is being said How cold is it outside? purely from training on Example pairs of you know audio waveform and transcription of what has been said you can do things with language Hello, how are you as input and it will can put spit out translation and these are all trained end to end You know in the past we've often hand engineered pieces of systems and kind of stuck them together So speech systems for many years had an acoustic model and then a language model kind of stuck on the end and then language translation systems had a complex Mix of many different sort of statistical models and phrase tables and dictionaries Today you can just train end-to-end on lots of raw examples You know with about 500 lines of machine learning TensorFlow code train a translation system that is higher quality than those older systems You can even sort of do more interesting things and just categorize an image You can for example take in an image and generate a simple sentence as caption You know a cheetah lying on top of a car. This is actually one of my vacation photos, which is quite a remarkable site In the field of computer vision for example, we've made tremendous strides thanks to the use of deep neural ads So Stanford runs this contest every year called the image net challenge where the goal is to take a million training images in a thousand categories And then on a test set that is not in the training set. You then need to predict for a given image What is the category and in 2011 the winner got 26 percent error rate and we know from a nice Right up that Andre Carpathi did who was a Stanford grad student at the time He actually subjected himself to the right kind of machine learning training protocol where he studied a set of training images for 120 hours and then like subjected himself to the labeling the test images He got 5% error and he convinced one of his lab mates to do this as well And they didn't study quite as hard and they got 12% error because they only did 10 hours of training So this is not a trivial task. You have to be able to distinguish 40 breeds of dogs and so on and in 2016 The winner got 3% error Right, so we've gone from 26% error computers couldn't really see very well to 3% error computers can now actually see and perform this task Which is not the same as general human vision But is still a pretty remarkable advance in the capabilities of computers to see the world and sense what's going on around them So one of the ways I like to look at the advances in machine learning is to look at this list of Grand engineering challenges that the National Academy of Engineering put out in 2008 for the 21st century like they sort of Killed eight years of it by waiting eight years into the century before they put it out But they got together a panel of experts of from all different fields and said here are 14 things that we as the broader engineering community should really be working on and I think it's a Pretty nice list like there's a lot of things about improving health care improving education making the world healthier or the planet Sort of a better place to live Within our research teams at Google we have projects focused on some of the problems for Tackling some of these challenges listed in red and I'm going to focus on the two listed in boldface today So advanced health informatics is one of them, and I think this is a real Opportunity for the world to really use machine learning to improve the quality of decisions that we collectively make For treating people and giving them making them healthy and happy So I'm going to give you a few different Bot examples of particular applications of machine learning to health care that I think are indicative of where This field is headed So one problem we've been working on in this space For a fairly long time about the last three or four years is The diagnosis of a disease called diabetic retinopathy Which is the fastest growing cause of blindness in the world There's about 400 million people of the world with diabetes and those people should really be screened every year to see if they have Signs of diabetic retinopathy and the key to Preventing blindness. It's a degenerative eye disease. So if you catch it in time, it's very treatable But if you don't then you can suffer full or partial vision loss and really the screening every year is what's going to really catch things in a time when their condition is still treatable and The retinal image that is looked at to assess This is just looks sort of like the thing on the right there And in India for example, there's Estimated to be a shortage of more than 120,000 eye doctors And so for their therefore 45% of patients suffer vision loss before they're diagnosed and this is a completely Preventable, you know outcome like if we could screen everyone and and get them treated they would you know that that number will be way lower So the way this is diagnosed is an ophthalmologist looks at these images and they grade it on a five-point scale, you know No dr all the way to proliferative and it's sort of a Bit of a subjective assessment. In fact, if you show One of these images to two different ophthalmologists. They agree on the rating 60% of the time And if you show the same image to the same ophthalmologist a few hours apart They agree with themselves 65% of the time Right, and this is kind of tragic because the difference between a two and a three is like Go away and come back in a year versus we better get you into the clinic next week right so It turns out this like general image classification is a Problem amenable to machine learning using computer vision models And so you can sort of taken off the shelf General purpose computer vision model and instead of training out on a thousand general categories train it on these five labels, you know, no dr mild dr and so on and If you get the images labeled by ophthalmologists board certified ophthalmologists and because of that variance and their opinion You actually need to get each image labeled by a lot of them So five of them say it's a two and two of them say it's three probably more like a two than a three Then it turns out you can actually get a machine learning model that can run on a you know a desktop machine or even a phone and Basically, it's as accurate or perhaps slightly more accurate than the average us board certified ophthalmologists at this task and You know that that's great. That means that all of a sudden we could have the ability to screen people without With the tremendous shortage of ophthalmologists that exists in the world but if you wanted to go further there's a subspecialty of ophthalmology called retinal specialists that have more training and retinal diseases and so if you Instead of getting seven independent assessments by ophthalmologists for each image You instead get three retinal specialists in a room and you lock them in the room and say you have to come up with a single number for this image that's called an adjudicated protocol then You can actually train a model that is on par with retinal specialists which is kind of the gold standard of care rather than general ophthalmologists and you know that's kind of the care you would want for everyone in the world and It's possible to do that with careful design of a machine learning data set careful, you know machine learning and Careful consulting with retinal specialists about how to how to best go about this So this is I think indicative of a lot of medical imaging problems that if you Get the right kind of data you can actually tackle this and have something that can really help ophthalmologists make the right decisions For for this or in places where there aren't enough ophthalmologists to enable others to make decisions The other kind of cool thing is that what we've seen is that you can actually Find things that Ophthalmologists or retinal specialists don't know how to do from the retinal images So this is a bit of a tale of scientific discovery We had a new person join our ophthalmology machine learning research team and to kind of get them up to speed with How to train models in our pipeline and use our software infrastructure Lily Tang the person who who leads that team said oh, why don't you go try to predict age and reported biological sex or gender from The data that we have because we had a little bit of data where we knew that for the for the image We knew that age and gender of the owner of the eye and then we would try to predict it and Lily thought that since ophthalmologists don't know how to distinguish gender from a from her retinal image The AUC should be point five no better than flipping a coin and so the person went off and They came back a few days later, and they said okay. I've got everything working my AUC is point seven And Lily said oh no that can't be right like go check everything and come back Maybe you're you know testing on the training data or something And so they came back, and they said okay a fiddle with some things and now it's point eight And so that was really a sign that there's perhaps a lot of information in the eye that we didn't really know Existed and so it turns out you can predict a whole bunch of things that are all predictive of your cardiovascular health and so there's a test called a five-year mace score that Normally you would draw blood send it to the lab You know wait 24 hours get the lab test back, and you would then have an assessment of your cardiovascular risk It turns out we can get the same level of accuracy of cardiovascular risk from a single retinal image This might be something where you now go to the doctor and they will start taking retinal images and Be able to assess interesting things about the rest of your health From those retinal images particularly if you start to have a longitudinal set of retinal images We think that could be quite interesting And similar stories are playing out in pathology. So here's some examples of of detecting cancer metastases Where with actually in this case? pathology images are very large hundred thousand by hundred thousand pixels and in this case We had actual pathologists labeled at the pixel level So they sort of circled the bit that's humorous on the training data, and this is with just a few hundred training images You know admittedly very big and annotated at the pixel level We're actually able to get a tumor localization score. That's better than than a pathologist at this task and We've built a prototype for example of a augmented reality microscope that kind of Takes what a pathologist is looking at in the microscope Can then use a mirror to capture that image route it through the model and then have the model overlay these kinds of predictive things on the on the Image in real time as you move the Microsoft stage around or zoom in or out We can then say hey you should really pay attention over here or or this looks okay that kind of thing We think that's a pretty interesting direction because a lot of the Microscopy a lot of pathology in the world is not yet digital. It's it's done on sort of traditional optical microscopes Okay Another kind of Medical task is really more about the kind of abstract information that's in medical records So given a patient's medical history and the symptoms they're reporting now a doctor is really trying to assess You know how should I treat this patient? How are they likely? How is their you know? Disease or condition likely to evolve in the next year can we predict the future from the patient's current state and One of the things that's happened over the last few years is that we've developed more powerful machine learning methods for sequential prediction essentially given Some sequence of data can we predict another sequence either you know in the case of translation? English sentence and then try to predict the French sentence from that English sentence or in the case of medical records data Given some part of the record can we predict the rest of it? and So this is kind of a good example of the kind of work We try to do at Google is one thing we try to do is basic research that we think if we make progress on this We will then be able to apply that research in lots of different ways So this was a paper published at the end of 2014 sequence-to-sequence learning so-called seek-to-seek models By my colleagues Ilias let's give her earlier vinyls and cockley and that turns out to be pretty useful We've actually used that in a lot of places within Google products So in Gmail we now suggest responses when you get an email message so that it's very easy to inconvenient on your phone You don't have to actually type you know a message comes in and says Can you join us for dinner on Thursday? And we suggest Sure, I'd love to what can I bring or sorry? I can't make it or Maybe one of the response. We've been using that for machine translation and it turns out if you focus on using it for Healthcare related tasks you can use it in the way that I was describing of predicting Sort of future out future Aspects of medical records either individual events or sort of more abstract things like will this patient develop diabetes in the next 12 months given the you know the other the information that we know about them so far And so here's a whole bunch of different kinds of applications of this kind of approach is given Corpus of de-identified medical records. Can we predict interesting aspects of the future? You know is this patient likely to be readmitted if I release them from the hospital now? How long are they likely to stay in the hospital? What are the most likely diagnoses? You know it could be it's 99% chance It's this thing, but there's a 1% chance it's this thing in that and the in the actual clinician treating the patient may have never even heard of that Thing because it's a fairly rare condition And maybe that will trigger them to do more tests or learn more about that condition This is a collaboration that we've had with UCSF Stanford and University of Chicago medical centers hospitals and I'll just highlight so we published this paper at the end of last year I'll just highlight one aspect of it, which is mortality risk prediction so one of the things doctors want is an assessment of how how seriously ill a patient is so that they can really focus their attention on the patients who need the most attention and This is already something where we're hospital systems use predictive models for this task So that's the baseline that you see there in dotted lines And at different times relative to their admission to the hospital And what you see is as you get more information in the stay the predictive accuracy goes up What you see is that if we instead of using about the 30 different kinds of variables that go into this relatively simple predictive method In the dotted line if we instead use all the data in a patient's medical record to make this prediction Like on average about 200,000 data points we can actually get much more accurate assessments of their mortality risk and In fact get 24 hours earlier notice for the same level of accuracy prediction Really enabling the doctors to focus their attention earlier on those patients who are critically ill And we think that's that's a pretty interesting thing that would be really useful in in clinical care settings Okay So a lot of advances in what we want computers to be able to do I think depend on being able to actually understand text at a better level than we're able to do today and There's been a bunch of recent encouraging improvements. So in the same way the seek-to-seek model was this nice piece of basic research That has a bunch of applications We've seen that some earlier work in 2017 That by my colleagues at Google developed a technique called the transformer model so prior to this the seek-to-seek model was a recurrent Model so that it sort of consumed words sequentially or symbols sequentially one at a time Would update some internal state and would then kind of go on to the next symbol updates him internal state So it's inherently quite expensive and has all these sequential dependencies in the computation The transformer model is very nice because it actually Consumes an entire sequence in parallel and then uses an attention mechanism to focus on different pieces of the sequence when making various kinds of predictions about it and so There's a diagram of a The rough model structure and you can kind of stack these one on top of the other so that you have a deeper transformer model I'm not going to go into the details but what you see on the right is that this model is able to achieve Significantly higher in this case translation Accuracies at ten to a hundred X less compute than the best state-of-the-art models And that's really a huge algorithmic leap forward because all of a sudden You know you have a hundred times as much compute that you could apply to You know you train on more data or you know you use a larger model, whatever whatever makes sense and Then building on that some other researchers at Google developed a technique called Bidirectional encoder representations from transformers, which is a bit of a mouthful. So we call it Burt and each one of these TRM things is a transformer module and the key aspect of Burt is really that You're going to train data and so in a traditional language modeling task You essentially take the words to the left like Obama was born in 1961 in Honolulu And then you try to predict the next word from all the data you've seen so far In Burt the because it's bidirectional. There's a pretty clever technique you can do which is basically You're going to mask out ten to twenty percent of the words and then you're going to consume the whole sequence and Then you can try to predict the missing words So if you've ever played mad libs as a kid, it's sort of like that and this is actually pretty hard, right? Like if you actually try this You know Obama was blank in 1961 Blank blank after the territory was admitted to the blank as the 50th blank Right, you actually have to have a fair amount of language understanding and contextual Knowledge of what is going on in the words around this in order to actually fill in those blanks correctly And So that's basically how you train this model you take the original words you remove ten to twenty percent of them you feed them through a Large model and then you try to predict the correct words And then you can consume large amounts of text itself supervised So you can use any piece of text and drop out different pieces of it and try to predict the words that are missing now that's pretty interesting and What we found is that an incredibly powerful technique for language tasks is Pre-train a model with this fill-in-the-blank style training on large amounts of these self-supervised task and Then you can fine-tune the model on individual language tasks like let's say you have a bunch of reviews And you want to predict is this review positive or negative is the sentiment of it positive or negative? You you often don't have very much data for that kind of fine-grain task But you can fine-tune this model on individual language tasks with very small amounts of data and you get really good results and so here is a graph of a collection of different language tasks the General language understanding evaluation Benchmark and what you see is that the Burt model made pretty significant improvements And the number of examples for each of these tasks is shown just under the headings So ranging from two and a half thousand examples to three hundred and sixty three thousand examples And these are pretty large improvements in in language results So that I think is pretty exciting One of the things we think is interesting is to have better data sets for language evaluation and so Just this year. We've published an open source to data set called natural questions data set by these folks I was actually quite a lot of work to put this data set together. It's about 300,000 training examples and I'll give you just one example of the kind of thing that's in there Question can you make and receive calls in airplane mode? And then it gives you a part of a piece of text that is supposed to help you answer this question as training data And the answer is no, but that no actually requires a pretty Strong amount of understanding because it doesn't say yes you can or no you can't it says it suspends radio frequency signal transmission by the device Which is relatively Indirect way of saying no And so we think this is a pretty interesting data set for language understanding If you look at the there's a leaderboard You can see the impact that bird is having because most of the leaders seem to involve bird in various ways And Yeah, so We think that's an interesting data set if you're doing question answering Okay, one of the other tasks in In the grand challenges one of the other challenges was this kind of catch-all thing engineer the tools of scientific discovery I feel like they kind of throw through that in as the 14th thing because it sounded like a good catch-all Bucket and one thing that's clear is that if machine learning is actually going to be a significant Part of how we tackle some of these grand challenges It's important that we actually have tools that it make machine learning expression and machine learning ideas easy And so one of the things we've been working on for for quite a while is infrastructure to enable us to express machine learning research ideas to apply machine learning models in our products in various ways and We've got we've gone through a couple of different generations of our deep learning software infrastructure when we started to build the second generation TensorFlow we actually decided we would open source it because we felt like there were all kinds of uses of machine learning in the world That would benefit from having sort of an open standard for how to express machine learning computations So how to exchange open-sourced models how to you know get a community Developed around improving infrastructure and then also a much broader community of using that infrastructure for all kinds of interesting things And we've been pretty happy with how the community is really developed It's really taken off when we first released it You know there were a bunch of other sort of machine learning packages around but TensorFlow is really I think Spoken to people because it's both a good way of expressing research ideas But also good at getting sort of production style machine learning models in lots of places running them on you know Phones running them in data centers running them on GPU cards running them on sort of more customized hardware and You know it's gotten this sort of fairly vibrant community now You know, I think that that's really good one thing I'm really proud of is that the sort of community of contributors to the core TensorFlow system about you know A third of those 1900 contributors are from within Google and the rest are external And people have indeed used this for all kinds of interesting things So there's a company in the Netherlands that's building fitness sensors for cows using machine learning They can tell you which of your hundred hundred strong dairy dairy herd is like not feeling so well today because it's walking a little strangely There's a group a collaboration between Penn State and the International Institute of Tropical Agriculture that is building an on-device Computer vision model an application for detecting cassava disease and telling you how to treat this particular disease plant I think this is a really good example of what how machine learning Wants to run everywhere in the world not just in data centers Not just when you have network connectivity, but in all kinds of places in the middle of a cassava field in Tanzania Without network connectivity, so it's really important that we build tools that enable those kinds of uses You know in our own work at Google we've been working on using machine learning for things like better flood forecasting So the monsoon season in India is coming coming on Is actually already started and we've been able to use machine learning to make more accurate predictions For exactly where flooding is going to occur enabling people to get alerts on their phones so that they will actually You know Get very focused fine-grain alerts, you know in the past the alerts have been coarse enough that people tend to ignore them Because you know, maybe they they really aren't actually at risk, but here if you get an alert you should probably pay attention Okay, so let me touch on a few pieces of work and how they fit together So one thing that we've seen is that I think we want really really large models to you know to understand and absorb large textual corpora for example but that we want That huge Capacity to not necessarily be called upon for every example, right? We know that that human and other sort of real Nervous systems only call on a very tiny fraction of of their capacity for any given moment, right? I have a visual center that is good at detecting garbage trucks. That's not activated now well now it is but and another part that's good at Shakespearean poetry and by enabling and disabling these different pieces we know that that's part of how the brain gets to be so power efficient and We haven't mostly done very much of that in the machine learning models that we've built Using sort of artificial neural networks or other kinds of models. So Here's a bit of work where we tried to sort of use this Basic inspiration of having a large model, but only partially activating it for any given example So the pink things there are normal neural net layers and what we've Developed is something called the mixture of experts layer Which you can think of as a whole bunch of different neural networks each expert has a whole bunch of parameters And there are a lot of them so there's two thousand of them So this has quite a lot of parameters eight billion parameters in them in a machine in part of a machine learning model is actually quite a lot and What we found and there's a gating network the green thing there It's going to learn which expert is good at which kinds of Context or which kinds of examples and it's going to learn to send it to the ones that are that are best for that particular example And that whole thing is going to be learned jointly simultaneously And what you see is that the experts actually do develop different kinds of expertise. So these are examples of of word sort of Textual contexts that are routed to different experts and so expert 381 is really good at language about research and innovation and and science Expert 752 is kind of a really good at playing the critical role and something and expert 2004 is really good at kind of fast adverby things And that's kind of what you want is to have these different pieces that get called upon at the right moment but are not mostly sitting idle and And and The results kind of show that you can actually Because you're you have all this capacity in this sparsely activated part You can actually shrink the pink portion of the model So it actually does less computation than the baseline So the baseline here is the bottom row and the mixture of experts is the top row in this table What you see is we can actually get half the amount of compute per word in this model versus the baseline But get an increase in accuracy Of one blue point roughly, which is actually a pretty big deal This is definitely something our translation team is excited about one blue point improvements And one tenth the training cost so it's one day on 64 GPUs versus six days on 96 slightly different kinds of GPUs And so we think this is a pretty interesting direction to be to be thinking about for really large models another area that we're particularly excited about is how can we Tackle automating some of the aspects of solving machine learning problems and so the current way you typically solve a machine learning problem Is you have some data set you care about maybe it's a super supervised problem So you have input and output pairs you have some computational devices You might have some GPU cards or new desktop or you might have some cloud resources spun up And then you have an ML expert sit down and run 50 experiments And then they look at the results and then they say okay. Well these five work pretty well Let me run some more experiments kind of roughly in the space of where those five Ended up and they repeat and then hopefully you end up after a bunch of iterations with a solution to the problem We care about So what if we could turn this into? data and computation Give you an equivalent or perhaps even better solution That would be pretty great because ML expertise is in quite short supply in the world And there's millions and millions of problems around the world that should be using machine learning to tackle them You know if you think about the organizations and businesses around the world There's probably tens of millions of businesses that each have a machine learning problem But probably only a hundred thousand of them realize they have a machine learning problem And probably only 10,000 of them have the capacity to actually solve that machine learning problem with a machine learning expert in-house so One of the ideas we've been exploring is how can you Take some of the decisions a machine learning expert would make which is what model structure? Should I use to tackle this particular problem? You know how many should I train a nine-layer model or a 14-layer model should have three by three filters or five? by five filters at layers seven, you know lots and lots of decisions like this and The basic idea behind neural architecture search is we're going to have a model generating model So we're going to have a model generate a bunch of models from it and then try those models on the problem We actually care about so we generate ten models train them for a few hours See what the accuracy of those models are is and then use the accuracy of those models as a Reinforcement learning signal for the model generating models we sort of steer it towards parts of the model space that are working well and away from parts of the model space that seem to not work well for this problem and Then you repeat a lot so quite a lot and It comes up with kind of weird-looking models like a human would probably not sit down and wire up a model exactly that way but it does show sort of Similar intuitions to what human researchers have come up with so in the resnet work for example Which is a popular computer vision model. There's a thing called a skip connection that enables Data to either flow through a layer or skip that layer and so those are very structured kinds of connections Here the model has come up with these very long-range skip connections that are sort of mimicking that similar idea Which is basically you want more direct paths for information to flow directly from the image towards the output of the model and So here if you look at this graph Let me take a moment to explain it so Accuracies on the y-axis Computational cost of the model is on the x-axis and generally what you see is a trend across a bunch of different black dots here where each black dot is kind of a Pretty significant advance in the state-of-the-art in Image net models over the last sort of five five or so years typically done by a lab of you know computer vision researchers or machine learning researchers Generally, what you see is more computational cost gets you higher accuracy models and Each one of those dots is kind of years of collective effort by the top machine learning and computer vision researchers in the world and There is what happens when you apply auto ML And the interesting thing is you get better results both at the high end where you care less about computational cost And want to maximize your accuracy But also at the low end where you want kind of a very computationally Cheap model that gives you the most accurate thing possible. Maybe you're running on a cell phone in the middle of a salva field So that's quite promising and I think what this really shows is that the auto ML Reinforcement learning-based system can just run more experiments than human machine learning researchers can in a practical amount of time And we've actually turned this into a real product, so we have a cloud auto ML product It's not just for vision so we've been branching out and adding the ability to use auto ML for videos and for language and translation and and We just recently added relational table stuff so if you have a bunch of columns of data and you want to predict another column like all this stuff about customers and which Ones are going to spend a lot say or other things like that you can use this kind of approach And that generally works pretty well And this is a long research thread and thrust and effort within within Google You know we've been exploring the use of evolution for search rather than reinforcement learning and have some interesting observations there You can learn not just the model structure, but the optimization update rules So like things like S2D and momentum S2D with momentum and Adam and all these other things are merely particular symbolic update rules that humans have come up with and it turns out if you Explore the space of symbolic update rules you can come up with ones that actually work better than than some of these handcrafted ones You can incorporate both Inference latency and the accuracy of the model into the reward function So let's say you're working on an autonomous vehicle or on a cell phone And you really have a limited compute budget you have to run this model in seven milliseconds You can come up with the best possible model that runs in in seven milliseconds or less You can learn data augmentation policies so typically in computer vision models You you for example have a hand-coded set of ways of transforming the raw data To sort of stretch the amount of use you get out of each labeled example So you have a picture of a leopard and half the time you flip it horizontally and half the time you don't And sometimes you might brighten it or darken it sometimes you may turn it into a black and white image instead of color And turns out that whole sequence of transformations. You can do you can learn Distribution over sequences of transformations that is more effective than the handcrafted ones that people can really come up with and To make this more computationally tractable. You might want to explore many different architectures simultaneously Rather than doing these one-off experiments So there's a whole bunch of interesting work there I think to be to be done Still more to do but the ability to solve machine learning problems in a more automated way is is I think a good and interesting direction Okay One thing that we've seen is that more computational power tends to help these models generally if you're able to train on more data in the same amount of time you will train a larger model and Accuracy generally goes up if you make your model bigger and you increase the amount of data Or you just make your model bigger and keep the data the same assuming you're not overfitting and so deep learning is really Transforming how we think about computational devices, you know general-purpose CPUs are good at running all kinds of different pieces of code twisty and branchy C++ code or whatever GPUs are good at like rendering graphical images and tend to be actually pretty good at the kinds of computations You want to do in in machine learning models What we found is that deep learning models all the algorithms and models that I've described today have two really really nice properties So one is that reduced precision is actually perfectly. Okay, if you have like one decimal digit of precision Most of these algorithms and models are perfectly happy with that You don't need five or six or seven digits of precision and that's important because like Intuitively, it's easier to multiply to one digit numbers than to five digit numbers And they're all made up of different ways of composing a Handful of very specific operations Typically things like matrix multiplies vector dot products other kinds of linear algebra operations So essentially if you can build computers that are specialized to do low precision linear algebra and nothing else You have a great baseline sort of building block for machine learning computations And part of our work in this area has really was really motivated by our early signs of success That these things were gonna be quite actually powerful and applicable to a wide variety of different problems We were seeing early success with speech recognition image recognition and starting to see success in various language problems And so I did this thought exercise in 2012 I said oh if speech works a lot better people might use it a lot more and If you do some quick back-of-the-envelope calculations of a hundred million users Started talking to their phones for three minutes a day at that time the phones were not powerful enough to run The actual on-device machine learning model So you needed to sort of send that to a data center and run that in a data center If we were running the speech models on CPUs We need to double the number of computers in Google data centers just to launch this better speech model Which is a bit of a scary thing even if it was sort of economically reasonable It still takes time to like pour concrete and wait for it to dry So that that seemed a bit scary And we thought because of these sort of restricted properties of the computations We actually want to run it made sense to explore custom ASICs that are really good at low precision linear algebra And so the TPU tensor processing unit is kind of the the family of Chips and systems that we've we've designed over the last sort of five or six years And the first one was really good at inference So when you have a trained model and you just want to apply it to a problem like speech recognition or image recognition That needs even lower precision than training and it actually turned out to be the most serious problem for us initially so we deployed these Cards TPU v1 And it's in production use for us for about the last four years used on every search query used for machine Translation for speech image recognition It was used for the alpha go match that our deep-mind colleagues competed against Lisa doll in Korea and Kijin and China And there's a an ISCA paper about sort of more details behind that machine The second system that we built was really targeted at both doing training and inference and So this is a TPU v2 device which has four chips on it all interconnected with the super high-speed connection fabric And if we dive inside one of those chips the nice thing about it is the design. It's actually pretty simple It has two cores each core has a giant matrix multiply unit So every clock cycle you can multiply 128 by 128 matrix With another one and then it has scalar and vector units for things that are not matrix multiplies And then it has a very simple core that can do other things But that's basically the design of the system and then it has HBM memory high bandwidth memory for very high-speed Communication with you know storing and loading of data and parameters And each chip is 45 teraflops. So the whole the whole board is 180 teraflops whole device And then a follow-on version cloud TPU v3 came one year later. It's basically the same design with some tweaks We added water cooling. So water and computers normally don't mix but here they do Well, they don't actually mix they just touch gently And that's 420 teraflops in that board and Unlike inference training is the kind of thing where you need to think about more than just a single chip Because you actually want to solve bigger and bigger problems and if you had enough compute on a single chip Like essentially you'll never get enough compute on a single chip because if you did you'd want to solve bigger problems And then you need more chips Connected together in some way. So these are designed to be connected together into larger configurations that we call pods So TPU v2 pods were kind of these two two racks of computers plus two racks of TPUs 256 chips and then the v3 pod is a bigger scale thing a hundred more than a hundred petaflops of compute And we actually are offering these through our cloud products as well So that you can other people can use the same hardware we use and one of the key aspects for the pods Is that they have this really high speed but fairly low-cost 2d mesh computation between our connections between all the chips And basically that makes the expression of the programs that you Run on these systems very simple the same program runs on a single chip it runs on four chips It runs on you know 256 or 512 chips Whatever whatever it is you your problem demands And so we're able to for example train 27 times faster with with significantly lower cost than using GPUs and Just to give an example we can train a resnet classifier in less than two minutes I'm basically processing more than a million images a second and which is roughly one epoch of The image net data set every second And this really makes you more productive as a researcher because the speed up is roughly linear So the cost of these things is roughly the same whether you use you know one chip for 256 minutes or 256 chips for one minute and It just makes you much more productive as a researcher if you can get an experiment run in a few minutes Versus a few days or weeks it just changes fundamentally how you think about problems and and the kinds of approaches you do There's a whole bunch of models that that run really well that are already open source reference implementations including Burt including lots of image recognition things Okay One of the things that I think is interesting in Natasha. No, I is actually here in the audience She's one of the key people behind this system is how do you actually find data in the world? It turns out there's actually lots and lots of interesting data sets in the world And it's actually kind of hard to find data that is relevant to a problem that you might be working on So how do you actually do that? Well, there's a data set search that I think indexes sort of 10 million data sets from thousands of different providers So here for example, if you want data sets on energy consumption smart meters There's a whole bunch of them there and with like detailed descriptions and the format of the data and so on So I think this is a pretty interesting resource for people doing machine learning who want to bring together Lots of different kinds of data for problems. They're tackling Okay, so let me talk briefly about where I think some of these ideas will kind of connect together So I think one issue that we have as a community is when we solve a machine learning problem. We typically Collected a to set and then we try to tackle Training a model for that particular problem and then we have a different problem and we go collect a data set and we Train a model for that problem But we start from scratch each time or sometimes we do a bit of transfer learning from a very related task But typically we don't do Transfer learning at the scale of thousands or millions of tasks We do transfer learning at the scale of one task to another or maybe we do multitask learning of four or five very related things I actually think we should be focusing on how can we build systems that can do very large-scale multitask learning so thousands millions of tasks and Actually sparsely activate this very large model so that for any given task you activate only a tiny fraction of the computational devices in the model But that you can learn Pathways and useful representations that are usable across different tasks And then obviously you want to run and map this onto different pieces of hardware Of sort of these ML accelerator hardware So a kind of cartoonish diagram of how this might look Let's say you have this model already trained on a bunch of tasks and a new task arrives So if you kind of squint at the neural architecture search work that I presented and you say well Suppose the goal was not to find a completely new architecture But was to find pathways through a model that were good at getting me into a reasonable state for this new task So now maybe some reinforcement reinforcement learning algorithm tells you that seemed to work pretty well for this task And maybe you want this task to be more accurate So you might want to isolate some capacity for this new task add some new components maybe some new parameters and extend the ability of that the system to Create a really accurate pathway for that for that task and now that blue thing that was just added becomes something that Can be added and used by other tasks as you as you add to the system I think this is a pretty, you know a very sketchy Direction that we should be thinking about going and we're sort of actively thinking about how we can work on this oh And each one of those components might itself run a little architecture search kind of Process to adapt to the kind of data that is being sent to it And all those routing decisions may then learn to route to things that seem to be particularly good At this particular task or example Okay I'll take one or two minutes to talk about a Lot of things that are that are relevant as we start to use more machine learning in society So one of the things that we we did as a company is last year We put out a set of principles by which we think about using machine learning in our own products and how we're using these approaches and sort of Inspirational goals by which we we hold ourselves for uses of machine learning at Google and I think that's really important that we actually Clarify these kinds of things. I think In many cases having vague notions of goodness is not nearly as good as having You know definitive principles particularly as more and more people across Google and across the world start to use machine learning You want people to really be thinking about What are the ethics behind the kinds of uses of machine learning that we could make in the world? And I'll point out that that many of these areas are not solved problems like There's a huge line of machine learning research into for example the second issue Avoid creating or reinforcing unfair bias, you know often when you train machine learning models on data about the world You learn to Perpetuate or perhaps even accelerate the biases that exist in that data and the biases that exist in the world And often you're training on the world that is not the world as you'd like it to be and so algorithmic techniques that can Eliminate unfair or harmful forms of bias, but keep the kinds of bias that give models these power The these powerful good good abilities to do things is a really important thing And so just to highlight, you know, this is a active area of research within Google on machine learning fairness what we Aspire to do though is take the best known current techniques for eliminating bias It's evaluating bias and in the models, but also continue to push the frontiers of how do we actually make? machine learning algorithms and techniques that can even improve on the state of your techniques So I think this is a really important issue for all of us as we start to think about how we use machine learning in more places and with that I'm done. I'm gonna conclude by saying that I think these deep neural nets and machine learning are really making headway on some of these grand challenges in the world I think it's really really exciting times fact the computers can see they can start to understand language better than they used to be able to Really has dramatic implications for all kinds of things health care self-driving vehicles robotics Reconstructing the brain Pretty exciting stuff. So we live in great times So we have time for some questions. So please go to the microphones Carlos Hello, so thanks for the talk you mentioned ethics in your talk and I would also like To ask something about regulation, right? So suppose one of these predicted models or algorithms for Diagnosing a disease is categorized by some regulator as a medical device And I think some regulators probably will categorize if they haven't done it yet So how could you help them make that certification process as a medical device more reliable? if they if the Certified an algorithm in this case. Yeah, I mean I think I it's a very good question I think in a lot of cases what we're seeing is machine learning technology is Starting to influence areas that are already regulated in various ways and how should the regulations Be used to like inform how we think about regulating machine learning uses in that sort of space medicine or Sort of vehicle law are good examples I think one way to do this is to do clear scientific studies of the the benefits of these approaches and show the regulators evidence The other is to first introduce them as an additional aid to a Human clinician who makes the final ultimate judgment But we've published other papers where we for example shown that human clinicians Augmented with a diagnostic tool like this can actually make better decisions than human clinicians alone I think evidence like that is really the way that you want to Show regulators that this sort of approach is is appropriate Alex What are your thoughts on incorporating more explicit human knowledge as in knowledge graphs and reasoning? What what what are the directions that we're taking? Yeah, I mean I think this is sort of the classic, you know Symbolism versus kind of these sparse representations. I think you know It's pretty clear that for a lot of problems if you have Human knowledge that you can put in these sorts of models that often helps the models a lot Because they don't then need to reason or sort of develop Understanding from very raw forms of data you can instead put in sort of higher level Features and things that are useful for these kinds of tasks I think the question of how we build systems that actually do more than just a feed-forward prediction Which is largely what a lot of these successes are based on how do you actually build a system that makes a prediction sort of then sort of Exams possible implications of that prediction along some sort of search path You know tries a thousand different ways of evaluating if I do this if I do that What does that mean if I look at the image in this part? You know a higher resolution version does that really make sense for the prediction I made when I looked at the whole image I think that's the kind of model. We actually need to be driving for and in some sense that iterative process I Would say is somewhat akin to human reasoning, right? You're sort of trying to put together a bunch of pieces of things and see if they make consistent sense I don't have the answer of how to do that But obviously I think those kinds of approaches are going to be what's required to really do this I think hand-engineered features and data like knowledge graphs Certainly people are experimenting with how to incorporate those into these kinds of models and that generally is pretty effective The first woman in the line, please the first woman in the line of the woman inclusion Imagine back to the time when you are a very junior PhD student And you told your advisor that you really want to work on some deep learning and your advisor said no No, you shouldn't work on that Because you will spend days and hours just fine tuning the model for different data sets So what are you going? What would you say to convince your advisor? Is this hypothetical So I mean I I think There is definitely work that is better left to machines, right? like if you look at the AutoML work a lot of that is sort of running lots and lots of experiments and Computers are sort of actually pretty good at that sort of thing What I think is really beneficial is human intuition and and creativity Coming up with new ideas because the the AutoML kind of work is not powerful enough Generally to come up with completely new crazy ideas that are really impactful But it is able to sort of help people run experiments more rapidly. So I would say You should strive to come up with really amazing creative new approaches to tackling important problems The the reasoning question is one that I think is interesting How do we actually build models that have better reasoning capabilities than what we have today? But it's not an either or thing. I think if you're passionate about we're doing work in the space There's plenty of work to do So I'm sorry that's no more time to for questions at least Talk to Jeff in the coffee break and let's thank again. Thank you