 So my name is Michael Carvin. Today I'm actually, so I had a little slide title appear that was actually talking about our work that we've been doing with the supportive lab on the lottery ticket hypothesis. And so it's this somewhat audacious term that we have there on the lottery ticket hypothesis. But below that we have the subtitle which says we'd like to understand how to find sparse, trainable neural networks. And so the key idea here, so these are introduced to me as being very interested in systems. And in particular I'm interested in some of the aspects of how we can actually reduce the overwhelming burden that we have in terms of costs, such as energy, certainly money, of course, carbon consumption for training modern neural networks. And so here, so that's gonna be the motivation. We're gonna work up to that. And the motivation that I'd like to start with or with the background is that on pruning. So this is a concept that's actually been in the deep learning community for some time. And the basic idea is that I can take a neural network and remove superfluous or just unwatered parts from it, yet still get a reasonably accurate network, right? And so the key idea, mind if I just use the laptop, is that gonna work for me? So yours, ah, this really is not working out for me. I don't know why. So the key idea here is basically this is actually an old idea that's been in the literature. And the key idea is actually going back to the early 90s of course with Yon Lukun and his collaborators, as well as being more recently revisited by Song Han, who you'll hear from later today, with great success. And the basic idea is this. So for pruning, what we're gonna do is we're gonna start with a trained network. Actually no, we're not. We're gonna start with a randomly initialized network. So this is our neural network. Circles corresponding to neurons, weights corresponding connections corresponding to weights. And the basic idea is that we first, we're gonna randomly initialize that. On the first step, then, we're actually gonna go ahead and train that. So using our standard training process to produce a new network. And then in step two for pruning, what we're gonna do is we're gonna remove superfluous structures. So what does that mean? So what does structure mean? Well, so we could be interested in a wide variety of things. It could be weights, it could be neurons, filters, channels, any number of things that we might be working with. Right, now superfluous, and in fact for weights, so let's talk, we're actually talking primarily about weights. Now, for a superferentialist, this can be magnitudes, so the size of these weights, the size of these structures that we have, or gradients, the size, or the amount of gradient that's falling on some particular structure, as well as the activations. So essentially how active these components are. So these subnetworks that we're looking at, similar to how Tenu was talking about. And so here, we're primarily focusing on work around magnitudes. And so what we'll go ahead and do is actually go ahead and remove that structure. And then, once you remove that structure, typically what happens is that you will lose some of the accuracy of your network. So in this last step, what we'll go ahead and do is we're actually fine-tuned. And so what this means is we'll take this smaller network and then we'll go ahead and actually train it a little bit more, have a little bit more data go through it. And as a result, we get an actual network that's smaller by some fraction that actually matches the accuracy of our original, right. So now this has been fantastically successful in the literature. So for a wide variety of different data sets as well as for different architecture, what you've been able to do is actually reduce the size of these networks by an order of magnitude, right. And this is fantastic as it's been explored. So when you're doing it in this step, when you're looking at it after training, what's gonna happen is this is gonna be great for deployment, right. So deployment, when I think about inference time, what I can do is shrink my networks down by an order of magnitude and therefore, reduce those costs. Now, as I talked about in the beginning, let's talk, but we're interested in about trainings. How do we actually reduce this cost of training? So this is really only for inference, but the way that we're gonna get to that process is just by asking ourselves some questions, right. And in particular, this prune network, as we're showing just by pruning is actually, this prune network can represent an equally accurate function. But we've pruned it, we've done a little bit more training and actually matches the accuracy of this original network, right. So this gives us a very good question. Why didn't we just train this prune network from the start, right. Now, this is what we'd ideally like to do and it gives us a very simple concept that we might then work with, well, why don't we just start with a randomly initialized prune network and then just train it to converge, start small, train. Well, unfortunately, this does not reach the same accuracy as your original network, right. And this is something that we've seen time and time again in the literature where a trained prune model, for example, from scratch performs worse than retraining a prune model, which may indicate the difficulty of working with small capacity. And as well, we've seen this again in other papers as well during training is retraining is actually better to retain the weights from the initial training phase for the connections that have survived than it is to re-initialize these prune layers, right. So people have actually been looking at this problem and wondering, no, is this possible for some time without success. So in this work that we've actually been doing, we've been trying to challenge us, we've been trying to look at these results, understand what is going on here. Can we actually do better? Can we make some attempt at actually doing this during training? And so that is the first question. And a corollary that we're gonna have here is then do these networks just need to be so large to learn to begin with, right. And so the answers that we've been exploring in our work are one that, yes, we've actually been able to find that there are these small networks that are out there and then perhaps no, right. I would actually put a question mark here that says perhaps no, perhaps these networks don't need to be quite so over parameterized, even though it says strongly no here. And so to explain how we got here, right. So the observation that we're finding is that the weights that are pruned after training could have actually been pruned before training. This is the basic hypothesis that we've actually been exploring and seeing in some of our work. So this is some work we actually had in Eichler this past year. And we're here, the key observation is that you need to use the same initialization. So everything that we've been seeing in the current literature is that if I try to take that small network start from scratch or re-initialize it, I'm going to run into problems. But we've been finding is that if you dig inside of that large network, you can actually find these very small networks that can train from scratch for the start. So let's go ahead and actually dig in to see how we actually do this. So here, what we're going to do is look at the twist in our technique. So the technique that we've had is we're still going to start where we started before, randomly initialize the full network, right. In the next step, what we're going to do is we're going to train it just as we did before and of course prune our superfluous structure as we had defined before using our preferred metric. And then now what we're going to do is we're ever so different, right. So at this point, what we had done is we had then done some fine tuning, but what we're going to do this time now is actually what we're going to do is we're going to backport these changes that we're seeing at the end of training to the beginning of training. So we're going to take those architectural discoveries, things that are irrelevant and apply it to that initialization that we had at time zero, right. Back in step one. And now we can actually do this iteratively. Yet again, starting with this one, do another pass, do it iteratively, backport. And what we're going to see is that we can actually now get this smaller network. So now what's been very interesting about these results is that what we can do is we can take these very large networks, apply this process, and actually find some very small network that can actually train to the same accuracy, oftentimes more quickly, or at least as quickly as that large, original over parameterized network. So as just a summary of where we've been with this work, right, we've been able to find that this works for fully connected networks, for MNIST, CIFAR, as well as ImageNet, and also including all the wide variety of different architecture tweaks that you might be interested in. And the key difference between our technique, what we've seen in the past and why we get this to work is it's all about this initialization. If you re-initialize this, this just does not work. So, that's some summary. So these networks that we're finding, they're between, they're oftentimes an order of magnitude smaller than that original network. They also tend to learn faster and they're gonna show some results around that, right? And as well, they reach the same accuracy and oftentimes better accuracy. And there's some caveats, of course, right? These subnetworks are found retroactively. So we're doing this after training, recovering, going back to the beginning training and then discovering this network. And of course towards our goal, there's still some room. We still need to make some progress towards our goal to figure out how we could do this right during training at some point during training. But let's actually go ahead and just look at some of the results here. And the way that I'm gonna show this is what we're gonna look at is the test accuracy of training some network as a function of our point in training. So as we're adding more data into our network, as we're moving through training. And so this is gonna be, actually, well, so this is gonna be the data point 100. And so what this means, this is our original network that we're starting with. And this is for MNIST fully connected so I can show these results for all the different benchmarks that we looked at. But MNIST just makes the easiest, most clear distinction. So let's see. And we'll start out with 100 here. So this is our full network. And so what I'm gonna do is I'm gonna bring in new curves that show these results, this behavior for smaller and smaller networks that we're finding. So here, this first one is gonna be 51%. What I've done, I've thrown away 51% of the weight. What we can see is this smaller, half the size network can actually still train the same accuracy and ever so slightly a little bit quicker, a little bit more quickly. If I go down to 21.6, I do a little accuracy boost. I'll also get a little speed boost as well is how I'm getting there. And if I continue going down to 3.6%, obviously this can't continue on forever, right? You know, the limit, I get no network. So obviously this has to break down somewhere. And below about 3.6%, so 3.6% is when we start to recover this original behavior that we're seeing. So now we're finding a network that's 3.6% of the size of that original network that trains just as well, just as quickly. And so that's what we're able to find and show that promise of potential new opportunities to find these small networks that can train in scratch. So to make this a little bit more precise and let's think about this role of re-initialization, if we can call it, then we'll bring back in this line that we had before for 51%, but what I'm gonna do is now add a dashed line. So this dashed line is gonna say what happens when we re-initialize. So when we re-initialize, what we see is we actually open up this gap. We aren't seeing quite the same behavior that we're seeing before. And this gap grows for the smaller network that I have. And in fact, at some point, we're gonna start, for example, we get down to 3.6, we're actually gonna see, if we were to re-initialize, we can't actually match the accuracy of that network, our original network anymore. So to make this a little bit more clear, because you start doing this, these graphs can get quite out of control. But here, what we're gonna plot here is gonna be on this left side, we're gonna look at, as a function of the percentage of weights that I have, how long it takes for me to get my accuracy, the accuracy that I care about. And on the right side, we're actually gonna plot that accuracy. And so we can break this down a bit more, right? So we've got our left, we've got our right, or actually just have our left. And we're gonna see that we're gonna have percentage of weights remaining, and then on that left side, we're gonna have the number of iterations. And so here, what we're seeing in this pattern, we're seeing that as we remove weights, we're going to do better. And this is with our exhibiting technique. So we're actually learning more quickly up to that same accuracy. And we're actually seeing over here, we're actually getting a little bit better accuracy as we're moving weight. And now where this becomes more dramatic from what we were seeing on those previous slides is when I do a re-initialization, right? With a re-initialization, I do worse as I remove weights. It takes me longer to actually converge. And of course, my accuracy drops off much earlier. So here, there's one last point that we make as well, is we actually do some experiments around rearrangement. So is it just the initialization or is it also the structure that we have in the network? Is this structure important in some way? What we're finding as well is, yes, that structure is very important. If we change those experiments ever so slightly, that structure, we change, we rearrange, we're actually gonna run into this issue as well, where we're gonna get slower convergence as well as a loss in accuracy as we make this network smoke. So this is where it builds up to this idea, this lottery ticket hypothesis that we've been working with contending with trying to understand, trying to validate even more. And that is that these dense, trainable networks that we have, that we're starting with, that we're training in our everyday-to-day, actually have these small sub-networks that perhaps train just as well. And really, this task is, how do we go out and find these? And here, equal to capable, again, means trains the same accuracy just as quickly as we had seen for the larger network. But really, this task is, how do we go about finding these? So in our current work that we've had coming out and that we're going forward, we can find these networks. And we believe these networks right now have essentially run this initialization lottery. Well, somehow they've gotten so lucky to be able to train to high accuracy quickly, even though they're that small, much smaller than what we are seeing in our normal training regime. Now to conclude, what are the implications here? So clearly there's some interesting things that we can do and actually trying to understand if this informs how we initialize or structure these networks. We can figure out if we can actually transfer these networks to new tasks. For example, we'll learn in new domains where we have limited data, for example. This might be a productive, inductive bias that we may be describing. And then finally, though, as I talked about before, really our main goal here, how do we get there? How can we identify these things with significantly less cost? Much earlier in training and therefore perhaps, dramatically reduced this cost of training. So that's the talking, happy festival. So this is an exciting time in deeper reinforcement learning because we now have systems that can do incredibly impressive things, like play games like Go at superhuman levels. But at the same time, I think there's also widespread understanding that the field is missing something fundamental. And that's essentially the problem of how do we generalize what we learn in one situation to a new situation? And a lot of people are working on this and it's an extremely hard problem. I'm one of those people working on the problem, but I'm coming at it from a slightly different perspective. I'm a computational cognitive scientist and I'm interested in seeing what we can take from the study of generalization of learning in biological systems and see what implications that might have for artificial intelligence and machine learning. So if you're unfamiliar with reinforcement learning, this is the standard setup. We have the agent interacting with an environment by taking actions and getting some reward signal. The focus of my talk is the agent's control policy, which is a function that maps from the current state of the agent's environment to a probability distribution over actions that the agent should take in that state. And so it's a simple example. You might take the game of learning or learning to play a pac-man. You have an agent who takes actions and maybe runs into the ghost and dies. And so the question is moving backwards seems to be a challenge. It's possible to slide it down. And so the challenge is you've learned from this experience but how do you generalize that to some new situation that you've never encountered before? So for example, you'll learn a policy for an orange ghost. How does that generalize to a policy for a light blue ghost? And so the issue is that reinforcement learning has absolutely nothing to say about that. Reinforcement learning can tell you how to learn an optimal policy but it tells you absolutely nothing about how to generalize that to new situations. So in practice, the way the field gets around that is by throwing deep neural networks at it. So we represent our policies using deep neural networks and sometimes those deep neural networks generalize and sometimes they don't. And when they work it's great and when it doesn't work then your paper gets rejected and that's the end of it. So I was like I said, it's interesting trying to figure out what we can learn from generalization and human intelligence. And so if we take an example in biology, Bluejay learns that eating a particular species of butterfly is toxic. And so what can the Bluejay learn about other species of butterflies that's never encountered before? Maybe similar looking species or species that look different or maybe butterflies in general. And so the key to my research is the argument that information processing in biological intelligence faces severe capacity constraints in a way that isn't adequately captured in modern machine learning systems. And so we can formalize that like most things in life. It boils down to information theory but taking the idea of an agent's policy as a mapping from states to actions, we can reframe that as an information channel. So the entering into the channel is the agent's current state and on the output side of the channel is the action that the agent will take. And so what we're gonna do is we're going to deliberately introduce a information bottleneck into that channel by restricting the flow of information from states to actions. So if you saw Yahshua's talk this morning, he argued that a useful objective for intelligence is maximizing mutual information of our representations. So to be slightly controversial, I'll argue that maybe a better objective is minimizing mutual information. So we want policies that are simple in the sense that they have low mutual information between states and actions. But at the same time, we want agents that have policies that have good utility. So the argument is that we should be training deep RL systems to optimize this objective or it's a two-part objective. We're on the one hand, we want to minimize mutual information, but on the other hand, we want to maximize utility. And so there's a natural tension between these two approaches. It turns out that this corresponds to a well-known field in information theory known as rate distortion theory. And what we can do is apply that to deep reinforcement learning systems. And so to give you a concrete sense of what that looks like, we can take the fruit fly of the machine learning world, the grid world. We have an agent who's starting out in this corner of the maze and wants to navigate to get to the goal and wants to do it in as few steps as possible. Each step it takes costs at one unit of utility. And if it runs into a wall, that costs it more. And so it's a very simple learning problem. And from an information theory perspective, it's also fairly simple to specify what is the information content of an optimal policy. So in each state, you have four possible actions. And so we can write down an optimal policy as a lookup table where we need two bits per state in order to write down what the optimal action is for each state of the environment. And so the question is, what happens if we don't give our agent two bits per state? What happens if we say cut that in half and say you're only allowed to use one bit on average to represent your policy in each state? We want to build a theory that can learn an optimal policy subject to that kind of constraint. And so that's what we're showing here. We have this graph on the x-axis where varying the tightness of the bottleneck. So as we move towards zero, the bottleneck gets tighter and tighter. And on the y-axis, we're plotting the cost incurred by the agent as a consequence of having a capacity limited policy. So out here at the tail with two bits per state, there's zero cost relative to optimal. And what we show is that as you tighten the bottleneck, costs start to ramp up. And so this isn't just kind of an arbitrary relationship. Rate distortion theory gives us the optimal relationship between information channel capacity and loss of utility. And so we can build reinforcement learning agents that find optima along this curve for any particular bottleneck that you impose on the agent. And to give you a concrete idea of what these capacity limited policies look like, we're taking three points along this optimal trade-off and just plotting the policy. So here the colors essentially indicate the randomness of the agent's policy. Red is essentially behaving randomly. Darker, cooler colors are behaving more deterministic. And so what you can see is that at very low information rates, the agent can't represent a detailed policy. It essentially just wanders around randomly. But the interesting point is that these intermediate stages along this curve. So here you have some information capacity and it becomes a resource allocation problem. So the agent learns to learn detailed policies in states where it really matters. For example, we have these corridors where if you take one wrong turn, it will cost you a lot in terms of utility. Those are the states where you want to represent a detailed policy, but they're also regions of the environment where it doesn't matter what your policy is. Any action you take isn't going to cost you a lot in terms of utility. And so we're building reinforcement learning systems that learn these types of capacity-eliminated policies. And so the goal for the rest of the talk is basically just to illustrate using these very simple as examples, the kinds of computational advantages you get by deliberately imposing capacity constraints on your ability to learn or represent a policy. So one simple thing we could do is we could take agents, train them in one maze environment, and then modify the environment by randomly plunking down additional walls. So we randomly add walls and then examine how the agent's policy generalizes to new environments, so the new walls are shown here in red. And so what you can see is that there's this Goldilocks zone where at moderate information rates, its policy actually generalizes better. And so in one sense, this is essentially just adding a regularizer to learning. And from that perspective, maybe it's not too surprising that you're no longer overfitting your policy to a particular environment. But in other sense, there's nothing wrong with saying this is just regularization as long as it's the right kind of regularization. And so the argument is that this is coming from a principled framework. We want to minimize the complexity of our policies while achieving optimal utility subject to that trade-off. So as another simple example, this is the mountain car environment where you have an agent who's trying to drive a car up a steep hill. It's underpowered, so it can't drive straight up. The optimal policy is to learn to rock back and forth until it gets to the top of the hill. We can apply the same game by looking at what reinforcement learning algorithms learn as we vary the tightness of the bottleneck. And so in this case, lower is better because we want agents that get to the top of the hill as soon as possible. And what we're plotting here on the x-axis, again, is the tightness of this bottleneck. The y-axis is the average steps it takes. And what we're showing here is actually not better generalization but better online learning. So this is the average steps per episode in the training environment without modifying the environment. So by restricting the capacity of your policy, you not only get better generalization, but you get better online learning as well. And so the advantage of working with such simple environments is that we can come up with an understanding of why that's the case. And so with a simple environment like this, we can actually plot their policy. So we have a two-dimensional state space for the agent, its position and its velocity. And here we're plotting its policy in terms of a color map. So red is high probability of moving to the right and blue is high probability of moving to the left. And we can plot the agent's policy for three different levels of our capacity tradeoff. And what you can see is that high capacity, the agent learns a very deterministic policy and at the other end of the extreme, essentially it learns a very noisy policy, which we can illustrate by overlaying sample trajectories. But again, there's this Goldilocks zone where at intermediate capacities, it learns the distinctions of the environment that matter. But for other regions of the state space, it learns that you shouldn't attempt to encode a deterministic policy. And so in human motor control, there's something similar. There's a concept of an uncontrolled manifold, which says that the motor system tries to control the dimensions of variability that actually matter in terms of task performance, but there are a large number of dimensions that are irrelevant for task performance and you should deliberately not attempt to encode a policy for those. And that's behavior that you only get by introducing this capacity constraint. So one last example, we've talked about better generalization, we've talked about better online learning. I think another interesting aspect of this is that we can learn better representations of our task environment. And again, this is a very simple environment. We're gonna be talking about a multi-armed bandit problem where you're facing a choice between 10 different slot machines and through trial and experience, you just have to learn which one to pick that has the highest average payout. This is a contextual multi-armed bandit, meaning that on every trial, you're first given some state queue that tells you something about your environment. And so in this case, the state queue are digits from the MNIST dataset. And so we train our agent. It's shown a state queue of this digit three. And then it just chooses which arm to pull. And we repeat that using a sample of say 100 digits from MNIST dataset. And then what we can do is look at how well it generalizes if we test its performance on digits that it's never seen before. So maybe we've trained it on this pixel pattern, but then at generalization, we test it on a held out sample of MNIST digits. And so again, the arguments kind of consistent throughout is this Goldilocks zone. Here we're plotting performance on the training set as a function of the bottleneck of its policy channel. And so for training, higher capacity policies are better. You can preserve fine grain distinctions of your environment. But then when we look at performance on generalization, we find that by forcing yourself to throw away information, you actually achieve better performance. And so the claim here is that what's going on behind the scenes is that this agent is actually learning digit categories. So instead of learning a mapping from pixel patterns to responses, it's learning a useful category structure for its environment. And the key point is that this is entirely driven through reinforcement learning. This is not supervised learning of digit categories. It learns digit categories not because it's been supervised or instructed to, but because it's a useful representation for this environment. And so to conclude, the general argument is that biology often faces strict limitations on information capacity. So encoding detailed policies takes metabolic expenditure or takes protein synthesis, but it's expensive for biological systems. And so evolution places a constraint on minimizing the complexity of representations that organisms have. But at the same time, evolution favors representations that have good utility. And so the argument is by drawing inspiration from biology and formalizing that in terms of the mathematics of rate distortion theory, we can achieve a better generalization of learning and reinforcement learning. And so with that, thank you for your time and attention. My name is Julie Shaw. I lead the interactive robotics group in the computer science and artificial intelligence laboratory and on faculty in the aeroastro department. My lab and I work on reverse engineering to human mind to make robots that are better teammates. And we work in an area surrounding the future of work and this is a time of growing anxiety of the role of AI, computing, robots in our lives and in particular, how they have the potential to supplant human work. The vision for our lab's work is to be intentional about developing computing that augments not replaces human capability. And I've worked for many years in human robot collaboration and manufacturing. So I'll just start with a stat here. Worldwide, there are 1.8 million industrial robots in operation in factories around the world. So that's the same number as the human population of Boston, Pittsburgh, and San Francisco. So cities are chosen strategically. And then, so if you were to guess, how many robots are there in our homes in the United States? How many? So are you ready for this? It's 30 million. 30 million robots in our homes in the United States today. And we don't often think of robots sort of being that integrated into our daily lives. Those 30 million are things that actually look like robots like Roombas. Those are not counting the new set of security robots patrolling apartment complexes, parking lots, delivery robots on the sidewalks of San Francisco. It's not counting sort of Alexa's, Google Now's, et cetera. Smart homes, which are essentially robots. As of this spring, 500 grocery stores around the country rolled out these strange robots to roll up and down the aisles of grocery stores to inspect for spills. So these robots are still coexistence systems. They live in our environment, but there's relatively few of them that we're interacting with on a daily basis. They're performing functionally narrow tasks in controlled environments and essentially under near constant in the meaningful sense, human supervision. So I want you to imagine a moment in the future where we have many, many more of these on our roads, on our sidewalks, in our workplaces. These are systems that don't understand us, right? So now imagine you're a student driver on a road with 1,000 other student drivers, which are essentially what these robots will be on our roads and on our sidewalks. So that's not sort of a scary vision of the future. That's actually our current day today. So you may have noticed in December, 2017, San Francisco actually banned sidewalk delivery robots. There were so many companies testing these systems on the city streets in San Francisco, but the elderly and people with disabilities came together and said, these systems don't understand us and we frankly don't feel safe with them. And so as a result, the city relegated the testing to an industrial sector, a non-populated area of the city. So the goal of my lab's work is to open up the opportunity for these systems to integrate more seamlessly and add value to our daily lives. And to do that, we need systems that understand us and understand our human context better. Now, I'm from aerospace. I enjoy safety critical applications in industrial environments, pilots, co-pilots, collaborating in cockpits, healthcare. There's a long tradition of work and study of what makes for effective teamwork and there's also working in cognitive science. But we can really actually all put that aside to just have a conversation about what makes Tom Brady so awesome. So, you know, they say he sees and he knows. So an effective teammate is able to infer what their partner's thinking is able to anticipate what they're going to do and is able to make fast adjustments when things don't go according to plan. So my lab works on translating insights from cognitive science into computational models that enable robots to infer a cognitive state, develop models of our behavior and our decision-making, and then are able to jump in and play the game with us. And in order to make these systems effective, they need to work with us in three settings. They need to work with us to form a common understanding of our shared plan for how it is we're going to work together. They need to be able to learn from us and with us. This is training. Like you can't imagine, you know, a football team, you know, coming together and playing well if they weren't training. So when I use training, I mean it in a different sense. And once you've trained together, you want to go out on game day and play the game. And so, you know, I'm going to tell you sort of two stories of work in the lab, two sides of a coin, for how we're working on making these systems more effective to understand us and to augment us. Now, we've worked for a number of years with a local Boston hospital, Beth Israel. And, you know, you may ask why it is, I'm an aerospace engineer working in healthcare. There's big opportunity here. So the Joint Commission reported that 80 to 90% of sentinel events, those are events that result in death or near death, are rooted in human factors, issues of cognitive overload and failures of communication and teamwork. So the question is, how can we develop AI that understands us better, that can shore up our ability, reduce our cognitive burden and shore up our ability to communicate effectively? In studying the hospital environment, this is a labor and delivery floor. I actually learned that there is an air traffic controller on the labor delivery floor. This is the resource nurse or the nurse manager. And she, it's basically always a she, she is making decisions of which patients go to which rooms, which nurses are assigned to which patients, she controls aspects of the OR schedule, many other decisions. She's doing a job that's computationally more complex than that of an air traffic controller and without any decision support. And there's no codified training process. She learns this job through a period of apprenticeship over many years and some case decades. And so there's a lot of implicit knowledge there that's not codified. So the question was, could we develop a machine learning model that's able to learn from relatively limited data in an environment where we don't have a simulator or emulator and we don't have a clear objective function? And can we learn essentially the strategies or heuristics that these nurses are using to make their resource allocation and scheduling decisions? Now, a medical student or a nurse apprentice is able to learn the context of a hospital workflow relatively quickly in a few days to figure limited ways in which that medical student or nurse in training can contribute to reduce the burden of the overall work. And so what is it that makes people so capable of learning so relatively efficiently with little data? So we know that in this context of task allocation and scheduling, essentially human multi-criteria decision-making that humans learn effectively through comparisons and in particular pairwise comparisons. And so we took this insight from cognitive science and developed a structured machine learning model to learn a pairwise ranking model. Essentially the machine is watching the nurses' decisions and computing those differences of what makes this situation different from that decision and automatically computing positive and negative examples from that. So we deployed this sort of human inspired learning model in the hospital both in computer aided decision support as well as a robotic sort of embodied aspect. And I can have a whole separate conversation about the interesting aspects of the embodied decision support. But so this system learned and we evaluated it in a high fidelity simulation environment but also had the opportunity to test it on the live labor and delivery floor. So the robot read the actual whiteboard on that hospital floor and made a real recommendation based on that current state to the nurse or the doctor. So I'll give you a sense of what this looks like. Right, like it kind of made me like, mm-hmm. I recommend placing a sketch which will see if there's any attention. Patient, we learned from you. First, very good. Can you take care of her? Like that's the... And so this nurse is responsible for taking in all of the information on those LCD screens, the information on that handwritten whiteboard on the current state of the labor. What is a good decision? I recommend placing new patients in a tree or a stand or a sign. I'm scared because I don't know what's going to happen to the community or what is the bad decision. A bad decision would be to place a scheduled cesarean section patient in room 14 and have nurse Kristen take care of her. Ginger, what's a good decision? I recommend placing a scheduled section patient in a tree or a sign. That's very good, take care of her. I don't know if you played your back a little bit. For me, I only noticed that she's a lot better. I always play this so well. Ginger, what's a bad decision? We are helping. We are helping. We are helping to place a scheduled cesarean section patient in room 14 and have nurse Kristen take care of her. I agree. So this was the demonstration but in our controlled experiments in our high fidelity simulation environment, we found that nurses and physicians agreed with the suggestions of our system 90% of the time. This is not a system that could ever replace a person. The idea is to offload some of the easier decisions of these nurses to free up their cognitive capacity to deal with the situations we need them for. The very complex or ambiguous situations. And so this is one way in which we can employ AI to reduce the cognitive burden of people working in these domains. But I'll point out, cognitive burden was one aspect of the challenges we face. The other is in failures of communication and teamwork. And so in cockpit, you might think that cockpit is for the pilots actually learning to fly the plane. It is not. A study from a while ago now by NASA found that 60% of aviation accidents happened when pilots were working together in the cockpit for the first time. There are different patterns in communication that increase cognitive overhead and pilots actually trained to communicate more effectively to intermesh their activities to anticipate the information needs of their co-pilot and that contributes to safe systems. And so in our recent work with an IBM Watson, we've been looking to take this insight of, that humans learn very effectively through pairwise comparisons and enable machines to present back to us what they learn about differences in different sets of data. So for example, in the cockpit, you can imagine different communication patterns, some which contribute to high workload and some which work effectively. You'd want the system to potentially provide a contrast of explanation, the comparison or contrast of explanation back to those pilots to improve their ability to learn and train and effectively fly the aircraft together. So prior works, for example, from the software verification community focused on mining summary explanations. And this work, our aim is different. We want to generate contrast of explanations describing how multiple traces differ from each other. And so this has many applications beyond the examples I've provided you here. The prior work in this, taking a pair of traces and inferring a specification of time evolved behavior, for example, linear temporal logic, use sort of exact learning techniques, which provide one contrast of explanation and they're susceptible to failure when you have noisy traces as you do with any real world data set. And so we developed a Bayesian technique, a probabilistic Bayesian inference model that infers a contrast of explanation in the form of LTL between two sets of execution traces. So it offers substantial robustness to noisy inputs. It quickly generates multiple explanations and allows flexibility incorporating preferences at various levels through that inference and search process. But the key to making this work is the fact that you need to structure that hypothesis space. And we have a large body in cognitive science on sort of the structure we design into real world systems, into our software programs and the ways we explain system behavior to other people. So we draw from that literature a template of two dozen plus patterns of behavior to structure the hypothesis space to infer LTL specifications very efficiently. And so in this work, we tested, this is an HCI paper presented this past summer. We began by testing this work within the domains of the automated planning and scheduling community international planning competition and domains of various sizes. Our model is able to infer multiple contrast of explanations and identified the true ground truth explanation with very, very high accuracy with quite fast runtime on the order of seconds. And the prior work, which generates one solution often either fails to generate a solution in the specified time or outputs a single explanation and which may not correspond to a human's mental model for understanding the behavior of the system. And so, and again, it's a Bayesian model so it provides you substantial robustness to noise. Again, in these domains with moderate noise, we're able to maintain very high accuracies of inferring that contrast of explanation that holds across the data sets. So this is an example of taking insights about how humans learn effectively with very little data and being able to not only reverse engineer a human's behavioral strategies but then explain back to a person with those contrast of explanations to facilitate the training, the co-learning between human and machine. And so this is two small steps towards the ultimate vision of systems integrating inferring our cognitive state and supporting us in both cognitive and physical work. Thank you. All right, it is important to anticipate. That's what Dario said this morning in the introduction. Yosha later on reflected on models that allowed to envision future scenarios. And afterwards, Roger talked about models that can complete sentences. Now I want to show you kind of our view on this anticipation behavior of different models and with our view, I mean the view of a set of talented graduate students at UIUC, I call them the magicians because they are the ones that really deserve all the credit. They make all this work. Now I guess it comes at no surprise to everyone that computer vision and machine learning have had an incredible success in the last couple of years. Models which analyze, and I deliberately use the word analyze here as opposed to predict work very well. If we look at tasks like image classification, object detection, semantic segmentation, or human pose estimation, we achieve incredible accuracies at this point in time. So are we done? And I would argue actually we're not done nowhere close on our quest to personalize life. What we actually need is models which anticipate. If you're driving on the road and you're seeing two kids like the two on the left-hand side chasing a ball, you're constantly wondering, are they gonna step aside and yield to oncoming traffic, or are they gonna run into the middle of the road without paying attention to the traffic? So we're constantly checking what is happening and we're looking forward what could possibly happen. But I would argue that we don't have algorithms and systems that are very good at doing this at this point in time. And those type of systems require us to combine on the one hand, the vision of David Maher who kind of proclaimed this 2D to 3D reasoning with the vision of the neuroscientists, Rodolfo Linas and Kenneth Craig, who said that a creature must anticipate the outcome of a movement in order to navigate safely. So how can we develop these systems that can anticipate? And according to our opinion, there is four steps to it. We need to have, first of all, a holistic object understanding. We need to know how objects interact with each other, how to learn priors of our environment and how to capture ambiguity. Now I don't have the time to talk about each and every one of those aspects, instead I just want to focus here on the first one. Now, holistically understanding object is something that we as humans do almost flawlessly. If we see an image like the one on the left hand side, we know that the bed board is supported somewhere from below. So the bed board doesn't stop in the middle of the air, it has to go all the way to the bottom, similarly for other objects. Or if I look at the image here on the left hand side, we see this stack of disks and I know that there has to be a table somewhere or it has to go all the way to the floor. Similarly on the right hand side, we know what elephants typically look like. And so we can complete an elephant despite the fact that there is significant amount of occlusions. But I would argue while humans can perform this task to reasonable accuracy, it is actually quite challenging for AI systems. And part of the reason why it's challenging is that it's tough to get occlusion reasoning to work because it's very expensive to actually get ground truth data. We have very, very little ground truth data. In fact, if you check what are existing data sets in this area, you can see that there's roughly three data sets out there. They are called by names like Cocoa, D2S and DICE. And while Cocoa has real-world data, it's actually a subset of the MS-Coco data set barely 5,000 images. So not exactly large scale at all. Similarly, then there is this, I guess, groceries on the table data set, which is not exactly a day-to-day environment that we encounter everywhere on the street. And then we have a synthetic data set which is composed of static indoor scenes. Now, what I would argue is missing in all those aspects or what is common to all those is the fact that those are image data sets. But what is missing is the temporal information. None of those data sets allow you to extract and utilize temporal information. So what we wanted to look at is, can we come up with a method or can we collect a data set that allows us to exploit and learn temporal information in order to perform better at this occlusion-reasoning task? And so what I want to talk to you about is one of the works that we presented at CBPR this year where we discussed the data set for a modal video segmentation. Now, in order to get this data, we utilized reasonably realistic computer games, in this case, GTA V. So students had great fun playing computer games in my lab. And we used that game because it provides reasonably realistic rendering. We have a variety of different scenarios that we can extract. We can actually change the weather conditions, the lighting conditions as well. And it's very easy to get or reasonably easy to get ground truth and I tell you in a bit how we actually get to that. And you can see on the bottom of these slides here the annotations that we can extract from there. Here is a video which kind of shows the richness of these annotations and there's really a lot of details up to the level of the hairs, in fact. There's another video that we see here. So there's a whole range of different diverse scenarios. Now, how can we get this data? Well, luckily, there's a library out there. It's called script hook that allows us to actually operate or interact with the game. It allows to alter the weather conditions, the time of day, the clothing. It also allows to pause the game and allows to toggle visibility of objects. Now, with that in mind, we can set up this very simple pipeline. First, we initialize the game randomly and choose a couple of those settings. Then we play for a little while and pause the game. And then we grab, first of all, the image that is rendered to the screen as well as images that illustrate none of the objects being rendered to the screen and one object at a time being rendered to the screen. Now, a problem with that approach is that we can't compute a modal segmentation from it because of effects like shadows, features of the game, presumably, and speculate highlighting. So how can we fix this? Now, luckily, we can kind of place ourselves between the game and the GPU. And so we can grab information from what is called the depth buffer and the sensor buffer. Now, that allows us to get not only the RGB images, but it also allows us to get the depth information and it allows us to get semantic class information. Now, by combining all those three cues, we can then, first of all, render the entire screen, render the background, get all this information, and also render one object at a time. Now, with all this data collected, it's not as hard anymore to just compare depth buffers, sensor buffers, and check where is an object and where is no object. And with that at hand, we can then compute the modal segmentation, in this case, illustrated for the human here. And if I do that for every single object, we get data like the one that I'm showing you here, respectively, the video that I explained to you earlier. We can obviously also get the visible masks and in addition to that, we can also track the objects. We are not looking at every frame independently of the other objects, but the game actually assigns a unique ID for every object, which then means we can get the objects across different frames and you can see here by the colors, either in the visible part or the A-model part, that objects are consistently annotated, which is very important, particularly for a video setting. We are also able to get class label information by looking at the models, the name of the models that are rendered, that allows us to collect 162 classes, 60% of which are overlapping with the classical cocoa data set that people using computer vision, so that allows us to test this models that are trained on this data set on real-world data. Another thing which I'm personally very excited about and now I'm kind of looking a little bit forward, what else can we do is not only can we get this A-model segmentation information, but we can also get very rich pose information. Pose information despite occlusions, like the one that I'm showing you here in this video. So I don't know, it's like some 60-something points and we get all this information despite occlusions. Here are some more videos that show the richness of the data set and I guess summarizing the statistics compared to the data sets that I mentioned earlier, and the data set that we proposed is like the first video data set that allows you to take advantage of temporal information and it has, whereas a number of instances, we have 1.8 million instances that are labeled, which in this case is very easy because we just have to run the game, whereas other data sets are around 100,000. We also tried a couple of baselines on this model, which I'm not gonna go too much into detail. We used standard benchmarks, and we asked RCNN for those who are familiar, which we slightly modified to not only predict modal segmentation, but to also predict a modal segmentation, and we wanted to understand how well do those systems work in this case. And we saw that they perform reasonably well. In fact, the numbers are on par with segmentation research on data sets like COCO, which is encouraging, at least for the modal part. Obviously, there is a significantly more challenging a modal segmentation, where we are nowhere close to that. So that's kind of going forward, something that we want to understand how can we develop better models. We also showed, and I refer you to the paper for details, on the fact that video signals actually are, or temporal data is, allows to improve the models. So the hypothesis that I initially made that we can exploit temporal information in seems seems to hold here. Some more qualitative results, which show that in some cases it works, but there's also still a lot of cases where it fails, particular for the A modal part. So there's a big challenge ahead where I would encourage all of you, please take a look. And if you're interested and try it out, we'd be very excited. Let us know if you have any questions. We also tried to run that model on real-world data. As I mentioned before, we can apply those models and transfer what they learned to real-world data. It works to a reasonable degree. Obviously, there's also still room for improvement here, as this last video highlights as well. Some research, in some cases, particularly if the model or the person is as easily visible, and has hardly any occlusions, then it's very easy to get it, but in many cases, that the model comes up with predictions that are actually not there. All right, with that, let me wrap up. What I talked to you about is a dataset for semantic A modal instance-level video object segmentation. There's a rich set of ground truth information that we have made available. You can leverage temporal information for that, and by applying your models to datasets like Cocoa, you can also test it on real-world. Thank you. Hi, hello, I'm Hendrik. I'm working here at the IBM Research Lab in Cambridge, and try to make full use of this town, also in terms of my collaborators. So whenever I refer to SV, I don't refer to me in a majestic plural. I actually refer to these clickers who really don't do the thing today. These kind of people. So I apologize, Antonia, I didn't include your mom yet. On the next slide, it will be. So what we do in the lab, I think, is a little bit adopting the ignoble model, and I would say, like, we're trying to create visual tools that make you play and then think. But on the other hand, it doesn't perfectly fine if you do the inverse. Yeah, if you think first and then play a little bit, and think again. And here's a couple of projects that we're working on, so kind of visual debugging for sequence-to-sequence models, finding the best VAE that separates according to some metadata. Again, you already saw my little demo in Antonia's talk, or a text summarization tool that tries to work with you from input to output, but also from output to input, which I think is a pretty fun idea. Okay, but today it's gonna be about gantithodes, and since we couldn't decide if it's a GAN with an A, or a GAN with an E we used, I think it's a Danish letter, eh, eh, umlaut, to speak about visual gantithodes. And Antonia already highlighted it in his talk. These are phases from 2014. These are phases from 2017, and this is from last year. So there is a certain fear in the general audience when they show these pictures, and they're like, ooh, yeah, that's becoming a little bit too realistic. On the other hand, also for the text side, at the beginning of this year, a very open company talking about AI, that they didn't want to release a large language model because the model was so powerful that it would do harm to humanity. And the question is, do we need to panic? Yeah, I cannot answer this question for you. I feel mildly, eh, not yet. And I will tell you why, because I think we still have a large toolbox that we can try to help with this kind of identification if something is machine generated or human made. I don't wanna refer to it as fake because fake has so many different flavors. Let's talk about machine generated or human made. And if you think about two different options, how you could tackle this problem, now you could have a large industry effort, or effort in general, to say like, we sign everything that humans made. Yeah, whenever you take a photo, you somehow add a signature that says, this is a photo and this signature is destroyed, whenever the photo is altered, et cetera. There have been attempts before doing this, or you try to detect something that is machine made. And I wanna talk about the red pill today. And it is about a text model that I was mentioning at the beginning. Just to fill you in on the memes and why it's called catching unicorns with glitter. The debate earlier this year started around a text that was actually primed by one sentence, and then the model was supposed to finish two paragraphs around this sentence. And the original sentence was, in a shocking finding, scientists discovered a herd of unicorns living in a remote previously unexplored valley in the Andes Mountains. So the meme of the unicorn became kind of a thing and then of course we had to figure out a way how to catch unicorns and you can consult the internet and three out of four web pages recommend you to build a trap with glitter in it to actually catch the unicorn. Hence the name of the tool, GLTR Glitter. But let's talk about how you sample actually a sentence from such a language model. And this is a very strong simplification but it illustrates how it actually works. So you start with the start token in the bottom left. You have some word embedding, et cetera, and then you make a prediction. And you decide on all possible events which event you take, you choose a word, you put the word in again, you get the prediction for the second position, you choose a word again, et cetera, et cetera. The question is which word do you choose? Yeah, at each point you have potentially 50,000 events at hand. Yeah, if you think about an English dictionary, that's quite some options. And there are some different strategies how you would actually make the decision. A previous strategy would be to take the most likely event, a top end strategy would say like you randomly sample from the top end events or in a beam search you essentially evaluate at the next step or which step you should have taken the step before. But what they all have in common is the bolds, what I highlight here in bold phase. Yeah, so that they somehow use the most likely or top end event. And this is something that if you have the model, can actually make a statistics about, you know? So can we detect how likely a text is sampled from a model, kind of word by word. And this is what Clitter is about. And instead of showing it use theoretically, I'm taking this weird risk of doing a live demo. And what you can see here is a text that is actually sampled by a model that we use for testing if the text is generated by a model. And the highlights in the back of each word indicate if the model can predict this word within its top 10, then the highlight is green. If it's within the top 100, then it's in yellow. If it's within the top 1000, then it's in red. And if it's beyond top 1000, then it's violet. No worries, you're not colorblind. This is the sanity check because this text was exactly generated by the model that we use for testing and using its top end strategy. So that you see everything in green here is perfectly correct. You're not missing a color. Okay, let's, on the other hand, look at a text that the humans write. So this is from a New York Times article. You see immediately that the pattern looks inherently very different. So especially the violet highlighted words or subwords appear to the model as very surprising. So it's very unlikely that the model would predict this word at this position. And if you, for example, hover over preservation in this case, you see that preservation is the prediction 4,681 at this position. So it's really not very likely that you would choose it when you sample from the model. Much more likely is the word the, it, a, shoppers, whatever. And if you look at the options that are top options, it's often also the quite boring choices. Because the model learns a language that is pretty perfect or pre-average in that respect as well. And then you can actually take the unicorn text. And what you see is in the first sentence, there are some violet turns, but beginning from the second sentence, and this is the generated part, you, all violet terms disappear. So nothing is super surprising anymore. There are some surprising things because the model that generated this text is much more powerful than we use for testing. So it can actually surprise the smaller model, but not in that heavy way as the humans would do. So the humans are still like the larger model in that respect or the more surprising model. Yeah, and then of course you can do things just for the fun of it. I dropped my pen in the meshed potatoes. This is a sentence I got from one of my colleagues. And if you look at it, meshed in this respect is very, very unlikely. So it's much more likely that you say, I dropped my pen in the sink, in the kitchen, in the air, whatever. But surprisingly and funny enough, potatoes following meshed is then very, very likely. So the model adapts pretty quickly in its context. And if you look at it, imagine you have 50,000 events and from 50,000 events, one event gets 80% of probability. This is like, yeah, you should choose it. There's also meshed avocados if you didn't know about it. The other funny thing is you can do these kind of side discoveries by just playing around. I dropped my whatever. If you see, I dropped my daughter, my kids, my son. It's much more likely than my pen. But also I dropped my phone. And I admit to my age, but in the 80s, nobody dropped phones or it was at least not statistically relevant. Yeah, that's someone. So just by thinking about that, you might actually indicate that there is a little bit of a bias in the text. Probably it's text that is past the 80s. So it's not always that it's super serious. The discoveries we can make with these tools. On the serious side, as we tested this, so we took 50% generated text and 50% human text and gave this to students and said like, five texts, you have to decide machine or human, 90 seconds. And we came up with an accuracy of 54%. Chance is 50%. Yeah, so this was kind of scary. Then we explained them what the highlights in the back mean, repeated the experiment and at least got a little bump up to 72%. Still a little bit on the scary side. Okay, so what we were able to do is going from this headline and adding this headline, which is nice in terms of outreach. Yeah, kind of time is short. So the good thing about living in Cambridge is that there are good students around everywhere. So this is Max's student at Northeastern. So he trained an MNIST GAN and trained an inverter to the GAN. And when he enters numbers, it kind of reproduced his numbers or at least it's completely deviating from it. If you put in letters into the same encoder GAN network, so for example, an N here, the reconstruction is kind of not really the N. Or if you take an A, it actually, spoiler alert becomes more like a four, et cetera, so you can at least say like, if you calculate the differences, it might not work as good. Okay, last question I wanna quickly address is human AI collaboration really needed also in the future because a lot of things we can automate and my little point I want to make is if you think about the attack defense loop. So here you have a plane going from classification clear dog to classification clear flower. What this area attacks try to do is essentially putting the dog onto the flower side and using this way as kind of their energy to do it. And normally what we try to do in defense mechanisms is make at least this. So make it much more, that you need much more energy to put the dog over the border. So make these things separate better. But there's the sweet spot in between here, the white spot where you do not exactly know yay or nay. And this is actually where humans in the loop probably still help. And also because there are some examples where you really cannot make a clear distinction if it's a dog or a flower. And that matter, thank you very much. Okay, great. So now we're at the Q&A section of the talk. I think if you haven't already, please do enter your questions on Slido or upvote other people's questions. At some point we'll show the questions on the screen and be able to ask them. Let's see, maybe, looks like these might have been from some of the previous talks, are they? Maybe you can try. Yes. No? Okay, there, oh, they're both. Okay, all right. One is hard. All right, well then let's go ahead and get started on this. So what type of hardware innovation will drive the improvement of AI systems in the next few years? Anyone like to answer that one? It's kind of, that's why I was thinking it may have been from a previous. Does anyone like to answer that? No? I can certainly say things. Yeah, this may be from a previous one, but certainly from my perspective. So where my work sits is I do a lot of work in programming languages, programming language design, compilations, so that means I sit between software and hardware and do a lot of interaction with hardware and some of the merging directions I've seen, I think has been very, let's say promising, successful, well-explored so far was actually looking at domain-specific accelerators for the variety of essentially machine learning models that we're actually seeing these days. And so that I think at that level, domain-specific design specialization is the trend that you're actually seeing from many directions in architecture in the community right now, as well as you're seeing people like John Hennessey, David Patterson getting talks on their perspective on how domain-specific design is actually gonna drive new directions in both machine learning where we're getting better performance, as well as on the hardware side, just generally how future architectures should be handled, right? And then specifically within sign of that, I think there are a wide variety of different techniques, things like quantization, pruning, approximation, looking at sparse structures. These are all types of optimizations that are just not well-covered in the existing literature in architecture. I think there are significant opportunities to look in those directions. Okay, great, thanks, Michael. Anyone else want to do that? Otherwise, we'll move on. So I see there's a question for Hendrik, right? Specifically in terms of, can't any detector for GAN-generated content be used as a discriminator for a new GAN? You guys can't see it, that's the reason I'm reading it off. It's down towards the bottom, always. Oh, even, oh God. Oh, it's not from, it has a z in there, but okay. Okay, that's a good question. I don't want to reach out too far, but I think, he says it, yeah. I'll tell you what, would you like to come back to you, or? Yeah. Okay, that's not a problem. Okay, I do actually also think this one on introspection is from a previous section, but I'll just, would anyone like to answer the one on introspection? If not, I'm going to go down to the one for Alex, right? Basically, you know, the computers don't know what they don't know, and is any of your work. I think that was from Yahshua's talk, so I'm going to keep going. Let's see, so one went away, okay, that's interesting. Okay, so this Slido app worked a lot better last time, having a little bit of trouble with it this time, so this is a general question. What are the most near term exciting practical use cases coming for your field? Maybe, do anyone like to answer that? I think Antonio, you and I are going to start asking the questions if we can start getting. So you want to answer the first one. So that is one question for Alex. Oh, the first, oh yeah, I went to the bottom, I didn't see that one. How do I propose to target anticipation tasks based on your data? So I guess two things that I have in mind there is on the one hand, given an observation, we can try to anticipate how should that scene look like a few seconds from now, with video data that is easily doable. Now you might recall Joshua Benjio's talk, right? He was saying we are not envisioning in pixel space, we are kind of envisioning in latent space. So these ideas apply there too, so likely we're gonna come up with a system and I'm kind of giving away some of the research that we're probably doing right now. So coming up with a system that comes up with a latent space object and then from that latent space object we are trying to reconstruct the scene. Now I also briefly alluded to using pose data, and I just briefly showed that on one slide. That's another approach that we are actually taking. You have the human pose at the current frame and you're trying to envision how is that person moving? What is that person doing in the next few frames and how can that person move? So these are kind of two directions that we're currently assessing and we're seeing which one works better and why so that these are kind of questions that we are studying. I hope that answers this question. Okay, great, thanks. So another question here that we've got is, are the prune networks somehow better for interpretability? So that's probably for you, Mike. Yeah, that's actually a fantastic question. So my student Jonathan Franco has been involved in that project, actually had a brief collaboration with David Bow, and one of Antonio's students, actually on this question. And they've seen so far some of the preliminary results that if you apply pruning it actually doesn't, let's say it doesn't affect, so you don't lose the interpretable structure, but whether or not it improves interpretability, we don't quite have the evidence there yet, but it's definitely a question that we're interested. And one thing I would say though is, so there's this whole other class of techniques where people are actually looking at how do I, for example, actually verify that my neural network is doing something reasonable? How do I verify, for example, that it's robust? And I think if you look at many of those techniques, a significant limiting factor is the size of modern networks. So I do think that things like pruning can actually make those techniques much more scalable looking into the future, and perhaps these goals of interpretability, robustness, can be much more in scope if we think about techniques like pruning. Maybe I'll ask a question that's just kind of building on that a little bit, and that is that so the work that you talked about is as famous maybe for two reasons. One, it got best paper for iClear, but the other reason it's somewhat famous is because the enormous amount of compute that it took to run the experiments themselves, right? So at least it's, you know, it was done on the IBM cloud, so that's why it's famous in IBM LAN in terms of the amount of compute that it did. It was significantly supported by IBM. Many people know our names. So my question is, you know, is any of your work either in the approximation techniques or otherwise to help with the experiments part of research, right, as opposed to just the training of the networks, right? No, this is a fantastic question as well. I mean, so there, so if you look at, you know, some of the work that is, or some of the processes that are out there, so Yang Kun actually has this position paper that was actually in a circuits conference last year talking about the types of emerging hardware systems, software development systems, just methodological concerns that we might need for doing these experimentations. And one thing, if you look at that paper that he separates is the fact that when you're still doing that model design, you still, you perhaps don't wanna look at pruning, you perhaps don't wanna look at approximation because you wanna make sure that for any given architectural decision you might be making, you are getting the fullest opportunity to see if that thing is successful. So I think it's, I think there are real methodological questions about how people approach design their networks and whether or not approximation can actually inserted very much in the get-go. I hope there are opportunities. I hope there are ways of thinking about making approximate decisions where you're making optimal subject to regret or something along those lines. But these are very initial ideas. I think these methodological concerns are very much questions that people like me in programming languages and software engineering really care about, how do you build software? And it just happens to be the case that now that software includes neural networks. So we are answering these questions, but we don't know the answer. Let's see, maybe another question to maybe to go across more of the panel, right? So that is that earlier Yashua kind of said, hey, for the research that I'm doing in my lab, I'd really like to see some complimentary research in the neuroscience field, for example. And this panel is interesting because we've got a mix of people who are doing the algorithms in the back as well as people that are really thinking about the human interaction pieces of it. And so I was kind of curious about to maybe ask each of you if you had the ability to kind of influence some of the work that was happening on the other side of what you're doing, what would you ask the complimentary research to be? I don't know, Julie, would you like to? Let's see, so yeah, so that's a great question. So much of the work in our lab has looked at how we can leverage insights about the cognitive science of how a person models another person or learns relatively efficiently to design structured learning models to improve the efficiency with which a system can learn about us. And from my perspective, I think there's a really interesting question around what does it mean to develop a predictive model or a behavioral model of essentially what's an alien entity, right? We have millennia of experience of building up, sort of like evolving our ability to predict other people's behavior within a social context. And now we have systems that are alien to us in multiple ways. They don't look like people, which is one issue. But another really interesting issue is that they, when in our experience have we had to learn how to work with a system that's also learning online, but whose behavior can change overnight? So think about, for example, a Tesla and a software update overnight. How is it that we design these systems to bring us along a learning curve? And this is also another real issue. There's studies out of excellent people in the human factors domain that are studying issues of mode confusion through software updates of autonomous cars. And so, I think the studies we've done so far have been useful, but I think there's new questions about how it is we study these systems as ultimately like a separate and very different entity than the systems we've had to model and understand how to work with in the past. Cool, thanks, Julia. Anyone else wanna comment on kind of the, you know, the complementary research you'd like to see? Sure. I think it's pretty widely appreciated that cognitive science and AI have had a tremendously fruitful relationship. So Reinforce Learning grew out of essentially 1950s-era mathematical psychology, and we all appreciate the relationship between 1980s connectionism and deep learning today. But I think just kind of a general advice if you're looking for new inspiration, there's a century of research in cognitive science on mental representations. And that hasn't had an impact in AI in the way that Reinforce Learning and connectionism have had. And so I think there's probably useful nuggets of gold hiding in that literature. And it takes some of the AI mindset to understand how to take these kind of vague, half-baked ideas in psychology and experimental research and formalize it in terms of kind of deep and rational principles. And so I think there's a huge opportunity out there for anyone working in AI to kind of take these ideas that have partially been thought out and run with it. Yeah, that's excellent. Thanks, Chris. Anyone else would like to comment on that? Yeah, go ahead. Kind of from a different perspective. And instead of trying to simulate what is going on in the human brain, the proposal here would be to say like, some parts are probably just need a human brain right now, like a real human brain. And so the idea kind of for visualization in the loop or people in the loop is to, yeah, not trying to simulate the brain, but to actually put it down in means of finding good interaction methods between models and humans. Yeah, for tasks that are, again, if this recording is seen in 10 years, people they're like, yeah, we've generated intelligence. Right now I think there are sufficient tasks where you can say, okay, at least we can have a little bit of a corrective part in there. So you imagine, for example, again, that is generating level maps for computer games. It might do a fine job, but it might just break at some point in terms of suggesting a street where there's supposed to be no street and having the ability to correct for these illogical things. I think that's still needed again, 2019. Yeah, thanks. Any other, anyone else want to respond? If not, we can go to the next one. That's fine, okay. All right, I'm gonna mark off the first one just because we already asked that one. So now it's humans learn through multimodality information, vision, depth, audio, continuous learning. What's the future of multimodal deep learning? Would someone like to talk about that maybe, Antonio? Yeah, I guess, yeah, I'm the person to answer to that one. Well, so I think that is not really a question about deep learning, it's really about a question of how do you get this multimodal data, and particularly talking about sensory data relates to how humans capture and learn from the world, because we don't actually have good multimodal data. There is a lot of effort that has gone into building cameras to capture the world. Microphones also, they are very good, but that's about it. There are no olfaction systems that are just as good at the human olfaction system. There are no tactile systems that are as good as the tactile system of the human. So we don't actually have rich sensory data out there that mirrors what humans actually do when they interact with the world. We have other types of multimodal data that relate really how data lives in the internet and so on, but not how you interact with the physical world. So I think that the future is about making the effort of building those sensors just as much effort as we have put into building cameras. So I think that's one of the things. And the second thing is that once we have those sensors, there is actually no robots out there to carry them because some of the robots that we've seen in Julie's presentation were just wheels and a screen on top of them. And in the best case, you have a vacuum machine that just works around. But there are not the robots that, when we think of a robot, we think of a robot with arms that help us cook or do something useful at home, there are not such robots. This is why when you say 30 million robots are at home, no one really thought, you know, where are they? I haven't seen any of those 30 million robots. There's certainly nothing in my home because you have to span the definition of robots to start thinking of robots that we don't really think are the robots that we really wanna interact with or that will help us. So I think the future of multimodal deep learning is not really on the deep learning side, it's on the building the sensors and building the embodied devices that will actually carry those sensors and do interesting things and we like both. Great, thanks Antonia. We also asked the hardware innovation one so if you could just mark that one off and this next one as well. I think maybe what I'll do is take a modification of the one about getting physics, you know, kind of into the, you know, physics in the real world. Maybe modify it a little bit because I think that with a little bit of a modification it also applies to both not only Antonia's work but Julie's work as well as Alexander's and that is, you know, so we've certainly seen, you know, with the GANs that you can, you know, incorporate into simulators better, you know, representations for kind of the real world visually, right? But physics, semantics of the real world, how people actually really work, you know, is there any, you know, work that's going on there to try to bring that better into the simulation that could be used for training the systems? Do you want to see something? I think you can go ahead. Yeah, so we are doing some work on that direction and there is a lot of work out there in trying to incorporate more, so GAN based models, for instance, on neural networks that there are a few pieces of it that really make very explicit certain properties of the physical world. So for instance, you can have a dynamical systems that try to predict the future and there is another network that will encode or capture some properties of the stimuli, but then there will be a dynamical system inside that will look very much like the traditional dynamical systems that people have been using to make those predictions over this representation that you have to learn. You don't know what the appropriate representation is, but you know what the dynamics should look like. So you incorporate that and through this bottleneck the system learns to discover a representation that is compatible with the dynamical system to make predictions. There is work also on, for instance, using 3D representation, so you know that the wall is 3D, so the representation can generate images by having a layer inside that really knows about perspective projection and that's just hard coded into the system. People have shown that these systems can learn some really interesting representations of the world and discover 3D structure when the only thing that you gave was really the fact that there is this thing that is perspective projection into the rendering engine, but you didn't specify what is the perspective of what that has to be discovered. So yeah, there is a lot of interesting work coming up in this mixture of neural networks with a little bit of hard coded physics. Anyone else like to talk about that? I guess another work to highlight in this direction is these simulators that people use, many of them also incorporate physics into this now. So the question then becomes, and I'm not aware of anyone having really studied that, but I could be wrong on this, is what type of physics do those systems then learn and how do they capture it? But I think there is definitely a lot of exciting work also like Antonio mentioned, that is going forward in this direction, but I would still say that we are at early stages in this area. Okay, great, thanks. Yeah, I would say two terms to look out for coming from the systems and let's say software engineering community and programming just community, but the concept of software 2.0 and then also differentiable programming. What I think what they're actually trying to do is this, I specify some part of my system just using normal traditional code that I can reason about, and then there's another part of my system which is a neural network. And I sort of fused them together and that allows me to do end learning for some given tasks, but there's some part of it that represents fundamental structure in the problem that I can reason about ideally. And I just pointed to Julie's LTL, for example. LTL is a very logical way to do that and think about that process, but you can think about richer expressions of how do you specify these logical properties using just regular code? I mean, we take model-based approaches to engineering large complex systems. It's the ability to bring that in and use the right representations and the right models for a particular application and be able to augment that with learning from data is just extraordinarily powerful. Okay, cool, thanks. You can see this question in terms of the use cases, right? I guess people are really looking at, okay, maybe a little bit more in terms of the near term aspects of what might be use cases for your work. What can we watch out for as Michael was saying? Any comments, Hendrick, you? Yeah. I just want to highlight that there are also impractical use cases that come out. Okay. Which can still be fun. So, Port of Allab is also trying to do some artwork and I think these are probably considered impractical in a technical sense, but can still be interesting. So, we have also in collaboration already some ideas how to kind of not use the information, but actually we have to help generate new ideas or at least new visuals that might inspire. So, not just to make decisions, but to help to inspire people, so to speak, to be maybe more creative or take to places they are not familiar with, right? Yeah, I think there's, I met a lot of people who say like, I'm not creative. Okay. I'm an engineer. And I'd like to challenge that a little bit. Sometimes you're not creative because you feel you don't have to. Yeah, it could have changed the game. That's what you're saying. Yeah, that suddenly people could be much more creative. That's a nice, that's an excellent. Anyone else want to talk about use cases? I guess one thing I can briefly mention is thinking about my work and this anticipation line. I actually just got an email when it was two days ago or something like that. Someone mentioning to me they are looking into trying to prevent accidents in working environments. So they're trying to look and developing models that look forward and see could there be an accident happening like two seconds, five seconds from now and then warning people about this. That's very cool. Which admittedly I haven't thought about this. First when I got the email I was like, this is actually an interesting thought. It's a very interesting thought. I mean, how far away do you think we might be to starting to give people, because the nice thing is it doesn't have to be exact, right? I mean, it just needs to try to give them some insights, right? So do you feel like that's something that's a couple of years or it's much further because it just needs to be, you know. I'm not directly working in that field. I don't feel like I'm qualified to answer that question. But I would argue that is one case. I can also see similar work in the direction for, you know, I guess for elderly trying to detect are they tripping or are they falling down. I think work in this direction as well. So I think there's a variety of works that try to go into this direction. How far away we are on this. Yeah, it's hard to see it. I wouldn't want to anticipate that. Okay, great. What is the project from a third person perspective or from first person perspective? I don't know the details about this. I literally just, I got the email. I need to get back to it and then. Okay. One could be like a tool for Schadenfreude, the other tool could be. Sure. Okay, great. So we are gonna go to a break now. We're having a shortened break, so about 10 minutes. So if everyone could be back here at 3.45, but there are refreshments outside. So let's thank the speakers, please.