 Good morning my name is Brandon Rohrer and I'm from Sandy and National Laboratories in Albuquerque, New Mexico and I'm going to be talking about self-programming in the context of an implemented architecture that does both feature creation and general reinforcement learning. So most robots today have one thing in common which is that they work well in a very structured environment that is with a known model or in environments and tasks that are easy to create models of on the fly. My goal is to enable robots to work in environments that are very difficult to model in complex. Pursuing arbitrary goals specifically more loosely speaking to do everything it is that I can do including cleaning up or fighting my battles or building a robot or giving a talk interacting with other humans and the term that I put on this is natural world interaction. This is a specific subset of the goals within the broader AGI community. I noticed I didn't say anything about biological plausibility. My bias is that the most likely way to do this is by mimicking how the brain does it and how people do it but if I can figure out another way to do it I'm perfectly satisfied with that and the current state of my work in this area is captured in a brain emulating cognition and control architecture back up and you can see it in block diagram form here. There is an unsupervised feature creator in tandem with a reinforcement learner and you can see the whole thing is set up to be reinforcement learning paradigm. Observations are taken in, actions go out and there's also a reward signal. Becca is set up to make very few assumptions about the tasks that it performs or the hardware that it's implemented on. One of the few constraints on this interface is that all the inputs and outputs are vectors of values real values between zero and one. Each one represents the presence or absence of an attribute or of an action. You can think of them as binary but they can be a scalar valued. I just want to point you right now to the website at the bottom of the screen, sandia.gov.roar. All of this information papers, code, videos, it's all there. Feel free to pull it up anytime. What I'll do is I'll walk through the pieces of Becca and then give some examples about tasks that I've applied it to so far. The unsupervised feature creator and reinforcement learner have both been created in my lab. The reason is I wasn't able to find anything that does what I needed them to do. I expect that someone, probably someone in this room will create better solutions for both of these at some point at which point I'll happily replace these and I'll describe what it is I need them to do here in just a second. The feature creator takes in observations and separates the input vector into groups, so into subspaces and it does that based on inputs correlation with each other. So inputs that tend to be correlated or broken down into subgroups of inputs and then each of those groups, within each of those groups, commonly observed features or commonly observed patterns are defined as features. Then at each time step, as the inputs are processed, the inputs vote on which feature is active at that particular time. There's a winner take all process and within a given group, one feature comes out as the winner. Those are all combined and passed on to the reinforcement learner. These feature activities are also real values between zero and one. They look exactly like inputs. So they're fed back and concatenated with the input vector and they become inputs to this grouping and feature creation process again. This allows for the creation of hierarchical features, increasingly complex features over time. It's entirely unsupervised, driven only by frequency of observation. So the characteristics of this that are desirable are that it's driven by experience. It's all incremental and online. It's suitable for use in a behaving robot, which is a constraint on all of this work. It continues lifelong learning. There's no separate training phase and can handle high dimensional input spaces well because it breaks them down into subspaces before it begins processing them. So it doesn't get bitten very hard by the curse of dimensionality. It's biologically motivated, which I won't bother to argue right now, but it's fun to talk about over beer. It doesn't assume very much about the task or hardware and I want to emphasize that it works with multiple and mixed sensor modalities. Becca doesn't know anything about where the sense what type of sensors produce these inputs, whether they're pixels and it represents whiteness or whether it's FFT processed audio or a bump sensor or radiation detector or the presence of a certain concept as determined by some pre-processing step. And the reinforcement learner also has some unique features, some unique attributes. So the feature activities from the feature creator are passed in at each time step. The first thing that happens is those are associated with the reward at that time step. Features that are consistently present when a reward is received are noted. Also, there's an attention step where one feature is selected to attend based on its salience. And that attended feature at each time step is combined with recently attended features with decayed versions of them. And so you get a short recent history of what's been attended. The current attended feature and recent working memory are combined in a model of the environment. What that allows you to do is note which if you observed A, B, and C and then D followed, that transition is an example of one entry in the model. You think of it like a tabular form of a first order markup transition model where you can extract the transition probabilities based on the statistics of your entries. Using that model, you can look at your current working memory and make predictions about what's likely to happen next, conditional on taking various actions. And then based on that, the action selection block looks at likely outcomes, the actions associated with them and the rewards associated with them and chooses either a greedy action or an exploratory action as a result and then sends that out to the world. So the features, the attributes of this reinforcement learner that are desirable are that it is also incremental and online. It learns which features are rewarded and which are not. Also, it continues to learn over its lifetime. There's no separate training phase. Its behavior is driven entirely by the task at hand. But it assumes nothing about that task to start with. So the status of BACCA development is right now all of the code is written in MATLAB. It's all downloadable. I think the most recent version is a couple weeks old. There's new versions go up usually every couple months. It is research code. I'll just leave that kind of yet. But I do try to make sure that it works. You can download it, pull it up in MATLAB and run some of the examples that I'm about to show you. It's separated into agent code that determines the behavior of the feature extractor and the reinforcement learner and task code. There are separate folders each containing a different task that you can select between. The ultimate plan is that individual users can download this and then create their own tasks, model on those tasks using the same interface. We're in the process of porting to Python. This is motivated by a desire to interface with the robot operating system, ROS. It's becoming the closest thing that the robotics research community has to a standard. And because BACCA is intended to be very broadly applicable, embodied approach, we want to be able to get it on as many robots as simply as possible. And if you've ever worked with hardware, you know that 98% of your pain is just getting the software to talk to the stuff. So something like ROS is really appealing. So in the remaining time, I'm going to go quickly over some of the simple online tasks, or sorry, simulated tasks that BACCA has behaved on. And also mentioned that our robotics work will be presented tomorrow night at the demo session. There are two students sitting in the front row here, Sean Hendricks and Pate Broger. They've been working in my lab this summer and have been working hard on integrating it with some robots. We brought those robots and we'll be showing them tomorrow night, doing some simple behaviors but with this cognitive architecture under the hood. So this is an example, a one degree of freedom task. There is a white image with a black bar in the middle. There is a frame, a field of view that BACCA has, this square, and it can move it up or down. Depending on the position of that field of view, there's an associated reward. And you can see here that it gets rewarded for centering this window on this black bar. Again, it doesn't know anything about this, the structure of the task, it just gets vectors of inputs and actions and rewards in each time step. And so what you can see here is over the course of time, there's quite a bit of exploration that goes on as the groups are being formed, the features are being created, the reward map is being populated, the model is being created. And then in this case, after about 5,000 time steps, you can see that learning then begins to pay off and the position tends to cluster much more tightly in this region here where the maximum reward is received. And you can see the average reward jumping at that time. One of the benefits of this approach as opposed to a lot of model free reinforcement learning approaches is that the model is the same even if I were to suddenly change and reward this for moving to one of the extremes, it would keep all of its modeling information and it would only have to relearn the reward map. That's different than Q learning, for instance. Here's an example of, this shows in this task, the features that are created. And you can see that they look like a black bar at various positions within the field of view. And there's 12 of them. And if you look in the, this is a representation of the reward map. This shows the reward that's associated typically with each of these features. And you can see these where the bars essentially in the middle are rewarded most highly. This plot here looks a whole lot like this plot here, showing that it learned the reward map pretty accurately. And the features that it's using are useful in accomplishing this task. Here's some examples of some transitions from the model. So if it sees, has this input and moves its frame downward, the black bar appears to move upward. This, the model is rendered so that we can interpret it in terms of the task. Internally, Beckett doesn't know what it is. It doesn't know this is an image. It doesn't know what it means. But when rendered for us, this is what it is storing based on its experience. Here's a two degree of freedom version of the same task. Now I can move vertically and horizontally. And you can see here that after an initial period where it learns the groupings, creates the features, populates the reward map, populates the model, it begins to bring that position in close to the center and stay there where it's maximally rewarded. Here the groups and features are a little bit more complex. Down here on the bottom you can see the groups that it created. If I remember right there's a little more than 50 of these. And these are, you can see the pixels that tend to be correlated with each other also tend to be located next to each other. Within one of these groups you can see the various features that were extracted. So these are commonly occurring patterns within this first group of pixels here. It's interesting to note these are groups of pixels and these features. If you look over here carefully at this feature, sorry at this group and this group for instance, these are actually combinations of groups that tend to be correlated. So it's an example of hierarchical feature creation. Similar to how in the human visual system in v1 we may create short line segments and moving to v2 and to v4 and to the mediatemporal area those visual features become more complex and richer. This is designed to do the same thing. And here are examples of some excerpts from the model created in this task as well. Here's a very simple one degree of freedom task. Let's see if I can make it play. So the dial can step right or step left. I promise it gets here every time. So I'm going to skip that because it's not even the best one. Let's see if we can get it in here. There we go. So here's an example of a two degree of freedom task. Each of these robots can move the elbow and move the shoulder and the goal is to get the hand on that little target. So you can see initially this is like a newborn. It has no idea of what motions produce what changes in its sensory input. It wanders until it stumbles on to the target and then restarts and does it over again. In this case it's been learning for a little while and it moves haltingly but more or less directly to the target. In this case after additional learning it moves nearly optimally to the target. Here's an example with a seven degree of freedom robot based on a robot we had in the laboratory power cube and we started working on just gripping a little block. So it was just exploring that one degree of freedom and what changes it observed in its sensory information when it gripped that block. We expanded that to a planar two degree of freedom, actually three degree of freedom gripping task and it was able to learn to move to and grip the target pretty readily. And then it was expanded to a full seven degree of freedom task. Initially codec unavailable, that's cold. Okay, so initially it wandered in an exploratory way. It was not able to complete the task at all and then after starting it close to the goal so it could explore and stumble onto the block when it became successful at that. It was moved back and moved back until finally also unavailable. You could start it far from the goal and approach the goal and start at a random position far from the goal and it would be able to achieve the goal and occasionally it would miss but it had a good enough model then it would correct and get it. So as I mentioned these are all simulation examples. The real goal or really excited about is hardware and you'll be able to see the initial implementations of that tomorrow evening. It looks like there's just a couple of minutes left. I'd love to take a question or two. Yes. Right now it is a big matrix. There's an incremental estimate of correlation that gets updated every time step. It's still by far not the slowest part of the code but it could be an issue but it's the type of thing that would be well addressed by the high performance computational tools, GPUs, things like that. Big matrix operations. Yes. Can you say that again? How is it compared to... So this feature extraction method because it's constrained to be online and incremental it's very dumb. It looks in the subspace of that group and it looks at the essentially the unit vector that represents that feature and if a new feature comes in that's sufficiently far away it adopts that as a new feature. So it's a version of imprinting that's commonly used in vision processing but its goal is not to represent the inputs in some optimal way but to spread those features in some uniform way across the input space that's used. There's lots of regions of that group space that actually never gets to it so it doesn't have to create features to cover it. So is it growing neural gas algorithm? I'm not familiar with the neural gas algorithm. So I can't comment on that. Sorry, yes I would be interested in learning about it. I think if we're going to have a coffee break at all we're going to have to probably pretty much eliminate the questions right after the talks here so turn this one over to everyone for the next talk.