 Hello everybody and welcome for another series of the Sussex Vision series out of Worldwide Neuro. Thanks for being again with us this week and today we're receiving Tobias Delburg. So Tobias Delburg, Tobias received his PhD from Catech in 1993 in the inaugural class of the Computation and Neural System program founded by John Holtfield. He then worked on CMOS Imager and Technology in a government lab and then he worked on electronic imaging at Arithos, Synaptics, National Semiconectors and Fovion. He has founded over the time three spin-off companies which support basic R&D on neuromorphic sensory processing. He also received multiple awards from the IEEE, the Institute of Electrical and Electronics Engineers, and was named a Fellow of the IEEE Secret and Systems Society for his work on neuromorphic sensors and processing. He is now a Professor of Physics and Electrical Engineering at the University of Zurich in the Institute of Neural Informatics. Tobias, hello and thanks for being with us today. Thanks for the invite. I really enjoyed this series of Worldwide Neural Talks. I went to the one from Simon Laughlin a few weeks ago. I thought it was a fantastic hardcore biophysics talk which I enjoyed very much. I always enjoy his talks and I'm happy to talk to you today. The talk I'm going to give today is directed at neuroscientists to teach them a bit about the industrialization of silicon retina cameras. The cameras that work a bit more like the eye. This is a growing field which got a lot of industrial attention over the last few years now, but I just want to give an introduction and do quite a lot of demonstrations here with this camera that I have. This one, this event camera. It's a chip that we designed in our group and the camera is sold now as an R&D prototype by one of our spin-off company, Sanivation. But I hope that this is directed at people that don't know much about this technology to try to simulate some ideas about how the spike events that come out of this camera can be used for visual inference and how they can inspire ideas about computing and artificial vision. And so the title of this talk is Silicon retinas that emit spike events. And it starts out with a brief summary of how the earliest silicon retina mimic the three layers in the biological retina. I'll just start the pointer here. The photoreceptor bipolar in ganglion cell output. And then I'll speak quite a bit about the silicon retina, the spiking dynamic vision sensor, silicon retinas. Here's a recording of one of them. You can see the black and white dots here that are drawn on a great background are the actual spike events that are emitted during a particular time period. So it's like a 2D histogram of the spike events. You can see here in this video that the silicon retina sees right through the sunglasses that Patrick is holding some sunglasses in here because these spike events represent brightness change. They kind of see right through sunglasses because the contrast with and without the sunglasses is the same. And then I want to say some words about how this is somehow related to cortical computation in pyramidal cells that are sensitive to very finely timed input. In other words, pyramidal cells that are sensitive to fine correlations in their input. And then I'll say some words about the connection to deep learning at the end. So to go back to the very beginning, this is Misha Mahowald here and Carver Mead. Carver Mead is the physicist turned into electrical engineering, coined the term Moore's Law for Gordon Moore. And after he started digital circuit design going, you know, the synthesis of digital circuits for chips, he became interested in neuroscience. And he teamed up with the young biology student Misha Mahowald and they designed the very first silicon retina as a chip around 1990. This is a photo from Carver Land from Rodney Douglas, who was one of the one people that started our institute here in Zurich. This silicon retina got quite well known because it appeared on the cover of Scientific American as this famous picture of silicon sees a cat. This is about three days of work from Misha to capture this from her own cat at the right moment. It's an output of the silicon retina. You can see the hexagonal tiling of the pixels here. If you look at the architecture of the silicon retina, it consists of a set of pixels. In each pixel is a photoreceptor that runs in continuous time, implements a kind of a logarithmic scaling from photo current to voltage. And then there are circuits that kind of mimic the bipolar cells, which take the difference between the photoreceptor and a spatial temporal average that's computed on a horizontal cell network. You can see the hexagonal mesh of resistors here actually constructed out of transistors. So this is kind of a direct mimicking of neuromorphic form of the Kufler three-layer retina that I'm sure most of the audience is familiar with. And here's a snapshot of Misha herself. You could scan the output from this chip onto a video monitor, multi-sync monitor. And here's a snapshot, I think, a polaroid shot that Misha took of herself. If we look at the pixel circuit, you see it's actually quite a few transistors. At that time, they were using a phototransistor. This circuit here mimics the logarithmic photoreceptor. This element right here is a kind of a pseudo resistor that makes this photoreceptor adaptive so that it's higher gain for AC than DC signals. And these circuits here implement the computation where the photoreceptor output is driven with a transverse conductance set by the bias current in the amplifier onto the resistive networks. And the combination of this transverse conductance and this lateral conductance of the resistive network sets the space constant of this computation of this edge detection. And the geometric mean of these time constants of this RC circuit here sets the time constant of the temporal low pass. So it did quite a powerful computation right at the focal plane to compute this spatial temporal high pass filtering that the Kufler three layer retina does. There's no spiking output. This is directly scanned output. Later on, she added spiking output that would turn these signals directly into spike events at the pixels and send them out as addresses. Now, if you look at this pixel, what's the problem with it? If you zoom up on this little piece of it, you see here, this is a zoom up of this little output of the retina. You can see the problem here that there's a lot of mismatch between the pixels. It's just kind of a lousy camera output. Nobody, even at that time around 1990 when the first seamless image sensors were being developed, this was not really considered a very good output. And the problem is that the pixels are really big and dopant fluctuation mismatch in the transistors that make up the circuit resulted in a huge fixed pattern noise, FPN. And so that's where things sat for a while. You know, you had gigantic pixels, lots of mismatch in the output, so that the salt and pepper noise would overwhelm the signal for most real situations. And now I want to take it fast forward to something that really fired me. And actually, you came after the development of our of our event cameras, but it's data from Botan Roshka's PV retina mouse cell line. And I'm sure a lot of the audience of familiar that Botan, who's sitting over in Basel, they are discovered while he was doing a postdoc with Frank Werblin at Berkeley, this is amazing PV ganglion cells in the mouse retina. Six, I think there's six of them, but here I just show three here. And this is showing spike raster responses to stimuli that look like this black and white dots that are increasing size on a gray background. And what you can see here is this PV on cell here, this is actually an on edge here response right at the very beginning of the dot, the response gets smaller as the dot gets bigger. This is an on off cell that seems to respond kind of at both edges, but only for really tiny dots doesn't respond at all to big dots. This is an off cell. Actually, it's flipped around here. This is off that has a cell that has a response that seems kind of similar to the first one. But what's struck me about this is what I don't know if you see it also, but it's just completely dominated by the transient response. There's hardly any sustained response at all. It only cares about the changes in the visual input. And so that kind of inspired in some way, in a post way, the DVS pixel that we came up with is dynamic vision sensor pixel, which again mimics the three layers from photoreceptor bipolar cell ganglion cell in the biological retina, but it completely leaves out any lateral network by horizontal cells in amicron cells. But it still turns out that's useful. It's a useful abstraction. So you start with a photoreceptor that continuously converts the light intensity into a photocurrent. Then there's a set of little transistors here, maybe about three or four transistors that turn this continuously into a logarithmic voltage. Now what we want is to output spike events that represent change in this log intensity voltage. This voltage by itself has tons and tons of pixel to pixel mismatch, so it's not useful to output that voltage directly. But we can put a capacitor in series with that block the DC signal and put a change amplifier, which is a kind of an abstraction of the bipolar cells. Every time this pixel sends out an event, we close the switch here for a moment, a microsecond or so, and that sets the voltage at this point. Now we open the switch and any change now in the photoreceptor output is reflected as a change on this floating node input. We now amplify that change with an amplifier, again for about 20, and we send that into two continuous time comparators, which at the reset point, the voltage coming out of this change amplifier is sitting right here. And if the light intensity increases, the voltage goes towards the positive. If it crosses the positive threshold, the pixel emits an on event. Otherwise, if it crosses a negative threshold, it emits an off event, and now we memorize the new log intensity. And so what we get from each one of these DVS pixels is a stream of these plus and minus delta logi brightness change events. The threshold here is just a global threshold that's nominally identical for every pixel. And so that's our extremely simple abstraction of the biological retina, which leaves out all the complexities of the real retina, but makes it affordable to build it in a small pixel area. And so now I want to do a demonstration of this, that the data is a sparse asynchronous stream of digital brightness change events that mimics the transient output of the eye. So I have, I don't know if you see the camera here, I have one of these DVS cameras right here, I'm not sure if you can see it on the camera. It's a USB camera that takes the output from the silicon retina and sends it over to the computer. And now we can do play around with the data, we can look at it and visualize it and do algorithms that process its output. And so if I switch over to that output, you see here, right now it's not showing anything at all, right? In fact, I have to unplug it and re-plug it. Okay, so now it's connected. Do you guys see that? Is that fine? Okay, if I don't move, I don't get any output, right? Because it's only responding to brightness change. If I move around, what you see is a kind of a picture of what the retina is seeing, me in this case, where the pixels are gray, there are no spike events, but whether white or dark, it means they're on or off brightness change events. And maybe the best way to do that, to use it is to point it at a tangent screen. It's just a closet I have here. It's a dark wall. On that wall, I'm going to take this bright bar and make a kind of a hubo-viesel edge stimulus for the retina, so you can clearly see it, right? So I don't know. I'm not sure if you can see that, but at the leading edge of this white bar on a dark screen, which you can see right here, you should get on events. So you see the leading edge of the bar is white, and the trailing edge is dark, right? Because the pixels are getting whiter, and then they're getting darker again. And the cool thing about a spiking silicon retina like this is we can actually go into the cells in this retina and probe them. So we'll just select this pixel right now, and let's listen to it. Get those spikes? And we can go into any pixel and just listen to it. So that's the basic characteristic of the output. If there's no motion in the scene, there's no output from the pixel. What's coming out of the camera is just a list of the addresses of the pixels and the timestamps of those brightness change events. If we just aim the camera around at a scene like the inside of this office here, you see, you can hear the spiking is quite sparse. Like if the pixel passes the white wall, there's no output at all. But as soon as it passes something that's moving, like my mouth or some feature or the monitor, we get lots of spike events. Okay, so what is that good for? Well, one thing it's really good for in machine vision is because of the way the pixels work, they're really quite fast. They can respond to kilohertz. And so we can take something really fast moving like this. I'll show it here. We can take this SpongeBob disk, which is just a spinning dot. You see this dot? It's just a modified fan that Daniel found. If we spin it at all like this, the normal camera just borders it out completely. You can't see the dot. Now I'm not going to make the recording right now, but what I did was I took this and just showed it to the retina just before we started. In fact, I can do it right here. It's not really in focus. But you see, if I share the screen again, yeah, thanks. I mean, if I now share the screen, I hold this up in front of the camera. And you can see that it's responding to this edge. In fact, every time the edge goes by this pixel that we're listening to, we should hear a spike event. Is that clear? Can you see it? Okay, good. Now if I play back that recording, I think it's this one right here. It's just quicker because I see it. Now we, this is more or less real time. In fact, it's about a tenth of real time. If I play it back in real time, this is what it looks like. But now because we've recorded the spike events, at microsecond timing precision, we can slow it down to an equivalent frame time of like 180 microseconds. You can see the leading edge of this black disk is still making a nice sharp edge. The trailing edge is not so sharp, but that's a matter of adjusting the parameters of the pixel, the bias currents of the pixels, so that we remove this resonant response to the white increase in brightness at the trailing edge. But now we have an equivalent frame time of about 10 kilo frames per second, right? The frames themselves are 100 microseconds long. So we now have the freedom to look at this data at any time scale we like. In fact, we can look at it in spacetime. And at a certain time scale, if I pause it here, let me just go to another place here, make it a little bit longer. I have to make the recording just a little bit longer here. Hold on a moment here, trying to freeze at the right point. Yeah, at a certain point you should see kind of a helix in spacetime. Each one of these dots here in spacetime is one of these spike events. So that's the basic demonstration of the sensor output. If I now, you know, you can use this thing in real world, like for example, if I play back this recording, I'm taking it, our institute, this is a recording just walking through some problem here with playing this back. So I can now slow this down at any moment. I can slow it down to a few hundred microseconds. So you can see the people here are rendered as we flow through the environment. So it kind of makes you think that this output might be good for stuff like optical flow, things like that. Yeah, okay. That's the first demonstration. Okay, now I should point out that the cameras, some of these cameras, some of these event cameras also include a regular frame output. So if I turn that on using the same photo current, you know, the same photo diode, I can now integrate that photo current over time and I can get a normal picture too. So here's the frame output from the camera. I can change the color scheme for the events. So they're red and green. And you can see the spike, the brightness change events, I'll slow down the frame rate a bit. You can see the brightness change events are going ahead of the frames, right? At the same time that the frame is being collected to be read out later on, you're still continuously getting these brightness change events. So when the normal disc is being completely blurred out, you're still getting sharp output from the brightness change event stream. In fact, it's quite cool. If I stop down the lens here, right now you see the exposure time is about, has been automatically set to about two milliseconds because the room is quite brightly illuminated by the sun coming in through the window, even here in Zurich. It's sunny today. But if I stop down the lens, I can make it really dark. So I'll make it really, really dark by closing the aperture on the lens. And now in order to get a decent picture, I have to increase the exposure to maybe 100 milliseconds or so. Now if I move my hand without the events, you can also see the dirt on the lens there that's coming to focus. If I move my hand now, it's completely blurry because of the long exposure time of the frame that's necessary to collect enough charge to get a decent output voltage. But because of the design of the DVS pixel, the DVS event stream is still very sharp. Even when the normal gray frame is very blurry. So normally I would just take questions here to help understand what's going on a bit better. But I hope that's helpful to understand more about how this DVS pixel works. I'll come back to that in a minute. Okay. Let me go to normal rendering mode. Okay. Okay. So here's the second demonstration, which I think is really quite cool. I still think it's one of the coolest aspects of the camera. Everybody in the audience is well familiar that we have a vestibular system. In each year, we have these canals that give us head rotation and head acceleration information. And now because of the vast production of smartphones around the world, there's been the development of these MEMS, these microelectronic machine systems accelerometers and gyroscopes. This is a six degree of freedom inertial measurement unit, which detects rotation of the sensor and also acceleration of it. It has a sample rate of about a kilohertz. It burns a few milliwatts and it costs less than $3. And we've included this vestibular sensor. It's kind of vestibular sensor on the back of each of the cameras right behind the chip. So now that gives the camera a vestibular sense as well. What is that good for? Well, one thing we can do with it, which is really cool, is just electronically stabilize. In biology, the eye is constantly mechanically stabilizing the input. And in fact, if you lose a vestibular, you know, if you lose the hair cells in the vestibular system because of overdose of antibiotics or something, you know, you basically can't see why you're walking around. Your vision is just too blurred because the eyes are not compensating the motion. Whoso I've heard. But in the silicon retina, we have the advantage that the electrons are very, very fast compared to biological ions. The next demonstration I want to show you is how you can do electronic vestibular stabilization of the retina output using this IMU. So I'm going to now first demonstrate this inertial measurement unit gyro output. I've now turned on the inertial measurement unit. And this purple vector, which I'm going to zoom up on here, is showing you the degrees of per second of rotation of the sensor. So if I rotate it left and right, you can see this purple vector here is the rate gyro is telling you that I'm rotating this way. If I rotate it up and down, so I changed the tilt, I get up and down. And this top one here is telling you the role of the sensor, how much is rolling around its lens axis. So what is that good for? Well, now imagine that I could take the spike events and spike event by spike event compensate for the rotation of the camera. In other words, I can take this rate gyro output, integrate it over time, and then compute from that a translation and rotation matrix that I can apply to each XY spike address. And that's what I'm going to demonstrate here with this stone called Steady's Cam. If I turn on the Steady Cam here, what it does is it now does exactly that commutation. So first of all, I'll take the camera, it's looking outside. Let me bring it into focus a little bit better. It's not, it's not, basically it's looking at the trees and a chair sitting over there. This is with the Steady Cam off. And now I'm going to enable the Steady Cam. So this is Steady Cam on. Do you see how it stabilized the output? Now Steady Cam off. And now Steady Cam on. What is doing us on each spike event is applying the transform. We can see that red rectangle there. If I rotate the camera or pan it or tilt it, it's basically trying to paint the addresses, the spike events onto the same spot on the screen, by electronically stabilizing it. That's not very informative by itself, where it really helps is when you have motion parallax. And to demonstrate that, I'm going to create a motion parallax scene where I take this ruler and glue it onto the edge of the table. So now we have an object to which it's very close to us. And in the background is something that's far away. So if I turn off the Steady Cam, you can see the motion parallax. As I translate the camera left and right, the ruler is moving in front of the background. Now what happens if I turn on the Steady Cam? Now it really stands out. The background is just kept stable. Now what stands out in terms of motion is just the foreground object. Because I'm moving the camera and translating the camera, it's moving it with respect to the background. And this is without the Steady Cam. I'm actually doing a pretty good job of stabilizing it. And this is with the Steady Cam. What it does is turns this camera motion into a movement of the foreground with respect to the background. I hope that's clear. It doesn't remove the background. The events still come from the background, but now they stand out in front of the stationary background. Since we first developed these sensors, they've gotten industrial interest and nowadays the production technology for these sensors has advanced a lot. In fact, there's a 2020 example of a stack pixel from Sony working collaboration with Prophecy. In our original pixels, everything was integrated right onto the focal plane, the photodiode and all the processing circuits. Since then, Sony has developed this, where the first to develop a stacked image sensor technology where there's a separate wafer with the photodiode and maybe a couple of transistors. And all the other processing is on the bottom wafer. And in this pixel circuit, they put only the photodiode and a couple of NMOS transistors on the top wafer. Then at each pixel, they have a copper copper bump connection. So they can bond the wafers together at the single pixel level. And then they put all the other transistors on the bottom wafer. And so with this approach, they can achieve a pixel size of about five microns with almost 80% fill factor. So almost 80% of the wafer is sensitive to light. It consists of the photodiode. That pixel-pixel copper-copper connection, only the photodiode and NMOS is on the top wafer. And all the other pixel circuit, about 50 transistors of those amplifiers and all the digital circuits that you need to get the events out that are on the bottom wafer. And it's another example from 2019 from Samsung about the same pixel size. This nice performance here. It's a megapixel sensor. And it also has all the complexity of a real industrial chip. For example, they have a MIPI interface on it. That stands for Mobile Industry Processor Interface. So they can connect this DVS directly to a smartphone interface. And it gets to be pretty complicated, this chip design business. But again, you have a pixel where the top wafer has only the photodiode and a couple of small transistors. And then all the complex digital stuff is on the bottom wafer. So it's becoming almost like a real biological retina, which has many levels. And the last thing I saw was Sony has now gone to three-level technology. So the image sensors themselves are adding more and more of this kind of complexity. That brings the ability to bring more and more function at the focal plane, where you can really save a lot of power and do stuff that is extremely expensive if you're just reading out frames and processing them redundantly. Okay, so let me now turn to the spike synchrony stuff. So here's an example of tracking objects from DVS events using spatial temporal coherence. People that think about cortex like to think about spike synchrony and coherence and stuff. Here's a practical thing you can do with these spike events to track objects. So this is a recording made over the 210 freeway in Pasadena over Christmas holiday, where our daughter Dee Dee was very small. And you might ask, what if you want to count cars on the highway? Like how many cars are going by? How fast are they going? Are they changing lanes? Because one sign of an appending traffic jam is that cars start changing lanes a lot. That means the traffic jam is about to form. And so here's a recording with the original DVS 128 square of cars going by. So how can you now track these cars so you can get these statistics about the speed of the cars and so on? It turns out just if you think about it, you can imagine coming up with a very simple tracking algorithms that are kind of cluster-based, that are able to track these cars even at very high frame rate with very little computational power. How would you do it if you wanted to track the cars? Well, imagine you already were tracking a car and you got a spike event right here inside this box. What should you do with how should you interpret this particular spike event? Well, the simplest interpretation is just that this cluster has moved a little bit in the direction of the spike event. So it's a very simple kind of a tracker where the spike events are considered to come from cars. And the model of a car is just a little box, which is moving along with a particular velocity. And all the spike events do then is just drag the box along. They update the state for the box. So the algorithm is for each event you find the nearest cluster to the event. Say it's this cluster right here. And if the event is within a cluster, you nudge the cluster a little bit in the direction of the event. If the event is not within any cluster, it's most likely noise, but you should potentially seed a new cluster. You don't show it until it's collected enough evidence, say 20 events or so. And then periodically you have to do lifetime management by pruning starved clusters, merging clusters, and so on. This kind of algorithm has a lot of practical advantage because the computational cost is very low. There's no frame memory, just need a memory for these clusters. And there's no frame correspondence problem, because there are no frames, you know, just continuously these boxes get dragged along by the spike events. And we built things like this robot goalie here to demonstrate this concept. It's just a extremely simple robot where you take one of these DVS and the job of this robot is just to block balls shot at the goal. So somebody tries to score balls on the robot and it just tries to block them. So you can perhaps imagine now that you've seen this car example, how you might solve this problem. From the viewpoint of the camera, the ball is just like the cars on the highway. They're constrained on a 2D plane. So as soon as you see where the ball is in the scene, you know where it is in 3D, right, just by geometry. Now by measuring the position and velocity of the ball, you can put the arm in the right place. And that's what we do here. We take the output from the camera, we set it into a laptop, and then we have a little USB microcontroller board that controls this hobby servo motor. And that lets us build this robot goalie. You can see the balls here are white balls on a yellow table or yellow balls. It doesn't matter. It really doesn't matter what the lighting is. It's extremely robust to all this stuff. And the program is really quite simple. The first prototype was built in a few days at one of the neuromorphic workshops. Here you can see how it looks like to the retina. The balls look just like cars coming on the highway. The bottom cluster is tracking the arm so it can calibrate itself. You can learn how to interpret its own servo output to get to a particular place in the picture. Okay, so that lets us build simple robots. The fundamental basis of this tracking algorithm is to exploit the spatial temporal synchrony of these events to update these clusters. But you might ask, what does that have to do with visual cortex? And this is something that I got quite excited about when I saw pictures like this from John Anderson. This is a feeling of a pyramidal cell. You have an electrically compact dendrite. You have the axon with all these boutons going out to the same area or out to other areas. One thing that really struck me was when I was at Capococcia and I saw a talk by Roman Brett, and he pointed out that based on biophysical simulations, pyramidal neurons can be sensitive to very tightly timed input, like five milliseconds. If the inputs come to a particular branch on the apical dendritic tree here, at the same time, they can cause a pyramidal cell to spike. If they're spread out much more than that, the neuron won't spike because the EPSP dies away and it can't be amplified by any nonlinear stuff going on in the dendrite or at the soma. So you might ask now, how is that related to, for example, extracting features from the DVS output? So here is probably one of the most famous models in all of neuroscience. It's the Hubel-Biesel model of a simple cell, an orientation selective simple cell, where you simply wire up some on-center LGN cells in this way to a simple cell and then you get a cell that's sensitive to a bright edge, a bright bar, like this ruler of this particular orientation. So how can we now do that with this DVS output to make an output that is somehow orientation selective? Well, I'll show you the demonstration first. So here's a demonstration of that, I'm again using the retina output. I will turn off the steady cam here, I think it's off, and I'm going to turn on the simple orientation filter. But before I do that, I'll aim the camera up at our tangent screen, and I'll turn off the inertial measurement unit, we don't care about it. And now you see just the spike events. Now I'm going to turn on this orientation filter. Okay, you see those red events? Each one of those is an orientation event that says that those spike events at the pixels are somehow correlated horizontally. If I turn it 90 degrees, now they're blue, I don't know if you can see them, they're individual blue vertical orientation events. They're saying that the spike events are correlated vertically. 45 degrees, you get green orientation events that are oriented at that 45 degrees. And if I turn it this way, you get purple events. If I show it something more complicated like a hand, then you get a distribution of these different orientation events. This thing has a certain receptive field size and so on. You can adjust all those parameters. But the way it works is by using again the the synchrony of these DBS brightness change events. You can imagine doing it just with integrate and fired neurons, but that's a simpler way to do it. You can take those input events and just write them into a timestamp image. This timestamp image doesn't hold a picture, it's a picture, but it's actually a picture of the times of the last events that are collected in each pixel. So now you have an image of times. Now when you move an edge across the sensor, what you get is a kind of a ridge slope going up to a cliff. And now you can look every time you get an event at the event and you can see in which direction are the event times most closely correlated horizontally, vertically, or one of these 45s. If the events are sufficiently correlated in a little neighborhood in a particular direction then you output an orientation event that tells you your edge orientation. Oh, that's clear. So it's just a digital algorithm. It's activity driven by the brightness change events and it operates on this timestamp image. And many of the algorithms that have been built for optical flow, corner detection and stuff operate on this timestamp image now. Instead of on the accumulated event counts, it actually operates on the surface of time. Okay, now that is all handcrafted. We just basically handcrafted that feature to mimic something like a horizontal simple cell and cortex. You might ask now what's happening with deep learning, right? And one nice result that came out recently is quite beautiful, I think. You can ask the question, how much information is in the DVS brightness change event tree? So in this work from Henri Rebek from Davide Skeramoutz's lab, at his PhD thesis committee meeting, at his PhD exam, he took this little bit of data. This is a few milliseconds of data from the DVS with Henri in the foreground. I'm sitting right here. Davide's here and Andy Davison's here from Imperial. And from this bit of data plus the milliseconds before it, he was able to reconstruct live this image. This is actually a photo. It looks like the image sensor output from the Davis, but it's actually reconstructed from the stream of events, these events plus maybe 20 milliseconds before that. How is that possible? Well, that's the power of deep learning now. And what Henri did was he trained a convolutional unit whose input is a constant event count DVS volume of 20,000 events that split into constant time DVS frames. So the input to this network is a number of channels like this that represent the past 20 or 100 milliseconds of time. And then it gets processed through a fully convolutional unit like architecture with conf LSTM layers. Conf LSTM layers are convolutional layers that have memory in them. So they can remember what was presented the last time there was an input. And finally, the output is just the grayscale frame. How did he train this network? He took images from the Cocoa image recognition dataset. And he affine transformed these images. So we just rotated them, scaled them, zoomed them in and out. And from those images, he generated synthetic DVS events using a DVS pixel model. And he fed those and he tried to make the network then reconstruct the original image. That's all. So after a lot of fiddling around with losses and trying to figure out the right loss function, he's able to get this thing to work extremely well. It's not very practical. It runs in real time, but only on a $250,000 GPU. But it shows that you can infer what was there just from the DVS event stream. And they had a prominent paper at CBPR, I think last year, which showed that they could, from the DVS event stream, that by using overlapping segments of DVS events, they could make the equivalent of like a 5,000 frame per sensor image sensor just from the DVS events. And to do that, they blew up this Swiss statue with a bullet. There's another one where they're blowing up a coffee cup. And I think it's in slow motion here at almost 5,000 frames a second. And there you can see the bullet actually flying through. So it's really quite amazing. The most recent development in this skips this reconstruction entirely. You don't have to do a reconstruction, right? If you want to just solve a problem, why do you bother reconstructing a picture and then sending it into a standard algorithm that's already been pre-trained? And so that's at the nips just a month or so ago, there was a release of an automotive object detection data set from Prophecy, where they labeled by hand more than 200,000 cars and 40 hours of data and 30,000 pedestrians. And they showed that they could beat the gray frame-based approach for recognizing cars and pedestrians by using a network which had front, it had again an event volume, a volume of the spike event, brightness change events coming in, a set of convolutional layers, and then very critically a set of convolutional LSTM layers. These are memory, these are CNN layers with memory in them that can remember the past input. And this whole network has 24 million parameters. It runs at 17 milliseconds of frame on a GTX 980 GPU. So again, not very practical in some ways, sort of getting practical, but one big problem with this approach is that it doesn't exploit sparsity. You have sparsity in the input. This input volume is extremely sparse. Even in a scene like this, at most about 10% of the pixels are active during any normal frame period. But in this network itself, there's no exploitation of sparsity. Basically, it just does all the operations, whether there's data there or not. So it's not much like a spiking neural network. So I just want to finish now with considering sparsity in the brain. So let's estimate energy use and spike rate in the human brain and see what we can understand from that. So this is for the neuroscientists here. So you have, this is numerology of Enrico Fermi style. So you have 10 to the 11 neurons. And what we're going to do is multiply a bunch of numbers together and then try to infer another number from this. So you have 10 to the 11 neurons. Each neuron has about 10 to the four connections. So the fan out is 10 to the four, 100 by 100 fan out. And then every time a neuron spikes, it's using a power supply of about 100 millivolts. And the synaptic activation current is about a nano amp. And synapse is activated for about a millisecond. So if you multiply these three numbers together here, you get the energy cost of one synaptic activation, electrical energy cost. So that turns out to be about, is that right? About 10 to the minus 13 joules. And so you multiply all these numbers together, the number of neurons times the number of synapses per neuron times the energy cost per activation. You multiply that by the average spike rate across the whole brain. And what you should come out with is what? The brain's electrical power consumption, which is about 10 watts. So what does that say about the average spike rate across the entire brain, ignoring the fact that it's much faster at the periphery and much slower in the middle? X turns out to be one Hertz. So on the average, the neurons in the brain are spiking at about one Hertz pop, pop like that, very slow. Does that mean that the brain is not doing a lot of computation? No, because if those spikes are not synchronous, because of the big fan in and fan out of the neurons, the neurons are still getting tickled at a rate of 10 kilohertz, right? So the snap, the parameter neurons are still getting tickled at this very high rate. And once in a while, you can have a synchronous input that causes it to spike. But it does mean that the average spike rate is very low. And it's very different than conventional deep neural networks where every neuron sends its message to all recipients at a fixed sample rate. And so it's clear that exploding spocity in connections and activations is a key direction of current hardware AI developments. So these hardware AI developments are going to look more and more like spiking neural networks that they're not going to compute when they don't need to. And that's the direction that we're working in our own work right now. So yeah, so that's it. That's silicon retinas in a nutshell. I still think that the opportunities to exploit this spike timing for online learning is not really pushed through yet. It's not clear whether it's useful, but probably it's useful in some way. In the industry, the main aim for industrial developments of these kind of sensors is still to shrink the pixel size to win the megapixel race. And it's clear that the kind of sparsity ideas that you see in these event cameras are also appearing in hardware AI accelerators at a furious rate. It's probably the hottest area in silicon development now. So with that, I conclude with our sponsors. On behalf of the sensors group, which is led by Shichi Lu and myself and all of our great students, I put here a few papers for people that are interested in looking. And I'm happy to share these slides if Maxime sees a way to do that easily. Yeah, we'll put that in the description for future notes and for the podcast. Thanks a lot. Do you see any questions at all? I don't see anybody now. I see no video or anything. Welcome. Thanks a lot, Tobi. Before question, I will ask everybody in the chat if they want to join us in a Zoom room for the ask themselves a question or for the discuss. Please do so now. Before the jump to the question, I will ask you a stupid one that I keep hearing in your workshop. Okay, this is all very interesting and very impressive, especially after deep learning training. What would be a proper, let's say, marketable application of such devices? For example, visual prosthetic. Now you have a sensor that is intrinsically like the eye, produces this very precise timing. It has very high dynamic range because of the logarithmic transform. That's the classic problem with conventional cameras that they don't deal with bad lighting situations where you have lots of contrast in the scene. When the light gets low, they produce very blurry output. Now you have something that only produces output when it's moving. It's producing output that seems ideal for either retinal or cortical prosthetic implant where you don't care about lots of pixels in the first place. Another application is automotive. Automotive camera business is huge. Do you know just one single product, one single camera was intended for automotive has a market of several billion dollars per year. So there's a lot of money that goes into automotive cameras. That's why prophecy is spending almost all of its, you know, 50 million in money trying to capture, trying to get a penetration into the automotive camera market because cars have the same problem where the lighting conditions are frequently terrible. You have lots of glare, you have oncoming headlights, lighting conditions just terrible for cars in general. It's not like you taking a picture where you're in control of the picture. Usually you control the lighting to make a good picture. But cars have these extremely tough constraints that they have to see quickly under bad lighting conditions all the time. Stuff like that. You know, also you can imagine machine vision applications, assembly lines that have to quickly sort parts, you know, drones that need to fly around and avoid obstacles, things like that. We'll see. It's not, nobody's captured a mass production target in the same way that Microsoft did when they developed the first structured lighting sensor for the Xbox. That really made that take off. You know, you remember the Xbox, I think was the Xbox had the first structured lighting, a structured lighting sensor paints the scene with structured lighting. And then it sees how that structured lighting is distorted by objects in it. And from that you can infer the depth in the scene. And it was the development of that first structured lighting by an Israeli company that got structured lighting sensors into mass production because Microsoft was willing to invest $100 million into that development. The same way for the CVS. As soon as there's some mass production customer that's actually buying these by the millions per year, that will make it take off. So far it hasn't. If it does, then it'll go, right? And I think that industry, you know, the big industry players like Samsung, Sony and so on, they've seen some potential in this, but they haven't yet penetrated some mass production project like a smartphone. You know, if you had these pixels as part of this smartphone front side camera, even if you had a front side camera, which is just incredible, you know, these cameras on these smartphones, now you can buy an 18 megapixel camera for about $2. The camera, a camera with over 10 million pixels mass production, as they call it, it's like corn. The commodity price for such cameras is about $2. But if you had such a camera and you had out of every 100 pixels, there was a DVS pixel, right? So imagine it's like a little bit like Parvo Magnostream, right? So you have a Parvo pathway, which is the standard camera output. You only turn it on when you really need to burn that power because it takes a lot of energy to get the picture out. The rest of the time, you just have your magnet pathway running. It's doing stuff like waking up the phone when you look at it, detecting gestures, things like that, helping you do augmented reality, you know, helping you localize the camera, the phone in space relative to the surroundings. So that's the kind of place. It hasn't happened yet, but you can imagine it, that it's potentially useful. Very interesting, but I will move with the questions. I have one from Tom Badden. Can you compute online flow fields as a camera moves about, for example, used to estimate and compensate for such motion, a bit like we think the fly might do? Yeah, the question is, can you estimate optical flow from this output? Is that the question? Flow fields, yeah. There's been a lot of work on that. In fact, I have a student right now who's developing a camera which estimates the flow within the camera itself in FPGA logic circuits using a block-based optical flow estimation that avoids aperture problem. Other people have developed gradient pass methods for optical flow, and the most accurate methods for optical flow use again some kind of deep network, right? They're trained on lots of data and they can estimate flow from this snapshot, this series of snapshots of DVS input that's called, that has been termed as voxel, event volume. They call it event volume, but really it's a set of frames that go into the CNN, and they can estimate flow very accurately from that. But again, flow is just an intermediate result. It's not an output by itself. It has to be used for something, and it depends what it's used for. But yeah, it's an interesting area to work on, but tough to beat, say, that they are now. I have a couple of questions from Philip, but I see he's on Zoom with us. So Philip, if you want to turn off the phone, you can try and ask it yourself. Go ahead. I like your voice. Just ask your question, please. Right. Thanks a lot for, yeah. Thanks a lot for the talk. It was fascinating. So first I want to ask you, how does it react to reverse-fi stimuli? Is this sort of the way you would imagine, or do you have to do some sort of tricks to predict behavior with the output of this guy? So your question was, how do I infer the spikes? I didn't quite follow it. So if I show this guy reverse-fi motion, then we'll reverse-fi motion, reverse-fi motion. Yeah. I can't remember what that is anymore. Yeah. It's some kind of illusory motion, right? It causes illusory, yeah. I can't remember what that is. Listen, don't get too sophisticated, right? The pixels are just outputting these brightness change events. They're not doing any fancy, you know, you're not going to get Mach band effects and stuff like that, like Misha demonstrated with her silicon retina a long time ago. Just imagine that the each pixel independently is outputting this team of brightness change events, and yet magically that stream seems to take on like kind of spatial filtering properties, just because of the statistics of the world. I didn't, by the way, I didn't show you quite cool histograms of the spike ISI distributions that you get from moving around. You know, you can measure online the ISI distribution of all the pixels in the array, and you get this very interesting distribution where you have ISIs of all different timescales about flat, with a flat distribution. I mean, I know what it means. So I have a ton more questions. So maybe someone else should, should ask first. I mean, there's not a long queue, so keep going. Yeah, thanks for that. So the other thing that I was interested in is, right, as far as I understood it, the dark edges that actually show the silent bipolar cells in a way. So are these dark edges, right, these, the silent ones, the ones with less variance in, in, in, in terms of spiking, are they necessarily more laggish, right? So are they just. Yeah, you're asking about the details of the circuits. Why, I think, are they more laggish on purpose, or is it just, is that what you're asking? No, does it emerge naturally that they're lagging more? It's not a desirable property, but it's a property of this particular pixel circuit that, for some reason, the trailing edge of this white bar against the dark background, the trailing edge, where the pixels get dark, let's see, the trailing edge, let's see, it's confusing. The trailing edge, the pixels get dark again, right? Actually, sorry, it was for this thing. It was a dark thing on a white background. So the trailing edge get, the pixels get white again, and you saw this long trail of them. Yeah, that's not something, it's a, it's a property of the dead dynamics of the pixel circuit, which is a second order or third order system that depends on the bias currents. It's not, it's not, nothing biological about it. I would call it just a property of that particular analog circuit that we would prefer not to have, right? But it's inevitable property of the dynamics. Well, I think it's, we would like a nice sharp edge on both sides, right? And it's possible to get that by cranking the bias currents way up and burning more power and introducing more noise. We can get rid of that trailing edge, but it costs us energy to do that. But as far as I'm aware, that actually mirrors what we observe in biological retinite. If I understood you correctly, if I understood you correctly. Don't, don't think it's some fancy property of the horizontal cell network that's predicting ahead, you know, people make all these beautiful predictions that the horizontal cell and the immigrant cell network is particularly predicting ahead doing motion estimation. Nothing so fancy here. I would love to have that stuff on the focal plane, but you know the problem is that it makes the pixel bigger and bigger pixels cost more money. So you have to first win the megapixel race. If you can't get into mass production, doesn't matter how fancy your features are, first somebody has to buy you and they're going to buy the cheapest solution first, right? So it's a matter of, it's a pure metal, it's a natural evolution, survival of the fittest and it's most raw form, right? So first you have to, you have to find your way in and to some ecological niche in this marketplace, which is extremely competitive. You have no idea how hard it is to get a new technology into production. Then you can start getting fancy, right? Because then the money will pour in that can, that can fund this development. No, not saying that research groups can't do that. Research groups can build much fancier pixels, more fancy functionality. That's a good time to do it right now. Then you can file the IP, you know, the basic IP and then later on the companies have to pay you, hopefully, right? If they even bother to license the technology, like some companies don't even bother to license. So thanks. Could I ask one more question? Is that fine? I'm just very excited. Yeah, go ahead. So the last thing, right, when you started talking about sparsity, right? And then of course, one thing that springs to my mind is like, how do you store all of these data, right? So you have to store them in very sparse structures. And then if you do, right, and I'm guessing that you do, can you learn some kind of an optimal timer, some kind of an optimal clock that you could maybe store the data with? That's a really good idea. Nobody that you're asking, first of all, how is the data stored on the chip and how is it stored in computer files? Is that what you're asking? Yeah, that's one thing. And the other thing is basically this guy, like this guy is analog, right? But then you need to store, like you need to digitize and store. So can you get like an... Yeah, I probably didn't explain enough about how the pixel really works. I mean, when I show the chip here, a picture of the chip, I'll just show the Samsung picture here, the Samsung DBS, just so you have some basis for discussion. So, you know, within each pixel, you have this pixel circuit. And then the fact that this pixel wants to send out a brightness change is registered within one bit memory within the pixel. So maybe an edge passes over the pixel, it wants to send out its brightness change event that's stored within the memory of the pixel. Eventually, the pixel gets access to the shared output buses and able to send its own address out as XY address onto the shared output bus. Now, once the one... Now, along the way, there are various FIFOs, first-gen, first-out memories that sort of buffer like an information capacity, sort of buffer the data on the way to the computer. You know, there's various FIFOs within the chip and outside on the USB chip. Eventually, all the data gets over the computer and then it's simply stored as a list, a list, not a pixels, but a list of addresses and timestamps. So each event is just literally the XY address, whether it's on or off and it's time in microseconds. And you can imagine various ways to compress that. If you just do Huffman coding or something, you know, it will compress the file tremendously. But you can imagine other things like Delta and coding stuff. It hasn't been worked on that much. No, just... But, you know, finally, for processing, what you want is now to process a stream of events. Nowadays, people, to make the pixel smaller, they're actually outputting frames again. But their frames at a kilohertz, you know, as they get rid of this microsecond, they just go at a kilohertz. But then the frame is extremely compressed because they only send out the active pixel addresses. So you can send a binary map of only the active pixels, so you only use one bit per pixel. In fact, since there's only one bit anyway, that doesn't save you anything. So somehow you send out just a list of the addresses of the active pixels. This is all for actually processing. But then again, if you're just moving around naturally, right, once you have an optimal timer at which you typically sort of produce frames. So people can imagine that you... That's exactly correct. That's a really good point. So the question is, as you're moving around, you're experiencing a range of dynamics. Sometimes it's very, very slow, like in a surveillance scene, nothing's happening. And then suddenly it gets really quick. You know, the guy comes in, you know, starts beating the store owner. You want to capture that dynamics. You know, so naturally you're going to get that because the output is activity driven. You know, if there's nothing happening in the scene, as long as you're good at filtering out flickering lighting sources, if you have a flickering lighting source in the scene, it's going to constantly make events. So you have to get rid of that with some filter that filters out repetitive pixels that are just doing boring spiking. Yeah. So all kinds of cool things you could work on. So like you can make... I made a thing called depressing synapse filter. It's inspired by the fact that some synapses depress. You know, the more you feed them input, the weaker they get. So this depressing synapse filter that basically has a memory for each pixel, that every time you get a spike into it, the synaptic weight gets a little bit smaller, right? And so if the pixel is continually getting simulated, its weight gets very, very small. And so it basically filters out those repetitively spiking thing. But it's just a software object that does that. But it's directly inspired by hearing lots of talks about LTP and LTE and stuff like that. I guess we can continue to talk about that just after. There's a very interesting question from Enrique, which I find very... Can you paste it here, perhaps? Yeah, try to read it to me. I'm just going to ask the audience after this question. I will hand up the live stream. So we're going to keep this conversation between ourselves. So if you want to join us, do that now. The link is on the chat. So please join us now. I will hand up the stream after this question. So Enrique is asking, the biological retina, so I guess the primate retina has about 50 different amacrine cells. But your amazing retina has zero. So can we reverse engineer eventually and learn about biology from your electronic retina? So you can read it. Yeah, that's a really good question. I mean, I was very intrigued by the idea that perhaps from this DVS spiking output, we could synthesize more realistic biological cells. And do you remember Gemma, when we were considering that work in the switchboard project and before that, we were trying to see whether we could somehow synthesize more realistic amacrine cell like modulated DVS output. We weren't successful. Now with deep learning, you might be able to learn such stuff as possible. Can you learn more about biological retina? I'm not convinced, but the jury is still open about that. Maybe is it a useful sensor for some applications? Yeah, definitely. Does it teach us more about the biological retina? I'm not entirely convinced about it. Maybe it makes us look a bit more at the time domain, the advantage of local adaptation, what is really going on there with the amacrine cells? Because as far as I can tell, their function of many of those amacrine cells is really difficult to understand. You can't interpret their output. Just like you can't interpret a deep network output, even the first layer is hard to interpret. Here's something even more noninterpretable. So I was hoping to make some progress in that direction, but we really didn't. Yeah, sorry about the disappointing answer, but maybe you'll have a good idea. Thanks for that. All right, thanks everyone. We're in the stream now, so I'll see you next week with another talk for the C-6 vision series. Thank you. Can I go forward with a couple of questions? Sure.