 Good afternoon everybody, and welcome to the Purdue Engineering Distinguished Lecture Series. My name is Vijay Raghunathan. I'm a professor and associate head of the Elmore Family School of Electrical and Computer Engineering. And it's my great pleasure to kick things off this afternoon by introducing Dr. Wayne Chen. Wayne, as many of you know, is the Riley Professor of Aeronautics and Astronautics at Purdue. And even more importantly is the College of Engineering's Associate Dean for Research and Innovation. And so of course I'm going to say nice things about him because he's the one we all go to for our cost share when we write large proposals. So Wayne has been a long time sort of leader for both AAE as well as the college and Purdue and is highly technically accomplished as well as a fellow of the ASME and as one and other societies and has won a number of research awards. So I'll turn it over to Wayne to introduce our distinguished speaker and kick things off. Thank you. And again, welcome to Purdue Engineering's Distinguished Lecture Series. It is my honor to introduce today's distinguished lecture. Professor Albovac, who is the Cochlear Family Regents Endowed Chair Professor and the Director of the Laboratory for Image and Video Engineering at the University of Texas at Austin. His research interests land at the nexus of visual neuroscience and digital pictures and videos. His recent interests include immersed virtual and augmented visual experiences and how they can be perceptually optimized. An elected member of the U.S. National Academy of Engineering, the National Academy of Inventors, and Academia Europea, his many honors include IEEE Edison Medal, Primetime Emmy Award, Technology and Engineering Emmy Award, RPS Progress Medal, and Edwin H. Land Medal. Now please join me to welcome the Professor Albovac. Okay. Looks like I'm connected. Thanks for having me here. You know, after 40 years in the field with many of my friends here at Purdue, I've never been here before somehow, even though I went to the University of Illinois for eight years. So it's really glad to be here and it's beautiful, wonderful rainy day, something I'm not used to anymore. Down in Texas, it just hasn't rained for six months or something, and the fall colors are, you know, just wonderful as well. We don't see that either. We see brown these days. So happy to be here. Like I said, I know a lot of people here like Ed and Charlie and Melba and Amy somewhere, and I've known you guys for so long. Ed and I have a very long history. We sort of came up academically together and we've shared many adventures previously, not only, you know, society activities and stuff like that, but also doing things like, you know, having encounters with drug dealers in the jungle at Belize. We did that too. So we go way back as I'm just happy to be here. As far as today's talk, I want to point out first of all, and I'm going to come down here so I can point, plus I tend to be sort of a mobile speaker. There is an S here. So in the announcement, I just said secretive image quality. Maybe they were trying to say, well, there's one secret. I'm going to come and find the one secret. Well, there's a lot of secrets in image quality. We're still trying to find out more secrets to image quality as we go forward. A lot of research yet to be done in this amazing field. Isn't this a great picture? Even though it's loaded with distortions, what we call technical distortions, I wouldn't change it in any way at all. It was actually created as part of a homework set by one of my class, my class digital image processing, where I asked students to go out and collect pictures of the world. I didn't tell them why. Later they did a project on predicting picture quality. This is a good example. Now I use this picture in my class and I ask them questions, how many distortions can you find in this picture? Golly, there's a lot. I mean, immediately you see there's different kinds of... Well, my pointer doesn't work here, as you can see. Over saturation, under exposure, red eye, blur, motion blur. There's just an infinite number, really a very large number. In just a few seconds I can come up with a list of about ten without even trying, where we can put names on it. Some of you can't even put names on it. Yes? No, I mean, more. Lots more. Yeah, yeah. And combinations of these. So they interact with each other to create more distortions. There's a blur with compression artifacts and that sort of thing. So really a lot going on. Here's another, this is a video, talk, so I have a video too. You can tell me if this is high quality. You know, you don't have to tell me, but you can imagine it's from 1967. Oh, look at that guy. You know, I mean, you'd say, but there's so many things wrong with this video. I mean, look how jerky it was just even at the very beginning. Yet it is the greatest video over taking a big foot. Okay, so to somebody at least, it's important. So video quality is really a big issue these days. It's everywhere a concern in industry, as well as in casual photographers. You know, you take a bad picture, you're concerned, right? So it's especially important today because just in the last twenty years, you know, it's overtaken the internet. Eighty percent of all moving bits these days are pictures and videos because we're doing things like watching Netflix and YouTube and all these things constantly, as well as, you know, loading our own pictures on Facebook and all that. And these pictures and videos generally suffer from... I wonder if we're getting close to... That's an adaptive response. These pictures and videos can generally suffer from many different kinds of distortions as I've just exemplified, a huge number. And in the early days of trying to solve this problem, people would say, well, let's model like the blur, okay? And if we can model the blur, then we can figure out how people respond. And unfortunately, there's an infinite number of blurs that are possible out there in pictures in the real world. It doesn't really work unless you're analyzing one microscope or something. Otherwise, forget it. You cannot model distortion. It's impossible. After working in this field for twenty years, I concluded that long ago. So we can't model distortion. How could you ever predict the quality that people, you know, see when they look at a distorted picture? The answer will be inside the brain, as we will see. Okay. And furthermore, you know, they combine and it just gets... You can't even name the distortions, much less, you know, quantify them in a simple model that you could then turn around. Now, I'm going to talk about two different categories, or actually just one today, but there's sort of two different categories of content that we address a little bit differently. One is very high-quality original content. So if you're watching movies, cinema, high-quality television and that sort of thing, what we might call studio-generated content like you'd see on these channels, I'm going to talk about this category today. It's the best place to start in this field anyway. The other category is pictures and videos generated by people like you and the guy on the street with their cameras and they upload it to social media and so on. These tend to be very distorted after being captured. Okay? These tend to be very high-quality after being captured. But then things happen to these, so they become distorted. And that's where our story today is going to follow. Okay? This one. Okay. Here's a diagram of a video communication system. Now, there's a lot of double e's in here who studied communication theory. In a classical communication system, there is three parts. There's a transmitter, there's a channel, and there's a receiver. And the better you can model each one of these, the better you can design a communication system. Well, measuring picture quality is substantially similar to analyzing this communication system. Only this time it's a visual communication system specifically. The best job we can do in modeling the transmitters to realize it's not an antenna, it's not an internet protocol. It is the real world of photons, striking objects, scattering around or being directed to cameras or eyeballs with caption. And we can actually model this process. It's amazing that every picture in video ever taken that's of good quality obeys certain statistical laws. You don't expect that because there's so much diversity in the world to photograph, right, or videograph. But everyone you take obeys certain laws. And we can exploit it, and your brain expects it. Okay, they expect it. If it's distorted, your brain is ultra sensitive to that change in these models. And that's one of the secrets of picture quality. The channel is what people tried to model for many years. You know, additive white, Gaussian noise, blur, whatever, and so on. Okay, but really it's complicated because there's so many places where distortion occurs. You know, the capture device actually everywhere. Even in the receiver. Well, what about the receiver? The receiver is the human visual system. Because in my work, when I talk about quality, I am talking about perceptual quality as viewed by a human observer. Not a machine, this isn't computer vision. It's human vision. Okay, these images and pictures are intended for the seven or eight billion human beings in this world. Okay, distortion can occur everywhere. You capture your iPhone, it compresses videos. It processes it in various ways, sometimes not always really good. And the providers, they recompress it and reformat it and do other things to distort the picture. Going over the internet, obviously, distortion can occur and your display device will be imperfect. And even your visual system is imperfect. These guys were just talking about, you know, getting cataract surgery. And within your brain, in the neural apparatus, there's neural noise, which we don't really aren't affected by visually because your brain has learned to basically denoise those neural signals as it interprets what you're seeing. But we have to account for it and we will. So video quality. These are very incomplete lists of distortions for which there are names that people have identified. It doesn't matter what they are. Blurs and exposures, this sort of thing. A lot of them are mostly spatial, so if you look at a single frame, you'll say, oh, that's blurry or blocky or it's noisy or whatever. Other ones occur because this is video over time. So you'll see it's jerky or you'll see something I'll show you later called stutter when you have too low a frame rate. Think of old movies that are just sort of discontinuous. Smearing, all kinds of things. And these are very incomplete. And again, these can all combine together to create distortions you cannot hope to model it. So it's really hard. Can we actually address this problem? I remember my advisor who I was talking with others about earlier, Thomas Wong, one of the greatest ever inventor of video compression, one of them. Not the only one, but one of the inventors. When I was working with him, we were doing some work on making pictures look better. And we said, how can we, we just published in the journal this sort of grainy reproduction. Nobody can tell you did something better actually. How can you prove you did something better? We said, is there a way of measuring the quality? And there was nothing. There was something called mean squared error. It's horrible as I'll talk about. It's a horrible predictor of picture quality. It's not correlate with human vision at all. Much they want to move on from that. But there wasn't anything for a long time. It was regarded as essentially impossible until even really the year 2000. They do good predictions of picture quality. So can we? Yeah, the answer and the reason why we can is because videos are special. They have secrets hidden in them. They're special in certain ways that we can exploit. And because they have that specialness and our brains have been processing videos in some sense, you know, visual information over time, they expect that special to be maintained when we see. And if that specialness is somehow broken, degraded, changed and so on, our brains are extremely sensitive to that. So we perceive distortion instantaneously, what we call pre-attentive. So if you see a blurry picture, you don't say, is that blurry, you know, you don't. You immediately have a perception of that. You see blockiness or noise, it's instantaneous, it's pre-attentive. It happens in just, you know, very short period of time measured in the low hundreds of microseconds. So our brains expect this. So this is not the same as a model making this observation that videos are special. How are they special? Well, here's one way they're special. I'm going to list a few, related to what we're going to do. So I'm not just listing special properties, they're all relevant to our task at the end and also to each other. First of all, the power spectra of videos obey reciprocal power laws. By power spectra, what do I mean? This is the Fourier transform in either space or time or both, doesn't matter. If we take the expected value of that over videos, it's something that looks like, well, what we call 1 over f noise, you know, reciprocal type of noise. Now these days, if I say Fourier transform to a data science, it's like, it's sad, okay? But, you know, it's just a fact, and I encounter that all the time. So I'm just going to have to move on from that, okay? I mean, learn about what a Fourier transform is if you don't know what it is, if you are in W and even in CS, okay? Basically, though, I'm just kidding, and I mean to poke fun too much, every signal is embodied by, you know, frequency components in space and in time. Okay, so if we do actual studies, as have been done many times, first time was by the famous vision scientist, Todd Hurst, back then, they find that this exponent is around 1.0, okay? Alpha being in space, beta being, if you measure this, in temporal frequency, they're both around 1.2 empirically, but it's usually, we talk about it as being around 1. Okay? Now, interestingly, if a function satisfies this, okay, any function, forget about pictures or videos, then it also satisfies this self-similarity property of the Fourier transform. If the Fourier transform looks like this, then it obeys this self-similarity, meaning if I scale it in frequency, then it scales multiplicatively as well, okay? And this was first observed by none other than Benoit Mandelbrot in the study of fractals, which was the first big, you know, breakthrough in self-similarity of visual information. Here's a quick example. This is a video of a sled, you know, you see it's bouncing around and so on, it goes around the curves so the picture changes and so on. This is on the log scale, the spatial power spectrum, and the temporal power spectrum, okay? Now, keep in mind it's on a log scale, so these variations look fairly big, but actually this is almost perfectly linear, both here and here, even this little tailing off at very high frequencies. Just an example of that. And it doesn't matter where you're looking in this video, where it goes and that sort of thing, same sort of 1 over f, second special property, vector of the Fourier transform. The Fourier transform says, there's a discrete form, that if I have a function like a picture, or a frame, or a video, then I can express it as a linear combination of basis functions. For the Fourier transform, the basis functions are sinusoidal or complex sinusoidal, right? So what if we ask a question, let's not look for that basis, but let's find a basis function, so that when I minimize this energy function over the sum of these basis functions, okay, let's approach this. In other words, this will become approximately true. Try to force that to zero, so this becomes true for some basis functions. Infinite possibilities for that, okay? But we impose another constraint called the sparsity constraint, okay? Some function s on these coefficients so that they will be sparse in some sense. Sparse in the sense that most of them will be zero, is what is intended. We studied this. As often happens in science, multiple places, people had the same sort of idea. In vision science, it was Bruno Olschausen and David Field who tried this, and the kind of functions they tried were simple, absolute value or something like that, that if you plot them, they have these creases in them. And so the idea is if you optimize this iteratively, the solution falls into a crease and stays there. That's a zero coefficient. There's another zero, of course, in a higher dimensional space. That was their intuition. At the same time, as Stanford, Robert Tibshirani invented the lasso. Same thing, pretty much, if you are familiar with sparsity theory in the lasso at all, better stay away from there. Okay, so point is in red, band-pass processing, okay? Oh, I need to show you something first. When they optimized, they found basis functions like this, okay? These look very much like, for those that have seen these things, two-dimensional band-pass filters. You might think of Gabor filters or wavelets of just about any variety, band-pass filters, band-pass filters. You notice that they're like little waves. They're constrained in their size, but you see sort of this sinusoidal characteristic. Some are sort of circularly symmetric. Other ones are elongated and stretched out. A whole variety of these. What this means is that natural pictures of the world are band-pass spars. If we represent them with band-pass basis functions, we can have sparse representations, which are very efficient. Why is this important in terms of perception? After all, that is not a perceptual observation. It's a secret about pictures of the world. But as I said earlier, our brains have evolved to exploit this, and our brains have to be efficient, because the amount of visual information striking our eyes and retinas in our brain is titanic far more than our brain can handle. Long ago in the 50s, a guy named Boris Barlow put forth the so-called efficient coding hypothesis, and he said, well, we don't understand how the visual brain works yet, but one thing I know is that in the early parts of processing, it's all devoted to some kind of compression, some sort of coding of the signal so it becomes much, much smaller and manageable. And in fact, when we look at the brain, and that means that the eyeball itself, the ganglion cells in the retina, they can be modeled in their responses as linear band-pass filters. They tend to be sort of center-surround like this, and they actually look like differences of Gaussian filters. For those doing computer vision, you say, oh, the good old dog filter, okay? It's like the log filter. In the midbrain, in the thalamus, lateral geniculate nucleus, temporal processing happens. That's also band-pass. These filters look more like one-pole filters used in circuit theory in their close modeling. And then in primary visual cortex, there are dimensional filters that become even more selective because they're frequency-selective. This is a Gabor filter. It looks an awful lot like what I just showed you, sparsity filters, band-pass filters, okay? This shows them in frequency, okay? And this shows a set of those band-pass filters on just one side of the spectrum. Now, band-pass filtering, there's a theorem in mathematics that says any band-pass filter can be written as a difference of two low-pass filters, having the same exact mathematical form, in fact. So, if I have a low-pass filter, I blur an image. If I take another one and blur it and subtract them, remember I said a difference of low-pass filters, okay? I'm gonna get zeroes in most places because they're gonna be very similar to each other. You blur an image, it looks like the image still, right? Blurry in most places. That's another way of viewing sparsity. We're forming a lot of zero responses, or near-zero responses. Midbrain and visual cortex, okay? Third property, even more amazing. So, let's review the first two properties. First property is that images are naturally self-similar, which means that they have a multi-scale interpretation. You can look at them at different scales. They have the same properties. You can look at them at different band-passes, and they have the same properties. Multi-scale and multi-band-pass are the same thing, really. Okay? Secondly, band-pass processing is a way to get efficient representations. Our brain does it, and the real world is composed of pictures and videos that are naturally that way. Third property, suppose I use one of those band-pass filters. I take my video and I filter it with that band-pass filter to get a band-pass-filtered picture. Then it turns I can model that very accurately using what's called a Gaussian-scale mixture model. In a Gaussian-scale mixture, it's a product of what's called a variance field, and basically what's just Gaussian noise. Gaussian noise times a variance field, which is really a matrix that details the structure in the picture. Now, that variance field is used by the brain, certainly to do things like recognize objects, find edges, whatever the brain does to, you know, navigate, see the world. Whereas there's always been a question mark in vision science. What is the Gaussian, you know, how is that useful? The Gaussian noise. Every picture and video you take will obey this model if you band-pass filter it. It's a universal model for pictures and videos. Can this be used? Nobody really had an idea, but it's this Gaussian noise part that is a secret to understanding how to predict picture and video quality. It turns out. Band-pass processing also decorrelates, taking differences, you're reducing correlation. It's basically what's called predictive coding in communication theory. If I estimate this variance field by processing my picture of video, and if I divide it out, then I just have the Gaussian noise. Or I could estimate it and then condition on it. And then essentially that's the same thing, removing the effect of the variance field. And we'll talk about, you know, doing both of those things. So, regarding the variance field, if I band-pass filter, then you can show that the multivariate density of the band-pass filtered image we can express in this way, where this is a correlation matrix, and this is the variance field. We're not going to use this except one time. I'm not going to go in the math or the proof or anything like that. And you can form, using this, you can form a maximum likelihood estimator of the variance field, which is why I'm showing you this. If I can estimate the variance field, I can condition on it or divide it out. So, what we find here is that every image, picture or video of the world, has this remarkable underlying Gaussian property. Now, when I say every, I mean every high-quality picture, video of the world, we're going to get to the point where if there's destruction of the video distortion, degradation, then this property is broken or degraded. And that's how we can predict picture and video quality. Here's a couple of pictures, which we band-pass filtered, it doesn't matter, just a wavelet, and then divisively normalized by the variance field, which we use that equation I just showed you to estimate. And on a log scale, this shows the histograms. If it was not on a log scale, you'd have a bell-shaped curve, but this way you can see things with more precision and underlying it, whoops, jumping through, you can see the Gaussian and dotted lines. It's almost a perfect fit, almost a perfect fit, empirically. So, again, we took pictures, band-pass filtered them, estimated the variance field and divided it, and you get something as basically Gaussian noise. Now, when we start to, we haven't looked at the brain yet here in this context. When we go and look at the brain, we see that the brain does the same steps as we were just describing. As I said in the retina, in the midbrain, in cortex, band-pass filtering of the visual signal in space and in time and in both. We also find, when we make actual electrophysiological measurements, that the output of these neurons in these places in retina, midbrain, and in cortex are divided by neighboring neurons. In neuroscience, we call this adaptive gain control. What it does is, neurons can have very wide ranges of responses. Well, if you divide by neighboring values, then you bring that responses into a very narrow range. So it's much more manageable. It's also efficient. If you divide by something similar, well, in a log sale, it's subtracting. You're, again, removing redundancy. That's a lot of predictive coding. So we can do the same thing with algorithms. I guess this is the neurons with algorithms. We can band-pass using an algorithm. We can divide by neighboring values. And after we do that, we get what we call divisive normalization coefficients, which invariably look Gaussian on pictures or videos that are of high quality. So another one of the secrets are the world, the band-pass process world has this Gaussian underlying property. Secondly, the visual system does the same type of processing I described to exploit somehow that underlying Gaussianity. Okay. Here's an example of a visualization of how you can see this action yourself. So here's a picture of a little girl. I can't point at it. What you'll notice is that on smooth parts here, you see all this compression distortion. This has been so-called JPEG compressed. Very obvious on smoother parts. The same amount of compression and distortion occurs in all of these areas where there's a lot of texture and roughness and flowers and all that sort of thing. But you can't see it hardly. You can see a little bit of it, but it's much reduced. It's really a form of visual illusion. What's happening, though, is very explainable. We can model why this happens. It's because of the device of normalization of the content. The content is high energy. We're reducing the visual impact of distortion except in this region. So obviously, if you're trying to predict picture quality with an algorithm, you had better account for this than in some places you can't see the distortion. There's an illusion. In other places, it'll be very obvious the smooth parts of the picture. So how do we formulate algorithms for producing predictions of picture quality using everything I've talked about? I realize there's a lot there. Talking about the brain and statistics of the world and how they're duals of each other and that sort of thing. So an idea is, suppose we have a picture. It's of good quality. And then we distort it somehow. Say by compression. We do this perceptual processing. It was very simple. Some sort of band pass and then device of normalization or conditioning on an estimated variance field. The result is, well, it puts our output somewhere near Gaussian. Maybe not a perfect Gaussian. It's rarely perfectly Gaussian, but it's close. However, if I have a distorted picture, it'll fall farther from Gaussian. So if I compare statistically this band pass, this perceptually processed picture with this one, I can try to compute a distance between them. That's what we call a reference quality prediction problem. I'm comparing my process picture with the reference picture in this statistical space. Now, in some problems, I don't have a reference. All I have is this. Like you take a picture with your camera, you don't have a reference, just the picture. In this case, you would just compare it with a perfect Gaussian. So this is really a no reference distance and this is what we call a reference distance, referring to some original pristine. Here, we're talking about reference pictures. Remember, we're talking about this sort of cinematic space where it might be a Netflix or something like this. High quality, you have that reference available when you compress. Okay, so we need accurate models of transmitter. I've been talking about that. They're simple statistical models, but they're very powerful, very powerful. Models of the brain, again, these are limited. We understand at a low level how the ganglion cells and the visual cortex and some other brain centers operate, but those are simple models. But they're dual models and we're really applying them both simultaneously. So our first approach to this in trying to use these models of our natural world, which had only been around for about a dozen years or 15 years or something, was, well, suppose we have the video, here's a reference frame, here's a test frame. I do the perceptual transformation, band pass filtering using a wavelet transform. Then we estimate the variance field. We condition on it in this case after adding some noise. We don't want to account for the uncertainty that I mentioned in the visual brain. There's neural noise and other imperfections and our perception of video. So we have these local measurements of entropy everywhere that are conditional. We scale them in a way I'll talk about soon. We pool them like you can think of just averaging over frames. Then we do what's called entropy differencing. Differencing is, of course, comparison. Attract the reference from the test, the original from the distorted, and then maybe pool that over time. It's a pretty simple processing protocol, easy computationally easy, that sort of thing. And then we do that this is for spatial. Then we do the exact same diagram, but for frame differences. There's two frames in sequence. Remember, video is a series of frames. We take differences between them. They're similar. They have reduced entropy already. And we do the same processes to capture temporal information. Temporal information. But the same series throughout. So we have a bunch of spatial quality-aware features from here and a bunch of time or temporal quality-aware features here. All of them exploiting the four special properties I talked about. Multiscale. Images are naturally multiscale, so is our perception of them. Images obey these statistical laws, which we can exploit, compare before and after during distortion. We form these features. So just a little bit of math. Gaussian scale mixture model plus noise. That's our neural noise model. We apply this model to frames of a video and frame differences, like I just showed you, of both original and distorted. So this is a cross product. So original frames, distorted frames, original frame differences, distorted frame differences. Same model of all of these. We find the maximum likelihood estimates. Simple. We compute the conditional entremies. This is the actual closed-form solution, computed in all four of these combinations. Then I mentioned that there was scaling. The scaling we do is by the log of the variance field. This helps a lot. It's very perceptually derived as well. Basically, where the variance is large, there's a lot of activity and interest in that sort of thing, so I'll give it a larger weight. If things are smooth, then the logarithm becomes zero. It also tends to numerically stabilize these computations. A number of algorithms, in fact, today, many algorithms have been derived from this model. The first one we called visual information fidelity. It was only for pictures. Later, we extended to the time domain. Later on, Netflix picked up on this and created something called VMAF, which takes these features along with another film grain feature that feeds it to a support vector machine. In other words, they decided to use machine learning. All of these are very powerful quality prediction algorithms. Netflix, VMAF, and VIF. Both of these have been used at the global scale throughout the world to stream videos. Can we turn off the speaker? Because I've got to be over here. I think I'm loud enough anyway. I'm going to show you a quality map being computed. That's great, much better. Here's an original. You can see this has been distorted by compression. It's a lower bit rate. This is a quality map using these features. The algorithm is called ST-RED, reduced entropy differences. This shows where the distortions are occurring, highlighted by brighter types of colors. That one has got a very simple content, just to exemplify things. Here's another one. It also depends on your viewing distance, whether you can see the distortions. There's a lot of very severe blocking artifacts occurring here. Some are sort of static. The other ones are sort of changing over time. This is the high quality one. It has some distortions. No video is perfect. This is not a great video. It has some distortions. This has much stronger distortions and compression artifacts that are being detected in this map. Where is it used? Mostly these days, these features, controlling the compression of videos throughout the streaming space, throughout the social media space, throughout zooming, throughout all the videos and pictures going over the Internet. When you upload a picture to Facebook, it is quality controlled. When you stream it again, then it's quality controlled again. The basic idea is when you're doing video encoding, you control the encode parameters using, for example, VMAF or some other quality algorithm like it. Then you transmit it and you receive it at your home. That's important because, again, 80% of bits these days are pictures and videos. Also, the Internet is close to 10% of the carbon footprint. By using VMAF, the bandwidth is reduced by 25%. Huge impact on your bandwidth use when you're at home trying to get content of any kind. The fact that Netflix is so efficient, now other companies are following like Metaplatforms and so on who are starting to use VMAF more and more. Also, a big hit on the carbon footprint on the planet Earth. We want a technology and energy award for this. This here. This was our second Emmy, actually. Well, it was during COVID, so I stayed home. That's an Emmy award. From the National Academy of Television Arts and Sciences. Those academics don't win Emmy awards. That's true. We also won a primetime Emmy award previously. Different earlier. Okay. I want to talk about some newer stuff. What is told to you is older. This is a general audience. I want to let you know what's happening in the world of this. But some newer work we've done is on high invariable frame rate videos. This is becoming important now. This great student was the lead of this. He gets credit for all this work. High frame rate videos are very important today because even today's 60 frame per second isn't really enough for really fast sports. So when they're running backs, charging it on the field and the camera's not following him, then you'll see this effect called Stutter, which I'll show you an example of in a minute. It's not really adequate when you have high object motion or fast camera motion. So there's great desire by companies like Amazon and YouTube, those two in particular, who are getting into live sports streaming. Higher frame rates are coming. 90, even 120 are coming. Okay. So just to show you an example, here's two videos. One will be at 60. I can't show you higher than that here today. This will be at 24. If we look at 24, when you're watching this, hopefully you'll see when the camera moves quickly, it's kind of discontinuous in what we call that Stutter. You're experiencing a temporal distortion called Stutter, where it's jumpy. You can see this jumpiness happening all over the place. Whereas if it's at a higher frame rate, and I hope that this TV is able to play that, probably should, it's much smoother, even when the camera's moving much fatter. It's not perfect. Maybe even this isn't fast enough because I can see a little bit of Stutter with my eye. But it's much, much better, much more visually appealing. The other one kind of is annoying. So look at that when I... Again, I see a few people wondering, well, I didn't mind the first one, so I'll show it again. So maybe you can see when the motion is happening, it bounces around where it's more Stutter. It depends on your viewing distance as well. So we have extended the model. I'm not even going to talk about the spatial part, same as what I've already described. The idea now is, well, can we do something in the time dimension that is more comprehensive using larger windows in time primarily and doing some other type of reasoning? So temporal bent past decomposition will be more sophisticated, and I'll also introduce the need for a third video, the reference, the distorted, and a third one, which we call the pseudo-reference. Okay, so that. So we're going to band... Here's the video, space-time function, and we're going to band-pass filter with a bunch of one-dimensional wavelet functions. They happen to be called do-bache-bio-orthogonal-2-2 wavelets. For those that work in image processing or signal processing, you know exactly what I'm talking about. Very famous symmetric wavelets with very fine properties for approximation and so on. But just put the word band-pass filter here, as is bands of filters, and so we know that there's magic there. We now reveal the secret properties of videos by band-pass filtering, that underlying Gaussianity is going to be exploited. We're modeling this part of the brain again, the middle of the brain and lateral geniculate nucleus with these filters. This shows band-pass filtered. I'm applying one of those filters to 24 frames per second. This is a band-pass signal. Why is it grayed out? Well, mid-gray is like zero level. Remember difference of low-pass filters? So everything is clustered around zero. You can see it's very chaotic, the band-pass filtered version. It won't satisfy our Gaussian models very well. Whereas higher frame rate much more well behaved. Much more regular, and the statistics are regular too. Even though not perfect, you can see some. I'd like a higher frame rate, but still much better. So if I do the computations already described, band-pass filter, like I just showed you, compute the conditional entropies. Notice the separation in those entropies is a function of frame rate. This is different bit rates, how much I can press the video. We see very consistent shapes as we expect, but these are different frame rates. Color is frame rate. This is 120, 90, 8, 82, 60, etc. Big spaces. Already we see how much this perceptual entropy is going to tell us about different frame rates. Clearly very predictive. However, we're also interested in, well, when we compress, there's going to be more and more distortion, like blocking artifacts. We need to measure that too, and to capture that, we need to remove this bias of the frame rate entropy. So we want to separate the measurements of compression artifacts from frame rate artifacts. So we'll remove the frame rate artifacts, and that's where the pseudo-reference video comes in. So we have an original reference at 120. We have a bunch of those. We have distorted versions, compressed and changed frame rate. We want to subtract the frame rate part out. And the pseudo-reference, which is defined as not compressed. It's just like the reference, no compression, but it does have reduced frame rate. So the pseudo-reference is just a reduced frame rate version of this. How am I going to use that? Well, for all of these, I compute the same sort of scale entropies as before. So here's the entropy, conditional entropy. I scale it by the variance field, same thing as before. But then I form what I call the generalized space-time features. It has two terms, one here and one here. This one is sensitive to compression distortions. This one to frame rate distortions. Okay? And you can start to understand it because if they have the same frame rate, the reference is distorted, meaning the only distortions are spatial compression artifacts, then this collapses. There actually should be a plus one. Plus one I forgot to put in. One plus this. It only depends on the distorted video and the pseudo-reference. Whereas if there is no compression but only frame rate distortion, then it only depends on this term. Okay? So they become isolated in those instances. And these features are zero, only when they're all identical. No distortion, no frame rate change, or anything like that. So I'm not really defining a picture quality model here. I'm defining a set of features that we can add to any existing video quality model. Like VMAF, which I've already described, but there's other famous ones like multi-scale SIM, or literally anything. Even mean squared error can be improved if we put mean squared error here, compute features from it, and feed all these features to some simple learning mechanism like a support vector regress, which is all we use here. No deep learning. I'm going to talk about deep learning a little bit. I know we're having a panel this afternoon. So any model, okay, and it turns out when we train in this way, we get tremendous improvements of every model when we apply it to higher and variable frame rate videos. Now these are just numbers. It's what I show you when we're doing picture quality. We show you correlations with human judgments. Using a large database where we have tens of thousands of human judgments of videos that have been distorted by frame rate changes and compression. These are algorithms that predict those human judgments. And this is the correlation between those human judgments and the algorithm predictions. So what you see is for every algorithm, this one, this one, this one, all of which are used pretty much globally to control your content, the correlations leap by a very large number. Point one by point two, very large numbers. A jump in VMAF from about a correlation of 0.79 to 0.87 represents hundreds of billions of bits every day. It represents bandwidth changes because they want to push compression to the very edge where the quality remains good. And if you can use this model to help you do that and modify the frame rate, then even better, okay? Norma's bandwidth savings are coming by doing this kind of modeling. This will be implemented in these kind of systems going forward. Even further, if we look at different frame rates, we see the improvement is gigantic at the lowest frame rates. It's stutter. In other words, VMAF and SIM and so on are pretty much blind to frame rate distortions at these, you know, at low frame rates. Stutter, they're blind to it. You see, you leap from like a correlation of about 0.25 to 0.36 up to about 0.75 with human judgments of quality. That is really gigantic, okay? Use Case 2 is only like two slides, okay? This is for a high dynamic range where we go from 8 bits per color to 10. You can see that when they're streaming HDR to you, right off the top, there's 25% more bandwidth required. That's very substantial when our infrastructures all are extremely stressed by streaming of pictures and videos. But it's great. We love HDR because, you know, old TV is 100 nits. That's a brightness measure. Now we can, you know, have monitors at 1,000 or more. So we have all these standards for HDR where you get whiter whites, darker blacks, and you can see detail in the brighter and darker regions and more vivid colors and so on. But the problem is, since you have more bits, you have to compress it more. And so there's more distortions, okay? And existing models, unfortunately, are not very good at capturing these distortions in high dynamic range content. Things like VMAF don't very well. So we developed some new perceptually motivated features called HDR Max features, which again augment any previous algorithm. They're similar features to what we've described before, except we do very simple processing first to highlight the HDR attributes of pictures and videos. Not a similar kind of thing. So what is HDR Max? It's a simple non-linearity like this. That's all it is. Which at darker darks, down here in a brighter brights, these get emphasized by the non-linearity. Mid-range values of brightness or color get compressed or squeezed down. It's emphasizing the very regions of the video that HDR is meant to improve visually. We crush down everything else. Oh, we still allow regular VQA, but here we're highlighting these areas as well. So if we have a region, this is not HDR. I can't display HDR here. It's not even an HDR monitor. But the point is, here's a dark region, okay? A model like VMAF will be dominated by all these distortions in this sort of mid-range area. But these may be very visible artifacts that will affect the appearance of the picture. By crushing it down like this, okay, then these are de-emphasized and these become more emphasized by this. And we can apply the same kind of things, band pass filtering, device immobilization, and so on to bring these out. And once again, we get huge jumps in performance on HDR content. VMAF jumps from about .65 to about .85. This is huge. And so, in this case, it's Amazon who supported this work, who are deploying this in their systems and so on. So it'll have more effect on your experiences. So just to summarize video quality, which was considered unsolved, way back to Edison, till this century, okay, has become possible, predictable, usable, and now impacts, you know, all of visual communications. And it all comes about by not trying to model distortions specifically, but instead modeling the statistical responses of neurons in your brain, which is what we've done. Model the statistical responses of neurons in your brain to distortion. Now, what about deep learning, okay? Remember before I said there's two categories of videos? We'll look at the cinematic, high-quality ones, and then there's the user-generated ones. For this category that we've been talking about, we don't need deep learning. Really, not yet, because people have tried. There's all sorts of deep learning models to do just what we're doing, but what are those deep learning models? They're learning these statistical laws. No matter how many layers, how many parameters, and the performance is only incrementally better than these simple so-called handcrafted models, and certainly the computational, the architectural, all that stuff is nowhere near worth it. So instead, all the big streamers are building these models in Asics, okay, to stream everything everywhere. The other problem where we don't have a reference, well, the distortions are infinitely manifold. There's so many, and deep learning is needed, and there's huge databases, and that's a story for another day. Thank you. These are the people who support our research today in doing this, you know, gaming and streaming and user-generated and that sort of thing, and also augmented reality is a big place for perceptual optimization. And by the way, I'm about to show a visual illusion, one of my favorites, but if you have a sensitivity to flashing objects, then don't look at it, or I won't put it up very, very long. It's a little thing. It's a... I forget the name of it suddenly, I'm talking too long, but it looks like the people are always walking, they're trudging home, but they never go anywhere. It's a visual illusion that's caused because the flashing breaks up the motion computation. Okay, so you think they're moving, it flashes and so on, and it's a wonderful illusion. But I'm not going to show it too long because some people are disturbed by it. Five minutes. You can maybe take a few questions. I know some people have to run, and there is going to be a panel of three-thirties. So is there any difficult questions? And anybody would like to ask? Okay, go ahead, Meg. Thank you. Yeah, very nice talk. Thank you. I just have a question. You didn't talk a lot, but you did mention about these subjective, you know, studies, evaluations, and I know it's a very big part of your work. So just very briefly, maybe you can tell us a little bit about its importance and maybe some of the challenges in your research. Oh, my goodness. I mean, every semester we're doing two or three large-scale human studies of some aspect of picture of video quality. In fact, every one of these algorithms has behind it some study or studies of, you know, subjective quality, we either do it in the laboratory, we'll bring in, you know, 50 to 100 people who will each view, you know, for three days videos that have varying degrees of distortion and content will gather all those and yet, you know, 20, 30,000 measurements of video quality. Or we sometimes go online. It's very challenging to get people to do this properly online. I could go under the problems for a very long stretch. But in that way, we can put out, you know, even millions of contents or millions of human opinion scores and do deep learning. So that's really for the, you know, blind problem, no reference problem, the UGC problem that we have to do that. We do that as well. So before we go to Charlie's questions, for the students, one of the things I'd like to emphasize is that you see Professor Bovik here did some very, very nice mathematics where he understood what was going on. And that led to the development of these techniques, which every one of you are seeing when you watch Netflix or with PlayStation. Any of those. So all of those companies you see up there have his technology in there. And it's not slash and burn deep learning. It's based on some real understanding of what's going on in the system. So this concept I think in some cases where students feel I don't need to learn the mathematics anymore, I'll just collect a bunch of images and train the hell out of it. The problem is that that doesn't work and companies don't want that. They want to be able to have somebody that can explain it. And you can see he explained his method in its mathematics space. And for those of you here at Purdue, the really underlying material you talked about here is ECE 600. Or if you're an undergraduate student, ECE 302. And ECE 301. Those are our systems and signals course and probability and stochastic process. So I think that's one of the things that's emphasized. Go ahead, Charlie. Is this on? Oh, yeah. I'd first like to thank Al for a really wonderful talk. And Ed for his words of wisdom that the students should really listen to because it's so important. My question is a little hyper technical, but I can't help myself. And it also is a little unfair because it speaks to the last slide that you didn't get into on the deep neural networks, which Ed just kind of trashed a little bit. So there's this emerging literature in this area of generative diffusion models. And it's turning out that it's kind of a surprising outcome to me that you can model the distributions of all these things as one of the most fundamental things as a denoiser. If you have a perfect denoiser, you have a perfect description of the distribution of anything. And when people look at images, they immediately know, okay, this image was processed. Like, Jake used to have an example of like you were looking at a picture from the side and you can tell that it's been distorted. So your brain just immediately knows that it's not in the distribution of possible natural images. Ken, do you think there's a hope in connecting these kind of basic theories to some of those kinds of outcomes? Yes, we're already doing some of those, but trying to detect images that have been generated by diffusion models. And diffusion models leave a different type of fingerprint than a natural scene has in it, and you can exploit that. And one of my members of my team is a noise print, which you can actually use to detect different types of diffusion models. So yeah, we've been, and I think it dovetails a lot on some of the things you're doing. You and I talked about this a lot. The goal of a lot of this generative modeling for things like generative compression, generative just creating something out of nothing, generative content, all that kind of stuff is really to create some sort of reality, even if it's a cartoon reality or whatever. I suppose it's meant to be like photographic reality. Well, if you're using a diffusion engine or something, you can do two things with it. One, you can just train it on huge corpus of pictures and that sort of thing, and it will learn to generate things that are realistic. Or, you can maybe with not so much of that, you can use science. Truth. You can use features like I just described to guide that generative model. That's exactly right. It's another way of kind of saying to guide that generation. So there is naturally I think in terms of picture processing where there's denoising, deep blur is still terrible. To do any deep task, you can inject known things to make your models better and that'll be a coming thing because after, what was it 2012? Here we are 11 years later, what have we got? C&Ns and transformers. Everything else is just some variation of self attention networks, residents or whatever. How much creativity, I mean it was 1980s, we already had C&Ns and back propagation. So it's just another thing that Tom Wong said, my advisor, greatest image processor ever kind of guy, an old friend of Ed's as well. He said, the history of the advancement of computer vision is the history of the advancement of the computer. And it was a joke of his, but he was right. We just have these GPUs that can crunch vast amounts of data to train these well known architectures which are infinite optimization between side bankers theorem already told us back when that eventually we'd get here. Yeah, I think the issue is could you put a physics based model into a diffusion system? People are looking at doing that. And then you can also help with the explainability. This idea says I put something into one of these neural networks and a miracle occurs and I get an image out. To me that's analogous to solving the 10,000 monkey problem. You know the 10,000 monkey problem, right? So you give 10,000 monkeys typewriters, you ask them to type away and eventually they type a Shakespeare sonnet. And so I think the issue is the other thing you didn't bring up is in some ways people are willing to accept really crappy video if it's free. Okay? You can't account for that. I'm not trying to. I know, but that's the truth. I think we probably got to stop here. The panel starts at 3.30. So why don't we give Professor Bovic another round of applause and we'll be right back. Thank you.