 Welcome to this course on data compression with and without deprobabilistic models. In this video series, you'll learn about a fascinating young research field that's at the intersection between information theory and machine learning. To watch the entire video series, click on the link to the playlist down in the video description. The second link in the video description points you to a dedicated course website where you can find more course materials like problem sets and solutions. It's all free and you don't need to sign up or log into anything. Now, there are obviously already a lot of other educational resources out there both for the subject of data compression and for machine learning. So if you want to find more resources then you may want to pause the video at this slide. But there are two things that I believe set this course apart from existing resources. First, we'll cover the entire spectrum from theory to applications. So we'll both go through mathematical proofs of fundamental information theoretical bound and relations but we'll also implement several highly effective compression methods and probabilistic machine learning models in real code. Second, there's been a new development in the field of data compression and new class of compression methods now use probabilistic machine learning models and deep neural networks. And these new methods are now starting to outperform many of the classical compression methods for images, videos and other data types. So in addition to the information theory, this course will also teach you how to design and train probabilistic machine learning models and how you can use such a trained machine learning model for data compression. Here are some topics that we will cover over the next videos and you can see that they really cover the entire spectrum from theory to applications and from data compression to probabilistic machine learning. I don't expect you to read all the texts on this slide now but you might want to refer back to the slide once you've completed the course to remind yourself of how all the things that you've learned interact with each other. That's enough administrative talk for now. Let's get started. We'll begin by thinking about the general goal of data compression and of how it fits into the wider problem of communication over some channel. In the rest of this video I'll give you the 30,000 foot view of a lot of the concepts that we'll discuss in the course so that you can see how they interact with each other. I'll deliberately not yet show any mathematical equations in this video but starting with the next video we'll introduce precise mathematical definitions, proof of fundamental theorems and also implement some models and algorithms in real code. And by the way, in case you want to read along the lecture notes for any of these videos then you can always download them from this website. The link is also in the video description. So let's assume that we have two parties, a sender and a receiver and the sender has some message that it wants to communicate to the receiver over some channel like for example the internet. Since you can't just talk directly into your ethernet socket the sender will have to encode the message in some way so that it can then be transmitted over the channel and the receiver has then to decode whatever it receives on the other side of the channel in order to obtain a reconstruction of the message. And depending on your use case you might either want to get a perfect reconstruction of the message that would be a lossless reconstruction or for many things like audio or video signals you might be okay with a lossy reconstruction that differs lightly from the original message as long as these differences are hardly perceptible. You'll cover both the lossless and the lossy case in this course. But there are more nuances to this process than the simple picture might make you think. First of all, we'll see that it's important to distinguish what kind of message you want to communicate. For example, the message could be an image file or a text file or it could also be some real-time media that has to be communicated with low latency like in video conferencing. More generally, the whole process of communication doesn't even have to be digital. The sender and receiver could also be two people who talk to each other while being in the same physical room so the message would then be the utterances that they exchange. What's important is that all these types of messages have certain properties. For one, they could be analog or digital messages but more importantly, the raw messages will almost always contain some sort of redundancies. For example, if the message that you want to communicate is an English language text file then you know that characters and words in these files follow certain patterns like the letter Q is almost always followed by the letter U. So we could say that any U that follows a Q in English text is in some sense redundant. If this sounds a bit vague to you then don't worry. We'll formalize what I mean with redundancies in later videos and we'll actually introduce several information theoretical quantities that even put a number on how redundant or non-redundant some piece of data is. The reason why we'll talk so much about redundancies is because it turns out that data compression and in particular lossless data compression is all about removing redundancies from a message so that the amount of data shrinks and we have to transmit less data over the channel. So if the introider's job is to remove redundancy from the message then it has to be able to distinguish what's redundant from what's genuine information. And for this task, the encoder needs a probabilistic model of the data source. What do I mean with that? Well, the problem is that the distinction between genuine and redundant information really depends on the type of message that you want to communicate. For example, in the Albanian language apparently the letter Q does not need to be followed by a letter U. Therefore, if you want to encode some Albanian text then any U that does follow a Q carries genuine information and it is not redundant so the encoder must not remove it. More generally, if you want to detect redundancies in a message then you need to know what kind of message you're dealing with so that you can build a probabilistic model of common patterns that you'd expect in this kind of message. In fact, it's in this probabilistic modeling of the data source where there's been a lot of recent progress because we can now use machine learning models inside a compression algorithm. And these machine learning models can capture more complicated statistical patterns than what any traditional compression method can do. For example, already on problem set number 3 you will be guided to implement a fully functioning compression method for English texts that models English orthography and grammar with a so-called recurrent neural network. And then later on in this course we'll talk about probabilistic machine learning models for images and other data types. We'll talk much more about machine learning models for data compression in later videos of this series. But these models have one important property that I already want to tease now and that is that they have to be probabilistic models. That's because we want to encode real-world data and reality is messy. For most patterns that you'll see in real-world data you'll also find some exceptions. For example, even in English texts the letter Q in the proper nouns Iraq or Qatar is not followed via U. Now it's one thing if these exceptions lead to long discussions during a game of Scrabble but it would be much worse if such an exception meant that your compression method for English texts could not encode words like Iraq because they violate some of its assumptions. Fortunately, that's not the case. You'll already see in the next video how you can still make use of statistical patterns even if they are sometimes broken and we'll see that it's all about probabilities. So one job of the encoder is to remove redundancies from the message but it turns out that the encoder also has to introduce new redundancies. And to understand this we have to think more about the channel and its properties. If you're a computer scientist then you're probably picturing some digital communication channel like the internet or more precisely some internet protocol like TCP or UDP. But if you're a physicist or an electrical engineer then you might be interested in more physical channels like cables or fiber optics that actually make up the internet. Or you might be thinking about the propagation of sound waves when two people talk to each other. Finally, when you think about communication over a channel let's also include in this picture the case of data storage. Thinking of data storage as a form of communication may seem a bit odd at first but you can think of data storage as sending a message through time rather than through space. So when you save some file on your computer then you're essentially sending it to your future self and the communication channel is your hard drive or your SSD. So similar to what we said about the message the communication channel can also be either analog like sound waves or fiber optic cables or it can be digital like an internet protocol or a 5-system. But one thing that we haven't discussed yet is that the communication channel might introduce errors in the transmission which we'll call noise here. Of course, you typically don't want to use a channel that introduces noise but if you go down to the physical level of abstraction of any communication channel then some amount of noise is just unavoidable because of thermal fluctuations and at some point also because of quantum mechanical effects. So if you can't avoid noise in the channel does this mean that we have to completely give up on the idea of lossless communication? Of course not, otherwise the internet and pretty much every piece of information technology that is all around us could not exist. The solution is called error correction. We changed the encoder so that it adds some redundancies to the message. The decoder can then use these redundancies to detect and correct any errors that might have occurred due to channel noise. Now it turns out that error correction is in some sense orthogonal to data compression and therefore we'll touch on error correction in this course only very briefly. We'll mainly learn two things about error correction. First, we'll prove that at least in principle one can design an error correction scheme that makes it astronomically unlikely for channel noise to affect the reconstructed message. And second, we'll see that in order to design such an error correction code one needs to know which kinds of noise the channel can introduce and with which probabilities. So in other words, in order to prepare a message for error correction the encoder needs to have a probabilistic model of the communication channel. So the encoder needs two probabilistic models, one of the data source and one of the communication channel. And since the decoder essentially inverts everything that the encoder did it too needs to have both of these probabilistic models. You should think of the models as being an integral part of the encoder and the decoder and they have to match between encoder and decoder precisely otherwise the receiver might reconstruct a completely wrong message. So before the sender and the receiver can exchange any message we assume that they had some way of agreeing on the probabilistic models to use and also on the algorithms that turn these probabilistic models into coders. These are the ingredients that we need to design an encoder-decoder pair for lossless communication. But if you think about lossy communication we also have to define what kind of distortion between the original message and its reconstruction on the receiver side is acceptable. So we'll need to define a distortion measure and we'll talk more about popular distortion measures when we get to lossy compression later in this course. Like with the probabilistic models for the data source and the channel the distortion measure also typically informs the design of both the encoder and the decoder and you'll get the best performance if encoder and decoder agree on the distortion measure. But interestingly the agreement on the distortion measure is actually not such a strict requirement and there are sometimes good reasons to tweak the distortion measure that the encoder uses even after sender and receiver have agreed on a compression method. For example a lot of photographs on the web are still encoded in the JPEG format even though much better formats have been introduced since. But with JPEG you can be sure that basically every web browser has a compliant decoder. And while we can't change how JPEG decoders operate for backward compatibility reasons developers of image editing software can still improve their JPEG encoders by using better distortion measures based on improved models of human perception. So let's summarize what we've understood so far. Our goal is to communicate some message from a sender to a receiver and we want this communication to be fast i.e. we want to use the communication channel as little as possible but we also want the communication to be reliable i.e. we want to be able to decode the message either without any errors that would be for lossless communication or if we're dealing with lossy communication then we accept some amount of distortion but not too much. And to achieve this goal we need three things. A probabilistic model of the data source, a probabilistic model of the communication channel and in the case of lossy communication a distortion measure. Now let me briefly come back to one point that I mentioned earlier. I said that in this course we'll focus mostly on data compression and we'll touch on error correction only very briefly. But can we actually separate these two tasks? This turns out to be a somewhat surprisingly difficult question to answer and it's the subject of the so-called source channel separation theorem which we'll prove later in this course. The short answer is yes we can separate data compression from error correction at least in principle. So here's again our communication pipeline and so far we've said that the encoder essentially has to replace redundancies that are intrinsic to the data source by different redundancies that are tailored to allow for error correction for the noisy communication channel. The source channel separation theorem now states that these two tasks can indeed be separated at least in principle. So you can take the same pipeline but now split the encoder into two parts that are called source coding and channel coding the same for the decoder just in reverse order. And the important part here is that the two source coding modules on the encoder and the decoder side they only use the probabilistic model of the data source so they are oblivious to details of the communication channel whereas the channel coding modules only use the probabilistic model of the channel so they are oblivious to what kind of message you want to communicate. Now this might look like a trivial statement at first and in fact this more modular pipeline might be what you've actually had in your head from the start but the non-trivial statement that the source channel separation theorem makes is that enforcing such a separation of tasks does not degrade performance of the whole pipeline. So for any encoder-decoder pairing in the upper picture where you are allowed to optimize the encoder and decoder to both the data source and the channel you can, at least in principle, find a pair of source encoder and decoder that may only depend on the model of the data source and a pair of channel encoder and decoder that may only depend on the model of the channel and the resulting bitrate and distortion of this more modular setup is at least as good as in the upper setup. By the way, I should say that the phrase source coding here has nothing to do with the source code of a program that is written in some programming language. Data compression is just called source coding because it uses a model of the data source to encode and decode messages. Now I keep using the phrase in principle when I talk about the source channel separation theorem and that is because the theorem only states that such an optimal separation into source coding and channel coding exists but finding such a really perfect separation is really prohibitively expensive. So if you really have control over the entire pipeline that is if you know the data source and if you have access to the physical noisy channel then you might indeed be better off not splitting your encoder and decoder into source coding and channel coding parts. And that can indeed be a good idea if your problem involves very specific hardware setups with computational limitations like maybe drones or RFID tags. But most of the time when we communicate over the internet or when we store files on a file system we don't even have access to the underlying physical channel because it is already abstracted away by existing channel coding infrastructure. So to some degree the separation between source coding and channel coding is just good engineering practice keeping complex systems modular. To wrap up this introductory video let's recap what we've learned so far in the form of three questions so that you can check your comprehension. I'll first go through all three questions then you can pause the video and then I'll show you how I would have answered the questions. The first question is deliberately a bit vague because I want to encourage you to think again about the communication setup. So here it goes. Assume you want to transmit some message over some channel. The message is given as a string of n bits. Okay, so maybe you have a text file or an image and in its raw format it takes up n bits. Now the question goes on and says that the channel transmit one bit at a time but it occasionally introduces a random bit flip. Okay, so you have a digital channel and the channel is noisy. Now here comes the question. How many bits do you have to transmit over the channel if you want to be certain beyond reasonable doubt and decoded without errors? And the options for the answer are A, you have to transmit n bits so that would be precisely the length of the message in its raw form. Or B, you have to transmit more than n bits. Or C, you have to transmit fewer than n bits. And the answer D is a bit ominous and it just says it depends. Again, this question is deliberately a bit vague but the next two questions might actually help you answering it. The second question reads as follows. Assume you have a noise free channel. And you have a message that contains some redundancies. And it gives the example of English text that follows some orthography and grammar. Now here's the question. What should the encoder and decoder do with these redundancies if our goal is efficient communication so if you want to use the channel as little as possible? And now finally, question 3 builds on question 2 and it reintroduces channel noise. So it reads, now let's assume that the channel introduces noise. Okay, so we again have maybe an occasional random bit flip along the channel. What additional tasks do encoder and decoder have to do? And in particular, you're asked to think again about redundancies. So pause the video now and think about your answer then I'll tell you what I would have answered. Okay, so here's what I would have answered. For the first question, whenever there's an option that says it depends, it's probably the right option. And indeed, that's what I would have answered here because we aren't exactly told much about the data source or the communication channel. All we know is that we're tasked with lossless communication but on the one hand the data source could be something that can be strongly compressed, like maybe a text file and this will allow us to reduce the number of bits that we have to transmit. Well, on the other hand, the communication channel could be very noisy and then we would have to add a lot of redundancies for channel coding or error correction. So it really depends on the specifics. Now in question 2 we're told a bit more because we now know that the message contains redundancies so it can be compressed and the channel is noise free so we don't have to worry about error correction. The question was what the encoder and decoder need to do with the redundancies and my answer here would be that the encoder now has to add some redundancies but it has to add different redundancies than the ones that it removed from the message because these added redundancies have to be tailored for the channel so that the decoder can use them to detect and correct the redundancies so that it can reconstruct the original message. And this is called source coding or also data compression and this is the main focus of this course. Finally in question 3 we now have again a noisy channel and the question is what additional tasks to encode and decoder have to do in particular regarding again redundancies. Here my answer would have been that the encoder can use them to detect and correct errors. If you found this question to be a bit abstract then I encourage you to have a look at the first problem set which I labeled problems at number 0 because we've only finished the introduction and we haven't really introduced any formal definitions and theorems yet. In the first problem of this problem set you'll get several examples of both source coding and channel coding from everyday life. Some of these examples are obvious but some of them will likely be things that you might not expect when you think about communication. The links to the problem set and to its solutions are in the video description. That's all for this introductory video. In the next video we'll get more concrete and we'll introduce our first class of compression codes so-called symbol codes. See you there!