 First of all, thanks for inviting me here. I'm quite humbled to speak in such a prestigious place, in particular in between two well-known professors who address much bigger challenges than what I'm trying to do. The goal of my work is simply to make communication systems better. And one way of doing this is using machine learning. I would like to speak a little bit about the idea of learning entire communication systems from scratch. Maybe to put the stage for what I'm trying to do, let's go back to Shannon's original paper, where he describes a problem of communication, a set of reproducing at one point either exactly or approximately a message which has been selected at another point. Now, this is a problem that has driven us and many generations of engineers for the last 70 years. And the way we typically deal with this to formulate this in a mathematical way. So we have a transmitter, a channel, and a receiver. And the goal of this transmitter is to be able to transmit one out of M messages reliably to this receiver. So M could be extremely large. Think about, for example, if you would have 1,000 bits of information, that would be 2 to the power of 1,000 possible messages. Now, since you cannot transmit directly bits or messages, which would be just an integer number through the air, you need to kind of modulate it on an electromagnetic wave. In this example here, just represent an electromagnetic wave by a complex number, x, or complex vector, x. These would be modulated IQ samples. So essentially, what you're transmitting is x. This goes through a channel, but this can be essentially described through a conditional distribution, p of y given x, where y is the vector that you receive. And the goal of the receiver is just to reproduce or make a guess about which message has been transmitted as head. And the goal of communication is to minimize the probability of making an error. Now, the rate for such a system here, just for the definition, would be I have k bits, where k is the log 2 of M. And I have n channel use. So in total, I transmit k over n bits per channel use. Another way we have essentially solved this problem until now is essentially through two steps. So first, you start by building a physical channel model. You do measurements, or maybe sometimes you just look at the Schrodinger equations, and you might come up with a channel model. But in essence, you need a mathematical model that you can work with. And once you have your mathematical model, you can actually try to solve the communication problem. And since this problem is kind of complicated, we cut it into many small pieces that we then try to solve individually. So for example, we have source coding, channel coding, modulation on the transmitter side. There's a problem of synchronization, channel equalization, demodulation, decoding, et cetera. Now, I argue the main kind of the driver for using machine learning as a setting is because of two kinds of insufficiencies. So first of all, any model we can up with is mismatched to the real world. So there's always this trade off between mathematical complexity and convenience for analysis. But we actually never work with the real system. So there's always some approximation error. And secondly, so this modular block structure where we cut this complex problem into many pieces, this is suboptimal. For example, in many cases, source code, separating source and channel coding is suboptimal, separating demodulation and decoding is suboptimal. So in many of these cases. So this is the reason why four years ago, me and some colleagues ask a question, would it be possible to actually build a system that learns to communicate over any type of channel from scratch? And now I just wanted to have a small kind of graph showing you why this might be worthwhile. So what are the reasons why we expect to gain with respect to technology? This is an hour's phone and this does this job incredibly well. So one could essentially argue, the problem is solved. So why do we expect to gain? So this is just a very simple decision tree that shows you actually the reasons and areas where we could hope to make improvement. So first of all, we have this problem of modeling something. And if you have a good model, if you say the model I have is sufficiently good. For example, in our case, you could say, I make an observation Y, X is what has been transmitted. It goes through a, essentially a channel H and I add some Gaussian noise. You say this model is good enough. This describes fairly well what's happening in my system. Okay, then I can go to the next step and try to come up with an algorithm to solve the problem. So for example, in our case, if the goal would be to recover X from Y, knowing H, I could say, ah, maybe I just do a, essentially I just invert the channel matrix. This is called a zero forcing receiver. You could say, if this is good enough for you, then there's a last question you need to ask and this is that of complexity. So it's actually the algorithm you have come up with on your model. This is something you can implement and practice. And this is, I mean, for Nokia, this is extremely important. It might be less relevant for writing a good paper, but often we come up with algorithms and prove that they converge to the optimum solution. But you would actually never be able to do it in practice. So if you say complexity, ah, this is not an issue. Here this would be maybe just a three by three matrix. I can easily implement it. Then there's no hope to use machine learning anyway. What would be the point? Essentially the problem is solved. But now in all other cases, so for example, if you have no model, if you can make many observations Y for many inputs X, you can, but you have no clue how they relate to each other, then there's no way how we can even come up with an algorithm. That's what we typically see in image classification. There is no algorithm telling you directly how a cat or a dog looks like. Then you need to ask, can I generate a lot of data, a lot of these pairs of observations? And if the answer is no, because it's too costly, too kind, too consuming, too complex, then also machine learning cannot help you. Now, but if you have access to huge data set, essentially can use machine learning to kind of deal with the lack of a model, of a good model. And the same happens now in all other cases. So if you have a good model, but the model is so complex that you cannot come up with a good algorithm, that's typically what happens in optical communications. You know, you might have a, for example, you know that the non-linear Schrödinger equation tells you how the channel behaves, but no one really knows what is the optimal way of communicating over such a channel. In this case, machine learning might help you to come up with a better algorithm. And lastly, if you have a model and maybe a good algorithm is just too complex and there's hope that a machine learning solution would come up with an almost as good as optimal solution just with lower complexity. Now, the interesting thing about this graph is if you're on the left-hand side, machine learning allows you to do things you could not do before. And I think this is a disruption we currently see, for example, in computer vision. There are many applications you simply couldn't do before. If you're on the right-hand side, so if you have, you know, a good algorithm but it's not perfect or complexity is an issue, then machine learning might allow you to do things you could already do, but just better. And unfortunately, I think in communications, I would say we are more on the right side here, in most cases, because we already know almost how to do the job perfectly. So the margin for improvement is kind of small. But still it's kind of worthwhile to look into that. Good. I'm very happy that Stefan mentioned autoencoders because now actually to build a communication system from scratch, there's actually an analogy to be made with an autoencoder. But my use of autoencoder is far more simplistic. So for me, an autoencoder is a neural network that gets a backster access input. And it's the only purpose of the autoencoders we reproduce X, essentially, at the output. So you would ideally learn the identity function. This would be totally useless, but it's not because we care about a specific intermediate representation that the autoencoder learns. So essentially what I care about is some representation that's called it Y of the input, which has certain times of certain characteristics. So in this example, the input is four dimension, Y is just three dimensions. So if I'm now able to reproduce perfectly X from Y, I've learned the dimensionality reduction or some kind of compression mechanism, okay? Now in communications, it's slightly different. We typically do not care about reducing information or dimensionality because typically when doing to transmit information, this is not compressible. So we rather have an autoencoder that increases the dimensionality because we would like to make what we transmit robust. So you can see a communication system as an autoencoder where essentially the chain of transmitter, channel and receiver is one large neural network. And the goal would be simply to reconstruct S through the input message as the output reliably. And the goal is actually what we care about what comes out of the transmitter here. So the goal is to learn a robust representation of your message with respect to the stochastic transformation as this message undergoes through the communication process, so in the channel. So the nice thing about this, if you interpret it like this, you can train the system from end to end. So you optimize transmitter and receiver jointly for a given channel model. And it's a universal concept which applies essentially to any channel, P of Y given X, okay? You can use this can be acoustics, molecular, optics, it can be essentially anything. So now I just give you a very simple example of an AW in channel. So it means it's a channel where you just add Gaussian noise. So Y is essentially equal to X what you transmit plus noise. Here's a simple architecture of how such neural network could look like. You inject an integer. This goes through an embedding and a couple of dense layers. Now what you need to think about is we always need complex numbers because so what we send over a channel is simply modulated on a cosine and design wave because you have two degrees of freedom we can use independently. So we deal with complex vectors. So we essentially always interpret some part of your outputs as a real part, some part of the imaginary part. So the choice is totally arbitrary. And we need to ensure that we have the right power normalization. So I think I went over this too quickly. So essentially here, so of course there's a constraint on what the transmitter can output. So you have typically an average power constraint on your messages. This is just you don't, you have a limited amount of energy. Okay, so if some normalization then you essentially feed this into any channel essentially you want, any mathematical model you could have. I will come back now to this. I just said we don't wanna have a model but actually here we need a model. So you can feed this into any kind of stochastic transformation you want. And then the receiver you just essentially do the inverse, you reinterpret some parts of the complex vector as a real part, some part as an imaginary part. You send this to a couple of dense layer and do a classification task. Just for future reference, so the transmitter is called f of theta t. So theta t are the trainable parameters and g of theta r is what the receiver does and theta r is the trainable parameters. Now what does the system learn? So you have a simple example. Here we have eight messages and one channel use. So this means the transmitter will transform a number between eight, one to eight into a complex number. For example, here I have eight possible messages and what the transmitter does at the beginning it represents these eight, essentially these eight messages as these eight points on the complex plane. And on the right hand side, what you see is the probability of error you make over an AWGN channel as a function of the signal to noise ratio. So you simply just use scale the variance of the noise you add. Now for comparison, I show here a scheme which is called eight PSK. So with the eight messages are uniformly distributed across the unit circle and the red line is the maximum likelihood kind of performance you get. And now, so the untrained system when you initialize everything randomly, you get these eight messages here and your receiver essentially does random guessing because it hasn't learned anything. And now I train the system and I show you a video how this progresses. So the transmitter learns to separate these points better and better and better. And the receiver learns to detect, come up with the right decision regions. And at some point, this converges to a stable solution. And what's kind of nice here in this example that the system has come up with a scheme where you have one message with zero energy but it gives you kind of a nice guide. So the thing is when I tried this four years ago it's kind of surprised. I was like, well, this is a nice result but it turns out that a colleague of mine or an ex-colleague has done this in 1974 has proven that this thing is actually the optimum constellation you have that maximizes the mutual information between X and Y. The thing is this is not a convex problem and the fact that I could easily learn it in a couple of minutes kind of motivated me to look into something more complex. Can you apply the same idea to more complex channels? Okay. And by the way, I've put this on Google Collaboratory if you wanna play around with it. Now, the biggest problem we have is then the following. We need actually to train the system from end to end. We need to model our channel to back propagate gradients through it. But in actually to make it the problem was while I don't want to use it on a simulated channel I would like to do it on a real system, right? That's the whole point. We want to compensate for this mid-smatch between the model and the real world. So the question is, how can you learn to communicate through a black box? And I've struggled with this problem for quite some time and I've tried various things. The first kind of obvious thing was the following. To say, let us model this black box as good as we can then train our system over the analytic channel model. Then deploy it. And then of course it might somehow work but you don't get the optimum performance. And then the only thing you can do then is to say, I can fine tune the receiver because here you essentially can gather many observations for which you know what has been transmitted and you can fine tune this. But the problem is you actually never optimize the transmitter for the actual channel model. And this is what limits performance. Now, the second idea, and this kind of goes to generative modeling that also was Stefan talked about, is you could say, let me first learn a model for this black box from data. For example, you could use a generative adversarial network, a conditional generative network. And once you have learned this essentially, then you can replace your black box by a neural network which is fully differentiable and then you can learn to communicate over this learned channel model. This kind of works, but the problem is that wireless channels are kind of, I think optical channels are even more complicated to model. And it turns out that I never managed to get any good progress on this. There has not been any in-depth study but I think it would be very worthwhile topic to look into how well can we model communications channels through generative models. I think no one has really done this. And so what we then came up with essentially was a method where you say, let's avoid channel modeling altogether. So don't try to have an explicit model but just try to try and on error to figure out what works best. And this is kind of inspired by what people do in reinforcement learning. But the key idea is essentially just to come up with an estimate of your gradient for the examples. And we do this with something which is called a policy gradient. And rather than showing these equations, I just have a hand-waving explanation of what's going on. So you essentially see the transmitter as an agent which has state S. So the state is a message you want to transmit. And the action is what you actually transmit over the air. And now the environment is the combination of the channel and the receiver. And it computes a reward. This would just be the cross entropy between the message which has actually been transmitted and what the receiver thinks it is. And now in this way, so here in this point, you only care about, you would like to make this agent better to increase the reward. So it's only about making the transmitter better because I've shown on the last slide. So making the receiver good is actually very easy because this is a supervised learning problem. But making the transmitter better for a given receiver is hard because of this black box. And now, so how does this work? You initialize randomly transmitter and receiver neural networks. And the first thing you do is you try to make the receiver better. So essentially you let the transmitter send a lot of messages, but this is probably total gibberish. You know, it hasn't been optimized at all. Think about these eight points. They are not well-separated, you just transmit something. But then you can essentially improve the receiver, try to find the right decision regions to supervised learning. That's the easy part. Now you keep the updated receiver and now you'd like to make the transmitter better for the current receiver. And this is where you essentially use what is called a policy gradient. So in order to explore now better solutions how to communicate, you have the transmitter create its output at the neural network, but before communicating it over the air you add some small perturbation. So this is typically some Gaussian noise. This is called exploration noise. Then the receiver computes losses, sends it back to you. And based on these losses you can get and the knowledge of the perturbations you have added, you can actually just compute a gradient estimate. So without what it does, you essentially just you transmit something but you transmit small perturbation around what you actually want to transmit. And this helps you to estimate a gradient. Then you can you update the transmitter based on this and do this iteratively. So you improve the transmitter for the current receiver than the receiver for the current transmitters and essentially you keep your fingers crossed that this converges. And it works, you have no proof that it actually converges to some local minimum, but it works extremely well. I've made a demo like this is already almost two years old. So we had two software defined radios for unconcernize and the goal was simply to implement this algorithm in practice. And here I just show you the probability of error. So block error rate as a function of the number of training iterations compared to a coherent, I have to do a good baseline system. It's called the coherent QPSK baseline system. What's kind of nice is that after around, yeah, 100 iterations, you start to outperform this. So here I'm not kind of surprised by this is not a massive gain or anything. No one would care about these small improvements. But what's kind of nice is you learn to do this in one hour. And this is essentially taking engineers 50 years to come up with such a solution. So that's why I think it's kind of nice. Now, there is some mathematical or information theoretic interpretation behind what's going on. And I wanted to speak a little bit about this. And in particular, essentially, we also do not want to do this black box approach but try to inject as much knowledge about the problem we have to come up with a principled approach to this. So we've actually seen what comes, something I think which is very similar to what has been shown in the first talk, where we essentially try to, essentially we have multiple operating blocks or things about which we know that this is the right thing to do. But then there are some parts about which we are not so unsure that depends on the channel. And this is where we kind of try to augment existing algorithms with deep learning. Come up essentially, we inject as much structure, expert knowledge we can into the problem. But those parts where we think there might be a mismatch between our model and our algorithm, this is where we try to improve. Okay, how to interpret what the out encoder actually learns. So on the left hand side, this is what we are training on. If you remember the notation, so G or theta R, this is the output of the receiver. This is essentially a probability vector over all messages. And so this is the categorical cross entropy across essentially all of the messages which are uniformly distributed, all of the channel outputs. So P of Y given X. So this is what the transmitter sends, given an S. And this is what the receiver believes. So this is the categorical cross entropy. And now we would like to minimize this with respect to the parameters of the transmitter and the receiver. Now, if we had infinite training data and a super large receiver, you can also be certain that this will converge to the maximum likelihood solution. So it will learn the maximum likelihood detector. So in essence, you can say the best this could be is actually to be the true posterior P of S given Y. You cannot do better than that. So in essence, we essentially optimize here an upper bound on the cross entropy you would get for the maximum likelihood estimator. And this is nothing else. So minimizing this is nothing else than minimizing the conditional entropy of the message which has been transmitted S given Y. Now in practice, so we cannot, we don't have access to this. We can't compute this expectation numerically. So we do this Schumann-Takalo sampling. So I sent many messages over the air. We do gradient descent on the loss function. And since we effectively optimize the conditional entropy here, this is actually the same as maximizing the mutual information between S and Y. So this is now where we start to feel, I don't know, this kind of, I find this reconforting that we actually maximize the mutual information. That's what channel wanted us to do. So it's good that we get this back. And what this actually does in the simple example I've shown you, we learn geometric shaping. So it's called how do you arrange kind of your points that represent messages? Well, and you don't need a neural network for this. So this essentially will be a small lookup table. So once you've learned it, you just show it away and save it as a lookup table. And as I said, so what's important is that the receiver is sufficiently rich so that you can get close to the maximum likelihood estimator because otherwise you would be optimizing something which does not correspond to the mutual information. It would be something else. You would actually be constrained by the allowable complexity of your receiver. If you essentially optimize the constellation for a bad receiver, that's what we wanted. Now, there's one more problem. So this whole idea does not scale well to very large number of messages. So if I would like to communicate a hundred bit of information, I would have two to the power of 100 messages. So it would be an extremely difficult classification problem. So an image net, what you've seen, where there's 1,000 classes, we have way, way more. And so the thing is you cannot train such a system for this large classification task. It simply doesn't scale. Now, so what you need to do is actually to couple this with an outer channel code. So you essentially say, I have a stream of information bits. I will code them to a vector of coded bits. Now I will chop this vector of coded bits into small pieces. Let's say, for example, six bits. And now we'll sequentially only learn to communicate vectors of six bits. This is something I do. And then at the end of the receiver, I just accumulate all of the outputs. Once I have the code word, I combine everything together and try to actually decode what has been transmitted. But now, since we do not want to communicate messages anymore, but bits, I have the problem of bit labeling. So I've shown you these constellations in this video. And typically, so the output of the training process is like a point cloud like this. And here you have 64 different points. And the question is, how do I assign now bit vectors to each of these points? So there's an optimal way of doing it. But the problem is that this depends a lot on your, essentially on the goal you wanna accomplish on your metric. And if you want to do this by hand, this is a very complicated problem which also doesn't scale well to large dimensions. And I want to show you that you can also learn this and get this information for free. So from an information theorizing pactive, we get a vector of m bits that we feed onto the transmitter. It computes an output, symbol X, and we observe Y. Now, if we optimize this system for maximizing the mutual information between X and Y, okay? That's what we've done so far with this categorical cross entropy. This is the same as maximizing the mutual information between the vector B and Y because there's a one-to-one mapping. The problem is that in practice, we cannot achieve this with a very low complexity method because it requires you to use multi-level coding at the transmitters. This means, essentially, you apply an outer channel code on each of these bit levels. And as a receiver, you need to do multi-stage decoding where you essentially say I decode the first bit level. Once I have this, I subtract this from the rest. I decode the other one, et cetera. So this is extremely, this is not practical. Now, what we do in practice is something else. We essentially do not care about the mutual information but on a lower bound. So here, I've just rewritten the mutual information between X and Y, which is the same as the mutual information between the bit vector B and Y. I've just rewritten this using this chain rule of mutual information. Another problematic part is this one here. This is this multi-stage decoding where you essentially say I decode first the bit I, knowing Y, and then I decode the second bit level B2 but now that I assume that I know the first bit level. Now, if you show this away, if you essentially say I decode all of these bits without knowledge, then you get this expression here, which is called the bit metric decoding rate or many people call this generalized mutual information. I think there are many names for this. But the nice thing is that this can actually be achieved with something called bit-interleaved coded modulation at the transmitter and bit-metric decoding at the receiver. So this is a practical scheme we use I think in almost any communication system that I'm aware of. The bottom line is that this is symmetric we care about. So it's not some mutual information. And now how do we learn this? So how does this fit into our autoencoder objective? So if we look now at the total binary cross entropy we could learn. So now imagine we change our receiver. So this is now what comes out of receiver rather than having one output vector with M dimensions where we have a probability distribution over all messages. You would just have M outputs corresponding to each bit. Okay? So it would be essentially if I have, for example, if I have three bits, so in the first case I would have eight possible messages and I would learn a probability distribution over these eight messages. Or I could say I just have three outputs corresponding to each of these bits. Okay? And essentially now my neural network would output me these posterior distributions over the bits. So that's what I wanna learn. Now I just sum all of these cross and binary cross entropies together. So that's the total binary cross entropy. And now then I can rewrite this in the following way. I can, I have a term that corresponds to the entropy of my bit vector. This is typically just the number of bits because they are IID. Then I have this bit metric decoding rate R which is this thing here. That's what we care about. And then I have a term, I have a sum of Kullbach lipline divergences between what the posterior my neural network gives me and the true one. Okay? And so now if you minimize our desobjective function what we efficiently do is, we maximize the rate. This is constant. And we'd like to minimize the KL divergence to the posterior. So this is the same. If you have a good model, this thing will be extremely, this will be very soon close to zero. Of course you can, this is we will just implement a maximum likelihood detector. And now I just want to show you the difference. And when you train on this metric R compared to the mutual information. So on the left hand side, you see what you get with the bitwise mutual information loss. And on the right hand side, what you get with the mutual information loss. And what I change here is essentially the SNR. So for each signal to noise ratio we get a different constellation. I just play this again. So you can see that if you train on the mutual information you actually, it doesn't change a lot. Stays rather constant with SNR. You always have this weird point in the middle with zero energy. And you also get no labeling. Here you get a perfectly labeled kind of constellation which changes a lot over SNR. And if you can, I spent some time looking into this and actually always learns the optimum labeling of whatever the constellations are. So even at these points where you have some clusters here together you always have some kind of what is called gray labeling scheme. Meaning that if you look at two neighboring points they will only differ by one bit. And this is all the case true. So even if you go at low SNR where you essentially would have far too much information that you can communicate it. It would group points together in such a way that only one or two relevant bits are actually distinguishable. And the other one you cannot communicate them anyway. Okay? So the nice thing is that now our outer encoder actually learns to output posterior probabilities for the bits. And this is essentially similar because these are locked likelihood ratios. And this means that they can now couple this directly with an outer channel code. So the way this setup now works is as follows. I have a bit string. This could be, for example, a thousand bits. I encode them using, for example, a low density parity check encoder. So here I would get 2,000 bits. Now I would chop this into many small bit vectors. For example, of size four bits. I feed them into my neural networks as all the same. So these are all they share the weights. So this is one neural network that outputs me essentially maps these bit vectors to symbols. I feed this into a channel. Now I have the same D-mapper that operates in parallel on all of the received symbols. And it outputs these locked likelihood ratios which are then feed into a BPD coder. What turned out to be important is to give the D-mapper access to the signal to noise ratio because the computing these LLRs requires the SNR. So it's helpful to feed the SNR additional input. And we tried this with a standard LDPC code from the Wi-Fi standard, rate a half length, 1,296 bits. And these are the coded bit error results we got. So in blue, you see a phase shift keying or QAM modulation scheme. So QAM would be, these are essentially, looks like points are uniformly distributed on the X and the Y axis. So you get these kind of regular grid. In green is what our system learns. And first of all, you can see that the higher the constellation order is the more you gain. And this is well known. So this is essentially known as shaping gains because we optimize the geometry, we just have an optimized geometric shaping. And that's where the game comes from. And the reason you gain is because the geometry just maximizes the bit metric, kind of the mutual bitwise information. That's what we're, that's our objective. But the nice thing is that it translates into real gains. Now you can go one step further. There's something called iterative demapping and decoding. So the key idea is to apply the turbo principle of soft demapping and channel decoding to our setup. So the way this works is as follows. You essentially make an observation Y from the channel. Now the demapper computes for every bit that is in the symbol, log likelihood ratios. Now you do this for all of the symbols that make up a code word. And you feed this into a channel decoder, for example, belief propagation for one iteration. And this will output you improved LLRs for all of the bits. And because they are improved because you now get the knowledge from the parity checks in your code. And now in the next step is you say, okay, I feed the output back to the demapper. But before doing this, I subtract the information that I've given to the decoder. So I have subtract this information here what I feed into the decoder, I subtract it here. So that the only thing that remains is the extrinsic information. So extrinsic means I only care about the information that the channel code has given me. And this is something that does not come from the observation. So now I have here essentially I run the demapper again, but now I have a priori information from the channel code. So this information here is independent of why it just comes from the channel code. And this helps me to make a better demapping estimate of the LLRs. Now I subtract before I feed this into the decoder the information, the extrinsic information that comes from the channel code. I subtract this here and I do this iteratively. So these two demapper and decoder that iterate together. This is well known. Now the problem with this is you can only do compute these a priori information explicitly for quite, I would say, simple channel models like an AWGN channel. And they are very complex. So the computational cost is high. And on top of this, the optimal constellation you wanna learn depends on the number of iterations you do. For example, the constellation you would learn for three iterations is different from the one for 15, okay? And now so what we did, we essentially said, okay, let's just replace this part here by a neural network. So essentially then you get a structure that looks like this. You have the, this is a transmitter, as before that's what you learn. Here you have the receiver. And essentially now it's just a neural network which gets as additional information the a priori, so the extrinsic information from the decoder. You know if this is a neural network you use the same everywhere. Now you have this complicated loop with all of these things. And now in order to train this you need to unfold this algorithm, okay? So for example, here we have an example of three iterations. So the first one is I just get the channel outputs and the a priori information is just zero. I compute the LLRs. I feed this into the belief propagation decoder. I just do one iteration here. Then I subtract what I fed into the DPP decoder to compute this extrinsic information. This is what I now feed in as a priori information to the Dmapper together with the channel outputs. Now I have this loop here which essentially subtracts again what I fed into the Dmapper from the output to only have the extrinsic information from the Dmapper and I do this again and again. And so the nice thing is about this is when you unfold this algorithm first of all these are all shared weights. So essentially the memory requirement is super low. These are just a few weights you actually train and you get something which is called a residual neural network architecture. So you see all of these shortcut connections here. The nice property with this is that you never, actually when you train a very deep neural network you have the problems of vanishing gradients. So if it's very deep, essentially what happens at the lower layer you'll never be able to train it. But thanks to these shortcut connection which just appear totally naturally, you can train this super deep. So we've trained this up to 100 iterations this still converges kind of nicely. And another nice property is that since I reuse the same components here I can also, I can train it for example in three iterations I could even let it run for 15. It's totally scalable to any number of iterations I want. Now kind of the performance results we get are here. So in blue, this is our baseline without iterative detection and decoding. So these are just here different modulation orders. When you apply iterative detection and decoding to an existing modulation scheme like PSK and QAM you gain of course a tiny bit. But what you can see is that training now the outer encoder so optimizing the geometry jointly with the iterative team mapping and decoding gives you like of kind of very substantial gains. We are here even more than a DB. These are the results on an AWN channel we got. And now the interesting part is of course to do this on not AWN models but on a real system. So we did, I showed this video where you have these two software defined radios. So we do essentially the same thing again. But now trains the system on this new metric using iterative de-mapping and decoding through a gradient estimate of the channel. We didn't attempt to learn channel equalization. We used here a hand kind of a linear minimum mean squared estimator equalization. We used OFDM 64 subcarriers. And there's one more thing we did. Once the system was trained we actually optimized the code for the learned Dmapper. Because the code is intentionally when you take a code from a standard they are typically optimized for an AWN channel. So for example the Wi-Fi code. So this is not an optimum code to be used for our learned systems. So on top of this once everything was done we also optimized the code. And that's what we got. In green this would be Wi-Fi. What you get. State of the art. We put love into this baseline. Now when you start optimizing the system just the constellation geometry you get this purple line here. So we gain roughly one dB. And when we then optimize the code on top of this we get like another half dB of gain. So we roughly know we're here 1.5 dB better. And where so we try to understand where do these gains actually come from. And it turns out that the biggest game comes from learning just of the Dmapper. Because if you apply a Dmapper which thinks that the system is an AWN channel that's what you typically do. Then you get this curve here. So this would be the auto encoder. So we optimize the constellation but we assume that the channel is Gaussian then you lose roughly half a dB. So we're actually having the Dmapper to be essentially targeted properly to the channel model. And on top of the imperfect equalization because we have imperfect channel state information here. So we actually see a channel which is very different from an AWN channel. So this is where I think the biggest part of the gain come from. And just before to finish I wanted to say that we can even go one step further. So until now we have only looked into optimizing how to place these points, optimizing constellations. So in other words, we have looked at the mutual information and we maximized this with respect to the parameters of your transmitter, okay? So this is what is called geometric shaping where to place the constellation points. But you have one more degree of freedom. You can actually also play around with the distribution over these points. So if you would like to maximize your mutual information it turns out that it's suboptimal to have all of these constellation points appear with the same probability where you would like to have a non-uniform distribution. And so I call this P of X and this is what is called probabilistic shaping. So with which probability should each symbol be sent? And now the question is can you jointly optimize probabilistic and geometric shaping and also the bit labeling everything together? And so the answer is yes, yeah you can do this but I won't go into detail so let's show you one video. So here you can see this is the constellation as a function of the SNR for 256 points. The diameter of each point corresponds to the probability, and you see that at very low SNR so you essentially kind of your constellation breathes. You have many points with a large energy which appear very rarely. So this means they cover a lot of information and that's why you'd like to give them a lot of energy and there's a huge bulk in the middle which essentially appears often but also carries only very few information and at very high SNR everything you essentially have a uniform distribution of messages. That's what you would expect. So now the thing is you on an AWGN channel we converge to, this is actually known that the optimal distribution would be a Boltzmann distribution. We converge exactly to this but of course our goal is not to do this for the AWGN channel but for essentially anything you want. So now we have a system that can learn out of the box probabilistic and geometric shaping over any channel. And so for this reason I am, so I believe that we are now actually able to learn a full physical layer which is optimized for a specific link or a channel model. This is totally automated, doesn't require any human intervention. So you do code design, all of the shaping, everything. People now look also into learning new codes. But essentially one could make the argument in saying rather than doing a one size fits all approach where we do measurements, come up with a channel model then a bunch of people agree on okay this is roughly how we should standardize a system. One could maybe argue that 6G could be a system that essentially allows itself to be optimized by machine learning. So what you would need to standardize is actually a protocol for message exchange which would just allow you to actually essentially would have a fallback solution. You essentially say okay I have a system where in one frequency band I use for example the Wi-Fi standard just to communicate. But then you define a protocol that allows you message exchanges of gradients, of loss functions, et cetera. And once you've optimized essentially in a different frequency band whatever's going on you just take over. And by doing this you would have a fully bespoke physical layer. I think this is, I find this exciting. So I'm gonna say this is honestly where the, I think this is where we are heading. And I think the next frontier would now be to extend this idea to higher layer protocols. So can you rather than just learning a bespoke physical layer, can you learn a bespoke Mac layer, for example. And that's what I'm currently interested in. That's the end of my talk. So thanks. If you learn on one particular channel I mean one physical cable for example and you change the cable with something the same kind but not exactly the same, is it going to? You would, so the thing is of course if you train on a one specific realization. So if your channel is fixed complex function of course if you then apply it to something else you will lose in performance. You wouldn't to retrain. Now if you there's something in wireless this I don't think it's so super practical to do this. Unless you do fixed wireless, think about in our backhaul or frontal connections where you see a line of sight link which won't change, this is where you can do it. But so what we typically do is we optimize over a distribution of channels. So during training you have random realizations of your channel and by doing this your system is actually able to generalize. So you don't want to learn to be too specific. It depends I think in optics but I think that many people working in optics you know this better than me. There might be an interest in doing this actually to optimize on one fiber. This is the thing I'm gonna optimize for. But in wireless I think it's even better to optimize over distribution. And then what's nice is for example, I haven't shown this. So you can also your system will actually learn to transmit whatever it needs to equalize the channel. So essentially if I train this on a channel a Rayleigh fading channel model where essentially you multiply X just by a random complex number you will actually learn to superimpose a actually a mean value to what you transmit. So which is constant. And if you transmit something which has a constant mean your receiver can learn it and this is like a superimposed pilot tone. So your system actually will learn automatically to equalize and adapt to any type of channel realization. So, but there's a fine trait that really depends on the use case. So... Hi, my question is about the choice of number of layers. So when you talked about the decoder you showed three and you said you... No, we did. This was just so that it fits on a slide. Well, you said you extended it to 100 and it's converging nicely. So obviously less layers, it's faster, more it's more accurate. But how can you comment on it more easily? So the thing is it all depends on the complexity. So I asked our, I asked a business division. So what do we do? That maybe use 15 iterations. I used here this flooding scheduling but they do just essentially one, just update one variable node. And so the nice thing actually about, I think where machine learning makes really sense is, I would say that in, let's call it in optimization or in traditional signal processing, you would write an algorithm and say, do until something is smaller than epsilon. And you don't care about the number of iterations but then you essentially say, I have an algorithm that provably converges but in practice, of course, you will just stop after 15 iterations and then cross your fingers, it is not too bad. What you do with machine learning in some sense is you say, you fix the complexity upfront, say I have 15 iterations or so many multiplications operations and then I try to do the best I can with this allowable complexity. So I think from a hardware implementation perspective, I find this makes much more sense to be rather say, I start from a compute constraint model and try to do the best I can. So that I think typically this is what decides on the model complexity. And in most of the times and I think this is often overlooked, to be competitive with a second an expert baseline, you need to put a lot of expert knowledge into your model. So if you try to just learn everything, you will end up with a model which is multiple orders of magnitude more complex than a human design baseline in which you've injected already everything you knew. So okay, nothing. So in certain set up this transmission problem relates to sphere packing problem that I've been studied by mathematicians and I was wondering if deep neural networks helped to get some better configurations of balance and high dimensions about sphere packing? Yeah, so the thing is, for example, we tried this on, we tried to learn, for example the, we have 256 points in eight dimensions. I think in this case it's, no one knows what the optimal sphere packing is. There's someone who's claimed it, I think it's called Agrile something. In dimensions it was exactly so. It was so, yeah, recently, it was very recently, right. Right, right. It was, right, you're right, right. But at the time I looked into it, it wasn't. So yeah, but the thing is, so the thing is, you come close to something, it's not exactly, you come close to something like this. Yeah, so that's all I can say. In higher dimensions, where? So we tried this in 12 dimensions. This is where we're currently looking into this. Apparently it's interesting in optics. It seems that we are better than heuristic methods or alternative methods, but I don't have any, like we just started to see. But it's really an interesting part in very large number of dimensions that's where they were deep, when they were close. Okay, yeah. But there's this other thing I think I'm not 100% certain if this optimal sphere packing is also the best for the bitwise, a huge information. I think it's not in many cases. So that's why, it depends on the channel, depends on the channel. So that's why, so for the channels I'm interested in, it's not. So that's why it's an interesting, probably we have seen that. So I think it's not the dimensionality is a problem, it's the number of points. So we have tried once with 16,000 points and I think in four dimension, and this was a disaster. So this, and especially then you are very, it really depends on the initialization. So we tried over thousands of seeds and each time you're stuck in some other local minimum. So because for the small kind of problems I've never witnessed that we got stuck into, we never need to run multiple seeds. We always seem to converge to the close to, at least to a local minimum which not too bad. But in this very high dimension, it seems to be really dependent on the seed. So I don't have, I'm not a theorist, I just apply things. So I don't have a mathematical interpretation of what's going on. But yeah. Thank you. Okay, then I think everyone is hungry, right? I typically eat at 11.45. So I'm, sorry. You are hungry. I'm hungry.