 So then let's just start, because I heard we have a tight schedule. So my name is Jan, I'm here to present some quick research I have done for the German Aerospace Center on generic polyphase filter banks on a GPU with CUDA. So yeah, I'm working for a German Aerospace Center in the satellite navigation, communication navigation institute, it's the satellite networks department and I do mostly software defined radio. So short outline of the talk, so we start with the mandatory motivation, then I want to give a short introduction into the CUDA system, short introduction into polyphase filter banks, what they are and why they are so cool or not so cool. Translation of the polyphase filter bank from like you have the DSP, how do you get it to CUDA and then some results and then some release plans for a library that is open source that actually does that. So motivation, once upon a time there was a space project that I've been working on, it was kind of a multi-frequency random access scheme, so we have some, several numbers of carriers, in this project it was like this, a 15, 30 or 45 carrier and on the receiving end we have to separate them somehow. So I remember Tom's talk in Karlsruhe like years ago, like hey let's do the PFB thing, that sounded cool. So I did some calculations and if you have 45 carriers the problem is if you have all of them at once, you have like 45 times the bandwidth, so that could be a lot. And also for the actual channels, the restrictions were quite tight, so we only had 12 to 15 percent guardband between information signal and then the next channel, so this is quite steep. We needed oversampling, here it says at least three times oversampling is needed, this is where we are right now, at that time we had to do at least eight times oversampling. So it's quite a lot, so when we did the filter design, why did the filter design, we came up with up to 1500 tap filters, so if anybody knows something about FIR filters, that's a huge filter. So I did some tests, I wrote a generic CPU version of that, yeah that was me when I figured out what I had to do. So I wrote a generic CPU implementation of that, I just used a thousand taps filter, just used 35 dB rejection which is, might be okay, might be not okay, I did the original nine times oversampling and I came up with two mega samples per second, well I needed four, so it was not fast enough. I tried to look into how I could optimize it for, or optimize it even more for x86 processors, but actually the filtering was all like maxed out from optimization already. I could have optimized the FFT somehow because I could not use FFTW which kind of sucked, but that would be like way too much time, so I just, let's try it on CUDA, I had some experience with it, so let's do it. So CUDA, what is it? CUDA is an NVIDIA framework for general purpose programming on graphics processing units, it's mostly used I think for scientific computing where you have a huge problem that you can somehow paralyze a lot and you don't want your simulations to take years or a month or whatever, yes it uses the massive amount of available compute cores inside the GPU, so for example the GPU that we're using has like 2,000 CUDA cores, which is way more than x86 or so has, yes. So how is a GPU built? So normally GPU has, for NVIDIA in this case, has several streaming multiprocessors, these you can think of as like the CPU you have in your normal PC, but you have several of them. In our case it was seven, I'm not sure how far they go up, but you can get one streaming multiprocessor in the low end graphics card, the high end one have seven or eight or something like that. Each of these streaming multiprocessors has lots of cores, like your normal CPU on your PC, but instead of four cores, the CPU that we used had 192 cores, if I'm correct, so also a lot more than your normal x86 CPU. Destruction that they use or the structure is a single instruction multiple threads structure, so basically what they do is they take one instruction, they broadcast it over several threads and all the threads do the same instruction at the same times. Also you have some different kind of memories on your GPU that are actually really important that you know what they do and how they do it, you have the global memory which is what they will sell you on the box, this is like you get the new GTX 970 with four gigabyte RAMs which are actually just three dot five gigabyte, this is really slow, it's way faster than DDR3 or something like that, but it's for GPU it's slow. Then you have to on ship memory, this is tied to a specific streaming multiprocessor so all the cores in the multiprocessor share all this memory which is faster, way faster than the GDR5 RAM and then of course you have to register like any CPU structure that are blazingly fast and way faster than anything else. So CUDA tries to use that architecture as best as possible, so first thing what it will do it will build up a grid and probably I'll just go to the next slide, so this is how CUDA structures the threads that it will run to match the graphics card architecture. So you have here this grid, this is what you have to define in your program, it can be up to three dimensional, it can be two or one dimensional that's up to you, for visualization I took the two dimensional case so we have an X dimension and a Y dimension. Each position in this matrix you could call it consists of a thread block. A thread block is also just a compilation of threads that will run on one streaming multiprocessor and inside each of these thread blocks then you have the individual threads that execute the instruction that you want them to execute. So the way this works is that each of these blocks has a unique ID inside this grid so you can access block 22 or access block whatever two dimensional ID that you give them and each thread inside the thread block also has a unique ID within the thread block. So if you know the ID of your thread block and the local ID of your thread inside the thread block you can basically access a thread inside your program. What this thread scheduler now does and how he executes those threads is he will collect the thread blocks, assign them to one stream multiprocessor and then it will further group these threads inside into bundles of 32 threads and they are called a warp and these 32 threads are actually the threads that are going to be executed simultaneously and you have to be a bit careful we will come to that later how you do calculations or anything inside those warps because the interaction between them is kind of special so yeah that's that you have some performance bottlenecks that you can run into very easily while doing your GPU and then you put your program into the GPU and you think well it's not really fast so these are some of the problem that you have first one is global memory it's not really fast so you want to minimize the usage in any case but even if you have to use it and you have to use it at one point you have to make sure that all your reads and all your writes from to the memory are coalesced what does that mean is that consecutive threads access consecutive memory because then what the architecture can do is it can load just one cache line for all the threads that need data and then distribute the cache line to all the data inside the cache line to all the threads this will be one load for several threads if you cross the cache line then of course it has to load two cache lines which might be way more loads that you would actually need for your instruction or whatever and then you just waste bandwidth so be careful about that then if you're using the shared memory you have a similar problem so shared memory can also be accessed in parallel if you access several banks in parallel and how the graphics card will structure this memory is that consecutive 32-bit words are in a different bank so if your threads or consecutive threads just access a 32-bit word that is where the memory is consecutive they can also load all the data in one go if two threads access the same memory bank then of course you have two loads and you're wasting memory bandwidth then branching is also a difficult topic because it's a single instruction multiple thread architecture so you want your threads to or the threads have to do one instruction that is the same on every thread inside a warp so if you have a branch it might be that one thread inside a warp has another instruction than the others so these threads then would have to be executed in serial because which branch which instruction would you execute them so yes these are the three main performance killers that can come up when you do GPU programming and when you if you're wondering why this may code not run as fast as I think it should go to this list so now polyphase filter banks what are they so basically polyphase filter banks are used for example if you want to reduce computational complexity within resampling filters in general whether you decimate or interpolate it really doesn't matter but they're used for that then what you can also do if you have a scheme like we had in our project where you have separate channels you can separate one channel with a polyphase filter bank but you can also separate all channel at once in one go which is the cool thing about these polyphase filter bank channel as us what you can also do is you can take several separated like information signals and then distribute all of them into a wider spectrum this is what synthesizer does which we also have implemented but in this talk I'm going to concentrate on the channelizer because there is already enough stuff to do so what you would normally do if you would extract a channel that has one nth of the total bandwidth of your of your of the signal that you recorded this well you could first mix your signal to base band and then do a low pass FIR filter over it because you have to do this to get rid of all the aliasing when you're down sample you can also switch this you can for first do like band pass filtering and then do the down conversion of course and then you can down sample the signal the problem as I said with this is if you have a small channel compared to the overall band with your filters will get very steep and they will get very computational heavy polyphase filter bank somehow help in that respect so the polyphase filter bank what it does is it basically takes your your filter tabs or your filter impulse response and then splits them into the end different phase shares or the phase parts that the filter has which I will show you in a second so this is like a tap representation of your filter right you have 16 tabs from tap zero to tap 15 and normally you would just like shift your signal through that in one go like serially and then you get the output so the P now we had the PFB will split these into other tabs so you have like the first phase part is are the red tabs and then the blue ones are the second phase parts and so on what you can do then is you can read or reorder those tabs and then basically serve build new filters that are now for four tap filters so in totally you still have 16 tabs but you have four yeah they only have four taps and you have four filters this will help in a regard that if you know how computational complexity grows within the filter it's if you want to sample a signal is n square so now we have basically n divided by 4 square times 4 which is way better than the usual filter complexity we actually have divided the computational complexity by 4 so what we will do now is we will get samples from from your source the first sample will go to the the first filter the second sample will go to the second filter and so on which is switch through all the filters when we are down we go up again basically what you can think of it in a simple way or how I always visualize it is if you down sample the signal what you actually do is you compute all your 16 output samples and then you throw away 12 of them and this is and if you would map that to your filter operation it's like you kind of just do a snapshot of your filter every four computations and this is actually what it does in the end everything gets added together and then you have your resampling filter already so if you want to now extract a channel you basically have to know where your channel is and then down convert the channel by applying a multiplication with a complex wave if you do an FFT after your filtering you get all the channel at once and yeah basically you do just one filtering do one FFT and get all channels that you want so yeah you can also do some different things you can over sample the output of the channel which is basically you manipulate this commutator so instead of for example if you want to over sample it by a factor of two you don't shift every input basically the the first sampler goes to a t0 and then get shifted to t4 no then it gets shifted to t2 actually yeah if you want to synthesize instead of channelize you can basically do the same operations but you just reverse it you first do the FFT then do the filtering etc. so how you could translate that into CUDA so if you're familiar with how CUDA works and you want to do all this basically consecutive memory axis all you do is gonna shuffle so the channelizer as we saw consists of four operations we have to shuffle the input stream from a serial stream to this parallel stream we have to do the polyphase filtering we have to do the FFT and after the FFT we have to parallel to serialize the signal again so input shuffling actually I don't like the slide and I don't like how I implemented it in my library so I took the easy road and said okay I wonder if you basically shuffle stuff you either read a core list or your write core list there's nothing you can do one of the memory loads will be uncoalesced so I decided that I coalesced my read because this is somehow the more logical or what's more logical to me and then write like scattered the problem with this is if you want to read then coalesced you basically take your thread blocks and then you have an X dimension in your thread blocks that is the number of your channels and if you remember that the scheduler tries to bundle the 32 threads to a warp and then execute them and it starts on the X time actually in the first dimension of the thread block if you have less than 32 channels you are not gonna end up with a whole warp so in my case we are at 45 channels it probably didn't make that much of a difference but if you have less than that 32 channels it might make a difference so I'm gonna revisit that and see if it stays like that so filter operation yes we have a basically two-dimensional grid and two-dimensional thread blocks the first dimension in the block computes basically several input samples because we want to maximize our GPU and just calculating one sample output we would still have a lot of threads idle so yeah these are compute several inputs input samples at once process several input samples at once the second dimension the wide dimension takes care of if I want to out over sample the output this is taken care of in that one and the first dimension of the grid basically represents how many channels I want to have and the second dimension of the grid is just there if we don't have enough well don't have enough threats already executing then those will provide additional concurrency just one thing I want to show you how that the code for that looks like so the actual filtering is shit wait yes the actual filtering is just these lines the rest is just shuffling sorry this is shuffling up what so no marking so this is shuffling of the memory and all the rest is just finding out in which thread I'm actually so yeah this is basically all what you do inside your code in CUDA you've tried to find out where the hell you are so yeah if you're a couple of other things for the filter operations we go to shared memory we try to avoid bank conflicts which is not that easy because we have actually complex samples which are 64 bits so not the 32-bit words that we would need for the loads from shared memory as far as the compiler output concerns we don't have any registered or shared memory spills that's always good the FFT is just a Q FFT which is provided by CUDA I mean minutes fast it's convenient so why not use it the output shuffling is because I was actually lazy implemented on the host CPU for now this might change as well I don't expect much of a performance boost maybe in one percent or one digit percentage per area but it would be nice to have everything on the GPU for me at the time it was plenty fast enough so we have some results so this is 32 channel separations with no over sampling the prototype filter was a 437 tap filter so rather small and here we see we as benchmarked a GTX 970 against the trusty old new radio polyphase filter bank and actually the polyphase filter bank of new radios doing very good this is for GPP really remarkable what we see we got to 160 mega samples with the with the CUDA implementation which is roughly four times as fast as the GPU CPU version this is the actual filter that we use in the projects for the 45 channel case so we have three times over sampling 45 channels 1501 tabs so we see this this time we actually have three GPUs there is also the GT 720m which is a laptop early really low-level laptop a GPU with just one streaming multi processor and you see we still get above 100 mega samples with the CUDA implementation on the GTX 970 720 GT is a bit worse or far worse with 14 mega samples with which would still be okay in our use case and then the CPU version is actually not that much slower than the GTM 720 so Martin also said that I should tell you something about the release plans and open source strategies at DLR well to be sure there is no open source release strategy at DLR at least no global one so if you work at DLR and you think I want to release something you just go to your superior and ask if you can do it then you have to go to whoever is supervising the project and ask if you can do it and then just put it on GitHub or whatever maybe if it's like for for the military you have to ask for export export control services but in principle it's your own personal choice so it's still not released we still have some bureaucratic hurdles inside DLR because we do not have a formal way to do it it's also still dependent on some project code and I want to get rid of these dependencies but I can tell you the license so we decided it will be LG PL 3 and it will be on on GitHub probably in a group K and San which is our department communication navigation satellite networks if you want some news on when it's released probably it's the best to check my GitHub because I don't think that there will be a German aerospace and news about that so that's it thank you for listening to me we have time for questions yes yes the longest sorry I'm actually too what's the biggest so the biggest that I tested was roughly 1700 tabs and it ran I think 80 mega samples per second so yeah that could be so the limitation I think for the filter not because they go to this local memory the way you can store these like constant stuff so if this memory runs out then you might be in performance trouble but I have not come to that point yet so I think there is plenty of space for even longer filters yes so there's obviously been several efforts and examples and implementations of GPU offloading from within good radio yes something that we honestly want to get into the main lines as possible the major point of concern is always that the GPUs are super high bandwidth but also high latency yeah yeah based on the work that you've done what I guess what is your opinion of how feasible it would be to basically drop a GPU block in the global radio flow graph and how would that impact your way your flow graph operates in terms of streaming performance streaming performance whoo that's a good question because there is also another problem with at least with most of these like DSP problems on a GPU you have to buffer first a buttload of samples to get to this parallelism that you can actually exploit a GPU so I would guess that the latency from buffering these samples would be higher than the latency from getting them to because also this this 180 mega samples okay it's throughput it's not latency it's measured with with the data copy from host to GPU so we have to stop yeah okay then sorry but I'm be hanging around so this is the most important thing you're gonna hear all day