 Okay, so shall we start? Yes. So, hey, I'm Jan and I'm quickly going to tell you about stuff that I worked at the German Aerospace Center in Oberpfaffenhofen, which is a cafe. It's the co-processor accelerated filter bank extension library. Yeah, so maybe some of you guys know me from last year and from three years before. We are also presented about doing stuff in software defined radio and on the GPU. So, cafe, what is it? A bit of history. Last year I presented a polyphase filter bank channelizer that works with CUDA on NVIDIA GPUs. And the reason why I did that is because I was in the project. It was doing random access over DVB RCS2 basically and it's a multi-frequency random access system. We have 45 channels and a receiver we have to separate them somehow, which turned to be a pretty big task. So, we have, as you see, 1550 tap filter because we have four mega samples or four megahertz of bandwidth. 45 channels and basically the passband between the filters is like 13 kilohertz, which is nothing in four megahertz. So, yeah, 1550 tap filter needed and we needed to run it for mega samples per second. Otherwise, you know, it wouldn't be a real time. So, I tried with what's in GNU radio because I knew they have a super optimized polyphase filter bank channelizer for this CPU. But unfortunately with the 1550 tap filter and we also needed some more over sampling, didn't work. So, we were able to run one to two mega samples per second, but not more. So, yeah, we went to the GPU. So, fortunately we have CUDA or OpenCL and fortunately polyphase filter banks are really suitable to run on the GPU. So, I presented it last year. The filter bank can do some other things like it can over sample the output to all, like, you know, factors that are integers of the number of filters. Example, if we have 45 channels, we can over sample each channel by factor of three, which would be 45 divided by 15. With this, we were able to achieve 100 mega samples per second, which is a serious improvement of a one to two mega samples per second. And yeah, in addition to that, back when we did it last year, I was super lazy and we basically did the reshuffling of the output of the QFFT. We did the QFFT on the CPU. Now we do it with the QFFT because actually that's way much faster and simpler. It's just one line of code, but I was just even too lazy to research that. So, maybe we are even now faster. I don't know, but we should be because QFFT should be faster than what I hacked up on the CPU. So yeah, but then we had another problem. So, of course, we now had all the channels and then we had some external company write some pretty awesome DVB RCS2 blocks for us. And their timing sink needed a four times over sampling factor. Okay, so we can do some over sampling or resampling stuff with the filter bank channelizer. But as you heard before, we are pretty much limited to integer factors of the number of filter banks that we use. So, we could get to four to two, six, six, six, six times over sampling, which is close but not close enough. So, we needed an arbitrary resampler and as we had to resample 45 channels in parallel, just stick it again on the GPU. I didn't even bother measuring what a CPU resampler would do because it probably would be super slow. So, how does a resampler work? So, basically, you just have your polyphase filter bank, which is kind of like up-sample the signals. If you have like 32 filters in the bank, it up-samples the signal by 32. And then what you do is you down-sample again by basically an appropriate step length that gets you close enough to the needed sample rate. And the cool thing about that is that you can basically skip most of the up-samplings part that you have done before because you just skip computing the output of the filters, so that's pretty neat. But you don't want to just skip the filters and then be close enough to the sampling rate. So, you also do interpolation, which I think is a minimum mean square error. I'm not 100% sure. I just looked into Harris, the book of Harris, and did what he did. So, you have two filters that filter the signal and then you interpolate between the output of both filters to be more accurate with the resampling time. So, how do you do this? So, you start as, again, you work with the polyphase filter bank. If you want to know more about that, then you can check maybe my previous talk because otherwise it gets too lengthy. So, you have your tabs for the filter that are basically designed as you would have like one general FIR filter. Then you have that. Then you add basically the second filter, which is the differential filter. And it's basically, for each tab you take the difference between the next tab and the actual tab. And then you zero out the last tab because you don't have another tab to make the calculation for that. So, that's pretty much it for how to design a differential filter. And then you do your normal polyphase filter bank partitioning. So, you know, in FIRST filter you have like the zero phase and then FIRST phase and so on. Good. So, but how do you then actually do the resampling? So, you have the interpolation rate, which is how much you up-sample. This is basically, yeah, let's keep it like that. That's how much you up-sample. You have the decimation rate. It's basically how many filters you skip. Then you have the floating rate, which is like the difference between your integer resampling, which you would do when you just up-sample and down-sample by integer factors and your actual rate that you want to achieve. And then you have something that is like an accumulated rate or maybe more appropriate, it would be an accumulated step size. So, this makes basically finding the right filter more accurate. So, how are these calculated? Interpolation rate is basically just a number of filters you have in your bank. It's pretty easy. At decimation rate, you basically calculate by having your interpolation rate dividing it by the rate you want to use. And then you make a floor because you need an integer. The floating rate then, of course, is just like the difference between your real, decimation rate that you would need and your integer decimation rate that you can achieve. The accumulated rate is then done in two steps. As we can see later, while we do it in two steps. So, first of all, you update your accumulated rate by adding the floating rate to it. So, you start at zero and then every time you filter your signal with a filter, you just add the floating rate. This basically determines the exact filter you're using. After you have filtered and the signal, you just basically do a moduli one. So, you always have to accumulate the rate between zero and one. Because basically, all you want to know is, do I need to get, is the next filter closer to the actual sampling rate than the integer or not? And basically, that's how you determine it. So, yeah, how you do the filter skips. So, basically, you just calculate the output of your filter at index whatever. Usually, you start at filter zero, or you can also start at filter, like in the half of the filter bank, which changes the phase a bit. But usually, you can get away with zero. Then, you basically do the interpolation, which is you take the output of the normal filter and you add the output of the differential filter multiplied by the accumulated rate. And then, basically, you update this accumulated rate according to this. So, you add the floating rate. Then, you update the filter index for your next output that you want to calculate. And then, you just cap, or not cap, then you just do the mod one for the accumulated rate to always have it between zero and one. Yeah, and then, you update the input because, you know, for example, resampling to a lower sampling rate, you might just skip one input sample. So, you go, if you imagine you have the filter bank, you basically skip all the filters and go to the next round. So, yeah. So, how do we get that now on the GPU? So, CUDA. Well, first of all, a short introduction to CUDA. So, what is CUDA? CUDA is basically NVIDIA's framework to do general purpose computation on the GPU, which is, I think, mostly used for scientific calculations and also heavily now in SDR. And it's the architecture of CUDA or how you would program it as closely related to how NVIDIA actually builds their GPUs, which makes a lot of sense. So, if you have a GPU, you usually have some multiprocessors on the chip, which basically all come with their own cache and their own ALUS and whatever or not. So, they have this on-chip memory and the cache, which is super fast, like your normal CPU probably would have. And this is pretty vital for, as we see later. Yeah, every multiprocessor has tons of ALUS, or they're called CUDA cores if you're talking in CUDA language, which are basically your kernels or your cores. And they have this global memory, which you usually see advertised. It's like with six gigabytes of DDR3 RAM. That's the global memory, which is super slow. And we usually try to avoid using that at all. So, there's another thing to it. So, basically, how do you map this architectural stuff to how you program in CUDA? So, in CUDA you basically have three terms that are very important. You have the grid, the block, and then you have the thread, which actually executes your computations that you need. So, the grid is basically you can, I always imagine it is. It's the GPU with all the stuff on it. You have your block, which you could map to multiprocessor. It's not really the case. Every block gets somehow executed on one distinct multiprocessor, but I think that it's close enough if you imagine it like that. And you then have the thread which runs on your CUDA core. Another important thing is the so-called warps, which is how the CUDA or the GPU scheduler schedules your computations. So, usually, they just stick 32 threads together. They're a warp and they're executed concurrently. You can do all the stuff like use the shared memory and the fast cache or local memory. Yes. Some rules of thumb, how you would program CUDA. So, usually, spawn more threads than you have cores. So, for example, I think the GPU that we used to have something like 2000, something-something cores, but we spawn like 32,000 threads. Probably not all of them. So, probably we are having a bit of overkill there, but usually the GPUs, they have a huge pipeline also for scheduling of the threads. So, if you keep the pipeline busy by having more threads than you have cores, you gain something. As already said, the global memory is super slow. So, the first thing you want to do after you've transitioned your data from your host to the GPU, which goes into global memory, you want to transfer all the hot stuff that you need to use frequently to the on-chip memory, which you can configure either as a cache or as this SL2 cache or as this CUDA shared memory. Yes. Something that you also need to take care of. For both transitioning data from global memory to the shared memory and from basically accessing the shared memory is that the GPUs have, like the CPUs, they have a cache line. The cache line is a bit bigger. If I recall correctly, the cache line is just big enough to provide all 32 threads in a warp in a single memory load with values if adjacent threads, so access adjacent memory. So, let's say your thread zero, access memory zero, and thread one access the memory at position one, which is then with a four byte offset always. So, because then you have one load that serves all the threads, if you don't apply that and you basically have multiple memory loads per warp which seriously slows down computations a lot. So, yeah, how do we do this now with the resampler? So, we basically do the same as we did before with the channelizer. So, we have one channel, so that the channelizer outputs 45 channels and we resample all 45 channels at once. So, we have one channel mapped to one CUDA block. So, each CUDA block basically takes care of resampling one channel. And each thread inside this block basically computes one resampler output. So, we only have concurrency by calculating several filter outputs at once. So, we don't do anything like do the multiplications in the dot product in parallel and then somehow see how we added. We don't do that. We just do the dot product basically in one kernel and concurrency comes from processing several samples at once. Which also minimizes synchronization because we really don't need any synchronization because the warps basically synchronize themselves. So, yeah. Filter calculations. Most of this filter calculations and by filter calculations I mean which filter to use are done on the GPU except for one thing. So, I made the decision if I have output from the channelizer because I have to transfer that again back to the CPU and I have pretty good CPU host code how to do streaming and I didn't have that for the GPU. I wanted to process all input samples at once. Which means with an arbitrary resampler we have a variable number of output samples. Which also means I don't know how many threads to start before I have to do it. So, to do that basically I pre-calculate how many samples I will produce in this round on the CPU transfer that to the GPU and I basically prayed for two things. First is that my calculation on the CPU weren't somehow compromised by doing a huge floating point multiplication. And second is frequent transfer from the CPU to the GPU don't slow down my system too much. So, yeah then we just start our kernels because we always start kernels in numbers of 32 I might be doing some wasteful calculations but I don't care I just throw the unnecessary stuff away when copying the data back from the GPU so we are fine on that. So, yeah I actually don't have that many benchmarks because I was a bit in a hurry and all I cared about is that the chain polyphase bank channelizer polyphase filter bank resampler was faster than four mega samples so I basically just pulled them together had the same setup 45 channels 1550 tabs prototype filter used for the channelizer I think the filter for the resampler was smaller it was 350 tabs something like that and yeah we processed 768 samples per channel in parallel and the result was 25 mega samples on average which was pretty good because that means I probably can sustain four mega samples for a long time. Yeah, so that's pretty much it. So last year I stood here at FOSSTEM and said yeah we're going to open source it and until September I was really optimistic that I can open source it until management got wind that I went to open source the stuff and then the bureaucratic nightmare started so I really had to go through a lot of bureaucratical stuff that wouldn't even guarantee me that I can open source it it was just that I had to put something on the table for our institute's management so they can then decide the fate of the software which was I had to check licenses I mean of course that everybody has to do that check the licenses I chose LG PLV3 if anybody is interested then check export control which is a real nightmare because German aerospace center doesn't really have a guy who can help you with export control so I manually went through the German export control list and if you're doing communication basically everything is regulated even a tiny FIR filter I think qualifies for having to be checked if you export it I just said it's fine I mean the technology is out there since I don't know ages there are several already available re-samplers IngenRadi for example so I think should be fine there I had to check with every project partners and coordinator and get their written approval that I can open source it even though it was our code but yeah you have to do that and I had to establish a contributors license agreement also there there's nobody to help you at DLR and then basically you have to do it yourself without you takes a lot of time like researching what does Apache what does FSF then come up with something that is similar to that but tailored to your needs also the cool thing is that I'm leaving German Aerospace Center in two weeks so in two weeks I have to sign the CLA for my own code which is pretty neat or not yeah so I still had to convince management so I wrote like thousands of documents and I give presentations I had other people give presentations for me because I'm not invited to the management meetings and yeah still the whole project was in jeopardy I was in contact with Martin I think also till last week I might not be able to make it and yeah last Monday I got the green light yeah one hour before I went on vacation so I had nothing done by Thursday pretty much in terms of having something for the presentation so the whole presentation was done on Thursday afternoon and that's it and yeah but there was lots of talk about management I also have to thank a lot of people at my institute mostly probably Jan-Louis Gilever who is my group leader and who fought relentlessly for me in all these management meetings also Hati who basically is our senior lead developer who also represented me in countless management meetings and made sure that the people understood hey it's good for us, it's good for everyone that we open sourced that stuff so yeah also some special thanks to me and Gerard for the Kung Fury images that you saw and enjoyed I hope so yeah then thanks for your time and have we time for questions? yeah we skip the pause now we can do one question yeah so computation depends on the previous one depends on the previous filter output no so no so each filter ah okay so the question was whether the filter output is dependent on the previous filter output it's not the thing with the precalculation is that sometimes as you skip the filters basically you overflow your filter bank earlier or not so in our example we could have had 1440 samples or 1441 samples produced by the filters I need to know in advance because I need to arrange stuff when copying so what I did I basically precalculated 1440 filter skips for that and to see at what filter do I end up and so because I also needed the last filter to start it again at the next call to the GPU that's where the floating point I was not sure whether then after one million calls I would still be in line with doing all this multiplication it's in contrary to just adding all the time just a tiny amount so that's it but everything is concurrently yeah that's it so thank you