 Hello everyone. I'm going to talk to you about a topic that's dear to me as a musician that's called hardware-aware neural architecture search for embedded audio effects simulation. I try to propose that approach of hardware-aware neural architecture search based on this application case but I also hope that the idea gets clear that you can also use this for other applications. Yeah, let's jump right into the application case first. Audio effects. Audio effects, according to Wikipedia, are electronic devices that alter the sound of a musical instrument or audio source through audio signal processing. So far so good. That means effects can alter the loudness, the timing, the pitch, the timber, that means harmonic content of the signal, spatial hearing perception and so on and so forth. On the right-hand side we see an example. The top shows an audio signal as a waveform and we're applying now low frequency amplitude modulation with a frequency of 20 Hertz, which is applied by multiplication to that signal and at its peak multiplies the amplitude of the signal with two and at its value with zero. So we get an on and off swelling effect on that audio signal. Those can be implemented in analog or in digital. Analog implementations usually are implemented with electromagnetics, that means for example tape. You could speed tape up, slow tape down, flange it and so on and so forth or electronically, that means via filters, non-linear processing elements like diodes, transistors that generate harmonic overtone content and so on and so forth. Yeah, let's look at a few examples. There are a lot of different types of audio effects and you can categorize them very differently. Here is one kind of categorization from one of the reference books on digital audio effects. That's a lot. I myself like to distinguish three different characteristics. Those are either we have linear audio effects or non-linear audio effects. We have time invariant or time variant audio effects. That means they are directly dependent on time. For example, when you have a low frequency oscillator also in that effect or they have short memory or long memory, whereas short usually means around a few milliseconds, long means more in the realm of seconds. Let's listen to some examples. I thought that may be a good idea to give you a better understanding. The first example is from a clean guitar that means no effects just for reference and the next one is equalized. You couldn't hear a lot. Maybe I can turn up the volume a little bit. It's make it 20. That should be sufficient, I hope. It doesn't change a thing. Equalizer. At least here the spectrum becomes different, a bit more bass, a bit more treble, which is linear time invariant and short memory. Now let's listen to distortion. That should be significantly louder. Yeah. Echo. You can't really hear the repetitions now, but it adds an echo. Sorry for the volume. Or flanger, which is a time-barriant effect. You might hear the swooshy in there. Okay. How can I remove it already? Good. This is how those effects may look like. This is my personal effect board as a recording musician or as a audio engineer. You might have a lot of these boxes or you don't have any, but usually you have. Those are all analog. The issue with this is, so one, of course, it sounds good. We've been accustomed to these kinds of sound processing for decades and we love the non-linear, warm, imperfect sound characteristics, but this comes with downsides. So this board is quite huge. It's transported as well and it takes up space. Acquiring all these things can be very expensive and they tend to be unreliable. So they are suspect to noise, interferences, impedance, mismatches, all kinds of stuff. Sometimes even temperature changes when you have germanium diodes or transistors built into that. So what can we do about it? Usually people then try to use analog technology, try to use digital technology. That means you're using DSP, try to process the signal in the digital domain. That means with analog digital converters in front, maybe you're doing a Fourier transform or applying fur or IIR filters, so finite or infinite impulse response filters. And it gives you a lot more possibilities. Also because you're not bound to causality because analog implementable systems need to be causal. You cannot look into the future. It can only work with the signal it gets at the current time and all past signals. The issue with this though is that traditional effects like distortion and so on and so forth sound different than the analog counterparts when you implement them digitally and don't try to meticulously model the characteristics of that device. So they tend to use, they tend to sound too pristine, too clean or too sterile, not really pleasant to the ears. Attempting to model these characteristics is called virtual analog modeling which has been a topic for quite some decades now. On the right hand side you can see a tape echo simulation machine. So this one tries to emulate the mechanical and electromagnetic characteristics of a tape machine only in a very small form factor. They do this by implementing the physics, the circuit and everything they measured and so on and so forth and implemented this meticulously. This is called a white box approach. It's working. So what is white box modeling? What is gray box modeling? What is black box modeling? White box modeling requires you to know the complete circuit that you want to model. So you need to know about every voltage and current relationships in the circuit. You need to trace circuits, run test signals through it, measure it. You need to know the circuit diagram. You also need to know that the circuit diagram is actually correct because actually some vintage or historical effects or synthesizers might not be implemented in the way that the diagram says. So this is... Oh, I'm sorry. Can I get this back? Ah, there it is. Yeah, where was I? Yeah, this takes time and is costly and you need to pay very good expert engineers to actually implement this. In the middle there is gray box modeling which only requires you to know the circuit topology but you have some free parameters that you try to fit on data. So either you run a grid search over it or you learn parameters via machine learning. So you have like let's say a compressor or a distortion. You roughly know how those devices are usually built up but you need to have some parameters that are specific to that one effect that you can then train on having input and output samples of that specific effect you're trying to model. And the black box approach on the right-hand side tries to take this to the extreme. Ideally, you know nothing about the circuit and view it as a black box and you completely learn the model by analyzing input signals that you feed into it and looking at the output signals and just trying to replicate that behavior of that device. So you're actually trying to imitate a complex non-linear system just by looking at its output. So this is a very simple diagram of how this could look like. On the left-hand side we have our clean audio signal. Let's call it X. And we run that through an effects box that implements a function which we just called P and this one's parametrized for example by knobs and switches and this one produces an output signal Y. But at the same time we also feed that into a neural network now and try to replicate the behavior of G. So we have G hat which is an approximator for G which produces Y hat which is a prediction of how it could look. A prediction of the output of the actual hardware device on top. And usually we try to minimize that distance so we try to get as close as possible to the actual output. Again I have some sound samples how this could look like. Those are some actuals I worked with so left clean guitar a bit quiet again. Actual effect output doesn't sound pleasant but that's how the effect sounds and this is my prediction which is close but not perfect as you might have heard but can definitely see in the way form. Okay so what are my goals now how to do this how do we get there. So I want to design black box models automatically for each specific effect. So I cannot have a one size fits all solution people have attempted that doesn't really work at least not for my requirements which I'll come to in a minute. And I want to do this automatically because otherwise you'd need to design a neural network for each and every effect and I want this to be a systematic approach. I don't want to do this manually all the time right this defeats the black box model approach of it although people have called this black box modeling now for a decade or so. Okay also because I'm I'm yeah considering resource efficiency because I work at the embedded systems department and want to implement this on actual hardware I'm concerned with latency on the target hardware and I wanted to run with minimal latency so that I can actually use it during performance for example and it doesn't produce a sound one second after I hit a chord on the guitar which is basically unplayable and eventually not there yet but working on it I'm trying to deploy models on a hardware platform that can instantiate arbitrary black box models with minimal latency so I can have some kind of one box does it all approach. Okay let's look at how we try to accomplish these. Okay the baseline architecture that I'm using as a reference is WaveNet based so WaveNet is a fully probabilistic auto regressive model for generating raw audio data and it is causal that means it only uses the present and past samples to generate the current output. TCN 300C means temporal convolutional network the C stands for causal and 300 is the so-called receptive field of approximately 300 milliseconds that means for every output that I produce I look at 300 milliseconds of audio sample. How are we doing this because when we try to do this with just linking 1D convolutions with each other we either need to have very long kernels or we need to have a lot of layers. What we're doing is we're using we're using a deletion so so-called dilated convolutions or maybe somebody of you knows it under the name convolution a trois or convolution with holes and I had a figure but I needed to skip that slide where you can see that in subsequent layers there is just with a deletion factor of two for example only every second output of the previous layer considered for one calculation but since it's shifted sample by sample at a time you still consider all samples and all outputs but you're reducing the computational complexity significantly by still increasing the receptive field. So they are also using so-called parametric ReLU functions which is like a ReLU but you learn the slope of X smaller than zero and yeah here you can see the basic configuration of that paper which tried to model analog dynamic range compression with this and this was a very yeah the results were very good of that paper and it was also the first model that was able to perform this calculation in real time and I'm using that as a baseline. What I did though is I removed the conditioning block which is faded out here which tries to condition the effect on the parameter settings and switches because the data set that I used doesn't have those parameter information currently. Okay so we could now try to use that model for every effect and be done with it unfortunately that doesn't work. Also it's very slow it's the real-time capable but the buffer that needs to be filled with those 300 milliseconds in the beginning because otherwise you cannot produce the output you'd need to completely differently implement that which I will consider but currently it's not possible which increases the initial time delay gap and this is the thing that you hear as latency when you strum a chord so we want to minimize that and so I want to develop one architecture that suffices these criteria for every effect so if one needs a larger receptor field then we'll give it a larger receptor field if we can work with a shorter one then we'll work with a shorter one right when we need more layers we take more layers when we need less layers and it's an easy effect then we'll take less layers but I don't want to do this manually because I don't know how the effect works sometimes I do but usually I don't and I want to determine this automatically so how to get there automatically this is what I'm using hardware when your architecture search for so this is a very simple schematic of how this could work so I have an architecture search space that defines building blocks and rules on how to construct a neural network and you have a search strategy that tries to find good architectures in that search space the search strategy might maintain a pool of candidate architectures that should be evaluated by the evaluation strategy and eventually we get a best architecture according to the evaluation strategy and then we might make a final performance assessment of that the thing though is now the evaluation strategy cannot just consider accuracy metrics or something like that but also hardware dependent metrics because the and also implement implementation dependent metrics because the inference latency is dependent on how fast the hardware platform can actually compute these results also we might want to consider in the future energy consumption so this is very platform and implementation dependent so we're using as I said that model or that architecture as baseline and with the search strategy we try to modify that by baseline by changing the amount of layers or blocks the TCM blocks the kernel sizes or lengths the number of channels per per convolutional layer the dilation factor or using standard or that's why several convolutions yeah so this is what the actual implementation the design then looks like I tried to color it colorize it the same way as before so you can match it to the components so the search algorithm I'm using is regularized evolution so I'm using a revolutionary evolutionary algorithm and I'm initializing the search or the pool of candidate architectures with a random search in that architecture search space which you can see here as a protocol kind of with building blocks and rules so the set of options per layer and the wave net or TCM structure built into that so and after I did this random initialization I start with a population in terms of evolutionary algorithms of candidate architectures from which I in each iteration sample a random portion of 25% and then do latency measurements on my actual platform which currently is so the target platform in that scenario is also the development platform and the loss estimation which means I train it for maximum epoch count or do early stopping at least for now and the fitness evaluation considers both of these so the estimated loss and the latency that was measured on that platform the worst model of that portion we discard and the best model we take and mutate so we try to change some of these options for the layers we have a certain mutation probability and then we try to get the best model even better by mutating some of these we added to the population again and then we iterate as many trial accounts as we have so as many models we want to evaluate and in the end we get the best architecture we train it completely then and yield a final model okay so I have some preliminary results those are preliminary because I've changed a lot of stuff in the code but haven't been able to perform such comprehensive experiments since then I think you can still see the potential of this technology and at least some interesting things to look at okay the thing that I want you to focus on the most is on the left-hand side we have the baseline model that I trained on each effect type so a cabinet simulation so speaker simulation of distortion fuzz and overdrive chorus flanger phase and tremolo with the last four being time variant effects as opposed to the others I measured the the loss the loudness difference the latency on that platform and so on and so forth and on the right-hand side you can see the best model that I found with the hardwareness algorithm on the right- hand side and this is I think the most important thing is the relative so the comparison between those two and we can see that almost through the bank we got faster a lot relatively so with the latency is now being 32 milliseconds 60 milliseconds here we got a bit slower but it's only like two milliseconds that is not much and the loss is usually in the same ballpark except for the distortion model unfortunately which is increased but still and this is the effect that you've heard in the sample before it sounds quite alike but not completely there so this still sounds like the distortion device just the loss is in comparison to the baseline increased the time variant effects look very good with very short latencies but that is deceiving and I'll show you why on the next slide when we look at the architectures that the hardwareness found again here I think I want you to focus most on the right-hand side you can see the receptive field of each architecture and the latency measured the receptive field is currently included in the latency measurement so I haven't done sample by sample latency measurement yet but rather oh yeah that I should have mentioned the audio samples that I am processing are two seconds long so I need two sec I need for example here 352 milliseconds to process two seconds of audio sample which is real-time capable so I can run that on a device or my laptop in this case and it will produce the output before the next sample arrives and the thing that is important to increase decreasing the initial time delay gap though is the receptive field and the baseline has a receptive field of 272 milliseconds but for the cap sim the distortion the fuzz I got significantly lower values so this will be way more usable to actually play by getting very similar quality results and the overdrive took a bit longer and as you know the distortion had not the best quality and I attribute that to the search strategy being not explorative enough so I now changed or increase the mutation probability and also the trial count that we've used to get more exploration the search space going because it has like 1.2 billion architectures that are possible and we cannot iterate or evaluate them all right but still I think you can see a pattern the chorus flanger phase and tremolo though are nonsense architectures it was not able to to really replicate that behavior when you listen to the predicted predicted samples it's basically the clean sound so it didn't really manage to apply any processing at all because the architectural building blocks are not fit for this task we'd need to introduce skip connections between the front and back end of the model and also probably make it directly dependent on time to be able to model that I'm working on that that's fine for now I know that this architecture is not fit to actually simulate this which makes sense because the model that I'm applying is I'm approximating nonlinearities in that black box model by yeah using linear filters the convolutional filters and feeding it into static nonlinearities which is not the execution model of those time varying effects are based on of course it doesn't work okay yeah this I essentially already told you now I also to get one step into the embedded realm use another evaluation platform with a not so performant processor and also used an optimized implementation of these TCN layers in C++ and were also able to get real-time performance on that device so with three models that I tested out that was possible to do when I compiled them and put them on the Raspberry Pi but I haven't done any elaborate experimentation yet what I'm currently using this for is I want to and this is where it gets tricky measure the latency of a lot of architectures on that platform because when I'm in my search in my last search I cannot measure the latency of each and every model while I'm doing the search because that means I need to compile it put it onto the platform get the measurement and then I can continue working but that's not really possible you have different methods you could do that you could use analytical models lookup table models and so on and so forth but since I want to apply this to different hardware platforms I'm trying to go with predictive models so I'm trying to build a data set with latency measurements and yeah then train a predictor okay yeah let's conclude current and future work latency estimation as I said their Lilith Berlayan has done some preliminary work for me here from AOA she's done her capstone project on development let me let me look development and training of a deep learning approach to estimate latency of fully connected neural network inference and yeah she did some preliminary work there for me which I can now build upon I supervised her work I also want to consider data set design again because it's suboptimal at the at the time yeah I think I should not go into detail but also and this is where it gets interesting where it ties with the work of the rest of our group I want to also add a hardware search space which has layer templates optimized for our hardware platform for example the optimized dev-wise separable convolutions or an optimized LSTM when we want to work with that those I can parameterize and those parameters I can include into my search and this is what Lucas which who isn't here today and me are trying to work on and this then leads to the integration with the elastic I creator which offers these templates to a developer and in this case the developer is the NAS and via this elastic I creator I want to deploy and validate these models found by the hardware NAS system with the predictions of the latency on that platform I want to validate that on our target platform on its FPGA based as the professor already said and yeah to validate my approach okay thanks for listening questions hello first of all thanks for the presentation it was great I'm trying to remember all of the questions I had in my mind but I can't I just didn't have time to take notes but I remember mainly three first of all since it's very familiar to stuff that I also work with first of all for the DSPs yes I personally use them and I turned from analog pedal boards to the SPs they they're pretty good nowadays but I can probably they differ in the latency from a venue to another and the quality is not at all the same if we compared to pedal boards especially using tube amplifiers maybe so the first of all I wanted to know would you mentioned about neural networks being used in the SPs or not yet that's that's one of the questions I need to know okay so yeah that's I think that's interesting so the thing that we're trying to do with the FPGAs is trying to build a dedicated optimized ASIC or an application specific circuit which is reconfigurable right so we're trying to implement a custom DSP for that specific effect you can think of right by reconfiguring that field programmable gate array to implement that circuit kind of right just digitally but I I don't know which dimensions your your question actually wants to tackle but there's another thing that I find very interesting that people are now working on which is called differentiable DSP so you're taking actual DSP functions or processing techniques that you're implementing in a way that there are differentiable so you can optimize them with gradient descent and then you're using Triton tested DSP or not tested but well known DSP building blocks in your neural network as part of that which I could include into the search search space and then optimize its parameters to make it fit to the effect that I want to simulate this any of these two things answer your question kind of does it's it's still familiar with it but this leads me to the second question which is you said we use these models nowadays on effects and try to estimate both the quality and latency of them other than DSPs can we use those on audio how can I say this to some audio applications for recording with an audio interface and use the same models for the applications like logic and q-ways and the others not only for live music but at the same time for live sound in general or at the same time why recording this is is this also possible to be used to use DSPs for these kinds of neural network architectures no the other way the to use these kind of models yeah on software and not on DSPs on software applications I mean okay yeah you need I think one needs to investigate those so currently we're working with FPGAs because it's interesting to our group and we have DSP slices on there which we can use for some calculations like filtering and so on and so forth currently I've only considered DSP chips for very simple filtering in front of the signal for example like a low-pass filter or anything like that but it would be interesting to see whether we could implement something like that so the the neural network architecture that comes out of this I mean we don't need to learn on that DSP we only need to deploy it and basically it's it's cascading linear filters fed into nonlinearities so I guess DSP could be able to to implement that also but currently I don't have a clue on how to implement this with in terms of for example programming languages that are used for these kinds of systems for example post or whatever you're working with to run software on DSPs one could investigate that yeah but we're currently more interested into reconfigurable hardware which we can customize to our own will with regards to circuits okay makes sense thanks a lot appreciate it you're welcome no more questions okay thank you thank you