 Welcome to another edition of RCE again. This is Brock Palin. You can subscribe to the show There's an RSS feed itunes link as well as all the old back episodes at RCE dash cast calm I also have here Jeff squires from Cisco Systems and one of the esteemed authors of open MPI Jeff Thanks again for lending your help. Hey Brock a couple worthy things of mentioning here So we're coming up in July here So the end of July is when super computing buffs and posters are due so make sure you get working on that I got to get working on my abstract for the open MPI status state of the world buff Also always accepting Questions for my blog so Brock and I have blogs and Twitter and things like that I just got a couple of users submitted questions, which were really great So if you have any questions about MPI or the inner workings or outer workings of MPI Please be sure to let me know and I'll address them on my blog Also, I'm going to be at the exceed 12 conference in Chicago and the third or third week of July So if you're gonna be there be sure to look me up But other than that I think we can go ahead and roll into our guests today And I'm actually I've been trying to get these guys on and they finally agreed to do this I don't know if I just pestered them enough over the last few years But this software library is one of the probably one of the most popular and most useful Libraries come across in scientific computing So what we have today is FFTW fastest for a transform in the West and we have with us the creators Steven Johnson who is at MIT in the applied mathematics department He's a faculty there as well as Mateo Frigo who is currently a quanta research Cambridge So Steven Mateo, can you take a moment and introduce yourself? Hi, I'm Mateo Frigo. I'm originally from Italy But I have lived in the United States for almost 20 years at this point to mostly in the Boston area and in Austin, Texas I got my PhD in computer science from MIT in 1999 I'm mostly an expert in parallel computing and my main research topic was a programming system called silk Which is now part of the Intel ECC and also part of GCC 4.7 So it's having its impact in the world through that route I was one of the authors of the paper on cash oblivious algorithms that some of you may have heard about Have worked on several different things including medical devices software radios compilers for exotic architectures And most recently I am at quanta research Cambridge, which is a research lab next door to MIT And I'm working on a form of error correction called called network coding. And of course I work on FFW with Steve Hi, I'm Steven Johnson. I'm one of the co-authors of FFW. I'm currently a professor of applied mathematics at MIT I got my PhD in 2001 in physics from MIT and a lot of my work Centers on nanophotonics. So basically electromagnetism in In media that are structured on the scale of the wavelength and I do a lot of analytical stuff But also a lot of computational stuff. We have a free textbook on A lot of my research called photonic crystals molding the flow of light if you Google photonic crystal book You'll find it as the first length and you can download it as a free sort of undergraduate textbook And So in addition to that I work on things like solar cells optical fibers Radiative heat transfer Micro-mechanical devices And a lot of different kinds of projects and so in addition to FFW I've written a bunch of fairly popular free software packages for Simulating electron magnetism so NEEP MPB R2 EEM simulation packages. I also have a Package called nalopt, which is a free non linear optimization package Can you give us a basic rundown of what is FFTW and what it aims to solve? So FFTW is a software library. It's callable from CE and many other languages That performs fast Fourier transforms and related transforms like discrete cosine transforms and discrete sine transforms Which are widely used in a lot of different areas of scientific computation Now what is it stands for so I mean you you said the FFT part. What is the W part? So FHW stands for the fastest Fourier transform in the West Which is you know kind of a whimsical title. There's no single program. That's the fastest everywhere I think actually The name from Mateo if you remind me I think the name for that even predates FHW I think you you gave me one of your it started out with you giving me one of your Old programs and calling it the fastest in the West I think yeah the story went that I had written a program to compute Fourier transforms on the connection machine 5 which was a super computer of the early 90s and At some point Steven asked me whether I had an efficient FFT code and I gave him that program telling him look This is the fastest in the West. You cannot do any better than this which wasn't true But that's how the story started That's how all good names start is with the story so For the benefit of our listeners, can you say what exactly is a Fourier transform and how is that different than a fast Fourier transform? And where are such things useful? so Fourier transforms Basically decompose a signal or a function into a set of frequencies You know if you look at the graphic equalizer on your stereo or whatever that where the little bar is going up and down you hear music and decomposes it into how much of that is base how much of that is treble and An FFT or a for you from just does that in much more detail a There are a lot of varieties of Fourier transforms mathematically on a computer You deal with discrete signals that have you know a finite number of data points And the way you transform those are something called a discrete Fourier transform And a fast Fourier transform is an algorithm to compute a discrete Fourier transform quickly So it if you have endpoints it famously can do it in order n log n Operations and these are used for a huge number of applications You know the obvious ones are things like audio processing where you directly think of filtering a signal in terms of taking out certain Frequency components or enhancing other frequency components But there's a lot of non obvious applications that don't seem to have anything to do with frequencies For example, if you just want to multiply two very large numbers with a million digits it turns out that there's a fast way to do that by Performing an FFT of the digits and then doing a simple multiplication of each individual digit and then fast Fourier transforming back and they're used for solving partial differential equations and Lot of problems So you mentioned that this was you know the connection machine original effort Oh, what's a little bit more of the history of FFTW and why do you still have it and why do you keep to working on it? so As Mateo said, you know, he had actually started programming programming FFTs before me So around 1997 I was a graduate student Mateo was actually a visiting scholar at MIT at that time and I was working on Solving Maxwell's equations for my research And we were using a spectral method Which involves which requires you to do FFTs and I needed an FFT that was fast and I was using in my own Machine, we were using Linux machines. We're using Logging in using crazy 90s and a whole bunch of different supercomputers And we wanted one that I wanted one that worked on all of them. There was parallel So I take advantage of all them. There was multi-threaded because we had multi-core machines and You know, I wasn't sure which I looked around at what was available and didn't seem like there was one that that did everything I wanted and there was a not much selection of parallel FFTs at the time mostly vendor specific ones so You know, I was telling Mateo about this because I knew him At MIT and he said oh, I have a fast FFT super fast fast for a transform the fastest in the West as he said and You should use that and it was it was parallel with silk wasn't it wasn't Distributive memory, but it was multi-threaded and so I took it and I also downloaded half a dozen other codes Free codes from the internet, you know There's one by singleton in 1968 and there's a whole bunch of things you can download you could download even then and I benchmarked them on a couple different machines and plotted the results as a function of size and You know, Mateo's was pretty good It was sometimes the fastest but not not always there were some some there were others and I I posted these graphs as a link on my web page and I sent Mateo an email about it and His girlfriend was now his wife is they said he came home that day and he said Steven put up a web page They said my code isn't the fastest this this has to change and so We got involved in this project to try and make one code that was the fastest or near the fastest all the time Yeah, you have to understand that in those days we had many different machines So that there was ultra spark there was a digital alpha there were various forms of power PC There was MIPS, you know the MIPS are 10,000 that just come out It wasn't massively out of order machine for the time, you know Intel was transitioning from the in order Pentium style processors to the Pentium Pro so it was very difficult to write a single piece of code that would work efficiently everywhere and just based on that background of Machines that we had that we came up with this idea of writing Some code that would try by itself out to run fast on your machine and try different possibilities So the fact that FFTW has this particular structure that it has is sort of implied By the computing environment that we had at the time. I see so that's interesting So are you saying that when you fire up FFTW it runs a little macro benchmark to determine? What's gonna run best on your machine to pick of many different algorithms? Yes, so FFTW is structured in two phases first you call a planning phase in which FFTW learns by itself how to compute the FFT on your machine efficiently and Then you actually compute as many transforms as you want given the information that you have gathered in the first phase and and that was the idea That made FFTW different from other routines doing the same thing now This was not the unique idea. There were some other efforts At Berkeley at the time doing the same thing for matrix multiplication in particular. There was a project called Phi pack That did a similar thing for matrix multiplication that project I think has died but The Atlas project has picked up a similar idea in that particular domain and it is still alive and kicking today So can you give us a little bit of a rundown of some of the things when you make this plan That FFTW does to try to extract performance like what's kind of your go-to Performance issue that you have to tune or out. Well, yeah, unfortunately. It's a lot of little things Yeah, so the way you have to first you to understand how how FFTs work and so the basic algorithm is suppose you have a Transform of size n what it does is it breaks it up into smaller FFTs of the sizes that are given by the factors event So for example, if you have a size thousand FFT it could break it into ten FFTs of size a hundred Or it could break it into 20 FFTs of size 50 and so forth and also if 50 FFTs of size 20 at the same time So the first choice you have to make is what factorization to use right? You could divide for example the the classic strategy, you know from 1965 is if you have a power of two size Let's just divide into two each stage So we have if you have a thousand twenty four you divide into two transforms of size 512 And each of those into two transforms of size 256 and so forth And that that actually turns out not to be a good idea on today's processors But there's a whole bunch of additional choices that the algorithm has to make because in addition to Deciding what factorization how did you compose the transform? It has to decide what order to do those Subtransforms in and where to store things in memory So it has a lot of choices on on different memory rearrangements They can make in the course of the transform that have a huge impact on performance not nowadays because the the memory architecture is is so important in determining the performance of of computational code So this is this happens under any users coders This happened at build time because like we had Atlas on here and they kind of do all their probing When you build it and then it's kind of set in stone Yeah, so this happens at runtime well, there's You know, it is possible to do it ahead of time and save the plan and then reuse it later on But you know, there's a difference a key difference here from matrix complication Which is what Atlas is doing? so The difficulty is that if you're doing because the FFT algorithms depend upon the factorization of the size if you have an algorithm for size n The algorithm for size n plus one is completely different because the factorization is completely different Whereas if you're doing a matrix multiplication And you have an algorithm for it for an end-by-end matrix the algorithm for the n plus one by n plus one matrix is actually pretty similar So, you know for matrix multiplications You just have to divide the sizes into into various different scales and figure out what block sizes you need for each Each scale of sizes, but for FFT that the algorithm choices are much much more difficult. So, you know, if if we could at build time You know create plans for for example all the power of two sizes And there is an option to do those and save those But you know, it's not practical to create plans ahead of time for every possible size that the user could It would be interested in especially once you once you include multi-dimensional transforms So I've noticed before there's this Save plan and something called wisdom. Is this what you're talking about? Yes. So so basically You know at the most basic level we have the ability to just to create a plan which takes a little time It runs some experiments Yeah, and you could do that a run time We can also do it once save it to a file or whatever read that in later on and reuse it It turns out that That in the course of creating a plan FHW generates more information Then just how to compute that specific transform and that information is stored in a little database in FHW That we call the wisdom database. It's a wisdom about the machine that you're running on so Technically you don't just save one plan you save all of the wisdom that is accumulated And you can read that in and potentially some of that information could be useful for other transforms that you might be doing So does it ever make sense for an admin or service provider? You know like myself and I've got a cluster built out of a single architecture to kind of Provide wisdom or plans for people to use so they don't have to spend time computing this or does that not make sense? No, that makes perfect sense. In fact, we built in a feature in FHW specifically to support that so there's As I said, you can import the wisdom from a file or from a string or what whatever But there's a special command called import system wisdom That's designed to import it from some some location that was that was set by the By the sys admin which defaults to slash ETC Slash FHW slash wisdom, but you can put that wherever you want So you could you could generate the wisdom once for a bunch of power two and power of ten sizes Which are the most common sizes, you know save it to a file someplace on your system And then when you build FHW you can you can you can put in that You can set that location as the import location for import system wisdom So then the users can just call this this one routine and will import your The wisdom from in the system-wide wisdom Now how sensitive is that wisdom to system jitter and other system types effects like pneumo locality and Other types of locality and you know whether the algorithm is MPI or thread parallel or or things like that Well You bring up a very interesting question That I think that the basic FFTW mechanism I think really works well only in the case of a sequential program running on a single core and As you start adding parallelism then this wisdom and this planning mechanism becomes Less and less reliable because of all the other activities that are going on Inside the system. So, you know From from what I've seen just basic FFTW plus threads If you have nothing else running on the machine the the mechanism works reasonably well Really when you start adding other jobs running on the same core and know my facts and all these other stuff This is this becomes a very hard problem I don't think we are doing the right thing yet, and I don't know what the right thing is So this is a basic I think limitation of this technology You know in FFTW defense I think we do better than other systems because at least we have some ability to adapt to the environment But I don't think there is a final answer to that particular problem Yeah, I mean it's very difficult to get the absolute optimum In the face of all this jitter as you say on the other hand at least you can avoid algorithms that are positively bad, you know on for your memory architecture or your system Sure, and I should probably disclaim my question that jitter is is enormously unsolved problem in the world of HPC anyway And so I was probably more asking in terms of NUMA effects and thread versus Distributed Marin you know shared memory versus distributed memory parallelism I think it's pretty safe to assume in an HPC world things run by themselves on Whatever resources they're running on and things like that So thanks for going into that and it probably wasn't quite fair of a question my mistake, but along those lines Do you have algorithms that are tuned for? NUMA types of localities? But yeah, I mean yes, it's what's completely different algorithm for parallelizing on threads and for parallel is that parallelizing on distributed memory So we have we have an MPI An MPI architect, you know plans and algorithms built up on top of F2w. We also have shared memory threads I built on top of F2w So and those use very different different types of algorithms as you can imagine We don't have specifically NUMA algorithms Okay, right so MPI very different than threads what I was really going for though is in a threaded environment particularly in you know Today's modern server architectures that are commonly used in HPC types of clusters Are your algorithms aware of say cache sizes and socket and memory locality and things like that? Well, so we don't do anything specific about NUMA in the sense that we If you have a machine with multiple sockets and memory connected to the different sockets We don't know whether the memory is local or not So in that sense that there is a possible improvement that we are not doing and it is actually very hard to do from User mode because then we must make assumptions about the placement of threads on course and so on So so however that being said F2w has some built-in robustness Against the memory hierarchy because it tends to use Recursive algorithms, so it tends to chop a problem into smaller problems Until eventually the problem is sufficiently small that it fits into cache. And so it doesn't Go to memory that often anymore. So just because of this You know this very silly strategy We tend to do a good job Regardless of what the memory hierarchy looks like. So there is some built-in robustness of the system F2w does not contain any tuning parameter like the cache size near the size of the L1 or L2 or L3 or whatever cache we It it this these parameters are learned automatically as part of it of the planning process So moving down this line of parallelism, so you there's a serial version of fftw and then there's an mpi based version Um, and even on most mpi codes. I've helped users build when they use fftw. They actually used a serial version But I also noticed so there's the threaded version, but there's also an open mp version Um, why provide two different threading models and what are the challenges of both? So the different threading models Uh, I mean primarily it's just to support different kinds of code From our perspective. It was almost equally easy I mean, basically the the the p thread the politics threads version and the open mp Version actually share most of their code It shows a different mechanism for spawning threads and the same code also supports Windows threads and at one point in the past we supported mock o threads Which nobody uses anymore. So we dropped that And we supported. I think there was a sun threads wasn't Matteo wasn't there sun threads. Yeah, solaris threads Was a precursor of p threads Later became p threads So you know from my perspective is just as easy to support one kind of threads is another kind of threads All we need is a mechanism to spawn threads and some way to synchronize after they're done You know the the reason that it was important to provide an open mp version specifically Is well, there's two reasons. So one is that actually a lot of the open mp versions We find are a bit better Than threads the postix just using raw postix threads or they put in a lot of tricks To keep the threads busy and to keep them pinned to different processors Uh, so they interact better with the scheduler than if you just launch a raw postix threads and the other main reason is that That if the user's code is using an open mp, which a lot of users codes are Then it's better if we use open mp as well So that our threads don't conflict with the user's threads that they share the same Thread pool and and the we allow the open mp mechanism to Coordinate the threading in that way So what kind of tricks do you do what kind of algorithms are there that that give you the speed, right? So you I assume you have a lot of different types of algorithms in your different classifications What are some typical tricks that you that you do? Oh boy, there's there's a lot So it depends as you imagine on the size. So so first of all, there's a couple of different things At the lowest level You have to understand that the in order to take advantage of the way cpu's work these days You you know, they have a lot of registers. They have you know fairly long pipelines You want to give them a large block of highly optimized straight line code All right, you don't you don't want you don't want to call sub routines that have that have a loop that has you know Two instructions in it. You want to give them hundreds of lines of code to to to chomp on and so at the basic level At the leaves of this recursive tree The we call these hard-coded routines Called we call codelets which which are just hard-coded ffts of small sizes Which are super highly optimized and each one, you know, there might be a thousand or two thousand lines of straight line code and as you can imagine It's really hard to write highly optimized Code that's two thousand lines long by hand. And so we don't do that What we do is we have a special purpose compiler effectively that generates these base cases So it generates tens of thousands of lines of you know, just different hard-coded transforms of different sizes So and that's really really important for getting Really good performance at the leaves and good performance for any sort of moderate size transform And now as you go up in size as the as the sizes go start to go out of cash You get much larger transforms multi-dimensional transforms. There are a bunch of additional tricks Uh that have more to do with the the memory hierarchy You know as Mateo said just just using explicit recursion Already gives you some memory hierarchy benefits actually It gives you something called cash obliviousness which Mateo was a subject of in part of Mateo's thesis And But in addition to that there's a bunch of little tricks you can do So basically it's it's a little hard to describe over the phone Uh, but basically it's a lot of little memory rearrangements So I so you can take take the data that's that that's spread out in memory You copy it to a little buffer that's contiguous you do the transform there you copy it back You do little transpositions Of the of the data actually make little matrix transposes Again to make things more contiguous that are interleaved with the transform uh You know Mateo, can you think of anything else that would Be easy to explain? Yeah, yes There is an interesting Tricker that we do in the codeless, you know at the leaves of this recursion um So steven mentioned that we produced We we produced straight line code Which is like a thousand or two thousand lines long And and if you give such a code to a compiler and you don't do anything about it Most likely the compiler will generate very slow code coming out of it And the real problem is register allocation is how do you fit a thousand variables? Into 16 or 32 or 64 registers that your machine has um, so it turns out that you can Write the code in such a way that it is possible to do good register allocation No matter how many registers your machine has and this is another consequence of this cash Obliviousness theory that that we mentioned already a couple of times um And if you look at it this this really makes a big difference You know the difference between naive code and code that is generated according to this principle is about a factor of three That you get uh in You know that that you get just by scheduling the code, right? Uh, and uh, I you know, this is really the thing that makes the leaves of our corrosion almost as efficient as anything You can do by hand Cool We should mention that that that scheduling is specific to f of t algorithms So it's something the compiler can't do for you because it's not a generic Uh scheduling algorithm So with all this high level of tricks and things that you're doing in the back end Are there knobs that the the user can tune that they know because like oh, I'm going to be doing This particular type of f of t. So I want to at least nudge your Decision making or algorithm process in this direction or something like that it Not very easily uh in the sense of just nudging the algorithm choices I mean, of course, it's possible to hack the code to do whatever you want But it's not set up to be To make it very easy to for the user just go in and do that the one thing the user can do very easily Is generate new kinds of codelets So if there's a particular transform size you you really care about if you if you really care about doing discrete cosine transforms of size Uh 120 And we don't have a hard coded transform of 120 So we would use more generic code to break that into smaller sizes instead It's fairly easy to just in the build system to insert one line and say I want discrete cosine transforms of size of size 120 And it will generate those This very efficient code for for that size automatically And or more generally sort of any factor Any small factor or small size that you really care about then If you really care about that size actually benefits you a lot to generate codelets for that size So because it comes built in With codelets the handle basically Sizes I think 1 to 1 to 16 20 25 32 and 64. I think those are the the the built-in sizes If for for ffts And so if you have some other oddball size then it's it's good to to generate code for that Yeah, there is another knob that the user has which is how much effort do you want to invest in the planning phase And there we have several levels. Uh, you can get in An estimate of what the best plan is and that is very quick Or you can take some measurements. You can take more measurements if you are patient Or you can do an exhaustive search over all possible plans Which takes forever and normally does doesn't run to completion So that that's the other knob that the user has So what about some things that users do have control over like maybe the transform size a little bit because There's Funny performance things that happen when you try to stay do a number of samples. That's a large prime number yeah, yeah, so FHW is as unusual is Less so now, but you know so As I said fft algorithms depend most of them depend on the factorization Of the size they work by breaking a size down into its factors So if you have that algorithm and you have a prime size, you can't do anything But actually there there are fft algorithms and order and login algorithms that work for prime sizes as well And I actually initially implemented in fhw for fun Just because these were kind of cool algorithms and they turned out to be really popular because that way You know, whatever size you give it It will still do an n long n algorithm won't suddenly be a million times slower if you add one to your size Now that being said The the algorithms for composite sizes are still faster So and in general you're best off if you if you have some freedom in choosing the length Of the transform you're best off if it has small factors say 2 3 5 and maybe 7 Now the conventional wisdom was that powers of two sizes are the best and if you look at most code out there A lot of the fft codes actually historically only handle power of two sizes And even the code that handles non power of two sizes is often most only optimized really for powers of two So f but ffw is actually optimized pretty well for a variety of small factors So if you have a size 600 transform that that factor is perfectly well the small faster in small factors It's not going to be worth it to pad that up to the next power of two to size of 1024 That'll actually probably be slower than than size 600 But if you do have size with large prime factors If you have the ability to get that to to tweak that and make it a size that has many small factors That that's probably a good idea So let's change direction here a little bit here What what language is fftw written in and you mentioned earlier in the podcast that you have bindings for A lot of languages so you can call fftw from many different types of applications. What bindings do you support as well? So, okay, there's there's a two separate question. So to begin with which languages is Is a little bit complicated in the sense that there's a two-step process So as we said these coders these leaves these base cases of recursion are all automatically generated and so that so That code that those coders are spit out by another program that we wrote and the and the generator program is written in a functional language called OCaml Which is a nice language for writing compiler like programs in And this OCaml code spits out c code and the rest of fftw is written in c Uh, so basically the end result is the user doesn't even need OCaml on their machine unless they're they want to generate new code. All they need is a c compiler And then that c compiler c code, of course, we support calling that from c++ And we also support calling that from Fortran So we originally had Fortran 77 bindings and in the most recent version we added New bindings for the the Fortran 2003 or whatever the latest Fortran standard is that which added The new Fortran standard has explicit support for calling c code, which makes it much easier to add bindings And in addition to c c++ and Fortran There's a bunch of bindings that users have added For different languages that you know, we don't we're not involved in those directly So I wouldn't say that we support them, but you know that they they exist out there There's bindings for c sharp and python and eiffel and eta and java And uh guile, which is a scheme compiler. There's pascal bindings Modular three ruby pearl Lisp and you know probably other ones that we're not aware of So with ffts you're dealing with complex numbers a lot and there's been some changes there with c and c I forget the newest c update Yeah, yeah, and you know Fortran has a native complex type Kind of does fftw play nicely with this and moving back and forth between the two types? Yes, so so when we wrote fftw uh You know in the initial version 1997 complex numbers were not a part of the c standard Uh, they were added in 1999 even even after they were added It's it's still taking some time for them to be widely supported And they're still not supported in for example the microsoft Visual c++ compilers because they're not they don't really compile c they compile c++ Which has its own complex number type So we don't in turn internally to fftw. We don't use these Uh, we don't take advantage of the c complex number support We we do our own complex arithmetic But we store complex numbers in a format that's binary compatible with the You know the c and c++ and fortran's complex numbers all the c c and c++ and fortran complex numbers are required by the standard To be internally stored is just the real part followed by the imaginary part So since we store it that way Then you can you can call fftw with with c99 complex numbers or c++ complex arrays Or fortran complex arrays with no translation overhead. All it is is a pointer typecast And we even have some support some hackery in the header files So that in the case of c99 you don't you don't even have to do with a pointer typecast Now Maybe meteo should comment also that there's there's Some reasons why in the at the code level. I think it's it's actually not desirable for us to use The c99 complex numbers that we can actually do more By having access to the real and imaginary parts separately and doing our Arithmetic on those things separately Well, this gets very technical But it has to do with a fact that It is sometimes convenient to view a complex number As an array of two real numbers Do vector operations on this array? So really treat them as if they were independent real numbers and only combine them at the end This is a thing that is useful if you're doing if you're using simd instructions like ssc and ssc2 and avx And altivec and all of that But it gets very technical to to discuss exactly why that's the point But to go back to the original question I also want to say that we support the alternative format Of complex numbers where you have a separate array of real parts and an array of imaginary parts Which is the format that was very popular a while ago because matlab was using it I don't know if they're still using it, but that There's definitely another popular format and fftw will work well with that format as well So this has nothing to do with a c99 complex types. It's a much older numerical libraries convention So you mentioned matlab in there. I noticed matlab has fftw in it and we talked about bindings Let's move up a level. What popular well-known full user applications use fftw So it's hard a little hard to keep track of you know, which things are using ffw Matlab as as you mentioned uses ffw internally for its ffts Uh, also several matlab like freely matlab matlab like programs like gnu octave uses ffw Sylab, I believe uses ffw Uh, and there's a bunch of other free software packages Uh that use ffw. I just went to the debby and I went to the debby debby and gnu linux Uh package database and just looked at which things Uh use ffw and then did a google search just to see so for example Sinulara, which is a free video editing editing package uses it apparently Crida, which is a image editing program. There's a whole bunch of scientific programs For example several molecular dynamics codes grow max and amber And espresso. I'll use it. There's a bunch of quantum chemistry codes that use it because fft is ever pretty central in how People usually solve density functional theory for finding Electron distributions and solids. So Abinit and vasp and and quanta quantum espresso is another package that uses it There's gnu radio, which is a software radio program Uses it the apparently the the myth tv package, which is a DVR digital video recorder and you know, it's basically a box that sits and controls your television That apparently uses it for small audio processing. There's a 3d game engine called panda 3d that uses ffts. I have no idea what for So uh and Matteo, I think you yeah, I saw the other day that paul's audio Which is an audio demon for gnu linux uses fftw to implement an equalizer I was surprised when I tried to install it on debby and and somehow it dragged in the fftw library Yeah, and there's and there's probably a couple hundred You know commercial software packages that have licensed fftw for for use in their code as well Um, I'm not sure where I'm at liberty to say what they are But you know, there's there's quite a few out there apparently So then what I'm really curious. It seems like fftw is kind of currently the leader and winner for and market share for You know fft stuff. Why do you think fftw has been the success? Well, you know, there's a there's a couple of reasons uh, you know the The most obvious selling point is is performance that it's can achieve good performance on a on a bunch of different platforms But uh, you know, I think the the real underlying reasons, you know, even though scientists say they care about performance I think in practice people care about generality even more They don't want their their programs to be they don't want what they can do They can do to be limited by their software and fftw Especially when it started out even now it's it's pretty unusual in the degree of generality It has it's not only portable so and supports different platforms supports threads and mpi and so forth But it also supports transforms of any size including prime sizes of any dimensionality Of real data of complex data All you know all four popular types of discrete cosine transform and four four types of discrete sine transform And mixes you can do a multi-dimensional transform that has discrete cosine transforms along some directions and discrete sine transforms along other directions and and that kind of generality is is Pretty unusual the only thing that comes even close is the Right now I think is the intel math kernel libraries Which even support actually they support the fftw interface But even the math kernel libraries don't support quite the full functionality of fftw They don't support the variety of data types that we support or the variety of sine and cosine transforms that we support and so forth So another question that I like to Basically, we just wanted to solve the problem once and for all And and have a library that will work in all cases and I think we are very close to that goal So another question that I like to ask other fellow software developers is what version control system do you use and why? So we use you know It was initially using cvs for a long time because that was just what people used in 1997 Uh at some point we got tired of cvs as most people do And we decided to switch to a distributed version control system and uh It turns out that uh a guy David roundy who wrote the darks version control system actually Had an office just a couple doors down from mine And I was already using it on you know several other projects and I liked the interface Even if I didn't like some other aspects of it I liked it better that it's better than git at the time which Especially early on git the interface was very low level. It's gotten better than since then So we started using darks and we've continued to use darks since then Wow, I think you are the first rce guest Uh that uses darks. Um interesting fascinating. I love this stuff. Um, all right another another question What license do you guys use since you're used in a lot of other? Software packages be good to hear what you did and why? So uh, uh, I think in the very first release of fw I think we didn't really think about it We just put some boilerplate thing that they were using at the laboratory of a computer science Which is free for non-commercial use So what what Richard Stallman calls semi-free? And then we quickly found that this this just caused a bunch of headaches That we hadn't anticipated that people wanted to include it on a cd that they sold for a couple bucks But since that's commercial then they wanted to know if they could do that You know the people working at a company They just wanted to use it internally on their Within their company and not to sell it or anything And but of course a company is commercial So they wanted to know if that was allowed and this is just a pain and so we switched Uh to using the GNU GPL version two or later And uh, we we pretty much stuck with that so that But at the same time when companies like matlab Want to use fhw You know they if They're not willing to use it under the gpl which would require them to make all of matlab free software and uh So they they they basically contact mit and purchase an alternative license So companies basically can can buy a An unlimited use license from mit for some amount of money And the gpl worked nicely for that because actually We had to convince because mit owns part of the copyright to To fhw And we had to switch convince mit to allow us to switch to the gpl from this original non-commercial use license that we had adopted without thinking And it's pretty easy to allow to convince them to use the gpl because The gpl that isn't going to cut into their licensing revenue because because Effectively it's non-commercial use with respect to Companies like mathworks and so forth that want to incorporate into their software So you still they still have to buy licenses from mit so it doesn't hurt mit's licensing business And at the same time the gpl is is better for us it's It's clear about the kinds of use we want to allow that you know, it's perfectly fine if you use it inside your company It's perfectly fine. If you include it on your linux CD that you sell for a few dollars And it's compatible with a much larger universe of Of free software than than a semi free licenses So you know the the other possibility was to use something like the lgpl Which we chose not to do for a couple of reasons So one it's just from mit's perspective if we had used the lgpl They wouldn't have been able to sell licenses because matlab mathworks would have been able to use it for free in matlab Because they just link it as a shared library And from our perspective We don't see any particular reason that we want to Subsidize mathworks if you if you want to use f2f2w in a product like matlab or Or some other product that you're selling then you know We feel like you shouldn't be able to do that for free. You should at least give us a little bit of money It is a little bit unfortunate in the sense that You know, there's as you know, there's a variety of free software licenses out there and not all of them are gpl compatible and you know We We'd love to allow f2w to be used in software that happens to use The mpl or some other free software license But just from a you know a legal standpoint is it's too difficult to do that to to to to open up to add exceptions for those things without opening up the door to Uses that we don't want to allow Yes, and from this perspective I want to point out that we're still using gpl version 2 and not version 3 precisely because we want to be Compatible with the most Software possible that is out there that is not necessarily compatible with gpl v3 Well, it's gpl version 2 or later So f2w if you want to distribute it under version 3 you can if you want to link it with gpl v3 software You can I mean, I don't think we have I don't have any objection to version 3 You know for its its own sake it's just a question of compatibility Okay, so let's talk a little bit about the future of f2w. Um, what thing do you want to add or see changed? So, you know from my own perspective that's one of the things I wanted to to to do for a while, which is still on the You know, it's not clear what the time frame for this is is is to add direct support for convolutions To f2w. So one of the most common operations that you're using ffts for is this thing called a convolution Um, which is used for filtering and multiplying large numbers and solving partial differential equations that involves taking two arrays ffting them multiplying them point by point and then ffting them back and You know, I'm pretty sure that you can do a much better job Uh If you if you do that as one operation Then if you do it as Separate ffts. So that's that's one thing that you know, I would very much like to work on if one Once we find time or or maybe get a graduate student to to work on that Of course, there's always performance tweaking and and uh with the npi it would be nice to support more data distribution models People keep talking to us about asking us about GPU support Which you know, if if those continue to be popular we may eventually have to do something there Uh, if my own span standpoint, I you know, I'm kind of hoping that that uh, the gpu's turn into more Conventional shared memory systems at least from a software development standpoint Then they're currently kind of exotic systems to program Which which it doesn't seem to me to be a good way forward to for the computer industry But you know, we we we may eventually have to support those You know that people bugged us for cell support for for ages when you remember the there was a big buzz about the cell processor And we finally added it and then I think nobody uses it as far as I can tell, uh, which was a bit frustrating Yeah, the the issue of supporting new instruction sets is always an important issue like, uh, you know, the fma instructions Are already implemented in amd machines. I think and They are either implemented or will come out soon on entire processors. So we'll need to decide what to do about that Uh, there will be new instructions um So until now as this mic processor, which is Adaptive of rrb, uh, which has its own seemed instructions, which are 512 bits wide So we'll need to do something about that at some point um Yeah, that that's another direction in which we will continue to work on Yeah, so initially, you know when we first started working on this in 1997 And we we got the initial version together and it was spring break And we decided to you know to to work really hard and and get the release Available by the end of spring break And you know, I remember Matteo telling me, you know, you have to be you have to be careful We we should we should just release this and then be done with it But we'll release it and we'll solve the problem and we can walk away because otherwise you can spend your entire life doing ffts And there's certainly people out there that that has become their entire career Uh, it hasn't quite become, you know our entire careers, but it's certainly something that's that's Continuing into the indefinite future Okay, well Stephen Matteo, thank you very much for your time. Um, what's the website for fftw where people can get information? That's fftw.org Okay, thank you very much for your time. Thank you guys. Thank you. Thanks guys