 We got kicked out of the last room I don't get into the details Let's just say somebody was trying to do fighting and we got in trouble with that So the university is super pissed but we're here in the gates voting So we're super excited to have Cartstein Rohnert today from he's our last Visitor for the seminar series and hardware seller databases. So Cartstein has a master's degree in physics but a PhD from Leibniz University in computer science he's a Continual startup founder. So he is the CEO of Swarm64, which he's gonna talk about today But this is now his sixth startup. Yeah, all based in Germany No, some in Norway So he's here to talk about Swarm64, which is a FPGA accelerator for post-SMI sequels. All right, thanks Go for it. Good. Thank you very much for coming and in such huge numbers much appreciated So let me start with some background information on the Swarm64 And we are actually also a Norwegian company and although all of us From the R&D side are in Berlin and then sales and marketing are here in the US and we are building out worldwide So we are building Asian presence and so on right now. We are 45 Hiring any one of you looking for a job then let me know we can talk. We are also hiring in the US and Yeah, so we are growing very rapidly Our theme is a database acceleration. We think that's a Market ripe for disruption. The reason is that Increasingly people are moving out of proprietary databases into open source. Now when they are in open source What they find is they have trouble with performance. I mean open source Database software has been around a long time. It's really great. It's stable It's a field proven all well the performance level is no way near the proprietary Solutions and what we're trying to do is we're trying to bring a completely new angle into it So that over time and really over time So we are we continue to improve and improve and improve and over time We of course want to get to the level where you've well where commercial solutions are and Yeah, so then we want to be a real threat to the to the established players because that's what startups are all about right? Be as threat to to the established players now when we started we set ourselves a few foundational principles we said what makes sense and what does not make sense and So the first observation we made was if you put an FPGA into the processing pipeline Then if you have a pure transactional system and the additional latency is going to kill you So forget about transactions focus on on analytical processing. That was number one So we are focused on analytical processing. We're not accelerating any transactional workloads We might in future we think but that's not today not at all the second Principle we are following is we do not want to build yet another appliance There have been FPGA is used in databases in an appliance type We think appliances are not a good idea anymore because in the cloud age You know, you have to be scalable and with an appliance your scalability is you can scale your appliance But at some point it's limited and it's not a it's not a homogeneous computing architecture anymore And this is what customers want to get out of so we want to we don't want to and we don't want to have an appliance And we also don't want to fork our host database The reason is simply if you start building an appliance or you fork you get for a short time You get a big jump in performance because you you can pull all the tricks, right? You can it's kind of like taking a shortcut So you're taking a shortcut you're pulling all the tricks and you get a performance jump But thereafter because you have to maintain the the the steadiness of innovation yourself You're you're the the angle at which you improve your performance is relatively low if you could stay with the established With the with the established technologies you are on the path of adding all their performance increases And so the performance curve on which you are as much steeper So over time appliances always disappear because they are they are being caught by the evolution of the standard technologies So no forking. We are we are using standard FPGAs. We are Accelerating open source databases. We're not forking them We are going through through their established interfaces and provide our services to the open source database and so on That's very important for us Of course we as part of that. It's it's kind of self-evident. We want to stay Compatible completely compatible to the to the host database, so Yeah, we don't want to introduce code changes to our customers because that's another very important point for our customers They don't want to change their code base That's for for example in finance. This is a huge issue. They have They are regulated and if you're regulated, you know, and you do code changes to Part of your system that fall another regulation, then you have to read you have to get it re Reauthorized and that is a it's a big pain. So and we can't do that Also, we have to be compatible to all the tools that are out there in the market. So we maintain complete tool and other Compatibility that's really important for customers Then I already mentioned that we need to scale and we are of course on a single machine We are scaling vertically, but then this vertical scaling also needs to scale horizontally Because we are in cloud environments and loads are increasing tremendously and the customers we have You know, they have meanwhile since we started working with them a bit more than a year ago and they have Tripled the size of their analytical database meanwhile. So and this is how it keeps going So in a year from now, they will probably again have tripled it and therefore if we don't offer horizontal scalability at some point We will lose them. So we may we provide horizontal scalability also and Then at some, you know, how do customers make the decision if they use this or that well at the end of the day It's money, right? So the total cost of ownership and all these things play a fundamental role and this is where actually FPGAs have a huge advantage Just as an example compared to GPUs I'm on our present platform and this this is platform dependent, but on our present platform It takes us 40 watts When the FPGA is fully loaded to deliver the acceleration the acceleration is not only FPGA There's also software component as I will explain but still 40 watts compared to what GPUs are taking and all that's a six or something like like that and So that's and then 40 is really the peak. So at the average is even lower. It's really very substantial In terms of of TCO implication that you have a low power consumption So I've spoken a long time about it. I think it's very important that you understand where we are coming from so that you understand Where we are going and this is roughly in a very broad strokes How the architecture looks so you have your application and I said you don't change it application also means tools like, you know bi tools as tableau or click or and so on and these tools talk to talk to the database and We are sitting below that database if it is postgres then we are kind of directly integrated But if it is any other database as I will explain later then again We help that can help any database with analytical processing through a process. I'm going to explain and within that Database with within our layer We have the standard postgres tables and they are the standard just the standard postgres tables So we have swarm tables the swarm tables are provided through the foreign data wrapper Which is an interface that postgres provides and the foreign data wrapper. Let's Let's tables look to postgres if you do it properly with it's a bit of effort, but you can do it So those tables look to the application just like any postgres table They don't am once the DBA has created them with the create table command or create foreign table in this case So you create the table and once it's created It's a table and then the application uses it as it uses any table that that is how we're doing it But it's a special type of table and of course it works on any storage architecture So direct attach networks that works storage you have it That's very important because we need to support enterprise type of IT environments We need to support the cloud. We need to be very flexible and that's how how we do that and so now How do we support all kinds of transactional databases? This is actually through live replication So if you're po if you have a post even if you have a postgres system You don't want your analytical your your OLAP processing to be on the same server as your transaction Processing because as soon as OLAP as your analytics kicks in your transaction Performance goes down and you want stable transactional performance Otherwise your system is in trouble very obviously and particularly if you if your customer facing or you need to achieve certain certain Response times then you can't have that so regardless of What database we're talking about and the analytical part will always be separated from the transactional part and now you need to Get the data from the transactional system if you're working with transaction data You might not you might just have other type of data like IOT Data that just streams in but for now I want to focus on this one use case to the other comes later So if you have a use case where you have Transway you have a transactional system and you want to analyze the data that's flowing into your or that's being created in your analytic in your transactional system then you need to have a separate Analytical system and the way we provide that data is through live replication and we can live replicate from all open source databases and we have not yet set up The live replication capability to other databases, but we are pretty confident that this is also possible There are there are extensions available for postgres for example to link up to Oracle But we haven't tested it yet, so I wouldn't claim it. It's just a roadmap item for but for now My SQL Maria postgres we can live replicate from all of those and then we have the data and then analytical system and from there We can you join the the we replicate into the Into the original the native postgres tables and then we just join that or copy it or how in any way you want You can then get that data into the analytical tables and from there you Yeah, from there you analyze it so very very much straightforward This is how the system architecture looks so we have an FPGA in the system and as with any hardware It's a PCI express card. So this PCI express card goes into the PCI express Slot just like a GPU or whatever else you want to put into your system And so you have a driver obviously you must have a driver to control that Piece of hardware the driver then needs to get instructions of what to tell that piece of hardware At least in our case it needs to and what to tell the piece of hardware what it should do That's what we call the runtime. So the runtime is a piece of software. That's from us It I'm logically separating it from the database. It's actually in terms of of software architecture It's actually part of our database extension This is how and we tell the FPGA what to do for certain queries, which is obviously necessary So you have the runtime element and then you have the software stack which plugs into the host database So these are these are the components and then of course you need to have What is called an image in FPGA terms? So this piece of software that configures the FPGA and tells it what to do because it doesn't From from the outset an FPGA is like a white piece of paper and you can start scribbling on it and you tell it what to do And that's it's called an image what tells the FPGA to do and this is what we provide And when you start the way we deliver this to customers for trials later It's different because then it's more productive but for trials we just give them a docker container They start up the docker container and within the start up of the docker container The FPGA is being programmed. We load the extension into the database and there it goes You have a fully configured system. It's just running. So what we deliver is only software The hardware comes in from Intel Beginning of next year also Xilinx. So we are currently Porting on to Xilinx. We will support Xilinx too beginning of next year right now. It's Intel And you can buy Fully configured servers from a number of server makers The big ones they all have servers which are completely configured. So the os is proper. The driver is integrated This particular driver is integrated and so on. So you can just buy the server You get our Software booth up running. No no big deal. Nothing special to do Okay, so what's happening on that FPGA now now it's getting interesting, right? This is what you've been waiting for what happens on that FPGA and this will be disappointing the level to which I go and I apologize in advance but We are filing patents for some of the stuff and so I Or we have filed them, but they are not yet granted And so I have to be a little bit restrictive forgive me for that But in general terms, let me see if this this works. Okay, cool. So Through the plug-in API to the host database in we're we're getting data So that data that comes in if it is fresh data It goes through this path into the compression decompression engine on the FPGA and from there We store it and in the store in involved in the storing of it is what we call the optimized columns algorithm That's a special data structure. We are using that minimizes the Amount of data traffic we need to generate on the on the SSD in order later to grab data So with this particular algorithm, we kind of know Where we are putting the data. We don't know it precisely But we know it to the just right degree because if we knew it precisely it means whenever new data arrives We have to reorganize everything from scratch. That's of course nonsense You don't want that because then your insertion speed just goes down So you want a very fast insertion speed and the analogy I'm using is Students might be might know this So you have your a closet and then instead of putting everything neatly what you do is you take your underwear and throw it Up there and you take your shirts and throw them down there and you take your trousers and throw them there So you have all these piles, right and within the piles it's Within the piles and also the piles themselves. They are roughly sorted But you know in the it might happen that A sock ends up in the underwear department But then yeah, this this is okay because then when we fetch the data then We on the FPGA we are we will find. Oh, what does this sock do? We are actually looking for underwear So we're just cutting out the sock and we're delivering only the underwear data and you can also say I only want the Red ones and not the black ones. Okay, fine. Then we cut out the red ones and only the black ones Is it a table is it a column what is like what is it? It's uh, it's a data structure It's a data structure. You can you can compare it to a log structured merge Merch tree, but it's not a log structured merge tree But you know the analogy is simply is a little bit like that. Yeah, that will yeah, okay So that's how it would work. Okay. So this is how we sort the data when we put it on on storage And I just explained what happens when we get it back. So we get it back And we look for it very precisely then we decompress it and once we've decompressed it Then obviously, um, the the data rate is much higher, right? Because this comes in at hopefully Full PCI express speed not always, but if it does then this is now Now three to five times higher data rate So you want to do stuff with it and you anyway want to do stuff with with it on the FPGA So what can we do? Well one, um, of course, we are cutting out what I just explained We make sure that we only get the data we look for but then And we also have a hybrid row column data structure. So analytical databases Mostly are columnar databases and columnar databases have a lot of advantages for data analytics We have a huge disadvantage the insertion speed is relatively low And we wanted high insertion speed because we believe that the entire Connected devices market is an important one and there you have a lot of insertion speed and by interestingly And this is how we started what we later found is that for example in financial services Where people are running nightly jobs for I will show you an example later Where people are running nightly jobs For compliance to make sure that they comply with with for example capital coverage rules They need to load from a lot of different databases all the data To prepare a nightly run and then they run it and this collection And setting up and then transforming of data takes huge amount of time It often takes more time than the and then the analytics itself At least it takes a substantial part of it. Even if it's only a third then, you know, this is huge And with our technology we can not only Very fast in jest. We can also very fast transform that's based on this data structure we're using And this helps even in Cases which initially we didn't have in mind, but we found it was a good idea to do it. So We have this hybrid structure and through the bit masks We can then decide which columns we we will take into further up for processing to the cpu So if you have a huge table with say 100 columns and your your query is only addressing 15 out of the 100 Then all those columns that are not addressed we kick out here the in in reality We are only loading maybe 20 So we will not even because this this algorithm Also covers the distribution into of the columns So we we will load a few more But not to the hundred and then kick out 85 But in in fact we might load like 20 and then the through the bit masking We kick out all the columns that we loaded accidentally and then We almost have the columnar database performance, but then combined with the advantages of a row data structure So then we have filter conditions and that is in simple terms. It does a bit more but in simple terms This is the hardware implementation of a wear statement. So of arranges. So when you say, um, I want To get back to my picture. I want all I want all shirts, but you know yesterday I ate a little bit. So, um only size l And and m and s, you know leave out So give me give me all the shirts, but only size l then This is where the size l will be filtered out and all the others Will will be gone. So where size is from l to I don't know l minus or so When making it up here. This is what the filter does selective materialization is yeah, what you would think Data conditioning. I'm not going to explain too much. Basically what it does is It arranges the data such that the algorithm with which you analyze it upstairs in the cpu Can run faster This for example one effect we can generate with that is that we change the order The runtime order of an algorithm if it is order n squared Then in some cases we can change it to for example order and n times log n, okay And that already if you have a lot of data you can imagine that gives already quite quite a nice speed up So things like that. We are we are Rearranging the data in a way that the algorithm in on the cpu can deal with it much more efficiently We have not implemented that it's theoretically possible, but for that you need What's called a root complex PCI express root complex on the FPGA so that you can establish a direct PCI express connection to the SSD And yeah, we don't have it yet, but that's that's That's a technical detail. You can do it. You can't and right now we are not we have some other relatively smart algorithms and we are transferring the data at least by using the dma engine on the cpu So there's no software involved. It's just To get me that data and then the dma engine of the cpu just gets us that that data and also Yeah, we and we are reducing the copying quite a bit here But there's a little bit of copying and that's that's a negative. So you're right One one there's a lot to be still worked on. Okay, so all these things we do on the FPGA after decompression And then we are passing this on to the software stack on the software stack almost all of the further analytical processing is performed and Then we just give the result Sometimes the complete result sometimes there's some post processing and post-press happening not much and then you have the result Of your query. So this is how this works. I don't need to go there too And this is a bit more on the optimized columns. So basically I've explained the principle in broad terms and this is an example of if you have a standard Multipart index. So you this is kind of an index if you wish. So it's a it's a special index So it's another way to look at it. You can say it's a log structure at merge free But it's also a type of an index we can have up to three dimensions here And so that you can think of another dimension added. This is two dimensional But you can have I can't draw that well So I've just drawn drawn it here in two dimensions But you can have a third and so how does this compare to For this very simple task. So you have you're looking for data where the order number is between 150 and 15 000 And where the sold date is between two dates. This is how post-press encodes dates I don't ask me what those dates are. I I don't know but these are dates, okay So then in class in standard post-press the dba would put off would put an index on it And with that index you find relatively quickly that data We ran it and this is a So to say real example So what happens first of all post-press or any other relation database needs to load The the leaves of the tree, right? So you have a multi-part index and the leaves you need to load in order to look up where where is the data And then in this example, we find that post-press would load one two three four. I think it's nine, right? Would would have to load nine pieces of of data, okay, so it loads nine And then it can continue to process in the case of the optimized columns It actually loads this and we've indicated that the real data is the highlighted area And then you have some redundant data. I I explained it with the underwear where there might be a sock Coming also with it. So the real data is here. We load we load this entire thing, but we only load three so we load three and Then on the fpga cuts out from those three only the relevant part And then only the relevant part is passed up to the cpu. So obviously this reduces i o a lot and I have a little Demo which I will show you later a video of a demo there. You can see it very nicely. Yeah, please Yeah Is Yeah, it's not yet the bottleneck. So right now we have a compression We compress between the factor three and five Depending. I mean compression is always data dependent, right? But three and five is what we typically observe It is it is basically G zip, you know in a in a in a hardware implemented way a little bit, however We as we are progressing we are working now on a release which will come out in January and then We will see the first time for several Several queries that this becomes the bottleneck. So we are currently working on improved compression schemes and Yeah, so that that's we we are improving it because we must The better the I I'll explain let me get back to that later because I have a slide which speaks to why that is important It's important. Thanks for the question and why I will I will explain just in just a second. So So this reduces Then the i o load and again the video I have will show that so and this is a slide I was referring to it. I didn't know it comes up so soon. I forgot I should know but I forgot So here we go. If you want to accelerate Then what you really find is you need to x you need to accelerate Consistently along the entire pipeline. So if you say the data source provides you something at one x throughput, right? Then your compression increases the effective throughput not the raw data rate But the effective throughput what you really get in terms of data by the compression factor Okay, then we have our preprocessing on the FPGA And that again depending on what query you have and so on we observe between one and five x Acceleration of the effective throughput in terms of data rate Then we have the optimized columns. They have a huge effect. So between three and 20 times I will show you examples of how well we accelerate and then I can show you where it's 20 and where it's three times It's very obvious. You will see it We have some algorithmic optimizations This is an area where we are just starting. So I would say currently we are more here, but we We work on on getting more here and and then another very important area is You unless you can load the CPUs fully so that the CPUs then Need enough data from that the FPGA can process Unless you can you can achieve this It's also breaking down and this is simply Amdahl's law. Look If you have a certain amount of processing that's happening in the database stack, right? And this Takes time x if you can't reduce that and the rest of the pipeline You are already accelerating tremendously Then the acceleration factor is limited by by this that's simply Amdahl's law So if you want to overcome this hurdle and if you want to make the entire Processing faster Amdahl's law tells you you need to make everything faster. So this also needs to come down That's why we cannot only care about the FPGA the FPGA and the scanning part of the database We can make tremendously fast if we do not also help By by means of the FPGA the CPUs to be more efficient So the load on the CPUs to go up then Amdahl's law bites us from behind and there we go You know it we we don't have enough acceleration. Therefore you need the entire pipeline. Yeah The speed up. Yeah Oh, no, no, this is all throughput. I'm not talking about latency latency we we we are We are very wasteful with latency No, the speed up the speed up really depends on the I have a slide on that the speed up very much depends on the type of query Yes, you multiply them. Yeah Yes, yeah, yeah, yeah, you multiply their effect. Well the effect is multiplied. Of course, it's all you know It's a little bit of an artificial It's a little bit of an artificial Yeah, yeah, yeah, it's just to explain that The the the purpose of this slide is simply I want to explain that you need all the elements unless you have You're going through the entire processing pipeline. You're not going to have proper acceleration That's basically what I want to say here And then I've broken down the elements of what we're doing and what they are today and where we hope them to go Yeah Um, actually, no No, no, no, yeah, yeah, but look Here in order to achieve a greater degree of parallelism on the cpu what we do is for example We are we are partitioning the data such That the cpu's when when they go parallel on a set for example on the join when you when you do join in parallel um What limits the degree of parallelism is how how often the cpu's need to exchange data If you partition it properly They don't need to exchange data so much and then your effective degree of parallelism goes up And that's what the cpu what the fpga does you could do that in cpu Yeah, of course, you could do everything we do on the on the fpga on the cpu So what is the benefit of the fpga over the cpu like what is the speed up? Yeah, well, this is the when if if your cpu's are fully loaded then what I You you you're paying for what you're doing on the cpu by Reducing your total performance, right? So of course you can I mean, it's like why are you using? A gpu you could also use the cpu. Yeah, but it's you know for some things it's faster for others It's lower and for example, just the compression. Let me let me be specific just the compression. Okay Just this compression algorithm we're using today, which is relatively straightforward That will take a complete core on Intel processor, you know, it has I don't know Four six up to 16 and one at least one core This this is consumed So this is okay. You can say well what one core? Yeah, do you really have such a degree of parallelism? Maybe you can afford it But then here the preprocessing that takes another few cores because After the compression your data rate goes up So then you need many more cores in order to deal with the data rate And so all of a sudden it's adding up and then you're slowing down because those cores are not available anymore For processing You know from cpu. He's expressed. Yeah, well we can You already said you badly. Yeah, well Well, we don't care for latency because we care for throughput we are in data analytics If we were transactions processing your argument would be absolutely true if you do transactions Then latency is really really really important for us latency is You know, yeah, you can't overdo it, but to a degree it doesn't matter you want throughput Because this is useful not for say a hundred gigabyte database. This is useful For a terabyte or it starts to have significant advantage from 300 gigabytes and upwards But if you have very small databases, yeah, sure, then you wouldn't do this But if you have a small database, you anyway don't have performance problems. So there you are Nothing we can do in tribune So you can you burn out cores, but instead of putting FPGA you can put in another whole processor You should Yeah So is it I think you mentioned there is a little performance for what? Performance for what it is it is a total total cost because the FPGAs of course are much cheaper And so so it's it's adding up, but no even in raw performance. It's very simple if you could do You you you can you can implement all of this on the cpu and you You know Yes Yeah, so you the FPGA Actually is used here and here and here and here So in four of the this is a simple software strategy But we need this in order to have a data structure that that's efficiently processable on the FPGA. So again But the FPGA is actually working on all On four out of the five steps. So this is where where it's supporting and and arguing with the course is Okay, I yeah, you're right. It's an over simplification But at the end of the day what If you want to measure if this is useful or not, you simply look at is it faster or not And if it's faster, it's good. If it's not faster, then it's not good Faster than running everything on the cpu Well, it's first of all it is cheaper. Secondly, it's also faster. So there you go. Yeah, there you go. Exactly. Yeah Okay now This is yeah, no, I think I've explained it all. Let me quickly go through it I'm I'm I'm I've explained it. So what are we using? We're using this row column hybrid structure Basically, this is our tradeoff between the the speed at which we can insert and write amplification and This is this is particularly useful in cloud environments Because if you have storage and compute separate Then it is very very useful to to lower the traffic the data network requirements and that you do through a compression and other means so For this row column hybrid structure I'm curious about what most specifically what does this means like You have multiple columns. So yeah, you have multiple columns and you put this the Volume columns of the first two or second to one just all all the way down to the end Or with something like a patch you have a block and within block you have a pure column store Within the block we are storing parts. I mean either a whole column if it's a small one or part of a column so when we load we're loading all the blocks that belong to that column and We are allocating Is It can be The block has a fixed size. So if the column Is not a multiple of the block size Then you might have already the beginning of the next column within one of the blocks for example And this is what I meant with the sock that is part of the underwear If you only want that column you're also getting part of the next column and you just cut it out on the fpga So we are putting we are distributing the columns Onto many many blocks Maybe one block if it's a small column many blocks if it is many columns Yes, and then you know, um this way we built the entire table by distributing the columns indexed by the rows And distribute and use this optimized Column algorithm in order to store them. Yeah, very well observed Every data structure is going to help you read less data when you do the Yeah, but how does it improve the ingestion speed because you still have to write all the data you get Yes, the ingestion speed improvement actually is not through that particular technology the ingestion speed improvement again comes from the compression because we need to push less data across and it also comes because Well, it We don't have an indexing overhead We have a built-in implicit indexing and index indexing data before you committed in a database is a very expensive workload So it takes a long time to compute the the index and then store it if you do if you compute the index later well, then you have this overhead you just shifted it, but you still have it and In many cases without an index the database performance is just super poor and therefore the net Insertion rate which we achieve Compared to a database with that at some point needs to index is so much higher Right, so that's basically what I'm saying. So at the end of the day We also need to write the same amount of data but the The amount of processing we need to do before we write it is much less. That's basically what's happening Because the the the index basically the indexing is working on rows But you still have to do multiple Ios to access a single logical rule Well, you you do you you do as many Ios as you need in order to fetch all that data So we did we have the algorithm That distributes the data column column wise and row wise on Onto those storage blocks, which we are storing Basically make sure that we Only roughly only load as much data as is basically necessary But there is a little bit of overhead So we're always loading typically we're loading a little bit more And that's what I explained what the FPGA then cuts out. But yeah, so there is some overhead, but I'm not not gigantic You know, this is where I would now have to explain what I don't want to explain. So pardon me Pardon me. Yeah, but your question is very good. Yeah so In my fault, but okay So we eliminate the indexing over that's exactly what I just explained the compression And then the we've also Implemented it such that we have a very high latency tolerance and that's important when you work in a network storage environment. So if if your analytics process As a step in it basically depends on a later on on data, which you only pull as a condition of what you have looked for earlier You know, so we have an if statement in other words kind of then if you have if you Have a very if you implement it in a way that you really have to wait All the way until the data arrives that you then need Then you have your very latency Sensitive and this slows down this type of analytics and we've built it in a smarter way that we make sure we have The data earlier and again, so I will not explain how we do that I intercept your question and This makes us very latency tolerant and therefore we work extremely well in in network storage environment and we have a It's also a little bit frustrating to disc makers. So we ran a benchmark of We ran ran a benchmark on a very fast SSD on A slower one and then on a very cheap one with With a hardware rate, which is really high latency and poor and you you have it And our performance between the very fast SSD and the hardware rate cheap SSDs Is only is minimally minimally different because we just We just work it out. The only difference is you had to have as many SSDs That you match the bandwidth of the very fast storage environment But if you match the bandwidth then the the additional latency We can tolerate that. So it's bad for SSD makers because For if we are there because we don't need the slow latency SSDs fortunately for them There's still the transactional part and that needs it. So but we don't need it Okay, then On the software side, we have the row column hybrid structure. I spoke about the optimized columns I spoke about that and We are this is another element of the compression scheme We are keeping all data everywhere Also in main memory compressed and that of course means we have a much better caching So if you keep your data in main memory compressed and obviously the Amount of data that you have available is x times x being the compression factor higher In your caches and then then if not and we use that This is this helps for the system level caches like linux keeps io caches and stuff like that The caches that database keeps so throughout the entire caching even in the the The cache hits of the cpu are being improved because we keep the data Compressed everywhere and only when it's needed we decompress it. Okay, so Yeah, I think I've explained all that already. I've just I just have it here as a backup material if in case you later want to read it and you have Forgotten what I explained on all those graphs But pardon me This is no this this is what I just explained I I've explained all of this and on the prior slides. So I just wrote it down in case you later You know you later want to you want to watch it on youtube and then understand what did the guy no mean Here is all written down again. Okay, so now This is my first video and I really I want to show you this Here you see the effect of the fpga and of the of the optimized columns. So This is disk io the black line here. This one is disk io the yellow line is the fpga and the This whatever color bluish color. This is this is the cpu load Okay, so we are and this is query 12 of the tpch benchmark for a 300 gigabyte data set So as I said the higher the larger the data set the better performance But then it also takes longer and so we've chosen something meaningful So what you see is the fpga kicks in with up to two almost two gigabyte per second And then later it runs at approximately one gigabyte per second But what you see here is there's very little io very very little you see not much happening whereas here You have a much much higher o load all of the time and we are done After 28 seconds and because it's reading reading reading Postgres needs so much and this is already postgres 11, which is a lot faster than prior versions And I don't want to let you wait now until this finishes because Takes forever. So let me do a fast forward And here you are so We are 20 times faster and this is a relatively complex query I will show you later Other queries where we are even more even a lot faster than this 20 times So this is the combination of fpga a and Optimus column structure and everything that I explained. So now This is what I meant. We can discuss. Can you also do it on cpu or not? Maybe someone can build something like that and then can add more cpu. I don't know We chose to do it on fpga and we achieve fantastic results. So, you know, that's That's That's kind of the proof of the pudding, right because And I have more performance examples. Actually, why am I doing it by hand there? I don't know So this is what I what I was announcing when I said performance also depends on on certain Factors on on the on the type of the query. So here you see the throughput Drawn by type of query. This is a very very Um Join heavy query. We have not yet done a really good job with accelerating joins. Um, it's on the it's on the roadmap and q1 q2 next year This number will go up a lot because then we will have more support On the fpga for joins. So there we are accelerating Or we are getting an effective throughput of two gigabytes per second, which is already a lot faster than It's it's roughly three to four times faster than postgres, but this can grow Here you've just seen it. This was 20 times faster query six of tpca just a query where basically you're doing a very deep scan and um, it's it's It's a very in complex. Let me put it this way So it's a very simple query and we are accelerating by approximately 100 times and literally this is 100 times faster So this is getting into into the into the region You can run this of course also very nice on on gpu's because this is simple stuff They can do it well and if they keep it in main memory I would think they are still a little bit faster, but they they would completely fail Here this this they have to put to the cpu And here we are getting very close and if you have a very simple drill down like this famous new york taxi ride That's what what people often hold up. You know, they say, okay You you're looking for a taxi that left at between One and two p.m. And that was going into a certain direction And I don't know had some other characteristic And now give me all the taxis to which this applies and you have a terabyte Database and now give give me a result that kind of stuff We can do it at an effective throughput rate of 256 gigabytes per second So you have a terabyte of data in four seconds You get the result back again a gpu might do that in Less than one second, but who cares than anymore? And this is yeah really this is because at the end of the day a customer runs the whole set of queries and They have lots of these and lots of these and then if you look at the complete If you look at the complete workload Then it's really important that you accelerate everything And that's what we that's what we do and our our entire development effort is going here We're not doing much here anymore Because we we think this is for the time being Fast enough and here is where we are walking working and this this will go up And as side effect this will also a little bit go up and this will a little bit but not not so much So here is where we are working because we believe that this is what customers at the end of today care for the whole workload Thanks, james good, so Questions, yeah Okay, so one way to let me have a sip of water Is it possible that you peep all your algorithm your software is that the same? Do you use the same thing on a gpu instead of fpga? another way to ask this is that What do you think is the main advantage of using fpga is that on gpu is because it's cheaper because you can program it or you have more memory or like what's the main advantage you think the The main advantage now james left, but He would he would not heavily if I say that the main advantage is that we have a much stronger roadmap with the fpga You know, um, so we've chosen we've chosen the path For our product development, which is a little bit more Difficult than if you just put everything into main memory of a gpu, you know And we you have hpm and then you push it through the gpu and then This kind of stuff you can do really well and for the time being You know, maybe you could do what we do also on a on a gpu, but There are a lot of new technologies coming up for the fpga. So partial reconfigurability, which means you can At runtime load new functionality onto the fpga and that means you can react to what's happening and then make make different and other functionality available and fpgas are coming with Memory coherency. So qpi protocols and stuff like that are being made available Hpm is coming to fpgas. We don't have hpm today this is This is limiting if if hbms would not come it would limit us, but they're coming so From a roadmap perspective and from where fpgas can go There's a lot of room for more improvements and as I said I would I would dare to claim that in an overall complex workload typical customer complex workload If it's not just simple stuff Then we are better than the gpu's already today, but then from a roadmap perspective We have a lot more capabilities. So that's why we've chosen fpgas and you get all of that For 40 watts instead of 220 and this is a real considerate. I mean data centers Are limited by the amount of power they have available at the end of the day You can only do so much in one data center because you only have a certain Power supply. So if you can do what you if you can do something with 40 watts instead of 200 Whatever watts then this is so much more you can do in that specific data center. So it's money And that's that's what counts that so we believe this is the better roadmap and we are willing to Invest in the pain of step by step delivering it yeah Everything we do today is rgl build high level High level synthesis is getting better and better. So we can then start synthesizing stuff partial reconfiguration So you see the possibilities and we are all this all stuff we are working on. So we think this gives us A really really strong roadmap. However, it's difficult stuff So you have to have people who understand it, but we've built a team and with that we can do that Okay, so Um a few more examples. Oops. This was the fastest a very sensitive little thing. Yeah um So we can monitor two ethernet Lines in terms of insertion speed. We can monitor two ethernet 10 gigabit ethernet lines at Full speed i.e. the smallest ip packages possible and insert that data In real time into the database this translates into 30 million rows per second. So the data rate is not very high but the Amount of rows inserted as the number of tuples inserted here is extremely high and this Nobody else can do that unless you even in memory databases find this super super difficult I think we are We are beating in memory databases here. I would claim They can comment them on youtube if that's not the case And these already I show you the next slide that I can answer that so and Then with with something for for this particular use case where then later you want to say, okay now This is for network security. I don't I want to I had a break-in into my system And a hack and now I want to know who did it what what what protocol have to use Which ip addresses were involved and so on if you want to look for that Um, then very quickly and this is this drill down type of analytics that I mentioned earlier And these are real benchmark data. So we have done this we have done this We can drill down through that data with The type of analytics you use to get a first grasp of what happened and what What actors were involved with the more than 3 billion 3.4 billion rows per second So extremely fast you get an answer In this kind of environment other x and and this is the insertion performance This is the insertion performance Over the size of the tuple So here the orange line is the number of eight byte columns. We are inserting so starting with two four that are that up to a 40 and This is then the resulting I've smoothed the curve because it looks a little bit This is a rough measurement. So just to show the trend better. I've smoothed it So this starts at something like 700 gigabyte per second for this very high 38 million tuples per second And then it goes up to close 1.2 gigabyte per second when you have When you have larger columns, okay? So there is a that this is the dependency basically between the size of the Columns and or number of the columns actually and The insertion speed that you have in terms of tuples per second Yeah, yeah, yeah single single box one FPGA only and so on. Yeah. Yeah. Yeah if you have more then Yeah, this goes up accordingly. It scales scales very nicely. Yeah Okay, this is basically now the proof I can so that you don't feel I'm telling you a lot of Things and don't prove it. I like to prove stuff and this this shows it So this is again a video of Of insertion so starting up then there's always some latency involved and here you go This is 18 million rows per second. This is pretty old But the important thing is this is postgres here down down here. You see the little So that's Of course, this is super high insertion performance, but this is what we do and this is what postgres thing can do So very very fast Tuple shows up you shove it to the DJ it compresses it and then you write it back out to the ssg Yeah, all of this is committed to the ssg at the end of today. It's fully committed. Yeah Yeah, so ace it compliant. Okay, and um Another example is iot machine control because Since we are using all the capability of postgres we can even work with stuff like json data Structures. So if you have an iot network where where your sensor data is delivered in json format No problem. We can do it because postgres can do it and we are relying on postgres. So Um This compares to um, mongo. We had a customer who was using mongo and they wanted this accelerated And we could this is actually this was a postgres 96 meanwhile. We're a ton faster, but even that Was is quite impressive. Another example is this Up go back. Another example is this where this is a customer. Um, that does financial analysis So they are They are extracting for many marketplaces the order books the trades and the market data This is not um in real time, but they because they are um high frequency trading So they collect all the data in in a file inside the stock exchange and in for high frequency trading You have your servers very close to the stack of stock exchange servers So they are writing the click file throughout the they call it click file this stuff So they write it throughout the day and then at night they download it and then they put it into a staging database and then they They take the they transform the data such that they have a combination of orders trades Some market information that was happening So twitter feeds what a lot of different market data and then they combine that and in the production database They have then a data structure where they can then say okay after this tweet. What happened what happened to the order books What happened to the orders what happened where orders cancelled stuff like that And obviously what the reason they do that is they want to better understand the trading strategies So that the next day they can adapt and next day they can adapt and they can get better over time Out of the time. So um, this is pretty complex particularly also the transform step is not highly non trivial with These different data say also come not at precisely the same time point. So you it's It's very complex, but we accelerate that quite a bit and Here is again a video that basically shows this and the videos I need to start by hand as you've meanwhile learned So here we go You will you will see What's happening and just it's starting it's starting. Okay. So this is filling up the staging database is filling up Postgres database Staging is filling up then from staging we transform Immediately we immediately into the analytics to be so as soon as data arrives It's being transformed and as you see we keep a constant leg of one second And this fills up and this fills up and here's postgres And the leg is growing and growing because the transform process just can't keep up And when it's when this is full the business intelligence will run. So we've taken 10 queries from tpch and We are running them on this data now now that it is available. So here the business intelligence is now starting and this is still filling up and this emulates this process that I've shown you At for this customer. Okay. So this is now going going going. There's some fast forward built in otherwise you would be standing here a long time and Yep, so let's see Ah, yeah, okay. So ingestion and we needed four Minutes and 24 seconds 14 minutes for postgres transform again four minutes This will run a little bit longer and business intelligence We are already running postgres is still waiting for the data to become available and so it keeps going and going and Yeah, so we will be done very soon here and okay, so It keeps for going forward, but it just again takes long. Let me just show you the end result So I'm not wasting too much time oops, no, I'm I haven't done a good job here in grabbing it and here we go. So end result is swarm Needed 66 minutes for this data set and postgres needed 310. So this is I believe four and a half or something like that Acceleration, this is very complex. This is really a complex stuff So you you're inserting high speed you you have a complex transform And then you're running your business intelligence queries And so this entire process we are accelerating by four and a half times so that's And this is really what we're targeting to get back to this gpu question We believe that this is what customers need and we're seeing it at many many customers they have a workload a complex workload that's made up of a lot of things and What they want is everything fast. It doesn't help them if it comes back to Amdahl's law on a on a higher level, you know If if you only accelerate some queries And you make them super fast But then if they have a lot of other queries where you don't accelerate at all or you even slow down This is not the business value. And at least this is how we look at it. Okay, so Yeah So Does this integrate into systems? Yes, very well because We are poscous compliant This is Totally application compatible. You can plug it into your system. No code changes. Nothing You need to create those tables and schemas by dba needs to do that But thereafter everything is just completely completely Compliant the fpga is virtualizable. So you can run this entire thing in a virtualized fully virtualized environment and We are being supported by all the common tools and we are also Compatible to all the front and back end tools. So, you know, here's an example for an example for a typical environment and Kafka rabbit mq all these input ingest tools We can link up to and you they can insert into a swarm cluster and you can have Storage separate or not And so on all all the analytics tools we all support that and so this is this can just you can rip out your existing Postgres or other database and you could print put this in into your existing environment and it will just work and so There's more possible for example, I like this one The spatial extension. So you can do there's so many extensions. You can have twitter extensions. You have whatever all works And so in summary I started with what were our Founding principles in here Well, so we're extending postgres Accelerating olab and it co-exists with transaction systems. So that's very important You can have your transaction system and you can have your olab system separate to that But it can link it can be linked so that you have more less real time Olab queries running while you had to you're running your transactions We fork nothing It's not an appliance and there's no lock-in if someone doesn't like us anymore. They can always replace us with poor pure postgres it's slower, but at least it works And we are retaining full SQL compatibility with the host to be and including all the features all the interface and so on It skates vertical and it skates horizontal for that. We are working with postgres XL So we also have horizontal scalability And it integrates easily into all the common IT infrastructures and tools and thank you very much It was a pleasure presenting this to you. Thank you One last question come on So what's the hardest thing about building an FPGA accelerator? The long long time it takes to implement stuff on FPGAs today because high-level synthesis is not yet good enough It's coming and as and we are very much look forward to reducing the time for implementation because The software development is so much faster than the FPGA development But it's all interlinked and that is really really difficult. Yeah, but that's getting better. Okay. All right guys Thank you for coming. I thought we had some amazing talks this this semester Before we go, we want to thank holla point Karen for the rain for all of us Thank you, Kevin All right, so that's it for the semester. We do have a bulky beat talk next monday if you're interested And then we'll pick up in the spring semester. All right guys Thank you. Yeah, good luck for me, too Stalin or new york with the criminals run the project development Through the screens and rapid fire shots my block consists the multiple juvenile offense on these kids making fix Keep this operation safe for me. Shit. Giuliani got these perpetrating housing cops on the dick