 in our seminar series on time series databases. So we're really excited to have Fett and Quill from KDB Common Give-A-Loture. So KDB is probably one of the oldest time series databases that have been around since the 1990s. And for me personally, their architecture was heavily influenced the design of the H2OR system, even though H2OR was not a time series database. So again, part of what the seminar series is about is trying to understand what time series databases are and why didn't KDB solve this problem 20 years ago. So Fintan has been with KDB for 12 years now. And prior to that, he did his undergrad at Trinity College in Ireland and has a BS with first class in computer science and computer engineering. So I'm really happy to have Fintan here today and take it away. Thank you very much, Andy. It's nice to meet you all. Thanks to everybody at CMU here on behalf of KX and myself. It's a great honor to speak here. Thanks for the hospitality I've all shown. So hopefully I won't bore you all too much today. Let me know if you can't hear me, by the way. I usually try and speak as loud as I can. But if you can't hear me down the back, just let me know. So these are just a couple of boring company slides at the beginning, just to get out of the way. Just a little bit about KX. We were founded in 1993. We were a subsidiary of an Irish company by the name of First derivatives. So we're publicly traded on the London Stock Exchange. We have about 2,000-plus employees worldwide. We've got a big global presence. Here are some of our customers and partners. And we've mostly been involved in the financial services. And then in recent years, we've branched out into other industries as well. So telco, IoT, you name it. Anything that needs time series. There's nothing specific to our technology to finance. It just so happens that that was the industry that we grew out of that had the need for big data back then. So here's just a sample of some of the companies. First of all, I wanted to go back and do a bit of a history lesson and just show the ideas and the methodologies that came in to KX and KDB Plus. And this is just a funny slide that's just showing where we go back in time. And so it all goes back to this gentleman here. His name is Kenneth Iverson. He's a Canadian. And he worked at IBM. He's an IBM fellow. And he was at Harvard as well. And he started a language by the name of APL. So the idea being that everything was a vector programming language. Everything was vectors. You thought of everything in lists. And what he wanted to do was basically create a programming language that was as near to the scientific notation as possible. And so something that scientists and mathematicians could write out as near to what kind of they were learning. And he basically came up with this ensuring lecture winning paper called Notation as a Tool of Thought. And if you haven't read it, I'd highly recommend reading it. It's just a very, very nice and elegant way of how to express ideas. And it kind of all harkens back to this vector programming language. And one of the things with APL was that it had its own keyboard. And so it had this weird kind of hieroglyphic style language. So for some people it was quite difficult to understand. So here's an example of APL code using how to find prime numbers, for example. So here's what it looked like. So obviously, if you're not familiar with the APL language, it can kind of be a little bit difficult to understand. But the idea being is that it achieved a lot within one like relatively short line of code. And then on came along the founder of KX, a gentleman by the name of Arthur Whitney. So he was effectively Ken's protege. He was also a fellow Canadian. So he grew up with Ken very much as a father figure in his life. And he basically learned from Ken and he created a prototype language called A and A plus and then language called J, which he worked on with Ken Iverson. And then he also basically came up with a language called K, which is the predecessor to our current language Q. So what it was was an APL style language, but just using the ASCII character set rather than, say, the APL, kind of this weird hieroglyphic kind of character set. So he's trying to bring it slowly into the more modern understanding and the QWERTY style keyboard solution. So this was the K language. And so then what we wanted to do is it was purely a K language. That's what KX and started out as was basically dysfunctional kind of programming language company. And then eventually we got into building a columnar database. So this is a sample from the book of Kells here. That's from my alma mater, Trinity College. And the idea here is that a lot of these lists and vector notations that we came across in the APL language, they just translated very, very nicely to the columnar structure. And of course, now in recent years, columnar databases have become very, very sexy. And people doing role-based databases have been left behind. So you see role-based databases are somewhat of a thing in the past. There's still a place for them somewhere, certainly. But it's kind of everything has moved more towards columnar now with the increase in big data. So the K language then came to KDB, which was the first iteration of the database. And then we have the current iteration of the database, which came about 2003 called KDB+. But people use the names kind of interchangeably. A lot of the data models are, in essence, the same. So what is KDB+, then? So KDB+, it's a unified columnar database and programming system. So it's not just a database. It's not like your traditional sort of database server and application server, where you push data back and forth between the two. It's a fully-turing, complete programming system. And we'll get into the language and the queue language that sits on top of it a little later. So I really just want to emphasize that point, that it's not just traditional database that, OK, we extract the data from it, and we do our analysis elsewhere. With this, we can do the analysis directly on the data. And so it follows the Lambda architecture. And obviously, that's become quite prevalent in the recent years, Twitter and stuff like that. But Finance has kind of been using the Lambda architecture since kind of the late 70s and the 80s with the advent of high-frequency trading. So the idea being that you had your streaming data, your real-time or in-memory data, and your historical data being used in the one platform with the one programming language. So you compare that to, say, this Hadoop style of Frankenstein monster, where you use about 17 different technologies for the different pieces. And they don't necessarily integrate together very, very well. Whereas with us, we have the one database paradigm and the one programming language that kind of nits everything together. So in some ways, it's kind of a one-stop shop or somewhat like a Pramante's Brothers sandwich, where you get the fries, the meat, and the bread. And then one. So what it's all about, basically, is in database analytics. So the idea being where you bring the analytics to the database. So as I mentioned, not like your traditional analytics server and database server, but also supports joins. So it supports all the standard SQL-like joins. And then we'll go on to some time series joins that it supports later as well. And the nice thing as well is that it's a 500 kilobyte binary. So it fits almost entirely inside the L1 or L2 cache of a modern machine. So it's very, very short code paths. And so it just leaves more space for the data, essentially. So it's not like a lot of the other big data solutions nowadays that have gigabyte install and paths. This is a very, very short code size. Wait till the entire database system is 500 kilobytes. Yep. Everything. Yep. Exactly, yeah. So if you download, we have a free version that you can download. You go to our website. I'll have a slide about that later on. And you can download it right there and then. You can install it on your Raspberry Pi or whatever it might be. So it's quite nice. So this is a very typical KDB plus architecture diagram. So this is by no means the end-all solution, but it's kind of a reference architecture as it were. So you've got your data coming in on the left. So that could be coming from any streaming source, a stock exchange, an internal trading execution engine. It could be from a Fitbit device, a sensor, or a smart meter, or whatever it may be. And that's typically parsed using a feed handler, which is usually written in the compile language for performance purposes. And then all of these blue boxes here that you see residing in memory are typically written in our Q programming language that I'll talk about later. So they're all just, you know, there's no specialist libraries here written in C++. This is all written in the one language. So first of all, the data comes into our events engine or sometimes in finance. It's called a ticker plant. And what that does is basically it distributes data downstream. So it's a data distribution engine. But the first thing it does is it logs its messages down to this transaction log file here on disk. So this is for recovery purposes. So if any of your downstream processes go down for whatever reason, you can get full replication and full recovery from this transaction log file. So then the simplest example of a subscriber to this events engine would be what we call the real-time database, or the RDB. So what that does is it basically just subscribes for everything. So it opens a fire hose and just gets all of the data, takes the data, and inserts it into in-memory tables. So it's a pretty simple process in many ways. Not terribly CPU intensive, but it can be quite memory intensive as the pages grow and grow throughout the day. But then if you've got one very specific calculation that you want to run or dedicated engines, like I want to get the volumeted average price for these particular stocks, or I want to get the average kilowatt-hour usage for this particular zip code of smart meters. And then you can set up what we call a streaming query engine. And this, once again, these can be written in the Q language. They can also be written in other languages as well that can subscribe to Q. So you can spin up multiples of these and say for the VWAP pricing engine in finance, for example. And then you will only subscribe for a subset of the data, maybe trade data, and maybe only for the ticker symbols that we're trading in our portfolio. So it typically gets a lot less data. But then when it gets the data, it starts running calculations on the fly. And so it's doing the calculations as the data comes in. So it's much more CPU intensive, but it's typically a lot less memory intensive, as it might just store maybe one record per subscription. And so it's way less memory, but way more CPU. So kind of the reverse type of profile from the real-time database engine. And then what typically happens at end of day or whatever the preconfigured interval might be, and the events engine will send a message to its subscribers and say, it's now like end of day or end of session or whatever it might be. And in the case of the real-time database, what it does is it basically purges its contents down to what we call the historical database. And then the historical database, you create a new shard or a new partition on disk for that day. So basically all of your current data is running in memory, so it's very, very fast access. And then your older data is running on disk then in your historical database process. And then if we go on to the next slide, you'll kind of see that essentially everything is just column arrays. And the way the data then persists down to disk is essentially the same structure, albeit it's just on disk. So it's essentially just an image of the data in memory on disk. So the historical database can take advantage of the operating system file cache. So when I go back and I query yesterday's data, it will get page faulted into memory. And then any subsequent queries will get near in memory performance. And when that's all done at the OS level, you don't have to care about that as the end user. So the data model is essentially the same both in memory and on disk. It follows this column there, or vector style structure. So you see the data model holds quite well. It's a very simplified data model. And whether you're querying a streaming engine, the in-memory engine, or the on-disk engine, the queries look almost exactly the same. It's a select style syntax. We'll get on to that in a little bit, though. And so we support many different data types. You want to be flexible in the database itself. And so here are just some of the database data types. We've got Boolean types. We've got numeric types. And we've got character types. And we've got a symbol data type, which is quite important. And we've got a GUID, which is basically a UUID data type as well. So if you've got highly unique data, you can get very fast in search performance on that. We support enumerated data types. And then we also support dictionaries, tables, key tables, and functions as first class data types. And there are other data types as well. And we also support time series data types, so seeing as we're talking about time series today. And we have a data type for date, time, minute, second, month, date time, time span, and timestamp, which are both nanoseconds data types. And it can convert between these time series data types very quickly on the fly as well. So it means you can store your time series data in its most granular form. But then if you want to do bucketing or something like that, you can convert between those data types on the fly very, very easily. So you don't need to store your data down in different resolutions. You can just store it once in its most granular form. And then you can do roll-ups very fast. So then on top of this, then, you've got this very nice platform, but you want to put a nice, elegant language on top. Something that's very expressive. So with that, then, what we came up with was the current language, which is called the queue language. And so here are just some of the aspects of the queue language. And first of all, it's a functional language. So it comes with 200 plus functions out of box. So it has all of the standard database functions, like your aggregate functions, like first, maximum, last, et cetera. And then it also has matrix functions, so matrix multiplication and version. It also has trigonometric functions, geometric functions, moving average, weighted average, moving sum, weighted sum, so a ton of different functions. But you can also create your own functions as well and then plug them into the database. So you're not just restricted to those 220 or so functions. It's also an array slash vector programming language. So what that means is that basically most of the functions that you run are array or vector-based functions. So you get rid of a lot of the clunky control structures that you have in other programming languages. You typically don't have four loops, while loops, do loops. So that actually just reduces your code base pretty significantly. And it just lends to a neater, more elegant code. But it also has a query language in it as well. And it's all in the one language. It's not like PL SQL and say SQL, where it's two completely separate paradigms. And we have select statements, update statements, delete statements. So it's kind of like a super set of SQL. And you can merge the functions into your queries as well. So you can create your own function and then plug that into the query. And it's also an interpreted environment, as we'll see in a second. So you can type the query and get the result back straight away. So it's very nice. It kind of changes the methodology with which people would approach maybe compiled environments where they have to compile some piece of code, go off, get a coffee, come back, and then it will be ready to run. With KDB, you can start prototyping. So it kind of changes the way the developer kind of thinks when they're actually running queries and creating a lot of this type of stuff. So then it's also a language for time series as well. So it has a lot of time series functions and joins. So there's some functions that we'll see later. Xbar, which is for bucketing or binning of data. So you can start creating uniform buckets of data. Or you can create non-uniform buckets as well. And we also have two specific bi-temporal joins, one of which is called an as-of join and another which is called a window join. And I'll get on to them a little later as well. You can also do temporal arithmetic, like you're going to add from a date, subtract from a date, add seconds, subtract seconds. So you can start doing a lot of very nice analysis using time series data. So it fits very nicely with the KDB Plus model and the Q database. So then there's many different attributes that come with KDB Plus as well. And attributes is the word that we kind of have for the concept of indices that you'd have in other traditional programming languages. So we support four different types of attributes. And they're just denoted by you put this backtake and then a pound after it. So the backtake S pound is for a sorted attribute. So that means your search algorithm will change if a field or a table or a column is sorted on a given. And you can have sorted on an entire table, or you can have it on a dictionary, or you can have it on any of these data types. Then there's parted. And what parted basically means is within a given partition, and you have a step function. So your data within a given shard or a partition on disk is held contiguously. And then we have a unique value. So obviously, if you're doing that needle in a haystack style query on an ID column or something like that, you get very fast performance. And then we have grouped, which is like a more traditional index where you have a hash map style index that you'd have built in a traditional database. So we have these four different types of attributes. And they can change the performance of a query very, very drastically in terms of you can get way, way faster performance where you can take advantage of these. And some of them are better placed for in-memory data structures, and then some of these attributes are better placed for on-disk data structures. So you've got some flexibility then. But it's not like other databases where a lot of the time they'll just index every single column. And then all of a sudden, there's this huge memory overhead. And you typically only tend to put these attributes on maybe one or two columns out of your table. So you use these quite sparingly, but you'll find the performance is very, very good. But these are attributes that you're defining when you create the table. Exactly. Yeah. Yeah. So that is, and then say when data is coming in, say to the grouped attribute in memory or maintain that attribute. Yeah. So then, of course, we can take advantage of parallelization. Obviously, that's a big thing nowadays as we're getting into multi-core architectures and multi-machine architectures. You want to be able to take advantage of these types of things. So from a vertical scaling perspective, we have the ability to do multi-threading. And all of the multi-threading is done under the hood, so it's abstracted away from the end user. You just start off your process with multiple threads. And one of the nice things that we put in more recently is we've actually gotten rid of serialization in between these threads. So that means then you can get really much, much better performance with these multi-threaded queries. And then we automatically distribute the queries across the CPU core. So there's no heavy, clunky, multi-threaded API on top of the database they have to learn. And it's very, very simple. And it's kind of abstracted away from the end user. So if there's certain query patterns that are just automatically parallelizable, KTB Plus will parallelize them. So that's how we scale up. And that's quite popular in finance, where they've got data coming in throughout the day, and they're trading on that data. So they want really, really fast access. So they usually tend to scale up by adding more memory and more cores, rather than more machines, which you get when you go onto the east coast versus west coast things. It's not quite as cool as the rap battles in the 90s. But we get the idea. And then we've got horizontal scaling then, as well. So that's for more historical type of workloads, like big historical research queries or back testing and stuff like that, where you're going through petabytes of historical data, or maybe, say, for regulatory reports where you might need to keep five plus years worth of trading data for reporting purposes to the authorities. And for that, then, we can do a multi-processing model. So we're not just restricted to multi-threading. We can do multi-processing. So we've got master-slave setup. And then you can set up nice failover and stuff like that, as well. And once again, there's a lot of manically distributed queries over the multiple hosts, as well. So they don't even have to be within the same machine. And even within one database, you can take advantage of tiered storage, as well. So that's nice, too. We also exploit some of the Intel vector instruction sets. So if you're running on some of the latest Intel machines, you can take advantage of those wider register widths that we have. And it obviously lends itself quite well to our data model of supporting vectors. So we can automatically parallelize some of those calculations for some of the primitive functions. Other than that, the KDB Plus itself is written in almost entirely in ANCC. So it's, hence, the 500 kilobyte executable. We also support native MapReduce. I mean, MapReduce has been around forever. Even though some of these West Coast companies like to think they invented it, APL has basically had MapReduce style semantics way back in the 50s in the mainframe days. So we have MapReduce style built in, so whether you're doing the multi-threading or multi-processing model. And we can take advantage of that. And then we get on to compression. Let's take a quick sip of water and let this slide sink in. So we support compression, then. So we support compression from a few different viewpoints. So built inside the executable isn't just a database, but we also have a web socket server and client. We also have a HTTP server. So all inside that 500 kilobyte executable. So it's a fully compliant web socket server and it supports the web socket compression standard. And the second type of compression is in-flight compression in between hosts. So if you're sending large amounts of data, say from machine A to machine B, you can compress that data before you send it over the wire. So you get better bang for your buck in terms of bandwidth. And then we also support on-disk compression, which of course is very important when you're storing years and petabytes worth of financial market data or streaming sensor data, et cetera. So we support four different compression algorithms. And the first is our own internal algorithm, which is the same algorithm we use for the in-flight compression. And the second is GZIP. And then we support Google with Snappy. And then we support the LZ. And compression algorithm. We have that in our latest beta version. And so that will hopefully go into production in the next version later next year. And you can obviously offload some of this to hardware compression boards as well. And so KDB Plus says very, very few library dependencies. The only library dependencies we have are actually on these compression libraries. Everything else just uses standard OS system calls. So it's very, very close to bare metal, which is why you get the very, very fast performance. Can you say a little bit about what KDB's compression algorithm actually does compared to Snappy or LZ4? So it's basically, it doesn't compress quite as well, but it was all built about speed, really. So it was built more towards speed, and then specifically towards doing a delta on time series data. So that was one of the very important things. Obviously, when you're doing nanosecond time stamping, you're getting very granular time stamping. So they don't compress as well. So speed on time series is what was kind of more optimized towards. So yeah, the GZEP, for example, you get way better compression, but then the decompression is going to be a lot slower. And so then we get on to security. So obviously, this is one of my favorite little Twitter memes that I've seen. So we support encryption for in-flight data. So we support SSL and TLS encryption. You can do row level. You can do columnar level security as well. We don't support on-disk encryption at this point. But all of our data that's on-disk is typically just in file and folder format. So we just can rely on the OS file system to take advantage of the encryption there. But it's something that we may take a look at. Then in terms of APIs then, we want to be able to connect to a lot of these cool different tools that are out there as nowadays. Not everybody necessarily needs to use a Q language. You can use other libraries as well, but still take advantage of the power of KDB+. And so here's just a few examples. There are many, many more. So if you're, say, interfacing with other database solutions, there may be interfacing with the Tableau front end. We have ODBC and JDBC libraries. And then we've got native libraries for C, C++, Java, Python, R, Kafka, et cetera. And these libraries typically work in one of two ways. They either just open TCP IP socket connections, push data back and forth. Or another nice thing that you can do at KDB+, is you can actually embed shared objects. So you can load shared objects into the same memory space. So that's, in fact, how the Python and our APIs work, where you can actually have the Python interpreter and the Q interpreter running in the same memory space. So you could, say, run, say, a NumPy function or routine on KDB+, data, or conversely do maybe a Pandas routine or, say, the KDB function on a Pandas and data frame. So that's quite nice, because then you don't need to copy data back and forth, and you don't need to go to the TCP IP stack. You can kind of avoid that. So then we can move on, and then we can actually kind of see it in action. So we'll have a look at our demo environment. So you'll see it for your very first time, hopefully. Or has anybody actually played with KDB+, here before? No? OK, cool. So the demo environments that I have here, it's 31 terabytes in total. So it's actually real New York stock exchange data. It's from January of 2013 onwards, up until yesterday, in fact, we load on a nightly basis. It has about 1.3 trillion records. It has the raw trade and quote data, like the raw message data that comes in, so very, very large tables. And it has some derived data as well. So for each data in each stock, it has the open, high, low, and close pre-calculated. And the machine that we're running on has about a half terabyte of RAM. But to be honest, most of the queries that we're running here are on disks, so we actually use very little memory. And so we wouldn't even need that much in practice for the queries we're going to run. What runs on height has well cores. We typically turn off hyperthreading. And they're 3.5 gigahertz, so pretty fast clocked. And the queries are run fresh from disks, so it's all spinning disks as well. There's no SSDs or MVMEs running on this drive. I'll be at their 14K RPM, so pretty fast. Why do you turn off the hyperthreading? Sure. Why? Why do we turn off the hyperthreading? And well, it's one of those things that traditionally we used to just because the hyperthread used to get in the way, because we'd be able to drive the CPU so fast that we found that the hyperthread would actually impact somewhat. But to be honest, more and more nowadays we find in some cases we turn it on, and it has negligible effect, and sometimes might even increase performance. So the lines with that have now been somewhat blurred. So I turn off the PowerPoint, I guess. OK, so here we are in the queue environment here. So as you can see, I can do basic math. We do 1 plus 2. It does return 3. And so we're doing good. And it can create lists very easily. So you can do a equals to 1, 2, 3, 4. And then it can do, say, sum of a. And we can do sums of a, which gives me the running sum. So you got a lot of these different types of functions. And I can create a table very easily. So I'm going to create a simple table, t, which has two columns, a equals to 1, 2, 3, b equals to, say, a, b, and c. And then I'll create a third column, c, which is going to be 4, 5, and 6. So here I go. I have my simple table here. I can do select sum of a from t. You can start creating tables and start creating these data structures very quickly and very easily. And I can start creating functions then. I'll create my own function. And I'm just using these parentheses. And I can just do x plus y for my two input parameters. So now I'll just clear the screen. So now we've got our function. And now I can apply that directly into the query. So I can do, say, select func. And then I can do the columns a and c from t. And you see we get the result back here. So we sum the values from those tables. Likewise, I can create a second function. So you can have functions inside of queries. But you can also have queries inside of functions. So I can do, say, select from t, where, say, a equals to x, where x is our input parameter. And then I can do func of 2. And just pass in the value for, say, 2. And it will return that corresponding row from the database. Pardon? Variable x is the input. How did the system know that x was the input? Oh, so yeah. So we have the concept of implicit parameters. So you can pass up to three implicit parameters, x for the first, y for the second, and z for the third. And then you can also put in explicit parameters as well. And you can do that very easily. You can see I'll create func3 here. And I'll put in my param, what we just called param. And then I can just explicitly put it in here just inside the declaration. So if you want to make maybe more readable code rather than have extras and y's and z's, I can do this. So then I can just change this around and just do func3 and pass in 2. And you see the result comes back. So you see the idea here where you can have queries inside functions and functions inside queries. So it all melts in very, very nicely. So now we'll go into the database. We're actually in the database process that we're currently running. So you see that there's very large tables here. You can do a simple count. So count returns very, very fast, whereas doing a count on a table on another database typically takes a long amount of time. So you see there's about 771 billion records in this table, so it's quite a large table. So it has how many dates in the table? It has 1,213 days. The last date, just to show you it has real data, is yesterday's date. The first date is from, I think, January 2nd of 2013. So you can see you can start doing counts. So the first query I'm going to run is against this relatively small table, a daily table. So it's for each date in each ticker symbol. It has open, high, low, and close. So I'll just copy the query across first. Hopefully my copy-paste skills will be good. So this is a pretty simple M-SQL-like statement. First of all, I'm just going to put a backslash t in front of the query. And what this does is it actually times the query. And it returns the amount of time taken in milliseconds. And so as you see, this statement here is during a select, the date, the open, the high, the low, and the close from this daily table where the symbol equals apple. So this is a pretty simple, if you're used to an SQL database, there isn't too much of a jump in terms of the syntax. It's relatively straightforward. And so I'll run this, and it will take about 530 milliseconds. So this is the number of milliseconds taken to execute the query. I'll run it a second time. And you'll see now the second time you see this effect of caching. So it's cached to data in memory. So it's about 10, 15 times faster. Excuse me? What does that back-tick mean? So this back-tick, this is what we call our symbol data type in the queue language. And this is like a varchar in standard SQL databases. So we just denote it by the back-tick because we just like to use as little code as possible. So if we can drop one character, rather than having an enclosing. Back-tick, so yeah. So this is the specific symbol data type, which is essentially an interned character string. So it basically means it's one object, rather than, say, four characters as a vector. So yeah, it's a special data type in KDB Plus. So now if I just run that query, you'll just see what it looks like. So it just looks like a table that's returned. So we return the data. So obviously it returns that data in a pretty fast time. But it's not a huge amount of data. What was it, about 9 million plus records? So large, but not very, very large. So we want to do queries over billions of records. So the next query I'm going to run is I'm going to get the open, the hide, along the close. But I'm going to actually do it from the raw data. So I'm going to actually run these aggregate functions on our trade table. So we'll just show, and we're going to bucket it up. So we're going to group the data by date, and then by five-minute bucket. So we're going to do two levels of bucketing. First of all, I'm going to time it again just so you see it. But what we're doing here is we're selecting the open, which we're going to define as the first price. So this is just the same as doing first price as open. And then the high is going to be the maximum price. The low is going to be the minimum price. And the close is going to be the last price. And then because KDB Plus is a time series database and column there, and it's already ordered by definition. It's not like an SQL row-based database where it doesn't infer order. So that means we just move the group by an order by inside the select statement, and we call it the buy clause. And so we aggregate the data by date and then by five-minute interval. And so the time column that's in our data and base is stored to millisecond resolution. And so by doing time.minute, we automatically do the conversion from this millisecond time stamping to the minute time stamp. And then this expire function that I showed earlier, it does the bucketing or binning of the data. And then the number to the left-hand side just shows the resolution of that bucket. So we want it in five-minute buckets. So we aggregate the data by date and then by five minutes. And then it's from our trade data, which is the raw message data for each and every trade. And then we're doing that for where the date is within the last seven days. So this .z.d variable is just for the current day, basically, so it just means today. So today minus seven, so you see that temporal arithmetic and today. And then the symbol equals apple, and then the time is within. So the within statement is like doing between x and y. So then it's between 9.30 and 4 o'clock, which are the opening hours of the New York Stock Exchange. OK, so I'll quickly just run that query. So we're aggregating over seven days worth of data, and then we're bucketing it, and it's taking 98 milliseconds to run that query. And so we'll run it again. And you'll see now it takes about eight times faster and the second time due to this caching effect. So I'll quickly just show you the data. So you see now what this returns is a keyed table. Because remember, we've aggregated the data by date and then by five minute bucket. So you see that we've got nice five minute uniform buckets in between. And then we've got our open or high and our low and our close. And so basically, this means that between 9.30 and 9.34, 599, 9999, and this was the open price, et cetera, et cetera. And so we do this bucketing and binning very, very quickly. And then we can change. We can go, say, if we want to do a 30 second window instead of a five minute window, we can go ahead and we can change that. So just to show you, I'm not making up anything. You can do that. And then you see, you see the idea, just it's overflowing on the console here. But you see the idea here at state. And now you see we get this same second data type. And so you can start doing this bucketing and binning of the data very, very easily. You see the queries return very, very fast. So if you wanted to, does anybody want to give me a stock that they want to see? MDB? Is that in the Nizi? I'm trying to think what stock that is. There we go. See the result book comes back pretty much straight away. That's Mongo. Oh, yeah, they just came up. Excellent. The enemy. I mean. All right, that's a video, and you said Nami. Yeah, exactly. So yeah, so you see, these are the real tick by tick data. So now what we want to do is I want to show you one of these, what we call the bi-temporal queries. And so what I mean by that is when we're joining two different time series. And so if you think about it and practice it in the world of finance, but this can be obviously in any other industry. And in the finance world, you've got your trading data. So you've got your broker saying, I want to buy and sell this stock. So they're giving you the two prices. And they'll say they'll give you that. And you'll store it on your quote table at, say, 10 a.m. And then maybe you want to trade on that. But maybe you might not trade until a few seconds later or something like that. So then you've got a trade record and you've got a quote record, but they don't necessarily have the same time. So if you try and join the data using one of these standard SQL joins, you'll get a mismatch because they're not exactly time aligned. And so what we've built is we've built a few different types of time series joins, one of which I'll show you today, one of which is called the as of join. And what the as of join does is it will take that quote or that trade record from your trade table at, say, 10.02. And it will do a look back into your trade table and try and find, or their quote table, rather. And if we can't find a quote record at 10.02, what it does is it looks back in time then and it finds the last available value or the value that's prevailing as of that time. So that's a function that's built directly into the queue language. Just go back to the last query and I'll just show you the number of records we were actually aggregating over when we ran that aggregation. So we aggregated over 745,000 records. And just over, I think it was 18 milliseconds. So you can see just the sort of speed of the query that you can get there. But yeah, I'll go back to the as of join. So sorry for that digression. And I'll run this as of join. So I'll just do the backslash t. And so this as of join is just a function. So you see this functional syntax with the square parentheses. And the as of join takes three parameters. And the first parameter is the field or the column that we're joining the data on. So that's this back tick time. And then we have the semicolon separating the arguments. And the second argument then is a table that we're gonna do the lookup from. And so we're gonna select the time and the price from our trade table where the date is the last date. And the symbol equals apple and the time is within 9.30 and 4 p.m. And then the last parameter is a table that we're gonna do the lookup into which is the quote table or the NBBO which just means national best bid and offer. Which means the best quote available as of that time. So we're selecting the time, the bid and the ask from our NBBO table where the date is the last date and the symbol equals apple and the time is within that window. So I run that query and it takes 87 milliseconds to run. And we'll run it again. You see it takes about four times faster. And obviously the caching effect isn't as noticeable this time. And that's because we've already actually cached some of that trade data in the prior query. So you see that the caching isn't quite as accentuated in the second query. So what I'll do now is I'll actually just show you what the data looks like. Yep. And you write the join as two separate queries. Yep. So it's two queries and then it's joining based on the time column. So the time that it shows here is actually the time off of our trade record. So the table in the second or the left-hand side table as it were. And then the price, this is the price that was traded and then this is the bid and the offer. And at the beginning of the day there's no bid and offer because it probably became just before that 9.30 offering. So this is the prevailing bid and the prevailing ask. So what I could do is I could maybe say then you might ask, well, why don't you show the quote time as well? So the as of join actually suppresses the secondary time. But you can just get over that easily by just duplicating the time column and just creating a column called queue time. And then we'll just duplicate that column. And now it'll show that. So if we look down through the records and if we look at this record here, you'll notice that the time off of the trade actually happens one millisecond after that actual quote came in. So you see that the ability to do this, it actually does the binary search back in time and to find the last available value. So you get very, very fast performance. So I'll just show you like. Is there a way to show the query plan for that query? Like, can you see it from the terminal? Oh, yeah, yeah. So there we go. 20 milliseconds to run that. Not much, sorry. Like, you know, you do explain in SQL to see actually like the query plan tree that actually used to execute that. No, we don't have anything. Well, we do have a profiler, but it's a separate library after loading. I don't have a loaded in here though, but yeah, I could certainly send you information on that. Is it doing like a hash join? What's the join algorithm for this? Yeah, so it does a binary search lookup and kind of like a hash type of join back in time on the table. So it cuts off basically as of that time and goes back in. And because of the way we order the data on disk and the way we have the data contiguously held in a given symbol but ordered by time, it actually can just do the join a bit faster back in time. So as you see, it does about 130,000 plus joins in 20 milliseconds. Our search is rather, and joins, I guess. And so that's the as of join concept. And then there's another version, a more generalized version of the as of join which is called a window join. And there's an extra parameter that you can specify where you're gonna have a window of time. So you could have something like, say, two seconds before an event or one second after an event, say within that three-second window then. I'm gonna get, say, the average price for the first price or the last price. So that's a more generalized version of this as of join. The as of join is a very specific version where it just gets that last value as of that time. So I'll go back to my slides now. Question? Yeah? So based on what you said, it seems like it's optimized for joining on time. That particular join is optimized for joining on time, but we do support left join, inner join. We also have a join called a plus join as well which actually adds n numbers. So we support the standard SQL joins as well. We just have these specific n time series joins too. And there's also variants of the as of join and the window join for whether you include the end of the window or you exclude it, whether it's closed intervals or open intervals. So there are some, so there's AJ0 and WJ1, for example. So then just in closing, just to show a few more resources. We've done quite a few benchmarks over the years and two particular benchmarks that we've done and we show here, and one is called a stack benchmark. So they're called the Securities Technology Analysis Center. And they do a lot of financial benchmarks. So financial FinTech benchmarks. And one of the benchmarks that we're involved with there is called the M3 benchmark. And what's nice about this is that it's running a variety of different hardware stacks, but they have databases of three different sizes and they run 17 different analytic tests. But they actually run them in their lab. So it's not like it's all independently audited and verified. So it's not like somebody that's cooking their own books and making something nice for the marketing department. These are actually real audited and verified figures. And we've been chosen as the database of choice for I think about 95 or 98% of those particular benchmarks. And we hold basically all of the records there. So it's really nice to kind of show that because you know benchmarks, you kind of always have to take a little bit with a pinch of salt. But another nice thing about the stack benchmarks as well is that the actual financial institutions themselves come together and they come up with the benchmarks themselves. So it's not like one of these pointless database benchmarks. And then the other benchmark that we do is the New York City Taxi Demo Benchmarks. So this started off with a blogger a few years ago. He started creating different benchmarks. And so we recently ran one with the Xeon 5. We've done some other ones that we'll be releasing shortly as well. And if you go to this particular blog, you'll see that he's done it on a variety of different other solutions, like Redshift, Uname, and other different database solutions. And he's benchmarked them all. So once again, it's kind of he's run them himself. So it's not like we're cooking the books to make ourselves look good. He's been able to verify it. So it's a good way of just showing kind of performance we can get. As I mentioned, we do have a free version of KDB Plus. For non-commercial use, you can go to kx.com, which is our website. Pretty short, pretty easy to remember. You can just give an email address. It doesn't even have to be a real one. But please put in a real one for our marketing team. So we've got bills available for Windows, Linux, both on ARM and regular, x86. And then we've got OSX build. So KDB Plus, as I mentioned, is a pure software solution. We also have an academic license as well. And indeed, they're actually using it, I believe, in Carnegie Mellon in New York Critter Financial Engineering class. So we do give free, full 64-bit versions available to academic institutions. So if you're interested in that, come speak to me afterwards. Or we also have an academic page on our website as well. For developers, then, there's a pretty cool website called code.kx.com. So kx.com is kind of like the commercial website. So if you want to see some of our other offerings on top of the database as well. But we've also got this code.kx.com. So if you want to start learning the language or look up white papers, or learn all of these different functions and see examples of the functions and how they work with edge cases, like, say, nulls and infinities and things like that. And this is really, really useful. We've put a lot of work into it, so we're quite proud of it. And there's also a Google group you can join as well. It has a bit, I think it's 14 or 1,500 members. So you can go there, ask a question. Oh, like, I have this query. I want to make it a bit faster. I have this library. I'm having trouble. It's a very responsive community. People answer questions very, very quickly. And we at kx, we monitor it ourselves. So if a question goes unanswered, we'll make sure that it gets answered for you. And there's also a GitHub repository as well. There's two GitHub repositories. One is a community repository where we basically scour GitHub once a month. And we take all of the libraries that have been written to integrate with KDB Plus. And we just gather them all together in an index. And then we have our own GitHub repository at KX, which are kind of like KX-approved libraries, so say our C-sharp API or C++ API, et cetera. So these libraries are pretty cool. There's visual tools built on top of KDB. And some, which we have ourselves as well, we have our own visual tools. And we have a variety of meetups throughout the world as well, so if you go to kx.meetup.com, we've got meetups everywhere. In New York, we have one every month, and we've got like two or three speakers. So it's pretty cool. Then there's two books available. We actually recently purchased the rights for Q for mortals, and that's actually now being fully put on the code.kx.com site that I mentioned, so you can get an electronic version, but you can still get the print version as well. So that's a really nuts-to-bolts guide of how to learn a Q language from start to finish. It's very, very well written. And then the second book is called Q-Tips. It's written by a guy called Nick Saras, who works for our customer of ours. And that's very much more from an implementation side of things, so he shows you how to build up an actual application or system using KDB Plus. So they actually complement each other quite well. The two books, because if you want to learn a language, you can use the Q for mortals, but then if you want to get into more implementation side of things, like how do I build some complex event processing trading type of engine, you can use the Q-Tips book. And then that's about it from me. I think I've finished before time, actually. I thought it was going to be over time. Obviously, follow us on Twitter, me on Twitter, or drop me an email. And we've got any questions. We've also built some tools on top of the database in recent years, so we have a visualization tool called Dashboards tool. And then we've got an integrated development environment called the Analyst tool, which was actually designed by the guys who created the Eclipse IDE. And they're based up in our research labs up in Ottawa, in Canada. So we'll be bringing more information about that over the next few months. So the Analyst tool in particular is really cool for data scientists. We have an inbuilt GGplot library, which is written entirely in our Q language. It's completely separate from the R implementation. And so that's pretty cool. We have syntax highlighting. And so maybe we can come back and do a demo on that someday. But we also have videos available for that as well. And so if you're more, if you don't want this command line environment and you want to use a more traditional user friendly IDE environment, that's pretty cool. So we recommend taking a look at that. And if there's any other questions, then feel free to ask away. All right, let's thank them. Thank you very much. Time for some questions. So as Andy mentioned, this is kind of a series of talks we've been hearing about, five series database. Sure. Like, how does KDB fit in this broader scheme of things? You're seeing, Matt, maybe you don't want to say this on video, but are you seeing your winning out over certain things and maybe being challenged in certain segments by other things? Yeah, I'd say because for years, kind of finance was our bread and butter environment, as it were. And finance, I think, has been ahead of the curve, especially more from a streaming and data perspective because I've always wanted prices to be in straightaway. And now that's why in the last few years, I've started to get out into other industries, like we're doing lots of smart meter implementations now. And anything like everything is kind of time series now. Which is kind of interesting. I always see people in the last few years were talking about unstructured data as the next big thing. But to my mind, I think structured data is going to make a massive comeback with all the IoT, industrial IoT, because computers understand structure, computers understand binary data. So the speed with which we've needed to have for finance is now helping us in these other industries. I think one thing as well as a simplified data model, as I say, between, say, the historical and the real time and the streaming, having that one ecosystem. I think in the last year, if you look at the Gartner reports where they show the Gartner curve, and now the Hadoop is kind of in that trough of disillusionment at the moment on the curve, I think people are kicking back a bit of putting these very expensive Hadoop installations with multiple machines, wherein all they needed was one machine with maybe 256 gigs of RAM or something like that. So we see that a lot of the things that we've solved in finance now, these other industries are approaching those problems because they're now finally getting streaming data. They finally need those low latency response times. So in some ways, we feel like we've been a bit ahead of the curve, and they're nice problems to solve. Like there's things like, say, geospatial problems that we're now having to solve. So we're kind of changing around maybe adding in different data types or putting in different libraries and to that effect. But we almost see the industry coming towards us in some ways, but yeah, we're still moving towards the industry in other aspects as well. Like we're doing stuff around unstructured data, like we're putting together a natural language processing library that's going into that analyst IDE tool that I mentioned, and then stuff like the GG plot. So there's elements of that where there's time series, but we're not just a pure time series database. We can also do relational structure as well. We support foreign key joins. So we're not exclusively a time series database. It is fully relational as well. So we feel we kind of can mix both of those worlds together. But we are very schema oriented. So we're very different from, say, like the schema-less folks, and where it's kind of like everything is done, you know, ad hoc on the fly, whereas we're very defined kind of from day one, but you can convert between the data types very easily as we've seen. So by being schema-like, it doesn't mean we're completely restricted. We can do the conversions pretty quickly. So I don't know if that quite answers your question or yes and no. Stream of consciousness, I am from Dublin. Who are your biggest predators? Who do you see in your scenes? Yeah, so it depends actually on the industry. So if we go outside of finance, we'll be seeing a lot of the traditional databases, like your oracles and your IBMs, just because that's what those industries are used to, like utilities industry and stuff like that. They kind of didn't have huge data volumes, and they're only now getting into the streaming game. So the relational was perfectly fine for their use cases. It was batch loads from, you know, devices at end of day, but now they're getting more streaming. They need that time series capability. And within finance, probably initially our biggest competitor back in day would have been the relational. But now what we find is oftentimes if you go to say like hedge funds and prop shops, it will actually be home-baked solutions built on flat file systems. And so like something just written purely in HDF-5 or something like that, where they just want a flat file solution. But what happens there when we find we're winning out there is the fact that we have a fully complete programming system. Whereas what happens with those other systems, the flat file systems, is they're fine if you've got one very strict kind of use case that you're adhering to. But if you wanna try and maybe look at your data in a different way, like we wanna run back testing and not just streaming real-time data, then they find they get a bit of a problem. So that's, you know, by being a general programming system and having this fully-turing, complete programming language on top, you know, we offer that flexibility within the language. It's a programming and analytics environment as well. Whereas the flat file systems are great, but maybe for one very, very strict use case. And so that's what we find we come up against. And we hear people talking about other time series databases as well. And what they've done is put the speed and performance, I think, by just being in finance with that speed and performance. So that's really been, you know, our modus upper-handi from day one has been about speed performance and elegance really. So like the notation of the language, which can, it's not for everybody necessarily, but that's why we have these open source APIs built on top of the database as well. And so what a lot of people will do is, and they'll do a lot of the heavy lifting, these large aggregate queries in KDB, and then they'll push that subset of data maybe into their machine learning libraries or their place on routines. Cause obviously those languages aren't necessarily built for big data, like real, real big data per se. Yeah, so it's kind of a mixture. It really depends on the industry. We come across a traditional and relational databases and then we just come across, you know, hack-to-getter, flat-file systems. So yeah, it's very interesting. You go to different industries of completely different competitions and stuff like that. Can you talk a little bit about like the system architecture, like how are you guys managing memory? What's the runtime look like? Sure. Yeah, and so we support garbage collection. And so you can put, you can get rid of garbage collection. So it's configurable. So you can have immediate mode where it does garbage collection automatically or you can turn it off completely or you can also do it where it's manually garbage collected. So that's quite nice as well. You offer that flexibility. So obviously if people are doing very low latency stuff, the last thing they want is garbage collection to hit in and take advantage of that. So, and the memory allocation is done via, it's done via body memory mapping. So it's the powers of two. So it's a relatively simplified memory model and not necessarily not sufficient. But it's all basically about speed. So it was basically the fastest and most elegant memory model we could put together for the performance that we were trying to get. So certainly there probably are better memory allocation models. And for the on-desk data, we use memory mapping techniques. So we memory map the data and then we'll use, we'll actually read and only the subsection of those files from the page into memory. And so we don't necessarily have to read the entire column. And the compression works on the individual file level. So you could compress maybe an individual column for an individual day. And so you can get very granular with the compression as well. And as I mentioned, you can use tiered storage. And so a lot of our customers will use is maybe for the last week's worth of data, they'll leave it uncompressed in a very fast local, say like SSD or NVMe or maybe sometime down the line I might even use one of these Optane or 3D NAND devices that are coming out from Intel. But then the older data then they might compress it and throw it down on our SAN or NAS and type of file system. But to the end user, it just looks like one monolithic database. So the compression and decompression is taken care of by KDB plus itself. The select statement looks exactly the same whether the data is uncompressed or compressed. So that's one nice thing is the architecture can be multi-tiered architecture. So that's why I can get these huge database installations, you know, multi-pedabyte database instances. We can also map the entire database as well in advance. So you can map, I think it's upwards of, I think it's 128 terabytes or 256 terabytes, give or take something as one or the other, I can't remember exactly which. So you can actually do that in advance so you can create all of your maps. So then, you know, you're obviously your virtual memory space is very, very large rather than your real memory space. So that means we can take advantage of the memory mapping then. So the queries are always executed in memory. And so the page fault the data in, execute the data in memory. And as I mentioned, the data in memory and the data on disk is essentially in the same format. Typically though, when we spill out the disk, we usually reorder the data. So when the data is in memory, it's usually time ordered when we're using that, say, Lambda architecture style diagram that we had earlier. I'll just go back to it. So when the data is coming in, like in a real-time database, it's usually ordered by time because obviously if you try to reorder the data by, say, ticker symbol after every insert, you know, you just kill your performance. So the data in memory is usually ordered by time in this streaming type of the system. And then on disk, what we typically do is we reorder the data and because most queries from a historical database perspective are maybe research style queries, like maybe one particular symbol or one particular, you know, range of stock symbols that we're looking at. So that means it lends itself much better for doing sequential disk seeks. So if you think about just a traditional spindle, if all of your data from Microsoft is held contiguously together and you're only doing an average price for Microsoft for yesterday, it will only go in and it will memory map only the columns that you're collecting, so it will leave any other column untouched, and it will memory map those files and then it will only do sequential disk seeks for only that area of the file that corresponds to Microsoft, so basically read just a little amount of data as possible. And so that's one of the nice things. So you can take care of things, you can look at things like, you know, read ahead and stuff like that to actually, you know, page fault the data and buffer the data into memory ahead. So it's really nice for sequential style workloads. And that model lends itself even better then when you start working with flash-based devices as well. So it's a kind of a natural fit to our model. So you're letting the OS, the OS manager's work comes in and out? Yeah, exactly, effectively, yeah, yeah, yeah. Yeah, so the caching is all actually managed by the OS. So as I say, it's all just simple system calls, you know, open and map, close, et cetera, so. So sort of related to my next question is like, sort of related to that, like, what did you can do to get it down to 500 kilobytes? Wait. We have our own stylized C code, but yeah, it's just we've had several developers working on it and it's a relatively small bunch of developers that have worked on it. So everybody knows it very, it's several files. How do you work on KV now? It's probably about five to seven developers actively. So it's always been a relatively small workload. So that just means that everybody can collaborate very easily. I mean, they always say, I think isn't it that that seven to eight team members, once you go beyond that, you see the productivity drops. And so everybody knows the code base, you know, very, very intimately, so they can really like, everything's very, very finely tuned. So for example, in the latest version we released last year, or earlier this year, version 3.5, we put in a fully fleshed debugger. We had like a quasi debugger before. So that actually putting that in, you know, and involved, you know, very, you know, forensically going through the entire code base. But by virtue of the fact that the code base is so small that we can actually make these changes very flexibly and very easily. And it's also very nice from our customer's perspective, because if there's a bug found, we can usually fix it within a matter of minutes and we'll have a bug fixed out later that day. So from a, you know, a customer support kind of perspective, you know, everybody talks about our customer support is fantastic. Just on the code base actually lends itself to that, you know, rather than having this massive code base and millions and millions of different types of tests, you know, I've got, we still have quite a lot of, you know, tests that we run obviously on the database, but we're having the code base so small, just you keep your instructions near it to the cache as well, you know, to do a pipelining and stuff like that too. So, yeah. Any other questions? So do you do unusual types of things like if I've got multiple streaming queries and there's some factor of what they're computing that can be factored out? Did you like seeing what your instructions are? Yeah, so you can have multiple attributes, you know, those like indices that we can have on different columns, but as I say, we typically only take advantage of one attribute in a given query and because you're going to reduce your size set so much by using the first attribute that the secondary, so secondary indices aren't like really supported as such, but with the streaming query engines, you can support multiples of those because it's just basically, just file the descriptors of the OS, you can support like a thousand and I think it's 22 or something like that within and the operating system. So you can spin up multiple of these streaming query engines, that's why I put multiple of them in here, but they don't necessarily need to be queue code as well. What we often find is people might have a C++ trading engine that will just be subscribing for price data from our ticker plant or the events engine as it were. So it can be any type of language that could be written in the streaming query engine. And there are, so one optimization is you can potentially bulk publish so that the events engine can do single publishing or can do bulk publishing and because we're a vector-based database and we use one mem copy per column, so that means then, you know, whether it's almost like a thousand records or a million records and because of the way we buffer data, the throughput, you know, it's a throughput versus latency thing, we can get really, really good throughput by bulking up the data. So say you've got it in publishing instant latency, we'll maybe publish out every 100 milliseconds or every one second and you get way, way better throughput then. We have examples that we show in some of our workshops for like even from Java, you get like a 15 to 20 X, you know, performance improvement by actually bulking the data together. And so things like that you can do which can help with the streaming query performance. So once again, it's kind of, it's a thing of throughput sensitive or latency sensitive. Oftentimes as well, what we sometimes find is you can have what we call a chained events engine and what the chained events engine will do is it will subscribe to the main events engine but it will actually start buffering the data. So the main events engine can publish the data as it gets it but then say the chained events engine can then buffer the data and then that might maybe push it down to say a screen because obviously you don't need a screen to be updating a million times per second but maybe like twice a second or something like that. So that's where you can start getting into these and they can start chaining all of these together. You know, the real time database could even be a subscriber. It's all just logic written in the queue language so they can have much more complex infrastructures. We also have the ideas of gateway processes as well and what a gateway process might do is it might sit in between the historical database and real time database. So say if you don't care whether the data was in today or yesterday and what we'll just do the query rerouting for you and stuff like that or I can also do it over multiple machines or multiple regions as well which can be a case with a lot of our financial institutions where they'll have data held locally but then they might give one kind of data warehouse feel so they can have these gateways and load balancers and once again they're all written in the queue language as well so you can get, because it's a full programming language and not just SQL, you can kind of write whatever you want so. All right, let's thank them again. Thank you very much.