 Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimize your MySeq Core and PostgreSQL configurations at autotune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at stephenmoyfoundation.org. Okay, thanks for coming. It's time for another day of this vaccination give-away seminar talk. We're excited today to have Paul Dix returning for his second talk at Carnegie Mellon. So Paul is the creator and co-founder of Influx Data, the name of the company backing InfluxDB. He always lists. He has an undergraduate degree from Columbia University in computer science, which is a big deal. It's just, it was a while ago. He's working in Influx is more important. Throw in a little university shade right now. No, I didn't say that. It's just, it was a while ago, right? Okay, so Paul is a great speaker. So we're super happy that he's here with us. As always, if you have any questions for Paul, as he's giving the talk, please unmute yourself, say who you are and where you're coming from, and feel free to interrupt at any time. That way he's not talking to himself. And with Paul with that, thank you so much for being here. Go for it. Great. Thank you, Andy. Yeah, just as a warning, I can't really see the chat part of Zoom from where I am. So, interrupt me verbally. Yeah, yeah. Okay. So, yeah, as Andy mentioned, Paul Dix, the co-founder and CTO of Influx Data. So the truth is, I struggled with what exactly to present in this talk, because when I first picked the date, this is like, I think some point in fall of last year, I'd assumed that this project would be like much farther along. There'd be more functionality. We'd already be running in production. I'd have all sorts of crazy real numbers to show. So, as a lesson I've learned, and somehow it never sticks, I learned this lesson all the time, which is there are lies, damn lies, and software delivery estimates. And of course, as an entrepreneur, I'm an optimist in many facets of my life. So this bleeds into when I actually think software projects will be done. That being said, I will say that I think there's some interesting work and design decisions that we made along the way. So this talk won't be about like details about like a query planner or execution engine. It's more like a high-level overview of the problem space, the specific design of Influx DB IOX, the importance of the data lifecycle to this design, the underlying technologies used, and some of the optimization trade-offs that we're trying to make as we're building this thing out. So I'll also mention some of the techniques that we're using in our execution engine, some of the things we're thinking about with data organization and query planning. So this talk is about Influx DB IOX, which is the new open source core of Influx DB. It's written in Rust, which is why it's IOX, which is short for iron oxide, and it makes heavy use of patchy arrow. But before I get into all that, I need to open with a little bit of background information to give some context for our design or problem space. So for those of you who aren't familiar, Influx DB is an open source time series database that we created in late 2013. It's written in Go. It's currently at version 2.0. It has two query languages. The first one, which is called InfluxQL, that looks kind of like SQL, but it's not really SQL, which is fine when you're first getting into it and can be frustrating in other ways as you get farther along. So here's an example in InfluxQL query. And then the other query language that we built as part of the effort for 2.0 is called Flux. It's not really just a query language. It's actually a fully functional scripting language. So it's basically a query planner and executor and a scripting engine all kind of in one go. So here's an example of Flux query. So first let me talk about what I mean when I say time series. I find this is frequently needs a little bit more clarification as different people have different ideas about what that entails. So here's what we think of at Influx when we think about time series. Obviously, it's all about data that has a time associated with it. When talking about an individual series, you have what are called regular and irregular time series. So regular series are frequently called metrics in the server monitoring domain, or it could be called measurements if you're working with sensor data. Within Influx, we usually call it measurements. Irregular time series are basically just events that happen in time. These could be individual requests to an API. It could be a user visiting a web page, a fan turning on, a boat docking, or any kind of event that you can think of that could be tracked both in the physical world and in virtual worlds and all over the place. So this comes up obviously in user analytics, business intelligence, distributed tracing data, logs, and many other places. I think of irregular time series data as basically the highest precision data that you have about some history. So you can actually create regular time series from those irregular events, like say the average daily stock price of Apple for the past year, or the 95th percentile response times in one minute intervals for the last four hours in your API. The important thing is the time at which measurement was recorded or an event occurred is part of the primary key of any data that you store. And time is a part of every question that you ask of this data, even if the question is just, what was the last recorded value of this thing? So what's interesting about the time series use case? Why does time series deserve its own database category versus whatever? Well, it has a number of properties that I think make it very different, obviously, from traditional OLTP workloads. So in particular, you have an append-only workload for real-time ingest that's very, very fast at scale. You don't have updates. And particularly, you don't have multiple writers trying to update a record. You have a bunch of writers accepting new data in and coming in. You frequently have needs for bulk ingress and egress, whether this is doing historical backfill of your data or doing large-scale analytics. And more frequently, what we see people asking about is training machine learning models. If you're going to make real-time predictions, you need to be able to ingest that data as it comes in and have your models output new predictions. But to train the model to begin with, you need to access to the data in bulk, which is a very different access pattern from, say, showing a dashboard or doing monitoring and learning. You have range queries over large blocks of data to do summarization, or you have last value queries. Almost all of the data in our use case is at rest, right? So in our infrastructure, like we use influx to monitor influx, our cloud environment, and all this other stuff, the vast majority of the queries that hit our production infrastructure are for data that's been written in within the last few hours. And all of the historical data is almost never accessed, which is kind of a strange workload. And then lastly, you have bulk deletes, right? People aren't deleting individual records. They're deleting either the high precision data to free up space and not have to spend money on it. Or now with GDPR, people have to be able to delete a user-specific data mixed in with all the other stuff. So obviously, this is an OLAP workload, if ever there was one. Data is either historical or predictions generated by a model. Even if you think about updating a record, you never really do that. If it's a historical observation, you may have a correction for that. But that's a correction associated with the record that already exists. You'd still record the time of the correction as part of the primary key. Or if you're updating a model and making new predictions, you're still going to tag that with the model version that is also again part of the primary key. You're always appending new data. So let's get back to what this data actually looks like. With time series data and historical data of all kinds, really, there are multiple dimensions on which you'd like to slice and dice that data. For example, from the server monitoring realm, you may be measuring network bytes into an API. You could want to know that that number over time by server, by service, by region, by user making the individual requests, by the software version that's processing those API requests, or any combination of the above. Because of this way people have query time series data, a lot of time series databases have a structure that builds this into it. In InfluxDB, we have measurement, tags, fields, and timestamp structure. So we created a protocol that we call line protocol, which is basically a text-based protocol for representing this data, where you can write data in. Here, obviously, we have the CPU measurement. We have tags, which are key value pairs. And the values are strings. And again, you can think of those as the dimensions on which you want to slice and dice that data later on. You have the individual fields, which are the values that you have. And then finally, you have a nanosecond scale epic. When I created this structure, I actually took inspiration from OpenTSDB as a project, which I knew about at the time. It had a structure actually of measurement, tags, a value, and a timestamp. I added fields in because I wanted to be able to store other value types like bool, int, float, string, and have them be associated together as basically a series. So Prometheus from the server monitoring realm basically has the same structure that OpenTSDB did, but names these elements metric labels and value, which you can see in the Prometheus exposition format on the lower half of the slide. Times in Prometheus are actually assigned by the Prometheus server when it reaches out and collects the data, a process that it refers to as scraping. So now InfluxDB and Prometheus and some others bakes this data structure deep into the database by splitting it into two different parts. The first is an inverted index that maps this metadata, like the measurement or the labels to the underlying time series. So we map a tag key value pair like host equals server A to the underlying series IDs associated with that combination. So going back to the InfluxDB line protocol example, you can look at it like this. A series is the measurement name, tags, and field. We assign unique IDs for those, and then we create posting lists. So we can say what time series are in the west region or what time series are associated with server A. Now what this leads to is something called the cardinality problem. So the second part of the structure other than the inverted index is the actual time series data itself. This is all indexed by the measurement, all tags, and the field. And then the data is basically just an ordered list of time value pairs where you're going in time order. So as you can see, this is like a highly indexed structure. If you're querying for a single series, this index structure is great. If you're trying to do analytics or aggregation over a great number of series, this is problematic to say the least. If you have a query that hits a million individual series, this results in basically a pretty horrific like join or where clause where you're trying to filter down that list. Now this problem of having many series, which are created by every unique tag value combination, like I said, is known as the cardinality problem. And many time series databases guide users to avoid using tags or labels with very high cardinality. And it's certainly something that we've done with influx DB. Because as you create more unique series, the index gets larger, particularly relative to the underlying time series data that you're tracking. And while there are ways to get around some of this, for example, partitioning the index is on time, which influx DB does, it still gets expensive is the number of unique series increases. And splitting out the index or doing tricks with it still doesn't have solve the problem of how to query that data together. So the distributed tracing use case, I think is probably the most degenerate example. So this example is from the open tracing website. If you wanted to capture this information, span ID amounts to a unique value for every single record. So in that sense, like that's not a time series in and of itself. And let's not even talk about the raw log data. So basically, your index becomes larger than your actual data by a huge factor. So this creates a suboptimal user experience where people need to think ahead of time about what they're writing in and avoid this, you know, cardinality issue, or they need to use one data store for metrics and another for high cardinality event data. But the thing is, frequently, this kind of data represents just two different views of the very same thing that you want to get at, right? Like the holy trinity of monitoring and observability is metrics, logs and tracing. All three of those are basically different views on the same underlying distribution of production system. You want to be able to ask questions about this data on the fly in high precision or summarized for graphs and stuff like that. So within FluxDB and our previous design, we had a number of specific pain points that we wanted to resolve. One, it's highly indexed, right? So we over index, which means rights get very, very expensive over time. It's ephemeral data, so a lot of these indexes are like completely not used. Queries across many series are expensive. In FluxDB also uses Mmap pretty extensively, and what we found is that's difficult to operate within a containerized environment the way we want to do so. So essentially, once you get to this idea that we have to change the underlying structure, we can't use this inverted index paired with the time series data thing. We want to get rid of Mmap. It basically points us towards creating a new core of the database. We needed something that was just, yeah. I don't mean to gloat, but I seem to remember telling you Mmap was a bad idea three or four years ago. You did. You did. You aren't the only one. But the thing is, if we hadn't used Mmap, would we have gotten as far as we did to get to this stage? Right? The previous work is what funded the work we're doing today. Mongo could make the same argument, yes. Indeed. I would take the market cap today. That is a, how do I say this? That Andy sitting in these ivory towers saying, do it this way, where you've got to pay the bills. So, right. Well, got to pay the bills and also like the Paul of 2021 knows things that the Paul of 2013 didn't know. And hopefully that continues to be true is future Paul is going to know things that current Paul has no clue about. That's fair. Okay. All right. Okay. So, putting together some requirements for this new thing, we want it still to be good for metrics, but we also wanted to be great for analytics. We want to have unbounded card melody. Importantly, we want to be able to separate compute from storage. We want to be able to scale those independently of each other. We need to be able to support bulk data import and import in a much more first class way. We want to support real time subscriptions to the data as it gets written in operator defined replication and partitioning embeddable scripting. So, I want to embed JavaScript scripting engine inside the database. And ultimately, we want greater ecosystem compatibility, right, which is definitely a departure from how we're building influx DB 2.0, right? We had a custom query language and custom underlying file format, like all this other stuff. So, there's some insights that we had along the way that basically point to the idea that a column or database actually would work really, really well for this use case. So, one, 98% of the queries are on recent data. So, part of what becomes important is managing the life cycle of when data is in memory versus when it's persisted elsewhere and how we query that. But the other thing that we realized is essentially like partitioning the data and doing this aggressive pruning at query plan time plus brute force columnar performance is probably good enough for our use case, which is basically real time monitoring, learning and dashboarding. And then separately, large scale analytics can be done in different ways. The other thing that was actually surprising to us as we did some of the research work here was that the columnar compression was actually frequently better than our custom time series storage engine. So, one, a few other things before we get to the other bits I think are important to mention, which is the operational environment that we're in. So, we operate in all three public clouds, AWS, GCP and Azure in a bunch of Kubernetes clusters. We also need to be able to operate in a customer data center or out at the edge on a single device or on a developer laptop on XA664 or ARM64. So, at one end we have a highly dynamic environment like Kubernetes and cloud providers and at the other, more traditional environments where you have an actual machine and a locally attached disk. So, let's get into the technology choices that we made for IOX. First, I think most importantly, well, one of the more important things that was a difficult decision to make at the time, but in retrospect I think it's been excellent, is that it's written in Rust. I've written about my excitement for this language probably a number of times over the last few years. I think Rust is essentially the future of system software. It gives us fine-growing control over memory that we're looking for with the safety of a higher level language. Even better, it's model for programming concurrent applications, which most server software including this project are eliminates data races. Within our Go code base, there have been a number of very difficult to track down bugs over the years that were a result of data races. Rust just makes it impossible to do that. It's error handling also helps developers write correct software and reduces the number of runtime bugs that you end up creating. I think a nice bonus is it's embeddable into other languages and systems, which was something that we didn't have in our Go code base. So, this means that we can embed it into InfluxDB or other parts of our stack or other analytics systems. We can embed our query engine basically into any language. We could even compile it down into WebAssembly and run it in the browser if we wanted to. So, there's tons to love about Rust, but obviously this talk isn't about that. But ultimately, I want this project to form the basis of future analytics systems for the next few decades and beyond. So, choosing Rust as the implementation language was a big part of the bet that we're making. From a hiring standpoint, do you find it hard to find people that know databases plus know Rust or do you assume people coming in without Rust can pick it up very quickly in your experience? Yeah, I think the intersection of people who know Rust and know databases is like there's almost none at this stage, but it's getting larger. And I think anybody who's been writing about just C++ code for like database C++ code can pick up Rust. There are things about the language that make it pain the ass to learn. I guess definitely for me, like when I learned it, it reminded me of what it was like when I was first trying to learn how to write code. Just like the level of frustration. I was just like, I couldn't wrap my head around it. And still, there are things I absolutely need help with. But I think one of the things that actually is, we announced Diox as a project in November of last year. And the fact that it's written in Rust, I think, has been a benefit to us. We've been able to hire people that we I don't think we otherwise would have had access to if it weren't for the fact that it was written in Rust. So the other big piece of it is that it's built around Apache Arrow. So Arrow was started in 2016 by Wes McKinney as an in-memory columnar data specification to help data scientists do fast zero serialization, zero copy data interchange between different data platforms and languages. It then expanded into persistence formats by adopting Apache Parquet, which is a compressed columnar and nested data structure format. Next, it moved into RPC by defining Apache Arrow Flight, which is basically a GRPC layer that uses flat buffers to describe the metadata description like schema and raw arrow byte arrays, basically to define highly efficient fast data interchange across the network for many individual records. Now, in the specific language implementations of Apache Arrow, you have compute kernels for common vectorized computations. And essentially, you have the Java domain, you have C++, and a lot of the higher level languages basically just wrap the C++ stuff. So Node, Ruby, Python, they just wrap the C++. Rust has its own first class implementation. Go does as well. We contributed the Go one, and obviously we're now contributing very heavily to the Rust one. Now, the other thing about the Rust part of the Arrow ecosystem is there's a project called Data Fusion, which is a SQL parser, planner, and executer that follows the Postgres SQL dialect. This was contributed to Arrow by Andy Grove in 2019. So as you can see, like all of this is fairly new. So despite Arrow's relative youth, it's starting to get significant adoption in the data science, and I think importantly for this audience in OLAP databases. So things like Snowflake, BigQuery, Redshift, and others in the data warehousing space either have support for querying data via Apache Arrow flight or doing bulk data ingress and egress via parquet that format. So I think really it kind of points to a potentially promising future where data persistence and network interchange are standardized for OLAP workloads. Essentially like this could do for data warehouses and data lakes, what ODBC and JDBC did for OLTP relational databases. Okay. So within Rust specifically, you have components in Arrow RS. So it was actually recently pulled out of the Arrow monorepo into its own repo under the Apache user so that it could have more frequent releases, particularly on Crates.io, which is Rust's packaging system. So that's actually in the process of getting finalized right now. But Andrew Lam from our team is kind of spearheading an effort to get it releasing hopefully once every two weeks so that developers can use the Arrow Rust library without importing like some specific get shot. So you have record batch, which is basically the memory format and the compute kernels there. You have a parquet reader and writer. You have the Arrow flight RPC layer. And then you obviously have data fusion, which is a SQL parser planner and executor. And all this stuff has been kind of maturing very rapidly over the past year. Like flight didn't work really in Rust up until probably some work that we helped drive forward. There wasn't a parquet writer at all when we first started working with this. A lot of the stuff in data fusion, like initially wasn't supported, like you couldn't do joins, you couldn't do group by multiple columns, and like a bunch of stuff that's basically just kind of like getting added in as we go. And the nice thing is we're contributing all of this directly, but there are other people like around the world who are also contributing to it. So it's nice to be like interacting in the open source world with like a larger community that's kind of driving these things forward. So obviously one important thing about IOX is that because of the tools that we're using to build it, it speaks Postgres flavored SQL natively because of the data fusion piece. So that means it's standards compliant. And we can share data with other systems via flight or the flight or parquet standards. We also have essentially an API that the go implementations of influx, well influx use to give us compatibility. So essentially we have those as language and query processors, written and go as separate sidecar processes. And then natively within IOX, it supports SQL. All right. So let's map the influx to be data model onto IOX and SQL. Essentially an influx to be two. Oh, we have this concept of a bucket, which is basically just like a database, a measurement maps to a table. The tag is basically just a column where we do dictionary encoding on it. A field is also a column and time is also a column. So that's kind of like how we mapped it. So one of the biggest design choices for IOX is that it operate with object storage as its persistence layer. We wanted to decouple compute from storage and we needed a system designed to run in a highly elastic and ephemeral environment like Kubernetes. So we use object storage as the disk, albeit one with limited capabilities compared to a local file system. Obviously there's a lot of prior art on this. Snowflake has written pretty extensively about the work they've done and a bunch of others. So we read through all of that literature as we were picking this up and kind of using the same kind of tricks. So I mentioned object storage, but the truth is we still have to be able to run without object storage. So we have a common API in our code base with concrete implementations backed by the local file system, by memory, by S3, Google Cloud Storage, Azure Blob Store, and other services with compatible APIs like Minio and Seth. So that means that IOX can run without a locally attached disk. It can run entirely in memory. And this is kind of a design goal some of the other features that IOX will support like replication, real-time data subscriptions, and embedded JavaScript engine. We want it to be able to act as a processor without actually having to worry about any of the persistence concerns if you want to do that. So let's look at how data is organized in the object store by IOX. So here's an example directory structure that we have. At the very top of the tree you have a writer ID, each IOX server that is pointed to an object store bucket, the same object store bucket is required to have a unique U32 identifier. So we take advantage of this and assume that we have essentially single writer, many reader semantics. So essentially anything below that tree, I know it can safely write modifications and changes to, but any reader can pick up that data and process query requests off of it. Next, you have a database name, which is pretty self-explanatory, and then we break data out into partitions. So this partitioning structure is actually defined by the user when they create the database. They can partition data based on a table, based on some value in the data, or any number of ways. So in this example, we're partitioning data by year and month. The truth is in our production environment we'll probably be breaking it down even farther. Actually we do right now, we break it down by day and even hour. So essentially we create hourly partitions. So beneath all of that you have what's called a chunk. So these are immutable blocks of data. We create new chunks as data is written in and use this as a way to buffer up as much data as we can before we persist it to a parquet file. And then finally we have the table itself. So we're likely, I think we're not totally sure yet how this is going to shake out, but essentially I think we're shooting for individual parquet file sizes in the tens of megabytes each. So not too big so that we can quickly pull them down from object storage if we have to, but not so small that we have like an insane number of tiny files just lying around that we have to access. So the other thing about IOX is just like in FluxDB, it's schema on write. So tables are like you think of in a regular database, columns with their types. The user will create the database and then as they write the data in, those tables get defined and the column definitions will get created and the data organized into partitions. We'll have a feature to basically specify the schema upfront if you want to and lock it down at some point in the future. So IOX supports all the same data types that InfluxDB does, string, bool, float64, in64, un64, and datetime, but we'll also add support for raw byte types. I think one of the other important things is the float64 support in InfluxDB did not include support for nans or positive and negative infinity, whereas IOX will support those. So if you want to write those into your float field, you can't. Why? Is that because you were using the negative infinity as null? Oh, yeah. Why didn't InfluxDB support that? It was an artifact of basically the storage engine which used those as like markers or something. I can't remember the specific since it's been years since I looked at that. But yeah, essentially, they were used just like markers. We've used that, like multibii uses the negative, the smallest value represent a null. I was just curious why you had multiple, like, I don't understand one, but you couldn't use multiple ones. So I was just wondering what you were marketing. Yeah, I know. We basically said if we're not going to support one, we're not going to support any of them. So we only could have actual float values. Okay. So obviously we haven't talked about indexing. So where does this happen in IOX? Well, other than partitioning the data and organizing it into immutable chunks, we don't really index. When data is written out to Parquet, it's sorted, right? So we care about the sorting of that data. But the strategy really is to organize data into partitions that we can prune out most of at query planning time and basically just brute force the rest of it. So from a high level, here are the different components of an IOX server, and we'll touch on each one of these items. We have the right buffer, which buffers data. It's appended like log basically with the mutable buffer, which is where data lands, we can query it when it's still hot. We have the read buffer, which is essentially an immutable memory-based area that's a bit more optimized for compression and query performance and stuff like that. And then we have the catalog, which is basically a catalog of all the different files that we have that we're keeping track of. So the write buffer. As I mentioned, it's essentially an appended only log real-time ingest. We need to be able to view this based on a specific database and within that database, data either has to be ordered by essentially a logical clock value, which is essentially just an increasing ID, U64, or by table and then clock value. We want to be able to query data out of this buffer and say, give me the data for this table from this clock value. Honestly, we're still working out some of the details here. This is kind of changing as we go. We're spending more of our effort right now focused on buffering data and memory and persisting it out to parquet and then querying it back out. So the read path looks like this. On the front-end side of things, we have the SQL front-end, which is actually embedded into IOX. The flux front-end is a combination of a go, flux and a flux skill front-end, or a combination of a go process that runs separately, that communicate via a GRPC API to the IOX server. And then that has its own. Then there's data fusion, which is not just the SQL parser, but the planner and the executor. And that can hook into different underlying backends. The mutable buffer is one backend. The read buffer is another backend. And then, finally, there's object storage or really just parquet files as the other. So as ingest happens, it lands into the mutable buffer. And then when data is queried out, it gets converted over to the read buffer. And we basically organize data into larger and larger blocks within the read buffer. Our goal here is essentially to land that hot data in a place where we can query it immediately if users want to query it, but collect chunks of it over time so that we can get, hopefully, better and better compression depending on what the data looks like and how it's being written in. So as we gather together, we're sorting it so that we can get better compression within the read buffer. The data fusion will basically push specific operations down to these different backends. Predicate pushdown is obvious to the most obvious, but over time, we'll be adding to this list. And we'll have to implement each one kind of separately. The mutable buffer really probably won't have any pushdowns. It's essentially it'll convert the data to an error record batch. And that's essentially what data fusion uses to process data from the different backends. So the readable buffer has a number of techniques that we're using. The most obvious is dictionary encoding. A lot of the data that we get written in these time series use cases is tag data. So we dictionary encode that. Then we use RLE encoding. So if we sort the data in the right way, RLE actually gets us really far. We already do byte trimming on the ints. So essentially, like the logical type is an int64, for example, or a un64. But if the data within a block of a read buffer can be represented as a single byte integer, then that's what we'll use under the covers. Whereas the logical type is an int64. And we'll be doing the same thing for floats. And again, this is kind of an artifact of the time series use case, which is a lot of people have float data coming in, where the floats are actually all natural numbers. But they just have this because the back end system only supports writing in floats. Frame of reference counting writes basically an offset from a specific value and then better handling of novels, which I'll get into in a bit why that's a problem for us. I pulled this number out of thin air. But basically, it's our goal for our goal for this data is essentially most of the query workload, we want to be hot in memory, right? We want to manage that process of two hours worth of data or whatever that time window is, two hours, 24 hours, two days in memory, or we want query response times that are less than 50 milliseconds. I picked 50 milliseconds because a user can perceive probably about 100 milliseconds and you need a little bit of time for the network call. So, you know, something, lower, lower, lower than that for, for other queries. I'm just saying is the upper bound. But honestly, like, you never know, some of the queries are going to be accessing large blocks of data for the queries that actually have to hit object storage. It's obviously going to be way worse than that, right? So, it's kind of depends on that whole situation. Well, in memory query response, but like, I will tell you how you depend on how sophisticated the data fusion optimizer is, right? Yeah. I mean, some of the queries are going to be in microseconds. Many of them are, but many of them won't be. And it really kind of depends on the query. So, but our intention is to kind of optimize that over time and also, you know, start to build in like caching and all that other all standard kind of tricks. And all the, any of the work that we do on like the planning optimization stuff like that will be pushed actually into data fusion in a patch arrow separate from Iox, the project itself. So handling the data life cycle is basically an important part of this whole thing. We need to quickly transition data from the mutable buffer to the read-only buffer. And we need to constantly merge that data together within the read buffer to make sure we're getting the most compact representation while still having fast query times. And finally, as data is gathered together, we need to persist it to parquet for recovery and long-term storage. All of these parameters are actually tunable on a per database basis. So we'll have some sensible defaults. But one of the design philosophies of Iox is that operators should be able to tune whatever they want to tune specific to their use case. Because the problem is like some data, even though it's like a lap, like depending on what their career workload is, it's going to be wildly different. So we want them to be able to define it on a per database basis. So the catalog is basically just the summary data of the parquet files that are in object storage. This happened that there is a catalog on a per writer basis. The summary data is, you know, for this writer and this database, what tables exist, what columns exist, the data type, the min, max, and the count. Each writer has its own catalog, but readers can combine the data ideally from multiple catalogs if they want to. But building cross server catalogs is kind of like out of scope of what we're trying to do. Essentially, I want to be able to start an Iox server and say, read the data from these catalogs and just do it on the fly. But how the Iox server gets those list of ideas is essentially like that's something that's built by me separately. So then you have the object store reader. Basically, the way I'm thinking about the local file system is essentially just a cache of files from object storage. And we have the catalog for very quick metadata will look up and essentially for pruning. So the thing about Iox is that it's an open source data plane. It's meant to be a building block for more complex architectures. It's a data plane that can manage the life cycle of the data in and out of itself and or to other servers and object storage. And we'll be able to query and combine data from many Iox servers based on configuration or individual request. It'll be able to read catalogs of any number of other Iox servers from object storage and query that data on the fly. But how you control many Iox servers is actually totally up to you, the user of Iox or the creator of a software system that uses it as a component. So the control plane is basically highly dependent on your operating environment. We chose to separate the data plane from the control plane for technical reasons because we want to be able to iterate on the control plane independently of the actual data plane, but also for business reasons. So I think it's pretty important to be upfront about how we plan to commercialize, you know, all this given the recent changes in the open source infrastructure software world. It was important for us to keep Iox permissively licensed and open source, which means we need a plausible plan for how to actually build a business around it. So Iox is dual licensed under MIT or Apache 2, take a pick. And our business is basically in the operating and management of many Iox servers. We're building a control plane to operate within our own cloud environment. So our cloud is where we'll be essentially commercializing at first. But the goal here is that we run the exact same open source Iox builds that we put out in the community in our production environment. So that's part of the strategy here is we wanted a commercial offering that is essentially complementary to the open source, not a replacement for it, right? With InfluxDB 1.x, there was the open source build of InfluxDB, and our commercial product was literally a replacement for it. You'd have to install new bits and all this other stuff. So when we make our control plane available on premise or managed by you, it won't be an Iox replacement. It's essentially a new piece of software that you run in addition to the same open source bits you're already running the Iox open source bits. So this means that everything a control plane would need to be able to do to tune the operation of an individual Iox server or a collection of them needs to be accessible via a network API. So we have a management API into your PC that our own control plane uses to get visibility into the operation of an individual Iox server and to make adjustments to it to its operation on the fly. And of course we have a bunch of like other scripting stuff that's within the Kubernetes realm to make all this stuff kind of work. So for example, say we wanted to have an infrastructure that held a certain time period of data in memory based on what a customer is paying. And they can choose to say have two hours worth of data in memory with a separate worker pool to process queries that go further back in time. And we'd have a collection of separate Iox servers that process those queries that are obviously going to have to do either query processing against Parquet files on a locally attached disk or even worse go out to object stories to retrieve them and like do all of that stuff. But basically we want to be able to support query workloads that are one for like real time monitoring and learning and dashboarding which has a certain set of needs or analytical queries that do larger things. And how you design your control plane I think is highly dependent on what your infrastructure looks like what specific use case you're building for and honestly how you plan to make money on the thing. So some of the other problems that we're kind of looking at and dealing with as we go. So the first is sparse tables. So so in these kind of like tracing data workloads or in other things you have essentially tables with hundreds or even thousands of columns where many of the values are actually completely null. So we need to be able to have a compact representation of tons of rows where many, many values are null and organize it in such a way that's not going to like blow up the size of memory. Conversely like on the other side of this we also need to be able to support many thousands of tables which is another way that people frequently use in flux. We've seen both. As I mentioned the right buffer like how we deal with buffering the data as it's coming in for real time ingest and how we deal with recovery those bits are still things we're kind of figuring out right because we don't want to write a ton of tiny little parquet files and then have to compact those together later. Compaction is one of the huge sources of pain within our current storage system. So really what we want to be able to do is buffer enough data and memories that we don't need to compact later on. The only reason we should need to compact is if people are deleting data or if they're like doing weird updates where they want to change schema or something like that. And then lastly bulk ingest is something we're thinking about. I think one of the interesting bits is essentially you could have bulk ingest where I have an IOC server on either my laptop or on wherever it can make a request to a remote IOC server to find out how that data should be structured. It can do all that data structuring and stuff like that. Write those files into object storage with a writer ID that it got assigned from this remote service. It writes the catalog and then it triggers the readers that are in the remote area to basically read those catalogs and make that data available for query. So essentially we can completely decouple ingest from the actual operational workload of the different query processors. So that is actually all I have for you today. Happy, happy to do Q&A. Oh, I would. I'm sorry. I have a dog here. It's very unprofessional. I'll pull out a bath at everyone else. So let me open the floor. If I may ask a question for Paul, please unmute yourself and go for it. Paul, this is Matt, one of Andy's PhD students. I just had a quick question on something you said pretty early. Andy's well known for getting on a soapbox and ranting about M-MAP. But you had a new complaint that I hadn't heard before, which was difficulty in using M-MAP within a container. Could you speak a little bit about what that was? I mean, so this is actually above my pay grade because I haven't done the operational work on the environment. As I mentioned to Andy, I actually didn't get to write code for a number of years. So there's this whole generation of inflex DB technology I'm not intimately familiar with. But essentially, the problem is the memory reported by the system is whatever. We don't know. We need more control over how much memory is being used and how, and not being able to control when that gets paged in and out is essentially a source of pain. And it's honestly, we don't get the visibility into what's really going on. So no M-MAP in our current system. I mean, the truth is, we're not even building a storage engine per se. We're building a query engine that works on in-memory data or Parquet files. And everything else is essentially trying to wrangle all the data into large enough blocks to have an efficient representation within Parquet. Anybody else? So if I understand the vision of IOS is, oh, sorry. Anybody else? I mean, go for it. Yeah, that was me. So any view on using instead of the just a raw object store using things such as iceberg or delta or hoodie, because they do a lot of management of the data compaction and catalogs for you. So it makes your life a lot easier. Yeah. So we, we did actually look into it. So we, I forgot additional materials and stuff like that. So actually on the second Wednesday of every month, we do like a monthly tech talk, where somebody from the IOS team will give a talk. So this Wednesday at 8 a.m. Pacific time, one of our engineers, Marco Newman, will give you giving a talk about the catalog specifically. And he talks about some of the considerations that we had there. So we looked at Hive, we looked at Delta Lake and iceberg. The main thing is like, well, one, like we're, we're obviously just dealing with Parquet files and stuff like that. So we didn't have the same set of requirements. And one of the things that Delta Lake had to hack around, which wasn't available at the time, which is like, there are two like important limitations that the Delta Lake paper talks about with S3 specifically, which is like, one is you don't have a put if not exist kind of function, which means you have to do some weird stuff there. And the other was the list operation was eventually consistent in S3. That actually changed last fall. So that's how recent that was. But basically, like we didn't have those same things. And the other thing is like, because we have single writer semantics, we didn't have to worry about multiple writers writing into the same place. It's also another dependency too, right? Whereas like, it's not the original IAX, it's a great use download. Yeah, yeah, I mean, the, the, the goal is essentially like, yeah, you can just download and run the one thing. And it's all basically self contained, or, you know, at some level, like you, an object store is required. But basically, like, any environment that I can think of has object storage, even people run in their own data center at this point have object storage is like an API. So you do have like the lowest common denominator in terms of the functionality, which is like why I called it like, you know, the big file system in this, or the big disk in the sky, but with a much, you know, with an API that isn't as fully featured. But again, I'm not sure. I really have a direct question. It's more of a clarification. But your vision, it sort of sounds like IAX, you want to be a federated database, but not the sense of like, you're federating my SQL and Postgres and influx. It's IAX talking to IAX. But the idea is that it's more like a peer to peer system that like, if you have permission to go connect to something in the cloud, then like IAX knows how to talk to that IAX and then push some of the queries to that or pull down some of the data that that's the vision, right? Yes. The idea that like, the data scientists in theory on their laptop, we banging out something that, you know, maybe a portion of the data, then it's just the same environment to go shove things back to the full data. Yeah. That's the vision. Now, one of my other hopes for the project is that data fusion gets more and more functionality over time that would allow it to federate with other data sources. And my hope is that we'll be able to essentially like pull that functionality in for free. The joke I made with people on my team is like, the best features are free features that you don't have to write code for yourself. But database federations like in the Afghanistan databases, you go in and like it's, you know, you're going to get caught there for a long time. Yeah. I mean, we're, yeah. I, if we get, what's that? Right, lightly, it's a quagmire database federation, like you were trying to in the 90s, it's super hard, right? Right. So, all right. And then my, my other question is like, this is more, maybe it's more of a data fusion question. Like, okay, now you speak to Postgres, you know, query protocol, the SQL dialect, but it's still overlap and still getting transformed into the physical plan that the data fusion and you execute. And you thought about what it would take to start doing more Postgres things, like in particular, putting PLPG SQL and UDFs, like how much could you push data fusion to make it look more like a Postgres? And maybe that puts you closer to timescale, maybe, and you don't want to do that. But like, see what I'm getting at, like, yeah, you just put the SQL dialect, but is it, the people come to say, oh, what about this? What about this other feature I want in SQL and Postgres? What about that? I mean, again, like, we're, first and foremost, we needed to be good for our use cases for the way people are currently using InfluxDB. But again, like the goal is with, with this, it becomes more applicable or useful for a broader set of use cases, right, that you use a columnar database for, right, bunch of analytical style queries that just essentially don't work that well in Influx today. So I know, like, the data fusion people have, like, they want to do user-defined functions, they want to do all this other stuff. And again, like, we keep up to date on that. Andrew, who's actually, I see on the call, is actually part of the PMC now for Apache Arrow. So he's driving a lot of the work within data fusion, at least from our perspective. Okay. Andrew, I don't know if you want to speak to that. But that's all upstream from you. It's you, you get it for free, but I mean, not for free, but you get it when it comes to data fusion. And it doesn't change the architecture or what IOPS is trying to do. No. Okay. Okay. Anybody else? So, what's it going to be ready? Yeah, good question. What was that I said at the beginning? Lies, damn lies? Exactly. Yeah. I don't know. Soon? We're, I mean, the thing is, we're currently running, we're running it within our own environment. As I mentioned, we have, we, so we use, essentially, our cloud products to monitor a bunch of clusters of our cloud products. We're one of our, our, one of the larger users of our own cloud products. And we get, so we're on the order of like a million, a million values a second written into this thing. And we're ramping up IOPS to be able to support that. But again, like some of the work, there's a ton of work that we have to do to basically build the compatibility layer to support flux and influx QL. So again, I mean, we're, so we have that operating right now. Our first goal is operate a collection of IOPS servers in our cloud environment for that entire right workload on, on a, on a footprint is dramatically smaller than what we currently require to do it. The second is all the support of the queries there. And then after that, we'll be putting out builds, or sorry, after that, our next goal is essentially, essentially like invite only alpha within our cloud product. So essentially like when users come into influx to be cloud, they can create a bucket and we'll give specific people the option to create a bucket that's backed by this new system. And then somewhere along this pathway, we're going to start creating, you know, official IOPS builds, but what that's likely going to be is a crate on crates IO and a Docker image. But since we're not focused on that right now, because I don't want to start putting out the official IOPS builds until we're far enough along that there aren't so many like moving, so many things are still like kind of changing as we're playing around with this thing and figuring out how to operate it. And then the other pieces, like we need time to write up at least documentation, right? If we don't write any documentation on the thing, it's not really a project that you've released out in the world. The code, by the way, is up there now today and some brave souls are actually poking their head into it. But it's difficult to tell people where to look right now. Right now, there's influx to be cloud. And so right now, you're adjusting a ton of data. And it's all in your, you know, the, I'm going to call it legacy, because it's not that old, like the existing proprietary format. At some point, you want to convert everything over to give a single platform that's all using per day and the new IO storage layer. That's not an easy undertaking. So have you thought about how you're planning to do that? Or is that just so far has you like it's? Yeah, I mean, well, so, okay, so the interesting thing is like, so with our cloud products, there's the version one dot x of our cloud product. So what that is is it's our enterprise software product deployed on like user comes in, they say like I want, you know, for four instances of, you know, this side, this many CPUs, this much memory and this much disk, the system provisions it in AWS with EBS, and then installs the enterprise software product on that and adds monitoring and backups and stuff like that. Right. That was one dot x 2.0 is a completely different design. So 2.0, the open source project. And 2.0, the cloud offering are wildly different from each other. So 2.0, the cloud offering, it looks more like a SaaS product, right? It's a collection of services that are operated within Kubernetes. We actually use Kafka as the essentially the distributive write out log. So when data comes in, it lands in Kafka and then we have a bunch of storage servers pull off of that. Right. So that's how we're going to tie into that system. So our goal with migrating people over is we need to provide zero downtime, you know, no touch migration, upgrades of that. So essentially what it should look like to the user initially is hopefully some queries that were slow before actually become fast. They'll be able to start writing data in without worrying about cardinality. And as we make more and more of the IOX API available in our cloud offering, that will become available. But they won't have to do a migration or think about that really. But if you do, you have to do it, right? Yes. Yes. So part of it is, it's earlier on last year as part of our effort, like when we were kind of testing out some of these different things. Ed, one of the other storage database engineers wrote essentially like a bunch of rust code to read our old TSM format, which is the storage format that we had before. But to read that and to actually write it into Parquet files. So that's kind of like what we use to validate that, okay, we can get good enough compression out of Parquet where it's not going to be a horrible experience. So essentially, we'll use that and we'll be able to convert individual customers on, you know, on a per customer basis, ideally without any downtime. But again, how that looks is essentially like, if you've ever operated like a SaaS services-based product, like you kind of build all this machinery to be able to use your downtime deploys and give visibility into things. And usually what happens is we go for a data product, you run two things side by side at the same time where you have the old production infrastructure and then you have the new infrastructure that's operating hot, but like not actually processing requests. And then you can start to either transition requests over gradually, or you can at some point flip a switch while keeping the old infrastructure online, just in case something goes wrong and you could switch back. But we won't be doing it. We would do it on a per customer basis and also a per region, right? Because the new cloud product is in all three cloud providers in multiple regions. And when they create a bucket, they create it within a specific region on specific cloud provider. So essentially, we're not, there's no upgrading everybody like all in one go. That's just not how it would happen. We have to stop here, but I would say as an academic, the merging from one data format to another, or migrating from one data format to another data format, I think it's super fascinating. And research science is not really covered because it doesn't happen that often like people say our database has to do proprietary format. I find that super fascinating. So Paul, thank you so much for being here. Great talk as always.