 This is the third lecture in our seminar series on time series databases. So we're very happy to have today Supra Goya from Two Sigma. So Supra is actually a licensed cage fighter in the city of New York. And he's a little banged up from his last bout two weeks ago. So he's sitting, what's that? So he's supposed to be night? No, he has to sit the whole time. Right? You know, you still see the bruise on his hand. So he's here to talk about a system that they're building internally at Two Sigma on time series databases called Smooth Storage. The way to think about Two Sigma is that they're like Google, where they make a lot of money, but instead of making money on ads, they make money on the stock market and other investments. So they build a lot of cool tech in-house. So he's here to talk about stuff that they're doing. Okay? Great. Thank you, Andy. And thanks, everyone, for coming to the talk. So I'm going to talk about Smooth Storage. It's a storage system for managing time series database at Two Sigma. This is not the only system we use for storing time series database, but is one of them. Like Andy mentioned, Two Sigma is a technology company, really, in the domain of finance. That's how we look at ourselves. This is sort of the standard disclaimer. I just don't take me or my talk too seriously, don't make investments based on that. All right, so this sort of the outline of the talk, I'm going to just go into motivation and why we build Smooth Storage and what are some of its design goals, basically, things we try to optimize, then go into the data model in the API and then bulk of the talk is going to be about the implementation of the data model and sort of the system architecture. And then I'll spend a little bit about some of the challenges we are facing, some of the work in progress and those kinds of things. So Smooth has been around for around five years now. I've been working on Smooth for the last two years. We have a dedicated team of five engineers now, including me, who are responsible for both developing it and running it in production as a service. So the basic question is why have a specialized storage for time-freeze data? So I think there are a few reasons. Time-freeze data is extremely common at Two Sigma and in general in the financial world. And it's not just about raw data like prices from a stock exchange or sort of series of news articles or things like that. There's a bunch of data that's generated downstream, it's like derived from other processes that also makes up a big chunk of this data. The other reason is time is invariably one of the primary dimensions along which applications want to filter data and partition their workflows at 30 scale, both in terms of the amount of bytes stored and also bytes accessed. Smooth right now stores multiple petabytes of compressed data. The peak read access exceeds 100 gigabytes per second. So there's lots of storage and there's lots of transfer of data. And then we want to optimize for the target workload. So our applications care about certain operations and so we optimize for those particular operations. They have some specific needs and it's hard to find something out of the box in the open source world or the commercial world that just works for these requirements, at least not without spending 100 times more money for example. And so this problem is kind of important for us and so that's why we build this kind of a custom system. Now so the next question is what is it optimizing for? So smooth is a pretty simple system at least at this point. There are really two data operations it supports. The only way to write to smooth is to actually do range updates. So actually I'll back up a little bit. Smooth data model is like a database. So basically the data is organized as tables with schemas. Time is a mandatory column and the rows are ordered and indexed by time. And so the two operations that we really support on the right side it's range updates and the right side it's just range queries or time ranges. And these operations are kind of similar to file system read write APIs. So in a file system what you're doing you're doing basically range updates or range accesses over byte offsets. So instead of byte offsets we're doing it on basically you know the time axis over the set of rows instead of bytes. But it looks like a file system in terms of operations but we also attach database like semantics to those operations. For example you know properties like atomicity or sort of like isolation model for concurrent access. So smooth kind of sits somewhere between a file system and a database. More like a database we like to think but it's not a pure database either. Other than these access patterns a smooth is run like a centrally managed service at 2SIGMA. So it's not like people are downloading smooth and running their own clusters. I'm not going to go. Yeah sure. Offline time series analysis because online it's difficult to make strong demands without risking dropping data. Well so it's tuned for bulk access but you can do parallel updates. The arriving data can come too fast for a system with strong semantics. I'll get into that so I don't think that's the problem with smooth. The system guarantees it or you just ensure the users make sure that it's more than a capability. Well the thing is if you have a very aggressive user coming at you. Yeah no but is that you think of the injustice of from a user? Yes from a user. Okay. Alright. Yeah. I mean what we are doing here is it's definitely not targeted towards high transaction loads for sure. I was thinking the time series usually come from some kind of abstract sensor and those things are driven by some source other than they don't slow down with you. So this is an excellent question. So yes. In that. Yes so the question is does this database keep up with you know IOT type you know sources which don't stop essentially right. You can't stop them. There's like metrics and IOT data coming at you and you can keep up with that. So I think I misunderstood your question. Yes. So smooth is really for offline. So it's not for online sources ingesting data from them on streaming you know in a streaming fashion. What happens really in smooth is you know you could have things like suppose prices from stock exchange right and we could get those prices at the end of the day right and you can bulk upload them and then do your queries on it later on for example. So yes. I talked to B&P Parallels once and they had a model of during the night they reorganized the database and during the day they did analysis on it. Yeah. Yeah. It's not that offline but what happens is a lot of data in smooth is not just raw data. So what happens is you could have these bulk processes that happen at the end of the day but then there are processes that run throughout the day which actually use those input sets and transform them and produce new sets and constantly writing back into smooth as new data sets and so on. But yes so this is not something which is tuned for metrics. This is not a metrics database right. So this is in a metrics database is basically you have the stream of events coming at you mostly in time order right and they're mostly very narrow tables right you like CPU utilization or whatever right. This is not meant for that. This is more for the primary use case for smooth into sigma is to support the modeling research workflow which is so we have modelers who are basically creating some kind of model to predict prices in the future for example right. So they are building predictive models based on some base data. So it could involve things like oh you know what I have this idea and it's going to use satellite images or whatever. So the process is you know getting the data into the system and then doing lots of iterative analysis on top of it and then and then once it says oh well this idea looks like promising it may go through some rigorous simulations of you know real market like conditions to see whether the data you know the idea actually works and so on. So this is sort of the basic use case for smooth people also build like data pipelines and they also store raw data and those kinds of things. So all right so coming back to the slide where so it's run as a centrally managed service at 2 sigma. I'm not going to go into the details of why that's a good idea but what that means for us is there's much higher expectations around sort of the availability and the durability and the reliability of the service and these are some of the practical concerns we spend a lot of time on. And running a shared service also means that you have problems like melting and see lots of users coming at you and how do you you know how do you control like fair sharing of resources across them right and what's your access control model and the security and all of that. And lastly but not the least since smooth stores a lot of data it's important for us to store that efficiently. So we employ things like storage tiering and compression and those kinds of things to help with that. This is sort of a high level summary of what kind of applications really use smooth they tend to be you know parallel time partition jobs that do big IO. So this is not a system for doing lots of small IO. So it's more it's more to scale the bandwidth rather than sort of the transaction rate if you will. And so they care more about throughput rather than you know latency of a particular operation. Now that said we are getting new use cases where people actually care about latency and they are running more interactive workloads on it and they're doing smaller queries on it and so on. So in the future smooth could evolve. So all right. So I already discussed sort of the high level data model you know we have tables with schema and a time column the rows are ordered and indexed by time. Smooth tables are non-relational basically you can have two rows with the same time stamp. In fact you can have two rows identical and doesn't really care. So this is sort of like the file systems style semantics it has. It's easy to update schema you know you could easily add new columns or hard existing columns. We also support wide sparse schemas efficiently this is sort of our case which is a lot of those metrics databases don't care about this because the tables are pretty narrow we can get very very broad tables with tens of thousands of columns and a lot of them containing nothing so it's pretty sparse and so we have an encoding that is pretty efficient. So this is the right API. So as I mentioned before the only way to write to smooth is to update a given time range and that operation is atomic. So it basically overwrites the existing set of rows belonging to the given time range and replaces them with a new set of rows. So you can think of this operation as basically you're doing a bulk delete followed by a bunch of inserts atomically. This is sort of the operation that a lot of our workflows needed and so we just optimize directly for this operation. So here's like a little pseudocode you know you start a write by supplying it the time range you want to overwrite and then you get like a you know write session object you start adding rows. Rows have to be added in non-decreasing time order because as you add rows they're actually streamed directly to some storage back end some object store so you can actually stream lots of data and there's very little buffering on the client side and so the restriction there is the rows have to be in non-decreasing time order and you have to make sure that you don't add something that's out of the original time range you supplied. And then at the end if everything goes well you can commit it and so this whole operation is atomic that way or you can about it if something goes wrong. Internally what we do is we assign a strictly monotonically increasing logical time stamp to each write and so the metadata internally is totally linear so if you look at the set of all writes going to a particular smooth table there's a total order so the latest write always wins essentially. There is a sort of a more general version of this API where for example if you want to actually update multiple ranges on the table and you want to update them atomically that's also possible. In fact if you are updating lots of data and you want to actually have multiple processes updating different time ranges of the same table and you want to publish the whole change atomically you can do that as well. One thing to point out is delete just becomes a special case of this API. So if you want to delete a time range all you have to do is don't write any new rows and just commit. And so that range of the table is going to be deleted. How common is it that people know the ranges ahead of time? You think you have to know right? Yes, yes, yes. How limiting is it that they must know? So what happens is a lot of times they have an input data set right and they partition it and they kind of know sort of the ranges they are actually processing. And so ahead of time they know that the output is going to be in this range essentially. So it's kind of like. What was it that I was doing as an analysis and I was saying you know I'm interested in a pattern where a particular thing that would be measured as a data column went through this particular pattern plus or minus some percentage that'd be a search on the non-index time range would be a complete scan? Yes. Do you do secondary indices? We don't currently, but we do have some challenging workload that I'll talk about. So right now yes, in those cases all you can do is filter by time and just do a scan essentially. Yeah. So this is essentially doing application partitioning? Yes. Okay. Yeah. Yeah. So you said the user is not allowed the primary index, but the time span is the primary index but in your set? Are you saying? In the previous page you said the database has no primary index. No, no, no. So what I said was the database does not enforce primary key constraint. What basically means is if users want to add two rows with the same timestamp, it's okay. But we do support indexing on this non-unique value internally. Yeah. Yeah. So we have indexing but we don't support like primary key constraints. I mean I want to go into details. There are ways where applications if they really want they can do it. They have to be a little disciplined, but that's not the primary objective. So reads, so the reads we basically support snapshot reads. What that basically means is the rows returned are based on the latest committed view of the table at the start of the read operation. So you start a read operation at the start of the read operation it basically latches on to a view of the table based on the latest commit that has happened. And then any concurrent writes that happen while the read is going on does not affect the set of rows returned. So this is kind of easy because we have a linear sort of metadata and there's a total order on the right. So we just have to cut out any new writes with higher timestamps. And so there is a little bit of interference with the compaction process that I'll talk about but that's how it works. So this is like a little pseudocode. All you do is you have to read, you supply a time range and you get an iterator of rows. And at this point you're basically directly going to some object store which actually contains the set of rows and just streaming from there. Analogues to distributed writes you can also share the same committed view of the table across multiple processes if you want. There's some other operations that are not officially supported but it's something we're thinking about and it's kind of a natural fit for smooth. Things like snapshots or sort of time travel, so bi-temporal access. So if you want to go back and say how did my table look like like yesterday at 2 p.m. right? And not include all the writes that happened since. It's relatively easy to support, we don't support it yet but it's definitely easy to support it if you bound the time window. If you want unbounded edit history for the table, I think that would require a lot of changes. Really? It's just a turn of harvest question. Yeah, but then our metadata blows up and we have not built to scale the metadata to those levels. But one thing to mention is that we do have an older system, an archival file system at 2 sigma which actually supports that out of the box but it doesn't have a database like API. It's like a file system API where every edit you do is maintained as a version. So you can go back and trace the edit history if you wanted to. There's something we might add to smooth as well. And the little thing is if someone really really wants to do some atomic read modify writes, I think it's going to be relatively easy to support that based on optimistic concurrency control on the coming time. So yeah. Just to make sure, if you want to read a ring of data, so you have to scan from the last data added to the earlier data, a scan. I don't know, it's a little confusing because from the last talk about here is kind of snapshot read. Snapchat read, so what does that mean? Yeah, snapshot read just means that so it basically, when it starts the read operation, it basically locks to a static view of the table as it existed. And then any new operation that happened while the read is happening, don't interfere. Yeah, this is a classic problem that if you were to read a database one row at a time and extract all the data that's actively being inserted and deleted, you'd get a picture of the universe that never existed. It's like a copy on write. So a copy on write is one implementation strategy. What you're really talking about is a consistency. Yes, it's a consistent point in time. So all the data values actually existed in the database. Exactly, so it's always going to return you a consistent view of the table. It's not going to return you rows half from this commit and half from the other. 15, 7, 21, next semester. Yeah, yeah, yeah. Not the only place to talk about content. It's not the only place where it's the best place to go. By the way, we use SQL Server for sorting out the metadata. And we don't use Asset, fully serializable transactions because of performance and all those reasons. So yeah. Exactly, exactly. That's a good way to think about it, is you can say, well, I'm not going to read any writes that happened after this logical timestamp. Yeah. Right? And as long as all the old data is static, I'm fine. It's actually not static, but we manage it in a way where it's semantically static, essentially. Why is it not static? Because we're also doing compactions which can actually merge some existing data that was written in the past into a new set of files and so on. And so while it is happening and you're doing the read, you want to make sure that you don't miss anything. Okay, but that's a physical sort of change, not a logical change. It actually also is a logical change. I'll get into the details of how it works and make sense then. Okay, so now this is sort of getting into the details of how we actually implement this API and scale it. And so at a very high level, if I had to give you two bullet points of how the implementation looks like, at a very high level, it looks like a log-structured merge tree, but it's specialized for bulk deletes. Because we think every right operation that happens carries within it a bulk delete operation, essentially. And so we kind of structure the whole method around it. The other thing is, this may be obvious, but databases are just being tracked away, so the physical details. But we have an implementation where the implementation itself has a clean separation between the physical properties of data and the properties of the implementation. So it'll make more sense when I discuss it. And why we did that is to allow kind of flexibility of the data layer so that we can actually choose a different object store, federated multiple object stores, use different file formats, better indices and better compression and so on, and not have to touch most of the code while doing that. So this is sort of one of the most important pictures of how the table internally looks like. So a range update internally is represented by metadata objects called shards. And a useful way to visualize these shards is to place them in this two-dimensional plane, where on the y-axis you have the commit time. This is the time you get. This is the logical time you get every time you write to the table. And then on the x-axis you have the values on the time column. So essentially, for example, you know, shard 2 has c2 as the commit time and has an associated time range that was all written. Now, each shard also points to an immutable data file which contains the new set of rows that were written as part of that operation. So the time range here is not an envelope or the new set of rows. The time range here is the range that was all written. And the new set of rows go into this data file, which is in some object store. The data file is immutable right once. It has the rows ordered by time and indexed. And it can be potentially replicated. So if there's some table, you know, we want to replicate it across data centers for multiple times. You know, you could have multiple replicas, essentially. And so property of the shards is they're semantically immutable, right? You know, it never changes, it never changes the commit time, it never changes the time range. And it always returns the same set of rows, right? So any process that's working at the level of shard, they don't care about the physical details. I mean, we could actually move these files to some other object store. We could rewrite the files in a different format. We could use a different compression. We could even have two replicas with two different file formats depending on, you know, we could, depending on the workload, we could use one or the other, right? So they kind of determine sort of the performance and sort of locality aspects. But the semantic aspects are basically remain the same, right? As long as you are dealing with the same set of shards. Typically how big is a shard in your world? We like it to be 100 MB. 100 MB, 150 MB, even larger. So typically these data files, they actually, we use HTFS for the warm data. And HTFS internally sort of chunks these files into 64 MB pieces. And so if you have a large table, what happens is essentially all the data for the table is kind of spread across the whole fleet of storage nodes. And so it allows us to scale the bandwidth on these hard tables very nicely that way. Another way to look at these shards is you can think of them as carriers of bulk delete tombstones. So you're essentially doing a bulk delete and with a bunch of inserts. But we don't really actively carry out the bulk delete operation. We just tag it, right? This range was deleted. Don't do anything about it. Just keep it in the metadata, right? So, yeah. Yes. So shard is an internal representation of a range update? Yes. So if I only update one row, you will generate a separate shard? That's a great question. So smooth is really not built for lots of small writes. Okay. Yes. So if you throw that kind of workload at it, it's going to be very hard for it. Okay. Yes. Yes. Yes. Yes. Yeah. Excellent question. So if two shards overlap, which basically means that the overlap portion is owned by the most recent shard, right? So because it overwrote it. So essentially it's hiding that portion, that time range in shard one, for example. And so in theory, that can be garbage collected unless you want to do time travel and stuff like that. Yeah. And so this will become very clear in the next slide, actually. So the next slide is about how do we actually do reads on this kind of metadata, right? So we support snapshot reads. And so what that basically means is when we start the read operation, right? We look at all the shards that intersect with the given time range, right? And then looking at all those shards from the top, we look for all the visible sub ranges of those shards, right? For example, in this case, if you did a read after the third write, all the pieces, all the sub ranges in dark gray. Are the ones that are visible, essentially, right? And so the rows corresponding to those sub ranges will be streamed back. That makes sense. And so we call this a read plan, essentially. A read plan, essentially, basically we concatenate together these visible sub ranges and stream it. Now the underlying data files are ordered by time and indexed. So extracting out those rows for those sub ranges is relatively efficient. And of course, while we're doing that, any concurrent writes that happen don't affect the read plan. So in a way, if you think about it, if you compare it to say LSM or something like that, which also has a log structured, you know, this is a log structured metadata, essentially. And these shards are kind of like SS tables, right? But unlike LSM, because we're optimizing for these bulk deletes, for us it's always possible to know which shard owns a particular time stamp. So you don't have to sort of merge at the row levels to get that answer. So this is sort of the underlying format of the data files. It's a pretty simple format. It looks like a static two level, you know, B3. So there's a single index block which points to a sequence of contiguous data blocks. Each data block is individually compressed. It's of the order of a few megabytes. And so that means that even if you're reading a single row, this is the unit of read. So you have to go and decompress this multi-megabyte block to be able to do that. For most of the data we use LZ4, it gives us 2x compression with very lower head. We've used GZip as well for some of the cold data. There's other stuff we have done that I'll talk about later for, you know, laying it out in column major formats and getting more compression and so on. So this is how it looks. And then, you know, come into compaction. So why do we need compaction, right? And so it's exactly for this reason. So if you have a read plan here like this, and if the sub-range has become too small, right, because of overlapping writes or lots of small writes, that means we are doing lots of small IOs. And lots of small IOs means, you know, the reads become slow. It can lead to more disk seeks, for example, on our backing stores. And so that's a bad thing for scaling the system for performance. And so what we want to do is basically defragment our read plans. So basically rewrite portions of the read plan where you have lots of small sub-ranges. The second reason is if you have lots of small writes, that means your metadata is pretty large. You have lots of charts in the smooth metadata. On HDFS, you have lots of I-nodes. And, you know, eventually on the X-T4, you have lots of I-nodes and everything is, you know, dominated by I-nodes. And it's not good for scalability either. Most of the things we are using, they don't like lots of tiny files. So we want to manage such that we have a high ratio of data to metadata. And so that's another reason behind compaction. And the third is, obviously, garbage collection. So you could have files and chart that are completely hidden and they're just taking up space. So we can just get rid of them. So the compaction process is pretty simple. It basically does what the read does, essentially. It creates a read plan for the entire time range. So it essentially says if I have to read the whole table, can you return me the read plan for it? All the sub ranges that are visible. And then it walks and looks at portions that have more fragmentation and then just rewrites them with a single large chart, essentially. Now, there's a huge limitation with this algorithm, which is if you have a small chart here and a large chart here and a small chart here, you can't just take these two small charts and combine them because what is the time range you'll pick for the whole new chart? Right? Exactly. Right amplification. So in theory, for certain kinds of workloads, if you're doing lots of small writes, this can be horrible in terms of write amplification. But what we have seen is in practice it's less than 10. And that's mostly because the workload tends to be big writes. So in the future, if it turns out that the right workload becomes more challenging, we probably have to do something more. But right now, we are okay with simple algorithm like this. Now, and yeah, once we have the new chart, the one more detail about the new chart is we want to make sure that the sequence, the right order is still maintained. So when we combine, for example, these three charts and replace it with this big chart on the top, we actually reuse the commit time of chart three. Right? And that way, we make sure that the right order is maintained and it's not interfering with any ongoing writes like chart four. So it essentially updates charts in the past to maintain that order. Small little detail when we actually delete these underlying charts, we don't delete the underlying data files directly because you could have ongoing reads happening that still reference it. So we actually delete the underlying data files a week after just to make sure the reads are not affected. You're just guessing that nothing's reading a week later. You are right. So a week is like an upper bound. So if you started a read, you better finish within a week. Yeah. Yeah. Okay. So I think there's something I already talked about. There's some similarities with LSM. Both are log-structured. You know, immutable charts with indexed data files are similar to SS tables. They both have compaction process aimed at similar objectives. But there's a difference in the detail. We are optimizing for bulk delete. So essentially the whole thing is optimized around handing the bulk deletes efficiently. And so we're therefore handling them until the compaction time. And the whole read algorithm kind of adapts around it and so on. And I think key value stores that have to optimize bulk deletes can do something similar. They can embed something like this into the SS tables and not have to actually get it out by enumerating keys and so on. And write amplification. So as I mentioned, you know, in theory this could, the compaction algorithm can lead to very high write amplification. But we just monitor it. It's not been bad in practice. Now if it does get more challenging, I think we have a path forward for it, you know, multi-level compaction. We can use something like that, but then it changes the metadata because then we'll have to actually push down the shard abstraction into the data files. Because now you have the joint ranges inside the same data file because you're combining two non-adjacent shards essentially, right? And so the read algorithm also becomes a little more complex. It starts looking like an LSM where you have to now merge the read plans of multiple levels together and so on. So this is something, this is a complexity we're awarding right now. If it becomes hard, we'd probably do something like that in the future. The LSM trees are considered the high write amplification too. Compared to B trees? Well. But I thought that was sort of their selling point. They write initially very fast the compaction process depending on the key ring of the red one. Well, it depends. It allows you to do the trade-off, right? So depending on sort of the size ratio between the two tiers and the number of tiers you have, you can tune the write amplification versus sort of the read performance, right? So because of the size ratio, just assume random distribution of keys then whenever you want to compact a megabyte into the next layer down, you're going to read 10 megabytes. Yes. Write them into 11 and write them back as 11. Yes, yes, yes, yes. So you've accomplished compacting one megabyte at the cost of 11 megabytes. Yes, but if you have like a B tree and you have like 10,000 keys for data block and you're doing like random writes to it, the amplification to be like 10,000 or whatever it is. Obviously that's worth, but yeah. That problem exists. Yeah, yeah. I think I would emphasize that this structure that we have is not really for optimizing write amplification. We don't have that problem. The reason we have this structure really is because it's a really simple reason about because things are immutable. I can replicate those data files and cache them without worrying about consistency problems. And I can have like all this concurrent happening, right? You know, reads and writes from the users, the compaction process and something else moving the data around because since things are immutable it's very easy for us to reason about concurrency and the correctness of the system. That is sort of the driving goal. Yeah. Okay, so this is just a sort of zooming out, kind of a messy picture about sort of the runtime components of Smooth. We store the smooth metadata in Microsoft SQL Server. That's on the top right part of this diagram. It gets replicated to other backup servers living in other remote data centers. We have stateless servers that mediate access to this database. At the bottom, you have these object stores. We use HDFS for sort of the WOM data and CELFS which is an internal two-sigma system and an archival file system for cold data. It's smooth because of the short abstraction can actually easily integrate multiple different kinds of object stores. It can federate multiple object stores for example for scaling for replication for tiering and so on. Smooth has a client library that applications need to link to in order to access Smooth. That's the thing on the left top. The client library basically orchestrates the whole API between the metadata and the data layers. I mentioned these things called data movers. We have a policy-based data movement framework. For each chart, you can say well, I want a copy of this chart in some other data center. This chart has become too cold. Why not just move it to CELFS and while you're moving it maybe rewrite it and compress it using GZF and so on. As long as you can do it in a way where there's always one copy present for that chart, no one cares. They can see performance issues but seem less to the layers above it. I already chatted about all of this. I think this as well, immutability is one of the design principles we have been using and it has been really useful for us because it makes it much easier for us to reason about the system. There's very less coordination needed by different processes, whether internal or external in order to reason about the correctness and the consistency of the tables and so on. That has been a big win for us. These are some high-level statistics. I already mentioned we have multiple petabytes of unique compressed data. We have read peaks in excess of 100 gigabytes per second. We have tens of millions of files and charts. We have hundreds of millions of files and charts and tens of millions of tables. Tens of thousands of concurrent requests. This doesn't seem like a big number but the whole system is more about scaling the bandwidth rather than concurrency as much. A lot of people are researchers here looking for problems. If you started from here and said what would you recommend? What are the hard problems you recommend that people might want to innovate? Not because two-sigma needs that you would think would be interesting things to do in this space. By the way, SMOOTH is just addressing a part of the workload that two-sigma needs in terms of time-series databases. This is not a system used by trading systems for example. They might care about much more complex queries they might actually care about low latencies so there could be lots of server-side computation happening. I'm not going to talk about those systems directly in this talk but I'm going to actually talk about some of the interesting queries that people are asking us to solve and I'm going to just give you that problem and maybe if the problem already has a good solution then great for us. So I'm going to talk about that in the next section. Looking forward, what are the things we are working at what are the problems we are solving right now? Two-sigma's infrastructure is spread across multiple data centers and it's also using public cloud for some of its compute. One of the common problems is I want to access this table from anywhere from any data center and I shouldn't care about it. Replicating data to all the data centers is not a good solution, it's not cost-effective. What we are looking for is kind of like a distributed CDN style caching layer that can span multiple data centers but does not require storing data at rest. So there's a new object store we are kind of investing in which has this CDN-like caching layer that we may use for scaling the region essentially. Improving storage efficiency is an important goal for us. We store a bunch of data, we are growing fast and so what we did this summer we had an intern come in source columnar formats like Apache Parquet and ORC on a big set of smooth data and see what we find. And so not surprisingly we get another 25-30% better compression over GZIP. So this is something we can maybe do in the future. We have an interesting file format developed within Two Sigma which is actually reoriented but it actually allows you to create a column. So it actually compresses individual columns but lays down data reoriented. And so for people who don't care about column filtering that can be much faster and still gives you pretty good compression so on. So this is something we can we may look into in the future. I'll be happy to talk about the details of that file format if you want after the talk. But that is another angle we are looking at. So we are looking at integration and tiering more cost efficient. Right now we don't have an object store that actually understands cross data center semantics properly. So handling a disk failure or a node failure is different than handling a data center level failure. You want to have different semantics. And so what we are doing right now is sometimes we actually replicate data across and that's not cost effective at all because HDFS internally makes three copies of any data we put in. So we have like six copies. It is not great. So we are looking at other solutions where the object store itself has the right semantics for cross data center application and it uses things like erasure coding and so on. So it's like a lot more cost effective and so on. So we are looking at those areas. And there is a bullet I didn't mention here. Performance is also a problem. Although we are throughput oriented HDFS tends to have very bad tail latencies for lots of reasons. Again getting the details of why that's the case and we are not very happy about that. So we want some kind of performance consistency and so that's a problem that's going to be challenging for us to solve. So other than that as mentioned there are people who are also demanding some more challenging read site filtering or querying. So right now all they get is range filtering on time. So there are some people who care about say column filtering which may not be that hard to support but there are other complex queries that they want to support. So I'm going to give you an example and this is a query which we don't know how to solve in a very good way. So what happens is a lot of our time series data actually has sub-series within it. So for example if I'm getting prices from a stock exchange I'm getting prices from Google and Apple and other companies all together ordered by time. So the data sets really looks like it's all ordered by time but there's another column which is sort of the stock ticker or the company. And then the kinds of queries people want to do on these data sets is I'm going to give you an arbitrary subset of the other column. So the other column can have cardinality of around 10K or 20K. So a number of companies listed in some exchange and I'm going to give you any random, not random, but any arbitrary subset of those names and you have to pick just those rows and return them ordered by time. It's kind of like a subset query and this is actually pretty common to Sigma. And if you think about like solving it using traditional database indices like having like a clustered index on the time and then a secondary index sort of on the other column the problem is like each data block represents almost all the tickers, not all but maybe 80% of all the all the companies and so you essentially are doing essentially tables kind of at that point even though you have a secondary index because you have to just pull that data block and filter all the rows out. The other way to do this is I'm going to have an index, I'm going to separate out the data set by the company ID. And so the first component of your clustered index is actually the company ID. And then what you can do is you know I can select the right set of rows and then just merge them. The problem is you're actually inducing lots of random IO if you're trying to select everything for example. So you're basically doing IO from 10,000 different pieces and then trying to merge all these 10,000 streams. When you think about sequential IO are you thinking about being memory? Are you thinking about on the same node are you really thinking he doesn't believe there's any disk IO anymore so everything's in memory? Well I think our data sets don't fit in memory at all. He's out on HDFS. He doesn't even know what disk it is. When you talk about sequentiality you mean when you launch a C how much useful data do you get? So it's basically spread across a bunch of disks essentially. And it's like 64 MB pieces who knows where it's lying. Is the data really that big? It's just time series. What else do you store with these ticketing files? Well the thing is this right it's not just raw it's not just raw data what people do is they say I have these rows coming from this market data. I'm going to do something on it and derive another data set. What's the copies of the data? Different formats. It's not the same it's not a copy really it's re-data out transform it shove it back in and that's a new copy. Yeah. It's derived. It's derived yeah. But you don't have video you don't have really big data you have what others would call metadata. Well yeah there are people there are people in the company who are looking at satellite images and looking at news articles and yeah so the thing is yeah I mean people are looking at any kind of data that they can use to correlate some future event. So you can afford a thousand node cluster and if you had all the memory of thousand node cluster you could keep a lot of time series data in memory. The thing is there are lots of data sets. It's not like there's a single data set we can just load everything. So every researcher I do see has his own favorite thing right he's trying out something and he has his own you know prepared data set and he's banging on that. So I think keeping the whole thing in memory is kind of hard. Yeah. So is it really just or would you consider the same problem being in flash? I think flash definitely helps because you know you could reduce the sort of the page size that itself reduces the read amplification right. So indices can become a little more effective. So now you can take those 10Ks and charge them some even if you get it with a hash so that you would be getting a subset. And you don't care about seeks anymore because you have so much of parallelism available right. Still the effect of a seek. Yeah. But you can do that in parallel right. These flash drives are pretty parallel. What do you do with disks in parallel too. Not a single disk. Yes. But I think there's a little problem right. I mean the merge itself can be non-trivial because if you're doing merge on time right essentially a conditional branch that cannot be predicted right. And CPUs suck at when they can't predict branches. And so doing general merges where you have to actually compare each row and you don't know which way you're going to go can be bad for CPU intensive workloads. So it's not just a disk I'm saying actually yeah. Thank you. We have time for one or two questions. Retention policy for all the old shards or they eventually lost after compaction? You mean the shards that cannot be read anymore? Yeah. I mean there have been people who said that never delete my data right and they want it for audit purposes and so on but this is really not right now we are not built for that because it uses without bound right. So generally we don't support those use cases where you cannot get rid of data at all. Yeah. Alright, thank you again.