 Thank you everyone for coming today. We'll be talking about how Netflix delivers key value in time series storage at any scale Which sounds pretty cool. I think I'm Joey Lynch. I'm an engineer on the platform team at Netflix and this is but yeah, Arvin I work for data abstractions and today we're going to tell you a little bit about how over the past couple of years We've really been able to solve a lot of our core data problems by introducing these key value in time series abstractions That can kind of act at any scale that our users need So let's start with our problems because I think that kind of motivates why we had to do this I to put it succinctly. We have all the data at all the scale in all of the places specifically We handle tens of millions of requests per second to our databases. We handle billions of requests per second in caching layers And these are all open source storage engines These have varied you'll notice that there's like different kinds of latency SLOs and different amounts of data and different amounts of traffic And it turns out that when you have all those different scale requirements And then you spread your data across four Amazon regions with three availability zones each So you have 12 copies of data to keep consistent at all times serving the whole planet You know, this this is a pretty challenging problem It gets harder when you realize that your different solutions overlap to your problems so like this is our universe of database requirements at Netflix and Given use case might be satisfied by multiple different solutions Sometimes we have use cases that live off in the no solution land and that's where data platform comes in and tries to build a new Solution or tries to convince the user that actually they needed something different and their requirement can maybe fit somewhere else But it's not static right like as use cases evolve as over time Maybe they have more traffic because the features more useful or excuse me more used or there's more storage as you Take more time and then on top of that you also have another variable, which is price So, you know at a certain scale a cloud managed offering might be better But then as you scale up that gets real expensive and maybe you need to migrate them to some kind of self-managed storage So for all these reasons we realized that you know simple key value storage wasn't so simple depending on the use case We had to use different storage engines that we had to be able to react rapidly to these changes in context So what's the solution that sounds like some problems? What's the solution? Well the solution that we've deployed at Netflix are what we call data abstraction layers, which are essentially a Layer of indirection all things are solved by a good layer of indirection, right? Between the abstraction clients or the application servers and our storage engines we abstract our clients from the storage engines and Today, we'll be talking about two key abstractions We've rolled out over the past two years the first being a key value abstraction Which exposes a multi item map API on the left there and then on the right We're able to translate it to different storage engines including different versions of those storage engines So for example this abstraction was key in our ability at Netflix to get off Cassandra thrift That was we basically had two storage engines on the right there one for thrift and one for CQL But it's not just limited to key value storage. We also have done this for time series storage This is like an immutable event store handling potentially millions of requests per second petabytes of state and We're able to like combine things like elastic search for full full time series search with Cassandra for the actual storage So that sounds those sounds like some some interesting techniques But like how does that actually work and for that video is gonna gonna show you the key concepts here? Thanks, Joey I'm gonna just talk about some key concepts after that Joey will introduce APIs and storage layer how we formatted the storage layer for this use cases First concept I'm going to introduce is item potency token So systems fail and when the stems fail you need to rewrite the data Can you rewrite safely? For example, if you take a bank transaction and you write twice the same withdrawal amount, is that safe? It's not right. How do we deal with it? we first write with the item potency token token can be any token that you generate out of the system and You rewrite the same data with the same token when you rewrite with the same token. It should be do We generate the time idempotency token through a timestamp and Generated token the token can come from anywhere from any system You for client generated tokens where when you generate the token in the client side you generate using the timestamp you monotonically increase the timestamp by and adding random Numbers to or random time to the last bits of it and you also use a random nonce Combine it together mix it together to form the item potency token What if you want a regional token you can use systems like zookeeper and take a lock from zookeeper Or you can use a sequence generator where you generate the sequence using the sequence generator generates a monotonically increasing batch of sequences and then use that to write your Write with the token So what if you need a globally? Globally generated token so you can use transaction IDs transaction IDs has to be mixed into your Mutations and then perform rights has to be performed So if we lay out all the things that we talked about about tokens from client to a globally created token if you lay out with Client generated is more reliable. Whereas globally generated is more consistent But because of the network hops that it has to do. It's less reliable At Netflix we use Client generated one we recommend recline generated ones more in most cases and in some cases we We use Regionally generated zookeeper locks or sequence generators. We stay away from globally generated tokens We talk so much about clocks have you measured it across our Cassandra free 25,000 VMs we We ran a script which measured our time and Clots have drifts clubs cascus Joey has written a good memo about it You can go read about it and most of the time we saw less than one millisecond of Clock drifts in easy to easy to So the next concept we are talking about is chunking So when we have a small amount of data small small payloads like one MB of data We we don't have to do anything. Just write it to the database. We are fine What if you have a large payload lights 30 40 hundred MB of data? We need a chunk so during the right path We first take the payload chunk into 64 kilobyte chunks and then we write it to this stage it to the server Right, we don't we stage it. We don't commit it But after you write all the chunks to the server you take the chunk zero You create a chunk zero with all the information that you need To perform the commit and then commit chunk zero chunk zero validates your commit I'm here. I'm talking about the same thing Take the payload chunk it create an item out of it with has chunk number in it here We you can see that we are using item potency token You perform writes per page. We do two MB pages Two MB of chunks per day per page. Why a page page because we can retry a page If you put a stream you have to retry the whole stream instead page can be retried individually after you write Write all the chunks you perform chunk zero Comet the commit looks something like this The commit doesn't have a value it has chunk zero it we use the same item potency token to commit chunk zero and We have information about how to retrieve those chunk by adding metadata about the chunk in chunk zero For example here you have chunk count 43 and chunk size bytes and hash While reading we first read chunk zero that has all the information To how to go and retrieve the chunks And after that you go and retrieve all the chunks in the To put it in the perspective where this server side we per page for MB of data is read after we read read for MB we construct a page token and we Send per page information back to the client Chunks and the page token is sent back to the client when we exhaust It's returning null as a page token in the client side. We put all all of the data together and we Formulate the payload back After we formulate the payload back we return to the client The next concept I'm going to talk about is compression So Cassandra compresses you all know LZ for compression When you write the data into Cassandra Cassandra compresses when you take the data back we decompress When we replicate the data is compression and decompression happens again For a 64 kb payload of 0.5 ratio There is a lot of compression and decompression happening here Instead what we can do is to compress client side and in client side if you compress all the We save in commit logs GC allocations are low Are you we'd save and disk and network are you overall we save up to 300% of compression decompression Have we measured is yes, we've enabled compression in one of our use cases if you ever use Netflix search Napa powers Netflix search and They store JSON payloads and then we compressed it compressed from 175 kilobytes to 44 kilobytes It uses LZ for compression. It's overall been of 75 percent reduction The next concept I'm going to introduce is pagination. We talked about how to store the data until now How do we retrieve the data from from Cassandra, right? When you think about storage engines storage engines come with Records record count right like we retrieve records from From the storage engine any DB you mean I we can take but we have to think about in the server side Accumulating a page of fixed size So we so I said the read for MB of data and return for MB of data as paginated value So we sit here and read for MB of data from this from the database and accumulate the page and then return the page with a page token So there is some kind of translation that is required for us to do from page count to pay row count to page size so The next concept I'm going to use is adaptability right when we re sit and read those forum be of pages from Cassandra or Any other database this and accumulate the system It is possible that we are doing multiple round trips or a single round trip should be able to fill up the forum be of data Right. If that is the case are we doing too much work, right? You you might be doing for large payloads You might be retrieving a lot of data and throwing away all of the data Before sending that to the client instead if you can manipulate the fetch size while retrieving the data itself We are fetching less data and doing less work to send it to the client So here we are adapting to the payload size and the data that we are reading by Manipulating the data page tokens So the next one is say for example, okay We did all the work accumulated a forum be of page. It took around the five seconds or less than 500 milliseconds But client already gave up client SLO was tense 10 milliseconds. What do we do? Right? We did all the work for nothing you do again the same work For nothing, right? So instead if we can send an SLO with your request and the Server can understand that you are reached 80% of the SLO once you reached 80% of the SLO you return back to the client that Whatever accumulated values you return back to the client then you are all good. You can Go again to read more So we talked about so much so many concepts. How do we do all this like meaning? All of these tuning has to be done by the client because client knows their data well their payload size Well, how what are they retrieving better? Right, but that's all the client tunings are error prone for that we have introduced signaling Signaling is a mechanism where client says I'm here handshake with the server and the handshake returns a signal and The client keeps on doing that every 30 seconds and to get fresh data in the signals This is one of the examples where on top There's a key value service who's which is doing handshakes and on the bottom your have time series service doing handshakes and Getting back a signal in the signal. You're returning Do I have to chunk if I have to chunk? What is the chunk size and what what what should I chunk after using those signals? the client can adapt and learn and Change how they're chunking The data in the client side large play payloads must be broken down into chunks and you can enable it dynamically So compression we talked about compression you have algorithm here LZ for you can customize the algorithm as well On the fly and return a different Client to do a different algorithm all together all dynamically fewer bytes are more reliable So here we talked about SLO a little bit earlier Here I have a 10 second SLO Some some clients need 10 second SLO some others need 50 second you depending say for example, you have a Consistency scope of eventual you can say you can return the data in 10 milliseconds Whereas if you have a global consistency you go until 50 milliseconds So with all that We have Joey talk about APIs All right, that was fantastic Those are a lot of concepts that we put together and now for the rest of the talk We're going to be kind of showing you how we put those in practice to actually make key value in time series APIs So let's start with key value So key value sort of the thrift API in Cassandra looks hopefully looks yeah, okay It looks pretty familiar right turns out that at Netflix. This is what most people want from a key value store They want some kind of hash map like partition key and then a sorted set of bytes to bytes This covers almost all of our key value use cases in Netflix Now we do have some like type libraries on top of this to help people like store longs and stuff but at the high level it's this API and Let's start looking into the API and seeing those concepts that video talked about So right out the bat when we look at like the muted the muted event points like put items We see that we're doing an operation with an item potency token every operation that changes state in an abstraction Requires some form of item potency token. We also start seeing these lists of these key value pairs which have that chunk number So that allows us to dynamically handle both large data and small data large data shows up as chunks that are non-zero Small data shows up as chunks that are zero We see the same thing for like our multi Multi item mutate items this is kind of like a batch API And the really nice thing here is that from a user's perspective using this API They don't have to think about like well as long as I create a list of operations They they apply in the logical order that I think so for example if I delete a record and then insert a bunch of items That happens in the logical order and then the abstraction can translate that to Cassandra's timestamps last rate wins using the item potency token and ordering after that On the get items side we have pretty straightforward predicate what matches and selections But the thing I want to call it here is the pagination So the get items response is always a page of results like video said if you have a streaming API You don't know if the other side is slow or gone So you instead we use pagination which allows us to be able to speculate and hedge for every single page That allows us to maintain those single digit millisecond SLOs and then finally if that next page token is set That means that the client has to keep consuming Scan is very similar scan. You'll notice the only difference is that we don't have an ID This is kind of a full table scan API and the main thing here is that There's actually multiple concurrent pages returned So what we found was that users expect a common SLO to first full table scans They want to be able to scan One terabyte just as quickly as they can scan 10 terabytes of data and the way that we performed that was with parallel range scans However, the nice thing about the subtractions the key value abstraction makes the determination about how many concurrent cursors to generate How many token ranges to scan and it allows us to kind of meet meet that's that full table scan SLO Specifically, we're going to give them as few cursors as possible to meet the SLO because we don't want to put too much pressure on that cluster So kind of putting it all together we can see all those concepts the video talked about We can see this item potency tokens. We can see chunking large values being chunking We can see the paid pagination on the read API And we can see that like real focus on fixed-size work that we can establish a fit an SLO on as opposed to counts counts We can't establish an SLO on returning one one gigabyte item is going to be slower than returning a hundred 100 kilobyte items All right, so that sounds pretty cool. We can see the API, but how does it actually work under the hood? How do we make it work with Cassandra? Well, we do it using a pretty straightforward schema So you'll notice that this is exactly what you would expect for a key value Extraction the only difference is that value metadata column now What's special about that value metadata column that allows us to tell to tell the system that that data is actually located somewhere else And in particular it's located off in this versioned chunk table There's a lot of stuff going on here, but I'll walk through it at that high level We have a partition key which is a combination of that ID and a bucket a numeric bucket And then we have inverse sorted key inverse sorted version and then chunk and You can kind of see like this mapping the idempotency token right like the version is going to be the nonce The the right time is going to be the timestamp and then when we want to store Large values like multi megabyte or even gigabyte values into key value the base table the simple key value value metadata now points off into this chunk table and Really key is that it points off into a locatable part of that table so that if we need to like for example We remove things around or resharred them later. We can do that transparently without the user ever knowing and Just to kind of show you the power of this I just want to walk through a couple of examples where we have three large values Like if we were to handle megabyte values all the time and just normal that base table Cassandra would explode instantly So instead with this what we're able to do is spread it out over multiple buckets So we're taking that offset which is some kind of consistent hash and then every eight chunks We're but we're bumping that bucket which has the effect of spreading that data out We can see with like a two megabyte We're spreading it out now over more buckets and then finally the 10 megabyte. We're spreading it out even over even more So bucket 12 through 32 in this case How does that look actually on the storage cluster? Well, it looks like this the first eight go to these three the next eight go to those three And then when you have the rest you're kind of spreading this large value out over Cassandra And you're never asking anyone Cassandra node at once to do all that work. So for example, if it's eight Chunks of 64 kilobytes, you're only ever asking a Cassandra node to do 512 K of work at a time And then because we have that pagination system We're only ever asking four replica sets to do work at a time So we kind of spread out that work over a longer period of time which maintains our SLO and Key doesn't wake Jordan up This also allows us to implement concurrency control because the versioning is what allows us to stage those large values Like you can't write multiple megabytes or gigabytes of data quickly like that takes some time So you have to handle concurrency you have to handle like people might be touching that same key at the same time And the versioning is what allows us to do that that kind of random knots and that item potency token in addition to that time Stamp that's what allows us to deduplicate those two concurrent rights and then kind of stage them And then the thing that ultimately arbitrates like what's the last rate that wins is that base table? All right, so we saw how we could use novel storage layouts and kind of a two-layer system to store any size data That also allows us to like if we have a partition that has lots of keys We can eventually potentially do summarization as well where we group those up and put them off in the chunk table But that's future work. What about time series? All right time series is pretty similar The only real difference is that this the sorted map is the key in the sorted map Specifically we found in Netflix that most event data stores wanted an inverse sort on timestamp So they want the latest events most of the time And the API looks pretty similar to that KV API, right? We've got the right event records event records contain a namespace time series ID This looks kind of familiar, right? And if you're a developer one of the key benefits at Netflix is that this looks very similar to that key value abstraction Yeah, you're storing a different kind of data, but the API and the way you interact is different You don't have to learn a whole new database And the one like kind of key thing I want to call it here is that the read event records API in time series always takes a time interval and That's what it's gonna allow us to locate the data And you might say well, but Joey do you just said that we always have item potency tokens? What is the idempotency token? Well, it turns out it's hiding. So it's right. That's the idempotency token The nice thing about time series data is they have built in idempotency tokens because they're event data There's a timestamp and then there's an event ID and that event ID is some kind of knots and again on the read records response We're seeing that next page token and that pagination So again, we're seeing all those concepts applied exactly the same to have a really robust setup What's different in time series? One thing that's a little bit different is that we found that time series users wanted additional modes of acknowledgments from us So they especially like our tracing use cases Okay So like one of the biggest use cases for time series in Netflix is this tracing data set which stores every time any service calls Any other service in Netflix? We have a lot of microservices. They talk to each other a lot. This data set is ridiculously big And they're writing at between one and twenty million writes per second Into this into this event store And they don't care if we drop one out of like a hundred billion events So they really just wanted us to be able to fire and forget Like send that data as fast as possible into the event store and if something was wrong we could deal with it kind of offline Some users want to know that it at least has reached the abstraction and has been queued into an in-memory queue And then most users probably want to know that it's actually in storage and there's kind of a trade-off here Right between consistency and reliability the stronger the durability the less reliable it is the higher probability it fails And we found at Netflix that especially with time series people want to be able to pick between these And that's fine The abstraction gives them this offering as one of those namespace configuration options that video talked about All right, so just wrapping it back. We see all those concepts again We see those item potency tokens different modes of acknowledgment A fix in this case a fixed prohibition on size although maybe in the future We've got dynamic chunking in our future and then finally fix size work again All right, so that again looks like a pretty cool API. How does it actually work? How did we make it work and I'm just gonna preface this with there's some complexity ahead and when we gave this talk internally people Like why is there all this complexity and I just wanted to lead with the impact Who here had a year of efficiency? We had a year of efficiency. Yeah, this this project saved millions of dollars for Netflix Allowed our operators to sleep better at night and massively simplified massive-scale datasets to put it bluntly we had compute efficiency operator efficiency and developer efficiency and Netflix the Complexity was worth it. What was that complexity though? Well, we start off with having like a metadata table that describes how we're going to lay out this time series And you can see here. We have a namespace. We have when that namespace That kind of start time and end time the time interval that it's dealing with some metadata and a status Let's dive into that. What are those? What are those things? What's in that metadata bucket? Before I get to that this is a real example from production. This is like a blank history service and Netflix There's a lot of blank history services. There's like viewing histories oppression history There's all kinds of history stuff because it's really important to us to personalize your use case It's really important for us to understand your historical behavior so that we can better tune the product for you So these in prep these history service data sets are some of the most valuable to Netflix They're also some of the most massive you can see that this one spans multiple years of data And you can see on the right there that we have this little like status column Is it active as it deleted? Is it is it closed? What are those statuses mean? Well, we kind of how a state machine of these different time slices entering these different states If if you're familiar with elastic search this hopefully looks kind of familiar Basically just took it straight from elastic search is how elastic search does time series data and it scales really well It also has really nice operator techniques So for example in this case we can see very clearly how large each of these times in this particular case There's less time series data over time. These are fixed time slices We can also Using the state of those different slices if they're active they can accept rights and if they're closed closed It's a really important thing for us That is what allows us to identify that somebody didn't know the data was bad to get deleted So in in time series if they're trying to access a time interval that's closed the API throws errors saying that you Can't read that data. It's outside of your window. How many times do you think people have gone? Oh, no, I didn't mean that that for that data to get deleted How many times do people think that's all the time? Constantly constantly we get oh, I didn't realize that there was a one-hour TTL in that data Now if it was written with TTL's we'd be like well your SOL can't help you your data's gone With this we just go. Oh, it's back in active and we've changed the life cycle policy So this has some pretty significant operability advantages for us number one We can push our clusters a lot further because we're doing retention with these tables instead of with compaction We can finally tune the retention policy and we can drive our clusters up to 80 or 90% disc full millions of dollars Super high impact being able to decouple that retention from that TTL meant that instead of us having to go Sorry customers. We lost your data instead. We can say up. Yep. We fixed it control plane operation Your data is going to be around for another year And then finally allows us to decouple background operations like compression and compaction For example, we can use really aggressive z-standard compression on those older tables that aren't being read a lot It's extremely impressive work, and I'm really proud of the team Pro tips some things we ran into this. Well, it turns out when you add DDL into the hot path of an abstraction Jordan gets sad. So the way that we fixed this was we created runway in the future So because we know the time partitioning we can pre-create those tables before they enter the critical path And we can make sure that if there is like schema issues we can deal with those ahead of time But also we learned that you can item potently create tables using a with ID statement even in old versions of Cassandra This is as far as we can tell support it back to Cassandra 2 when you create a table You can say like with ID equals some view ID, and then that replaces the silly time you ID implementation in the server I'm sorry. I'm calling it silly. It is silly Like it really is frustrating when you have schema disagreement that leads to data loss because people created the exact same table twice This solves all of those problems because you just say this this item potent Also, the thing that just landed in Cassandra 5 1 trunk Transactional cluster metadata that also will hopefully solve this problem so we can we can remove that hack All right, but what is the actual times here is data look like well again? We've got that data, but like with a couple more buckets Hold on. I'll walk through all the buckets. I know there are buckets. There's some time buckets There's some event buckets, and then of course what you'd expect the event time going inverted to that data This looks pretty similar to key value again just with a little bit of difference in the sort key All right, let's go through that bucketing because this is actually pretty key And this is where that metadata associated with that slice comes in So this is a snapshot of the metadata associated with a given time slice and we can see that we have a configuration of like how long are the time buckets How many random buckets do we want per time slice by default and then how do we split up events between the buckets? Let's walk this through with some real data So like on the x-axis there we've got time as we're writing new data that data depend gets routed to these different time buckets and these different event buckets You might be like do Joey why so many buckets? It's a lot of buckets got like three buckets We got tables we got event we got time buckets. We got event buckets a lot of buckets Well, the key is that when you start doing time series at scale you realize that you need the ability to tune these different parameters Specifically the tables Those specify your retention period. They also specify a lower level read amplification a common use case at Netflix I want to store a year of data, but I'm only going to be reading a month typically like a typical query We'll only look at a month and that's what time buckets are for they allow us to control that read amplification And then finally sometimes even with that first two levels of sharding you end up with really large time series Like we like to think people who just sit there and constantly watch Netflix You know that or they're sharing their account one of the two and for those users We need to be able to dynamically shard out that wide partition using random event buckets kind of like the key value chunking design So with that I want to hand it back to video to close us out With a couple of success stories things that have gone well. Yes Thanks, Joey. That was a great overview of the API's and storage engine Now we are going to talk about some of the use cases that power There are stored in these storage engines Key value deploys 400 plus shards Around 3,000 use cases live and a key value for time series. We have 10 petabytes tens of petabytes of data Stored events stored in the storage engine tens of use cases in time series as well Cloud says one of our gaming use cases All the game data when you play the play games and you move from one device to another and you want your game to resume in another device Metadata about the game and how you are playing Playing with the games all of that is stored in cloud saves. It has data payloads which are sizing from 30 to 300 megabytes in size the user calls the metadata and Updates the game progress It is in primarily this use cases enabled by chunking and compression that we talked about earlier Tracing Netflix tracing as Joey mentioned distributed tracing every microservice pushes some data about what they're doing Into the tracing shard. It's time series data Hundreds of thousands of services send their IPC traces to the time series data It is around one to two petabytes of uncompressed data. It's all immutable data it's thrown away and TWCS and All the strategies that Joey talked about is being applied in the tracing data the last one I'm talking about. I'm going to talk about is a profile level user interaction. It's called fluid per video per profile all the pay plays pauses and stops all the events that You interact with at Netflix it stores our data. There are two kinds of data in fluid one is the key value It actually stores all your sessions live sessions. You it will stop and play back that is coming from fluid And it also has session history data all the things that you did in a particular session is stored In time series data, it has 800 K writes per second for key value and 100 10 K writes for time series 150 K reads in key value and 80 K reads in time series with the 8 heart ETL for Key value and one year TTL for time series. It's tuned heavily for rights Right operations are heavily tuned with that What is what are we going to do for the future? We have some very innovative things coming up One is summarization somebody talked about white rows or white partitions earlier With white partitions. We also have the same same problem some of the Q use cases That we're dealing with have wider partitions and it's ever-growing partitions Device key and the device ID per user is ever-growing as well. So for that we need summarization We have to chunk the data and store the white partition as summarized data. That is not enough We need to resharp the data If it is ever-growing some buckets or a fixed number of buckets are not enough We need to resharp as well dynamically resharding and compare and swap Everybody needs put if absent compute if absent compute if present How do we do all of these operations using Cassandra compare and set is the way to go There are other things as well. We talked about compression. We we can do dictionary compression using signaling and Getting the day to know about the data With all that that's all we have if you have any questions we can take that now the future is bright Right of time. Okay. We'll take questions in the back. Thank you