 Carnegie Mellon vaccination database talks are made possible by Ototune. Learn how to automatically optimize your MySQL and post-grace configurations at ototune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Thanks everyone for coming. Welcome to the next session of the Vaccination Database Seminar Series. Today, we're pleased to host Keeshore Bhopalakrishna. He was the lead developers of Apache Pinot, built at LinkedIn while he was there. He's also the co-author of several other notable Apache projects, including Apache Helix, and his breath though, which actually isn't on the Apache project, but was an alternative database built at LinkedIn. So, as always, you want this to be interactive. So if you have questions for Keeshore as he's speaking, please unmute yourself, say who you are, where you're coming from, and ask your question. Again, feel free to interrupt any time, right? Because otherwise, he's talking by himself, okay? All right, Keeshore, the floor is yours. You have an hour, go for it. Thank you. Again, thanks Andy for giving me this opportunity to talk here. I have seen all the series here and this is like one of the best places for me to actually talk about Pinot. So I'll divide this into three main chunks. First is like why we built Pinot at LinkedIn. And the second part is giving you a little bit of an overview of the third part. I'll dive deep into like the starter index, which is the key part of this talk. One of the key reasons why we built Pinot at LinkedIn is for the user-facing analytics, right? So most of you might be familiar with who you're my profile that you see on LinkedIn. And anytime you see this, you see some sort of chart like this on LinkedIn, which are basically coming from Pinot as the backend. So we have like 70 plus products, like this build on just on LinkedIn. Across all these, we are seeing somewhere between 170,000 queries per second, right? And just last year, this was like 100,000 queries. So the usage of Pinot is just continuously growing at LinkedIn. And the most challenging part is since this is all user-facing, the key thing for us to provide here was the latency, right? Even at this scale, we have a requirement of providing milliseconds to subsequent latency for the users. So this is another one. Whenever you post an article, you get an real-time analytics on top of it. Similarly, if you go to any company page, you're seeing the analytics of a company and that's again powered by Pinot, right? And if you have a talent analytics, if you're into recruiter, you want to see, or even in terms of a company, you want to see where did your folks move, how is your talent pool? So all these things are coming from LinkedIn. So these are like some of the sample apps that were built on top of Pinot. So the next section I'll briefly cover like what actually goes on behind the scenes in terms of building and powering such applications. As you see, pretty much everything that you do on LinkedIn, be liking a article or connecting to someone or creating your profile, everything is recorded and is modeled as an active verb object model, right? So this is the classic graph model in terms of showing what's actually happening on every event. So you can think of a member liking an article or recruiter emailing someone or a marketer posting an ad. And typically everyone wants to know more about this in terms of analytics. So what exactly happens with this, right? So every activity that you do is you can classify it into either an event data or an entity data. So whenever you create an entity that goes into this database called espresso, which is like the no SQL data store at LinkedIn, very similar to the Cassandra MongoDB, but it is a timeline consistent model. So that means the consistency is actually pretty critical in terms of this database. On the other hand, other side, every activity that you do is actually going into Kafka, which is the key messaging platform that we use at LinkedIn. So once they get into Kafka and Espresso, we have this mechanism where Pino actually has this Lambda architecture where you can simply point at Kafka and then you can start querying it in real time. At the same time, the dumps are coming into a loop either from Espresso or from Kafka and you should be able to still push in a batch load, right? Every day you can compute it and then update Pino. So you kind of have this hybrid mode that's possible in Pino. You can consume directly from real time. At the same time, you can continuously keep pushing from batch. So once you get the data in Pino, one of the key things that we have done is, as you already saw, the data products is one of the main reasons why we built Pino at LinkedIn. Apart from that, it's also used for visualization internally to power all of our exact dashboards and the business metrics. And we also have things like anomaly detection, which is third eye, which constantly keeps querying Pino and being able to detect anomalies in the data. So this is kind of our internal dashboard as well, where we have like 10,000 plus business metrics, 50,000 dimensions across all of them and people are able to query this and slice and dice. So this is mainly used by internal users at LinkedIn. This is the same set of data, but being able to use being, but we use third eye to actually query this and periodically and then figure out anomalies. So it's also not just figures out the anomalies, but it also helps you find out why. So in this particular case, you can see that it's even highlighting the fact that it was Halloween and that's why the page was ever actually down. So it helps you not only detect it, but also figure out like helping terms of identifying the root cause. So to just take a step back on what is the key difference here in terms of Pino versus any other systems which claim to provide analytics, right? So the key difference here is most of the traffic that comes to Pino is coming via app. It's not a user who is sitting on writing SQL queries against Pino, which is typically the case that happens with data warehouse or any other analytic schools. So that's where the QPS is very, very high and also the requirement is very stringent in terms of the SLA because you're showing it to your end users, be it members of LinkedIn or customers of LinkedIn. So it's important to achieve, have a very predictable SLA in terms of latency and even that very high throughput. So in a way, it is really made to power apps that are providing analytics to their end users. So before we built, I mean, this was something that I was discussing with Andy. Before this is, in 2014, we actually had a different version of Pino which was built on top of something like Elasticsearch, but it was an internal version built on top of Lucie, right? And one of the problems that we had in 2014 was we had to grow up to 1,000 nodes just for my profile use cases. And we had around 200 plus million members at that point, right? And this was really, really expensive for us not only just manage and operate, but it was a nightmare because the system was not scalable. We had to take the cluster into multiple vertical charts and we sort of did all sorts of things in terms of trying to keep this cluster up and running. And once we identified all the problems on why this was not scaling, we took on this journey of rewriting Pino from scratch and after we scale today, it is running at like 75 nodes and serving 5,000 queries per second at peak, right? And just serving like 700 million members. So this is kind of the change that we saw after we did all the performance improvements. And in the rest of the talk, I'll cover like what actually enabled us to achieve this. But this is just to give you an idea of how important it is for us to address the bottlenecks of throughput and be able to maintain the latency. So the second part here, I'm going to take a small use case in order to drive the fact the latency is important and as well as what it does to the throughput and what are all the techniques that people generally use to solve something like this. So this is like, we just took the airline data analytics which is data from like 1987 to all the way till 2008. And this has like close to 100 million rows. And you can see the cardinality of all these things. We stripped some of the metrics here which was not really interesting. And only metric that we actually had was the air time. So let's say someone wants to really build an application on top of it, right? And then these are some of the queries that someone wants to query, right? Just say like, hey, what is, tell me all the details about a particular trip ID? Or I want to know like, what is the total flying air time that across all the flights? Or how much, what is the air time for a particular flight number? Or across whole region or unique. So you can think of any slicing and dicing combination of this. I have left out the time just for clarity but you can add time as well as it as the query is part of your predicate. You can add group by all those things come into picture. But now let's take a look at how would one go about solving this, right? So a classic thing is like, you store every flight record as each one row, right? And then whenever the query comes, you do the aggregation on the flight. So which is very well known and you can end up doing that as a first approach. And if the data is small, you will always, that's actually a pretty good enough solution in most cases. The second solution that most of them end up using is the pre-aggregation. So you take, remove some of the things that are not of much interest. Let's say no one is asking you on the trip ID, questions on the trip ID. You can take up the trip and if the flight is between two common destinations, you can actually aggregate that, right? So then the amount of data collapses, you might be able to get your performance much better. And then there's the other extreme which was traditionally done a lot, which is cubing, which is pre-compute every answer. If you, someone asked like, how many flights traveled from Atlanta? You pre-compute that answer. How many flight traveled from Atlanta to San Francisco? You pre-compute that answer. So you basically, it's a classic OLAP cube. You compute every possible combination upfront. Now let's look at, if we actually did this from these data sets, how would the latency actually translate, right? So I have one more column here with even withdraw because indexing is a common technique that most people end up using. So as you can see, the first query which took 15 milliseconds on this raw data with 107 million, with index it became two milliseconds. With the pre-aggregate and the pre-cube, we can't really answer that question because we kind of lost that record. So you don't really get the flexibility. But as you go down in the next, if you ask for something like flight number 8617, this kind of has the maximum number of rows and you can still get data two milliseconds. And with the pre-aggregation, you get to 40 milliseconds. But with pre-cube, you can answer this question in two milliseconds, right? So this is kind of the benefit of pre-cube. But now let's look at the third one where there is not much filtering, right? Where you're accessing the entire set of columns and entire set of rows. And here the inverted index is not really helping us a lot. It's still one plus second. And with pre-aggregation, you can bring it to 240 milliseconds. And then with pre-cube, you can go to two milliseconds. So the idea here is that the pre-cube is pretty much fast in across all these cases, right? Apart from the trip ID which it cannot really solve, you get like really, really good latency when you pre-cube it. But what is the downside of that, right? I mean, it cannot be all that good. So the problem here is the number of records that it actually has adds up, right? It's the throughput is bad in, sorry, the throughput is very, very good. But then in terms of pre-cube, you're actually adding a lot more data, right? So if you kind of look at it, it's a 6x. If you look at the pre-aggregated data versus the pre-cube data, 6x increase in the data size. Typically, it is a lot more if you add more dimensions. For example, if we add date and trip, it, trip ID just explodes. So we couldn't even compute all the combinations. It gets into billions at that point. And we kind of stopped it. And then, but with pre-aggregate, you can still get good moment of latency. But if you kind of look at the QPS, it is just 4.5, right? And the second, the most interesting thing here is the throughput for the raw index. While we saw that the individual queries latency were actually good, you can't really handle a lot of throughput just with the inverted index. And the reason is, inverted index is being in a lot of data into the memory, into the system cache. And it just crashes things right in the center, right? So because of that, you are not able to get the QPS that you want. And whereas on the complete right-hand side of the pre-cube, you can get almost close to 1800 plus QPS, right? And that's kind of what we were looking at because of the amount of people that are using apps on LinkedIn. We wanted to make sure that we are able to serve them. But kind of step back and then see this as in a summary view, right? So you have this three options. You have raw, pre-aggregate, and pre-cube. And you have one access, which is performance, which kind of, pre-cube is the best. But you kind of start using the flexibility on this, right? And then if you look at storage, you have pretty good storage in terms of aggregation. But the raw events actually have high and then the pre-cube is exponential. As you keep adding more dimensions, you just have to reboost up the data completely because it's not easy to add more dimensions to pre-cube. So there are a lot of other issues with the pre-cube. So the challenge for us is like, hey, is there one system that can actually give you all the three in one, right? And it gives you the flexibility and speed, flexibility that you get from raw data. You get the speed from the pre-cube. But at the same time, you get a single data set of structure, right? So you don't want users to basically say, I'm gonna keep one raw data set, one aggregate, one pre-cube. And then on the application layer, depending on the query, I will figure out what view to actually access, right? So now think about all these things being baked into one particular system with a single table abstraction. That's kind of what Pino is, which gives you all these things, all the features in just one system. And you can pay the cost in terms of computing something upfront versus computing everything on the fly. There is also the last part, which is cost-effective, right? Which is you don't want to really do the pre-cube because it's just very, very highly exponential in terms of the data size. So that's where Pino has this concept of start-reindex, which allows you to go from one spectrum to another spectrum and be able to get the best across all these three things. So, I mean, maybe you'll get to this, but your idea, just to repeat, your idea of flexibility is that you're gonna push through the application developer, they have that aside, whether they wanna generate the data cube or whatever ahead of time, right? And is it, what does that look like in terms of the API, is it like a knob? You say, hey, I'm actually just creating all the time, could you create a bunch of stuff for me? Or like, this is a one-off ad, I think, don't do that. Or is it the expectation of the game this system finds us that automatically? So it's more of the former at this point. It just, you can say that, hey, I want this, I want, I'm going to query these dimensions. You don't have to say the value of those dimensions, you have to say that here are all the dimensions that I query often. And then we can make that faster for them. So you basically start off with no indexes, nothing configured in Tino. And then you, as you see that it is very slow, you can almost think of, hey, make this query faster. And then we generate the indexes on the fly. So you don't have to do the rest of the data. This is more of a side comment, and just, you can't say about what happens at LinkedIn anymore, that's okay. Like, how common were data cubes at LinkedIn? Sorry, how common? Data cubes at LinkedIn. How common? Yeah, like, are people, like were people in the previous Lucene system, was it blowing up in size that people were generating data cubes or? No, so there was another completely different system which was generating data cubes all the time. And we used to store that in Waldo mode, and that's how it used to be solved before this. So we would actually use a key value store and then you saw all the other problems that ends up happening with using a key value store there. So we would- That's not what the Bowdo mode was first made for, right? Like that, I think it was, yeah, okay. Yeah, it was not made for that, but we just used it using it for doing analytics. Okay, all right, keep going, sorry. So at a high level, like what's Apache Pino, so it's a real-time data store which can basically ingest data from all streaming sources as well as batch data sources, right? And at the same time, power use cases for the internal real-time applications or the data products, which was the key use case for which it was built. And we also have the anomaly detection on top of it, which is constantly generating queries on top of this data and then being able to find out the root class. In terms of the architecture, so if you have like two main components, we have the real-time server, which kind of ingests the data from Kafka in real-time and periodically it is generating segments. And once the segments get generated, it does backup automatically in the segment store. And then we have this offline servers, which can, where you can actually push data from outside of Pino as well. So you can periodically generate segments and then keep pushing it with offline servers. So the broker is smart enough to figure out which ones are in offline, which ones are in real-time so that there's no overlap in the data. And we have Helix and the zookeeper that we use for our entire metadata store. In terms of the query path, it's very simple here in terms of how the entire query execution works. So you get the query, which is a scale and then the broker knows from Helix on like what segments it is where, how do I route to and it has all sorts of algorithms in terms of minimizing the number of machines that it needs to query or maximizing it or using the partition aware or time pruning. So it has all these enhancements built in to prune the segments as much as possible. And then it does a classic scatter gather here and then gets the response and then sends the response back to the client. So very classic scatter gather request model that's pretty much used in most of the other systems. In terms of the actual execution, there is subtle difference compared to traditional databases. We don't have any optimization at the query level. So we don't really have a logical planner as such. What we actually end up doing is plan per segment. So every segment decides what is the right plan for executing the query. So it can look at the segment metadata, it can look at what kind of indexing it has and then decide come up with the actual plan. So we can think of one of them having inverted index, the other one doesn't have inverted index. So it can figure out what is the right way to structure this, how should the predicates be reordered, all these things are at a physical level and not at a logical level. So there is a little bit of overhead that we pay for this but it helps us a lot in terms of the query performance itself. In terms of the anatomy of the segment, so we pretty much map everything. So we don't store anything in memory. So we have the concept of the Pino data buffer. And around that is what we have as all the indexing techniques. So we have the dictionary for encoding, we have forward index, inverted index range. So we have a lot of indexes that are built in Pino. And then we have the column concept around that, which is basically in each individual column or storage. And segment basically just wraps up all these things. And the segment can actually be mutable or immutable. So immutable segment is something that can keep accepting more records on the fly. And all these indexes are getting updated on the fly as well. Whereas an immutable one is created upfront and then pushed into Pino. And that doesn't really change over time. That's a lot more efficient than the mutable one. And all the mutable ones periodically keep getting converted into a Pino. I'll brief you. So do you do any, like with your, you said you're using NMAP. Are you using any of like NMBIs or like, how hard are you pushing the OS to make NMAP work for you? We pretty much rely on the OS for pretty much everything in terms of, we don't do anything on top of it. We don't manage any of the buffer pool or any of those things. I know, but like, I mean, do you are using NMBIs or like the other like tricks you can do to make NMAP? Yeah, we do touch some files when they're getting loaded. So we do some sort of rigs there, but not exhaustive tricks that other databases do. Yeah, okay. Thanks. So as I mentioned earlier, right? So one of the key things for us was to have all sorts was the need for having all these various indexing. The philosophy of Peno to a certain extent was to not figure out like, I mean, most of the databases do the vectorization, do the SIMD and all those things to make things faster. For us, it was not sufficient to just do that. If we had to go beyond that in terms of, how can we reduce the amount of work done, right? I mean, you can always scan and filter which traditional databases do. For us, it was more about, how can we eliminate the work that is needed? So that's why we added all sorts of indexing. So we have sorted, inverted, range text, geo, stardry, bloom filter, JSON, and we keep adding more and more indexes. So we have built the system in a way that adding these indexes is actually very, very easy for us. And as I mentioned earlier, we also have this concept of storing the raw events, pre-aggregate and the pre-cube, everything in the same abstraction. And all of them will be stored in the column in a fashion. So the key thing for us is like really the, not only to have all these techniques, but also have an optimizer or a planner which can pick the right techniques based on the query. So that the user doesn't really have to say, oh, use this index or use the other index. You just run the query and then we build the right index that is in there. For all of the indexes, did you guys write these from scratch yourself or are there all the shelf ones you're using? Most of them were something that we built from scratch. The only one that we use is the Roaring Bitmap. Which is a- Yeah, Daniel Lamere, yeah. Yes, exactly, yeah. Everything else. I mean, we do use the underlying library. For example, GeoIndex, we use the H3 from Uber. Yeah, yeah. Something like that. But others, we end up building ourselves from scratch. Yeah, that was my next one. It's the Facial Index. The GeoIndex is the software. I was surprised if you guys were that yourself. But yeah. Oh, yeah. Yeah, we don't want to do that. No, we just, yeah, yeah, yeah. So just to illustrate, I mean, each of these indexing is actually a talk by itself. So unfortunately, I won't be able to deep dive into these individual topics, right? I want to, the only thing that I will be focusing on is on the start index and the next section. So the key thing here is to just demonstrate what is the power of all these indexes that we have. So we took this five years of GitHub data for to demonstrate this. It has like 2.6 billion rows. It's just a single node that we used and it's like close to 400 gigs of data, right? And you can see the schema on the left. So just to give an idea of what happens with and without index, right? So the first one is like, just tell me how many events happened between these two timestamps. And without index, it basically takes like 32 seconds. And with just time partition, which is without any of the other indexing, we bring it down to like one second. And if you kind of want to ask like here, how many commits or events does a particular user has performed? And with just with inverted index, you can actually get it to 15 milliseconds from two seconds, right? Then the other one is like, you can just look at it. So pretty much everything we have like range index gets from 40 to three seconds, right? So you use this whenever you want to do a range on a numeric value. So if you want to say, tell me all the comments that add 100 plus changes or lines change, right? So you can use something like range index. Inverted index doesn't really work very well for that. You can look at text index, which tells you like how many comments at this message, right? So you can kind of include all these different mechanisms and JSON indexing is something that's not specified in this, but that is another thing that we added. So overall, you can look at this and then think about all these different things in terms of metrics or logs and phrases. So you can use these techniques to address each of the specific use cases that you have in your system. And it's very, very powerful. You just pointed it first into, ingest the data into Pinoe. And then after that, you add and remove indexes on the fly. You don't have to do, it's not stop the world like all the table in typical databases. You can just add it on the fly and then in the background, the indexes can keep getting generated. Do you use for joins? No. So one of the key decisions for us is to not go beyond one table. So we want to be the fastest single table query engine. And for the joins on the outside, we actually end up using with Presto. So we have a very powerful connector with Presto. And what we do is we look at the query plan and then see how much ever we can push down to Pinoe, we push it and then the rest of the parts of the join we actually perform in Presto. And we also have the Spark connector. So that's kind of what we end up using. So we focus on the performance on a single table with all the indexing and things like that. And we have a streaming connector for Presto. So that means it's not waiting for the results to be completely processed and computed on Pinoe side. So as they're getting computed in blocks and chunks, if we are actually streaming it back to Presto, so that you are getting the benefits of the streaming connector as well. So this is a new thing that we also added, which is JSON indexing. So until now, we only supported structured data in Pinoe. So we have recently added the ability to get unstructured data as well. So if you look at it, you can say like, hey, tell me someone who lived in Fun Street. So without index, it takes like 18 seconds. And now with every field in the JSON being indexed, you can pretty much ask any sort of questions, even at any depth. This syntax, we use the JSON path. So if you, this is like two addresses of zero is like the first entry of that. You can see addresses of star, that means anyone who lived on Fun Street will basically get that. So the syntax is very, very flexible here. So you can almost do like the JSON path expressions here in terms of finding the records that match your query. All right. So that brings us to our starter index. So are you guys doing any vectorized processing as you're scanning, say like numeric columns? Yes. So we try to keep as much as possible in terms of blocks. So when we scan it, so we do convert them into block arrays and then we try to do the vectorized processing. We don't have the SIMD yet, but we try to do the vectorized as much as possible. So everything is in a block of 10K or 5K. Okay. And then is it, I mean, it's open in Java. So actually, I have no idea. I'm sure you can use SIMD processing, SIMD instructions in Java. Like that affects the portability of it. No. Yes, definitely. Yeah, definitely. Okay. All right, keep going. All right. So let's quickly go back and then a little bit of a recap on the solution that we thought about, right? Which is, hey, we could use raw index, raw data without index or with index or pre-aggregate and PQ. What we really want is like, almost like the best of all these worlds, right? So we want like achieve the throughput of pre-cubing, but we also want to get the flexibility of raw data so that we can answer any sort of questions even, which is not solved with the PQ. But then we also don't want to pay this high storage cost. We don't want to keep computing all possible cubes. We don't want to have this explosion in the data. So that's kind of where Startery really comes into picture, right? Which is, can we go from any of this spectrum from one side to another without actually paying a lot of cost? So let me double click a little bit in terms of why we actually need a Startery, right? Because inverted index is kind of quite powerful for majority of the use cases. So if you kind of look at it in this particular thing, with inverted index, you have this latency of two, six and 120 milliseconds. And as you see, it's really impacted by the number of rows or the selectivity of the query. So some queries can be fast, some queries can actually be very slow, right? So if you ask like, tell me everything about the flights that came in and went to Atlanta, it's like almost 5 million rows that needs to be scanned on the flight, right? So there is no way you can make this fast. So other classic example is you have ads data and you want to say like, how many impressions were there in US? So inverted index is not going to do a lot here because you have to scan 50% of your data anyways. So your latency is completely bounded or decided by the amount of scan that you need to do once the filtering is done. And this high variance is really, really bad for query for user facing analytics because what happens is most of the time, the ones that are very popular are the ones that actually get a lot of queries as well. And you end up seeing that they are the ones which are the slowest and which is not really good for user facing analytics. So if you kind of want to get one key thing from this talk, I think this is the core concept of start re-index. So what this means is like, instead of saying that, hey, every query can take as many seconds as many milliseconds as needed, what if we can completely bounded any query of this form, right, which is like select some of X from table and where you can put any sort of any combination of your predicates. And what if we say that hey, it should never scan more than T records, irrespective of the combination. If you kind of look at it in this, one is scanning one record, the other one is 91K, and the other one is like 5 million, right? What if we say that all these queries should just scan only 10,000 records? So that's kind of where the start re-concept comes in. And it's based on this traditional iceberg, iceberg cubing model, where you compute only partial aggregates. You don't really compute full aggregates. And that's kind of the model that we used, but what it was, what it was built was for a single note. And we took that concept and we can continue to use the columnar store and built it in a distributed model. So this is another way to look at what this threshold really means, right? So if you kind of look at on the leftmost extreme, that's basically your raw data. So you're not doing any materialization at all. You're just keeping the threshold, the data as it is, and all your queries basically go there. But the problem with this, as we see is the latency is widely varying, right? It can be anywhere from two milliseconds to 100 plus milliseconds or even seconds, depending on the data. But whereas on the other extreme, where if you say the threshold is basically one, that means you're pre-cubing and you're computing everything. And the beauty of this is, you kind of see the curve and in terms of the storage requirements and the latency, you actually, as even if you compute only some of the partial aggregates, you will be able to achieve a lot better latency without having to get all the way to the right extreme. And that's kind of where Startery comes into feature. So you basically, when you configure Startery, you can say what is the threshold that you want, that for any of these queries of this form, I don't want to scan more than 10,000 records or 100,000 records or even one, in which case it just does completely cubic. So that's the beauty of this. And it's a knob that you control and you can figure out like how fast you want the response time to be and how much you're willing to give up in terms of the traditional storage over there. Any questions here? But the knob you're exposing, is it like I want my SLA for this query to be 50 milliseconds, right? It's just the split special here, right? So how does somebody has to figure that out? So typically what we end up doing is every query, we actually tell how many scans actually happen for that particular query. So you can, let's say you get 100 milliseconds and you see the number of records that is getting scanned and it's one million. So you can actually do the math. And then your idea is to basically get it down. You say that, okay, this query should not scan more than 100, than 10 records. You basically just look at that ratio and then say that if you want to get it down by 10x, you reduce, you set the latency to the threshold to one-tenth of the records that it's actually scanned. So I think you could do that, but most of you are stupid, right, Lenguins? And I think that you're the one. Yeah, so the idea here is to basically get into the mode of auto-tuning, right, instead of, so the first approach is to basically give the control to the users. If someone knows it, they can definitely set it and figure out what is the actual threshold. But the idea for us is once we understand the data, we have other APIs on top of it, which can figure out, just make this faster, right? But it's really about tuning the system since everything is dynamic. You're basically putting something outside the system that is constantly monitoring this and making it faster. So you can think about, oh, these types of queries are running more and more very often now you want to optimize such queries. So you can just set up only for those things. Okay, thanks. So let's look at what, why is this threshold really important, right? So this is kind of the graph one, going the threshold from almost like one, which is the pre-cube on the right-most side to know nothing which is on the left-most side, right? So if you, earlier we looked at the raw data could only handle like one query per second, whereas the pre-cube can handle like 1800 to 1800 and 80 queries per second, right? So if you look at in the start tree, by providing some threshold of just a thousand, without, if you look at the data size increase, right, from the pre-cube to pre-aggregate to this, it's literally adding like 50 million records, whereas the pre-cube is adding 185 million records, which is like six times your, the pre-aggregate data. So you have this curve that you get, and you can see from, if you want like just 102 peers, you can basically get your start rate threshold to 100K and you should be able to get to 235 queries per second. So this curve is like, it's not definitely not linear, but the nice thing about it is you don't really have to compute all possible combinations. And you just set this right threshold. And with that, you will be able to get the queues, the queues that you really want. At the same time, you are not losing the flexibility, right? And you can add more dimensions, you can evolve the schema as it is, and you can still query the trip ID, you can query on the time. You can do pretty much everything. The query engine is smart enough in order to figure out which can I use the strategy or should I call back to the raw data? And that's kind of the idea here is to give that control to the user and it's one system. You can actually start from the leftmost with, just using the raw data, start with one QPS. And as your load increases, you are just adding more indexes or even going all the way to start index to get the performance that you need. So you don't really have to change your systems. You don't have to de-boost up the data. It's all dynamic and you continue to use the same system for all your use cases. And one of the benefits that we have seen is at LinkedIn, is most of them start on the leftmost, which is like very low when the products are launched, their QPS is not very high. But as they gain more and more popularity, the QPS increases quite a bit. So we always had this problem of starting with one system, moving to another system over a period of time as it becomes popular. And with all these indexing techniques and starter index, we can basically stay with one particular system. It's just a configuration change for us. A question that this pages from cloud are here. So when you talk about QPS, is it a system running one query at a time or is it sequential? Yes, for this setup, it is sequential because that's how we wanted to get the max out of this. It's a single core machine and running as fast as possible in terms of sequential queries. So if you look at this in a second, it can answer like 1800 queries. Okay, thanks. Yeah, I think I'll put that somewhere in the footnote in one of the slides. Yeah, thanks for asking that. All right, so the next part comes into how do we actually generate this index? So there are a lot of methods in how you can do it. There is top-down, bottom-up computation if you can do parallelism and a bunch of things like that. But at a very, very high level, the way we want to think about it is just take the raw data first and then we first figure out from a root node. So this is like the, whatever is seen on this is the actual index and the left side, what you're seeing is the actual column of data. So we take the raw data and then we first figure out what should be the sorting model of this, right? And typically what we first pick is the one with the highest patternality. And the reason for that is basically to avoid the explosion. So you can think about it in the reverse order. If you kind of just remove gender from one of your data sets and you compress it, the max you can get compression is 50%. Whereas if you remove a very high cardinality, let's say trip ID, you will get the maximum compression. So the idea here is to pick the dimension with the highest cardinality to first compress. And what we do, once we sort on that and then we create this star tree, which is the green node on the right side, where we remove the country dimension from the data and then we compress everything, right? And then you continue this process until each node has, and until each node drops below the threshold that we have actually specified. And that's kind of the idea here in terms of how we compute this tree. So as we are adding computing pre-aggregate records, we kind of keep upending each of them to the actual columnar store. And in each node, we actually maintain what it actually points to in the columnar database, what is the start and end of each of them. And at the same time, we also compute store some pre-aggregates on the node itself. So we kind of have these two data structures now. One is the star tree on, which is gets keep getting updated as we keep scanning over the data and keep creating this tree. And every time we create a star node, we add more records onto the columnar database on the other side. So it's like a hybrid structure. You have this D plus three kind of a thing on the right. At the same time, you have the columnar. So you're kind of getting the best of two words in some way. And I'll talk about like how we use this index to actually talk about the tool during the query processing. So this is how it looks like. So if you look at this, it has like three colors, right? So one is the blue, which basically means that the node has more records that it will further divide into sub children. And in the star node, if you look at it points to another set of records, which has, it has lost the country dimension. So the idea here is that this is almost like your another materialized view. Each star node is creating a materialized view automatically, but it will not create all the pre cubes, but it will only create the table if the number of records is greater than T. So that's where the threshold comes into picture. This is how in the end, the tree will be organized. So you have a top level tree which has all the dimensions of your data with some bunch of star nodes. And they point at the right section within the columnar database on where it actually each node starts and ends. So this is, I mean, it was, it's very hard to show them in, but I'll kind of show this. So this is actually how the tree looks like for the airline data. So you see all the individual nodes are created under the first, so the first dimension was flight, right? And then here is the star node. So we call the star node as this all node, right? So everything now under this has lost the flight dimension, flight number. And the next dimension that gets picked up is the tail number because that's the next cardinality that comes into picture, highest cardinality. And once we further divide and then we get into the destinations here, right? So one of the key things to note here is there are two different kinds of nodes, right? So one is the blue, which further opens up. What that means is they have more than the north threshold number of records. And the ELM, for example, is less than the threshold. So it doesn't divide further. So the cubing basically stops at that particular point. So this is any index can be, we have built this viewer. So you can just point at it and then you can pretty much see what is the way the tree is built. And each of these leaf nodes basically pointed has the start and end, which points at the actual data we start and end with. So that comes to the query execution part itself, right? So now there are so many different types of queries and what node does it actually use to answer the query? The first one is, let's say we want to say what is the number of pages where country equals EL? So it starts off with the root, right? And then it looks at what are all the filled up predicates and then it takes the first branch because there is an attribute on predicate on country. It gets to the country node and then it says, oh, I've actually reached the leaf and there is nothing more I need to do. And it uses that subset of records to actually do the query process. And if it can just answer that and there is no other filtering, it will just use that node and then answer the query. So you basically get the answer very, very quickly. And you take the second records where it is country equals CA, right? And then in that case, CA doesn't really have the answer at that it goes into the next node, which is start. And then it picks up the answer from that. Similarly for the Chrome, it doesn't really need any country, right? So it actually takes the star node first and then goes into the Chrome node. And it has the answer that is already either pre-computed or it's available on the actual number of records and it answers the query from that. So the idea here is basically you use this tree to figure out like, hey, which node should I use to answer my query? And then you go to the end of that and then pick the right node to answer your query. And as you see on the right side, any pattern, any query of this pattern will always scan less than threshold that we have defined because it will always go to the leaf to process and every leaf will have less than the threshold number of records. That's kind of the high-level idea in terms of how stardery is structured and how we actually use it for query processing. So one of the cool things that we actually added recently is you can even generate multiple trees, right? You can say that, hey, I want this tree for only this set of dimensions. I want this tree for only this set of dimensions. So you can dynamically add and remove these trees as well. So you can say that, hey, I want one stardery which just takes care of country browser and time. And if a query comes with anything with combination of country and browser, it will hit stardery A. And you can also have another set of combinations with this OS device and time and anything that comes with OS and device, it will go there. If something comes across country and OS, then it might end up going into the raw events or you can create another one for that or you can create a default one with all the dimensions. So the idea here is that you have a lot of flexibility in terms of picking and choosing the common dimensions that you often query. And you always have the raw events for you to answer anything that doesn't fit in the tree. And the planner is smarter enough to know what is the best tree for that needs to be picked up to answer this query. So it does some sort of optimization or evaluation in terms of picking the best tree here. So anything else that's there? Yeah, so the way it picks up is just based on the query itself. It looks at what are the columns needed? What is the group by ordering? And what are all the columns that are appearing in the query? I'm assuming it's the heuristics, right? It's like, if it needs this and I have it, use it. It's not a cost based search. If there are multiple trees, then it uses the smallest tree. If the multiple trees can actually solve it, then it uses the smallest tree. Not yet, but the idea is it's not doing dynamic programming. It's not. It's the cost is too high to do that. How often are the queries extremely similar, right? Like in your case, looking on the airlines of country code, imagine it's just like you replace that country code over on the value of the country code parameter over and over again, so you can do heavy cash and prepared statements. Yeah, we haven't done that. Yeah, it's definitely possible to push it beyond this. Are you doing any kind of like sort of costless rewriting? Like A equals one and B equals A. You can rewrite that as like A equals one and B equals one. Something like that. Do you do any of those tricks? No, not yet. So that's a high level overview of, I mean, there's a lot that we can get into the start tree, but I just kept it at a very high level. It's a talk by itself, but I had to put some time in terms of getting all of you up to speed on Pino itself. But in terms of what's coming up next, so we do have an absurd feature that's also supported in Pino that we added recently. So the data can be mutated. One of the new things that we are adding is partial updates. That means the updates coming in doesn't have to have all the columns. So we will actually look up the previous value of that and then update it. And the second, but the most often asked from the community is the native object store. Pino right now, the segment is local to the node that it is serving. So it uses the 4.6 APIs. What we will be doing here is directly be able to store the segments in S3 and be able to access that. So that it doesn't have to be local. It can be remote. So that will be one of the cool things that we will be working on. If anyone is interested in working and contributing to this project, we will roll out to get in touch with them. The geo indexing is the next one that Uber folks really wanted this as part of the Uber Eats application. So you can now start doing geo queries like you tell me how many riders are in this area, how many Uber Eats are available and things like that. So we will continue to optimize on the Presto-Treno and Spark connectors. As I mentioned, our goal is to focus heavily on the single table implementation of Pino and making, providing all sorts of indexing and optimizations. And for the joins and window functions and nested queries, we will be relying on Presto to actually do the job there. And if you ask any of those three, Pino, Presto and Spark, which one is misused? Like you're probably using Pino. PrestoDB is the one that's used most. Okay. Yeah, I mean, in terms of the connectors, I mean, Pino is also getting a lot more adoption right now. So I would say Presto and Pino together is probably the most often used. Right, thank you. Hi. Hi, I'm Rohan, I was a student at Andy's Lab. So I had a small question that when you update, like you mentioned that you use this with real time data also, right? So when you update a raw data node, there will be multiple upward pathways in the start tree that you need to update, right? Because all of the different analytics that that raw data node points to change. So in that case, how do you achieve like concurrent operations in the start tree because like most B plus T concurrent operations work when the queries are along one way, right? But if they are on the upward direction also, then there can be a lot of conflicts with the locking and all this. Yeah, so two things, hello. Yeah, one, one, one, make sure. So the start tree will never is not getting built on a mutable segment. So we remember I had a concept of mutable and immutable. So the mutable, we don't generate the segment. It just a start tree, just too expensive, as you mentioned, it's not worth the cost. And since it's only the one that is the recent segment, only when it gets flushed, when it gets converted into an immutable segment, that's when we generate the start tree. So the start tree itself is very limited to a segment and it's not across all segments. And the second part is on the upsets part, we haven't supported the start tree yet when there is upsets. It's only for the immutable data that the start tree is available. Hope that answered your question. And the second one is each segment is always in the right path is single threaded. So there is no concurrency in the right path. There are not multiple guys writing to the same segment. So we simplified that on the application layer instead of trying to handle all the concurrency within the start tree. Thank you very much. And so the JDBC adapter is available right now. So something that we, so looks like someone had a question. I mean from the chat or? Someone raised hand. So what does? Oh, Cecilia, go for it. Yes, it's immutable unless your hand shouldn't be raised. Okay. I would lower her hand if she wants to come back, she can come back. Okay. Yeah. So the other part is the, we want to add beyond Kafka. So Kafka is the only streaming that is supported right now. So we are planning to add Kinsis and PubSub connector as well as part of this. There are some changes in terms of how we plan to handle Kinsis, the PR is in progress. PubSub will be the next one that we will be taking on. So that kind of leads to the question and answer session. I do want to call out that there is another meetup coming tomorrow where we will be talking more about absurd and JSON indexing and how it's getting used at Uber for Uber Reads app. Feel free to reach out to us on anywhere here on the Twitter account or on the Slack channel. So most of our community is on the Slack channel. So we'll be happy to help you with your use case. And as you see, there are so many different indexing when to use what is definitely a challenge. It's not trivial, but we are happy to help there. All right. Okay. Awesome. I will clap on behalf of everyone else. Okay. We have time, great talk. We have time for a few questions. If you have a question please just mute yourself and then go ahead and ask it. Okay. Okay. Okay. Great questions. Good. Very selfish of me. Remember what else? I'm gonna ask you. Thank you. So what is the main bottleneck for you guys now, right? You have the index. You have a bunch of indexes. But like, what is the, assuming everything's in memory, what's been the critical path for you? Like what's the thing that's slowing you down? Is it the JVM itself? Is it the APIs or reading the data from disk? Sorry, S3 is a totally different thing, right? That's always gonna be slow. That's not your problem. And what's the main bottleneck for you guys right now? So for us, there are two things. One is on the forward index scan itself is the random, because we are, since we use random, sorry, indexes, most of our access patterns will be random. It's not going to be sequential. Got it. So our key challenge will be like, how do we trade off between the compression, decompression at the runtime versus the random lookups, right? And how do we, how do we really try to organize the data in a way that we minimize the number of random lookups? It is sorting going to help is should we reorder the data because it doesn't really matter from the user point of view. So we do some of the things. So we actually have a concept of a partition ID. We have a concept of a sorted ID. So you can specify the columns that you wanted to be sorted on. So when we flushed the segment, we actually resort reorganize. So we play a lot of tricks on the data locality, but we are not in a state where we can do that automatically. So the user has to know a little bit about the use case for them to tell us like, how do we organize the data? For example, if for who viewed my profile, we actually sort the data on the UV ID. So that when your who would my profile opens up, we actually don't even use inverted index for who viewed my profile. It's just the sorted index. So we just go to that location directly and we see that filtering based on the scanning is actually much faster than filtering based on inverted index. But for that to happen, we actually have to sort the data off and then we do that dynamically. So the data locality part is one thing. And the second one is like the trade off between like how big should we go on compression versus link? So that's always a problem that do we compress more and pay cost at the runtime world? We compress less and make the queries faster and the storage is more expensive at that time, which basically leads to more thrashing in terms of page cash. So those are some of the trade off that we are working on. And the third is on the CPU itself, we are seeing some cash misses in L1 and L2 caches. So that's something that we want to figure out when the cardinality is very high, how do we actually make sure that there is no lot of cash misses on the L1 and L2 cache. But that's like the next level that we want to get into. Okay, okay. All right. There's a question from the chat from the Steve Mui. Says, is the internal wire column format based on a page URL considering you're trying to integrate with the processor if you know it's work. So the internal format is not arrow, unfortunately, because we had built this much before arrow existed. And in fact, our format internally has a lot of things that arrow has, but it's a map. It's not in memory format, unlike the arrow one. And we don't, we cannot really bring it into memory so many segments. So arrow is still a very good format for the, or the wire, but on this, we make sure that we can map and we don't have to really deserialize anything we can just add in any data directly. So we have our own Pino segment format, but we have all sorts of access patterns there. We have multi-value, single-value list keys and maps, all those things. What was the second one? But it was the FDR using arrow, because if you comment, you're trying to integrate with PrestoTree on Spark. Yes. If they're using arrow on the wire protocol, then you can do the same thing. Okay. Okay, any last question?