 So, today I will be presenting the evolution and scaling of feature store at Uber. Basically, I'm Dhupya Nagar, I work in feature engineering team at Uber and we manage the features, how to scale them, how to serve them, the feature store architecture in details. Before going into all the details, I would be going into the different terminologies I'm going to use in this talk. So, this is my team. We basically manage everything around feature infrastructure at Uber. So, this is the agenda for today's talk. We will be learning what, machine learning, what a feature is, how the features get generated, how do we store them, how do we share them across different models, and some features store operations like offline serving, online serving, model training, feature transformations. We'll go into the details of all of these. So, this is a machine learning model. This is basically a very high-level overview. We have a data, we have a model. We use that data to train the model and then we serve predictions for this model. And then, once we serve whatever data we get in return, which goes back into the loop, and then models get retrained with the new data. And the online and offline here is the data which is used at the training time. We call it like offline serving. So, if I say offline serving or serving at training time, they are the same thing. Online serving is basically serving at the prediction time. So, this is event data and features. So, this is basically what event data looks like. You have certain orders coming into the system. We basically do some aggregations and transformations. And then, we generate features based on that. So, for example, like, in the lift table, we have two orders received on nine. And then the right table denotes how many orders you received on that particular date. So, that's the difference between the raw data and what features eventually look like. At Uber, we have two primary type of features. One is, like, batch features and real-time features. Batch features, which has very low granularity. For example, if you want to say, hey, how many orders this store received in last one day or last 30 days or last 15 days, that's a batch feature, which is, like, large data processing. We process all this data on a daily basis. And then we store this data. And real-time some more, as the name suggests, it's a real-time information. For example, what is the activity happening in last one minute or two minutes or last five minutes, one hour, anything which is less than a day, up to a granularity of a minute. So, it's basically continuous stream of data which we ingest into our stores and so forth predictions. Feature structure is, this is how we basically store features. The reason to do this in a way is we have, like, tens of thousands of features and managing each feature is difficult. So, we basically group them probably at a different entity level. So, an entity can be anything, basically an Uber entity. For example, stores, eaters, orders, different things. So, pallet feature store. This is what we call Uber's feature store. So, feature store is not just limited to serving and collecting the data. There are a lot of other things we do as part of feature store which is, like, making sure features are shared across different models, automatic feature selection, managing the streaming, ETL pipelines, data monitoring, data quality, feature transformations. All these are also part of this huge pallet store or feature store infrastructure. These are just some numbers, what we do and what we manage. So, we have, like, tens of thousands of features. We have hundreds of million of QPS we serve. We serve at, like, P99 is less than 10 milliseconds. So, SLAs are pretty tight there. Thousands of feature pipelines, hundreds of terabytes of data per training for a single model. Now, to manage the scale, moving forward, I'll just go through different scalability problems we have seen and how we overcome those challenges. So, this is just a very high-level architecture. I have abstracted out more details from here, but this is basically what it looks like. We have streaming and aggregation jobs. Those jobs will get real-time data, perform the aggregations. Then the data will be ingested to Kafka. From there, we have an ingestion service, which will write this Kafka data into Cassandra. And we also have a high ingestion, which will write the same data to a high table. And for batch features, we have, like, large aggregation queries. So, those compute jobs will run. They will write the data to high. And from high, we will basically disperse that data back to Cassandra. So, this way, we make sure that both batch and streaming data remains available in both offline and online stores, from where we serve it for training application and inference. So, problems at scale, we divide these problems into four parts. Feature onboarding, offline serving, online serving, and online offline consistency. Basically, how do we onboard hundreds of features with such a small team? How do we perform these large joins? How do we serve hundreds of millions of QPS and manage the infrastructure around that? What is offline, offline consistency, and how do we resolve these problems? So, let's talk about feature onboarding. In order to onboard hundreds of features, the first thing we realize is that we need to reduce the intervention from our team and make the platform as self-service possible. So, what we have done to do that is we have basically automated everything in this architecture. So, what user can do whenever they want to get a new feature or a feature group, they simply write a JSON config. So, they will update the JSON config, and then we run some validations on these JSON configs that everything is correct, and it is not breaking the existing stuff. And once all those checks are done, this JSON config will go to production, and this is what it will do. So, this JSON config will basically trigger a set of jobs. So, it will trigger the ingestion jobs, it will trigger the dispersal jobs, it will trigger the data quality test, it will also trigger monitoring, alerting. So, all these things will basically be automated. The green ones are which are basically automated, and the yellow ones which are mostly self-serve. So, for example, once your feature is ready, you have trained your model, then you can create your inference server on your own, which you can just start sending prediction requests. So, you don't really need much from us at the time of batch feature onboarding. So, this automation basically allowed us to reduce our overhead for our team and basically give a better customer experience to our users. The second problem I'll go into details is offline serving. So, what is offline serving? Basically, this is a very high-level overview of what offline serving looks like. So, we have certain features, and we have base data. Base data is also like your labels and compute configs because we need to run SparkJob to perform all these large joins. So, we take all these information as an input. We pass it to the orchestrator, which eventually starts a SparkJob. This SparkJob will run some set of joins basically. It will join the base data with your features, and then it will generate a training data set. And that training data set is used by the ML models. So, this is what it structure looks like. Now, the fundamental problems with offline serving is these, like how to speed up the joins? How do you do a cascading join over hundreds of terabytes of data? How to reduce out-of-memory Spark errors? How to make it easy for users to write optimal Spark configs? And how to reuse computations? Even when we do all those optimizations, some jobs do fail. How do we make sure that... How do we basically reuse all the computations which are already done in the past and not do them again? So, there are four fundamental optimizations we have done for offline serving infrastructure. The first is batching. So, in order to understand batching, we need to go into the details of Spark sort mode join. So, when we join two tables, basically Spark will do a lot of data shuffling, and then it will make sure all the similar data are available on same executors, then it will perform a join between them, and then it will return you the output. So, when you perform a join for a large data set, these kind of jobs become more failure-prone, and it's easy to get like... Your job will be either out of memory or out of disk. So, what we have done is we have done a batching optimization. So, let's say you want to perform a join between 30 days of data, and it's too large to handle. So, we can divide this data into smaller batches, and then we just process these batches one by one. We can test on these batches, fine-tune the configs for these batches. The jobs become more manageable. The next thing we have done is auto-config. What we have seen in the past is that it's not very easy to configure a Spark job correctly. The reason for that is, basically, you need three things to run a join operation. You need feature, you need the labels, you need the compute configs. In order to write the correct compute configs, what you need to know first is how does Spark operate, and what are the Spark configs which can actually impact your job, for example, like how many shuffle partitions to have, how many cores to assign per task, how does Spark handle shuffle data, how does it clean them up, all these things, and then you can write optimal compute configs to run the job. So, we have seen in the past that a lot of our customers are not able to write these configs correctly, hence a lot of their jobs failed due to this. Now, what we have done is we have written an auto-config module. What auto-config module does is it takes all these input parameters, and it will look at your features, it will look at your labels, it will figure out how large data sets you are joining at the end, and it also knows what kind of code we are performing, or what kind of operations we are performing. Based on that, it will modify the compute config on-flight. If it thinks that these compute configs will not work, and you need more resources, it will automatically update your compute config to have more resources to make sure that your job runs successfully. And if it thinks that the compute configs you have provided is a lot, and you don't need as many resources, then it will basically reduce the resources to make sure that the resources are optimized. The next thing we would go into the details of incremental join. So we have seen batching before. So when we basically perform a cascading join, so this is what your join looks like. A model can be trained on multiple features. Each feature can belong to a different table. So we basically get all these tables, and then we join them with your basis or label data set. So what happens is every time Spark joins these tables, it will shuffle both the tables, then return the output. It will reshuffle output and feature table 2, then again do the join, and then returns the final output. And this operation happens for every batch. And then this is what it looks like. Once it has done all these operations for every batch, we do a union at the end, and we return the join data set. Now what is the problem with this? The way Spark works is it will not delete the data until it has the linear job data in the job, which means that until the union is finished, it will keep all the shuffled data around in either memory or disk, and this requires a lot of data to remain in your memory or disk. So you need a lot of resources to perform the join between these jobs. What we have learned to overcome this issue is we basically removed this union. This is what we call incremental join or incremental materialization. So instead of keeping all these shuffled data, we simply persist this data per batch, and we specifically go and delete all the shuffles. So what basically you can do is the only resource you need to run a large join is as much as a batch. The downside of this is that your job can be slow because now we are adding this extra step of writing to HDFS, but it does make large data set jobs more reliable. We anyway write this data to HDFS, and I'll go into the details why we do that at the moment, but this basically allowed us to run our jobs more reliably. You can perform literally a join worth one year of data and that join would still succeed. The next optimization we have done around join operation is resume job. So what is resume job? So let's say you are performing a join with a large data set worth like a month or three months, and your job went halfway through, and then it failed. It can fail for multiple reasons. External dependencies, power configuration, someone else took your disk which you were using, hundreds of reasons. So what we provide a functionality to resume the job. So basically you divide into batches, you perform some of the batches, and then this failed. You can retry the same job, and we keep this data, I think as I showed, we persist this data in HDFS, and then you can just resume your job from where it actually failed. So we keep these checkpoints. So this allows us to basically not recompute a lot of data if not required. So this is further optimizations we have done to make sure or to run our feature preparation more reliably. Next we will look into the online serving. So this is what the online serving look like. Basically you have whenever you open Uber Eats app, you will see some restaurants, and for each restaurant or your recommendations, there are like tens of models which are being used, and each of that model is using tens of features. So this fan out increase exponentially for us. That is what causes hundreds of millions of QPS in feature store. So this is a high level overview of our online serving. Probably I think this is literally like every standard serving diagram. So you ingest the data and the data is being served. So we will look into the optimization on two fronts. How do we optimize the ingestion and how do we optimize the serving? Let's first go into the feature ingestion. So there are like, here we have listed four things, but all the dispersals, dispersal is basically reading data from highway and writing to Cassandra or any online store. And caching is caching the data during online serving. So we'll go into the dispersals one by one. First thing is efficient dispersal. So we basically utilize again Spark to read this data from highway and write to SS tables. Now, if you just run a Spark job and write this data to SS tables, what ends up happening is every Spark executor will write all these data to every Cassandra. Sorry, every Cassandra node. Hence, this causes a lot of small files, data fragmentation issues, and many more. So we basically rely, what we do is we have basically mirrored the partitioning strategy of Spark and Cassandra, which allows us to write data to specific SS tables and then only not have a lot of fragmentation within our system. The second part is delta dispersal. So the delta dispersal is basically whenever the new events are coming in, what our users do is they take the snapshot of the data and then for every day, they would basically, for every day, they will take the snapshot of data and they will write it to Cassandra and disperse it from highway. In delta, what we can do is instead of writing everything or the full snapshot to Cassandra, you can only write the data which actually changed. So in these two tables, you can see that we don't really need to disperse the data which happened on day two and day three, part of data on day three, and we can only write which actually changed. So this reduces the size of dispersal significantly. So right now we do this on a row basis. So if there is no change in your row, we would basically disperse it. Another optimization which is on the same lines for data dispersal, which is in the pipeline and we haven't really started on it, but we are also planning to do it at a column level changes. So what ends up happening is when you aggregate the data over a long period and there are a lot of different types of features in the same group, every row will more or less change every day hence when you do the column level changes and dispersals that is more optimized as compared to just row level differences. The next thing is selective dispersals. So we have an offline store which is high and we have an online store which is Cassandra. So what ends up happening is not all the models want all the features available in every store. So sometimes we disperse the same feature group into multiple clusters due to isolation or different SLA requirements. And at that point we don't want to disperse everything everywhere. So we allow doing selective dispersals. So what you can do is you take certain features and instead of dispersing everything you can only disperse the features you are actually using in your model. So this optimizes the dispersals, reduces the dispersal size significantly and it also optimizes storage cost. The last part in online serving optimizations is caching. Caching is more or less straightforward. I think we have local cache, redis cache and Cassandra. So if your data set cardinality is low, for example if you are using city level features where we only have like few thousand cities you can just locally cache it. Local cache mostly comes for free because we use the memory of infinite servers. If you have hundreds of GB of data probably like a store level feature then we use redis and if it is a eater level feature or maybe a driver level feature where we have millions or billions of rows in that case we use Cassandra. So it's basically the local cache and redis is much, much cheaper than Cassandra so this helps us in optimizing the cost for Cassandra. Online offline consistency. So let's first understand what is online offline consistency is. In this diagram there are like batch features and real-time features. So when you train a model and you give some data which was, let's say you train a model based on yesterday's data. Then when we serve the model we also need to provide yesterday's data. Then only the expectations of the model will match with the data. And the same thing happens for real-time as well. If you train the model based on last minute's data then at serving time you also need to serve last minute's data. So let's understand this with an example. So let's say you train a model and you wanted to predict how many orders a particular store will get today and you created the model which uses data from yesterday. So September 2 to September 3. Now when your model is already trained it is ready for serving then this is what happens. So you are serving on September 9 and the data which the model is expecting is from September 7 to September 8. But we cannot serve September 7 to September 8 data because we don't have it. It takes some time for data to get ingested in the pipeline. There are some ingestion delays or the large data sets can take few hours to get fully ingested in Cassandra. So instead we serve basically what we have available which would be previous day September 6 to September 7. So this creates a consistency issue between online and offline models. Now how do you resolve this? One way to resolve this is just train your model with probably further old data. Instead of training with yesterday's data you train with day before yesterday's data. But with some models it's not always possible and they want more fresh data for serving. So it's more like compromise between if you want higher consistency or if you want more freshness. So for example if your model is doing a fraud detection then you want more fresh data and you can potentially compromise on consistency. So to do this we allow users to get time shift features. So when we serve the features we take the feature name as an input and you can also give us how shifted data you want in time. So for example time shift zero which is basically you will just say give me previous days data and then we can send the time shift zero. Day before that it would be time shift one and time shift two and so on. You can get either of these you can actually request all these time shift features. It works the same for both batch and real time feature except in real time feature this time shift is in minutes instead of in days for batch it's in days so you can just say hey give me the activity of x, y, z for last minute or for last five minutes or for last ten minutes and you can keep shifting and then based on all those parameters you can make predictions. So that's how we basically resolve online offline consistency and the model owners can make a decision what they want to do or what they want to have more. So feature metadata and data quality as I said apart from online offline serving we also utilize Uber's data quality framework to make sure that all the feature pipelines are running on time the data which is being generated is 100% correct how to detect failures early in the process how to enforce accountability who owns this data who to send alert when something fails. So there are four components of feature metadata and data quality the first thing is unified metadata so we have a metadata store where you can find every information about a feature like who owns this feature how this feature is being generated what is the type of feature where is this feature stored where is it available online, offline what are the pipelines this feature is being generated lineage what all pipelines this feature these feature pipelines depends on if it's offline feature you will see this kind of lineage graph where you will see all the tables it depends on if it's a real time feature then you will see a set of jobs that this is the job which is generating this set of data the tiering of a feature is basically based on the use cases for example if a particular feature is being used in a very critical use cases then it would be tier one we have from tier one to tier five where tier five is the lowest criticality and based on what tier your feature is we register different data quality tests again this framework is automated the tiers get updated automatically so for example you have a feature group which is being used in a model which is a tier one model so all the tiers will get updated automatically based on what type of model is using those features so they will get downgraded automatically they will get upgraded automatically and whenever this upgrade or downgrade happens the tests will also change so for tier one features we have more higher SLA requirements the SLA will be more strict for example it has to be fresh, it has to be complete the data loss should be zero the duplication should be zero and all those kinds of SLAs need to be in place but we don't enforce the same SLAs or same tests for low tier probably like tier four, tier five feature groups so apart from that like metadata and data quality what we also have is feature store search due to having too many features it becomes difficult to search these features or manually navigate to these features so what we have is a search store you can basically search based on your entity feature group, features, specific name you can search based on their databases sources, a lot of different key features and these are some of the areas we are exploring basically embedding vector data type support till now we did not have any vector data type support for feature store so we are adding that better storage I think in general in feature store the way the data doesn't really change very frequently once the data is ingested in online store there are no updates possible so it's mostly immutable so we are looking into potential solutions of immutable key value stores which can be used to lower down the cost of Cassandra or if it can be replaced with something else feature versioning feature versioning is again part of the same effort what ends up happening is when you store some data in any key value store you need to assign some sort of details and if your feature pipelines fails due to any reason the data will basically time out and then you won't have anything to serve instead what we would probably prefer to have is a versioning system where one version of the feature remains until the new version is available so we always have something to serve then feature intelligence we do have automatic feature search we want to enhance this further and near real-time features which is aggregation infra and backfills by aggregation infra we mean is so right now we rely on external frameworks to do the aggregations for us for real-time feature like we have Flink, Athena you can write any custom job in Flink you can generate the data to Kafka we want to provide aggregation framework as part of feature store which can allow you to do basic aggregation with JSON configurations and backfills again it's not we do have backfills for batch features but it's not very straightforward to have backfills for real-time features so that is another work in the pipeline we have that's all thank you any question? could you talk more about how the recent advances in vector space and embeddings tie into your legacy feature store systems sorry, I did not heard the last part of the question I just want to know how the recent vector databases and vector embeddings and their popularity given their popularity how they tie in with legacy system like you demoed today okay, so I think the way they tie with the legacy systems is when we added like embeddings we also had to add support for embedding types so till now like all the stores we are using we don't really have support for added type data types and most of the embeddings are like set of arrays so what we were doing till now is taking them as a string input and at the time of surveying we will basically transform the string into a list of parameters so that is not very optimal and that is also what causes a lot of latency issues at runtime so now as we see that more and more use cases want to use embeddings so what we have seen a shift is more efforts moving into the direction of some sort of embedding store where people can reuse embeddings they can store the embeddings in their native format so we are exploring multiple different databases, vector databases we already have an in-house solution called CIA to serve embeddings so we are exploring all those options to create a new storage system for that I think that's it, no more questions okay, thanks everyone