 The databases for machine learning and machine learning for databases seminar series at Carnegie Mellon University is recorded in front of a live studio audience. Funding for this program is made possible by Google. And from contributions from viewers like you. Thank you. Hi guys. Welcome to another seminar from Carnegie Mellon Data Group. Today we're excited to have Simon Kadar. He's the CEO and co-founder of FeatureForm, the virtual feature store company. So he's here to tell us all about what a feature store actually is and how to build one that makes it, that can scale. So as always, if you have any questions for Simon, as he's given us talk, please unmute yourself and fire away at any time. Would this be a conversation for him and not talking by himself for an hour? And with that, Simon, the floor is yours. Thank you so much for being here with us. Of course. Yeah, thanks for having me. Hey, everyone. So today, I'm going to be talking about feature stores. It's kind of one of those concepts that gets hyped up a lot, especially in MLops and kind of ML infrastructure. But I think there's kind of a fundamental what is this thing question to be answered? And how do you build one? So I'm going to really dive into that. I'm going to drive into the different types of feature stores and really try to define what a feature store is. I'm going to keep it more technical just for a given audience. And I'm specifically going to highlight a few technical challenges that we had to overcome and how we overcame them. Specifically, the way FeatureForm is built, it looks more like an orchestrator. And the thing with orchestrators is it's not typically like there's one super hard problem to solve as much as it's kind of deaf by a thousand paper cuts. There's like many, many, many problems to solve. And how do you get all those things to work together and build an architecture that kind of scales to all the different possible use cases that people would expect to use some sort of orchestrator. And yeah, I want to keep this interactive. So I'm going to talk about a lot of different things. I'll leave spaces in between the sections and I would love to answer questions on anything. I try to keep it in depth enough to give everyone good information and contacts if they want to keep going by also decide to leave it just one step above how deep I can get just so I can let you all choose where you want me to zoom in. So I'll give a quick intro myself. I'm Simba, founder and CEO here at FeatureForm. This is my second company. I was at Google before in my background in software engineering. I worked on Strigi Systems. My last company I built a recommender system that powered about 100 million monthly active users. And a lot of the ML infrastructure that we built there actually became the foundation of what is now FeatureForm. So agenda today, we'll start with what is a FeatureStore. Start with the basics. What is this thing? Why is it useful? Why does it exist? It's kind of whenever building a data system, it's really, I think people get really caught up in the tactful details and sometimes you can forget about what's the problem we're solving here anyway. Three types of architectures. There's three different approaches that the kind of ecosystem has come to and how to solve this. I'm going to break all three of them down specifically talk about the architecture we took and why we took it. I'm going to deep dive into four specific technical challenges. One is streaming in backfill. Two, materializations. Job-state orchestration, which is kind of more the architecture of what FeatureForm is. And finally, monitoring and concept drift. And I would be amiss if I didn't add just a small little section on LMS and RAG, given the general hype in the space. I know there's a lot of talks on vector DBs that already happened and are going to happen. So I'll keep that part brief and really focus on the things that are unique to our system. Great. Well, as a feature story. So when I say feature, you can think of it not like a product feature, but rather like an input to a model or a signal. Some examples, you might have the user's favorite song in the last 30 days. The store's top selling item in the winter, average price of all items in the catalog. These are all examples of features. You can think of a model almost like a black box that takes inputs or signals and generates an output or prediction. Now, a lot of what data scientists do in practice and data scientists are end users. They are oftentimes feature engineering, which is the kind of constant iteration on signals to come up with better signals to build better models. And the way we do that is they work in their notebooks, usually kind of a mess. They work with lots of different infrastructure. They create lots of data transformation, which eventually becomes these signals. These are common problems. I'm not going to dive too deep into these. It's because they're a little less technical, but I just like to throw these in. So these are common kind of workflow problems that we see. And I think it's just as building a system like this. I think it's really interesting to see a lot of companies got really caught up in the technical parts of how do we make this thing perform it. And there's almost this higher level problem or workflow problem, which is often overlooked. And we spent a lot of time thinking about what's the correct API, the correct workflow to really get the system to work seamlessly for a data scientist. Some other things, which may or may not look familiar to you. Great. So final kind of bit of context before I really start diving into more of the tech and the technical details is that it's actually it's funny. Data scientists, I think there's sometimes a little bit of fighting to get them to agree that, hey, maybe you shouldn't be deploying notebooks in the production, especially if you're a giant company. And it, you know, your whole recommender system depends on this notebook you put together. I think for this audience, I think it'll be very clear that, you know, that's not how you should deploy things in the production. So there's typically the fence between production and call it like offline iteration. And what happens in practice, a lot of companies is that they will actually go and take these notebooks and essentially rebuild them from scratch into what is the production workflow. And what happens to data scientists, this perspective is they'll come up with all of these features. And then there's this huge locker of how do I actually get this thing in the production. The problem of getting things in the production is a few parts. One is that when you are experimenting, it's common to use a sample, it's common to use really unscalable patterns issues, pandas use want your laptop. But when you actually go to move features in production, you start having to work with actual data systems. You start having to deal with streaming data, you start to deal with batch data, you start having to deal with on demand features, which are kind of like stored procedures. And all this stuff fits together to build the signals at production time in a play in a way that is very low latency. And yeah, it's production grade, let's call it. Things that are not problems, things that we have kind of we don't solve. One is we're not trying to build a better at this, we're not trying to build a better spark. We don't really view that as the problem to be solved. We're not trying to make everything streaming stream processing is a specifically hard piece of this problem which I'll get into. And then finally, I kind of view the name feature stores a misnomer, there's a whole category and every single cloud has a feature store. Well, when you think of feature store, if you use feature in the way I've defined it, you just think of as Oh cool place to store features. In practice, how a feature store ends up looking and working. Or how we think of it rather which I'll get into. We don't really think the problem that needs to be solved as a new type of database to store features. The problem that we see the way we think of it and the way we solve it is much more looks much more like an orchestrator. It's taking your data infrastructure, your compute with a spark your storage like s3, any other pieces you use. And applying an application layer above it, which provides a single source of truth of resources so you can define these things. An easy way to collaborate for data scientists monitoring and alerting and governance. We need the ability to build every feature every single I create I need to be able to use in training and an inference, and that's a very hard problem which I'll again break down soon. And finally, just having a nice declarative API to work with us and some sort of dashboard to really be able to understand what's happening do monitoring and Sarah. I think the name feature form, like when we build features form internally, what we really wanted was terraform for features. And so the name feature form is literally a non to terraform like I wish I had terraform for features. So a lot of what you'll see in the architecture choices that we make may or may not remind you of terraform because in practice, like a lot of what we're doing is very similar. The difference being that terraform is bringing up infrastructure and we're bringing up essentially data pipelines. Cool. Any questions so far. It was more of the high level of review. I think we're good. Well, there's three types of feature stores. So I'm going to talk about the three kinds and the three architectures that companies have come up with. One type of architecture is what I would call the literal feature store literal because it's literally where you store features. This is the kind that you if you've ever used a feature store you are most likely to be familiar with. The most probably the most widely used feature store that exists, which is actually changing pretty quickly is a product called feast, which is an open source. AWS SageMaker has a feature store vertex as a feature store Azure has its own feature store now. Databricks has a feature store. Cloud providers more or less copied feast because feast at the time was one of the first feature stores and they view that as the most commonly used it was the largest open source players so they kind of just mimicked it and built their own cloud provider around it. The approach they took was to have the user build their own signals iterate on them and then finally store them in the feature store. The value here is that all the features are unified for training and inference. For inference, you need to have the most recent value of a feature will very low latency serving. For example, if you're building a recommender system for Spotify, you're making a music recommendation, you might want to know the top song user listen to. Their favorite genre in the last 30 days. So you want to maintain up to date cash. Of that value, and that's what's stored in inference store. On the other side, there's the kind of the training bit offline store sometimes it's referred to as, which is where the training stuff lives and what makes that hard is that you need to maintain a historical log. Of feature values because the feature values are constantly changing, but when you're training, you're kind of rewinding time. And going to different things that happen. So maybe for example, you'll go through all of my Spotify history. You'll say symbol listen to this song from the red hot chili peppers. And what were you know these X feature values at that point in time. So you actually have to rewind time, build the features that they would have appeared at that point. And then zip those together with the label to train the model. What that looks like. And your database people so the idea of a CDC will be very familiar to you so in practice what it looks like is you have a CDC and a materialized up to date version and the materialized version is different store and the CDC kind of stream is offline store. The CDC you're much more focused on throughput you're much more focused on correctness of the inference store you're much more focused on latency. The problem here is iteration. This works really well if your features don't change, but in practice data science and machine learning is a very iterative process. And in that you're constantly changing your transformations and because the features are being treated tears artifacts that come out of the transformation pipeline, as opposed to being tied together. It kind of creates this disjointed feeling between both sides in practice. If you look at Uber, if you look at Airbnb Airbnb has a product called zip line Uber has a product called Michelangelo. There's multiple offers like Pinterest has galaxy Facebook has parts of FB learner. Almost every internal feature store that you see at a large company is actually going to look like what we would call a physical feature store. People also call this a feature platform, where you actually tie together the transformations and the storage. And what that the benefit there is that it as you're iterating on transformations the transformation is deeply tied to the storage. So rather than iterating on like kind of hey I have this artifact but stored back and really easily serve for training and inference. It's as I'm iterating that artifact is automatically being updated. It also solves other sets of problems which again streaming and other things I promise I'll get into momentarily. But I think the main takeaway is like a lot of these companies have solved this problem of building allowing data scientists to define features in a way that works in production and a way that can be used in training by allowing them to work in this kind of new type of system, which is typically actually backed by more generic providers like spark. And allowing them to work in their own framework and having the framework be smart enough to automatically fill the inference store and offline store. Now the part that feature form did was what we call the virtual feature store architecture. And the virtual what it comes from is when we would go to our users and there really is especially. The idea of them actually rehousing their data onto a new platform that we owned was kind of an insurmountable thing for them because when you go to a company like let's say JP Morgan. And you told them hey, you have all of this data, you need to put all of it through our physical feature store. It's going to be transformed and it's going to be stored there. And it's a hard sell. It's a hard thing to convince a large company to do what we realized is that actually one most of these companies that have something that looks more like a data mesh in the sense that it's. Hattrages infrastructure that is spread across many different teams. And to that the main problems to be solved are much more of an orchestration and this kind of an application layer problem having a single visor over this whole set of infrastructure, as opposed to actually building a better spar, which is what a lot of people set out to do. And so that's why we came to this approach of why don't we act more like an orchestrator and perform a lot of the kind of unique operations that exist for. For feature stores, but apply them across whatever infrastructure they have. Now the con here is that we need to build infrastructure or interfaces rather that are generic enough to work. So if you use Kafka or if you use pulse or if you use spark or use no flavor use Postgres or all three. You should be able to work in a unified way across that maintain a level of performance that they would expect, at least in the same magnitude of what they would expect without feature form so we kind of have to act like a zero cost abstraction. While performing all of this and because we don't own infrastructure, you can't really take full advantage of every single tweet you could do if you just were running on Kafka spark, whatever. Maybe you'll get into this but like the. So if you're not hosting the database, the data itself. What is an orchestrator look like is it like a little Docker file that they run on their network and it phones home get instructions or what they do or is it is it is it air gapped enough that it just runs by itself. So it runs in Kubernetes where Kubernetes native and I'll get into like the architecture, we've taken more depth in a few slides. So be my last kind of overview slide of just the let's call it the ecosystem before I start getting some technical problems but I just want to make sure you all kind of have context and what this thing is when I start talking about problems that we solve it's not kind of doesn't sound random. That's a great question I'm definitely going to dive into the architecture but is there any other questions. Less far many one cool. If you do please feel free to interrupt and ask. Okay. Let's talk about one of the, I would say, probably one of the most challenging problems that feature store companies face, which comes from streaming and backfill. So talk to our production features, and I'm really going to dive into the top left here is streaming features. So streaming features have two things. Let's say a few aspects that make them unique. One. They are typically they can be pre processed. What that means is. Streaming features, you would take, let's say, a user's comment, and you chop it up in some way, you remove the soft words or something. That would happen at inference time. So we call that on demand feature, you wouldn't call that streaming feature because it's actually happening at the time of the request streaming features are typically things like users favorite song, or let's listen to that would be a streaming feature because you'd have a stream of data you'd be constantly generating the value of the feature that value would be constantly changing, and it would be changing fast enough that doing it in batch, like on an hourly cadence or daily cadence would not work it wouldn't really capture the value of that feature for things like last 30 days and those sorts of windows. Though you could do those streaming and you can obviously argue that streaming is actually a super set of batch. In practice, it's, it's not, and that the tools don't make it much, much harder to actually do streaming as opposed to batch. Let's talk about why. First, it's important to understand the idea point in time correctness. The idea point in time correctness works like this. If I have a feature, which is the last five items a user clicked on. I don't want to at every single time let's say that this is an input to my Spotify recommendation. Think of how many recommendations Spotify does just for you when you open app. If they were to go and query your historical values and do kind of a query at that point in time. Chances are it would be far too slow, especially if you're doing anything more like taking some sort of aggregate. It would be really expensive would be very slow and in practice, if it takes a long time for your recommended songs to pop up. It will have a very direct effect on stickiness which will actually have a direct effect on revenue. So there is kind of a whole business case of making this really quickly. So in practice, all of this stuff gets preprocessed. So that's one piece is that you need to have the point in time of the feature at inference time needs to be now. So maintaining that the other point in time correctness that's really important is historical correctness, which is that the way a model is trained. You have a set of labels. So let's say that I'll use the same example where, or let's use a different example, let's use an example which is fraud. So I have a transaction, which is fraudulent. I'll name this as a label, and let's say the user is user a. So label why is true this transactions fraudulent. I might want to know, let's say I have a feature acts, which maybe is how many items, or what's the average value of items that uses the user has bought in the past 30 days, or last, you know, over the last five items, whatever. That feature is going to be constantly changing in value. But what I want to do is to be able to generate the feature as it would have appeared at that point in time. So again, it's that idea of building almost like a, like a CDC stream, like getting all the changes and feature values at the point in time that they change and building a log of all those feature values over time. So that I can zip them together with labels and build rows like training rows as they would have appeared, because from a training perspective you want the model to almost be unaware if it's being trained or not being trained in the forward pass. So you want to be able to give it features as if they would have appeared in production to really do training correctly. Now, how people used to do this historically is they would actually build two separate pipelines, which is really, really error prone as you can imagine, where they would build a whole pipeline and batch. So let's say they're using Spark, and they would build these features, and then they would go hand out over the wall to a ML engineer who would take that batch job and essentially convert it into a streaming job. There's a lot of problems that come up with that one, for example, is that sometimes those jobs might have, let's say a seven day aggregation window. In the stream, if you're using something like Kafka, you might not have a retention period that is that long. So you have to wait at least even after rewriting the feature you still have to wait and number of days before you actually have enough data to actually start outputting features. Two is, you might come up with this idea, you might train your model on it, but your training set is kind of frozen in time, and you have this new values that are constantly being generated that may be changing, maintaining that correctness, doing that monitoring. And this is kind of a set of problems that come up, which I'll actually talk about later how we solve that. But I think in general is this idea of what would be amazing, and what kind of one of the promises of feature stores is unifying that streaming and batch pipeline, so that as a data source, I can build a feature, and I can know, or I can essentially, that feature is deployable, getting back to my old point of how do I deploy a feature. When I define my feature in this framework, it will work in batch, and it will automatically kind of remain updated as new data comes in. The final bit I should probably add is that in practice, you could be iterating constantly on features, like you might come up with 10 to 15 different types of features just over a few weeks span. And you want in that time period to be able to test all those things. And no joke the way Twitter, like a few years ago released pretty much how they used to do it. And what they would actually do is they would come up with 15 features. They would deploy them or send them to another team to deploy. They would wait like 60 days to receive back the training set so they can actually train their model. And that's kind of the iteration cycle. So they would just like keep trying things over and over again and then 60 days later they would get back that experiment that they tried 60 days ago and they can finally test it. And as you can imagine this has a huge effect on your ability to iterate. So if you find out that, hey, this is a really good idea and all these other things I've been spending time on for the last 60 days are now irrelevant because the first thing worked. So how do we solve this problem? Let's get into the technical details. So like most things in distributed systems, the solution is a log. And the reason why the solution is a log is a log has an interesting kind of effect to it that if you take a log, so a log, so I define what a log is, a log is a, you can think of it almost like a stack where there's a first event. So everything kind of has a time period. You can append only, it's an append only operation on it. Everything historically is immutable. It's like a ledger. It's never a way to think of it. And the cool thing about a log is that you can never change historical values. What that means is, is you can freeze a log at a point in time. If I cut it before those two events come in, everything before then will never change. If you only read from the top as events come in and ignore the previous bits, it looks like a string. So it's almost like, it works like a batch data set in the stream at the same time, depending on how you look at it. And that's kind of the trick. And it sounds simple, but again, like I'm going to talk about how do you actually do this, which is a much harder problem. But the idea is more or less that when I generate a new feature, that especially one that's stateful, I will run a batch job. I'll freeze the log, batch process the whole thing. I will then stop my batch job, save the state, run the streaming job at the last event or as the last event that was coming in, and I will continue to update my inference cache and maintain a log of, or again, like a CDC of all the feature values that get created. And as this image shows that was always here and was here every single time. There's kind of one possibility. So historically when I built, I built a few feature stores in my career. And when I was able to own infrastructure, which is not a luxury I have a feature for me because we're virtual. We actually use Pulsar. And the reason we use Pulsar is Pulsar has this neat characteristic that they separate out long term storage from the message broker. So they, when I getting, I guess, too deep into how Pulsar works, unless you'll want me to I'm happy to. The way it works is it kind of chops up streams as it keeps, if you set infinite attention, it will build that log, like I described. And what it will do is because it knows that the historical values are immutable, over time, it will kind of chop them off into a segment and it will actually offload that into S3 or GCS or HFS or whatever you decide to back it up with. And what that means is from Flink's perspective, if I run a Flink job, it actually kind of looks like this perfect stream that has infinite retention. So when possible, which really the only, so my understanding is that Confluence Kafka, which is the kind of the paid kind of creator of Kafka, they have infinite retention now. But open source Kafka does not. And the reason that it does not, and the reason why it doesn't work is in theory you could, right, you could set the parameter to be like so long, but it essentially is infinite. But in practice, the architecture of Kafka makes it such that the messages live on the same node as the broker. And because the data or the stream size just grows indefinitely. You would end up having to create a massive number of messages, even if you don't have or a massive number of nodes, even if you don't actually have that much events going through the stream. So what you end up doing in practice, which I'll show is that you kind of reverse engineer what Pulsar does natively to make this work. And that's the trick. The trick is more or less to treat a stream as a log, maintain infinite retention, treat any transformation, you have to define transformations in a way that they can be run both in batch and streaming. And if you do that, then what you can do is you can freeze a log, run a batch job. So Flink, what I'll do is I'll in Pulsar, I'll get the last message ID that I see. And I will run a big job because in practice is way more circle data when there is data coming in and I don't want this massive batch cluster running for when I'm just doing stream jobs or streaming events as they come in. So I'll create this large cluster. I'll process the frozen part of the log. And then what I'll do is I'll do this handoff to streaming where in practice while I was doing that batch job. New events will have come in that have not processed and will actually be in the log. So I'll maintain a message ID. I'll pass it to my streaming job. My stream will go back in time, find that event that first purple event and start a stream processing from there. So that's kind of the trick versus almost like really this dance that we have to coordinate across the batch and stream. It's much more easy to see when we have to do with Kafka because in here we can't pretend that Kafka is building an infinite log. What we do here is as events go into Kafka, we will write them to S3 so that we have a historical log of events. We will also process all incoming events into Flink. It can also be Spark. I'm just using Flink and Spark in this example to create a differentiation between the streaming job and the batch job. This is the kind of the steady state. So this is the state when the job is already like this is a feature that has already been created. All I have to do is keep it up to date as new events come in. Now, let's say as a data scientist, I'm like, Hey, I have this idea for a new feature, which is I want to know. Again, let's say we're Spotify. I want to know the average beats per minute of a song the user listened to. Well, now I have to run this big Spark job to go through all the events that the users for every user to generate this feature value. And also generate the feature value at different points in time. It's not just like the current values, but what they would have been after every listen. Then I would do that handoff, like we talked about to coordinate to a streaming job because if I do deploy this thing to production, I need to maintain that feature value in different store. So that lots of the reason why the feature store problem exists and why it's so hard is actually because of how this kind of bimodal nature of training versus inference and this requirement of having historical feature values. So this would all be resolved if theoretically there was some way to build. If Spark natively could do streaming batch unification where I could build one job and run it across both modalities. Then I would be done wouldn't have to do this ourselves. And there's been many approaches to this like for example Apache beam has something called a splitable do function, which is an approach to doing this. Even though this is like a known problem and it's been kind of an open problem for a long time. It's remains unsolved. And the best approaches like that really the approaches that most companies go with is actually just doing something that kind of looks like this. Just almost, I mean, I built this and even building this it's kind of a hack, like it works, but it says what would be really nice is it's something like spark or flank to this natively but they don't questions. Simba, I have a quick question. This is Gignesh Patel. How big does a feature store get when you see in the larger end of the deployments, do they tend to be they look a lot more like OLTP systems smaller relatively small amounts of data, lots and lots of updates queries on version numbers time. Or do they tend to get much bigger. They get huge. This is actually maybe part of what makes it strange is if we go back to the architecture. It's like that inference store is an OLTP system, but the offline store is an OLAP system. But in practice, I need to make both of these things kind of I have to maintain consistency across both of these systems and make it so that I can define one transformation that is both using offline context for training which is an inherently kind of asynchronous long running job. But I then need to be able to use that and a transactionally at in a production system. So it's both and the awful everything I talked about with this kind of batch streaming, like the batch job more or less is running an OLAP style context is a massive data set where the streaming job, like in practice it's still I mean depends on the size of the company but it looks much more like a typical production grade transactional system. So in the big bank, let's say on something important like fraud, are we talking terabytes or gigabytes in the feature store? Larger than terabytes. Larger than terabytes, got it, because they're keeping a lot of versions going all the way back to be able to log and see what inferencing mechanism they might have used in the past because they can't age it till whatever is their policy, right? That's where the, the size of data goes up because you have to travel so much for the back in time. So in practice, what companies at that scale will often do is they will sample things. So we still have to process it all but we might not put all of it into storage, or they might actually do a hard cut off where we don't care about events like they will never make that. There's a generic cut off in the window size which it will never be, I'll never train on data that's older than two years old. And then you can ourselves we can just never, we can just let that stuff, it's almost like a TTL. So there are many techniques that exist. And even when they do process at all, the actual training step is very expensive, much more expensive oftentimes than the actual generating of the features. And so even if you generate two years worth of features, you might actually do some sort of smart sampling and data curation, so you only actually train on 5% of that data that is more like, most likely to have an effect on, on the model itself. Great, and one more question related to that and you could defer it. What's the definition of feature because it could mean something as simple as a measure or a metric to the entire code pipeline that generated that and if it's generated in Python, then it's like, what, what, you know, it's like the 100 libraries that came with it like, how far does the definition of feature go and I'm guessing different people carried in different ways to it being just a value to actually being a reproducible code entity that connects to that value. It's a really, really great question. And I think that's actually the fundamental difference between how we think of the problem versus how other features sort of think of the problem. Our features sort of think of the problem as inherently, it's an artifact, it's the final row. The feature is the row of data. And you, for every feature, it would not just be one row, it would be kind of again like that CDC, it would be every value of the feature historically. And if you do it that way, I don't think that is the correct way to view it. I think in practice, the better abstraction is to define it as like you were describing like actually a pipeline. And there is this kind of open question on where does the pipeline start, you know, like do we go all the way back to like the kind of initial stream of data that's coming from the client, or do we start with something that looks more like a data that we live in that. And in practice, it really depends on the company. Some companies like for example, LinkedIn, they have a whole data mark, kind of very clean, perfect data sets but the BI two teams, the analytics teams, the data analysts they use. But the data scientists work on ML purposely don't use that. And they actually prefer to have the full control all the way back from the stream. So they can really make any feature they want because in practice, metrics analytics are a little bit simpler. For example, if you have a revenue metric, like you want there to be one revenue metric, you know, and you want it to be right, whatever right means. If you have 25 revenue metrics across the company, you would be, you know, as a CFO, you might be a little concerned for your job. But if you are a data scientist doing ML, there may be many, many reasons to have 25 different versions of revenue, because in practice with ML, features and signals, they're kind of a means to an end. They're not actually meant for human consumption. They're meant for model consumption. So if a feature with some weird cut off, like, hey, I'm going to only take revenue of users who pay us more than $1,000 a year that are in, you know, the US or Europe, but also India. And, you know, like, you have this weird like cut off that to a human, it's like you would never, you would look at the description of his table and you'd be like, why would anyone use this, but to a model, for some reason, it might be the correct signal to make the model perform very well. And one other thing you kind of touched on a bit, which I didn't really get into, but is a really, really, is another really core problem spaces. I talked to a bank who I asked them how many features you have in production, and they said 10s of thousands. And it's a lot. And I was like, okay, how many of those are like, are those all actively used. And their answer was we have no idea, but we don't know what we can and cannot turn off. And we don't want to see what breaks. I mean, the real way they would do it if they had to turn them off one by one and see who screams. And I there's some value here of obviously the lineage, but also the because we work on a higher level of abstraction. We can decide what gets created and when we can decide what's worth caching and materializing what's worth throwing away. And in practice, all of these, we make it, or we ask the user make technically they can not do this if they don't want to just we don't, you know, you get undefined behavior, but we want all the transformations to be more or less pure functions, which means that every everything from the raw data forward in feature form should be fully reproducible. And so in that way, there's a few things that opens up one is that obviously we can turn off anything and put it back on anytime to which is another really common thing is I mentioned a lot of times like features such as top thing user did in the last seven days. It's kind of uncommon to have a last seven days last 30 days last 90 days. And in these sorts of situations, we can actually be very smart about how we can process multiple features at once. If we know that there's kind of just a number of window sizes. So all those things come into play. And again, like we do a decent amount of stuff are truly obvious. If we own the infrastructure, we could really like, you know, go into overdrive on this because we would own everything we know everything but we have to work with all of these vendors with very, very different needs very different types of transformations and this less things we can take advantage of. Well, thank you. That was awesome. Thanks for the great question. Any, any other questions before I keep going. Cool. I'm going to touch on two parts and kind of went back on the floor for where I should join these or not, but I'm going to kind of end up joining them. I'm just going to talk about the other side of this, which is everything I've talked about has been streaming with streaming you're kind of constantly filling the online store different store as new data comes in from the stream. You're simultaneously maintaining again that log of historical feature values in the offline store. And that's, it's a little different because that you might be running on a schedule and typically what's going to happen or not typically what is going to happen is you're going to first kind of in the lap style build that table into the offline store. But then you're going to want to materialize into the online store. And that materialization problem is actually a very annoying problem that a lot of features stores kind of struggle with. The reason they struggle with it is, especially if you don't own both like in our case or a virtual feature store, we don't get to decide what offline store people use what online store they use. Some offline stores are very easy to write to other online stores, some aren't. For example, snowflake redness. So snowflake is often store red this is online store, very, very common, probably second most common, the most common in the network or data bricks of redness that we see. There is no native way to materialize data or copy a table from snowflake and to redness. So we have to fill that in a lot of feature form in practice was obviously the core orchestration problem. So what we do is kind of have all of these little just a blue, there's just all these things that add up and are really annoying. This is what I mean about like the problem of orchestration systems are as much less like one specific hard technical problem. I would say the most close thing we have to that as a streaming problem I just talked about. So what we solve is these kind of like things that are missing that people just have to kind of have to gather themselves and you just end up with some insane set of scripts that people have to like kind of make this stuff work. What is what is that you're going back. What is the step for you have and that like what does that materialize is that much not upserts or you're just doing like copy into like what's like what does materialization look like. Yeah, I mean the problem the actual problem to solve is I have a table of all the feature values over a set of time. I want the most recent value of each of those features for each of those users or entities to be in the online store so that I can do a look up and say hey, what was Simba's, you know, value of this feature or what is currently similar value for this feature so when I do a presentation, I kind of have this nice cash of pre process features I can use. So that's a problem we solve how you actually solve it. Different companies and products do in different ways, like the way feast does it is they just do a very kind of dumb just copy everything over upsert style. So, when you're when we do it, we actually take this, because in a funny way, the most expensive operation is actually the copying of data over network. In terms of time, and actually the processing of the data in place in offline store is not typically it's less so in practice what we end up doing is we actually maintain. The last snapshot of the so we build the slap shot so we actually build a view in the offline store of what should be an online store. Then the first time we just copy it over like more or less row for row. The second time we maintain the historical value credit new value and we take a diff. And then we just copy the diff over so we can do the minimal change to get the online store into the same state or the new state that it should be in. So that's the approach that we take and other approach. Sorry, is that that diff is that done like server side on the online store or you basically do that in your Kubernetes thing. We actually do it in like snowflake or Databricks or whatever offline store we use. Okay. And then all we do in the cage job is we just do the copy and it's an embarrassingly parallel problem because it's just I need to run and operations on. You know, I could just be an inserts so I can actually just break that up into a job and I can just copy those things over and I think using Kubernetes in this way is something that's unique to us like being Kubernetes native and just really using Kubernetes not just as this thing runs our servers but actually as this operating system for distributed systems where we can like kind of spawn jobs and really just take full advantage of whatever infrastructure they give us whether it's on prem or it's, you know, in Google or both. This is kind of a nice value probably get. Thanks. And then getting into maybe the more broad that was featured from look like. There's a lot of different components was online serving was offline serving was infrastructure provider we have these worker pods and worker jobs that get created, you have the coordinator metadata. You know, there's other pieces this actually been out of date. Just looking at it was a few more pieces I've been adding since. But really, the core and maybe the core architecture which again is kind of, let's call it a copy a bit of how terraform does things. Is we treat the metadata as the source of truth. When you define things in feature form you are creating your desired state. Feature form saves that desired state. And it, all the state lives in the CD, or, I guess, the search is stateful for me if he's a stateful but in practice like all the state that matters can be a story to CD and if you back up it CD, you kind of back up that feature from cluster. What works architecturally is we create this source of truth that is the metadata and the coordinator's job is just to take that source of truth. Look at what actually exists in the world on infrastructure providers and try to make these things be in sync. And if they can't, you know, if it fails or the job fails or whatever, then let the user know set up monitoring set up learning. So I think just that approach. I think there's two common distributed systems approach one is treat everything like like usually the answer with the data system is like a log is usually the answer on almost everything like distributed systems it's funny how often just using a log just comes up as a solution. The other thing is immutability is your friend. And so creating kind of this immutable immutable in the sense of like you can change it but you kind of go from one immutable state to another makes these things much easier and makes things inherently I didn't put in there's a lot of value in using that and you'll kind of see I see but I constantly use those two kind of techniques and tactics over and over again and systems I build. Last thing we'll talk about is monitoring concept drift. So the idea of monitoring concept drift is this idea where when I train my feature. I might have had a distribution that looks like this blue distribution. When I'm doing inference. I might have a distribution that looks very different. This concept is called future drift and actually maintaining this kind of view and again it's that mixture of like kind of the more analytical offline store, excuse me, stuff with the more production grade transactional stuff, combining those maintaining actually like here's the current distribution of things while also having this historical user on a much bigger data set here was what we were trained on. And then we have a set of I'm not going to have time to go into all these but we have different heuristics that we use on what we use when we use it etc. I would say this would be maybe the fourth really hard technical problem that people reach out to us to solve and something that we have to solve ourselves is maintaining and building and coming up with the right heuristics is all heuristics and then in fact for text data we actually build our own model so we use embeddings we use other things to be able to actually figure out drift. How do you use because all heuristics and then right, like, how do you figure out to write heuristics, and how do we do that in a way where someone can catch feature drift before it becomes a problem. There's actually a story from also from Twitter. Again, a few many years ago now probably four years ago, where they missed revenue in the quarter. We actually missed revenue in the quarter because one of our feature values was set wrong and actually created like a 3% drop or something and making up the numbers but there's some relatively minimal drop and how to recommend their systems quality was, but actually was something that made them miss the revenue target and they had to go in their public earnings statement and be like we miss our revenue targets because we had the wrong feature in our model, which just maybe I'll actually tie up on that which is, keeping track of what you're doing and why it matters moving up, and you'd be surprised at how often like, there's almost a direct line of sight from like we hit our revenue targets because the feature store was good. So, yeah, I'll put I have more stuff but I know we're getting on time and it's gone a little bit over but I'll pause there if there's any last questions or anything else I can jump into. Maybe I can ask a question but let me see if someone else is a hand on and if you want if you have a question. Now I'm going to go for it and you want to show like a final conclusions live go for it. Awesome. So, Simba, awesome talk. So, if you look at the broader infrastructure and people are trying to fall in many of these feature stores into the broader infrastructure offering that they have you know everyone's trying to offer a streaming platform plus and all that platform and data science platform. What's the case to be made for independent virtual feature store I'm sure you get asked this all the time. Great question. I think my various two problems to be solved. And I've actually focused a lot of this talk on what I would call the processing problems, which is the part of engineers love to solve like this is like the stuff that you know why we chose to be engineers. There's another set of problems, which I didn't talk about because you'd be bored to death if I did, which is, how do we do governance, what's the right workflow, how do we get a team of data scientists to work together. What's the correct way to do versioning. There's this set of what I would call workflow problems. It's almost like API design versus like, I would say every company that exists is actually kind of a quote from from one of my mentors who ran some very, very large database companies. Where he says that every data company solves one of two problems, they're either solve a hard technical problem, or they solve a workflow problem. Example of a hard technical problem would be like snowflake. Like this snowflake API is more or less an API that exists for a long time, but they do it better, arguably, than everyone else and that's why they're successful. The other set of problems where like, let's look at like terraforming hashi core. Like it's not that, you know, AWS couldn't technically build hashi core where they do seem unable to build a strong snowflake competitor. The API problem is one an extremely hard problem and two not a problem you can throw people at. You can get five people to build a better API than you can have 100 person working the problem that we solve is actually fundamentally an API design problem in a workflow problem, much more so than a infrastructure infrastructure problem. And because of that, we are just says in fact I would say we are more qualified to solve this problem and even the Databricks would be because it's all we think about. And two, we, it's very common that these types of problems are solved actually by open source vendors that have the standardized workflow and kind of build around that. And we're open source. I didn't mention that in the talk, but feature forms and open source product. Got it. Thank you. Of course. I do, I do better than to make any sort of definitive statement in a room full of database people. Roh has a student mind. He's at Oracle. He's fine. Another question from the audience. Hello. Yes, go for it. I mentioned you have a kind of a. I understand that as you try to have solving the data quality and that by providing a feature store was for providing kind of a manageable outlaw layer on top of it to customers. You mentioned the for the telephone from desire state and also for the drift of it. And, but how do you. What was the kind of the story for the, for the, for the, how, you know, define the state you mentioned about is a metadata layer and then you mentioned for the diet state. And the word that is what define and then in your feature store. I didn't get into this. I can do it rule. Like if I end this slide show is it going to screw up your stuff and so I can just show a quick. Long as you don't stop sharing. It's fine. Okay, cool. So I'll show some code. This real quick. So how it works in practice is you're actually building like transformations. And in these transformations, you are defining. Like actually in this case it looks like pie spark codes for batch, but all you're doing is you're decorating your Python transformations with feature form blocks and so feature form is actually more or less. It's not trying to build a new transformation language, we try to minimize as much as we kind of do that so you can continue to use the tools you love to use, but we're doing is providing a framework above it. So really have taken that approach so the answer to the question which I think I heard, which is, how do you actually define these things is in Python. And you, it's terraform like in that in the end you call terraform or feature from apply, and it will actually build these things you can interact in this data frames and. Yeah, it's a short you can always look at again it's open source products if you want to see more, you know, feel free to check it out. Thank you we check it out. I did yeah. So I have a question about how you speak about how you have these, like, like, you're at a higher abstraction level than maybe snowflake or something else. And then you have data ingestion coming in from there, and you have these transformations that performed on them. I'm trying to get an overall idea. Are these transformations performed each time you want to perform some sort of analysis or how exactly is it. Yeah, so for streaming, they would be performed. There's kind of the batch section, which is performed once which is kind of the backfill work. And then there's the stream which is continues to be processed in the bat situation, you're typically going to set the schedule. For some jobs, depending on what the raw sources they actually want you to just run the entire thing from scratch each time on the schedule. There is a way in the API to define kind of like an on update one where you can add for example with sequel maybe just add a word clause and where time stamp is greater than the time stamp of the last run. So if it is easy to define your transformation to be incremental, then you would just run on incremental new things as the schedule runs. In situations where you can't you just have I mean this in practice you have to run all over again. This is actually kind of a thing that dbt gets all backlash on as a reason, which is that they create they can if you use it poorly create a huge snowflake bill because you're kind of just almost running everything over again and the most expensive way possible super easy but you pay for it somewhere else. So what kind of performance objectives to people that we have or that kind of what kind of ways you're doing it. Yeah, so we. So people ask us all the time, like what's the performance feature form, and the answer is we aim to be a zero cost abstraction. So, in practice, are we like we're close we try to be, we definitely had a little bit of overhead this was his serving time, but in practice, what we're what we want to do is provide a framework which in the end maps to something that is very similar to what you would be running yourself. It's not more optimized in some cases because we have all the context of what you plan to do in the end because you're an end is training or inference. But yeah I mean in practice, most of the heavy lifting is being offloaded anyway. Most all what you're doing is more the metadata and orchestration which is in practice not a very compute intensive problem it's much more state metadata intensive problem. Thank you very much. All right. We're short on time. Do you want to share quickly your slides and show the conclusion slide and I'll splice that in the video. Like code if you have it feel like contact info or you're assuming you're hiring. I am hiring. I didn't really add a final slide. And that's it. I can take care of that later. Yeah. You have a unique view of what people are infrastructure people are actually using and you listed a bunch of these different frameworks and everything. But in terms of like, you know, like Kafka versus Pulsar what do you see more of way more Kafka. I think there was a time and this pains me because I'm like a big Pulsar component. I think Pulsar is an amazing system but in practice I think there was a window where Pulsar was gaining hype and speed and the kind of question was can Kafka catch up to Pulsar functionality because I do think that Pulsar was the better system versus can Pulsar actually justify enough teams making the jump. And I think that it's kind of been put to rest in my opinion and Kafka has come out victorious. And I think that practice all the things that make Pulsar such a great system are kind of being backfilled into Kafka anyway. And what about Flink versus Spark? What do you see more of? Much more Spark in general. I think Databricks has just done such a fantastic job of kind of expanding the footprint of Spark. I think Flink lacks their companies but I assume actually some of them probably spoken here before but there are many companies that kind of tried to take Flink to production. There's V1 which is the Verica which was acquired by Alibaba and now there's like Decodeable and a few others. The Confluent people just bought the new Flink company. I forget what it's called though. They bought it last month. I didn't know that but yeah it feels like there's not been like a Confluent Databricks equivalent for Flink or Pulsar really. There have always been startups but I think because of that it's hard to get that sort of buy-in because it's such an expensive thing for a big company to commit to. But I do see a lot of Flink. So I will say there's a lot of Flink. It's interesting that the ecosystem also in China versus in the US is quite different and there's a lot more Flink in China than there is in the US as a percentage. I'm not sure why but it's just something I've noticed. That's probably definitely a part of it. I wonder if it predates that actually. I wonder for some other reasons to it. As a Confluent bought MROC that was the Flink company. They bought it back in January.