 Hello, everybody. My name is Willem, and I'm one of the tech leads on the data science platform team at Gojek. And I've seen some Gojek people here. Is there still anybody left? Any Gojek people? So today, we're talking about Feast, an open source project that we developed over the last year or so in collaboration with Google Cloud. So the agenda today is, why are features important? What are some of the challenges working with features? Some of the design goals we have with building a feature store. MVP we built, lessons we learned from that. I'll show you some examples of what it looks like to interact with the system, and then kind of our vision going forward with Feast. So the company like Gojek, and I think most tech companies, you have two things that are actually quite important. You have your products, and then you have a lot of data. Sometimes the data is also a product. Especially in Gojek's case, you want to empower most of your products through intelligent systems, machine learning. But these systems depend heavily on the data. So we can talk about features and these data points on many different entities or aspects of your business like your customers, your drivers, the locations. Gojek has 18 different products. So we have a lot of variety in our data sources. And the features that we can get from these data sources are critical in powering these ML systems. So for us, at least features data is critical, and our models need features. So I'm going to give you just a high level. This is actually a little much higher level than the previous talk. So I'll go through it quickly because I'm sure you guys are all very technical. But this is a basic data frame. It's completely synthetic. We've got in the green columns are just features or attributes. And they're essentially properties or observations on a specific entity. In this case, it's a driver table with driver IDs. And these drivers are entities within an entity type of driver. So this is the most important part of actually building an intelligent system is aggregating and creating these features. So in a toy example, let's say we had a booking table. This is essentially a transactional table for a company like Gojek. And you had features on your entities on the left. If you wanted to predict, let's say you're building an allocation system matching customers and drivers, if you wanted to predict what is the most which drivers most likely to convert or to have a successful trip. So if you look at the column on the far right there, it says trip completed. And that's just a Boolean flag on that transaction. And those are historical transactions. So if you want to train a model on this, you actually want to enrich that data frame on the right there with driver data. And to do this, you need to do a join. Now, this is a very simplistic example, but for large and complex systems, you often join hundreds of features onto a single frame. And this is something that data scientists does commonly. And it is something where a lot of errors can creep in because you will notice that the timestamps of the features are typically different from the timestamps on your transactions or whatever you're joining it onto. So doing a point in time correct join is actually kind of tricky because if you don't match these two things together correctly, you're going to leak feature data. You'll train your model on data that it won't have in production. And the accuracy will degrade. And you won't even know that unless you have very good monitoring systems. But let's say you can do all of that perfectly. You're going to have your features, your inputs essentially. You're going to have your outputs or your outcome and you're going to train your model. And that's going to be the product. So you're just going to run through all these examples and train it. And I'm looking at this from a data engineering and an engineering perspective because I'm an engineer, not a data scientist. So it's up to them what they want to do inside of their model. But I care about providing the whole system around this modeling and building their end-to-end ML systems. So once you have your model as a data scientist, you want to ship it into production. But you don't really know or care about the infrastructure around it. So you'll have your incoming request in production. And that request will have, let's say a driver ID or an entity ID as well as some feature data. And that'll be fed into the model. And an important thing to note here is that the features will be the same for the model. It won't change. And it's not just that the types are the same but the actual source data has to be the same. So the real question here is where do these features come from? Because these features are not coming from production systems. If you're integrating with a production engineering system, they're going to have IDs. They're not going to have feature data stored for you. So as a ML platform team or a data engineering team, we will have to provide those features. We will have to enrich that yellow driver ID with the green feature values at some point. And we have to do it in a way where it's consistent with what was given to the model training. While at the same time ensuring that there's no staleness in that data or inconsistency in the way that it's transformed or created or sourced, even if it's just the distributions that are skewed. So those are some of the concerns we had when we were looking at feature engineering at Gojek and a lot of our projects that it was very hard to do this. But we had some other challenges as well. There were two personas that we're trying to address here. One is a data scientist. And typically data scientists at Gojek because there's so many projects, so many teams, so many things going on. They're very busy with primarily feature engineering and building data pipelines. And these pipelines are everything from sourcing data to transforming data. They're doing a lot of data validation and data quality checks. Essentially there's a lot of plumbing that goes into it. So we wanted to free them up from that. Then we also wanted to make sure that there was more reusability in the work that they were doing. So they were doing a lot of work from scratch. So essentially there's zero visibility across teams a year and a half ago and you basically had to have a relationship or you had to have published something or a team has to publish something for somebody else to know about it. And that's a challenge in a large organization where you have hundreds of engineers and hundreds of technical people that want to reuse work. And there's also a trust factor there even if you know about somebody's work often teams are recreating it because they're uncertain about how it was created and they can't actually inspect the data source and data transformations. So we wanted to help with that. We wanted to help with getting data into production. That's the third point that data scientists shouldn't think about that. And finally we should ensure that there's a consistency between training and serving to avoid model skew. The second persona that we wanted to address was the data engineer. So typically these data engineers in large organizations are fielding a lot of ad hoc requests. If you don't have a platform that does everything for you then each project that comes along will require the data engineers to go and spin up a Redis cluster or build an API to find schemas and contracts and everything and then manage that instead of monitoring. You actually want to build this an infrastructure that can just manage this for you especially because 70, 80% of projects and there are many projects are not too different. They're not unique enough to require ad hoc infrastructure. So we wanted the feature store to address the needs of this data engineer. Also in terms of data processing we wanted to make sure that the data engineers not to manage the scale or compute whether it's importing or exporting data or just serving features in production. It should be easy for him to just have features there that's fresh and that's being served to production systems. So our design goals with Feast was we wanted reusability in the features. We wanted to make sure the teams could see other teams features and have confidence in how it was created and in using it in their production systems. We wanted to ensure that there was consistency in training and serving because we're building an infrastructural system here. It is a great opportunity to build consistency into it as a first class concept. The third thing we wanted to do was we wanted to ensure model, the future creation and the future usage was different. So this is something that's extremely common in most small teams. In most small teams you build one big pipeline. You're doing everything from raw data to a batch scoring at the end. And every time you iterate on your features you rerun the whole pipeline and it's extremely time consuming and it's actually wasteful. But you want to actually put asynchronous checkpoints that are essentially like stores or so for example a feature store or model store that breaks this pipeline apart because the cadence of developing those three separate stages is different. So you can have a data engineering team in one side of the world developing features and on the other side of the world you're just selecting features. The fourth thing we wanted to do was we wanted to provide access to features in real time, I've already spoken to that problem. And we finally wanted to support the tools that data scientists were used to already. So we wanted to make sure that we go to where they are. They can use pandas, they can use Python. We don't take them into a foreign language or anything like that. So our solution to this was Feast. And our vision with Feast is that it's essentially a bridge between models and data. So we can't address all models. There will always be models that are too different to that sit on as edge cases. But we believe that most models can be addressed with a single interface. And so the goal is for Feast to be the interface between essentially any source data and that model. So where do we begin? We began with a collaboration with Google Cloud PSO. So that was between Gojik and Google. The product of that collaboration was an MVP, Feast 0.1. The primary focus areas, we have many design goals. The primary focus areas was one, we wanted to support both the production or all of our production ML systems, especially the batch training and online serving and have a reusability and discovery of features. And the open source that you can Google for the Google Cloud blog post on that. That was at the start of the year. So this is a high level of Feast 0.1, the MVP. And I'll talk a bit after this about how things change and some of the lessons we learned when we rolled this out. So firstly, this is an infrastructural system. So even though we did comply with the design goal of making it easy for data scientists to use with the tools that they're familiar with, it is an API or a system that you interface with is not something you run locally. It's a centralized system. It has a single warehouse, a single online serving API as well as a backing store, which is the Redis cluster. So in our case, the warehouse is BigQuery and we have ingestion as a layer within the system. So essentially you're loading in data from upstream sources. So it can be Kafka or PubSub or BigQuery or Object Stores. We have interfaces to many of those different sources. So this is an infrastructural piece and the data model was essentially entity feature related. And I'll tell you a little bit later about why we moved away from entity feature relationships. So in this case, you had, for example, if a person registers an entity, which is something you do in Feast 0.1, it creates a table and then each feature is a column on that table and it grows as a very, very wide table. And essentially when data is written from these ingestion jobs or when you're ingesting data, it goes into rows in that one single table. So essentially if you have many different ingestion jobs being written from many different sources, you have a very sparse table. So for example, if you have a stream writing every second and you have a lot of batch features that are only updated once a day, you'll have 80 or 90% of your rows or your cells just being empty. So the serving APIs are separate, but feature data is consistent because we have a single ingestion job that ensures consistency, so we meet that design goal. So the lessons we learned when we rolled this out to our production systems and our teams in Gojek, the first one was every engineering team wants their own Feast. They don't want to have a central Feast that they depend on. That's too scary for them if they're not running it themselves. So this is the first thing we ran into. The second thing we ran into was that the data scientists were creating really, really strange entities, composite entities. So they were doing something like service area, which is essentially like let's say an area where we were running our services like in Jakarta. Service type is essentially a product and just the flag is weakened. So this is not an entity like a driver or a customer or like a digital merchant. No other team is gonna want to use this, but it is useful to that specific team that does create this composite entity and they store their features on this composite entity. So we realize we have this need that's being fulfilled, but this is polluting our namespace. So if a new team comes on and they're looking at entities they see these weird abstract entities and it's just a very strange experience. Then the third one is the retrieval suffers with the sparse, very wide tables that we have created. This data model of ours was not efficient. And it was partly because we depended on a specific technology, Bigtable, which we eventually moved away from towards Redis. And when you store the data in this model in Redis it's really not as performant. So this is one of the pains that we had. And the fourth one was feature validation super important data scientists. And if you're building an infrastructural system like Feast it's the perfect place to put in validation statistics and alerting on feature distributions both training as well as preserving. And then finally aggregations. So Feast 0.1 doesn't do aggregations all that happens upstream, but feature creation and feature engineering on the transformation level is actually very important. So it's something we're looking at adding in one of our later releases. So where are we now? Okay, we released Feast at the end of 2018. If we're rolling it out to many production systems in 2019 we're currently, we're at 0.3 right now, 0.2 introduced decentralization. So we broke Feast into two phases essentially. We put a Kafka stream in the middle. So all data goes into a Kafka stream when you're ingesting it. And from there you can sync it into various Feast deployments. So at Gojek, we have just a central Kafka bus and each team has a certain Feast deployment for their system. Yet at the same time you still have a centralized Feast core essentially a registry of features. And so you have both centralized management but you have decentralized serving. So for Feast 0.3 we introduced some new concepts there to reduce the sparsity in our data models. So we have something called the feature set. You don't define just a feature you'd find a group of features and this group of features maps onto a source data frame or a event stream. And so we know that this information is coming in together and we know the time that it's coming in together. So those rows are always written as full rows. There's zero sparsity unless the creator of that event adds nulls or something. So that really changed, completely changed the data model for us. So current Feast is essentially an interface. So we're moving away from an infrastructural system to just something that is as thin and as lean as possible. It's able to load from various sources so streams and warehouses, flat files. It's able to store, serve historical features for model training. It's got a Python SDK if you want to use that. So we support Python as a first class language. We also have online serving with the GRPC API and we have a GoLang and a Java client that you can use if you want to use those languages. And this allows us to serve model or features that low latencies two or three milliseconds. We also have a consistency guarantee to avoid model skew because we have a centralized bus that distributes data to all of these stores. And then we have obviously the centralized management of features and the related metadata. So how is Feast evolved from 0.1? So these are the two concepts that I introduced just a few seconds ago. Feature sets, it completely changed our data model. It maps onto sources. It improves the retrieval times because if you're storing dense rows, you can actually compress that a lot easier so you can get about 70, 80% compression and you can store a lot more information and a lot more localized and you can just look them up and serve them to the end users. It also provides isolation. So if you have this team that's creating strange entity composite entities, all of that complexity sits in the side of feature sets. And I'll show you a little bit later what a feature set looks like. But essentially, nobody else can see that unless they want to see that. So the feature set is an implicit namespacing mechanism. And then like I said, you have these decentralized stores with Feast 0.3. So the architecture is something like I have on the right there where you have multiple private clouds or VPCs and you have a shared VPC perhaps that they can network with. So they can understand data sources and where to find data based on what this core registry tells them. But they don't actually need the core registry to find the data. They can find the data themselves. They're completely isolated from each other. And if this shared VPC goes down, they can still operate and serve information and features. Okay, so just to show you a quick example of what the API looks like. So let's assume that we want to train a model on trip completion. So it's just the example I showed you earlier. It's a toy case. So in this case, we're gonna do two things. We're going to first register a feature set, load in a data frame into Feast. We're also going to register a stream and load in that data from the stream. And then we're going to retrieve that data for online training, sorry, for batch training. And we're going to serve the model online and retrieve from the online API. So the first thing we do is this is the very simplistic example. You're just reading in a CSV into data frame. You're creating a feature set with a name. The only thing we need there is the name. And then we're inferring that feature set, the schema of the feature set based on the source data. So it looks at the properties of that source data frame in order to build that feature set. And this gives Feast an understanding of what data is captured in that data frame. Then you register it with Feast and then you can load the data in. So in this case, it's a push-based model. So you're pushing the data into Feast. So we currently support both types of push and pull. So here you're loading data in. And on the right you can see the schema that it's inferred. So you get your acceptance rate and your conversion rate and your daily trips. And it detects that the driver ID and that is the entity or the entity type. Okay, so now automatically this is available in serving and it's available for training as well. But you can assume that you've got millions of rows, not just four rows. So if you want to do a stream, this one is a little bit more explicit. So here you're saying create a driver stream feature set. It's got the specific entity. It's got this one feature. And there's a Kafka broker and a specific topic that you will find that feature on. And it basically maps fields onto those features if you use the same names. So you apply this feature set and it will spin up jobs and it will automatically start sourcing that data. And what I'm not showing you here is how the serving stores are configured. But essentially you have a subscription at your serving store where you say, I want to subscribe to these feature sets. I only care about these feature sets. And then it will automatically start pulling in for information. And then if an end user or a client hits that serving store for information, it will have it or not based on the subscription. So now we've registered two feature sets and one for your batch store and one for your serving, one for your batch data and one for your streaming data. So how do you train a model? So you've got a list of features and you need to hit feast the API using the Python SDK with your list of features as well as your driver entity rows. So the entity rows here that drivers variable or object there contains two things. It contains the driver IDs as well as the timestamps that you want to join onto. And typically those timestamps are the same ones that the trip completed event happened at. So like the transaction timestamp. And then all of the features will be joined onto that by feast and it will be returned. And you can call the two data frame method and it will materialize that as a data frame but you can also just access the Averafile or whatever file that the serving store produces for you directly. You can fit your model and you can train it. And you'll notice that if we go to online serving the feature list is going to stay the same. So the API is very, is completely consistent. You're just referencing a feature set name with a feature within that feature set. And feast, sorry, just to go back for a second. Feast is automatically joining these features in a point in time correct way. So it makes sure that the timestamps are actually correct and the joints happen correctly. And what I'm also not showing you here is that it can do, it can do staleness detection. So if you're trying to reference features that are too old, it'll automatically drop them and warn you and return metadata based on missing values and missing keys. So there's some more complexity here but this is the simplistic example. So if you go to online, you're gonna do the same thing but it's even easier in online because you're just going to give driver IDs and the list of features and it'll automatically return a list of these values to you. So it'll enrich your ID. So if you're gonna send in the driver ID and the yellow on the top there, it'll just enrich that. And then you can just send that straight to your model and Feast will automatically make sure that the shape and the contents of this data is consistent with training. So your model knows exactly what it's gonna get. So what's next for Feast? We have just finished 0.3 development. We are hoping to get Feast 0.4 out either by the end of this year or early next year but the proposal should come very quickly. We actually want to do this in a completely open source so we are interfacing with the Kubeflow community as well as some folks and other large tech companies that are using Feast or at least playing around with Feast. In addition, we're also adding, we're thinking of adding project namespacing into Feast as part of this proposal. And we are looking at one of our big next releases will be feature statistics with visualization as well as capturing the schemas of data frames. So if you export data, feature data from it with a batch API where you create a data set, we can actually understand that data, the schema using a tool like TFDV or equivalent and capture those schemas within Feast and use that for validation both in an offline as well as online way. And we also want to add additional sources so non-GCP specific ones. Right now, it requires a little bit of dev if you want to do this yourself but we want to make it easy to extend Feast. And finally, some other high-level goals for each one is the Feast UI, adding feature transformations as well as lazy loading batch data so that you don't have to push it into Feast. So the key takeaways, Feast makes sharing and reusability of features easy. Feast makes it easy to do point in time correct joins of large amounts of batch data for training model and it allows you to serve feature data online easily and in a consistent way to your batch training. So if you want to get involved, we've got a GitHub page over there. We've got activity there and we've got a Slack channel as well that you can join. It's on the Kubeflow Slack. We're in the hashtag Feast. And you can join our mailing lists which I've linked to as well if you want to get updates on new releases. We are very actively developing at the moment and there are a lot of proposals that we are going to make and we'd love to get more contribution from the community as well. Right, thank you. Questions? Everybody? I was wondering what kind of infrastructure requirement for this platform actually. If I want to play around and for training or sharing data, what would be the... Right, so we have one hard dependency right now and that's BigQuery. So the warehouse for all the versions of Feast requires BigQuery, but the warehouse is an optional component as well. So you can spend our Feast locally. We've got a Docker compose file in the repo. We've got mini cube installation instructions. So you can pretty much do everything except do the batch retrieval. But if you want to do that, you need to use BigQuery for that. It's probably the most requested functionality is an open source warehouse. So we really want to add that soon, hopefully by the end of the year, but worst case early next year. We're thinking of just building on top of an existing interface there so that we can support many different external stores. But to answer your question, you can run pretty much the whole Feast locally in a mini cube or with Docker.