 Hi, my name is Willem and I'm an engineering lead on the data science platform team at Kojik. I'm joined today by Alexi Moskalenko, who is my colleague, a senior engineer on the team. Today we're going to be talking a bit about building a cloud native feature store with Feast. Today on Agenda, we're going to talk a bit about the data challenges that teams face when operationalizing machine learning. We're going to talk a bit about how Feast can help you with those challenges, what Feast is and what Feast is not. We're going to have a quick demo on Feast. Then we're going to talk about the project itself, where it's been and where it's going. We have three big announcements that we're going to make and I'm hoping you can stick around for those. How does data science work at most teams today? These projects are typically started because you want to target some kind of business outcome, some kind of metric that you want to push up. A data scientist is often tasked with building the first model or the first proof of concept for this. They're given data and they're asked to build this model and see if it's actually viable. Often the basis of the project, the inception is the notebook. This is a linear end-to-end flow that a data scientist comes up with and you evolve that notebook into an end-to-end machine learning system. You have something that resembles what we have on the screen now. On the left, you have your data sources and then you have transformations, let's say batch transformations. You have model training, you have model deployment, you have model serving, one end-to-end flow and this often represents and resembles the notebook that it's being evolved from. The final product there is a model that's integrated with the production system. This works and most teams can get to this point by just hacking together open source or proprietary solutions products. At some point, somebody will say, we need real-time features and real-time data and a hookup is streamed. At some point, somebody will say, we need more distributed computing so the hookups spark but ultimately you're left with an end-to-end system, one monolith. These monoliths come with some costs, sometimes very big costs. One of the first problems is that they're very slow to iterate on. If you have one team that needs to work on feature engineering and other needs to iterate on modeling and other that iterates on the online serving and the production requirements, they all want to iterate at different frequencies and paces and they're different teams sitting in different places but they have to work in unison because it's one monolithic system and this slows them down. The second problem is that features often need to be redeveloped when going from training to serving. This is often because features are written in Python and they don't perform it enough and they need to be written in Java or Go. There's all these reasons but inconsistencies arrive when engineers and data scientists need to work together to produce one final output, one product. The third problem is that because of these rewrites and changes from one environment to the next, often you have training and serving inconsistencies in the data and this can lead to a performance drop in your models. The fourth problem is that data quality problems arise when you don't know that you don't have the proper monitoring in place, you don't have the proper metrics in place, you don't have validation of your data. Basically the tools that are available today are not meant to measure data but at the same time this data is affecting the business outcomes. The prediction, the data that's going into the models affect the predictions that are made and those that affect the outcome and the decisions that you make. Finally one of the biggest problems that we found at Gojek and many of the other teams have also told us is that there's just a gross lack of reuse of features and yet feature engineering is one of the biggest costs that teams need to incur and so these are kind of the five big problems that we've seen ML teams face with when it comes to operationalizing data. So we believe that a need exists for a battle tested open source feature store to address these problems that we've highlighted and we think Feast is that feature store. So Feast is a system that attempts to solve the key data challenges to production machine learning. So if we go back to that diagram I showed you earlier, typically what you'd have on the left is you know you'd have your data, you'd have your transformations and then you'd have your model training deployment serving. This end-to-end flow would be duplicated for each project that you undertake. You'd not have reuse or very little reuse across projects. So how does Feast help with this kind of monolithic end-to-end architecture? Feast decouples this completely. So if we talk about the problems that we were highlighting earlier, the first one we were highlighting was lack of iteration. So now you've decoupled engineering of features and creation of features because that process ends with Feast. That's one team that can be the engineers that are scientists. Then you have another team that's just purely iterating on model training. They're selecting features from Feast, training a model and shipping it to a registry. Then you have a third team. This can be a production team, engineers perhaps. They're shipping models into production. They're hooking those up into you know connecting them to real-time production systems and you know they're looking up online features at low latency. They can do this confidently in its scale and with monitoring and instrumentation that requires. All three teams can iterate independently. So this both solves iteration speed and also solves reuse because teams can now independently use each other's artifacts with their models or features in this architecture. So what does Feast provide? Feast provides a central registry. This registry is a catalog through which you can define your features and reuse them and collaborate on those across teams and across projects. Feast also does ingestion. So it has jobs that it provisions that loads data in from upstream sources where they're streaming or batch into the stores in order for the third point which is serving. Feast provides a point in time correct, a temporarily correct serving layer that allows you to look up data at scale for training a model and it will handle the joints and everything you need and it also allows you to do online lookups so low latency lookups for predictions in the online case. It also provides you the monitoring tools to operate your system at scale in production. The end-to-end flow is represented on the screen. So on the left you have your data. So these can just be notebooks, data legs, data warehouse, it doesn't really matter. You push your data into Feast and this is done through the triggering of an ingestion job and that job is a spark job and this job will write your data into the stores. So in this case a consistent copy of the data that you're ingesting in online and offline stores and the definitions of the features and the data that you're ingesting are handled by core. So Feast core is that central registry and then once you've ingested your data into the storage layer it's then available for all teams to use. So if you know one team wants to try out a feature that has been published and ingested into Feast they can simply query it out of the Feast serving layer and then train a model and so Feast for training uses an SDK and for online serving it uses a low latency API. So what is Feast not? Feast is not a workflow scheduler, it's not like Luigi or Airflow, it doesn't do scheduling, it's not just a data lake or data warehouse like BigQuery because it's got the online functionality as well although it uses BigQuery and some of these tools underneath. Feast doesn't do transformation so it's unlike Spark and Pandas although it will utilize those tools but it's made those are upstream tasks. Feast does have some discovery and cataloging functionality but it's not meant to be a discovery or cataloging system. Feast does not try and solve lineage of data or data version control and Feast is not an ML or model serving or you know model tracking or metadata tracking solution. Hi, I'm Alexey from Gojik and I want to Feast in action but before that we need to deploy it. The primary way to deploy Feast is to rule out Docker images to Kubernetes using Helm. The only thing that needs to be done manually is to create in some secrets. So I already connected to my Kubernetes cluster so I will create a namespace and I will add secrets first for Postgres. You can see that I specified just Postgres password here and also our installation would need Google service account key since we're using Google storage. So now we can run Helm install. So after you clone Feast repository you can run Helm install using some overheads files that we keep in the repository. It takes some time to start all the containers but when all ports are running we can now connect to the Jupyter notebook. Now when we finish this deployment let me briefly talk about Feast components. The first one is Feast core. It's essentially registry which stores specifications of features, collections of features which we call feature table and entities which are keys in those feature tables. The next one Feast serving it provides features in real-time and with low latency and mainly used by model serving for running real-time prediction. Feast serving relies on the online feature storage which we use ready for. In JSON job here it's mainly responsible for populating this online feature storage. It pulls data from data warehouse or data lake which we call batch source or from Kafka or Kinesis some kind of streaming source which we call stream source. The historical retrieval is another job which mainly does similar job. It pulls data from data warehouse or data lake to prepare a data set for training. Both those jobs aimed to provide consistency between data for training and data for serving. Feast is decay kind of glues all this together. It used for creating features and feature tables for requesting features from Feast serving to retrieve online features. It also can be used to call historical retrieval or manage your streaming ingestion or batch ingestion job. In this demo I want to demonstrate how to use basic functionality of Feast. I will register features in Feast Core. I will show you how to retrieve historical data set for training and of course how to use Feast serving to retrieve online features. Also we will cover how to ingest data from batch answering source into our online feature storage. First we need to instantiate Feast client. It's essentially an entry point to all functions that are provided by Feast SDK. We will configure it by both parameters to the Feast client constructor and environment variables. Now we can declare our features. For this example I will be using driver-trips dataset which consists of features like average daily trips or conversion right and also trips today. The interesting thing about those features is that some of them are updated daily like average daily trips or conversion rate and some of them like trips today are updated in real time. Despite the fact that all of the features are connected to the same key or entity, driver ID, I created here two feature tables. That's because feature tables should be aligned to the source and now it's a job of Feast to join features from different feature tables and provide you all features related to the same entity. So ingestion job here pulls data from the watch source and stores it into the feature storage. Another ingestion job consumes the data from the streaming source and also stores it to the feature storage. And serving basically pulls all the features related to one entity and it does it with a low latency because we store all features related to the same entity together in Redis. Now important fact here is that populating both data warehouse and stream source is not part of responsibility of Feast. So Feast relies on the fact that our customers already have tools for populating both batch and stream sources. So in addition to features that I already showed the feature table has also two properties, batch source and stream source. Each feature table is required to have batch source because all feature tables should participate in training data set but not all feature tables, not all features are available from the streaming source. In this specific example we first create feature tables with batch sources only. So we will be able to use them in preparing training data set. So we first are creating the entity then we declare a feature table referring to this entity and adding all features that are updated daily specifying their types and we also specify the source. It will be a directory in Google storage which I will use parquet format. Also an important part is the timestamp columns that we use here. It's being used for point in time correctness for example which I will describe in a moment or just to deduplicate data that can be somehow duplicated in this source. There is also date partition column which we use for optimizing our spark jobs. So we declared both feature tables and now we can easily register them in the corner. So I'm calling client dot apply feature table and that sends those declaration of feature tables to the core. Now we can test that those feature tables are successfully stored. As I said Feast assumes that you already have tools that populating your data for a house or pushing data to Kafka but for the sake of this demo I need to put some data into those Google storage buckets. So I will just generate several data frames and with this helper function I will store them on the Google storage. As you can see data has been successfully ingested into the feature table bunch source. We can check it by just retrieving what's in the Google storage. You can see there are a bunch of parquet files and it used date column as the partitioning column. Now we can talk about actually your machine learning project. The first step in preparing the model is training and for that we need to generate training data set. Important part about generating data set for training is point and time correctness. As you can see on this graph when we are making a prediction we use feature values with various timestamps. So as you can see those features probably from different sources came in different types. Thus the prediction function usually used the latest available value. Since we want our training data set to be as close as possible to what we will have in prediction we are making this point and time correctness correction in the for the training data set as well. So data scientists must specify this line here basically this time point and we will be doing backward search in finding the most recent feature value in relation to this time point. Let me show an example. So here from the several data frames that I generated we are using entities data frame to create a request for training data set. So we are taking a sample of those entities and adding to them a random event timestamp. So this is a data frame which we will provide as entity source for our get historical features request. This function here launched SparkJob which can be run on the Dataproc, EMR or standalone Spark cluster. This SparkJob will pull the features from your batch source and combine them with the entity data set which we just provided do this point and time correctness and return your data set for training. As you can see features from two feature tables were combined on this resulting data set. Since I was generating those event timestamp randomly it is possible that in some point in time values for these features haven't yet been set. Thus we can see null here but in general this data can be used for training your model. Now let's move to the online features. So when you train your model you will probably go with it to production and in production you will do real-time prediction. So we will be using the real-time online features. In order to do that you need to have those features in online storage. The simplest way to populate this storage is to take everything you have in batch storage and put the latest values of each feature for each entity into online storage. So now we will do exactly this. We will run the offline to online ingestion which also starts a SparkJob which reads from the Google storage and stored to Redis. When this job is completed we can again generate some sample of entities and make a request to the online serving to retrieve features on those entities. As you can see this is much faster because it used Redis as a backup. Now on those feature values you can run your production prediction. As the next step we can also add another features to our production prediction namely the real-time features. So in order to do that we need to add a streaming source as one of the sources for our to our to our feature table. With this streaming source we need to specify also a bootstrap server and topic since we are using Kafka source and also we need to specify a message format which in this example is Avro format and of course we need our schema so that ingestion job would know how to decode records coming from Kafka topic. So this our schema in this example consists of feature trips today and entity driver ID and event colon data time. Now that we updated feature table in core we can start our ingestion job. As soon as job consuming specify Kafka topic we can start to populate it. So this function takes the record from our trips data frame and encode it with given our schema and put it into Kafka topic. Now we should be able to retrieve those features with our get online feature method. So we're generating entity sample again and doing request. So you can see here that we having both average daily trips feature which comes from the batch source and was ingested with batch ingestion job and trips today which comes from the streaming source and we just ingested it from Kafka in the same response. Those features are now can be used for your model prediction. That is everything that I wanted to cover in this demo. Thank you for attention. You can find this notebook or other examples in our repository on GitHub. Three years ago Uber introduced the concept of a feature store through their blog post on Michelangelo, their machine learning platform. Since then many companies have bought their own feature stores like Feast from Gojek, Airbnb, Zipline and many others. Teams large and small are starting to realize how critical a feature store is in the ML stack of the future. Where is Feast now? Well we released 0.1 late 2018. We rolled that out in Gojek and we've gone from strength to strength. From 0.2 onwards we've been working with our community. We developed decentralized serving. We developed point-in-time correctness. We added project isolation and name spacing. We simplified concept and added integration points. We integrated with TFX and TFDV. We allowed for multiple VPC support and request response logging and monitoring and metrics in production. And recently we've added Amazon support and Spark support with the help of teams like Tecton. So who are our big adopters and contributors? Well we've been working closely with a community of large technology companies like Agoda that's delivered Cassandra support. Postmates has delivered Bigtable support. We've added Simpress that has added Auth. And we've been working with Microsoft and Farfetch and they've delivered Databricks and Delta support. And recently we've been working with Tecton who've contributed Spark and Amazon support. So there are over 43 contributors and 6 plus enterprise customers running Feast in production today at scale. Let's talk about the Feast vision for a second. So there are three aspects that we want to look at. The first is open governance and standards. So there are teams large and small that currently rely on Feast. They're not just running it in production but they're also contributing code back to the project they're collaborating with us. These teams need to know that they can rely on the project at scale but that they can also shape the direction of the project. So we believe open governance and the transparent process is important in driving our project and encouraging further adoption. Secondly, we want to build Feast into a multicloud or cloud agnostic, modular and production grade feature store. Meaning if you're a team using Feast you should be able to cherry pick the components you need, deploy it wherever you are running your existing machine learning stack and then run that at scale reliably. Lastly, we want to make Feast easy to integrate into existing machine tools and the ML tools of the future. Meaning if these are machine learning platforms, upstream data transformation systems, downstream model serving, we want to have clear API contracts and clear integration points. So these are key parts of our vision and we are already starting to work towards addressing them. So for open governance and standards we are super excited to announce that Feast is now officially a part of the Linux Foundation. Meaning that not a single company has any kind of special privilege on the project. The trademark is now a part of the Linux Foundation and we will be, we are already operating on a governance structure defined by the Linux Foundation which means it's completely neutrally governed and the process and structure is completely public and transparent. There's an initial group of maintainers and a democratic process for new companies and teams to become maintainers of the project. Secondly, we believe that we need to work with the greatest minds in the industry in order to build a best in class feature store and we've already worked with many of the greatest minds over the last couple of months. But one team in particular stands out there and we are super excited to announce that Tecton will be committing a significant amount of resources towards Feast development. We believe that by joining forces with Tecton we'll be able to build a world-class feature store that teams large and small can rely on and together we hope to set the standard for what a feature store is. And so we're super excited for this collaboration opportunity and all of this is happening underneath the Linux Foundation umbrella. Then finally, we'd like to announce that we are now a top-level component of Kubeflow and this is the start of our integration journey. Kubeflow is one of the best machine learning platforms out there completely open source. So it's both the best of breed and an end-to-end system and we believe that by integrating into Kubeflow and improving that integration we'll not just improve the integrations of Feast with into ML platforms but also to sister projects like model serving or pipelining or other managed services. So our roadmap going forward for Feast 0.9 we believe which is currently under development. We believe that we'll be able to achieve cloud agnostic deployments. So we've already got Amazon Google Cloud support and we will be launching Azure support as well as on-prem deployment support. We'll be adding all fine storage support for Delta and we'll be adding feature engineering support which will likely be through Spark SQL. We will also be sending out request for comments on on-demand feature transformations and a feature discovery user interface. And finally, come and say hello. Our homepage is over at Feast Dev. Our source code is all online. It's completely open and Apache 2 licensed. We have a Feast channel on the Kubeflow Slack and you can find a link to this deck on tiny URL on the screen right now. Thank you.