 Sätter det här? Så thank you very much and welcome to this lightning talk. My name is Kim and I'm a software engineer to start up. Call logic locks. We are based in Stockholm and we are developing a platform for doing big data analysis and machine learning at large scale call hopps works and this presentation I'll be talking about feature stores and try to convince you why you need a feature store to manage the data in your machine learning pipelines. So it's like a different from the previous talks. Now if you're a traditional machine learning person, the way that your problems are set up is that you have this X and Y values and then your task is to kind of model this data and learn the relationships between your variables. If you are a scientist, but this X and Y values don't just show up clean by themselves. Rather, they have to come for somewhere and that somewhere tends to be a script or some data engineer written to extract the data. Maybe from a Kafka topic or from a production warehouse somewhere. And what we're seeing now increasingly when machine learning is getting adoption in industry is that this process of extracting the X and Y values and getting the data in the right format and the right time for just doing machine learning is a very big task. And some companies even argue that this is the hardest problem in machine learning right now. Uber, for example, acclaiming this, and they are arguably one of the companies in the world that apply machine learning at the largest scale. So what they're saying is kind of interesting. But even if you don't necessarily agree with everything, Uber says in this quote, we at larger clocks agree with Uber on one point and that is that it should not be a question mark up here. Rather, it should be a standard as interface for how you feed the features into your model for training and serving a standard as interface between data engineering and data science. And that is the feature store, which I'll be explaining in the remainder of this talk. So looking at the definition of a feature store is actually surprisingly simple. It is just a storage location within your organization, where you store documented and curated features. But the power of the feature store really comes from the strong semantics that we can enforce on top of this feature store and abstractions that we can build on top of it. So some of the things that we're trying to achieve with the feature store are things like feature versioning, automatic feature documentation and analysis, and also feature backfilling. But the main point of the feature store, at least in my opinion, is this first one in the list, which is that by using a feature store, it becomes much simpler to reuse features across different models and across different teams. And let me elaborate on a little bit why that is such a big improvement for many organizations today. So what we argue at larger clocks is that by investing in your feature store, the cost of your machine learning products will start to taper off. And our point there is that instead of doing new big investments in each machine learning product you take on and build custom pipelines, you can do kind of one time investments in the lower layers of your data science stack, such as the feature store. And as you build up that feature store with more and more high quality features, the cost of building new models on top of the feature store will start to go down. And let me now look at an example to further make this a point of feature reuse ability. So this is like a typical architecture at an organization today for doing machine learning, where you have this kind of siloed machine learning pipelines, one for each product. And this is a fairly simple workflow, but when you have this architecture, it's very hard to do things like reusing features across the different pipelines. So typically what you have to do actually what I've seen at organizations is that if you have a feature here that you want to reuse here, you just copy the definition and recompute it again. And if you want to do more complex things like for example, feature backfilling or automatic feature analysis and documentation is basically impossible using this architecture. So what we suggest instead is to use what is called a feature store as the common interconnect between all of these pipelines. This means that a feature is no longer tightly coupled to a specific model, rather a feature is now an independent version and reproducible artifact that can potentially span many different models. So we no longer have this tight coupling between model and feature, rather they are kind of disconnected now. And this gives us a lot of freedom and flexibility as engineers or data scientists. For example, one big benefit of this is that now we have this consistency that no matter if we're using a feature for training or for serving or for a neural network or for decision tree, you're always going to use exactly the same definition of that feature, because we're not redefining it over and over again, we're just reusing the same feature again. And this also means that we can do nice things like here. We have all features centralized so we can easily analyze them, look for the correlations. We can also do things like feature backfilling so we can be more efficient in how we compute the features. And that is kind of the promise of what the feature store will give you. And I will just very briefly go into the more of the technical details on how we can build out this thing. And I will speak in our from experience of building a feature store on our platform. So internally, a feature store consists of five different components, starting with the storage layer where we store the actual feature data. This tends to be predominantly numeric data like floats and integers, but it can also be categorical data or binary data, for example. Then we have the metadata layer, and this is, in my opinion, the most important layer to get right with the feature store. And you'll see in our implementation, we exploit this quite a lot. So this is where we store things like the version of a feature, the ownership of the feature, which models is used, which job is just computer feature and from which data set does it originate. This means that we can automatically backfill features when necessary. We also store a lot of statistics about our features for helping our data scientists understand the features like the distribution of all features over time, the correlation with each other, so spearman and Persian correlations. And we also store descriptive statistics and clustering analysis and so on. Then we have the two final components of the feature store, which is the feature registry and the API. And this is kind of the client side interface to the feature store. And the registry, you can think of it as the front end to our feature store. It is a place where we can publish our features, we can search for them, we can browse their metadata or the computation of features and publish our own work basically. So you can think of it in a little way like an app store for machine learning. So it's a place where you can discover new features within the organization and you can also publish your own work so that other people can reuse it. And then the final component is the API. This is what we use inside our machine learning pipelines to read and write to the store. And this is just a regular client side libraries that we use and I'll show you a quick example of how our API looks like. So here we have two examples of the API, one for reading from the store and one for writing to the store. And as you can see code wise, I hope you agree with me, it's quite a simple API. So to read from the store just provide a list of feature names and to write to the store just provide a dataflame of features. And to make this API so simple what we have done is that we have abstracted out of our heavy lifting to our client side libraries. So we actually have a query planner that will analyze your query and give you the result. And now my time is up, but if you're interested I have several references on the final slide where you can read more. And just to summarize this presentation, machine learning is a very powerful tool, but it also comes with a very high technical cost. And in the beginning of this presentation I mentioned this complexity of getting the data in the right format at the right time for doing machine learning. And our solution to this problem is to invest in data management layers specifically designed for machine learning called a feature store. And the world's first open source feature store is available on our platform called Hoppsworks. Thank you. Yes, so the question is what type of features will be for natural language processing and typical features could be for example word embeddings are very common, but it can also be features such as is this word a noun or a synonym or those kind of features. But word embeddings I think is the main feature we're using in the NLP right now. Yes, you can do that as well. It depends a bit which task you're trying to solve, but that's also an option, but you can add more features to better in most cases.