 Hey, everyone. Welcome back. Hope you enjoyed the lunch Yeah from experience I know half of you have not been to kube-con before but for those of you who maybe went to kube-con EU this year The food it wouldn't have been hard to improve on the food from kube-con EU So and actually I really enjoyed my lunch. So hopefully the rest of you did too I'm actually gonna hand over to Tiago who's one of the committee here He's gonna talk a little bit about like the shirts that he brought to give out And a bit about like, you know, trying to encourage and embrace like diversity and global representation in tech Hi folks, what about playing stand-up sit down once again Everybody stand up And sit down How many of you folks were born and raised in Europe? Can I please sit down? How many of how many of you folks were born and raised in Asia in Pacific? How many of you folks were born and raised in America? How many of you folks were born and raised in Latin America? How many of you folks live outside Latin America? Geographical diversity matters. So this is a this t-shirt that I brought for you folks as a part of the giveaway Is a token of appreciation? I know how tough it is to make to the tech industry without proper access to knowledge and resources This jersey is from a minor league soccer team from my blue-collar neighborhood My father used to be a trick-or-treat driver Every time I wear this t-shirt it reminds me how hard it is in some regions to achieve their goals in Brazil you either become a soccer player or a So-called bird star It's quite the same in Latin America whenever you stay there It also reminds me the importance of giving back to the community. Thank you very much and enjoy the rest of your day Amazing. Yeah. Thank you. Tiago Okay, so our next Talk is going to be what the title is feather an open source battle-tested scalable and enterprise grade feature store I imagine if you wanted a feature store those four things are definitely things you'd be looking for But the I think that you know the cool thing with feather is this was built like at a Company that already has lots of scale and has been dealing with this for five years now So this has come out of LinkedIn and Hanfei Lin is going to talk to us about that He's a staff software engineer at LinkedIn. So let's please give him a round of applause and welcome him Thank you. Can you hear me now? Okay, cool. Thank you. Um, yeah, I'm Hanfei and staff software engineer at a linking Today, I'm really happy to be here to share our work and the progress on AI info on Kubernetes AI day Maybe a little bit of self-introduction for myself. I joined linking about us seven years In the first three years I work on like a building micro services and the building data pipelines and later on I Switch to machine learning info and the focus on building feature store in the last four years And I like a system design and the product design And I also like to hear users feedback about our product and the users conduct user study and Understand the layer pain points and the build a product or solution for them So if you have pain points in this area, I'm happy to talk to you so With that I will Maybe give a very brief introduction about the fellow fellow is an open source enterprise grade high performance feature store it was built at linking about like five years ago and the system we have a Adopted by a lot of majority of AI teams and AI applications at the linking and powering a lot of AI applications and And in this April we have make it an open source Collaborating with Azure and now it's integrated into Azure cloud and now in this September We have it has been accepted by Linux Foundation and that AI data as a sample a sandbox project so Future store is both old and the new we have been developing feature store. I think about the five years but they are still companies kind of just start to consider adopting feature store or thinking about if they should use feature store and Every year there are new solutions with different kind of angle trying to tackle this problem and Trying to tackle this problem so I feel it's important to share our view and Our view on how do we formulate this problem and what a problem we have seen and how those problems Need to list a solution we have built here So generally there will be problem statement solution and architecture and the roadmap and the demo If we have time So I think a little bit of background here most people might be familiar with it So our industry is really growing in this last decade a lot of AI applications like a GPT-3 and or stable diffusion to generate AI generated content and many other usage to help to people to build the better applications and I believe I will keep improving people's life in this century and disrupt many other industries and Then the demand is really growing exponentially. So is the demand for MLOs and MLO infra So in the past people has paid more attention to Models by less attention to like a data and the feature like the presenter in the first key notice mentioned There's a shift from model-centric AI towards like a data-centric AI So for data-centric AI right most people start to pay attention to like unstructured data like an image text and so on and Yeah, and also some industry also start to pay attention to like structured and semi-structured data like a user interaction data like a user User static information data and so on and And we also see a trend that people are shifting towards using more real-time features and building more real-time inferences so The goal for the goal for features is usually people want higher quality They want the data to be featured to be more correct without any data lecture and so on and they wanted the feature to be more performant like either like for offline they want high throughput or like a low latency for online inference and they want to have like a faster iteration like a AI application Building is really iterative process the faster. You can iterate the faster. You can bring more impact And lastly Importance of fresh a feature Is kind of more prominent over time a lot of applications Requires like more fresh a feature like a Netflix and the tick tock For example, if you somebody viewed a comedy comedy movie It's likely they might they might want to view another comedy movie today It might not be very not very reasonable to recommend a horror movie or something And this is more kind of outstanding in tip-talk if you use something you might be view another related stuff later on So, but how does that go fit into? Production or reality, right? This is actually One of our open source user ask me, okay, I'm new to this industry How do I actually feature as my raw data for my application applications? So taking a typical example like you have a website and you have user activity data coming in and usually you will have like a checking events to collect a lot of data and while streaming like a Kafka and the send those data into offline storage like data lake and data lake And the pop part of that data will be sent into online database The typical past people usually do like for trainees or offline inferences could just go through offline database Those are worried mature mature kind of a process and the workflow But when it comes to like online and streaming it becomes relatively harder due to the complexity of topology And it usually takes quite some time for a company or team to make all the features available in different scenarios like streaming online and offline So this is from That is a follow-up technical perspective. Let's look at like a human human side human perspective This is just a way done done by Forbes asking people about how much time do you spend asking about AI engineers and the data scientist How much time do they spend on like a data cleaning prepare preparing their data build their training data set and finetuning their algorithms 82% of the time actually spent on data preparation and 76% actually don't really enjoy the lowest part of work. It's usually kind of tedious. So I actually also weighed like our internal like data scientists or AI engineers and some of my friends from other areas or our users about a similar question like how much how much is your date how much time do you spend on About a feature engineering radio task including feature serving feature data cleaning a powerful model and In average they spend like a 40% time on this task and And most of them don't really enjoy those type of work essentially they enjoy it like a feature engineering itself But they don't enjoy it like maybe how to serve the feature online And how to clean those data and how to make sure that data quality is good and monitoring as all so It's a really hard It's really hard to convert raw data into features and then serve into models It's pretty complex in both many steps and the different tools and they involves heterogeneous environments offline online in the new line and they have different requirements actually for different use cases and the systems and All organization level like all like a team level. It usually demands different vast experience and the multiple skills and different knowledge and when you start to build an AI team and As it scales you will find that a lot of the tribal knowledge actually for doing things around the future related task So in some company usually is either one person like a do this and to it like do most of the stuff or in other company Maybe there will be a part of a team that do this together. So Data scientists may just focus on very small part of things They will just be focusing on building a prototype and after the prototype done They will hand it over to other teams. This is another model So I think this this model will sometimes leads to this kind of meme here Okay, once they are done, they just hand it over and they don't care about what happens for the rest of it And then there will be complaints from the data ops team for sure So and lastly, I think it's equally like people don't like to jump from one tool to the other This is really not really fun. So this is the problem statement We I kind of a self-summarized here based on those observations There's a couple in here people usually just a couple of their feature engineering into model Just a whole big pipeline of one whole big notebook to achieve all of them And then and then there's like training and the influence skill and training data and the influence Data might not be the same and the result in a skill later on and less organization like Organization problem as well. It becomes harder to reuse and share your features across your teams or cross your organizations later on so That we have a raw data on one side and though We wanted to make it into features and to serve into model for model inference. I feel less apparently There's something is missing here There's a gap that we needed to feel So let's listen me to this question. Can we build a layer to simplify this? So that enters our solution here so feature store, I think a feature store for some people This might be confusing sometimes because actually I think this is not a well standardized terms in some some some kind of marketing Material people will define it in a narrow way It just mean the storage system that store the feature data itself For example, like a red is the feature store Or like Google has this tensor store is mostly for offline training purposes people call it a feature store as well There's also like a broader definition like a feature is like a feature management system That allows you to create access and the share and discover features It usually compose of ecosystem of a tool set essentially. This is also the definition used by Wikipedia So fellow feature store is more like on the broader sense is a feature management system So it's better tested and linking for five years And we are not opinionated about what feature storage system you use You can plug in like a red is or arrows by a cosmos to be all at three and so on And right now it's open source and so is now a linear with foundation AI and the data sandbox project So fellow from very high level fellow is an abstraction layer between raw data and the model It helps you to define features based on raw data raw data sources using simple API's Get lost features by their names during model training and the model inference this will avoid the problem like a full training serving skill It allows you to share features across your team and organization So I will demonstrate some Use cases and examples to give you a kind of a hands-on experience of how it's like so fellow started with like fellow started with like feature definitions for AI engineers and data scientists For example here you have a New York taxi trip Data that consists of like the pick up time and job of time and some trip distance information in your database in your data like in csv format So then you can just describe What's the path to that data and the layer timestamp The timestamp is mostly used for like something called a point in time drawing to prevent the data leak We might not have time to call it today, but if we have time we can call it in an example We don't but if you have time if the data is time sensitive We can just tell us like the timestamp and their format Then you can just tell us the key and put it give it a description of it And then for features you can even a feature is a really kind of a feature rise that you can just take out as is like for trip distance and another example if you want to Transform your raw data into another different feature you can use that using like a sequel or Python as well We will use a sequel to say okay, if this is a long trip distance And lastly you just group them into together into a feature anchor So feature anchors means like a few features that anchor to a specific source So this is kind of a you define your first two features here for one source And the next one is about you can do the same thing for streaming feature as well It's mostly the same the only difference is like for Kafka source You have to tell us like brokers and the topics and the schema and then the rest of them Ideology is actually pretty the same So is it can be transfer is unified API can be transferable from one platform to the other So after you have defined the features then you can use the defined features in different use cases and the scenarios For this one we let's talk about the most common one like a build a training data set So to build a training data set You just need to tell us an observation data set Observation data set usually contains the label for your training and if it's a time sensitive Observation data you should tell us the timestamp as well so that we can help you to join the feature in a point in time correct way So later on you can just tell us the features you want for example here You I want like a location average fair the name lastly you can just call get offline features with the above Metadata you provide it then we will fellow engine will kick off the computer in a cluster you chosen So that we can the features will be computed later on So after model is trained you may want to serve your models in Serve your models, right? Survey model offline is relatively easy right now But sometimes serving model in online setting is relatively harder So to do that to serve features here Usually the easiest way here in fellow is you just push that data in to read this here so you can just specify the table you want to push to And then tell us what features you need and then you can just call the material I features and then we can kick off a Offline computer job for you and push the features to online read this Usually after like a few minutes you can get a data in a read this and you can call this in your online inference cluster to do the online inference We also we also provide a UI to help your team to help to discover and to share features Here you can find all search for your features. You are interested and also We provide a lineage metadata information to understand what features derived from what and And What is derived from what a source as well? so we also For enterprise need and we know like a lot of times the access control is pretty Critical as well. So we provide access control as well You can specify what team members or what team to see what Project essentially they can have read write or manage you control So lastly, we also provide a derived feature sometimes you want to have capability to do feature class like one feature multiplied by the other feature or like in this example, you want to compute the embedding Cassini similarity between two embeddings. So you can say user embedding and item embedding and Multiply that You can use like our function function called like science in marriage to Multiplied by user embedding by item in daily to produce a similarity score. So oh, yeah, we invest a lot of A lot of our time on scalability like in linking our scale is pretty huge So we find actually if we don't do any scalability optimization It's pretty hard to complete the job to Create a training data set. So either sometimes it runs like many many days. Sometimes it even doesn't finish So we have invested a lot to make Processing like tens of billions of rows and the better by the scale data possible We build optimal native optimizations like a broom filters drawing plan optimizer optimizations and the sorted drawing to make it that possible Also, we have cloud friendly architecture that each individual component can scale out as it needs Security is also one thing we care about a lot So we have like a security credential manager and the key water manager a key water to help you to store credentials to help your team to Manage your credential in a secure way So there's a like our back. We just made we just demonstrated There to help you to control access as well So that's like a revisit this diagram again So this is like a impact in a real world what you have to do like there are a lot of branches different scenarios if you want to feature eyes access or managing features in different scenarios for offline online fashion So I highlight these like components in orange if we you want to build it kind of a mature featureization platform or pipeline So those are the components you need to build based on our experience If you want to build up like a production grade mature Kind of featureization platform or pipeline it usually may take each of them may take like a few quarters of engineering hours so and apart from that you have to think about other like components to support your ecosystem like a discoverability shareability and the monitoring and compliance and so on So each of them may take actually quite a significant time and and the investment So with fellow we are trying to say build a good abstraction to abstract away those Infrastructure details from your AI engineers from your Manager management teams and so on So your data scientists team can just use our API to achieve what they want using our elastico or Python and and also using AI UI to discover the features and share their features and For some like a managers or the Administrative like stuff they can also use AI to achieve certain tasks that they want so here I Want to talk about the failure at linking like history present and the layer and its impacts so we started initially development in 2017 and in 2018 we start adoption within linking and in 2020 we have achieved the majority of linking emerald application adoption and in 2022 we make it open source and Have a greater as your collaboration and also integrating into as your and we are now a part of like a linear foundation AI data simple box project and we also have like people Using using fellow in AWS as well Right now we fellow is powering hundreds of AI models at linking thousands of features and in many kinds of entities that empower our linking's economic graph It runs at a petabyte scale So these are several kind of highlights Highlighted like impacts. I want to mention here the first one is reducing the feature engineering time Required to adding and experimenting and the serving features in the production from weeks to days in the past it usually requires a lot of people and the time to actually build the pipelines or build integration code or build applications to actually Put the features into production There are a lot of details that you need to worry about like a point in time drawing and the performance and the latency and how do you call a large amount data without causing GCP problems and so on and Second is related like a we we actually find actually fellow performs faster than some some customer build feature processing pipeline sometimes by as much as 50% And the fellow also enables feature sharing between similar applications like for example, you have a feature for your member or a feature for your job and the other team is building similar kind of AI applications, but with a slightly different motivation You can just use fellow and just call that feature name and we can help you to orchestrate all artifacts for you without too much hassle so Next I will talk about architecture and the and the roadmap So this is like an end-to-end workflow for the as you are integration The on the left top side is the AI engineers or data science They come in usually they will start with interacting with the UI to understand what features has already been created in the in the system and This is a power by restful API and the purview and the sequel as their data backends and we have a Python fellow Python client to help you to Create the manage and define feature definitions After you you after the data scientists create those then they can dispatch fellow engine will dispatch those computer to the corresponding Compute cluster like a spark or synapse. They will talk to like an offline storage Delta Lake or snowflakes and so on and we also can talk to like Kafka and the event hub Also for online, so when we can push the data to like Redis and the cosmos DB this can later be on be used by Machine learning platforms, so Due to time concerns, I will Not cover the roadmap in a lot of details. Maybe I will just skip it and I have more time for Yeah, I have some time for summary here Well, it's an open source feature store Which can be seen as an abstraction layer between raw data and the model Well allows you to define features with transformation on top of raw data source and the get a feature of values by By name during both training and the influencing But also simplifies the feature preparation workflows and enable feature sharing across teams and even organizations So we don't have time for demo, but there's a link if you are interested you can open it to try it out This is a product recommendation like in your e-commerce website And lastly do we still have time for Q&A? Yeah, maybe we can take some questions here. Okay. Yep round of applause Yeah, thank you. Thank you. Yeah, great talk. Do we have any questions from the audience? Just stick up your hand and I'll come over. Yeah, there we go. It's looking at the back Sorry, can I get you to sit in here so people online can Could you please talk a little bit about your compute engine? Is it is it all spark? Yeah Question is about a computer engine we use I think that's a great question and the computer engine right now We are mostly using spark great any others Yeah, I have a question about the RBAC control So is the RBAC can be integrated with the like the company's Directory or or how it's integrated with the real case or do we just manage it within the Like that the feeder contacts or not Yeah, that's a great question the RBAC right now is integrated mostly based on as your kind of Access control, but we have abstraction layer actually is just kind of a key value pair. You can replace that into different Kind of identity management system. For example, we have a User that is actually using this in AWS. So we are working with them to make this available in AWS Great. Any final questions? Yes. One more here and we'll wrap up on that When you did transformations earlier, do you support geospatial? Type transformations being able to say something's within a certain radius of a point or within a certain bounding box Sorry, can you repeat a question again? Do you support geospatial type attributes? You mean geospatial transformations or just the data types both, okay We don't natively support like a geospatial specific data types But we support a very kind of rich data types from like a one-dimension tensor to multiple dimension tensors as well and For transformations, you can plug in the transformation You like as long as it's SQL based or it's Python based So I know like a lot of geospatial libraries are Python based you can likely to plug in Python based Python based the geo geospatial transformations And what you have to do is just you need to include your dependencies and we will help you to Download those dependencies when the computer runs Great. Right. Well, yeah, thanks again. Let's have a round of applause for Hangfei Lin