 Today, I'm going to be talking about a feature platform that we built at Gojek. So this is a project that's been going on for quite a while. We're actually planning to open-source it soon. At the moment, I can't give the finer details. But in this talk, I'm going to give you the higher-level considerations of what went into designing the system and the major components inside of it. So let's start with who or what is Gojek. So Gojek is an Indonesian technology startup. I'm not sure if many of you know about Gojek. But there's an example of our application on the screen. So on the screen, you can see some of our services and products. One of the most famous four is actually GoRide. So that's a ride-hailing service. And the reason it's a motorcycle and not a car like Ola is because of the unique challenges in Indonesia. In Indonesia, there's been a massive population growth. They're a very big country. They're the fifth largest in the world by population. So I'm sure you know, as in India, traffic can be a big problem and people need to find ways around that. Otherwise, you're always stuck in traffic. So the way in which they solve this is with Ojek's. They're just people in this corner of the road that will take you to your destination on a motorcycle. And this is an example of that. So when we started in 2011, we were just a call center, no application. And you could call in and get a motorcycle to come and pick you up and take you to your destination. And that went down really well in Indonesia because now people had a way to quickly move around without having to wait for an Ojek. And our founders were very pragmatic and they did a lot of data research with customers, analyzed user behavior, and they knew that there was a demand for more products and services. So in 2015, we launched a small basket of products in an application. So go ride, so a motorcycle ride-hailing food delivery with GoFood, GoMark, which is our grocery shopping, and a few other ones, e-commerce related and logistics. Since then, we've hit hyper growth. We've gone from a call center at that point in time to a unicorn, so a multi-billion dollar company. And we are one of the largest players in some of these verticals. We are in 15 verticals at the moment with 18 different products. And our approach has always been understand the customer, be very localized to the region and don't just randomly launch products, but be very targeted. So a bit about our home, Indonesia. In Indonesia, our application has been downloaded 80 million times. We're in 60 cities. We have more than a million drivers on our platform, hundreds of thousands of which are active at the same time. So just in one product, we are a very big pair. And for GoFood, our food delivery, we are the biggest food delivery in Southeast Asia. We have more than 200,000 merchants on our platform. And if you're talking about digital payments, we've got one of the largest e-money systems in Southeast Asia. So the scale at which we operate is massive. And if you're talking about data and data science and machine learning, there's a very richness of the data and the engineering systems are also, I have to be very robust to handle that. I hope this animation works, but on the right you can see an animation of just Jakarta through one day. So each pixel is a person being picked up or dropped off on a ride. So this gives you a little bit of an idea of just the scale at which things operate. And this is a pretty old animation. This is at least a year old, but it gives you a good idea of the pulse of the city, of magnitude of orders. Most, excuse me. This is only Jakarta, yeah, just Jakarta. So currently we're also expanding into new territories, but this is just Jakarta. So the scale is very big and most of these transactions are processed with ML models. It's engineering systems, but they're affected by ML. But data science in Gojik is not just about machine learning, it's also about understanding the customers. The business looks to us to understand our customers and then give them advice on what plugs to launch and what actions to take. So as a data science platform lead, it's my role to make sure that data scientists are very effective and efficient in what they're doing. So we looked at where the data scientists were spending their time and this is what we found. So this is just a bar that represents a typical project workflow for data scientists. So the colored bars represent tasks that they're doing that are data science related. It's in their domain of expertise and the gray bars kind of represent tasks that we think they should not be doing, engineering related tasks. So when we looked at what they were doing, we figured out they were doing a lot of engineering related work. So this was a concern for us. And this was actually the precursor to understanding or knowing that we needed a feature platform because even though making features was one section of the tasks that we're doing, we realized that many of the other tasks they were doing were in support of creating features. And especially when it came to building pipelines, that was one of the biggest time sinks for data scientists because the feedback loop is so long and failure in creating a pipeline can be very costly. So we wanted to decouple them from that and free them up to be focusing on modeling more. So why did we need a feature platform? But post that investigation, we realized there were two problems, the service creation and feature engineering. So the feature engineering was the one that we wanted to solve and this is what the talk is about. But there were some specific problems. The first pipeline jungles. So the data scientists were building all of these pipelines and they were using hacky code. I'm sure you guys have seen it. There's very many levels of competency in a data science team. And what ended up happening is you have all these pipelines that are interdependent on each other in ways that you don't understand. And it's very difficult to visualize and get a clear picture of that. So we built systems that were not going to scale to multiple countries to global scale. So we needed to solve this in an engineering way. Inconsistency between training and serving. So this is actually a big problem. So what happens is data scientists will build a pipeline to create features. Maybe he's using Python. And when he goes to production, he can't reuse those feature transformations because they're written in Python. So what ends up happening is they redevelop the features in serving or in production in Go or in Java. And there's an inconsistency between the feature creation and the transformations. And there's scope for error creep there. So we wanted to solve this with our feature platform. The third is the need for real-time features. So when you have production systems, typically you don't automatically get real-time features. You need to engineer a system that can stream the data in and store the features necessary for that model. And having to do this for every project was a big cost. We wanted a system that could just give us any feature that our models required. And the fourth is standardization. So just having a standard way to define a feature, this allows you to have logging and monitoring and lineage and documentation and all these things that are often ignored, but very important. So if a data scientist opens a code base, there's a standard way in which he understands that the source code should look. It shouldn't be a mystery that he has to decode every time. And then we wanted to be able to handle large volumes of feature data. So if you have 18 products and you have tens of millions of customers and more than a million drivers, you have a lot of rich data and you often have thousands of features that you want to pull into a single model and you want to look at months of data so that that can take days on the wrong platform. You want to be able to create that in hopefully hours or maybe even minutes. And then last is duplication between data scientists. So we wanted the platform to have an element of discovery. One person creates a feature, everybody can use it or improve on it. You don't want data scientists to work in silos where they create features that only live in their realm. And this is very important because some data science is a lot more productive in building features and some features are very powerful cross-project. It's not just useful for that one project. So to distill these requirements down, there's some core things that we wanted from our platform. Abstract to where the infra, we don't want data scientists working on engineering systems. Standardize the feature creation, like I said. We wanted the consistency between serving and training because it's a big room for bugs to inter-production systems. We wanted the system to be scalable both in serving and training. So a training store should have a historic time series feature data and the serving one should be able to scale to high QPS. Obviously real-time features in serving and discoverability. So to reason about this problem. Yeah, yeah, so you mean between serving and training. Well, typically what happens is that data scientists are kind of disconnected from the engineering and production systems. So they're fine building a pipeline that maybe take some like query database, get it to CSV, transform the CSV. I mean, this is the basic example and now you've got features created. But how do you go to production with that? You can't just put those Python transformations or code into production. Often it needs to be rewritten in a different language like Go or Java or something that's more performant. So what ended up happening is you have the same features in two different languages and often you're using different libraries and dependencies and there's inconsistency and data types and there's room for error to creep in there. And we want to avoid that and move feature creation upstream so that a feature that's created is in both training and serving and it's identical. So there's a consistency. So we're talking about a feature platform. We needed to break it down into higher level components instead of just having this homogenous platform thing that doesn't really, you can't really reason about. So we knew we were gonna have input data. In our case, we're on Google Cloud platform. So this will be BigQuery for raw relational data, object store, cold storage on Google Cloud storage and then finally an event stream. So we use Kafka for our event stream. We knew we needed, we have two consumers, model services and production for serving for the real-time features and we have users that'll be training their models. So data scientists training their models. So the components we broke it down into were creation. So these are essentially the transformations on the raw data. So this is bounded and unbounded data. It can be streams or flat files or tables. We needed a storage layer to store the data for both training and serving. We needed an access layer. So this is a very important part that often is missed. You don't want the client's accessing database directly you want an abstraction layer there to handle the load. We'll talk about that a bit later. And then finally you needed a discovery element. So this can be a user interface or some way that data scientists can see which features are in the platform. What can I use? What can I use to train my models? So let's talk about those components. And the first one is creation and this is actually one of the hardest ones. I hope you guys can see this but this is just a typical scenario of one of the problems that we solved with machine learning and this is the dispatch problem or allocation problem. If you're a customer you wanna go to a destination. You're at one point you press book to go to your destination. We need to decide which car, which driver do we give you. So this is a problem that is very important because of the amount of money that throws through. If you have millions of bookings every day a small tweak in the ML model can make a big difference. So features are critical for this model. So you can determine if a car is going to be or a driver will take a long time to get to the customer or if he's already on a trip and these things change in real time. If he's got a bad rating or a good rating all these factors come into play. So for the dispatch model we needed to have both training and serving features and a lot of them. So some examples of the features that we needed. So if you're talking about the driver the driver has location, data, speed, direction, ETA and a whole range of other features. You have temporal features like time of day, day of week, is it public holiday? If you train a model and it's suddenly Ramadan your model becomes inconsistent. So these factors you need to take into account. There's regional information that you have. Regional features like is there a traffic spike? Are there many flights landing in the city at the moment so that you can anticipate what's going to happen in the future? And then the customer features as well like what is their behavior in the pause? Clicks and actions and are they canceling a lot of rides at the moment? Are they price sensitive or are they very tolerant for a driver being late for example? So these features are critical in this model's decision making. So like I said, I don't know if you guys can read this but basically we had three stores that we needed to support with the system. BigQuery, this is our data warehouse. This is where all our transactional data is. We have BigQuery as our relation store. We have events on Kafka and we have our data lake on cloud storage. So the data lake is mainly used for extremely large volumes of data. So if you take GPS pings, if you have pings coming in every second for every driver, that's an immense amount of data. So that we store on Google cloud storage. So this is where our input data is and we looked at a lot of ways in which we could create features. But based on our requirements, the one we went for was cloud data flow. And I'll talk a little bit about why we went for that. So our stream main options were basically Spark, Dataflow and I'm gonna say Dataflow slash Apache Beam and Flink. We considered Flink and Beam as the main options because of the first class support of streams and batch. So that's one of the first things. It's both of those concepts are first class citizens. So if you define a feature in Beam or in Flink, you can support any kind of bounded and unbounded data source. So that was very powerful because it means that scientists only need to define it once. So we scrapped Spark for that. Although Spark has made a lot of progress in the last couple of months and years. And the reason we chose Beam was because of Dataflow. So Dataflow is a managed service on Google cloud where it allows you to run Apache Beam code. So one of the things you wanted from this platform was to abstract away in the engineering and the infra. And Dataflow has incredible amount of power in scaling and abstracting away a lot of the pens. So it's kind of like a serverless platform in a way because you're only defining your code. So that was the main reason for choosing Dataflow. So once you had feature definitions defined only as Apache Beam code in Dataflow, you have a consistency automatically between serving and training because now if you have your data stores downstream from that, they always get identical data. It's up to you how you store that data but the transformations happen before it gets to the stores, not after. So this is a very important point. And if you architect this correctly, you can replay the whole feature creation process from a captured events that you've captured on cold storage. So it's deterministic and idempotent. Just an example, hope you guys can see this. This is what a feature definition looks like in Beam. So in Apache Beam, just to reiterate Apache is the API, Apache Beam, and you run it on Dataflow. So Dataflow is the runner. This is a Python example. Typically we write Java code for the feature transformations because the feature set is larger. You can't do windowing in Python on Beam but this is a good example of what a feature transformation looks like. You just have trip events coming in. You do subsequent basically lambdas. In this case you filter out the successful trips. You build a data structure there which is basically a key value of the driver key and then the value one and you do a group buy by all the events for that driver over the day and then you're left with the trip count for that driver. So this is just a simple example of a feature that is essentially taking a collection of events doing a transformation and produces another collection. And these collections can be transformed over and over. So at a high level, this is what the creation process looked like for us. But what we did is we separated the creation and our core platform. So we don't wanna limit ourselves to only having one way to create. Sometimes people wanna create features offline on a CSV and just import it into our system. So we also added ingestion capabilities. We also have our own internal lambdas that we're building as well. So we support multiple ways in which people can create features but this is the primary way. So this is the first building block and that's creation. The second one I'd like to talk about is storage. When we got to storage, we realized something else. We're gonna have all these components in our system and they need to be able to talk one language between them. So if you're talking about a feature, what is a feature? Obviously we know that that's an input to a model but the system is a way to understand what the bytes that are being streamed into it is and how to validate whether it's correct. So what we did is we talked about how do we define what these features are and we broke it down into at least three specifications. So our system has these YAML specifications which I can't show the exact YAMLs now but in a month or two we're planning to open source these and basically you have base entities, both entities as in drivers, merchants, customers, in area and time for example and composite ones. So you can have one say driver area. So these are the entities and the features on that entity for a driver in a specific area. So those are specifications that we create and our system, once it's defined as a YAML spec it's in our system and then we create feature specifications and this is the interesting part. So that's where you define what data structure the feature has. What is the distribution of the data in that feature? What granularity is the feature? Where is the source code located? Who is the owner of the feature? So these YAMLs are very important for us because it gives you that source of truth, the master, it's like a master manifest on what is this feature. And one thing that's very important to know is that it's very difficult to trace the creation of a feature all the way from the source data. So your system needs to be built in a defensive way and this is the kind of way that we define it but creation can happen from any source. So we can't necessarily trace it back to its origin but we can at least say your distribution of data is correct, your data structure is correct and you have an ownership, et cetera. We also use protocol buffers or protobufs all throughout our system. So we want two reasons. This is a lot more, less data usage. One, and we have many different languages in the system. We use Go, Java, and Python depending on which layer you're talking about. So this allows some interoperability between these components in our system. We have a command line tool, we have a UI, we have ingestion, we have access layer. All these layers are not necessarily built on the same programming language. So we store the data of the features also inside of our stores using protobufs or we encode them. So from a storage standpoint what do we need from our training store? The first thing we need is it has to be easy to use. The reason for this is because a feature training store will not necessarily have all the data that data scientist needs. They will maybe have labeled data that is external to the store. So they'll need access to the store either through a UI that we've built or the databases on UI or command line tool and they'll need to join the data onto labeled data unless you have a way of importing the labeled data into the store. But in our case, that's not the case. So we knew it needed to be easy to use. Secondly, it needed to scale to a very large amount of feature data something I already touched on earlier. Scalability is a big concern for us. And then we needed some types of queries and transformations. We needed to be able to store historic feature data for months and we needed to be able to join that feature data and sometimes with fuzzy joins onto labeled data. So imagine if you have a customer buying a dish. He buys it at one time but you don't have a feature on that exact same time. Then typically a data scientist will say well if it's one minute later the feature data at that point in time is maybe good enough to train the model. So you want sometimes some fuzzy joins on that. So the training store needs to support that. So the one we actually went for to no big surprise was BigQuery. So BigQuery was where we were already storing our labeled raw data. So in our data where as now if we have our feature data as well as our raw data it's a lot easier for them to generate training sets. So some of the benefits of BigQuery is there's no infrastructure to manage. It's completely hosted and managed. Data scientists understand it because it's SQL. It's easy to use. It scales, it's integrated with GCPs, integrated with TensorFlow and Cloud ML engine and all the tools that they want to use. It's what's cool about BigQuery and why it makes it so useful in those cases that features are typically columns and databases and BigQuery is a columnar store. So queries are very efficient when you are selecting specific columns in BigQuery and that's one of the key things that we also found that makes it very compelling. We looked at Presto and Hive and some of the other solutions and a lot of them were compelling and they actually had features that BigQuery did not have. Two reasons we didn't go for them was in our tests for our use cases they were not as fast as BigQuery and two, they required infrastructure management. If we had the liberty of using other cloud providers then we might include different training stores but the moment we're mostly GCP and so this was the best tool for the job. So from our serving store we had more requirements. Low latency reason, right, so this is very, very important. Typically when you have, for example, the homepage or application, if you have food recommendations you want to immediately send a response to the user. Otherwise it ruins the experience and you might only have about 40 or 50 milliseconds to produce a response because it's not just your ML model that's just generating the responses. All kinds of other systems that'll be building that UI. So if you're on that critical path the more time the feature lookup takes the less time your model has to produce a prediction. So having a below 10 milliseconds was our target. Very high throughput, depending on the features this could go up to hundreds of thousands of kilo cups per second in peak times for a single feature. Scalable storage for the persistence layer we needed to be able to store terabytes of data and ideally managed as well. So the two options we, or the two choices we made were using Redis and Bigtable. So I'll touch on Bigtable first. Bigtable is interesting because of its scalability and its performance, it allows you to read and write into keys at a very low latency, 10 milliseconds guaranteed as long as you space your data out correctly. Each node can handle 10,000 combined reads and writes per second, which is very high. And it's very hard to find any open source tool or a system that does this. And we looked at quite a few. So we looked at Cassandra and we looked at HBase and many of these and Elasticsearch and the performance didn't compare to it. And finally it's scalable, you just add nodes, you just use UI or use Terraform and you scale it up, it has persistence. So for us this was a no-brainer in terms of our primary store for our real-time store. But in some cases, let's say you've got driver location data or speed, velocities, altitudes and things like that, that change at such a high rate. You don't wanna have to create 20 Bigtable nodes that'll cost you like 30,000 USD a month where you can have like one or two Redis instances. So this is for data that changes very fast and you only care about the latest data point. So we also added Redis as an alternative. So we used Cloud Memory Store, which is the hosted Google Cloud version, but you can just use the normal Redis. So Redis is just our non-persistent, higher throughput store that we use and this essentially allows us lower latency access on our data. So the third component is access. So before we talk about what we did with access, we can first talk about the problem. So what happens when a sports stadium empties or a music concert empties is that many of our customers will open their apps and they don't make a booking and the same drivers will be sitting there. So this will make a request to our systems and the system will do future lookups for all those drivers. But there's a big overlap, right? Because of all the customers in the same area and all the drivers in the same area, the same lookups will happen. So we knew the access layer needed to handle this kind of load balancing and caching and have all these features. So just a quick example, if you're at our load, we ran the math and at peak times we could have up to 150,000 key lookups per second for a single feature if we didn't have an access layer. So that's completely impossible to, not impossible to handle, but it's a bad idea from an engineering standpoint to not have an access layer that can fix that problem for you. So we built a serving API at the front. It does intelligent load balancing. So we have our own query structure. So when an external system calls our feature serving API, the specific query is sent, the feature serving API deconstructs that query and then it first looks up, it does things like user authentication and rate limiting, but then it looks up into a cache whether the answers already exist. And then to talk about our feature specification earlier that specification has a time to live. So it will tell the serving API, hey, this feature cannot, can only be cached for 10 seconds. Don't cache it for five minutes. So that's how these systems interact with each other. So the feature serving API does things like load balancing. It knows where all the data is stored if you're sharding it, if there's replication factors, all these things. And it obviously only interacts with our serving stores like Radis and Bigtable. So that's what we did for the serving API. And then onto discovery. So the interesting thing is now if you have these three systems like creation, storage and access, you actually have a lot of data about what features exist in the system, the source code of those features, who the owners are, the statistics about the feature, like how much events are being ingested for that feature, how many files have been ingested, how many rows and CSV records, how fresh or stale are the features. If a model's being trained, you know the performance of that model even before it goes into production, you know which features were the best in training that model, which you can really in a relative way rank them. So you can store all this metadata about features and you can just log it to the console and they know that it just dump it in a database. And then so what we did is we lump all this data and we bought a database around this and then we expose the data. Unfortunately, it doesn't render as well, but essentially what we did is we have a way in which just with the simple BI tool, data scientists can query all existing features that we've created. They can drill into the features and they can see feature information like the types, changes to the feature, the versioning, what validation has existed, what errors are there, how much data exists, so at which time ranges, which gaps exist. Then also interestingly, you can say, on average, how good is this feature in its use for training a model? How many hits is this feature getting on a serving API? So this kind of information to data scientists is very valuable because they can just rank by the best, select like 50 features, train the model and that's a good place to start instead of having to go and work from scratch in every project. So this is one of our main focuses is building a richer interface to this, but once you have a unified platform then this becomes one of the big selling factors and it allows for a lot of discovery insights and it allows senior data scientists to disseminate their insights basically to the juniors. So we don't have the model in our UI, they could if they looked at the logs. So we do have an event log in which they can drill into and go and trace it back. So this is one of the things we're developing still and hopefully to have it more enhanced. So impact. So when we started, we said that the data scientists were spending a lot of time on engineering related tasks. So building pipelines and setting up the infra, they spent a little bit of time on making features but we wanted to compress the time here a lot. So the two problems we wanted to solve was service generation which we didn't cover in the stock and then the pipelines and the ETLs for feature creation. So ultimately what we had built is a system that allows you to take features, sorry, raw data in our stores. We standardized the way in which they create features with data flow as feature transformations. This allows bounded and unbounded data. So you only have to define it once. We have serving and training stores with Redis and Bigtable and then we have a training store with BigQuery. So BigQuery has all our historical feature data for four months. It also and then Redis we have latest feature data that's not persistent in Bigtable. We have also time series data that's available in serving. We put a serving API there. So feature lookups come from our production systems. It's handled. So as a data scientist, all you need to do now is one, go to this feature explorer, find your features, export it through BigQuery into a file, train a model and then you can put in production. You don't have to build any new systems in production. You don't need to build any features actually at the start but if you do, you can go to data flow and write code there to build those features. So then another thing we're looking at doing now is to simplify the process of the feature creation. Temporatize it. So another idea we had was if you create a feature, you can actually generate more features from that. So let's say you create a booking count for a driver or a trip count. You can, the system can automatically do things like averages, minutes, maxes. It can aggregate over time periods. The person doesn't have to define that manually. I'm not sure what you mean by feature selection. So we have the creation part here where it's that really raw data that you transform and you make more raw data and then you comply with the specific specification of the feature and the data structure in the protobuf and that gets stored. Then the selection part is basically data scientist going into the explorer, the feature explorer, finding the names of the features and then going to BigQuery where the columns exist for those features and then he joins that onto his labeled data, exports it to CSV and then he can train on Cloud ML Engine. He can train on Spark or wherever he wants to train. Eventually, yes. Like in the ideal world, everything is in this platform but at the moment we also support legacy systems and that's why it was very important for us to support BigQuery because now that the scientists can have their own airflow ETLs, for example, or whatever they're using for automation or orchestration and build their own features and then as long as it becomes a table in BigQuery or if a table can be federated onto CSV, then they can use that as well but that's not a long-term solution because you lose the discoverability and the definition of what that feature is. Yeah, so at the moment, it's an exporting process. Once you've exported, then you lose all the metadata about those features and where they've come from. So there's no way that we have the moment that you can differentiate between side input features that you've created and features from the feature store except if you're just exporting everything from those that dataset in BigQuery where these features are stored. I'm not sure if I understand your question maybe we should pick this one up afterwards. Okay, okay. So finally, we did compress the time to market a lot. Data scientists after deploying the feature store, feature platform was a lot more focused on modeling. We also spent time creating features but these features actually end up impacting the business a lot more and data exploration was another thing that they could focus on a lot more. So the impact on Gojek to wrap it up, foster time to market, obviously we're spending this time on projects, improve customer experience because our models are more accurate, the feature data is more, there's more features available. Because of our international expansion, this project helped a lot with data scientists supporting more customers. You can't just double the data science team every time you expand into a new market and because we're using BigQuery, Bigtable and Dataflow, we have less infrastructure to manage. And as an engineer, this is very compelling to me because I don't wanna have more terraform files and more infrastructure that can go down in the middle of the night. Right, thank you. We have time so I think we can find any questions. You guys can shoot questions. Hi, the feature selection platform is like a great thing but how do you handle the entity feature relationship if a data scientist has a data scientist? I'm trying to do a trip analysis and trying to select a feature which is the driver level. It may be an inaccurate feature or I might need to do some aggregation or some modification to that feature. So how that's controlled? Is it handled by the feature platform or data scientist has to do it? Do you mean entity creation? Let's say there is a feature which is at the driver level. Yeah, yeah. So it could be like a... Yeah, yeah. Okay, so you have an entity for the driver, obviously, and then a feature as well. You'd need to create two specifications. It depends on what data type it is. If it's something simple like an integer or an N64 or a float, then you just create these two yamls. It's a merger quest in Git. You can do it and then an engineer reviews that. Once, is that your question? Yeah, it's handled at the feature platform or it's handled... Platform. Okay, it's... Well, there's a check because the merger quest in CI needs to make sure that that specs are integrated into the Integit, to get a source of truth. Then it gets deployed into the feature platform. Once that's done, it actually builds tables. So in Bigtable, it'll add columns. In BigQuery, it'll create data sets and tables there. If you've previously had that feature and deleted it and created it again, it'll tell you this column already existed. There is data from maybe a month ago, so there's a conflicting name. So there's all these little things that also need to be taken into consideration. Do you also record the access time? For example, one particular feature might be easy to fetch and retrieve in a real-life solution, but some of the feature might take more time because they have a computation behind it. It may take more time. You mean the creation or the access? Access part. And when I'm trying to access it in a particular model, the feature platform, does it also tell that there is a likelihood that these features can be accessed very fast and these would be slow? We don't publish how fast a feature is available. We do publish the information about granularity and the freshness of the feature. So we have a limit, let's say, of the age of a feature. So let's say an event is created. There's an event time to that, right? So if it takes too long to propagate through the system and to get into our store, then we calculate that you're not gonna be happy with that feature based on your requirements. That's available, but there's no UI for that. And your other question was on the lookups. Yeah, so the same thing, that is this information available as part of the lookup? We do track feature access and statistics about the performance. So we do observe that. And that's very important for our debugging of the serving API. That's the first thing we bought is when the query comes in and it's deconstructed and there's lookups happening to the stores, where are the bottlenecks? Where are the hotspots? Because often it happens, especially with big table, is that certain hotspots are very slow, but other ones are just perfect. And fixing that problem is actually quite a lot of work. But yeah. I think my last question is around the data privacy. So you are dealing with lots of customer data and the driver data, which is very personal data. And you are hosting it on a proprietary platform or let's say on a Google. And do you face any kind of challenges in terms of a legal agreement or a kind of enterprise handshake you require to ensure that your data is protected enough and is used in the right manner? Yeah, so there's an internal team that handles this for us. But they follow all regulation for the different regions we're in and they layer access within the systems. So we have to stay within those streams. So we can't have cross-layer joins, for example. I hope that's enough detail. Yeah. OK, so the specification is more like a high-level manifest. I can show you an example, maybe, off through my laptop. Technically, I'm actually supposed to. We're going to open source in the future, so this won't be a final version. But it's more like a YAML that's basically a manifest that says who's the owner, what data type is it, what's the class path of the protobuf that defines that data structure, what validation hooks need to exist, at which points in the system to check the data. Like they're just... OK, so the feature specification, one thing to be clear about this, this doesn't tell the system how to create the feature. It only tells it how to validate the data that it comes through. So all the different components of the system, like serving, will take that feature spec in and then alter itself. It'll build types and checks based on the specification. And then when lookup is done, let's say from Redis, it'll validate. If the specifications say validate lookup, there's a time to live, there's a cache, a write or read time for a feature. All of this is defined on a specific for a feature basis. So in ingestion, it uses those specs. In lookup, from serving, it uses those specs. In discovery, it uses those specs. Even the UI uses the specifications to generate the interface. This is the one we're building now. Yeah. So at the moment, that's quite rudimentary, but it's basically validation on the data. So we allow engineers, platform engineers, to define functions. And then within the YAML spec, that function can be called based on the content of the data. And then it's basically a predicate that says, this is good or bad. It passes the validation. So data scientists or engineers can define those functions. Typically, at the moment, it's just the minimum and the maximum of a value. Like an integer cannot be higher than 100 and not lower than zero for a specific feature. And then the serving and ingestion pipelines will make sure that those validation hooks are called whenever the processing happens. Is this a distribution of values? Yeah. More than value? Yes, it's values. Yes. Yeah. Excuse me. People ask a question. So you spoke about the area or the location specific features, right? So a lot of times the granularity at which you define it, let's say you're defining a origin destination level feature, the number of combinations can just explode. Yes. You might have to put in some logic to prune it in some manner. Yes. So is that part owned by the feature platform or the data scientist? And how is that part managed? Yeah, so that's a good example of one of the most painful features is a composite entity between an origin and a destination. And if you have locations in space, there's just too many location permutations. The key space explodes. So you need to bucket them. And that's what we did. We bucket by time and we bucket by in areas. So in Jakarta, for example, we'll have these football field sized areas. And if you have composite entities of origin destination, you have to group the areas, the GPS locations in these football field sized areas. And then we bucket the features in, let's say 20 minutes or 30 minutes or some kind of buckets. And then the key space fits into Bigtable. It doesn't fit into Reds. It's still terabytes and terabytes of data. But that's why Bigtable was very compelling for us because of the amount of data we could store. But you're right, that is a big pain. And it's everybody's job to work that out because creators can create features that fill up the system. And our system needs to defend itself against that because creation is relatively open at the moment. So you were mentioning something about Dataflow helping you manage the differences between the experimentation state and production state. Can you elaborate on that? The experimentation? And production state, you were mentioning that Dataflow is helping you solve that problem somehow. Can you elaborate on that point? So experimentation, like it does help us in consistency. That's the world that's managed and it allows us by moving. This is the answer to your question. By moving the transformation from separate training and serving systems to having to transform, like if you have a model serving application and you have your transformation inside there, that's disconnected from your batch pipeline that has a transformation. Dataflow helped us by moving an upstream. You only transform once and you store the values, the refined feature values in all those stores. I don't know if that answers your question. Can you take it as a thank you? Okay, yeah. All right, thank you. Thank you.