 Hi everyone. My name is João Santiago, or just Santiago for short. I hope everyone is having a blast in the conference. There are some really cool talks in the lineup. I'm personally super excited to be here with you today to talk about just-in-time features, also called on-demand features for machine learning models and exploring a bit why I think closure and just-in-time features are a match made in heaven, that we are not exploring enough. These days, I work myself as a data scientist. I'm leading the antifrot unit in a company called Billy. We are fintech based in Berlin, mostly focused on, by now, pay-later solutions for B2B online shops. We had a really nice series C funding round in the past months. Actually, we are hiring, so if you're looking for a new closure challenge, either look at the link or get in touch with me. Billy, we started to feel that our deployments for machine learning were not as smooth as they could be, and a big chunk of this issue was related to feature engineering. My whole point in this talk is to say that feature engineering as closure is a match made in heaven, but what does this mean? We'll start with defining what feature engineering means in the context of this talk, because different people may have different perceptions or notions of what this term entails. We will look at a higher level architecture of how this is dealt with in the real world. Some examples of possible solutions, and then have a little demo in the end of Bulgogi, which is a prototype I've been working on specifically to address the concerns we'll be exploring during the talk. Let's start with feature engineering. What does that mean? Let's imagine, because I'm in the fraud department, I need to make an anti-fraud model. Let's say these are transactions coming in from some online shop. Usually, you'll have data that looks like this. We have an email from the buyer, an amount, a timestamp, and a couple other things. Our job is to figure out if this is fraud or not. We cannot simply send this into a machine learning model. The reason being that most models don't handle strings well, don't handle complex data like this, they much rather prefer to work with just integers, just numbers in general. We need to go from this into a numeric representation of this data. That's what is meant by feature engineering. It's to go from raw data, pass it through some process that engineers these features, and then feeds the data to a machine learning model. In this case, let's imagine that we know from experience or from research that the number of digits in an email and how long an email is, an email name, is predictive of fraudulent behavior. This is a closure conference, so we could write this enclosure as two simple functions. We can just apply them in order in parallel, it doesn't really matter. The point is we need to map these two functions into this data so that we get these two features. Out of this feature engineering process, we could get something that looks like this. Let's imagine we are using the number of digits, the number of characters, and the amount of the order itself. Here, the important distinction to make in the context of what I'm talking today is what is then just-in-time features. Here, just-in-time features are only number of digits and number of characters. Why? Because they are calculated right before we send the data to the model. We didn't do anything to the amount. It's already a number, so we don't need to change its nature, its type before sending it into the prediction engine. We send this data into the model and out comes an actual prediction that could look like this. We could say, it's fraud and some probability or some score that defines how we got to that classification. This is the general concept and it is this frame of thinking that I will be always having in mind when we see feature engineering throughout the talk. How does this look in a real-world context? It could look a bit like this. Usually, you have a development environment and an actual production environment where the model runs. The development environment could be, you have a database, you have a CSV file, you have some sort of data you want to use. You may need to do a couple of other steps, hence the dot-dot-dot in the middle between the actual data and the feature engineering, like joining some other data to the data you are interested in. Maybe you need to massage the data into having a certain structure or a certain format. But at the end of the day, you always need to go through the same process that I've shown you before. You need to turn any data that is not numeric into something that the model can understand. And maybe you even, it's not about being numeric. You may even need to change numerical values into something else, like saying, I want the average or I want the difference between the current order and the previous order. It's still an numerical calculation, but it's something you can only do just in time, right before you send it into a model because you are dependent on data that you don't have during the production of it. So when you are producing the data, just an example, let's say we have our little ninja here that's trying to get some stuff online for free, and they produce this data in the front end, for example, they write an email, they select some item to buy. The important thing to remember here is that ideally, in a data science context or world, this data that is produced by the user will then be saved to the database exactly, which means whatever you do in training in the above row in development, so going from this to the numerical representation we spoke about before, you will need to do that too in production. So you'll have a system that more or less looks like this. You're doing exactly the same types of transformations above, so in this case, it could be calculating the number of digits, calculating the number of characters, and then feeding it onto the model. This is the crux of the problem that I want you to keep in your mind today, is that these two things are coupled together. They need to be deployed together, any change you do in your development environment to how you calculate features or which features you use even, you need to replicate that in production every single time. So how do people usually deal with this? In the data science world, R and Python are the most commonly used languages. I personally use R, but there's also colleagues in my company that use Python, and how this usually goes is those two things, feature engineering and an actual model are coupled together in an object which are most commonly called pipelines. R, the tidy models framework calls these workflows, but the idea is essentially the same. R receives some new data as input, makes it go through all of these steps of feature engineering and finally applies the model to the final data. You can of course build your own solution for this, which is what we've done at Billy. We don't use an object per se, we have things separate to keep things slightly easier to manage and reusable. Now the thing with Python and R is that they're both not the fastest languages around, they are single threaded, everything concurrently is awkward or not great at best and if you are writing code in R, it's not easy to reuse it in Python, you always need to translate it somehow. So it's not great when you work in a polyglot organization such as my case. Other companies or other teams use something called Spark. There's of course other options to do this. Spark is just one of the most known. It's originally a successor to Hadoop MapReduce, so it's meant mostly for large batch jobs, although it can also do streaming and it can do real-time computations. So it's a really, really great tool, you don't have massive amounts of data that you want to crunch at the same time, but in my opinion, this is the case I'm making here, it's a bit overkill for a simple transactional use case, because remember, our ninja when is placing an order is not producing a continuous stream of data, it's actually looking at a spinner waiting to know if the order was approved or not. So the cost of adding new concepts to learn for your team and added complexities such as Spark is this cluster which is something you have to now deal with that is completely orthogonal to your problem. So you can use Spark as a language, some companies rewrite some of the stuff that the scientists do in Scala just so they can use Spark. Yes, Spark has APIs for Java, for Python, for R, but again you have this cost of translation between the language of implementation of Spark which is Scala and the actual languages that the scientists use. And in the end it's just added cost not just in terms of having a team to manage this, but on top of this some companies add Kafka to make everything work on streams and now in my opinion you're really going with a full bazooka into a problem that seemed very simple some time ago. So my question is why not keep this simple especially if you already have good engineers in your company or in your team that use closure, why not just use closure? By closure I don't mean re-implementing the idea of a pipeline or of this object in closure I mean simply using the structures that closure already give you. So how do we actually keep it simple? I think the main goals here are three-folds. We should use what we know we don't want to have some big piece of technology that we need to learn and manage. The number of moving parts should be as low as possible to keep the whole system simple and understandable for everyone. And obviously the main main goal is to decouple feature engineering from the actual models and we want to do this so that deployments are independent and make the feature engineering bit more reusable not only across deployments but also across teams. So if I write something for example the number of digits in an email feature and now another team also needs that feature they can just use this new system to get that feature without having to worry about implementation details or rewriting it again in another language they want to use. So how can we do this? First let's look at the architecture and this is what my goal is and what I came up with. We're really centralizing the feature engineering process so we are very much into microservice architecture so this could be just another microservice in your global ecosystem. Now this is nice because suddenly all the features in the company or in all your teams can be placed in a single spot. So if you make a new project and you're thinking do you want to else create the number of digits in an email feature you just go into this centralized repository or centralized service and you see is the feature there. If it is you can just ask for it. So it should be seen as a buffet of features. All of the features are available there. You just go in and pick what you choose. In this sense that means providing some data and providing a list of other names that you need to be used and then just expect the system to map the features over the data. Simple as that. Another cool thing we can do here is then asynchronously because there's no latency constraints when saving the actual data. We want to be as fast as possible responding to the model and to the user because remember the user is most likely seeing a spinner very fast but then we have all the time in the world once things are calculated to store the results into a database and the cool consequence of this is that if you have all your features calculated you don't need to do this again. You don't need to transform the data again whenever you want to train your model in development because you can go to this database and just say give me features ABCD in this time span and you can directly feed this into your model to train it. Usually this type of speciality database is called a feature store. I would really invite everyone to read more about it. It's a very cool concept for productionizing machine learning and for scaling up machine learning in organizations and this is just one bit. I just called it Bulgogi it could be called something else it can be done in different ways other companies came up with different ways of doing this I simply feel the closure method is not only simple it's also elegant and makes it very easy to scale out and create additions to the system. So let's actually look at some code. I have my editor here the REPL is already loaded I'm just going to make sure everything is in my buffer so all my functions and right now Bulgogi is a single Babashka script. This is how simple you can do this in closure and one of the main arguments I think for using closure for this use case the amount of boilerplate is pretty much next to none I imagine this will grow as we add more features such as maybe some metadata maybe some spec to validate validate features and at test cases but in general you just write in closure not doing anything that's alien so let's take a look here we can already find that see that like we saw in the first slides in this talk features are just functions in closure they are pure functions most of the time but I would recommend to keep them pure so that they are transparent and you can just say this function is the number of digits in the email so you see there's nothing very special about it it just takes a map so all of the features in Bulgogi take a map as input so that they are have the task of choosing which data they will take from this map which can be a critic point you can say there's too much coupling between the actual name of the thing you're interested in and the feature that the function that calculates it and if you have that criticism I would love to hear your opinions on how to do this better another cool thing you see here because everything is just functions in a namespace is that you get dependency management for free so we know that number of digits in email name depends on getting the email name itself and this is just yet another function that is called when you call number of digits you don't need to do anything special I can envision that for some performance sensitive use cases we may want to have a pre-calculated DAG like a graph so that we don't go through the same calculation twice right if you here calculate number of digits and number of characters in email name is being calculated twice so this is an optimization that I will probably look into adding to bulgogi at some point once this is more mature so we can keep adding features as functions always the same idea they take in a map they output some value and then we get to the real meat of the whole system which is a pre-processed function so now the pre-processed function takes in a request map which looks like this here's an example the request map should always have an input data key and the features key input data should be a map and features should be a vector or an array in JSON and I think you can already see where this is going right we have some data that we are interested in transforming and we are telling bulgogi with strings these are the features which for closure is just functions that I'm interested in mapping over this data so if you look at the pre-processed function here it's only six lines of code and this is pretty much the whole logic business logic of bulgogi we extract the input data and the features then we look using the strings we look into the namespace if there's any symbols any functions with those names and we make a list of that we keywordize all the feature names so that then we can have a nice map with the keyword meaning the name of the feature and then the actual value computed and then we simply PMAP the functions over the data that was sent to us so in this case we are parallelizing everything which in my experience from my specific use case which might not be true for you greatly increases performance and makes sense because data is not usually very dependent on each other simply zip map the whole package so we can see how this actually works let me just evaluate this and let's see let's simulate this so this can be you can imagine this is a microservice with the simple rest interface it's getting an HTTP request in an endpoint and now we just want to pre-process the data and respond back as fast as possible and asynchronously we want to write the results into a database in this case I'm just writing a file just to keep things simple so very basic you can see we asked for this data these were the features we asked for and now we get back our nicely pre-processed data if we want to add more data to the payload we can just say for instance let's see if there's any risky items in this in this order so same thing again now we get all of the functions all of the features with the respective results here contains risky item is just a is a Boolean but as an integer indicator so one means yes it seems that either for industries or bus corp are very risky brands in our little scenario here and this is pretty much it obviously there's a lot of work to be done there's no errors, error checking there's no validation of any kind but I feel that the main idea is already solid enough to build upon we simply use namespaces to take care of keeping our features and making sure that we can find them we simply use the functions we already have in ClosureCore to find those functions and apply them over to the data we have and respond back and then a future or maybe a channel if that is more if that makes more sense as we develop this to asynchronously save everything into a database so to summarize here there are good things and there are bad things the good thing is it's just functions are more known than this the core logic as you saw is extremely simple it takes six lines of right now six lines of closure to explain what's supposed to happen when you get a request so there's almost no boilerplate and this is very important not just for the developers that may do this the data engineers but also if you want to bring in data scientists that may not be completely familiar with closure it's just very simple because you just need to write a simple function so they don't want to need to learn a small set of how things work and then build from there it's completely the coupled from models right so it's a separate system the models don't need to know that it exists in the sense that they are not deployed together we can have deployments in separate times and it just makes features very simple to share between teams because now they are in a single point if I want to reuse something that my colleagues used I just need to call Bulgogi and say hey give me this feature and that's it now obviously it's not all Roses and Unicorns there's also some disadvantages using something like this the main way is that obviously new features must be backfilled before training meaning we need to retrain we need to send the data into Bulgogi and save it into the database before training a model otherwise we have a chicken and the yang issue which means there's a bit of a slow feedback cycle so we need to wait for features to be available before you can actually train your models I still think this disadvantage is not bad enough that the system like Bulgogi doesn't make sense because the benefits that are on the good side are just much much major in comparison to this so it may not be ideal for our use cases I feel it's ideal for my use case and this is what I would like to know from the community if you see this as something you can use or what changes would need to be done for this to make sense in to be a more general use case and obviously I think it should scale but maybe it doesn't and it's a lot of real world testing that's what I have for you guys today thank you very much for your attention we have a very very big thank you to Daniel Slutsky and the whole Cycloch community for helping organize this conference also listening to me a couple of times talking about Bulgogi and some other topics and also especially to Dave Lipman and Jack Rusher with whom I exchange a lot of ideas going over this first iteration of Bulgogi cool, see you all in the panel later