 Hello, and welcome back. Hope you had a nice lunch break. We're approaching the final straight here at the Big Things Conference. This afternoon is our final block. Now, to build a machine learning model, the quality and quantity of data are important. Of course they are. But our next speaker believes that there is another often overlooked type of data, metadata. To explain more, we have with us Jörg Schatz, CTO of Orango DB. Jörg, are you there? Yes, I'm here. Thank you so much for the introduction. Great to see you. Thank you so much for organizing all of this. And yes, we'll dive into many different ways in which graph or data in general can be useful for your machine learning pipeline. Whenever you're ready, Jörg, take it away. I'm happy to take it off. And let's get started. Yeah, as mentioned, we will talk today about different ways in which machine learning can actually depend on different kinds of data. And so in short, different ways graph data is important for your machine learning pipeline. What is ML metadata or metadata in general and why is it so important? Why should you actually care if you want to build a production grade machine learning pipeline and we'll briefly touch up on different open source solutions to solve that problem as well. Interestingly, when I looked at the schedule, the other talk which is happening right now is about improving predictions by automatic drift detection that actually touches upon a similar related field is like when your data changes and we'll see how that relates to metadata as well. So about me, I'm, as already mentioned, CTO over at Orango DB. Basically my entire career has been building large-scale systems, either database systems or large-scale machine learning platforms. And that's why I'm very happy that in my current role, I can combine both a view from an ML operations perspective, but also from a database perspective and combine all those passions within one role. And yeah, let's see how this combination looks like. So I think still for many people, I interact with the typical view of someone writing machine learning code or data scientists like we're sitting in front of your laptop, we have TensorFlow installed, everything's great, and we can really achieve really cool stuff. So we are kind of like the superhero. We have nowadays lots of data available. We have huge compute centers, a lot of compute resources, and we also have really cool machine learning models or machine learning research, which enables us to do really cool things. But the question is, how can we actually move from like and prototype over to something which actually involves our business, which moves us forward, which generates value out of there. And that's actually a bit less exciting than we might imagine. I think for a lot of people, this image of the data scientists, and they get some data from business, they write really intelligent machine learning codes, they train a model that performs great. So they run that model and then they start over with the next problem. Unfortunately, in reality, it looks a little different. So this is a chart from a Google paper. Probably many of you have seen that about the hidden depth in machine learning systems or machine learning pipelines. And as we can see here, so they try to quantify like where is their machine learning teams are spending their resources, their time on, and only this little black box here in the middle is the actual machine learning code. And basically every single round is from like data collection, data verification, resource management, et cetera, is actually way more cost intensive in terms of time compared to the actual ML code. So how does it then typically look like? So in reality, most machine learning pipelines I've seen across different companies follow roughly a similar scheme as this one here. So on the very left, we have some data and that can also be some streaming, stream data stream. Then our data engineers, our data scientists will go in and engineer some features. So try to prepare the data, try to get rid of missing data, try to form meaningful features out of set. Next is then we would define our model and then train our model out of set. So this is usually very compute intensive. Also often we have multiple iterations here and then we have either one or often also multiple models out there. So multiple candidates for model serving and these models need to be managed stored and also we need to have a catalog to decide like which of those models do we then want to really deploy onto our live production system. And this is then where machine learning serving comes in. This is where we actually get value out of it. Imagine we have a web service distinguishing cats and dogs. This would be when we can deploy our machine learning model, some user can upload an image and we try to predict whether it's a cat or dog or anything out there. So next, this is actually in, if you look at different open source systems. So for example, TensorFlow extended, which is the ecosystem around the well-known TensorFlow framework. We actually follow a very similar model here. So on the left we have the data ingestion, then we have data validation, the actual transformation. So going into feature engineering, then our training, our model management and then the pusher and model serving. So this is how it comes down to, if you look at some other, if you look at the virtual real world systems out there, if you look at Kubeflow or other systems, they typically follow this very similar pattern that we've seen earlier. What has that to do with databases? So as mentioned, I work at the Rengar DBs. This is a graph and beyond database. So as a core, we are really scalable graph database and we also can do full text search. We can do document, et cetera, which is not really the core here of this talk, but why is that actually interesting to me working at a database company? So what we see more and more happening to really build enterprise great machine learning systems, machine learning pipelines, database systems will play a pretty big role in that. A lot of the things we have invented or in need for database systems like role management, permission management, tracking of different resources can actually also help us to move from like a prototype machine learning solution, our research team built to something which we can actually utilize in our enterprise. This is a paper by Microsoft, really interesting read, but we'll see also some aspects of that in the coming slides. So if we go through this pipeline, let's just have a look for where databases can actually play a role. I think the obvious part of the schema of this pipeline where databases play in is on the data in streaming side, obviously. So we could either load data from a database or store data in a database, but that can be the source of our training data and verification data. Next part where it can also play in is in the feature engineering and model training. Interestingly, especially graph databases are often used for analytics and there we can already have some unsupervised machine learning models like K nearest neighbors, like community detection or other things where in that sense we don't really train a model, but we can extract, for example, features which can then be leveraged for our model training. So basically we often see databases and also in particular graph databases being used in this preparation of data. Now, most interestingly, this is also where we see a lot of the future of machine learning is basically by explicitly leveraging graph machine learning. If we look at, for example, what DeepMind is doing in terms of Google Maps ETA predictions, if we look at what computational chemistry is doing in terms of drug discovery, molecules, if we look at them, they're actually also graph. A lot of the new research, also if you look at NeurIPS and other big conferences are actually revolving around graph machine learning around graph analytics and I really like this quote here. We can really make better predictions if we utilize relationships explicitly. The example I typically give here is, imagine here in the middle we have some social network and we want to predict churn in our network. In traditional machine learning we would go in and basically for each user we would have a feature vector out there and then try to predict its churn probability. Imagine now actually all our friends have just left LinkedIn, Facebook, whatever favorite network we are talking about and we now want to predict the probabilities that I hear in the middle will churn as well. Obviously this will be pretty high because now all my close friends have just left the network and probably I'll have a similar reason as them for leaving as well. So this is kind of key to this idea of graph machine learning that we can explicitly leverage those neighborhood or connection information out there. So kind of what is the different stages there? We can have simple graph queries. So for example, imagine you're at LinkedIn and you want to show a user like, hey, who could give you an introduction to another user? So basically who is on that path between you and the other user? That's a simple graph query, basically any graph database systems can deal with. Furthermore, we can have graph analytics. Who is the most connected person? Who are the influencers I need to target? Who are the kind of recommender systems are also often built on top of graph analytics. And then we have the graph machine learning which can actually help us to build statistical machine learning model and then do predictions such as, for example, who are potential connections? So what edges are missing in our graph? Who is likely to churn as we just discovered in the example of predicting churn in the social network or any other kind of network. And those are the different things we can actually deal with from a graph analytics from a graph ML perspective. I think the interesting insight here is that often traditional machine learning has the assumption that individual records are independent and identically distributed data. In reality, this is often not the case. So in real world data scenarios and real world graphs, we often have assumptions, for example, as homophily that neighboring nodes are similar. As I said, if all my friends have just left a particular network, probably my likelihood to churn is also higher than some random person in that network. And this is actually what we see a lot happening right now that people go in and explicitly exploit the graph structure in terms of network. Most interestingly, and this is what most of this talk will actually be about, is that databases can play a role for the entirety of your graph ML pipeline. What do I mean by that? Let's actually take a look at an example. So I used to work in a healthcare scenario, healthcare I start up, and we had that challenge that we were training models based on some privacy-relevant data. And now the question came up, what actually happens if one of those people is withdrawing his consent for us to use his data? It's kind of a bit still under debate whether under CCPA or those privacy laws, we would have to withdraw our model because the model is actually influenced by this sensitive data on the other hand. But at least we wanted to be able to identify which models here on the other end were actually impacted by that data. So we went through our entire pipeline. So we looked at the data site, identified the data files, which had those patient records in there. We went next, which features did we generate out of those datasets? Manual stab, looking through locks, very lengthy. Then we looked on what models did we train with that data? Again, very lengthy, looking through locks, very annoying to check. What models resulted out of that? What models were in our training, in our model management software? And then last step, which of those models did we actually deploy to production? So this was actually a manual lookup across all those different pipeline stages, which was very annoying and very time-consuming to figure out. And this was basically the point where we were like, oh, we need a solution here, which helps us to identify that automatically. If you're more interested in what kind of information can be pushed through a machine learning pipeline, there's an interesting paper, the secret sharer. So unintended memorization in neural networks, which is pretty much about how much your training data can be influenced or can be retrieved by the models later on. So basically, can I actually guess a credit card number which was used during training in the live model later on? So really interesting read. And again, kind of relating to like, how does our initial data impact the model in the end? When going through there, we identified a number of other challenges we would like to solve as well. So as mentioned, the biggest problem for us was we wanted to understand the provenance of a model. Where was it trained from? Which features did it come from? And understanding that also gave us, should give us a complete version history. And also an audit block. If you're working in constrained environments, often those audit logs are really important. Like who trained what, who was all involved in training a certain model, which different parties. Then also comparing the performance across different models. They say it really were evolved over time, finding reusable steps for what can be reused different features. I think that notion of feature stores, feature catalogs is becoming more popular as well. This is something we looked at as well. And then also the question is my serving data. So is this data, I actually see the data distribution on my serving side. Is it actually the same than what I've seen on my training side? And by the way, this is what the parallel talk is right now about. And we basically said like, oh, this is currently pretty hard to answer. What do we need? We need metadata. What is metadata about? So here is a definition from Kubeflow. So Kubeflow is basically a way to deploy a variety of different machine learning to deploy machine learning pipelines on top of Kubernetes in short. And here is the definition is basically metadata is information about runs, executions, basically anything but the data itself. So we can basically see it. We know when a certain, we have meta information about a certain data set. When was it created? When was it accessed? How big it is? Et cetera. Who has access? We have information about transformations from which data set was it derived? When was it transformed? When was it trained if we talk about models? So this is basically all metadata for us. And next we were actually looking at collecting the metadata across the entire pipeline. So we wanted metadata across all those different steps to be stored in a single database store. As mentioned, when we went through that exercise, it basically was manual mergers. We somewhere had that meta data about our data sets somewhere else about our features, somewhere else about our models. But the realization was actually, hey, we need that in one single database. And this is when we actually started to develop an open source solution called Arango ML Pipeline, where we used Arango DB as a common store for that. So it's a simple Python interface on top, which plugs into different machine learning systems and easily allows us to store this metadata. Main goal when developing that was A, to keep it open source so anyone can contribute and extend it. But then also really to make it extensible because already in our environment, we had different training jobs, we had some Spark jobs, we had some TensorFlow jobs, some PyTorch jobs, et cetera. And all of them were creating different metadata. And this actually also led to our choice of using Arango DB down here. Because we first evaluated relational database systems, there we really faced a challenge of, oh, we would have to put that in a fixed schema. Next, we evaluated document databases, and there it was really hard to track the lineage in between. And this is where we ended up at Arango DB, which both supports a graph model, but also that document model. So on an individual level, we were able to store in a flexible schema different information. So for example, the description of a dataset is just a JSON format. There are no fixed fields. We had a general schema for it, but it could really vary whether data was coming from HDFS or whether data was coming from, for example, S3 or some other data source. Similar for the transformations, similar for the features, they all, the information, they actually stored heavily dependent on which system we were using in the end. Then the next step we needed is, we needed to connect that information. And this is really where the graph part is coming in. So by having a graph database in general, we can go in and then connect those different pieces of information. So we can say like, hey, this dataset is related to this transformation. This transformation is related to this feature, this feature, again, and so on and so on. And keep in mind, there might also be branches or even loops in that one feature might be derived from another feature out there. And so with this kind of information, that question which dataset is impacting what kind of models just turns into a graph reachability query because you're basically asking what models can I arrive from a certain dataset in this graph? And this actually, I think, is a very nice insight that we can query all of that from a graph perspective because in the end, this forms a graph of how different components are depending on each other, what's the provenance of certain artifacts within our pipeline. And here, for example, this would be just a simple query to just look up all the entities being derived from a certain feature. And you can also be ended up having really simple queries for all the different challenges mentioned above from finding out the provenance, from finding out then the audit log, like who all contributed to training a certain model, which again, if you work in restricted environments such as finance, healthcare, those can be pretty important questions to ask. And yeah, with that, it actually turned out to be a really useful tool for different kinds of people. Initially, it was mostly driven from like a data ops perspective. They wanted to track that lineage. They wanted to make sure we are compliant with GDPR, CCPA, like regulations. But in the end, with that, we could also, the data ops person can also use it for the audit trails for reproducible model building. So like if we want to go in and retrain a certain model, we can really identify what is all the, what are all the different artifacts which went in there, what was the random seeds used for a particular training run, and those kind of things because it's all stored in the associated metadata and enabled us to really retrain a model with a single click. Similarly, the data scientists themselves, they also get really interested in that because it makes it very easy to look up relevant entities for a given model. Similarly, at a large company, you have a model predicting, let's take again, user churn, and you want to build a similar model. And if you can now go in and really understand the pipeline, which was used for the previous model, this will make your job a whole lot easier, especially if you're new to the team and you don't have the history. So onboarding new data scientists really became much easier for us. Similarly, it's also, it's very nice to track performance differences. So as we can serve multiple models in parallel, we can then compare where's actually the difference, why does it behave different in terms of runtime or in terms of precision as well. And I think overall it makes it much easier to reuse certain entities across your pipeline if you have a catalog of that. I've seen way too many companies who just built such a kind of pipeline, they train a model and then this model is just used as an entity by itself. I would hope that kind of one of the takeaway messages from this talk has said, you shouldn't treat a model just as a singular entity, but treating a model with its history makes it actually valuable in production grade scenarios. Then last but not least, also like our administrators, tracking the usage or SREs nowadays, tracking the resource usage became much easier because they could just see like who ran how many jobs on our training, various training platforms who use those models in the end and this actually helps us to do resource accounting across those different, across different teams, across different models in the end. So maybe just going a little bit into details on how a Wrangler ML treats that by itself. I mentioned in the beginning, it's an open source project designed to be extensible. So this is why I put schema in quotations here, but we try to come up with a connection, with a graph, so to say, of useful entities for most people. So we, for example, have an entity for, a preset entity for data sets, we have preset entity for transformation features, experiments with different performance characteristics for model later on, then for serving performance. So this is all kind of like very easy and it's in the pre-existing API. Again, the advantage of graph here is that we are flexible. So most, actually any system I saw wasn't like any other system. So I feel they really vary in what they do and also how they combine certain things, real-world pipelines, despite following this general scheme. But this is, as you have a graph, you can dynamically change it, change the edges in between. So you can add new entities, you can change the edges in between what they derive from, and basically set your graph schema for your specific ML pipeline. In the end, it's a Python package, it has an HTTP API, which most people use, that allows you to plug into any existing system out there. We also have a TensorFlow extended integration and then from that within your machine learning code, from within your machine learning system, you can basically just register different entities and basically then also register the provenance in between. You can, there comes with a handy UI, you can discover certain entities, you can search your metadata, for example, you want to build a new model for user predictions, you might want to search models which are doing something similar and which are being used in production. And then in the end, of course, there's also a graph view the graph of your entire metadata. So this is not the only open source solution out there, there are also others. Unfortunately, they weren't available when we developed it, otherwise we probably would, might have co-developed on this as well. But for example, also TensorFlow extended, so TFX in short, also came up with its own metadata solution out there called machine learning metadata MLMD. And this actually plugs into the existing TensorFlow extended system and each component in TensorFlow will automatically if you enable it write its metadata into that store. Interestingly, if you look at it, the metadata store is following your graph pattern. And this is something where we also reached out to the TensorFlow extended team to come up with a more extensible interface to actually support the graph interface because in the beginning, they actually just used to write it back to relational database, which felt like, hey, you're turning everything into a graph, then you transform it into a relational model, then you retrieve it again for querying it. And I think this is just, again, an interesting revelation that this metadata is actually forming a graph and is also becoming more and more important as TensorFlow extended implemented it. Qflow is now also supporting metadata tracking and most of the big machine learning systems, open source systems I see out there are supporting some kind of metadata storage. And as mentioned, if you're building your own platform, a Rango ML pipeline can be the solution for you because it's just an HTTP API, you can leverage with Spark, with TensorFlow, with PyTorch and your favorite solution and your favorite custom stack. So you don't need a pre-built machine learning pipeline supporting it out of the box, even though I must say, of course, that makes it much, much easier because it's like really deeply built into the system. And here, by the way, you can also build a connector to plug into TensorFlow extended. So you can even, if you're using TensorFlow extended, you can still leverage the UI and other components out there. How to get started. There's a simple Docker image. You can simply run and then there are a number of tutorials. We have a number of Jupyter notebooks out there getting you started. If you have any feature requests that you want to use, you can simply run it on a cloud service and you don't have to set up a database or anything by yourself to get started out there. So that's basically a fully managed cloud solutions. There are certain SLAs. You can either have temporary cloud instances for trying it out or also just exploring the idea of how to do it. So you can simply run it and also just exploring the idea or there are also production instances. So thank you so much for listening. I hope the kind of key points I wanted to make here. If you're building a production ML system, keep in mind that for really operational value, you should keep track of the provenance and actually the metadata of what's happening in your system. The machine learning model by itself won't really buy you a lot of value out there if you can't reproducibly train it, if you can't explain where the data is coming from in most scenarios. So I would urge everyone to really try to build a pipeline and then keep track of the individual items, the flow between those different items in your machine learning pipeline. We also have blog posts around that. Also these slides include various links to, for example, TensorFlow Extended to MLMD and all other systems out there you might want to use for managing your machine learning metadata. I think that gives us still a few minutes for questions. Is that correct? Correct, Jorg. Thank you so much. Super talk from a superhero. Oh, I'm not clear. Are you Batman or are you Deadpool? Which superhero are you? I feel like I'm more Robin helping the Batman data scientist to be more successful. I don't believe that. No, I don't think you're the sidekick at all. So yeah, you're right. We have a few minutes. You have a few minutes left. So let me pick up on something you said right at the beginning, which is that you compared what data scientists are actually doing so let me ask you why aren't they doing what they should be doing? I don't think that came across perhaps as clearly as it could. I wouldn't necessarily say what they should be doing. I think there's just a perception if you talk to your favorite CTO or to your favorite management in some kind of company. There's often the impressions. They read like an article about oh, deep learning is the greatest best thing ever. They need to do deep learning. So they'll get a team to build a deep neural network model out there. And I think there often the perception is that they will just start out by building a cool model and once this is there kind of task solved and then move on to the next task. I think the challenge of actually operationalizing that model is really undervalued. I wouldn't say that data scientists should be doing something else, but I would say the challenge of operationalizing your machine learning pipeline beyond a research project is often underestimated. And that's also what I've seen across many different verticals across many different companies is the main challenge for really adopting machine learning and gaining value out of it that you kind of focus on oh, we have a cool model done with the effort which needs to be taken into building such model. So I think if you take that into account you can enable data scientists a bit more to focus on data model training, exploring data by having really a team around that. So we saw on one of the last slides kind of that different distinction of teams where you have like a data engineer who is actually in charge of that pipeline. So from my experience, most data scientists they come from more a mathematical background they really understand data and analyzing data very well, but they're not necessarily the operations people who will build this full-fledged distributed system for model training, for inference and of course all of that should be fault tolerant. Okay, some viewers are asking if they're building their own platform should they always include metadata in the ML pipeline or are there certain cases where perhaps it's not so important? I would say if I'm building my own pipeline I would at least keep in mind that this will become crucial. So I would I personally would say it's crucial for any pipeline because you will want to look up this one model which is running in production and now all of a sudden is throwing some errors like how was it trained and imagine this is like someone being woken up at night 3am on pager duty you really want to have like all the information available especially if it's someone new on that team or someone who hasn't built that initial model and especially if you grow if you grow your team it becomes more and more important if you have a simple like two-man team which is managing the entire pipeline all of the training models they probably will have most of that context like they will have the metadata in their head most most of the time so if it's some very small scenario or some research project where really this is not meant to be operationalized then it's probably not that important but my general recommendation would be keep it in mind for anything you build out there. Alright so there you have it. Anyone wanting to know more please get in contact directly with Jörg. We are unfortunately running a little bit tight on time so all that remains for me is to say thank you very much for this fascinating talk Jörg and we'll stay in touch so thanks again. Thanks a lot.