 So today I want to talk a bit about the different use ways in which data is actually important for your machine learning pipeline. Probably most of you are aware that data is super critical, like data quality. Preparing your data is super important to get proper output for your machine learning models and proper quality for your machine learning models. But on the other hand also just capturing metadata is very crucial to actually run and operate a machine learning pipeline in a production environment. So we're kind of going to talk about those different ways and another way at which you might see that we're going to talk about the way in which no SQL databases such as graph databases, document databases, full-text search engines are having an impact on machine learning and basically those two fields first of all for feature engineering. So how can I actually use it as part of my machine learning pipeline? So this is probably more appealing to data scientists and then on the other hand we're going to look at as the title promises like what does it take for machine learning infrastructure and here we're actually going to take a look at the metadata we might want to capture across our metadata across our pipeline. This is actually captured under an open-source initiative called Arango ML, Arango Machine Learning and we'll learn a bit more about the heritage about this project throughout this talk but basically it's an open-source project which actually allows me to do both of that, leverage no SQL databases for feature pre-processing or data pre-processing for machine learning and on the other side capture machine learning metadata across various pipelines you might want to build. I'm Jörg, I'm the head of engineering and machine learning over at Arango DB and previously I always have been switching back and forth between the database world and like machine learning infrastructure or large-scale infrastructure world so switching back and forth between databases in my PhD then I was over at SAP for HANA then I built large-scale machine learning frameworks over at Mesosphere and Suki AI which is like a healthcare start-up and what I kind of like about my current role is that it's combining both those passions so I can both work on the database side but also integrate and merge that together with the machine learning world. Alright maybe just why is machine learning taking off nowadays so we always imagine like hey just with TensorFlow everyone can actually become into superhero and program really cool models. This is mainly due to the fact so why is it the right time right now that we actually have large amounts of data. Secondly we got to computing capabilities available like large data centers, cloud environments where you can easily get large deployments and also recent advances especially in like deep neural networks from an algorithmic side. This actually allows us to train like self-driving cars. There's this like a deep Bach which actually is composing music which I find pretty indistinguishable from like real Bach music so it's pretty cool what we can do all with machine learning nowadays. So just kind of seeing that if you and I always hear that when people are starting with a new machine learning project all even in big companies it's like I just going to hire a data scientist. The data scientist is going to sit there leverage his laptop together with TensorFlow. I give him some data he's writing this cool model we need to train it obviously but then we're done and we have this cool model and this data scientist can actually approach the next big challenge. Unfortunately in reality this looks very different so this is from a paper by Google so those people actually have a lot of experience with machine learning but even for them the machine learning coding is actually just a small box in the middle and actually the real challenges of productionizing machine learning models so not just coming up with like one model but actually putting that in production is basically all those bigger boxes around the small black box here and so I believe what we should actually do as a community and I believe the metadata aspect is like one part of this is actually work together as a community and define like data science principles. Similar as we have like software engineering principles like CACD testing patterns and everything else I believe if we work together as a community we can actually structure this picture much better and work together whereas currently I actually see a lot of people reinventing that from scratch and trying to come up with their own way of doing such things. There actually other people like Ian Goodfellow like agreeing with that and I think this is something where we as community should really work together and hence move the AI or actually more the operability of AI forward with best practices we share and I guess the first best practice I see is as mentioned like people start hiring the super hero the data scientist but actually the super hero isn't very powerful if he doesn't have like side gigs which can actually help him manage this entire big map here because for a data scientist often coming from a background mathematical background dealing with large-scale distributed systems scaling computation often is like not their core strengths and this is why actually we need to work together as a team in solving this challenge and I guess most prominently what we've seen over the past probably two years there's been this new role it's kind of like the DevOps role for data science and this is basically like data ops data engineer there are different names out there but this is a person actually combining data science skills together with like distributed systems engineering so kind of a mix of computer science and mathematics and this person this persona can actually help to build this entire infrastructure operated and make sure we actually are efficiently implementing that so this is probably like the first takeaway I would like to give people really think about operating your platform and not just hire a data scientist and yeah if we actually divide this maps and between our different personas the data scientist can actually really focus on the data itself so feature extraction feature engineering writing machine learning models writing analysis on like how efficient different models are whereas the other two people like a more traditional system admin DevOps person can take a care of the infrastructure and our data engineer data ops can actually take care of the data aspects around our yeah machine learning platform so what typically comes out if we start such initiative we end up building a machine learning pipeline with more or less those different steps here so on the left side we got data basically how do we store how do we manage our incoming raw data and then comes the next step this is the feature engineering and this is actually where data scientists are probably spending most their time this involves data cleanup extracting core values out of the raw data basically transforming the raw data into something which we can actually leverage and input into a machine learning framework then comes the actual model training so there is what we actually expected there's a small black box here where we have something like tensor flow spark py torch mx net or whatever our favorite machine learning framework is and we basically train a bunch of different models so also to keep in mind we typically not only train like one model for a given problem we train up up to like in the range of tens to 50 different models for different hyperparameters for different ways to try out things and the next step is then typically I end up with a library of different models so also keep in mind this might actually improve over time I might get new input data I might get new training data I might get a newer model by my data scientists so actually I end up with like a large library of different models I've trained and the big task is then how do I select the one I actually want to use in production for serving so which one do I want to leverage to actually solve the real business problem I'm having on the right side in model serving so there multiple actually open source pipelines available probably the most prominent one is a tensor flow extended or short tfx which has been built around by Google around the tensor flow ecosystem and you can already see this pretty much matches to what we saw earlier we got data on the left side we got some transformations or feature engineering in the middle we got our training in the middle and then we got model management and model serving more going to the right side very similar and actually related in many aspects it's a cube flow pipelines so cube flow was originally a way to deploy tensor flow on kubernetes but it actually has grown into a very large ecosystem of different projects around machine learning model serving around feature stores so there we actually basically are trying to build a similar ecosystem as in tensor flow extended and they actually there's a lot of interaction between both communities as well also here we actually I think they have a boost here it's logical clocks hops it's also an open source platform pretty much again following the similar steps we got data ingest data preparation experiments and train and then deploy on the right side so we actually got a number of different open source options out here to actually build our pipeline and now as promised we want to talk about how no sequel databases kind of fit into this picture and I guess the kind of obvious answer is like on the data and streaming side taking care of our data storing out our data so probably not so interesting to actually go into much more detail here but what's actually more interesting is the feature engineering aspect as mentioned earlier this is actually where data engineers are taking most of time taking raw data and turning it into something tensor flow or something else can actually start doing something with what are the typical steps here first of all we have to actually clean up our data there might be missing data there might be different for example time or date formats different units for distances and so on and so on everything you learn when you live in the US to love and hate and the other thing is I might actually also have to transform it into something which can be understood by my machine learning framework so typically tensor flow and others they work with for example TF records which is basically like a tabular format of input data so if my data is actually a JSON document or a graph or a key value store I probably have to do something to turn it into something meaningful which can be understood by tensor flow and actually processed in an efficient way so for example if we take a graph we then turn that over into for example extracting the number of movies made by each director if we want to predict something like hey which director should we actually choose for our next blockbuster movie so we need to extract features from a graph which can be easily understood by humans but not so well yet by machines so we'll also see a bit there's research in that area later on but this would be for example a typical step in feature engineering talking about different data models why does it actually make sense to have different data models such as JSON graphs key value and I guess the most important aspect here is that we also have to think about how data is represented for us as humans and so for example it's very easy for us to think about graphs to think about different connections between entities rather than like a big table and it also often makes it much easier to express certain queries for example graph traversals finding out everyone who's connected with us finding out all the models we will see later on related to a data set this is an easy graph question but it can end up being becoming like a really annoying join if we want to express it in sequel and this is kind of the one thing we're working on with a Rango DB is that we have actually seen in the past their their specialized databases probably most of you you know Mongo Neo4j Redis for example which basically help us to solve like one of those data models here the challenge in real life is actually for most use cases that one data model isn't sufficient I typically I need kind of a mix of both of them and then I actually end up having my engineers write a lot of code basically merging data from my graph database with my document database and I often even end up re-implementing a lot of database functionality on top and that was the kind of motivation behind the Rango DB it's an open source multi-model database so it natively supports all those different data models inside one core engine it's distributed so graphs can actually spend multiple nodes and also for usability there's just like one sequel like language which actually allows us to query across all those data models so I have a uniform access to all of that I promise is the only a Rango DB slide so I'll also move forward to the next slide as mentioned this especially the topic of graphs and machine learning this is where the connection of no sequel databases or different data models and machine learning probably is most obvious so we just search like Google we find a lot of different things about like knowledge graphs and how we can connect deep learning and knowledge graphs or graphs in general and I think this is coming from this fact that for us as humans a lot of things can be easily expressed as a graph so if we go back to this movie example this is actually a Kaggle data set so Kaggle is like a platform for data scientists for like competitions and you can actually also learn a lot by just playing with those data sets so if we have this data set here and this is originally here in a tabular format but for us as humans if we want to think about those different entities so for example there's there are different records like for example a director is appearing like multiple times in here but if we as humans are thinking about we're actually thinking about the connection of a movie if you think of a given movie it's connected to a director and that director is actually its own instance which again can also be connected to other movies or even other actors by that because for example Tarantino is always choosing like his favorite actors so there might be also different connections in there so for us as humans it's very natural to think in like a graph like structure and what's also really interesting is basically how can we kind of like the next forward step where a lot of research is happening right now is how can we actually combine this world of deep neural nets together with graphs and this is the kind of area of graph neural networks and this can then also involve like how can we leverage machine learning to predict edges in a graph how can we predict machine learning to leverage labels so basically how can machine learning deep neural networks natively deal with graphs so I think this is very interesting because then we don't have to do this feature step in between where we actually need to manually extract features but we can leave it up to the machine learning framework to extract the valuable information from a network as mentioned we also wanted to talk about not just graphs but actually the other ways in which we can leverage data or which in which we need to leverage data as a data scientist and this is where is in different processing capabilities come in so we can't do for example graph queries we can do graph traversals but on the other hand what's super powerful even in just feature engineering is for example natural language processing going back to our movie data set imagine we want to find all the Batman and Robin movies so we want to issue a full text query and actually find all the different movies which concludes there and then we could for example leverage a graph query to find all the connected directors again for those movies another aspect about features as we talked about how we can derive features is feature reusability and keep in mind that it takes up to like 60 70% of the data scientists time to actually clean the data and develop those features so actually reusability is a very important concept and here is the idea of feature catalogs or feature stores has become quite prominent so feature store basically allows me to reuse features which a previous data scientist has worked on imagine I'm at Airbnb data scientist one is coming up with some kind of user representation some kind of transformation he uses to clean that data set to extract the feature describing a user and then I'm the next person the next data scientist who needs to work on a similar problem like predicting which what might be booked by several users then it's probably a good starting point at least to take that feature and not derive everything from scratch so good feature store and this is a screenshot for example from logical clocks again we're outside is I can actually discover features which have been worked on by other data scientists before I got versioning consistency and I can also have a caching so I don't have to compute that always from scratch which can also take a lot of time for large data sets okay let's actually move to the second part where or the third part actually where databases are super useful in our machine learning framework this is actually by managing this entire pipeline in one go and for that let's like one motivating example for me and that actually brought me to back to join a Rangudibi a while ago was I was working in building machine learning pipelines for finance and health care we often had problems like we needed an audit tray like how where did this model come from which data sets were used to train it but also the other way around like we were dealing with patient data and we needed to be able to identify which models were impacted by this one patient record for example because that patient was withdrawing his consent for us to use this data and then under GDPR it's a bit unclear whether if it's concerning the original data set it actually translates over to the model but there can always be information exposed for example here this is shown in this paper show out in the final model from the original data set so at least we wanted to be able to identify which models were impacted or trained by this like one single record and this actually turned out to be a pretty hard problem because each of those steps had its own metadata we knew where a data set was stored we knew when doing feature engineering like which of those stored data sets was used we knew then when doing model training which features had been used we knew then when storing a model like from which training rounded that come and then also on the model serving side we knew like hey we use this stored model but this actually turned out to be like us manually joining data from five to six different systems to actually figure out like hey those models were impacted and we might have to withdraw those models from our production environment and as mentioned like this actually is related to a number of other challenges so for example just understanding the provenance the lineage of a model is important for a lot of audit trails especially in finance this was like a very crucial thing and also just being able for a data scientist to see like hey which reusable steps were used there we already saw this concept about a feature I want to reuse things which other data scientists have done before so I also need to be able to discover that so just being a data scientist and saying like hey those this Jupiter notebook those features those data sets were used to train this is a very powerful tool and this is actually when we started into looking looking into building a common metadata store so as mentioned like there's often like individual metadata stores for each of those pipelines but we were at that time really missing this common layer which we could join across all of that and maybe just to get like a common term here let's just define what we mean by talking about metadata so metadata so this is the cube flow definition is actually just describing information about runs models data sets so it's not the data set itself but it's for example a description of this data set has this size has this name is stored here and has version xyz and thereby I can figure out that there's a newer one of that similarly with like training runs with training runs of course a really important metadata is like training performance test performance and also with like notebooks features and everything else so it's basically like a description of different entities but not the entity itself and this actually led us to build a Rango ML pipeline so a Rango ML pipeline is a common extensible metadata layer we basically have seen that it's not really good if we would build it for like one particular pipeline so for example just for tfx or just for cube flow simply because most people actually vary a little bit in how they implement their pipeline so we wanted to be both extendable and flexible and not being tied to like one particular model and so we ended up creating this open source project based on a Rango DB and why did we choose a Rango DB here so a Rango DB we talked a little bit about this idea of multimodal data and this actually turns out with machine learning metadata this is a really good use case for me also to explain multimodal why is it important and what we actually saw is when we started out we started using a document store and this was worked really nicely because the individual metadata items for each pipeline stage for each entity really nicely fit in like a document like model training different machine learning frameworks have different outputs so I got different metadata from like my TensorFlow framework from my MXNet framework so I could easily model that into like adjacent document and store that just turns out that a modeling the connection between different entities became a lot harder that way so we actually wanted to model the structure in between as graphs we wanted to say like this JSON document describing my experiment my training run from yesterday is related to this feature this feature is related to this JSON document describing a different transformation this is related to a different data set so basically we wanted to bring up this graph structure and this is why we then in the end ended up choosing a WranglerDB because it allowed us to be flexible here and leverage both the document store for storing this unstructured data but on the other hand also have a structured connection in between by leveraging a graph and using that actually queries like which models are being impacted by like this one data set actually turns out to be a simple graph traversal or reachability we start with one data set and then it's basically a graph query which models can be reached here and for that just keep in mind that it's not always this linear chain but a feature might actually also depend on a different feature or a model might actually be also trained on top of another model by using transfer learning so you actually also have a lot of loops there which makes like just writing like a long sequel statement for this if you would store that in a relational store pretty pretty difficult and yeah so basically just leveraging the graph structure a lot of those things we needed to answer also like audit trails for a given model turn out to be a simple graph query which I can easily express and this over time actually allowed for a number of different use cases for actually different personas so originally we were actually mostly targeting like the data ops engineer who needed to take care of like hey this is the audit trail this is the data lineage for GDPR because he had to answer those questions in our environment but it actually turned out that this metadata was pretty useful for the other personas as well so both for example our data scientists here he could easily find relevant entities for a model so we had like a new data scientist we onboarded and we told him like hey can you please look at this model and improve it and just by looking at the metadata you could get like the entire view of how that model has been trained in like one go whereas previously he would have to go through like all the different systems or ask the people who had implemented that to tell them like which dataset was used for example it also helps to like search whether someone already has done like a user feature and also it really helped them in the end to have like compare performance differences across different models because they can again issue like a simple graph query and then compare like training or test performance across various models depending on very flexible criteria then obviously as mentioned like the data ops engineer or data engineer was our main persona he could much easier do audit trails he could track this lineage of data identify what's impacted by what and also it helped us to actually reach reproducible model builds because we would store like random seeds and other things as just attributes in in that graph structure and that actually helped us to be able to rebuild models more or less with the same outcome which is was a pretty hard problem and then we always got like from compliance they were pretty annoyed that we had different model performance in the end and also it really helped us in production there was this problem of a data shift so data shift basically occurs if your production data ends up having for example different distributions over time there are new attack vectors or simply like just a new usage pattern you haven't seen in your training data and this often means that your model you have trained is not applicable to that data anymore because it just has different characteristics and this was previously pretty hard to detect and by keeping track of that metadata also for the serving environment basically like which requests do we get we were able to always compare the distribution and characteristics of the training data set to the data we would be seeing in production and this would also for example trigger like a rebuild of models once we saw this was getting out of hand or out of his like a certain range also for the more traditional dev ops or system administrator it proved to be very useful simply because we could do resource accounting we would know like how many teams how many hours training for example models teams had spent because that's also something we would simply monitor and hence also translate to like cost like how much had they spent on training this one model how much had the team spent on solving a particular problem which was very helpful also in allocating different resources also for permission tracking we can basically see is the data scientist for example able to see all the different entities involved in like training a model or he actually can't work on a certain problem because he is not allowed to access the data set all right maybe just diving a bit into arango ml here so how did we solve that problem in the end so basically this is the schema so to speak we came up with for arango ml and schema so to speak is mostly because there's no fixed schema we came up with a number of entities which we felt useful were for most machine learning platforms but in the end everyone can simply like define a new feature a new entity in that schema and actually also change connections in between because it's simply like a graph structure so i can for example directly move if i'm not using notebooks i can directly move from like a feature to a model function depending on my workflow at my company so as mentioned there is no fixed schema but we have some entities which just have better support in the UI or are kind of like first-class citizens arango ml it comes with different apis so we have a Python package available you can just leverage from your like your favorite notebook or your favorite data science environment we got an HTTP environment for cases where you are not working in a Python environment and we are also just finalizing our TensorFlow extended integration and i think i have a slide for that upcoming so basically if you're using tfx it's all automatically stored in into arango ml if you want to leverage that this then is a first screenshot of the UI it allows you to for example find models or deployments it allows you to tracks and also lineages so once i've found something in particular i can actually see everything which is related to an experiment related to deployment and the next version will actually allow you to then also double click on model and then also deep dive into there so again this gives me a really easy insight into everything which is involved there and again we as humans like graphs make it really easy to comprehend how certain things are connected so this is why we wanted to have another graph representation in here making it easy for people to understand the connections between different entities as mentioned like tfx so TensorFlow Extended kind of has its own way in adding metadata storage so they actually announced that after we started that project and originally i was pretty annoyed because we were like hey we tried to solve that but in the end it turns out to be a pretty good fit because this is very targeted to the TensorFlow Extended to the TensorFlow ecosystem those APIs you don't really want to code them by hand it involves a lot of grpc calls a lot of generations and in the end they actually defined an interface which allows us to easily plug ourselves down there so this is basically just imagine that last box being a ranko ml down here and we'll release that probably within the next months and hence we can easily integrate into that ecosystem as well the other projects are also catching up with metadata i think here there's alpha version notice and we're actually working together with that community as well on the metadata tracking but this problem of metadata is actually coming up on a lot of those different machine learning platforms or pipelines because it's typically required for production grade scenarios and also for anything involving audit trails or something similar so whenever i got compliance coming into the picture i somehow need this metadata and be able to trace back of what has happened for building this model all right thanks for listening i actually just i got six minutes left i can just briefly show you because this is all open source so if you just go to like a rango pipe slides are will also be online after this talk so links are going to be in there so just jump in the notebooks you can run if i reconnect to my session let's see if it reconnects but basically that entire example is here for you to run from from a jupyter notebook here's the ui we can search things we can explore whatever we have done in in our models and then the last thing i briefly wanted to show is kind of the implementation details for those entities and this is basically the way in which a rango db is dealing with those different entities so you might see their collections are kind of like the table acronym for inside a rango db because it's not just tables it's collection of entities so collections and here we have the overview of all collections with all the default entities involved and you can see there are two different kinds of ways in which things are represented so those are kind of like documents and so for example here's a document describing a deployment and then we got those kind of like edges so edges are describing how things are connected and because those two entities are actually independent of each other i can have this flexibility of always adding new edges of adding new entity types without being constrained to like one particular schema to one particular setup all right with that i think because we are also at our planned limit yes shortly i would just leave it up for some questions otherwise feel free to just find me somewhere afterwards and thank you so much for listening three two one no questions okay i'll be down here thanks