 All right. So this is the last talk of the day. My mandate is to be quick. I promise I will be. So metadata management for machine learning pipelines. My name is Steven Pimentel, and I work at Apple. I would normally give a 40-minute version of this presentation. This is my first attempt at compressing it to five minutes. I'm calling this the Haiku version, and I hope it isn't more like a Cohen. I'm going to talk about a foundation DB layer called Entities Store that I wrote for an internal system at Apple. We use it to store and manage metadata for machine learning pipelines. Entities Store exposes a data model for versioned entities with fine-grained authorization and lineage. That summary on the slide isn't very complete, but it is a Haiku. So why does metadata matter for machine learning? Say you're running machine learning pipelines on a cloud platform for lots of users. The number of users is large, and they span a diverse range of use cases and teams. Pipelines start with data sets. The data sets are manyed, varied, and usually raw. There are often requirements around who is allowed to see which parts of which data sets. Raw data sets are transformed to produce data sets suitable for training. Then we must track what raw data sets were used. Two or more can be joined or otherwise used in combination, what transforms were applied, using what code and what languages, with what versions of what libraries. Training produces models. We must now track what architectures were used, what learning algorithms, what frameworks, TensorFlow, PyTorch, with what libraries and what versions of these libraries. And this is before we even get to hyperparameters. We can also use one or more pre-trained models in case of transfer learning or fine-tuning. In an experimentation framework, you often iterate through a number of models, varying architectures and algorithms, but all trained for the same task or use case. Now metadata must be tracked per run. In addition to the previous types of metadata, we have additional run-specific types. Finally, we may want to serve a model in production. But as with any production system, a model will go through versions as it is updated and improved. You may have canary deployments or roll back to a previous version or deploy multiple versions for AB testing. We want to store and manage this metadata in a unified framework that tracks provenance for each version of each entity. And this is what Entity Store does. So Entity Store is a foundation DB layer that implements a data model for versioned entities with fine-grained authorization and lineage. It's implemented as a Python library exposing its own API above the foundation DB Python bindings. Read and write authorizations are separately recorded at the level of individual fields, also known as cell-based security. Versioning of entities is automatic, with modifications to immutable fields resulting in a new version rather than a mutation. Each version has a unique ID, like a Git commit ID, formed from the SHA1 hash of its primary key, immutable fields, parent's version, and authorization groups. Versions form a parentage tree and can be explicitly selected for use, also like Git. Each entity is modeled as a collection of objects that represent its distinctive versions, along with a core object for mutable non-versioned fields. Provenance, or lineage, is a record of data's origin and the transforms that have been applied to it. Entity Store records lineage via labeled directed multigraphs. The graph for a version parentage is constructed by the versioning mechanism. Other forms of provenance, such as derivation of training data sets from raw data sets, are represented by different labels for graph edges. Objects are schema-less and consist of any number of fields with values. Fields can be single-valued or set-valued, and either case can optionally be indexed. And fields can also have large blob values. This rich data model is mapped to the key value store. Versioning, authorization, and lineage are incorporated directly into the data model layer so clients get them for free. Transactions allow multiple clients to concurrently access entities without fear of inconsistent results or data corruption. And thank you very much.