 So my name is Pamela Nerozzi. I'm a product manager in the CTO organization at Bloomberg and today with me I have Eric. I'm Eric Keppel, a software engineer and the architect on the model registry. We're going to be presenting and We're going to talk about model management I'm going to cover some of the problems that be identified from model management and then Eric is going to cover how we came up with the solution and how OCI plays a part, a core part of the solution Okay, if you attended the keynote, you probably have seen these slides a portion of it But I'm going to quickly go over what we do at Bloomberg Bloomberg is a technology company that provides Software solutions for financial professionals in the finance industry We at one of our main products is called Bloomberg terminal and Bloomberg terminal provides many functionalities that allows users to get access to raw data insights Analytics tools and communication tools The scale of our operation is quite massive. We process large chunk of data on a daily basis You can look at some of these numbers. They're quite large and these are all daily numbers the amount of data We be processed and because of these large amount of data that we process on a daily basis it's super important for this data to be structured in a way that is useful and digestible for our users and Because of that AI, you know recently there are a lot of hype around AI large language models open AI But AI has been a big part of our product and big part of our culture we've been working on AI products for the past 15 years and since 2009 and because of this we have over 300 plus developers researchers that work on machine learning and They create and they have expertise in national language processing machine learning information retrieval search and They deliver products that allows us to deliver value through the Bloomberg terminal other products of Bloomberg has and Internally we have a platform called the data science platform and the data science platform Provides infrastructure that is optimized for machine learning use cases think GPUs for training high performance computing Infrains in the keynote be touched on some of these our colleague kind of went over those and we also provide services and tools That makes these infrastructure really use usable for our users to be able to do the whole process and deliver value with machine learning and they do this by going over model development lifecycle have a Diagram of how that looks like is a very simplified version of that but You could think of every single stage here as a process that a developer or a person will go through to be able to Complete the task and for every single step we provide You know platform offerings that allows our users to be able to complete the task Starting from data-gathering and exploration. We have Jupiter notebooks all the way to pipelines you have Argo and all of them are built on top of Kubernetes and You could also see some of these logos open source is a big part of our culture is a big part of Bloomberg and our platform as a whole In many cases we contribute to open source project We use open source project and in some cases we also, you know start open source project One of the main things you could see here is for our managed serving We have a case surf which was co-started by us Google and IBM used to be called cave serving And a name change to case serve and it's growing quite fast and Earlier today my colleague Dan son Kind of was going over that and how fast is growing and we're actually Trying to transition that to CNC if there's a proposal for that now going back to the Visualization I had around model development lifecycle. I think one word really pops up Quite strongly in this diagram. It's repeated like over, you know, almost all stages Can you guess what that word is? Yeah, so I didn't make this diagram to make it seem like models important because I made the slides But it actually is quite important. That's what it's called model development lifecycle and we kind of looked at this and we saw Parallels to code when it comes to software development lifecycle They're quite similar in terms of the objective you have in code for software and then model for model development lifecycle And we saw that for code or mature tools that allow you to manage code from the beginning to end in that process, but Question is can you use the same tools for model and be able to do the same process and Accomplish what you need to do. So what we ended up doing we ended up kind of looking at the industry We looked at users see what they do. How do they manage models? How do they sort them? How do they use them and We asked ourselves, can you just use what tools we have for code and the purpose and for model? Well, if we could then I guess I will not have a presentation You can't and I think the reason for that is The user base that you have for model is not the same as as code So one of the things that they want to highlight here. I Amel developers are actually delivering value with machine learning They have a different set of needs than than software engineer in software development lifecycle in here Emma developer is representing multiple Roles as not one singular role some some folks do data science some do engineering some do research But ultimately there's a collaboration that happens between these user bases to actually deliver value So what they really care about is being able to deliver the value fast They want to be able to use tools that are really intuitive and easy to use Experimentation is super important machine learning evaluation is super important machine learning is the process you go through to create this artifact The artifact is is predictive So you need to be able to experiment a lot you need to be able to evaluate and to be able to use The tools that are really easy to use and collaboration is a big part of that process But just because the user base is different Doesn't mean some of the core concepts that are part of the software development lifecycle and tooling that exists for it Are not relevant So proper versioning and immutability are still relevant in the in the process you go through deliver value with models Proper release process provenance out of the ability compliance checks are all relevant still so you want you want these things and If you really think about it within you know Bloomberg is The way our user kind of the users that we have In most companies ML developers are you know you have data scientists you have ML engineers. They are production engineers but in Bloomberg be Usually have one engineer be able to actually do end-to-end process They do experimentation and they're usually also responsible for taking that thing to production as well And because of that this issue is is even more important for us They're optimizing for speed for delivering value machine learning moves fast But they need to also make sure they take care of the software best practices and and actually have a process to be able To to deliver that value So we believe we need to bring these two together and this is where the concept of model registry came came to be It's a it's a name that we have for it internally for the product and It tries to really make software best practices part of the process for these machine learning developers But without slowing them down without forcing them to change the workflows to adhere to some tools that were You know made for software engineers and code lifecycle So What are their requirements? So this is the PM edition of those requirements usually these are much longer I condensed it for this presentation, but It's pretty much around two big concepts. One is You need to make sure to allow these developers to do the work. They want to do experiment collaborate at the same time make sure the software software best practices are ingrained in the platform itself and When they use the tool you provide they don't really have to worry about a lot of these pieces for software best practices It's kind of part of the platform itself and Like any other offering within the data science platform within our platform internally We have some non negotiables. You could call them We want to make sure that whatever we provide is cloud native. It's built on top of existing solutions that people you know worked on We don't want to reinvent the wheel if it already exists and in some cases We may you know make something and collaborate and release it for other people to use We also need to make sure that you provide a solution that works both on premises and the cloud Hybrid, I don't want to make the same and but is the future so we need to make sure that it works in both environments seamlessly and It should be multi-tenant. We have multiple tenants. We don't have platforms So that that's kind of an unnegotiable thing and then lastly but definitely not least you need to work really well with the rest of our platform So the ultimate goal of our users Maybe at one two or a team is to deliver value and for them to deliver value They're going to be using a bunch of tools and those two need to just work together And if they don't work together there's going to be a weak point in the process So it needs to work really well with the rest of our platform and with that I'm going to pass it to Eric That's going to cover how we what solution we came up with Thanks, Ben Okay, so here we go. Here's the fun part So we want to build the model of registry. What does this sound like? So these are some concepts that Maybe a default position has to be horizontally scalable We heard him talk about multi-tenancy or enterprise organization teams need isolation from one another Multi-cloud as Different providers gain different capabilities, you know, we need to be positioned for that on Prem Bloomberg does a lot of its core computation on premises. So not gushable Authentication and authorization we all hear The hub of about like model security and model transparency of their ability How are you guaranteeing security around your assets and Lastly, I'll put in kind of from the software world digesting and signing. So how do you know? I delivered exactly what you asked for No problem, right sounds easy, but I don't even like gotten to the ML parts. So we Also have this entire domain of like what is a model? Is it partitioned? Is it large? Is it small? What's happening tomorrow? What's happening tomorrow? So the questions go on and on and on. So how do you avoid going into this rabbit hole? Well, you buck up So we're gonna go shopping We're gonna go shopping for this model registry and these are the primary things that I'm going to be looking for So need some storage It needs to scale it needs redundancy concept. So if a file disappears the file is still there partitioning support Frameworks are now sharding things We want to be able to do distributed polls. We want to be able like feed into data frames and all these things Format so Sensibly some things metadata and binary, but how are we structuring that is it extensible? One of the things that I hear is I as I listen to everyone's like everything almost starts at s3 Like just get your stuff in s3 and we'll take it from there. Is that really where the story starts? Security gonna not ask for a whole lot here, you know basic ACL, you know your POSIX kind of ideas But how easy does it integrate into your systems? Your dome your user domain your verification services, etc And lastly is kind of the capstone at all. It's like, okay, you've created all this stuff How do you find it and how do you manage it and does it fit back into life cycle concepts? so Let's go shopping. Oh Turns out you can't really find this on the store. So we're gonna DIY this so Manager asked me how long is it gonna DIY? I say, oh, I'm gonna it's really gonna have to build the server for 300 developers and all these Products I might get to the machine learning by really in the back of my head. I'm like, oh my god I got a scale this thing. It's got a running multiple data centers. I have people working on sensitive stuff I have people working on public stuff This is what I want I want to like get rid of all of that infrastructure stuff and just go deep into the domain Because that's where the fun is So We spin off we're looking at ways to like bootstrap s3 use databases like all all all this crazy stuff But I'm also looking at what OCI and or else is doing and I hope you're out there because you guys are awesome So with helm there's this idea of an artifact that if you strip away the runtime of the container You're left with this kind of metadata and these blobs. So What else could you possibly do with this so this is where I started to see like hmm Registries are ubiquitous every provider has one we have multiples at our organization. Is this possible to just use this and What is it OCI is kind of three core ideas? There's the distribution specs. So rest your shipping images around What is the image? So that's the image spec that defines how you construct these things and Lastly is the runtime spec, which I'm not going to touch on in this particular talk. It's like how do you execute this file bundle? Particularly the registries kind of stood out. So it's a basic repository You just stick content addressable tar files in it digest the tar and that becomes its location and It supports these really common workflows. So push pull List update crowd kind of stuff So we started looking at the reference implementation for this and I really liked what I saw The images really all they are is kind of this JSON. Yeah, there is some Talk around yellow, but practically speaking everything is JSON And then you're basically just lining up these tar files behind it very rudimentary But like it has digesting so I can prevent tampering. I can do checksums It has versioning and take supports both on the spec side and on the artifact side And then we have all the supply chain tack that can scan these Tars and tell us our licenses in check Are there vulnerabilities etc etc? That would make security people very happy and what they look like under the hood is I Present here. There's not really much to know other than there's kind of this type idea and you have these layers And there's front matter and your eyes and so you can start to kind of pick up on some of these keywords And that OCI one one added this custom artifact type, which is kind of cute of this I'm not gonna delve into it, but there's a slide on the end if people want to download and kind of look into that So can these concepts provide a foundation for what we're trying to do? Well, let's go back to our shopping list. So I'm looking for scale and resiliency and partitioning so What is a registry node? It's basically a stateless service that sits on top of your storage provider That means you can just line them out east to west and take on the load that you want. They also have this concept of Storage redirects so as you hear like people focus on LLM and s3 like I can kind of sleep at night thinking okay I can just redirect them to s3 and Pick up whatever optimization other people are working on and you can stick a blob cache in there too So you're really not hitting s3 as much as you think But the key part of this is registries are everywhere like I said earlier All major cloud providers have them and there's a large assortment of tool kits to like move stuff in between or to to work against them So I'm pretty happy with this alright, so I put my storage in my shopping cart, so now I'm gonna go looking at formats What is an OCI image artifact? It's a stack of blobs one is a configs. That's a metadata and Then less is the content so when you're using a container this explodes into what you see on the left but What if a model is sensibly the exact same content in one one they also added this Notion call an artifact type so you can now specify a custom type So maybe it's no longer this container. It can be whatever you your client deems it to be That's a lot of jargon. So what does this look like? What am I Proposing for the user So we have a builder API their buildings conceptual container They're putting in their model files. Maybe their model cards their metadata really We're not going to like hold them back too much. They can do the layout that they want etc So when they close this up, it's pressed into the artifacts that I've been describing or the OCI images This is then pushed to a registry They're returned a logical uri at this point and so this is kind of a key distinction like kind of going back to the S3 case like What region am I in what cluster am I in what's my bucket like all this stuff? I wanted to kind of avoid that so I'm kind of following the the Docker paradigm here Just like give me the thing and version 5 so Lastly, we'll touch a little bit on this in the end But so you've created this thing and now enters the life cycle of our system and you can experiment serve and maintain these artifacts But what does that mean for the end user or for the person you're sharing your your creations with? So again, you given this logical UI, you know go use model version 5 They pull this it explodes to the system and they can now use whatever was pushed But we've actually taken this one step further. So like payment mentioned our Our case or project has native integration with this so you can just say like okay I want to launch you or I in case of you push button and you're ready to send requests There is a key Point that I'll kind of touch on later in within that story that really came from the product side So I'm pretty happy with this. I can I can store metadata can store binary I can spec it I can version the spec I can version the content That sounds great. It's got check psalms all this stuff. So security This is kind of like one immature spark part of OCI. They don't really give you a lot of control So on their OAuth implementation You basically just get a name and then a few actions. So push pull delete They do have a working group on this so I expect it to go further But we wanted to support tenancies and so basically you have to Provide a no-auth provider that is able to work with these concepts. And so that's a little bit of how we did it I think the more important part of security is like federating it. So I'm getting a little bit ahead But one of the key parts of the system is that we can attach as many registries as we want There's just a couple of stipulations on this registry is that it has to have an OAuth provider But most cloud providers have these on, you know, Cognito is an example of this. So it's really not a hurdle So I got some basic security concepts I know because it's OAuth it kind of works into like a bunch of things on our platform So now we have this discoverability question. So we're creating all of this stuff How do we manage it and push through a life cycle? So this is where our data science platform comes in and because it's an internal tool I can't really go into it too much But the idea is that like someone can come to their tenancy. They can see what models They've created they can see the origins or the experiments that created them They can move them between registries. So if you have different registries with different providers You could copy them out and you can manage that as a tenancy. So that is another level of an asset for you and also like promoting and deprecation of models I Think this is a really immature part of of the machine learning Publishing story of like the notion of kind of deprecation and versions It's not something that you really hear a lot about when you're talking with ML ops people So I'm whipping through a lot of stuff here like how do I bring it all together So this is kind of our big picture You have an AMO AI ML workload somewhere As long as it has an OI OCI registry next to it and as long as it has OAuth We can then tie it back into the central cataloging and use that as a distribution point of our software And I want to just shout out to like all the hotel people that are here because I've been trying to add hotel to this and It's fantastic what you guys are selling to my team Thank you very much Future possibilities so unpacking to the current runtime This is the the kind of the one point that I wanted to talk back to product was So the story always starts at S3 like you do some stuff Yeah, put it in the bucket and then the rest of the world takes it over so they brought up this idea of like Model as first-class and so I started thinking about this and it's like I don't want my Scientist to have to like upload stuff into S3 and think about logistics and manage keys and Endpoints and all this stuff like they have the model like why can't they just save the model? And so that's what we're trying to do like so They give the model we save the model and we put it back into their runtime when they need it much more complex than the sounds but We're going for it More partitioning so as things get bigger and as frameworks add more capability for sharding To roll that back in At test stations, this is kind of a personal project of mine You basically have Shaw action Shaw or you or I action you or I or You you ID action you you ID it all starts to like Coalescent this world of triples and arrow and you can graph and data frame it Example people working on this is some of the good stuff that you've seen. It's really cool I think if you can like include the provenance with with your artifact group like That's amazing And lastly some of you might be thinking you could like DDoS ourselves So we have been looking at well things like Dragonfly so they put that over Harbor You can then kind of distribute the poles if you're pulling what a bunch of duplicate stuff I want to add one other thing to that was the sequel bull Tars is also something that If you've looked at like star GZ cool stuff I'd like to be able to pull something just enough for initialization and then let the other stuff lazy Authorization granularity so OCI has not really touched on authorization yet There's a lot of discussion around it. Hopefully it can coalesce into something meaningful Multitenancy a lot of distros like Harbor Keppel Key they all kind of focus around multiple tendencies I think that's really important. I'm hoping that the reference distro gets it and lastly storage performance So your models as your models get bigger and get very bigger Like most of your latency is now in the storage layer of which I don't own of which I am not particularly monitoring so How do you roll that back in like how do you? Protect yourself or like gear yourself up for more and more and more storage requirements So I'm gonna wrap it up there. That was a whirlwind. We only get a little bit of time and this was a very big project If you want to hear more obviously you can come work with us Where you have positions all over the spectrum from like data science to cloud engineering, etc, etc With that, I'm gonna pass it back to payment to close it up. Yeah, that was great So please give us feedback. I have a QR code scan it and you can go to the shed app And there was a few give us your feedback but with that I think we're done We're open for questions if you guys have any The woman in the red shirt has a mic she can help hi wait Thank you. So Does it mean you basically build a in-house original or llama basically all of my said yeah So all Alma You you really kind of tying yourself to the the world of LLMs So we're actually tackling the problem. I would say a little bit higher than that Test test. So thank you. You mentioned that you can serve your models in k-surf But for that k-surf needs to understand how to run the model, right? So did you implement this runtime spec that you mentioned or are you just limited to a certain kind of model? For example, you assume that there is an onyx binary inside of this artifact or a certain entry point or How do you do inference with the artifact that you generate? So with inference really Or let me put it this way on the runtime spec doesn't necessarily have to define Exactly how something is run. So if you look at like how Docker Combines a runtime spec. It's got like users environment variables working directories Special paths these are all important ideas that I can pass down to k-surf That has nothing to do with how that model is executed That's kind of you know where I'm taking it Yeah, right now you can just pass that BRI as a as an Identifier to k-surf and k-surf can take that information that we have within that artifact and decide like what's the best way to run it Thank you. Yeah Thank you. First of all, excuse my naivety on the problem but first question is how big of the model is, you know, you didn't mention that and Second would be how big of a challenge is that to use? You didn't convince me that how bad is S3 to you know mount some S3, I don't know storage and to your container How big of a deal that? You didn't convince me that It is better approach and finally if you convince me that this is a really big artifact let's say I haven't convinced but What was the problem with? Geet of artifact management that you didn't use, you know And last one are those images of versions whatsoever are those immutable? Sorry for asking lots of questions I want to think about it. Can you repeat the first one again? I Don't remember First of all to that mode of size or the S3. All right, so it's not so much What S3 is it's how people use it so you just have Joe Schmo Joe Schmo developer like uploading whatever they want into S3, you know That's not Software lifecycle, that's like someone just pushing files And so what I wanted was like the the idea of S3 as storage is fine it's really about like what someone is putting into it and like How you apply that regulation how you apply that structure? I really want to just implicit like I don't want to tell people like how how CAS works I don't want to I don't want to tell them to use this cluster or that cluster. And so really My problem with S3 is just that it was it's it's too loose of a paradigm. It attracts cowboys it It attacks it attracts a lot of problems Um second question. What was it? First one to be honest with the second one first one was modellizes How big of you know model is that that I can't really go into like fine details I'll refer you to our platform geniuses that are down here in the front row That that Carry more weight than I do it goes from small to large like we have large and we have small like very small so And what was the third one? Oh Why not why not get Yeah, and the S3 thing as you said, I think the issue is not history I think the issue is their lifecycle run history How do you like store make sure that whatever you stored is what ends up going to production and when he goes to production No one can go change their mind and delete that blob I guess if you set up a system to really you know make that lifecycle happen There's nothing wrong with this street self I take back my third question go for the fort And I can add a new one are the images immutable and how do you guarantee immutability? Yes, everything's immutable you so in typical repositories You might have the idea like a snapshot being editable, but then a version being immutable So right now we're all in on the immutable side We're opening up this experimentation workflow and seeing like how we can integrate that That the experiment side I wouldn't call the artifacts themselves immutable, but Your workflow will involve a lot of choices before you actually or will give you the ability to have choices into which particular output you want to choose So maybe it's metrics driven. Maybe it's test driven or evaluation driven I Really want to kind of focus on that being where people can make choices and Out of that process now becomes this immutable version artifact that has a Shah digest that can be checksummed That every place I put it using that checksum. I know exactly What bytes are there? That's a powerful idea for me Okay, thank you Excuse me what? Pipped one is that What is the frequency of generating new models? You know, I can imagine whenever a developer pushes an image You're gonna create a new model and store it somewhere. Is it something like that? or it is the question is How many models I guess how frequent people store models and how do we how do we handle that if people run? experiments and store models do we store Every single model All I'll say is s3 is cheap Yeah, so we do store a lot of when they people run these models we store them we can certainly improve how we tackle Checkpoints and and you know iterative process of experimentation In terms of like do we keep every single artifact that was ever produced in the life time of every single project? That there may be some work there to make sure we don't keep everything and be responded with Retention of it, but besides that we just store all of them right now Are you taking advantage of? layering When keeping track of models let over a lifetime or you maybe put fine-tunes in a separate layer so you can dedu Etc. Yeah, we're actually just starting to work into some of these ideas right now Our layering is primarily around sharding functionality that certain frameworks Provide, but we're hoping to like keep pushing into this further and further You know, it's not just models that we're thinking about we're thinking about like all kinds of assets that are used in this ecosystem But it could certainly possibly be a possible future where like if most of the model is the same all the time and a small Portion of it only changes there could be possibility of better layering to not change every single layer Yes May I ask you to something more in your shopping list as? Governance, do you plan to? Gouverne your models as we can govern that asset so tracking Tracking the versioning tracking all the people the stakeholder who have an Impact in the stakeholder and impacted your your models So the question is when it comes to governance, how do we think of that and this intersection with data sets and data How data is governed and it push we're taking with models Is that a question? Sorry? It's good. I wouldn't say we've Approach anything specific to data sets. I think we're kind of well, the big question we're trying to answer is Lineage I guess so from when something was trained or from whence it came Outside it. I mean, I don't know if I have a succinct answer. So there's there's risk concerns. There's software vulnerability. There's data privacy There's your location like what data center or what provider using and what kind of security they provide there I Think we've kind of just like us. We've stuck to the basics for the moment like you you can authorize people to Update your stuff. You can authorize people to read your stuff. I think these concepts or there's also ownership. I think these concepts kind of You know carry over to data But I don't really have a cohesive story for you and I think I'm sorry if I'm not understanding people. I don't hear very well. So I'm relying on Hey, man Hey, Eric. Hey payment. My name is Feynman I'm so excited to hear that you leveraged or us because I'm an or as maintainer great Yeah, I have two questions regarding your use case as first one is rock regarding the Registry side. So you mentioned on OCA 1.1. So I'm curious Have you are you? build your Container registry your model registry on top of the Distribution registry or you're leveraging some open source registries like harbor and is it and OCA 1.1 1.1 compliant registry Technically we're still on I guess 283 in our production environments, but we're using that 3 alpha and testing so Very good. Yes, I'm writing a slim line between 1.0 and 1.1 right now But soon as that reference implementation is pressed We're gonna go forward Okay, I have a second question. So the first one I may need more I may have more detailed conversation with you offline Okay, okay the second question on you mentioned you have some plans to enhance the security of your OCA artifact distribution, right? So you mentioned attestation. So can you clarify? what's your plan to add attestation to your to ML models as OCA artifacts and Have you noticed that there's a new feature in OCA SPAC OCA image SPAC and distribution SPAC that you can link on different OCA artifacts In a relationship in the registry So can you clarify? The plan for enhance your security of the OCA artifact distribution. Yeah, absolutely So when when they are serialized I I think there's a couple spots that they can fit So you you can have like an inline blob descriptor is kind of one option So, you know the way that you might put base 64 in like a style sheet or something like that So that was kind of one idea. So like as you're building this artifact that you're able to covalent these attestations and actually like Present them as descriptors in the manifest like an inline descriptor So that was one idea. The other is the one that you're talking about with refers and subjects absolutely like How cool would that be if I could say okay for this artifact give me all signatory a's or something like that and It would be able to deliver those to me or it would be able to tell me like whether or not there are Subject artifacts out there in the registry It's all very cool stuff and we're hope I'm hoping to leverage it Okay, thank you. Oh, no, I'll make a call out on that too for like sig store stuff like that, you know being able to Have a large user base easily sign things and and not lose their private key The frameworks like six-door like you know essential to these ecosystems too. So if you're out there, I'm watching you too Hello Small question or I know I think to a full Reprocessibility you need to have the exact same environment in training as in serving. So I'm thinking Python dependency packages except first exact same versions. This is also something that is included In the model or do you tackle this so in the model registry or do you tackle this in some other way? We haven't tackled that yet, but we intend to so so we're bringing these worlds together and hopefully They can share the same templates and and everything but at the moment It's it's more declarative. I guess by by the developer We do track that in the platform though So we may not track that in the registry what we have that information and we do track Links between these artifacts or like the how the artifact came to be and we do track dependencies We use built packs for when people run When they create their images to be able to run them So we do have the track of how an artifact came to be may not necessarily be in the registry itself right now All right. Thank you. I'd like to see a metal model kind of Materialized behind all of our platforms and it it be cohesive and comprehensive You know, we're working towards these ideas Yeah, we are out of time. So if you have questions, you're gonna be around. Thank you