 Thank you. Thank you. Can everybody hear me? Is this working? Okay, great. Yeah, sort of a technical deep dive. I guess I'm not going to throw a whole lot of code at you or anything. I'll try to go into some details. So yeah, my name is Damon Caswell. I'm one of the lead developers for HP's internal developer portal, senior lead engineer. And I think I'd like to start by talking about a number. So why am I showing you this? That is a rough estimate of how many objects HP has to ingest for now from one location, one provider. I yell to take a moment to think about how in your environments you would go about trying to ingest 50,000, 75,000, 100,000, 200,000 or more entities from a single source of assets. Something where you read it once and you dump it into the catalog. How would you do that? Try to think about how you might accomplish that in a large multifaceted company where you don't necessarily have any direct control over the data sources that you need to make searchable, that you need to ingest into the catalog. Yeah, if you're picturing a world where all of your data partners give you a YAML catalog info file that they keep updated for you, the science fiction fantasy conventions down the street. So backstage is at the core of our developer portal, which we're using primarily for its catalog and relationship capabilities. Our stack includes a lot of technology, some of which you've seen today, that you're all very likely familiar with, and a bunch of AWS services as well, which you're probably familiar with, Elasticsearch, RDS, Lambda, Docker, et cetera, et cetera. There's a lot of things that go into a typical backstage deployment, and we're no different from anyone else on that. Now, some of you are probably thinking HP, the printer company. Some of you are probably even thinking this. Yeah, we still get a lot of jokes about that. Yes, HP makes software. We actually make, we've got anything from printer driver. Ooh, what happened there? We've got printer drivers, we've got firmware, we've got machine learning, we've got artificial intelligence, we've got obviously software for people's computers. We sell a lot of computer hardware. We've got all these different things. We've got web apps, we've got mobile applications. Everything that you can imagine that a modern tech company has, just because HP is old, doesn't mean we don't have them, and more. Got a fun one here for you. This is one of the largest printers in the world. So yes, we are still a printer company. This printer has a doorway, I think more than one actually, has a staircase and it has over a million ink jets that have to be able to operate simultaneously, coordinated by some really specialized software. So yeah, the software is a big part of even something big and industrial like that. Also really cool to watch it in operation. Watching all the paper go chun-chun-chun-chun-chun, don't stand too close to it. It won't care if you get caught up in it. So we want all these assets, just like anybody else, to be in our catalog. But we're big, HP is big. We've got hundreds of thousands of them. We're actually looking at millions down the road. We want to make all of those available in our catalog. And we want them and the people who created them searchable, which means finding a way to efficiently ingest all of these really large data sources. So this is the problem. I'm going to break this up into two main sections. I'm going to outline the problem that we're facing, which is really a two-part problem, and outline the two-part solution that we found. So ingestion is slow. As our ingestion needs went up, the performance went down. It started getting longer and longer from the time an entity was ingested to the time it was visible in the catalog. When we first started, it was very manageable. But as time progressed and we added more and more to it, we just felt like it was bogged down. So we were new to the platform. And as people who were new to it and had never looked at it before, we were following best practice guidelines that came from the community. And duplicating entity providers or actually initially just catalog processes, we joined up before there were entity providers, we followed the best practices and guidelines that are available by default in backstage plugins. We parsed locations and emitted them from one catalog processor to another one, pick up each location and process the entities in that location. It's a pretty common design pattern. Actually, can I see a show of hands? How many of you are doing that now, emitting locations and then emitting entities from those locations? Not that complex yet? Well, that's where we started. And it actually worked really well until it started getting too big. So problems with catalog processors. They ingest, whatever. There's no real scheduling control. You can do things like configure the processing refresh, I can't remember the parameter name, the processing refresh rate. We ended up with lots of locations. Even locations that didn't actually technically exist were just used to logically separate chunks of data into our catalog. Entities did eventually show up. It did work, but it took a long time. So we concluded that this sort of design pattern that's sort of out-of-the-box design pattern was effective if you were dealing with a small number of entities. But as soon as you started to try to ramp that up, we needed something else. And this was even back before, like I said, before entity providers existed. When entity providers came our way, we jumped on those because they looked like they would solve the problems. But they really didn't. I mean, we were thrilled to have a mechanism that was specifically designed to just ingest data. It's a great idea. And just leave your processors in charge of all the post-processing. But our design thinking hadn't really adapted to it yet. And then on top of that, it still brought some problems. If you ingest locations still, you're still getting entities whenever the catalog processor reaches them. If you want to target a very large entity source, an asset source, then it actually introduced new problems that revealed that a default entity provider didn't really scale comfortably to a very large data source. We still had locations polluting the catalog. We still had entities taking a long time to show up, sometimes even longer than before. Now, I want to say entity providers did offer a lot of great things as well. I mean, if you, I did not mean to go that far here yet. They offered the first good mechanism for clearing orphans. And that, in and of itself, was huge because data sources change. Assets get removed. Assets get added. And if you don't want any old assets lingering in your catalog, especially if that asset is a user who maybe has left your company, you don't want them still appearing in your catalog. And of course, backstage allows you to ingest users. So this was somewhat of a game changer for us. But we still had problems. So again, that $200,000 number, we struggled. With this one. A full mutation with an entity provider with 200,000 entities, it's impossible. You can't do it. We ended up breaking data sources up artificially, artificial distinctions between arbitrary categories of data. Everybody OK? In order to subdivide it into ingestible chunks and create a separate entity provider for each of those chunks, it was sort of a desperate stop gap mechanism at best. It's not really a long-term solution. For one thing, you're not going to necessarily, again, have control over all of those data sources. Maybe whatever you're using to subdivide them is whatever arbitrary factor you're using to subdivide them will change. If you have no control over that, you need a different way to ingest such a large set. On top of that, we found that when we dumped a very large number of entities into the catalog all at once, the amount of time it took for post-processing and pre-processing skyrocketed. Because there was just too much all at once in the catalog. And so any one entity, if you're the entity at the back of the line, you're at the back of the line. So kind of to summarize the issues that we saw with both default out-of-the-box design patterns for ingestion, because neither really fit our needs. The catalog processor, it did have some pros for ingestion. Simple to implement, mature, one-stop shop for entity ingestion, which is fine, but leaves behind orphans, creates long delays in entity ingestion. It's really just better for side effects, for emitting relations, is what we found for that. Everybody is used to the idea that backstage is eventually in sync, not real-time. Well, HP kind of wanted to go real-time. And catalog processors weren't able to get us there when used for ingestion. Now, entity providers are objectively better. They're strictly dedicated to ingestion. They don't do anything else. They're much more configurable with scheduling. And of course, there's the cleaning up of orphans that we saw. But still can't emit relations from it. There's only two options for ingestion, full and delta. And I think I included that here. Yeah, so the full mutation, that's that very large data set. We ran into that JavaScript heap out of memory error. Has anybody else gotten that? Has anybody else tried to ingest a large enough data set to see that? No, wow. OK, well, you will. Oh, I see a hand. Hey, it's nice to know our pain is felt by somebody else out there. Yeah, if you ingest a very large amount of data, you'll eventually run into Node.js' own limits on how much you can shove into a single array, because that's what a full mutation does. You put everything into a single array to be ingested. And you can work around that. You can do things like increase your max heap size. But you're going to run into problems there, too, eventually. That's a band-aid. Think about your asset catalogs. What are they going to do? They're going to grow. And they're going to grow, and grow some more, and keep growing, and keep growing. And they grow because asset sources grow. You add more APIs. You add more data assets. You add new organizations, new actual sources of data, so entirely new entity types, more Docker images, more Harbor stuff, more charts, more pipelines, more everything. Everything grows. Here's another one. I'd like to show up hands on anybody whose asset catalogs are shrinking. Yeah, yeah, yeah? OK, yeah, thoughts out. So that's going to cause some problems. You're going to run into issues where you're back and can't handle it in production. Yeah, that's fun. So yeah, by the way, you can tell I'm not a graphic designer. That's the best image I could grab of a burning database from the interwebs. So that's just to start with. There are other issues that make themselves noticeable when you try to scale up into the hundreds of thousands or millions, as we plan to eventually. Database concurrency issues is one of the things that we ran into. When you've got that many entities going in at once and you're trying to post-process at the same time and you've got four or more pods in your Kubernetes environment, all talking to the same database, all trying to get to the front of the row. Hey, I've got an entity to ingest. Hey, I've got an entity to pre-process. Hey, I've got an entity to post-process. Do it, do it, do it, do it. Eventually, post-gres is not very happy with that. Yeah, it can be messy because every single entity needs to go through the same processing loops before we solved this issue. And I want to assure you we did solve it. It could take hours or, in worst case scenarios, the worst we ever saw was days for an entity that was just added to the catalog to actually show up to finish its post-processing loop and actually become searchable. And because where was I? Oh my gosh, I've lost my place. That never happens. So here we are. So we started asking ourselves, why is that? Why were pre- and post-processors taking so long? We looked, this is stolen from Backstage's own documentation. Thank you, Backstage. This is their graph of the processing loop. I sourced it, it's the URLs at the bottom. We asked ourselves what specific operations cause a delay in the processor for your post. We started measuring exactly which ones took the longest, which was not easy, but it was fruitful because we identified, in every case, asynchronous actions in the pre- and post-processor. You put, you've got an entity that you need to post-process. One of the things you need to do in post-processing is grab this additional piece of data from this other source. It's just an HTTP get. It'll only add 500 milliseconds to that entity and to all 200,000 other entities you're ingesting and processing. Yeah, that adds up. Let's see, here it is, here's my calculation. When there's a backlog of 50,000 entities and you've got four pods processing entities at the same time, the lag caused by a 500 millisecond delay on average per entity will actually result in an additional half an hour just by itself for the entities at the end of that line. The entities at the end of the queue wait an additional half an hour, and that queue is never ending because as soon as the stuff reaches the front of the queue, you've got more stuff at the back of the queue. So the summarize. Initial ingestion, very slow with catalog processors. Entity providers can't really ingest very large data sources and you end up with a catalog polluted with locations and processing asynchronous operations cause a delay. Standard patterns for enabling or for emitting entities after reading locations are kind of inefficient because of the asynchronous operations. Okay, that's enough talking about the problems. Let's talk about the solutions. Part one, incremental entity providers. We created our own type of entity provider with help from front side. Thank you, Taras, who some of you probably know. Actually, all of you should know from the backstage community now. You heard him talk earlier. So yeah, Taras rocks. Thank you. With their help, we created something that works a little differently. So this type of entity provider takes a large data source and ingests it in byte size configurable chunks with the ability to back off if the data source is generating errors. Anybody run into issues where you're most of the way through an ingestion and then the API you're trying to read from crashes or gives an invalid response or something goes wrong and, oops, better start over. Yeah, now it's this kind of entity provider has a back off mechanism where if an error happens, that is loud. Yeah, if an error happens, it can back off and then retry that particular burst of data. And every aspect of how these run is configurable. It uses paging to break a large ingestion up into chunks, it pauses between chunks so that you're not, you've got constant pressure instead of spikes of really high pressure on your data sources. And it's, which is really useful for data sources, that rate limit. I'll just say this. We do use GitHub and GitHub rate limits. So, oh, I'm seeing a lot of nods and agreements in the agreement here, oh yeah. Yeah. I'm glad we're not the only one, I'm glad about that. So, and all aspects of this are configurable. So, you can configure how often it runs. You configure how long it's gonna do a burst of ingestion before pausing between bursts and how long it rests between full ingestions. If you've got a data source that changes very infrequently but it's really, really large, you might not wanna ingest it more than once a month. And it goes through its thing, ingests, and finishes. We also included in this, we needed ways to control when these incremental entity providers started and stopped. We needed ways to pause them, reset them if necessary if there's something wrong. So, we also created a suite of administrative tools. They're just basic rest end points to call when you want to manipulate the activities being performed by these incremental providers. And you can use a web front end for them or any rest style tool. This is a screenshot of Postman. I'm sure you all know what that is. So, I wanna talk a little bit about the internals on it. We created a new schema that we add to your database. We don't push this stuff directly into the public schema and Postgres. We keep it separate. And there are three main tables. There's the ingestions table which tracks the status of a running provider, the ingestion marks table, which is essentially a way of tracking the cursor for the page of data that we're on. And there is the ingestion mark entities table. This is how we accomplished splitting the difference between Delta and full entity providers. We still needed a way to get rid of orphans and tracking the state of the entity so that we know whether it was still there on that last ingestion or not was how we reached the point of being able to wipe out orphans in a way that's very similar to the way that a full mutation does with a regular entity provider. There are some requirements for this. Your data source has to paginate. If you've got a very, very large data source that does not paginate, well, I mean, first off, why do you have a very, very large data source that does not paginate? But if you don't, an incremental provider is not going to work for that. It requires pagination because it needs to be able to track where it was in the process. We're also, because we're adding new tables and one table tracks your entities, there are storage considerations. So if you're always riding the edge on how much storage you're using for your database in Postgres, you're gonna need to up that a little bit if you implement an incremental entity provider. One that did not make it to my slide here, but I should mention, is that it also does not support stateful APIs. So for instance, if you are using something like LDAP for your, as your directory provider, something LDAP is a stateful API. I know that's kind of old school, but a lot of places still do use it. So an incremental entity provider is not going to work for that. And the reason for that is that after it's done doing a chunk of data, that it's done ingesting it, it's not going to, the cursor that we store is only valid for the client and the session that it opened when it communicated with that data source. If you're running in Kubernetes with multiple pods, as we are, we've got four replicas per environment, some other replica might pick up the same, the next chunk of data. And it doesn't have an open session to LDAP. So you can't use it with a stateful API. Stateless is what most of them use anyways though. So it's not a huge deal, but it's something to keep in mind, depending on what sorts of legacy systems you want to ingest from. So that solved half the problem. That got us to the point where we could ingest those very large sources of data and do it in small enough chunks that there was never a point where they were just absolutely flooding the catalog, which leads me to the second part of the solution, optimizing the processing loops. It was still taking too long to get from the point of an ingested entity to the point of it being visible in the catalog. Like I mentioned earlier, we identified asynchronous operations, all those HTTP GETs that you put in your catalog processors. Those are the culprit, and it's a little frustrating because those are also the natural places, the place it feels right to do those. You're trying to emit a relation. We'll grab the data that you need to emit a relation with and emit it. Fine, but it's not efficient at it. So fixing the problem meant committing to something that honestly looked a little ugly at first. Front loading all asynchronous operations into the entity provider. We banned them from catalog processors. We do not use that mechanism at all anymore. Our catalog processors are without asynchronous operations entirely. And we moved all of that into the entity providers, these incremental entity providers. It's not a trivial task. It meant that for every entity kind that we deal with, we needed to include all of the methods and functions that are used to get that asynchronous data right there in the entity provider. It had to do it at that stage. That's okay because you're only running that once in a while. And ideally, since you're not ingesting a huge chunk all at once, you're doing it in byte size chunks, it never adds a huge amount of data all at once or a huge amount of time all at once to it. But it was work to get it there. So every entity provider needed to be rewritten, every catalog processor needed to have them removed and the ugliest thing of all, every entity schema that we created needed to be updated so that the entity itself could have all of the data that would eventually be used to emit its relations or to emit its side effects, to run that emit process. That meant updating a load of schemas with data that was never intended to be there permanently. It's just there for the point of ingestion so that it doesn't have to make an additional asynchronous call once it reaches the pre or post processor. And then honestly, all of our pre and post processors, if that data is still there, after we've emitted the relations and such, we remove it, it doesn't need to stay there. It doesn't need to stay as part of the entity. It's just a mechanism for getting it there and front loading the asynchronous activity so that you don't do that in the processor. It's a little ugly still, but the solution or the solution, the results of the solution speaks for itself. What you're seeing here is a graph from Grafana of the refresh state table. If you're not familiar with that part of Backstage's internals, the refresh state table tracks the status of each ingested entity and part of that status is when it's scheduled to be processed. This graph is showing a running average of the discrepancy between when an entity is scheduled for processing and when it actually passes through the processing loops. And that's in hours. That's a processing time in hours, 0.57 hours. So on this graph, we were down to just a little bit more than half an hour on hundreds of thousands of entities. So that was the amount of time total that it would take for an entity that's at the back of the queue to reach the front and get processed. The total discrepancy from when it was scheduled to be processed and actually got processed. Before that, the same graph, we have historical data. I'm not gonna show it. We have historical data that shows the same graph from a couple of months ago. And it was different. It was four hours or sometimes more. Like I said earlier, the longest we saw was over a day. Here's our dev environment. This actually has a few more refinements that have gone into it. 0.25, 0.259. That is 15 minutes. 15 minutes from the point the entity entered the catalog or entered the database as an unprocessed entity and was scheduled to be processed to the point where it was actually processed, 15 minutes. And that graph looks like it's going up a little bit there but I should tell you something. Using those admin tools I mentioned earlier, I manually started all of our entity providers before running this check. So that's with all of our entity providers running. So in fact, if you look at the queue over time, July 10th, yeah, I guess I can do include some of that data, 4.67 hours, that was a painful time. That was how long it took back in July, oh my gosh. And we had to tell all of our stakeholders, you're gonna wait an average of that much time when you add a new entity before it actually shows up to now. Average of 20 minutes. On average in our production environment, 20 minutes. 92% decrease. I thought you might like that. And we're not done. I mean, there's still some optimizations to be made. We want processing time to drop further. The lowest I've seen, and I think we can get it down there, I've seen seven minutes. I've seen that graph that I showed earlier show just seven minutes. I'd like it to stay there. I'd like it to go even lower if possible. I don't know if we can get there super consistently but we're sure gonna try. So to summarize, incremental entities, entity providers, more flexible scheduling, more resilient, more manageable, able to handle data sources. Here's something that's not on this list. Due to be open sourced November 1st. Yeah, we're gonna release it. And reconfigured processors with no asynchronous operations. I wanna talk a little bit about what we learned on this path. We learned that the advice that backstage community offers out of the box needs to be updated. It needs to scale the way we managed to scale from the start. There was a lot of trial and error. There was a lot of effort involved in getting from where we began with the advice that we started with and with the design patterns that we were given out of the gate to the point of being able to ingest hundreds of thousands of entities without it there being a significant lag without there being anything bogged down. That sort of default advice that people get should scale from day one to day 700 when you've added all those assets sources. And it can. This process might have been difficult but it also made me a true believer in backstage. Backstage can do this. We have the graphs to prove it. We have the active HP wide implementation to prove it. Backstage is very, very powerful. So this is sort of a not really a technical piece of information that we learned. More of a more of a social piece of information we learned. We need to start talking to the backstage community needs to start talking to potential adopters. Not just what's the easiest way to get from zero to your ingesting. But in fact, what's the way to get you from zero to ingesting that's gonna work two years from now? The same design pattern. Is it gonna work two years? Is it gonna scale to all of the assets that you're gonna add in two years? For some companies like of HP's size and scale that change in the way we talk about ingesting and the way we design patterns that we offer from the start will likely be the difference between them just having it as a little toy proof of concept that doesn't really scale to what they need to a full-fledged enterprise sized, enterprise functional deployment of backstage. And now I'd like to take some questions if I have some time. I talked a lot there here. Did I overshoot? A little bit. A little bit, oh no. Okay, tell you what. I'll take questions at the end of the day. Anybody who wants to talk to me about this can come up and ask me about it. So thank you all.