 All right, everybody, we have Martin. K, how's it going, sir? Wave to the crowd. Hey, everyone. Good evening or good morning or whatever time zone you are in. All right. Take it away, sir. Let's talk about indexing NuGet. Good. Hello, everyone. Welcome to this talk. I'll quickly move my Skype window away. The idea of this talk is to talk about indexing NuGet with Azure Functions. JetBrains, we've building a couple of things that need data from NuGet, and we decided to rebuild something that we've had for a while using Azure Functions. So that's going to be the story for the next half hour. So what we have in Resharper and Rider is find this type on NuGet.org. So what we have is that whenever you write some code that uses a namespace or a type that is coming from a NuGet package, but you don't have that NuGet package already installed, you can actually search for that namespace and that type from a Quick Fix window and then find that type on NuGet.org. So for example, you see on screen here. We can't see your screen. If you use, sorry? We can't see your screen. All right. Let me quickly share my screen. You were so excited, you forgot to share it. Exactly. Well, that's the thing that happens. That's right. Don't worry about it. That's all good. Where do I share my screen? Yes, let's start sharing. Okay, there we go. So can you see now? Yes, we can see just fine. Go for it. All right. So in Resharper and Rider, we have find this type on NuGet.org. So as I explained, when you type JSON convert, for example, and that type is not yet installed, comes from Newton Software JSON package. We have a Quick Fix that you can use to find that type on NuGet.org and Resharper will offer you to install that thing into your project that you are building. Now we've had that for a while. Actually, it was introduced in Resharper 9, I believe, somewhere in 2015. Visual Studio came up with something similar in 2017, I believe. And the thing that we've built consists of, of course, the Resharper functionality. We have to have that Quick Fix. We have to have the analyzer to check whether the type you're using is actually installed or available on your system or not. But of course, we also need a service somewhere, a cloud service, and that's a cloud service should index and power the actual search that you are doing from the IDE. So that thing historically runs on an Azure cloud service. I don't know if anyone is using them, but they are web and worker roles. Web roles are basically HTTP services that you can host, and they serve up the API that Resharper is using. And we have the worker role on the back end basically scanning NuGet.org, downloading all the packages there, and indexing whatever is available. Now that indexer uses the NuGet ODataFeet right now, and I'll quickly show you what that ODataFeet looks like. It's basically this. You can address NuGet.org slash API slash V2, and you get back a list of packages. Now what we can do in the URL to that API is select the fields that we want to see, and basically we are interested in the package ID, package version, so that we can build the download URL for it. And we can order that thing by the last editor date. So for example, if we would want to download all the packages from the last one hour, for example, we could hit this URL and query NuGet to get that data from NuGet. Now one thing you will see somewhere at the bottom of this OData response is a pagination URL. So you will see that at the end of this document, you will be redirected to the next page for the next 100 results there. Now this is all fine if there's only a couple of packages to be downloaded, but of course NuGet has been growing in popularity, and the list of packages gets really, really long. And if we have to do a remix of this thing over OData, things get pretty slow because this is actually hitting the database on NuGet.org, and it's basically HTTP over SQL server. So that's not ideal. Let's jump back to slides. So NuGet over time has been growing. One of our developers reported somewhere last year that NuGet is 1.9 terabytes back then. I think by now it's over two terabytes in size, so NuGet has been growing, but also the OData feed that we are using in this indexer is being deprecated soon. So we need to do something. We have to make this better and use a different API to fetch the data that we actually will need. So let's talk a bit about the NuGet server-side API. The server-side API, we've already seen that OData protocol, but there's also the new V3 protocol. And that V3 protocol is JSON-based, no longer that OData, no more XML. And it's really a entry point into different resources in NuGet. So if you go to api.nuket.org slash V3 slash index.json, you will find a list of resources that provide endpoints for various purposes of NuGet. So for example, if you're using search in Visual Studio or in Ryder or in VS Codes, you will hit that search endpoints that is defined in this V3 index.json response. Same thing with autocomplete. If you're editing your packages config file or your project file and you're making use of the autocompletion in there, you're probably hitting this autocomplete endpoints defined in here. There's one for Packetry Store because it's kind of not very useful to use the search endpoint to do a Packetry Store and use compute. You basically already have the package IDs and package versions. So there's an endpoint that supports that scenario. And the one thing that we are really interested in is the catalog that is currently only found on NuGet.org. The catalog is an append-only event log of every event that happened on NuGet.org since the beginning of time. So you will find every package addition, package updates, and package deletes are being described in that catalog. The nice thing about that catalog is that it is chronological. So everything you see is ordered by time, which makes it very easy if we want to use a cursor, for example, to say index everything that was added in the last 24 hours because we can simply point our cursor at that timestamp and start getting all the data from that point and continue where we left off. This also allows restoring NuGet.org to a given point in time. So if you are doing data mining or whatever, the catalog may be an interesting endpoint to use if you want to, for example, restore NuGet.org to the stated hats on 1st of January in 2017. Now the catalog is based on a specific structure. You have the root of the catalog, which I'll show you in a bit. That's basically a collection of all the URLs to whatever pages are in the catalog. You will see a URL and a timestamp, and if you hit a page, you will see a list of leaves. The page itself is, again, a collection of just package ID, package version, and links to more details about that package. And then, of course, the leaves themselves provide all the info that we need. So let's quickly show you what this looks like and how we can use this programmatically. So this is the index.json for NuGet.org. This is the main endpoint that you will see, and this has all those resources. So for example, if you want to use the search query service, you can hit this URL, but we are interested in the catalog. So let's find the catalog in there, and we will find the root of the catalog. So let's go there. And what you will see in the catalog, like I already described, is a set of catalog pages, a commit ID, and a timestamp. So you will see the URL to a certain catalog page, a timestamp, and so on. So let's open a random page. Let's pick this one. Huh, that's a really big JSON file. So let's pick this one and go to that catalog page. And what you will see in there is, again, a list of packages. So you see the package ID, the package version, there's a commit ID and a commit timestamp. So again, using a timestamp, we can fetch packages that were added or deleted at a specific time. One interesting thing to note is this type attribute here. You see that this is package details. Normally, when Nuget adds or updates a package for some reason, you will see this as the type. But of course, for a compliancy reasons and so on, Nuget also has package deletes supported. So occasionally, you will also find a package deletes as the type there. Now of course, just the ID and the version is probably not enough to get all the metadata that we want in our index. So we can navigate to the ID here and we will find all the package details. So again, this is a package details type. So in addition or an update of a package, package ID is in there, but also the package icon, the hash, where we can find the package, what dependencies are in there and so on. So we don't really have to download the binary to get some basic information about a Nuget package. We can simply use the catalog, run through it and see what is there. Now of course, navigating JSON is not something that we may want to do. So what we can do is make use of a Nuget package called nuget.protocol.catalog. I'm using the sources here in my project, but you can simply install this into your project and then start using it. So what we can do is write a program that uses an HTTP clients to access the internet. Of course, we want to store an in-memory cursor of where we left off in the catalog. By default, it will start from the beginning of times, but just for the sake of demos, let's say that we want to use a daytime offset of UTC now minus one hour. So we get the packages from the last hour from Nugets. And then we can create a catalog processor. This thing will contain some plumbing to make use of the HTTP client that we have and so on. But what it will do is download that catalog page, navigate through the exact pages that you defined based on the cursor timestamp. And then for every package addition and deletion, it will make a callback into our code. So for example, when a package is added, we get the package, we get the version of the package, we get the package ID, and so on. So we can write something to console and do something with it. Same thing for deletes. We get some information about the package delete. Not as much, it's just going to be the ID and the version, but still we get notified about the fact that a delete happens on Nugets. And if we run this one, we will see that we get the packages from Nuget.org from the last one hour. So let's quickly run through this. This is usually pretty fast because it doesn't really have to query Nuget. It just points at a certain JSON file, parses the JSON, and starts outputting data. So in the last one hour, a number of packages were added, apparently. Right. So find this type on Nuget.org in resharpenrider. Of course, since OData is going away on Nuget.org, we have to migrate it to making use of vTree. And we've actually been doing that. There's a couple of things still missing in the APIs. But we'll get there and have our worker role and our web role updated to make use of the vTree protocol. Now we were also thinking, OK, this thing is still running on cloud services. Maybe we should build a new version of this thing that makes use of the vTree protocol and makes use of this catalog to fetch its data. And that's going to be the remainder of this talk. So let's build a new version. What do we need? We need something to watch the Nuget.org catalog for changes in packages. For every package change, we want to download the package. Of course, crack it open. It's just a zip file. And then scan all the assemblies that are in there. So we want to open up the assemblies and look what public types are available in there. And then store somewhere the relation between the package ID and version and the public namespaces and types that are in there so that we can search for them later on in the API. That, of course, has to be compatible with resharpenrider versions since the beginning that we introduced this feature. Now, watching the catalog can be a periodic check. We could do this every day, every hour, every five minutes. Doesn't really matter. But it's a periodic check where we just point our codes at the catalog and fetch whatever data was added or deleted since the last time we ran that check. Now, that's going to be one process, just reading from the catalog. But maybe there's 1,000 additions. I can imagine when .NET Core 3, for example, shipped this week, that all of a sudden a lot of package changes were present in the catalog. So maybe we want to fan that out and have multiple processes running based off a queue so that for every package change, we can run that code as quickly as possible to index that package into our own index that we use in resharpenrider. Of course, that API that we want to build can be anything. Could be a web API. Could be something that we run in Azure Functions or whatever doesn't really matter that much as long as we can serve up some JSON and handle some HTTP requests. Now, those two first things, that periodic check and the fact that we have a queue to run all of the stuff there and make something happen based on a message that appears in that queue, kind of sounds like making use of Azure Functions. So sounds like Azure Functions. Let's see what we can do. We will have the NuGet.org catalog as our starting points and then build a function that watches the catalog and fetches changes that are in there and looks whatever additions and deletes that happen in the catalog. Then for every update and delete, we create two messages on two different queues, one on an index queue, one on a download queue. The download queue is not really important, but we want to kind of have a copy of all the binaries so that we can quickly restore our index if needed and open up locally instead of having to hit the NuGet.org endpoint every single time if we want to re-index, for example, a specific binary. Now, the more interesting one is gonna be the index commands. The index command queue will have a function listening to whatever messages pop up in there and then index the package, store everything in a search index and then make that available to the API that we will use in the end, of course. Right, so let's collect from the catalog in an Azure Function. So what we can do is build, for example, a new timer trigger and let's say this is gonna be .NETConf, let's call it this thing. So we can create a .NETConf timer trigger and our timer trigger will run every, yes, let's add this to Git, will run every, let's say every second so that we have something to demo. Let's call this one timer, remove some of this and then add some code. So this is roughly the exact same codes that we had before. Only this time it will run in Azure Functions. This is how easy it is to create an Azure Function and have something running serverless. So what we do here, again, is make use of that HTTP client that we wanna use in our function. We again make use of that catalog processor and we will just dump to the console the fact that the package was added or deleted. Of course, we would have to wire this up into the queue so that we can actually process the package that was added, but this is gonna be good enough for this demo. Now, the more interesting thing here is that this in-memory cursor has been redefined. The timer we get from the Azure Functions runtime contains the last time our function ran. So the first time it runs, it will be minimum value and we will start reading the catalog from the first time that we run this thing. So from the beginning of times, but the next time it runs, this schedule status.last will contain the last time the timer ran and we can easily fetch starts from where we last and it's the processing of the packages on the catalog. So let's quickly run this thing, run this function and see what it does. And if all goes well, we should see packages pop up in our console coming from the Azure Functions runtime. I didn't do anything. Yes, this has to be async. So let's make this async. All right, there we go. Let's run that again. So what we will see is some outputs coming from the catalog and live packages as they were added since the beginning of time. So we will see the first packages appear that have been added to negate the dork because our schedule status is gonna be empty. So let's wait for this to boot up and you will see our host runs, functions runs and we will see our timer get fired and our function is being executed. So that's roughly it. We have a timer and every couple of minutes, every couple of seconds, we fire up that timer and fetch new packages from the catalog on negate.org. So that's pretty easy. Now there's a couple of things that I want to note here. First of all, functions best practices. I don't know if anyone knows Paul D. Johnson, but he's working at Amazon and he has a lot of blog posts on serverless best practices. So they apply to AWS, but also to Azure and to whatever other functions platform you may use. And in one of his blog posts, he mentions that each function should do only one thing and only one thing well. This makes it easier to do error handling because you know if this specific function fails, well, it's easy to pinpoint which function is failing in the entire chain of functions that you may be running, but it's also easier to scale your functions. For example, if you have our timer trigger fetching data from the catalog and then we have our next function doing the actual indexing, well, the actual indexing function may be scaled out separately and may run more quickly. Also by using this, we will learn to use messages and cues because every function has to communicate with another one. And ideally that's not gonna be RPC based. We want to have them communicate asynchronously and that's typically gonna be something using messages or cues. Now talking with inputs and outputs can be done using bindings. We've already seen in this presentation the timer trigger binding which is an attribute that we add and that the functions runtime will use to trigger our function at a specific interval. There's also an HTTP trigger binding. So when an HTTP message comes in, our function will run in response to the fact that the HTTP message came in. There's trigger bindings. They start the logic that we want to run, but there's also input and output binding. So for example, if we want to make use of a blob in Azure storage, we can use an input binding that gives us a stream instead of us having to write all of the logic, all of the retry logic to reach out to an Azure storage account and fetch the data from there. Same thing with outputs. Why should we care about creating a new blob, making sure that the container exists, handling retries and so on? If all we can do is ask the Azure functions runtime for giving us a stream that automatically handles the fact that we write to it and will store it on blob storage. You can build your own input and output bindings. I have seen SQL server bindings. There's a Dropbox binding that I found out there, but we could also build our own custom trigger binding. Ideally, what I would want to do is remove all of that clutter from the code that I just showed you, and simply write a trigger binding that reads from the Nougat catalog, and whenever a package appears, calls into my function and gives me a package that I can process. Now, custom trigger bindings are not officially supported in Azure functions. A reason for that is that they do not play well with the consumption plan in Azure functions. Because a trigger has to run continuously, it has to run on a provisioned instance, and only then it will be able to call into the functions that you have. We've been running this thing since May, and so far, this has been working perfectly, so as long as you have a provisioned instance, it actually works. Let's see if we can create a trigger binding that makes it easier for us to make use of all this stuff. I'm going to delete this one here, and show you how we could create a trigger binding. What we want to do is create a binding that does something like this, where we have our function that will run, and it will trigger based on a catalog trigger that has a cursor blob name so that we store somewhere the timestamp that this function last ran, and a couple of arguments that we may want to pass in. For demo purposes here, I added this previous hours property so that I can get data from the last one hour there. Now, what will happen is if this function is triggered because of this NuGet catalog trigger, all we will get is a package operation that contains the ID, the version, and so on of the package that I want to process later on in the pipeline, and that's roughly it, so my function will trigger based on something in the catalog. I will get the data, and all my function does is add that package operation to those two queues that I want to have, to that indexing queue and to that downloading queue. Again, the catalog watcher will only be one function instance running, but I still want to be able to fan out all the work that we have, so that's why this function is basically passing stuff along from the catalog trigger into a queue, and based on that queue, we can, of course, fan out and have multiple functions listen to whatever package operations are added. Now, let's look at this NuGet catalog trigger. The catalog trigger is a real attribute like you have in C-sharp, and what that attribute contains is a couple of things. First of all, some properties about all of the metadata that our binding will need later on in the process. So what we will need is the service index URL, which will by default be the api.nuget.org. We will require, for example, this previous hours thing where I will scan, for example, the last one hour. A connection to a storage account because somewhere we want to store a JSON file that contains the latest timestamp of the latest package that we have processed, so we can do those things. Interesting here is this app setting attributes that we add on connection. This will basically tell the function's runtime to bind this connection property to a setting that we may configure in the Azure portal, for example. This binding attribute itself is not very special. It's just a metadata that is required to run our binding. The only thing we have to do is tell the Azure function's runtime that this is an actual binding, so we add this binding attribute down there. The next thing we have to do is make sure that this binding has a binding configuration. So we want to write a Nuget Catalog Trigger Extension Config Provider, and I know that's a very long name, but there is some magic going on with string-based convention, so if I have a Nuget Catalog binding, I will have to have a Nuget Catalog Trigger Extension and so on. I defined this as an extension to the Azure Functions runtime. Now, this configuration really tells the runtime what our binding is going to do. So what we will do is say, okay, functions runtime, whenever you see a Nuget Catalog trigger attribute, I want you to bind this to an actual trigger, and that's gonna be the trigger binding provider. Now, this trigger binding provider gets added into our runtime by the Azure Functions runtime, of course, and is where a lot of stuff happens for our custom binding that we are creating. Right, so what it does is it takes a name resolver. That's basically to handle some magic parameters that our binding may use. It will want a storage account provider so that I can actually write that cursor blob into storage, and then, of course, a logger so that I can write some stuff to logs. Now, whenever the functions runtime initializes our trigger binding, it will call into this try create async, and we can use that to verify that we're actually making use of our Nuget Catalog trigger attribute so that we have all of the metadata that we require. This binding provider, what it does is simply make sure that everything is in order and then creates the actual binding implementation, and we turn that to the runtime, and the runtime is going to boot this up and make sure that it can run and fetch data from our catalog. The binding itself will take parameters, the service index URL, and so on, so basically all of the stuff that we had on our attribute that we had, and then make use of that to create an actual listener. This trigger binding itself is not very special in that the only thing it does is create the actual listener that checks the catalog pages and makes sure that our function can be called, but it is kind of important if you are mixing multiple languages on Azure Functions. So for example, if you are writing the binding in C-sharp, but you're writing your actual function using Node.js, or using Python, or using PowerShell even, what you will do here is also define what's the data contract for our binding looks like. So here we say we have data, which is gonna be the raw JSON that we get from the catalog, but also that we have an ID property of type string, a version property of type string, et cetera, et cetera. So this is really defining for the runtime what the data format that we get from our trigger binding is going to look like so that we can use this binding in other languages than C-sharp. If you're only concerned about C-sharp, probably your code will look something like this where you just pass along the data, but it's always nice to be able to use the other languages in the Functions runtime as well. The meets of the implementation is in the Nukit Catalog listener. So this one is going to do whatever we've been doing so far. So what we have here is, again, our processor, our processor will make use of a static HTTP client in this class, and whenever a package is added, it will call this local function. Whenever a package is deleted, it will call the package deleted local function. Now, a couple of things that I added here is a cloud blob cursor, because again, we want to make sure that if our function restarts, if the host restarts that we can continue where we left off on the Nukit Catalog, we want to store that cursor somewhere on a blob. So this is an implementation of that cursor class that the Nukit package has to basically get the timestamp of where we left off. And as you can see in storage, we will be able to also find a blob in our storage account that holds that timestamp that we want to make use of. Let's quickly show you in the storage emulator. I think this is going to be the file there. So this is a simple JSON file containing that timestamp. Anyway, back to our Catalog listener. So this will run just using the exact same logic that we had before, but what it will do now is call into package added or package deleted. Package added and package deleted are roughly doing what we did before, but instead of writing to console, what they do is call the executor.tryexecuteasync. So what they do is tell the Azure Functions runtime, please try to execute for me a function using the data that I give you, which is going to be my package ID, package version, package URL and so on. And yeah, try to execute it for me. And then the functions runtime, passing in this executor instance will handle the fact that the function has to execute and our trigger binding will run. But this makes our code much more readable. I know it's a lot of work to build this binding and make sure that all of that works, but our actual implementation of our queue itself is now much cleaner because all we do is define that NuGet catalog trigger. We get a package operation in there and then we append it to a queue. So instead of having multiple lines of codes, all we have to do here is simply write that into a queue and then make use of that. Right, indexing itself is going to be easy. It's going to be reading from the queue that we just populated, download the NuGet package and open it up. It's a simple zip file. All of the assemblies in there we will read using system.reflection metadata. Another really nice NuGet package to get metadata from assemblies without having to load them using reflection. And then we will store the relation between package ID version and namespace and type in Azure search and make it available. Now, the next steps that we have to do is of course make the API compatible with all resharper and writer versions. But in the interest of time, I will not dive into the details of that one. If you want a full story of this, how this thing was built in its entirety, I have that on my blog and you can find the full implementation there if you want to go through the code base as well. The only thing I quickly want to show you is that this actually now works with a resharper. So if you restart resharper in internal modes, we can point it to a different test server URL. And if we now make use of, for example, a new J object and we want to create a new instance of that, we will see that if we make use of resharpers quick fix to find this type on nugget.org, it will actually query from the functions that we just created with the data that we just indexed and make it available to us to install into our project. And we also get the metadata if we want. We can see basically whatever the catalog provided us plus the fact that J object is inside of that package. Right, couple of learnings from this thing and then we'll wrap up. So we have built all the functions that we want. We have one to collect changes from the catalog, one to download the binaries for later use if we need it, one to index the binaries into Azure Search and one API function to make it available as an API. Every function should do one thing. We created a couple of bindings for that. One was the nugget catalog trigger binding that triggers our function to run when new packages appear there. But we also created another binding to store data into Azure Search because again, Azure Search has retry logic. You have to make sure that the index exists and so on. And we kind of abstracted that away into an output binding that we can reuse more easily and simply pass along objects and have the Azure Functions runtime plus our binding handle that. All of our functions can scale and fail also important independently. We have the indexer that runs in a separate function. We have the catalog watcher that runs separately and the API that runs separately. So even if the indexing fails, we still have all the data in search and the API will run and so on. We did a full index of nugget.org in May 2019. So this year, and that took about 12 hours on two B1 instances. Now, if you're familiar with Azure instance sizes, the B1 instances are not very powerful. So it's still impressive that using functions, we were able to run this in 12 hours. While currently, I think it takes us about two weeks to run the entire indexing process on top of the OData feeds that exists. So this is infinitely better. I think this can be faster on multiple CPUs and better machines, but 12 hours is still a huge step up from what we have currently. There's about 2.1 million unique packages in the nugget catalog. And we found about 8,400 catalog pages and 4,200,000 catalog leaves. Now you may wonder if there's only 2.1 million unique packages, but 4,200,000 catalog leaves in there. What happened? Why are packages duplicated? Well, if you go to the nugget block, you will see that at some point last year, they started repository signing packages. So what they did was add a signature into every single nugget package. And of course, that triggered an update in their systems and basically costs every package that was in nugget.org up till that point to appear in the catalog twice. Once without the signature and once with the signature that was added. What we also learned was that deploying it in separate function apps is interesting, not only because of four boundaries. We're gonna need to wrap up here. We're a little bit over time. We're gonna need to wrap up so we let the next person have it. Maybe 10 seconds? Yes. Perfect. So the nice thing about running this in separate function apps makes sure that we have different fault boundaries but also in terms of cost, this is interesting. Our trigger has to run in a provisioned instance, whereas the indexing can run in on-demand instances. So basically the paper use model. If no one is using the indexer, if no one is using the API, we can run it there. So it's very nice to use that. And also, if you're building Azure functions, make use of those bindings because they really simplify the actual code of your function and really makes it easy and nice to read and work with. With that, thank you. Again, find the full story on my blog and hit me up on Twitter if you have any questions. Thanks. All right, Martin. That was great. Thank you. Thank you so much. I love seeing how much you were able to simplify by building the new binding, the new trigger there. Yeah. Yeah, there's a bit of code there but it really makes it nice to write a couple of lines in your function and be done with it. Absolutely. All right. That's great. Well, thanks so much. We're gonna get ready for our next speaker. Thank you for having me and you have a great remainder of the NetConf. All right, catch you later, Martin. See ya. We'll be right back.