 Okay, we are recording now. Hi all, welcome to the cloud native special interest group meeting. Today we have a presentation by Sumit. So Sumit will present external fingerprint storage. This is a project he's working on as part of Google summer of code project. We will have a demo and after that we will spend some time on the design review and discuss it. I guess that's the plan. Okay, so the floor is yours. Yeah, before we begin, how do you prefer to answer the questions during the presentation, send them right away or after that? I'm fine with the way I on my slides. I have a Q&A slide also at the end. Since we don't have so many people in the call, let's just answer them on the fight. Perfect. So can I share my screen? Yeah. So is my screen visible? Yes, it is. Okay, great. Hi everybody. So welcome to the cloud native meeting and today's demo for the external fingerprint storage project, which is one of the GSOC projects for this year. So at the beginning, so for the agenda for today, I'll start with my personal introduction and what exactly are fingerprints and what is the current architecture, what is our vision for the project and followed by demo and Q&A. So who am I? I'm basically the GSOC training student for this project. I'm currently pursuing a bachelor's in instrumentation and control engineering from NSIT Delhi. I started contributing to Jenkins in December 2019. And most of my contributions have been somewhat around fingerprints, which got me interested in pursuing this project for GSOC. And I'm really excited with it. I think we have worked about, we had a community bonding for one month, maybe less than that. And we have spent about I think three, four weeks, almost four weeks into coding. So yeah, so I'll be presenting that work for today. So just for anybody who is new to fingerprints, I'll start with just a small introduction to fingerprinting. So basically fingerprinting in Jenkins is a way to track files, artifacts or anything across the CICD flow. So what happens basically is that in different jobs, say you want to track, say which artifact was used in which particular build by say some other job. So you need to know that which version they use, right? So fingerprints help us identify these files and they help basically versioning of these files and they can be a huge benefit to different plugins. And they can also be a huge benefit in tracking down where the source of issue came. So maybe our developers have fixed something in a particular build and another team is on another build. So it helps fingerprinting help that way. In Jenkins layout basically, we have a, this is how fingerprints UI is built. So say you have a job called demo and so say you go to a particular build, you can click on see fingerprints beside it and it will show you all the fingerprints that particular build recorded. And you can click on any of those fingerprints and you'll get a view which shows that. So here it says number one to number three. So it's showing that it was used in builds one, two and three of the job demo. So that is how fingerprinting is presented is built in Jenkins. So what is the current fingerprint storage in. So the main crux around this project is that currently fingerprints are being stored in a look in a local manner in an XML files inside Jenkins one and that has issues all the issues with using a physical storage disk. So basically, as fingerprints grow, they are consuming the storage of Jenkins master and they're increasing reliance on the physical storage. So we cannot, you know, configure it with cloud storages, which are also very, which offer benefits like pay as you use cloud storage and even most of times are cheaper. We can even configure replica sets for say external databases. It helps in reliability availability. And lastly is that currently we cannot track fingerprints across Jenkins instances. So that is another aspect of this project, which maybe will come in a later phase down the line is that we can track fingerprints across the entire across different Jenkins instances. And that is also something we want to solve in this project. So, so what is the vision for our project is basically in this diagram is that we can configure. So basically we want a pluggable architecture for fingerprint storage that any storage, any storage plugin can come in implemented for that particular storage and then store those fingerprints in that storage. So say maybe we have a MongoDB storage plugin on my SQL or the Redis base, any plugin can come in and we can use that to store our fingerprints. And then maybe down the line we build another plugin which the fingerprint external storage API plugin so that we can, you know, track all these always flow and maybe represented in a, in a good manner. So that is something we'll do as we go. So, so this is our cut. So this is the current model and how we are going to change it. Right, so now I'll go into demo or enter the demo. So here's what I'll do. Just open up a session for. So, if anybody wants to like so we currently have a reference implementation for our plugin, which is backed by Redis. So it lives in this repository, which is Redis fingerprint storage plugin inside Jenkins CI. We have a guide also for running it locally, which I just walked through this basically. And yeah, and the API that is currently the pluggable API that we are building on Jenkins core side is actually it's an incremental build. It's like, it's still not merged, but it exists as a pull request. So yeah, it's, I think 4731 and Jenkins code. Right, so once we have this merge maybe so, so it's still a beta IPM. Um, so. I'll do is. I'll, you know, the new. I've known this repository and my local. Desktop. So we have yes. So, what I'll do is just to plug in and run this. Take a few seconds. Start up. Sometimes to start up. So it's ready. Take a moment or two. Yeah, maybe you could just refresh the page because sometimes in development mode. Right behaviors, which will bring this in production. So, so we have Jenkins, so what we'll go to is manage Jenkins and configure system. So we have a configuration page for our plugin. We're here. Here is that page. So we can configure host port as a cell database. So Redis has a integer indexed databases and we can change timeouts. We can add credential plugins. So, so say at the moment, like, and we can test at this. So it's failing because I don't have an instant setup. So I'll just start instance locally. And it's on this 6794. So now I can test it successful. So I'll hit apply. I'll hit save. Also, I'll start a command line interface into Redis. And select the zero database and see what is we have there. So as we can see, it's currently empty, right? So what we can do now is start a new job or create a new job. Let's call it demo. I will say freestyle present. What I'll do is I'll add a build step. I'll add a post-billed action to report these fingerprints. So demo.txt is the file which will get fingerprinted. And I'll apply it, hit save. So now I have this job, right? So I can start a build for this job. And I get a success rate. So if I look into this, I'll go to see fingerprints. And I can see that demo.txt is being recorded. And it was used in build one, this particular job, right? So now in my internal CLI, okay. So I know what happened. I think I forgot to enable this particular plugin just a second. So what happened is that in configuration I forgot to enable this. We probably will be changing the enabling portion. To Jenkins code, but so yeah, I had to pick this box. And hit apply. And I let save, right? So now like I don't want to. So yeah, so some clashes may happen. So what I'll do is I'll just create a new job. I'll just call it demo. Yeah, so right now you still have a memory representation of the fingerprints, right? Right. So yeah, I can show that. Yeah, so we have this here, right? So yeah, so now I can go here. So now as we can see that I have this fingerprint demo.txt, right? Which was built locally. So now this should not happen in this particular build. So just record the fingerprints for demo.txt. I'll hit apply and hit save. So now in this new job that I've configured, I'll hit a build. And the data success, yes. And you can see fingerprints for demo.txt in build one, right? So now let's see what happens in our CLI. Yes, so now I'm getting a fingerprint. You can check what story. So it has the XML blob for the fingerprint. And just out of curiosity, let's see whether did it record a fingerprint or not. So this time I'll go to fingerprints. So this is the same old fingerprint. I don't have any. This is for demo.txt. So no fingerprint was recorded for this new job that I created. And basically we can also configure like Docker fingerprint, say build and publish plugin. And so if it records, say fingerprint facets, so they are also handled. So we also get saved in the greatest instance. So I think that's about it for the demo. Yeah. So yeah, go ahead. Answer my question anyway. I want to talk about next steps. Okay, so next steps for us is. So as of now we have not implemented any, any saveable. So basically there's a component called a saveable listener inside Jenkins and it can file some events on some particular. It can file some particular events at some particular times. So basically that does not get fired right now because the API is such that we need to pass a XML file to it. So maybe we'll pass a virtual file or some, or maybe we might change the API altogether. So that is one of the next steps for us extending the API and looking at so we can basically what we want to do is look at plugins and see how we can help other plugins by what API we can introduce to help them down the line. Fingerprint cleanup basically it's happening in file storage right now. So what happens is old builds get discarded. The fingerprints get discarded routinely that that cleanup facility right now is not being implemented. Then you want to be tackling migrations. So, so say when a person configures an external storage. So the, the fingerprints that are present locally need to be migrated to the external storage or say if he goes back from external storage to an internal storage. So these type of migrations are he may configure a new plugin. So these migrations are something that we want to tackle. Tracing is something that I discussed earlier across instances for Jenkins and ORM also at the moment. So basically we leave it to fingerprint plugins to decide how they want to save these fingerprints but Redis like right now is saving it as XML blogs. So maybe you want to change that to something which is more portable in the database. These are some of the things we might be working on in the next few months. And so I have some links at the end of the slides. But yeah, before that, so basically if anybody has any questions and or any suggestions would be really nice for our project. Thanks for the presentation. Any questions or comments from others. So here, maybe you could just share the current job, because it's one of the important topics. So we would like to really lend Jenkins Enhancement to the proposal for this project, similarly to other bug-able storage jabs. And yeah, for that, I guess it still needs some feedback from potential stakeholders so that we can make sure that the job represents all concerns. So just to clarify for those who are not familiar with the job process, there will be a draft stage. So basically it's a formally consistent job for discussion. And this is the plan for the next phase. But if you see any major issues or topics to be addressed, it's a good time to discuss that so it's added to the draft and referenced for the future discussion. Jeff, in case anybody wants to read it. Our current plan is to ship an alpha version within the first coding phase. So basically it's, let's say by the end of the month. In addition to the job, there is a pull request to the Jenkins core. Basically this pull request becomes a major topic for the alpha release because we still need the API to be listed in public. So that we use all standard features like better annotations, etc. But yeah, from those who are on the call, it would be great to get feedback, whether you see anything major today, which needs to be discussed before we press it with the match. Because a functional bias, it works pretty well right now. It also passes all tests. We got some additional test coverage as a part of the project. So, personally, as a viewer, I am tempted to say that it's ready to be merged. I wonder what the others think, especially Jace Vincent, because they've been working on a pluggable storage implementations before. Do we have any open comments on that pull request at the moment? I think this is one for that I'll configure. I discussed yesterday that enable button that I just showed. You need it at core site. So I'll probably just add a UI for configuring that from a dropdown inside Jenkins core. That is the only thing that's left. So, yeah, this problem actually brings up one problem about extensibility. Because in Jenkins, you have two extensibility models. One is, let's say, factory as an extension point. So you have an object which can produce, let's say, fingerprint storage. But the extension point is this factory. Another approach is to actually have a fingerprint storage extension point on its own. So both approaches have their own advantages and disadvantages. And, for example, if you want to support multiple fingerprint storages, let's say, for migration pass, maybe it's better to think about the factory approach. But, yeah, I'm not 100% sure about that. But, in principle, it looks to be feasible for this particular case. I would say for, I mean, for a beta API, you can probably just keep it as simple as possible. And just activate the new storage unconditionally when this plugin is installed. Once, if there are multiple implementations and a lot of users, then you can start to think about, does it make sense to have multiple implementations simultaneously? Do you need to control which one is used in certain folders? Do you need to mark which engine was used for a given build, things like that? And it's probably premature to do things like that now. So, they had the same question with JEP 210, which was about plug-able logging, and JEP 202, which was about plug-able artifacts. So, for 202, we wound up creating a configuration UI for this. For 210, there's still nothing at the moment. Well, that was a prototype, which I didn't touch it. Yes, but both stories finally ended up using factories as an extension point. Because in that case, what was our solution, for example, for build histories, etc. Because both stories actually depend on the build histories. And there, we were able to just keep a storage reference right inside the build metadata. And hence, we didn't have to worry about the migration flow. Because our migration flow was that you introduce new storage, old builds can work with old storage, and eventually, when you delete all this history, you don't need it. Something like that. Yeah, there are different design choices in terms of how much information about the external storage you actually want to keep in Jenkins build metadata. For the artifact manager S3, we chose to... Well, the artifact manager records the fact that it is using S3 for a particular build, but it does not record the details such as the name of the S3 bucket. That's part of Jenkins global configuration. We did that intentionally so that, for example, you could rename a bucket and not tap and only change one spot in Jenkins global configuration, and you would still have access to historical build records that way. So depending on what the usage model is and how important access is for historical builds, you can make different choices about where to keep different pieces of information, and where they should be updated. So, Jesse, what would be your advice for fingerprint storage? Because it's not really using builds, it's rather a system-level storage. It's to builds by ranges, but yes, it is system-level storage, essentially. Yeah, I haven't thought about it enough to really have an opinion. Yeah, it might be a bit difficult. Yeah, our advantage here is that fingerprints have unique IDs. So potentially, we could somehow match it on the level of fingerprint IDs. So if ID exists in an existing storage, just keep using that. And when you choose the default storage, it becomes a lot more tricky if you want to somehow really operate with multiple storages based on folders, et cetera. But I think that these use cases are rather out of the scope for JSOC project. Yes. Because it's really for large-scale instances. And to do that, we need a lot of feedback so that we design it properly. Yes, I think people running large enough servers that they would benefit from external storage are probably willing to do some extra manual setup or migration or something like that. So it's more important to not worry so much about those kinds of things at this stage, I think. And I think this is a good choice because we don't have issues with internal prioritization when we use Redis. If we took other storage like Elasticsearch, there would be a lot of sexual issues on that side. In principle, it would have been easier to use Elasticsearch to search the data. But for storage like this, when we store blocks, Redis looks to be a perfect choice. One question about the keys. Currently, at least the file fingerprints in Jenkins use MD5 hashes. We would like to replace that with SHA-256 or similar in the future. I'm just wondering if that has any relationship to this or if the interface that you're choosing will automatically support different kinds of hash algorithms if we switch to them in the future? So I think Ulle had the same thing in mind when we switched our terminology to IDs. So at the moment we are in our API, we have left it flexible enough that we do make the switch to say maybe we can choose between different hashing algorithms. Then the same API can work even then. Yeah, maybe it makes sense to show the code here because I guess it's already integrated in this pool request. So this is the storage, right? Here we've used the ID terminology. So we've not used MD5 markers and for this particular API. We can change the implementation of that get hash string in the future if we need to. Right. So basically when we were discussing the API, first of all, this API is behind the beta notation. So it can be changed later if needed. But yeah, there was a discussion of what we want to use the passing byte array or whatever right inside API or strings. I believe that our decision was to postpone the decision until we see all the use cases for that. Because yeah, the specifics of API in Jenkins that there is already a lot of conversions between byte strings and strings. And if we could reduce number of these conversions, it would be also better for performance, etc. I think it's subject to change. Yeah, so you suggested this and for at least for Redis, we were as it is using strings so it made sense at least for Redis, but in future if we find that it does not, we can change it. The only important thing for this API is to make sure that you've identified all of the access patterns and so that you know what the correct keys are. I think for fingerprint storage, it's relatively simple. You know, you can put a fingerprint but the only query really is I'm giving you an ID, and I want to know a set of build records associated with that and I can get a single response with all of that data as one, an HTTP response or something like that. You know, for some other kinds of storage systems like J unit, for example, it's a lot more complicated to try to decide what kinds of queries we might want to support. But here I think there's no, for example, there's no need for an efficient way of finding all of the fingerprints produced by one job or something because there is no Jenkins feature that does that. The only thing we do is look up a particular fingerprint and get builds back from it. I don't know because there are plugins which lists all fingerprints. For example, you can open a UI in build and there is six fingerprints action, which basically lists all fingerprints for build. There are also inverse searches, for example, on fingerprints pages. So, for queries, I don't think that it's fully done deal, how it would look like. So you're saying given a build enumerate all associated fingerprints. Yes, that certainly needs to be supported by the API. Yeah, so that would need to be added to this extension point. So maybe look up APIs, so where you just check for resistance, etc, without really pulling the data and the servicing that and there might be a lot of iterations. But for fingerprint storage, we actually have an advantage because the number of his cases currently in Jenkins is not that high. So it's artifacts, it's credentials to some extent it's Docker images. There are very existing plugins, so we can just take a look at these use cases, maybe create some benchmarks using Jenkins test harness, etc, and yeah, start from there. Right, yeah, I think the number of plugins are potentially affected by the API is very small. Credentials, Docker Commons or Docker workflow. Yeah, Docker facility if it still works. Yeah, but yeah so as long as we have a pretty small set of places to look for all of the valid use cases, as far as I know, but you do need to do that search so you need to go through and do some sort of text search of open source Jenkins plugins that are interesting uses of fingerprints and decide if those those uses are doing something that would be covered by this API, or if they're not if they're doing something important enough to add. That's right. Anyway, it's a great start because basically you have something working by the end of the first phase. So if you get an alpha release, then you will probably get some feedback from users. Fingerprints is probably not the biggest offender in terms of disk space, etc, but it's still it's nice and related to the use case we could address. And it will give us some insights for what we do next for all the stoppages. Yeah, for some users it can be pretty big for example if you use the pipeline maven plugin and don't turn off the feature to keep fingerprints. You can quickly get multi gigabyte databases. Yeah, and actually it's a good subject to explore, because your pipeline maven is quite widely used plugin. So why not. I really didn't think about focusing testing on this plugin, but yeah, it might make sense. And right now, pipeline maven, sorry, yes pipeline maven plugin also has its own local database. So maybe they could build some opportunities to see whether it could be especially integrated with the fingerprint storage. I'm not sure. Yeah, I don't remember what that database is exactly. I think it's snapshot tracking. Yeah. Which is related to fingerprints certainly but not exactly the same. You could maybe reimplement that on top of fingerprints. I'm not sure. Yeah, it might be interesting because otherwise somebody would have to maintain this pipeline maven plugin databases, and it already has at least two or three. So it had H2, it had Postgres, and it had MySQL, or MariaDB. So it has fingerprints if I'm right. So any additional feedback or comments? Yeah, I just checked that question about the lookup of fingerprints per build. And that doesn't seem to be an issue as far as I can tell because that's stored as a run action. Yeah, because, yeah, I believe it's the storage limitation at the moment. Because we have file system storage and we do not really cache fingerprints or whatever, so we load them on demand. So it's actually in order to get any kind of feasible performance without really query operations or whatever. We're done fingerprint IDs and actions, so we were persisting these sections and basically this database still exists, but it exists as an action not in the database itself. Right, so yeah, so a nicer approach would be to allow this fingerprint storage extension point to turn off fingerprint action and have the equivalent functionality be re-implanted by a proper indexed search. I'm not sure if that's something you could do in Redis exactly because it's that's something you would more do in a relational database. Perhaps. I don't, I don't know Redis. This is a relational enough, I would say. Well, yeah, it might still pose some issues with indexing, but I think it's a topic which could be researched. Yeah, that would be nice because the implementation of fingerprint action is as some complicated optimizations in it to try to minimize the amount of keep memory being used in each case, because that's often been a performance problem for us. So I would just keep that out of the build record entirely and just look it up from external storage, if and when that information is needed, that would be much nicer. Definitely. Does anybody have any new use cases for fingerprints? Be untartified credentials and images? In any way, where you would like to see fingerprints? It might be interesting to see if there's any useful interaction with the artifact repository like Nexus or something like that. So if you, if your build is uploading artifacts to Nexus, for example, should we have the ability to query, which, which already does its own. Its own hashing, I'm not sure if it does any kind of index of those hashes. But is there some sort of interesting interaction we could have with that so that you would be able to see not only the Jenkins build that produced some given file, but also look it up in Nexus or vice versa. I don't know if there's anything to do here, but it's something, if you're thinking about the use cases for fingerprints and how that relates to external systems. It would be interesting, especially since artifacts on Nexus are quite popular, there will be discussions about creating native artifact manager for that. I believe it hasn't happened yet, but in principle it would be possible and maybe useful for some companies. Nice use case. I was also thinking about using fingerprints for script approval. So in Jenkins we have a script security engine where in some cases you can approve scripts or particular pipeline commands or methods for being used. So I wonder whether it would make sense to add more traceability there. So, if you participated in UI UX hackathon, Wadiq for when you presented a new UI, which adds some additional tracing for script security, but maybe with fingerprints we couldn't go even further. Yeah, I don't think you should use fingerprints for that. I mean, first of all, it should be a very small number of items. And ideally zero if you're running Jenkins properly. And second, it has more, much more serious security implications than fingerprint matching does. Anyway, I would keep that separate. Just making notes. Any additional feedback? I think that we should really target integrating this pull request at least as beta is all disclaimers. But yeah, I did some testing for file storage basically works as is. So pretty much like Smith presented during the demo. It contributes to the code refactoring, which is also a really great thing because fingerprint code is not exactly the most structured code on the Jenkins core. It's not that big, but yeah, we can do a lot of improvements there. Okay. So, if there is no more feedback, thanks a lot to submit for the presentation. Yeah, looking forward to see it in production. Thank you. Thanks all for participation. If you have any questions, we have a guitar chat where you can join and where you can ask any questions. So, I think we will share this presentation after the call. So, you can find all the references there. Thanks all. Thank you for joining us on the recording.