 Welcome to the talk. It's going to get a little darker in here as we were just warned. The reason for that is, of course, we're using a very dark slide theme, which is actually our company's slide theme. So that's just kind of how it is, and the lights are kind of washing it out. Enjoy the darkness for a little bit. You can have a nap if you want, like we're not going to be mad about it. I was going to make a joke about it's great to see there's so many people attending, but now I can't see anybody anymore. So, you know, I'm just going to assume you're all still here. So the talk is sneak ops, getting users, you get ops without even knowing about it. And my name is Ayman Ryan. I'm a principal field engineer at Grafana Labs. And I'm joined here by my partner in crime. I'm head of salmons, or head of salmons. A children's engineering team just to kind of get a note on what that does. We support the solutions engineers, all the sales engineers at Grafana Labs to be sure that they are able to get the breadth and the technical depth that they need to submit their engagements. They know a lot about how the products are used. We know a lot about how the products are architected. We also write workshops, we write material, and we also build and maintain the demo kit. So, this is kind of, well, am I too close to something? I can't really tell. Is it you? Is it this one? Right, that seems better. Okay, so this is a very basic view of the environment that has mentioned called the demo kit that we support. It essentially is a GKE cluster with a whole bunch of stuff deployed on top of it. So, we've got a large, buzzing is ready. Can you just turn this one off? It's off. It's off? Okay. Anyway, so it's this large instance. It's got Grafana in there, as I'm sure everyone here is familiar with Grafana. Maybe not familiar with all the other stuff it connects to. Lots of people only connected to one or two things. But because we are Grafana Labs, and one of the teams we support is involved in showing demos for what Grafana can speak to, we have it hooked up to literally everything we possibly can. So, on the left side, it's Prometheus, as well as all the other projects and products that Grafana has, like Oncall, and Mimir, and Incident, and Loki, and Tempo, and ML, and K6. On the right-hand side, we have all these third-party things that we can also connect up to, especially on the enterprise version of our stuff, like ServiceNow, and Oracle, and Splunk, and so on. So, it's a lot of things to connect to. It's a lot of work to upkeep and to make sure that all of the different demos and things for everything are actually working all the time. So, we've got a lot of dashboards. Now, as far as our role, we create all the data sources that they need, and we usually create a few demo dashboards that they can use as a basis for going forwards. And they do do this, but the problem is customers are always the problem, right? And the thing to do is say, well, we want a demo of something that does this with this data source and that with that data source, and let's show it all working together, right? So, they take the data source and the dashboards that they have, and the solution engineers start copying them so that they can tailor it for exactly what they need to do, which is great. But they do it again, and they do it again, and they do it again, and all of a sudden we have so many dashboards within the profiler demo kit that we're not entirely sure what's going on, what is the golden dashboards that they should be using, the solution engineers start copying dashboards, other solutions engineers, so on and so forth, and so on and so forth. And it really becomes a situation where even we have problems going through, and we're trying to figure out what we should be keeping and what we shouldn't. This basically just comes in almost literally every week, just with the side of this emoji in the Grafana track channel is called this yet. We're able to write the emoji. This bot comes in and writes it, built out of like nine smaller emojis, so it comes in huge, which is great, works in every channel. Thank you. So, our plan to solve this is to have multiple Grafana environments instead. So, the first is a sandbox, complete anarchy, Mad Max looks very much like the except that, you know, it's intended to be that way, instead of having evolved that way over the... And then we have a second environment, which is our production pristine environment, and the way to keep it pristine is, of course, to not let people change it easily. So, what we're going to do with that one is, you know, lock it down, make sure that, like the SEs and the people who use the environment can't actually just directly edit stuff in there. So, lock it down with permissions, make sure users aren't admins except for our team, and just make, they can't just add what they like in there. And what we want to do, and what you can do with Grafana is, you can feed dashboards in programmatically, and we want to do that via Git. So, that would require that Solutions Engineers would file a GitHub PR with the dashboards that they wanted to add to the Grafana instance. Sounds great, except for, you know, Grafana is like, doesn't have a totally great flow for automating this in itself. You have to kind of, you know, use the API or something like that. It doesn't really have a native integration for this yet, but in theory, we thought this would help and it would give us a nice cleaner setup. However... So, the challenge is that we as the admins want to use GitOps. It's fairly obvious. It makes it quite easier for us. It's very low-touch. We can ensure that we can get stuff in without having to do a lot of work ourselves. So, the problem is, a lot of our engines, because of the rain and the different sales practices that we have and the engineers that actually don't talk to customers, not actually everybody knows really about GitOps, Gitmar, or the source-of-shot systems that we're using. And because they constantly meet, they don't really have time to actually go away and learn all this stuff, all the processes, all the flows that we want them to use. And, of course, the lack of use of it and the lack of most of the majority of the solutions engineers actually putting PRs in or thinking about this means that the mess remains. It just continues. We end up in a situation each week where we have to go in and start pruning dashboards looking at data sources, seeing exactly what needs to be there and what needs to come out. And we actually have an additional problem, which is that as we add more griffin instances, so we've been going from a single centralized one to having one in the U.S. and one in Europe and one in the Asia-Pacific region, is that, well, the SES are adding dashboards in the main one and then they're like, oh, it's not in the other one and it's not in the other one, so they're not really propagating around. And if they had done the Git process, maybe we could do that. But as had said, it's not usually in their day-to-day job to play around with Git and so they never really get used to the process. So we're going to GitOps it all. So we're going to use some fairly standard tools, but we're going to go into what those tools are and a little bit about what they do just so we can kind of set the scene for when we do this demo. So the first is Grafana itself. Grafana, obviously, has many products. In this case, it's the beautiful dashboarding tool that everybody knows and loves. It has a full REST API. So that REST API acts with JSON data. It allows you to upload dashboards. It allows you to upload data sources. It allows you to pretty much do whatever you want with any of the resources that Grafana supports. It's full quad operation supported. But in this particular case, we're going to use it to upload and download dashboards from one Grafana instance and upload them into our production instance instead, where they will be basically locked out so that people can't start mucking around with them. So we're going to run a little demon in a pod on our Kubernetes cluster. It's going to talk to our Wild West Anarchy development system where the solutions engineers are actually going to put the dashboards that they want to store into our production Grafana. And it's going to send, obviously, the dashboard JSON back to that demon running in the pod. So another component that we're using here, which I'm sure lots of people are already familiar with and it's been around for years, is Terraform. So briefly, if you're not, it allows you to do infrastructure as code. It works idempotently, so that means you can rerun things and it won't, you know, stamp all over everything you've done. Works off of desired versus actual state. And there is a Terraform provider for Grafana. And one of the basic resources is the dashboard resource. So you can add dashboards, no problem. You can remove dashboards. We've actually put, as a side note, a ton of dev effort into that provider this year. So it does way more stuff than it used to do, but obviously one of its most basic features is put dashboards into Grafana. So that works really well. It will mean that as we try to push dashboards into Grafana, it will only add in the ones that it needs to versus ones that are already there. So it just works on the diff kind of basis. So we have our stuff inside of Git. Terraform is going to react to that with an extra tool in play, which we'll get to in a second. And it's going to push the dashboards into the production Grafana that landed in Git from the sandbox Grafana from the other component heads it's just talking about. So this is where Atlantis comes in. So obviously Terraform doesn't run by itself. We're going to store these dashboards inside Git, GitHub in this case. And what we actually want to happen is whenever a PR is created, and the demon itself is going to create that PR for us, we want to ensure that, A, we can verify the dashboards that will need an approval process from somebody that is responsible for the production environment. But when that approval has been given, we want to ensure that we can actually do something with it. So Atlantis is actually a way of allowing you to run Terraform via Git pull requests. It's a very fully-fledged system. It allows you to do various things such as, you know, only apply Terraform when something has been approved, you know, clean up on an emerge, all of that kind of stuff that you need for a GitOps workflow. So the general idea is that we will have our dashboards in GitHub. We will create a new PR from the demon that's actually doing all the work pulling these in from our development Grafana. It will get approved. Sorry, the Terraform plan will actually be run at the point where that PR is created so that we can actually assure ourselves that we're not going to break anything if we go ahead with actually merging it. Somebody will then come along and approve it, having looked at the dashboards. And the JSON that was produced by Grafana day in, day out for two and a half years, you can actually start seeing into the matrix just by looking at a dashboard JSON file, which is quite scary, really. And then once we have approved, we're going to apply the Terraform that goes along with it, and that will get us into a position where we can merge against our production instance. And one other small thing on this side. If people were thinking of asking the question of, hey, why didn't you use the Terraform controller in flux? I didn't know that existed until this Tuesday. So we'll probably go and look at it. We were at GitHub's con on Tuesday. Oh, excuse us, we should maybe look at that. But yeah, I literally didn't know it existed until this week. So another little piece of glue we're using is GitHub Actions. So it allows you to really easily run little workflows inside of GitHub. Things that can be triggered by PR creations, comments, all that kind of stuff. For us, we were just using it as a little bit of glue because of the way that Atlantis currently works. It actually requires you to comment on the PR to allow it to actually proceed. But we wanted that to happen in a slightly more automated fashion. So we have an action that comes in and automates that a little bit for us so that we don't have to do it quite in as manual fashion. All right, this brings us... Oh, no, this is me again. By the way, all these pieces of artwork we've been showing in the background are just fun bits of, you know, all the different AI tools for generating art that you've seen around lately. So we use them to ask it for things like GitHub Actions, which created a very strange image here with like a fox wave. And this is what it came up with for Atlantis, even though it very much looks like rapture from BioShock. Kind of okay with that, honestly. That's pretty cool. This is one of the ones we got for sneak up. So we got this very shape character in your data center. So I thought that was perfect. And there's a few more of them as we move along. Okay, so our solution is we take our, actually to move closer because I can't even read the text now. We take our sandbox Grafana here and down here we have our solution engineer. And he's going to go in and add a dashboard which will have like a bunch of characteristics as you might imagine. So it'll have an ID. It'll have a tags associated with it. And that dashboard, what we're going to have them do is save it into a very specific folder, calling it the magic folder. It doesn't really matter which folder it is as long as it's the one that we designate. So they save it into a specific folder. So there you have the dashboard and it has specific tags on it. Then our component, this dashboard daemon or the sneak ops daemon and its heads was describing. Looks very shady, more AI art. It's going to come in through the Grafana API, look at what's in that folder, pull that out, and file it as a PR into GitHub. And that means that in GitHub we now have a folder structure that includes dashboards that are being pulled across from the sandbox Grafana. And this works for as many as we need to. So multiple folders, multiple dashboards, doesn't matter, it's okay. Then as we mentioned, Atlantis is going to come in. Atlantis is a Terraform runner. It's going to then create the plan to run this against the production Grafana. And then that will allow us to get as many dashboards in as many folders as we actually want. And that will mean that we will have a system where all the person does, the SC in the bottom left, is saved the dashboard they want into a specific folder just has a particular set of tags, and then off it goes. But heads is going to explain the whole, the custom bit in more detail. So we keep talking about this daemon that's going to work with dashboards, but actually is it? So this is what our sneak ops daemon actually is. It's written in a language that rhymes with toad, I'm not proud of it. It is going to be written in Go or Rust. But what it basically does is look in this golden folder in the development environment. And every time a new folder appears, it goes away. It looks at exactly what the folder is. It looks at the tags that are attached on the dashboard to determine the folder in the production environment that we're going to put it in. It looks at a version tag as well. It's very important that we don't allow SCs to create new dashboards that are going to overwrite production instances that are actually already used. That would get very messy very quickly. So it does a load of different checks to ensure that it's actually safe to provision this dashboard to the production environment. It raises errors if any of this information isn't there that it needs or something is going to go wrong. And then if everything is all right, it will raise a PR for it in our GitHub instance to ensure that we can then go through the entire GitOps flow to ensure that we can kind of get that into production as well. If there's an author tag on it as well, it will also ping you. It works with both logging out to standard out, which gets picked up by logging out which gets picked up by Loki oddly enough. But we also have a Slack integration so that anybody that's interested can come along and look at the Slack channel where all of this work is going on and see if their dashboards have thrown error or if they're actually working. So the head is going to switch over and show us a demo of this. This is not a pre-recorded demo. So fingers crossed and toes crossed. I'm praying to the elder gods that everything is going to work. So in this particular case, you can see that we have both our development environment, which currently has a couple of folders, but we have our Wild West folder here where maybe a solutions engineer is working with an amazing demo dashboard. We have our production instance as well, which has a few different folders that people are using at the moment for demos. And what I'm going to do is I'm going to go away and open my dashboard. It doesn't really matter what the dashboard has in it, but I'm going to save this now into the folder that we use to actually pick up any changes to go into production. So let's move that into our sneakups folder and let's save it. Now what's going on at the moment is the way that our Dash Demon actually works is it polls every 10 minutes. We don't want it to poll every 10 minutes for that. We would go way over time. So it's currently running in a little demo cluster here. You can see that we're running Atlantis. We're running our Demon. We're running a few things that we need for observability. And we're running both our Grafana development instance and our production instance as well. Hopefully by the time we go back to Slack, he says of course we're going to hit the either. We're going to hit 30 seconds the other way. There we go. So actually the Dash Demon has told us that we can't provision that. We actually need some more tags that we don't have, the Pro version and the Pro folder. I've not told it where to put the dashboard and I've not told it what version it is. So what I need to do is go back to my dashboard now, change my tags. We're going to remove the tags that Dash Demon has put on there. That's actually to help us. So it will help both us and Solutions Engineers determine why the Dash Board couldn't be provisioned and it will give an epoch date there when it happened. So let's remove those. It's dark of course so I can't see properly my screen but we're going to put that in there. Let's call it amazing demos folder. We're going to give it a version of one. So that will allow us now to hopefully try again. Dash Demon should come along shortly and it should pick up the fact that we want to put it in production in that folder with that version. I could whistle the girl from Ipanena. I'm hoping that this is actually going to go quickly. Hope everyone's holding their breath. I am. There we go. So now Dash Demon has picked this up. It said okay, you've given me everything I need to actually provision this folder and it's also very handily given us the link to GitHub where the PR has been created. So if we go and have a look at that, obviously we can't show you our production or demo environment. We can't also show you the GitHub repose that we're using for this so I've just set something up in one of my own organizations. And what's going to happen now is Atlantis is going to run the Terraform plan. It's going to ensure that it can actually do something. This should. It's usually fairly quick. It shouldn't take too long. He says. There we go. It has actually run. It's just GitHub didn't refresh. So we now have an output here and Atlantis has run the plan. Terraform has told us exactly what's going to change. We're going to add a new dashboard. We're going to add a new folder. And at this point, we're in a good position for somebody who maintains the production environment to come along and approve the changes. Button boy, do your thing. I get to press one button. That's my contribution to the demo. This usually wouldn't allow you to merge either. It's just because it's a private repo that I'm not actually paying for in my organization. So I've been able to turn branch protection on. Aiman has come along and he said a few things in the comments. GitHub Actions has now discovered that there's something that has approved, sorry, an approval has occurred on the PR. Because of that, it's now gone along and said Atlantis apply. So Atlantis is now going to do its thing. It's going to apply Terraform on our PR, which again, hopefully should not take too long. It hasn't. There we go. It's done it. It's applied our dashboard changes. It's closed our branch. And if we now go back to our production instance and take another look, we can see we have a Golden's dashboard and our dashboard has been moved across fully along with all of our tags. Now the SE can come along. They can demo this dashboard and we know we're in a good position. And it's a dashboard should actually be there. Of course, if anything had gone wrong, what I could do is I could have gone into Grafana and, you know, we're Grafana. So I've logged it all up. There's metrics. There's traces. But that's essentially the demo. Clicker table is so dark. So quick recap. So our environment, big shared Grafana instance, tons of data sources, tons of integrations, lots of users using it. Our whole SE team are maybe the primary users. There are dozens of them. Less than 100, more than 50. But it's actually open to everybody at Grafana. And it actually gets used by engineering teams. It gets used by customer support teams. It gets used by professional services. It's used by everybody. So lots of people in there. So the situation was we have Mash's dashboard sprawl. We have lots of duplicated dashboards. We have dashboards that basically don't work anymore. There's a real danger of somebody going and saying, okay, I'm going to show you this lovely demo and then nothing works. We have a lot of old dashboards. And it's just not a situation that we can continue with. So like I said earlier, I plan to have multiple Grafanas, at least one sandbox one. It's intentionally a big giant mess. That's okay. But we have a nice clean production one and we want to provision things across via getups. But the challenge is that a lot of our users don't actually use Git on a day-to-day basis. So we need to ensure that we did something that would allow them to use the tools that they know and love very, very well, obviously, and that allows us to still kind of get some view of what's going on, but also that is very minimal part to actually do all of this. And our solution is, well, we'll build this tool. Let them save dashboards to a magic folder. They don't ever have to even open GitHub or Git or even know how to run Git commands. Dashboards are pulled out by the Grafana API. It gets committed to a Git folder structure. Atlantis runs a Terraform script that I wrote that is able to iterate over that same folder structure so a folder of folders of dashboards and just run that in the loop and just drop them in exactly where they need to be. So let's do some wrap-up. Should you do this? So this is a custom solution to a specific problem that we had. This is something we were actually rolling out for real in our internal environment for those groups. We're not necessarily saying everybody should do it exactly the way that we did it, but this is intended to be an example of say you have something and you want to GitOps it, but your users are not really Git savvy. This is an example of what you could do. As long as the tool that you're trying to GitOps has some kind of API in a way of programmatically speaking to it, you could do something like this even if it wasn't exactly this. The tools are there. You just need to stick them together. Do a bit of work. Bang your head on the wall because it didn't work 17 times. All that kind of stuff. Order maintenance. Well, excuse me. As you can see, we've tried to pick some off-the-shelf tools to make it very easy. The only thing we had to do was write a demon to ensure that we can actually work with Grafana API and create the pull requests. Having looked through GitHub actions a bit more, I've slowly been learning it, I think we can do an awful lot of it without actually having to have a demon itself. If you can pick the tools that you need and configure it in such a way that you're able to maintain it as low an effort as possible, then actually, it probably is a really good thing to do. As Ayman says, if you've got users that are very high level, they don't actually know the tool they're going to need. This could be a really, really good way of ensuring that they can self-regulate what they're doing. On the future roadmap, this is also partially an answer to the inevitable question of, is this open-sourcing, can I have it? We didn't actually make it public at the moment. We're kind of on the fence about whether we should, but it's not because we don't want to give it to people. It's also because there's actually significant effort going on in the official Grafana code base to improve the situation of integration to a large degree, which is, I'm going to be very loose about this and say, should be something next year. Easy on. Don't hold me to it, but yeah. There is official work there and we're a little hesitant to just hand it to people because then people will start using it and maybe that's awkward when the new version of Grafana comes out, so we're like, don't really know what we should do there. But within the tool itself, we were thinking of doing things like adding data sources. Yeah, there's, I mean, one of the things is I mentioned earlier is you get very good at reading JSON, but that's good if you've been reading it for a long, long time, but not so good if you're just doing something else and you're there just trying to admin and maintain what's going on. So we're going to run a Grafana instance as the PR is created and gives you a screen shot. There's little nice bits and pieces like that that we're just going to start to add that makes it a lot easier. So it looks like we have five minutes left and that's actually perfect. So I'll leave this on the thing. This is the Rating Deception viewer code and Twitter links, but we have a couple minutes for questions. I was going to say, Twitter links at the moment. Let's see how long we last on Twitter. So we're going to have questions. There you are. If you remove something from the magic folder, does it also remove it from the production environment? So does it also do like a terraform destroy of the dashboard? Not at the moment, but that is again something that we're kind of thinking over the future. To be bluntly honest, part of the problem that we have is that we have a very small field engineering team. So we try and do things like this in bits and pieces when we're not talking in depth to customers about Tempo, Loki, or you know, Mamiya, that kind of thing. So there's a lot of things on the roadmap, but unfortunately that's not one of them that's there at the moment. There wouldn't be that great that much work to add that bit because the actual provisioning piece of it is effectively terraform. If we just change it so that if something is removed in a certain way, it removes it out of the gift folder, terraform will take care of it without running terraform destroyed. The apply will recognize that it's not there and will actually delete it. So like half of it is done by nature of it being terraform. So you guys showed... Sorry, go ahead. Sorry. So the question is... Sorry, the statement was the tool creates the PRs automatically. Will it update the PR if they then update like the dashboard because they did something wrong? Yeah, it will. If the PR is still open at the point where somebody comes along and they actually change it again, the polling will actually go and look at all the PRs that have been created and the team will go, and the version number has gone up, and I'm going to essentially update what we've got here, do it with the tags, and ensure that the plan will run again so that we're not in a situation where maybe somebody else has come along at the same time and done the same thing. Again, this goes back to checking to make sure there are no conflicting versions with the same folders, the same dashboards, et cetera. I saw your hand up. I'll be adding webbooks to Robana so we can pull out some action and what the page happens. I knew I shouldn't have said polling. I'm just repeating the question for everybody. We have a demon that polls, and then the question was have we thought about enhancing it to support Webhooks on top of that? Webhooks from Grafana to call out when a change happens. That's a really good question. It is a really good question. It has actually been debated internally before. I honestly don't know what the situation is. Grafana slash Grafana issues, please. I have heard people ask for a custom Webhook from Grafana several times what the official answer was on it so far. But I bet there's an issue already there for it. I wouldn't be surprised. This person way over here is the mic going to work. We have time for maybe one more question, I think. One second. If you're quick, we can maybe do two. Hurry. Yell. Was that a question or a statement? I did. Okay. I think I understand. So you're saying have we considered doing it in cross-plane using a Kubernetes CRD for applying a Grafana dashboard? That would be neat to do, but no, we didn't actually look at it. I have like that would kind of fit, I suppose. I have weird thoughts about cross-plane and it's provisioning against stuff that could technically be outside the cluster and you're adding an extra layer in here but I'm not sure the complexity is worth it in some cases, but it's a good question and we didn't really look at it yet. Okay, last one real quick. Last question. Yes. It kind of works. So that's a good question. And at the moment, no. Although ideally what we should be doing is not applying until the merge occurs. Yeah, the way that Atlantis seems to work is that you comment Atlantis apply and then it applies it and then the PR is still open and then you're supposed to merge it afterwards and it's kind of weird because you sort of feel like you should be able to merge it without the actual application. So that's a good question and at the moment, no. Although ideally what we should be doing is not applying it afterwards and it's kind of weird because you sort of feel like you should merge it and then that should result in the apply but it seems to be just that's the way Atlantis works. I wasn't able to find if you could reconfigure it to work a different way. That's also why we had the GitHub action that runs the comments Atlantis apply for us when you approve the PR. So we kind of stuck that in as a bit of extra glue. Now maybe Flux's Terraform controller would actually do this in a better way. We won't know until we try it out. But the worst, we just create another GitHub action on Atlantis apply. Probably one more. We're getting extra time for a question. Oh crap. Sorry, how do you manage changes to production Grafana because in this case you would get a huge difference when you would try to apply changes for like... So you're saying if somebody changed the dashboard in the production Grafana so they can't because we're literally disallowing them from doing that. Like the only people who could do it would literally be our team of four people because we'll be the only people with permissions to do that. If that was to happen I guess you just delete it out of production and just re-push it from there. Obviously Terraform's going to provision everything again. The thing about Grafana now is from Grafana 9 we put in a very, very detailed fine-leveled, grained access control system so we can actually give permissions to lots of people to do everything they need to do so actually Grafana has given us thankfully Grafana now has the ability to do all of that very, very easily without us having to worry about it. I forgot this is 35 minutes, not 30. Yep, one over here. What do you get out of running Atlantis instead of just running Terraform apply in like a CI? It was quick. It was quick and easy to set up. The other thing that you're getting from not directly running it automatically in a CI is you get to see it before it runs in the PR, shows you the whole plan lets you see what the diff will be so that people in our team can go and look and say, oh, it's going to do this. That looks okay. We can let it run. Instead of if you run it in CI it's always assuming everything is okay and everything may not be okay. Something weird could be happening and you might never see the diff and all of a sudden Terraform is wiped out every dashboard you have. So you can see what it's going to do and save yourself hopefully after. But thanks very much, folks. Thank you.