 Hello, everyone. Thank you for coming to my talk, how Cortex became a community. This is a talk that I'm really passionate about. Yeah, let's just dive in. So, I'm Gautam. I'm a senior software engineer at Grafana Labs, and I've been involved in the Prometheus community for about five years now. I started off using Prometheus while I was doing an internship, and then I did my Google Summer of Code with Prometheus, contributing to its storage engine. But for the past three years, I've been working on this other CNCF project called Cortex, which we use at Grafana Labs to provide a hosted Prometheus service. I was also one of the initial co-authors of Grafana Loki, a log aggregation system inspired by Prometheus and Cortex. And I've seen Loki go from launch to a huge community, a huge user base, and yeah, that's pretty interesting. Cool. Before we dive into it, I just want to cover a little bit about Cortex and the history of Cortex. Cortex is a horizontally scalable, multi-tenant version of Prometheus, or it implements the Prometheus APIs. It's about five years old now. It started off in June 2016, and quite recently it became a CNCF incubating project. So it was started off at VWorks to power its SaaS. VWorks was creating Veeve Cloud, and they wanted to give all their users a hosted Prometheus solution. So it started out at VWorks. It had a few early adopters in EA, Maya Cloud, and a couple of others. And then later, Grafana Labs started contributing to Prometheus, started offering a hosted Prometheus solution. And today, it is being run by tens of companies. I think I know of 25 companies that are running Cortex, and it is powering their mission-critical monitoring systems. Cool. With that context, I'm going to go into a little numbers. So Cortex started off somewhere in 2016. I'm not even going to cover that graph. Let's start somewhere in end of 2017, and you can see it's pretty flat. But in the recent years, there's been a huge spike, and the community has been growing over time. This becomes a lot more apparent if you look at the number of contributors that are contributing every month. So it was roughly around 10 contributors. And then in 2019, once we joined the CNCF, you start going up and there's been a huge spike. And in January 2021, we had 75 unique contributors in that month, which is kind of crazy to be honest. And I really absolutely love this growth. We also have a huge growth in the adoption. You can see this. These are the public adopters list we have. And we have a lot of big names like Etsy, Gojek, Distillation, EA. And also quite recently, we've added AWS. AWS is using Cortex to power its hosted Prometheus service. AWS managed Prometheus. So while these are just 15 companies that are public, I've also been speaking to a lot of other companies, a lot of banks and enterprises that can't make their Cortex usage public. And I think I know more than 25 companies overall that are using Cortex in production, which is kind of crazy. So why this talk? If you see the numbers, the numbers are actually not that big, but we're actually quite proud of the numbers we have and the adoption that we saw. And over time, we've experimented with a lot of things and we've noticed that some things contributed a lot to the growth while some other things did not. And we've tried to do the same thing with Loki and we saw incredible adoption in Loki. And over time, I think I had this playbook in my mind as to what should open source projects do if they want to have more users, have more maintenance, have more adoption. And I was talking to a couple of friends of mine who were involved with the newer stage open source projects that were kind of struggling to build adoption. And while I was giving them tips, I realized that I could give a full talk that would be useful for more open source projects. And by no means we're not experts at this, we're still learning a lot about how to build communities. And this is just a subset of the items that I think are important. If you know of other items that you think are important, please share them with me after this talk. I would love to write a blog post or a live doc as to what open source projects can do. Cool. With that context, let's dive in. So this is going to be obvious. Documentation will break or make your project. You can build the best open source system in the world, but if you don't have good documentation, nobody will use it. This is obvious, right? But turns out we actually didn't do a good job of it until quite recently. So we had docs. They were all in GitHub. You had to click through MD files to kind of figure out where they are. They didn't have a proper structure. And we did not have a website until December 2019. And in December 2019, one of the low key maintainers, Cyril, who was also contributing to Cortex at that time, he decided to just use one of the open source templates and create CortexMetrics.io. And this is basically when we started to focus a lot on usability and documentation. We kind of redid our docs. We added a lot more structure. And once we launched a website with proper docs, we got a lot of feedback that it's like the change, like the usability and adoption story is like night and day people who were struggling to use Cortex before found it much more easier to adopt. And they were loving the new website. So if your project doesn't have a website, add a website, make sure the website has a search bar because people will use it. And yeah, one of the first things you need is a getting started guide. Make sure you have a very simple, very easy to use getting started guide that lets people install the system on their laptop, play around, click around. And also make sure to include a production and troubleshooting guide. The production guide will tell them, okay, this is how much you need to provision. This is like for this scale, these are the tweaks that you need to do and things like that. While troubleshooting guide will be like, oh, if you're not seeing any of your metrics, look at these numbers or look at these logs. And maybe this is the issue. So simple FAQ based on the common request that you get or common errors that you get would be quite nice. If possible, if you can have a proper page with the config file that is that describes the config file, what are the different options that you can browse and search through that would be super helpful for users. We did some good job automating the generation of this page from the config object that we have. And that page has been the go to page for a lot of people. So if you can do that, that would also be nice. So basically put the man page in the browser. Yeah, cool. So documentation is important. Again, make sure you have all these guides and make sure you have a website. This is something that we've kind of learned the hard way, I would say. Make sure your project is extremely easy to run. When I say extremely easy, make it really, really easy. So before May 2019, if you open the Cortex website, it would show you this architecture diagram. So now this is a lot of moving parts and a lot of dependencies like you have to depend on a no SQL store like Sandra or Bigtable, a console, a memcached and possibly even an object store. And there's like a lot of these moving parts. And if you are looking for a long-term storage solution for Prometheus, and you look at this diagram, you would be immediately discouraged. But that was the reality of Cortex. And there's a good reason why the Cortex architecture is this way. It's so desegregated. Cortex runs at massive scale. We handle 20 million samples a second. We ingest and compress them. And we need to make sure that every single component that we are running is highly available and horizontally scalable. There's a reason why every one of these components exist. And we've added one more component recently because the query front-end needed some additional features and needed to be scalable. Cool. So with that in mind, let's go back a little bit to understand how Loki was built. A lot of Loki's coding was done on transatlantic flights. So essentially, one of the reasons that Cortex didn't work or like the Cortex was a little hard to adopt is or a little hard to contribute to was the development process was you make your changes, you write tests for it, you make sure all the tests pass and then you had to deploy it to a remote dev environment in the cloud. This means the dev cycle was slow. And if you did not have a dev environment setup, it was kind of hard to contribute. But for Loki's case, we made sure all of Loki runs as a single binary. You do dot slash Loki and it just simply works. There's no dependencies. It works off of your file system and you can test and play around with Loki on your laptop. This airplane mode or simplicity of Loki and it's the same for Prometheus was a huge part of why Loki was adopted widely. And we've learned from that. And we focused on two things. One, making Cortex much simpler to run and reducing the dependencies that Cortex has. Today, if you look at the Cortex architecture, it looks something similar to this. So you do dot slash Cortex. You don't need to run 10 different microservices. All the microservices sit in a single binary. You can run multiple replicas of it. We use gossip to communicate between these replicas. So you don't need a console or an LCD. And we don't even need a NoSQL store anymore. So we just write directly to an object store. Object stores are really cool because one, they're inexpensive and two, there's virtually zero configuration that you need to do. So when we made these changes, the first change moving to a single binary mode was actually a simple one. That had the number of people who tried Cortex increased drastically. And then once we removed the NoSQL requirement, we saw bigger and bigger users start to look at and adopt Cortex. So if you are an open source project, make it really simple to run. Just create a single binary or single point of entry where people can just run that thing and everything is just working. Try to reduce the number of external dependencies that your project has. So one of our adopters, Obstress, they wrote a blog post on why they adopted Cortex. And they were like, yes, we were looking at possible solutions and we realized that Cortex needed to use a database like DynamoDB or Cassandra. But then one of our community members, he came in and contributed a new storage engine based off our sister project Thanos that uses simply object storage. And we spent a year optimizing the engine contributing to Thanos, contributing to Prometheus and Cortex to make sure that now Cortex only depends on an object store. This has increased our adoption massively. So, yeah, simplify running your system. There's a good reason Cortex has all those microservices at the scale that we run. We still need a lot of those microservices. But for a lot of people, they don't have that scale and they can run a much simpler version of Cortex. So if you can simplify your project that way, try to do so, it will improve adoption. And also it will make things easier to develop. It will also improve contributions. Yeah. So one more thing I do want to cover is be accessible. So what does this mean? So I have a huge issue with Slack for community support. The problem with Slack is people ask you a question, you answer it. And then 20 days later, when someone else has the same question, they would ask it again and you would answer it again. So there's no searchability here. A Google Groups mailing list is really good, which is what we use in Prometheus. Basically, when somebody asks a question, you answer it, it's persisted on Google. So next time somebody Googles the same error or same issue, they will find your answer. So while that is really cool, when your project is just starting out, it's really important that you have a sync chat system. Like something like Slack. And this is even true for Prometheus. When I was using Prometheus five years ago, when it was 0.8 or something, I initially asked my question on the mailing list, I got an answer. The answer was not very clear. I immediately jumped on to IRC, which is what Prometheus uses for community support. And I started asking people the maintenance, their questions about it, and they could immediately answer without the latency of using a mailing list. And that basically helped me fix my Prometheus issues. And that also helped them understand some of the common problems people were facing. So if you can have a sync chat system combined with a mailing list, do that. Having said that, make sure questions don't go unanswered. When a contributor asks a question, and if they don't receive a reply within a few days, it would be really discouraging. And when you have a ton of those questions, and when somebody new joins the community or the channel, and they see a ton of unanswered questions, it's very discouraging. So make sure all the questions that are being asked are answered. But the caveat is you should also expect to spend quite some time in the beginning, like 10 to 15% of your time will go into answering these questions. It's important and you will be doing it for a few months, but it will be giving the community such a good experience that once they have your system up and running, they're going to stick around and answer questions because they want to contribute back and that's the easiest way they could contribute back. So yeah, in the initial months when you launch or if you're a newer project, spend a decent amount of time answering questions on Slack or whatever chat option that you choose. The other thing I want to mention is have regular office hours or a community call where the maintainers just jump on for 30 minutes for an hour, whichever works for you, and they give the highlights like what's in the new release, what are the new bugs that we fixed and then open up for questions from the community. So this is really important for the community to have a space where they can directly talk to the maintainers with the video on. And yeah, I would highly suggest that every project that doesn't have a community call, make a community call that is 30 minutes at least once a month. So that's really useful. In Cortex, we have one every three weeks and we have new users jumping on all the time telling us about their use cases and we also have all the maintainers from different companies and time zones who are talking about different issues that they're seeing and potential fixes. So that's really cool. Having said that I must warn you, the moment you launch a community call, unless you're a big project, you won't have people joining it. It will be quite lonely for a month or two every time you post something, like I would say it will take like five community calls before people start showing up. But you have to consistently advertise it post on Twitter that, okay, this month's monthly call is at X time tomorrow and things like that. So the reason for this is people who are using your project, they will keep seeing these updates and they will probably join a few times later. If you cancel the calls because there's no agenda, people won't realize that there's even a call. And this 30 minutes just hang out with your maintenance and talk about random stuff if nobody shows up. But yeah, try to have a regular office hours community call. So yeah, the other thing I want to mention is we are doing one for Tempo and the initial times, I don't think there were a lot of updates, but like recent days, a lot of people are jumping in and showing us ways of how they're using Grafana Tempo, which is another open source project that we launched. So have a community call if possible. So this is another big thing, dedicate time for outreach. So by outreach, I mean make some noise about, do some marketing for your project. Don't do vendor marketing but market your project, regular blog posts. Write a blog post every month if possible and share them on social media. Twitter is a good medium for this. Now you might wonder, hey, what blog post should I write? Write about the new features that came up in the last month or if there was a release, write about what's new in the release, why that release is important. Or if you found some nice performance optimization, write about the performance optimization. Write about many things that you can write. Write about different use cases people can use your project with. But make sure you continuously ship blog posts out about your project. This is because in the future, people are going to Google about their problems. And if you say, so people are going to Google about multi-tenancy in Prometheus. And I'm pretty sure there's like five or six blog posts that we wrote about Cortex that are going to show up. So they're going to read about Cortex and then they're going to try things out. So yeah, always try to publish a lot of blog posts. I would prefer blog posts over like conference talks and meetups because they show up on Google and people can search them in. It's easier to consume text. Having said that, always keep on applying to conferences and also to meetups. Like this, like good conference talks at good conferences will actually give you a lot of early adopters and a lot of useful feedback and contribution contributors. Meetups are the same and I kind of prefer talking at meetups over conferences because one, the setting is a little bit more intimate even if it's an online meetup. And I feel like there's a lot more engagement and people will there's a higher chance of people going back and trying and running the project at meetup standard conferences. Yeah. So a couple of advices I can give here is have a set of slides ready like a simple Cortex 101 or an introduction to Cortex slide that is public and that people can just copy and give conference talks at their local meetups or like local conferences. This way it's not just you but your community is also empowered to talk about the project that they really like about. And one of the things that we've done at Grafana Labs that I think is super cool and worked really well is we have an OKR to publish a blog post or a talk per quarter. So this essentially means if you have a team of three people you already have a blog post or a talk per month which is really cool. So if you can have OKRs that have to do without reach, try to add the OKRs and try to see how that would work. Yeah. So the good thing about Cortex is it's just not one company who is using it. So a lot of companies are using it and each of them is writing Cortex blog posts which are really nice. Here we have an example of a blog post from WeWorks one from AWS and one from Grafana Labs. And you can kind of see all the different ways or all the different kinds of blog posts. The first one is about the performance optimization. This is about AWS's journey of adopting Cortex and the third one is what did we do in Cortex in 2020. So this is actually my favorite advice of all. So one of the things that I really like to do is work closely with a few users. So this gives you a lot of insight and it helps you kind of improve the project a lot. So what do I mean by work closely? If possible, have a weekly call with the user. So this can be right after you launch and you're struggling to find an option and you find a user who is super seriously wanting something like your project or wanting to try that project. If they're super keen on it, set up weekly calls with them, see what they're struggling with and prioritize their requirements over yours. If possible even create a shared Slack channel because they might be trying it at different times and if they can have access to you over DMs or a Slack channel, like a sync communication channel that would be super useful. And this will actually help you build the right product. I've done this a couple of times for a couple of projects and this need not be even right after you launch your project. Let's say you're working on a large feature or a huge change or a huge performance optimization and the kind of problems that you have are different from the problems that users have. So whenever we're doing something big we try to find a beta user who would be okay testing it out in their environment and giving us regular feedback. So this works really well in optimizing documentation because they're running the project on their local system and they're going to hit issues so it will help you understand what are the struggles that end users face and what are the problems that your project is actually having that you need to fix so that the adoption story gets better. One caveat is make sure the end user that you're working with is as invested as you are you might be wanting to help them but they might not have the bandwidth to do this. So it's kind of really hard to find this. This is why I like meetups and conference talks where especially meetups you give this talk and then you likely find one person who's super keen into trying this, testing this because they have the same problem as you. It also happens in conferences for example I've done this with Gojek Gojek is an Indonesian company but it has engineering offices in India and I gave a conference talk talking about Cortex and they're like oh we're having this issue with scaling in FluxDB and we really want to try Cortex. So it's like hey this is amazing I really want to help you. Can I come to your office to give you like a tailored presentation and to kind of look at your problems. So they said yes. I visited their office a couple of times they had a direct connection to me and as they were adopting Cortex we had to improve our documentation, publish some of our configs to kind of fix issues at scale and things like that and I had a lot of fun working closely with the team and improving the documentation and the adoption story over a couple of months and one of the really cool things is this is Ankit you see in the picture he was part of the Gojek team and now he shifted from Gojek to a company in Berlin and he moved to Berlin and now we are really good friends and we hang out all the time. Open Source helped me make a lot of friends and Ankit is just one of those amazing friends I made through Open Source which is super cool. Yeah so if possible try to work with one or two companies like you won't get a lot out of it but it will help you improve your product, build the right features and also ask them if they want to do a conference talk or a case study with you afterwards. So that brings me to my final piece of advice is do a case study. So basically case studies are documents where end users outline the problems that they faced. What were the requirements they had for the solutions they were going to pick why they picked your project or why they picked Cortex and what were the challenges they faced and how they overcame them and what benefits did Cortex or your project give them. So if we have a few case studies we have actually three more in the pipeline that I'm going to publish and these are super useful because turns out other people read the case studies. So the first one we did was Gojek and we have a couple of users who read the Gojek case study who were struggling with Inflex as well. So they read the thing and they're like oh this is great Cortex works for an actual company and here's an end user case study and we've done a couple more of that and we have a couple more in the pipeline and they're really really useful. So if possible if you have a very interested end user do a case study with them. Having said that make it really easy for them. So essentially if you ask an end user to write a case study it most likely won't ever happen. So what we do is we jump on a call with them. We have a technical writer interview them based on a template that we create we record it and then we actually do the write up ourselves and then share it with that company. Once the end user has sign off from legal we publish it. So this way all that the end user has to do is jump on a call for an hour and talk about their journey which most people are happy to do. So yeah if you can do interviews and write up case studies for approval yourself do that. That would be much easier for the end users. And your first case study like having one case study is actually much much better than having no case studies and if you already have companies that you're working closely with try to use them for the first case study. One caveat is make sure that the case study is approved by the company's legal and legal team before you do the case study because sometimes the person who's running your software might be really interested to do a case study but it might not have the right approvals. Cool. Yeah and continuously encourage your end users to submit conference talks and to write blog posts. End user talks are much more appreciated have a lot more nuance than vendor or maintenance talks. Again I want to reiterate if you have a Cortex one on one already that would also be super useful for end users. It will just reduce the amount of time they put in. Cool. To sum up have a good documentation and a good website. Make the project really really easy to run. Be accessible to your users especially in the early days of the project. Have a sync communication channel. Have a community call. Have office hours. And yeah spend a significant amount of time. 20% even to work on community and outreach. Write blog posts. This is how you're going to get adoption. Work closely with a few users to refine your product and your features and this is like these users will become champions for your product in the future and that will drive more adoption. And yeah leverage the end users to generate more content and make sure to have some case studies. If possible write the case studies yourself and then send it for approval after doing an interview with them. Yeah cool. Any questions so far?