 Hi, good morning everyone. How are you doing today? Great. So I hope your day one at root con was really exciting and you guys learn something new. I hope to kickstart day two with a topic which is close to home. I'll be speaking about culture and strategy of SRE and things that we did at my current company and how we built the SRE team from ground up. Certain decisions and solutions that we took and that helped us build a good team which was able to scale with time. So I like to start with a brief outline of things that I'm going to cover today. First I'm going to give you a overview or a background of the scale at trusting social, what we do and the data that we deal with, what's our scale, the essence of SRE according to my experience and certain war stories that we dealt with at trusting social which will help us cover five main backgrounds. One is why do we have to reliably manage infrastructure. I'll be speaking about infrastructure management to start with. The art of access control, how to make debugging easy, right? And why remove the admin from administration? Why do you have to even automate certain tasks and when do you know the right time to do it, right? And when not to manage your storage versus how to manage it, right? So these will be some of the key takeaways today and let's start with the talk. Yeah, so what we do at trusting social, right? I had this opportunity to be a part of the SRE team ever since it started and we started off like last year and I find myself in the position to share my experience and I hope you guys can take some key points out of it, right? So we started off as a four-member team and in a span of eight to nine months, we could expand to 15 people. We operate out of four countries, mainly based in Southeast Asia and we have our production code running in five data centers and expanding with time. We have, we work across three cloud providers and there are a dozen products and services that are running right now. We ingest around one TB of metrics per week, which means that there is a fair amount of scale that we're talking about and the current team has been the same ever since it inception. So the attrition rate is really less. Yes, so when I speak of essence of SRE and many people have this question as to what an SRE is or what does a site reliability engineer do, right? I think we need to get back to the core principle of what SRE is and there are three things that come to my mind. One is automation. When I speak of automation, I would say that as a site reliability engineer, we are basically the bridge between developers and operations and in order to make development easy and to streamline operations, we have to come up with certain workflows and processes which automate the entire flow, right? That is where automation comes into picture. Maintainability, let's say tomorrow, trusting social grows, right? And in any company which is growing at a fast rate, the main problem that they face is scale and to make sure that scale is managed, you need to be sure that the processes that you've put in place are repeatedly applied and they can bring up infrastructure and deployments can be fast and at the same rate, no matter what your scale is, right? And that's when you have to have maintainable workflows and visibility. As a site reliability engineer or as a DevOps culture is, we have on-call schedules, right? And every single person keeps rotating. You won't have one particular engineer dealing with one product. So in order to make rotational on-calls easy, you need to have more context into what was the issue and how do you solve it if it comes again. For that, you need visibility into the systems. You need to know how the service is speaking to each other and you need to know what was the first thing that the on-call engineer did. That's where visibility comes into picture. So let's go ahead and see the five things that we did ever since we started off at Trusting Social. And the first thing that I'm going to talk about is infrastructure management. I think that's the core of site reliability engineering in any way, right? If you have to take code to production, you need infrastructure and you need to come up with different environments such as production staging and dev and you have to manage these in a way which is not manual and automated. You have to be able to reliably ensure that infrastructure is brought up and maintained. So usually what people do is end up using infrastructure as code as a tool that is available out there called Terraform. People use Terraform to bring up infrastructure almost quite commonly these days, right? I think most of you have worked with it. Okay. So we also started off with using Terraform. What Terraform does is just it's a CLI which ends up bringing up infrastructure in cloud. But there were certain problems that we were able to identify with this process, which is if I have to reliably maintain my infrastructure, there were certain key things that I had to do. First, I wanted to ensure that I have I can manage concurrent access to the infrastructure that's there. I want the ability to know who or which engineer brought up which infrastructure. I want the ability to ensure that there's access control over there. There should be retries and rollbacks. Retries and rollbacks in Terraform are quite manual and I want an automated way to do this. And there should be an ability to maintain versions of your infrastructure with time, right? So that I know when my infrastructure has changed and how can I manage it? So for this, we thought of a small data model. We identified what were our key pointers that identified any kind of infrastructure that we brought up. So trusting social as an organization operates in cloud and data centers, as I mentioned, and there are different cloud providers that I would speak of as the key. I would any time have I would identify the infrastructure with the intent that is if it's a dev or a production for staging infrastructure and which region it has been brought up. So these three key fields like helped us identify or segregate infrastructures. We needed the cloud provider or the data center key, the intent and the region. And that's how we were able to identify which product went in which workspace, right? So they were, yeah. So the first thing was workspace that we identified in the data model, which helped us tell us that, hey, which cloud are using which region and who all can access it. And that's how we were able to also have ACL on on top of it, right? A layout, I would say, define the Terraform files that define what service boxes or what machines did I bring up? What was the flavor, the other key attributes of any infrastructure? There was versioning on the layout files that we were able to maintain. And we are also able to add success and failure hooks, like if this infrastructure is brought up successfully, then go ahead and run a playbook which provisions tools in the new infrastructure. Or if it is not, then just clean it up in a very graceful manner and rollback. So these were some things that we identified as a solution to reliably maintain infrastructure. And the first tool that we built was called tessellate. What tessellate? So it had two aspects, the client aspect of it and the server aspect. The tessellate client was a small binary which would sit on your local machine, or in one central machine, which is accessible to everyone. We could have, we could just basically invoke it using CLI commands, which would apply, destroy or do certain cloud operations on your infrastructure. Apart from that, you would provide certain fields such as which workspace and layout, and what web hooks do you want to add, right? So this was the first thing that we built. If I have to speak of the workflow, we would submit these JSON files via the tessellate client, which would talk to the server. The server would then schedule these infrastructure jobs. And using a scheduler, it would bring up the infrastructure, right? And there was a backend which we use called consult, right? That's where we maintain versions and rollbacks and states everything. So that was the first thing that I wanted to talk about, how to reliably maintain infrastructure. And the next problem, the next problem that we faced was controlling access across different deployments that used to go out. Being in a team of 8 to 15 people now, we need to know, we have to give everybody the same kind of access for any kind of deployment that goes out. And there were certain mess ups that keep happening, right? There were certain incidences which caused us to question how do we streamline the process and make sure that the right kind of service is getting deployed out there. Now, for this, I'd like to give some background. For scheduling or bringing up services in production, we used something called as Nomad. And the reason for this was we wanted to maintain homogeneity across all deployments that went out, be it a virtualized environment or a dockerized container service, we wanted one wrapper which worked well for us and made managing deployments easy. So that's when we used Nomad. It was it matched our scale and as well as gave us homogeneity, right? So let's say there's a SRE who's on call and there's a new deployment coming out and he wants to deploy a service A, right? But accidentally because because there is a problem, he ends up deploying another service or causing some problems in that. So, you know, if we have to ensure that these things don't happen, then we thought of, how can we do this? Should we just maintain give SRE one the access only to service A and not to service B? But then how do we manage on call rotational on call schedules, right? So that's why we thought there should be another mechanism which made sure that there were two people involved in any deployment. So the on call SRE would get a 2FA or an OTP token from let's say the product owner who knows that his service is going out there. And if he tries to accidentally deploy service B, the OTP obviously won't work because it's for service A. So this was one workflow that we thought and we thought that hey, I think this is great and it matches the scale. So we went ahead and implemented this over and above Nomad, which we use for scheduling our services. So let's say there's a product owner who uses this small CLI tool that we built in house, which is called TS 2FA. He will generate OTP for his product and then there's an SRE who has to deploy the service in production. He will ask for an OTP from the product owner and that's how he will just using Nomad deploy the service, right? So this way we had two people involved and that's how we made use of a 2FA token to ensure that everybody is on the same page and mess ups don't happen. So how we did this was by building a small proxy in front of Nomad which would basically read the token and which would like check if the token matches the service that is going to go out and if it matches then it would go and go ahead and dispatch it. If it doesn't then it would just throw an error, right? And this worked really well for us in the initial days and it still does. So that helped us manage access really easily. We didn't have to go out there and hunt for different tools. We could just come up with different solutions and designs over and above the existing tooling that we were using. The next one is how do we have the right information to debug any piece of code, right? And every time you want to go ahead and see what the system is at the current state and what has been going wrong, there are multiple certain pointers or sources that you would look for. You would look for logs or metrics, pages that you get, right? So how do you know what's the right thing to look out for, right? So how would you know what the first on-call SRE did and what were the key steps that they actually implemented when they were trying to fix a particular piece of problem? And to get that kind of visibility, you will have to manually go and talk to the person who fixed this or you'll have to look at a piece of documentation which might be hard to maintain when you're newly starting an SRE team, right? So what used to happen initially was that if there are three SRE people who are accessing different networks, they would always access it via one admin account. Using an admin account, they would then go ahead and fix certain things, right? What I'm talking about here is any kind of manual effort taken to debug any problem, right? So in this situation, it's very hard to track down and to understand why a particular command was done and what was the intent behind implementing this particular action that was taken, right? And we don't even know that who did it. So how do we know who's the person who knows how to solve it? So there was missed context for any recurring issue that occurred and you can never trace back actions to the person who implemented it. So for that, we used something called as LDAP, a protocol which was helped, which helped us gain more visibility into our systems. What we basically had was, let's say all these SREs would connect to a common VPN client and then they would access their particular networks. So the VPN client would help us understand the the identity of that SRE, right? Because it's tracked with the email and there would be some central network where there's an LDAP server running, which has the list of all the users and when the, when that SRE would access the network, there would be one box on which on that particular box, whichever machine he's accessing, there would be an LDAP client running, which would connect to the server and then that's how we would know that that SRE one ran a particular command and because he wanted to do certain things and this is how we got LDAP for better debugability. We were able to track how which SRE did what and that's how recurring issues were easy to solve because the context could be maintained over time, right? So this was the first step that we took towards gaining more visibility. Yeah, after this, I'd like to talk about a little bit about networks and things that revolve around it. So in any particular network, if we have multiple machines, this particularly applied to us in our use case where our machine, our piece of code was running in data centers and these machines were just handed down to us. Very little control could be established by us and all the firewall settings or different kinds of accesses were already defined by the vendors. So how would be would be as SRE ensure that is this particular port accessed from another subnetwork and if I have to go ahead and deploy a service, will I be able to do it? Will I will this database be accessible from another subnetwork? For example, right? So can a machine access an HTTP port or a port where TCP can run or a port where UDP services can run and how do I make sure that this works? So one way is to actually go inside the machine or run Linux utility which checks all ports and returns the result, right? But we have to do this in a in a very different permutation in combination way, which would mean that if there are n number of machines in a network, there's an there's a complete mesh of ports that you need to check for access and doing this manually can be tedious, right? So there will also be manual error and back and forth which we didn't want to happen and maintaining context again, right? So to deal with this particular problem, we wrote a small utility called extra. What extra would do was depending on which protocol do you want to check for which port is accessible across the subnetworks. We would provided file adjacent format and file which would check if these protocols and ports work, right? So example on X, let's say machine a there would be an extra server running which would start a process on all of those end ports for that particular protocol and from machine B that's where the extra client would run and it would go and check if all these processes or all these ports are accessible, right? This was an automated way of doing it and this worked really well in mesh kind of an environment, right? And also made sure that the results were maintained and could be carried forward in again an automated way. So how extra works is there's a server on a client. The server would only take an input. Jason and the host IP where the machine on the machine that has to run and extra client is where you would say that this is the machine from which you've to check all the ports, right? So again, a small piece of utility that helped get us started. The next problem that was there or an incident that occurred was sharing storage across machines and it went away. Yeah, sharing storage, right? And every time you have a service or an API service kinds running in a reliable manner, you would want all of them to write to the same database or access the same stateful service, right? If most of our deployments go out as dockerized containers, you would want to have a dockerized solution for your stateful services as well and maintaining that in a reliable manner was an overhead and it had its own challenges, right? So you would probably have to manage the clusters and different databases. You like would probably need one dedicated DV admin of sorts and we at that scale didn't think that was the right choice. So we explored a little bit more and we went through the problems that we had. We saw that we wanted a particular piece of solution that was that was platform independent and also provided a first first hand citizen for dockerized containers. So that's when we came across a tool called Portworks and what Portworks helped us do is ensure that stateful services can be reliably managed in dockerized containers and it would basically lay out a volume across machines and which would be then utilized by your database that is running in one of the machines, right? Portworks was again something that we thought fit well for us and we didn't necessarily have to go out there and write our own tooling or anything and that's how we picked Portworks as a great choice in this scenario. That's regarding state sharing of storage. So let's quickly go back and look at what is the essence of SRE that we first saw and at a team of our scale, we understand that automation, maintainability and visibility. These were some of the problems that we identified and fixed at the start at the event when scale of trusting social was pretty less and today where we are at, we didn't really have to change our tooling or stack. We could just go ahead and use the same and also have scale, you know, coming in. So my point is that if you look at these essences of SRE and keep certain key things in mind, the culture that you build is what is going to help you identify the loopholes. We in our initial day spent a lot of time solutioning and discussing where a particular system could break. What were the failure points and what were the risk factors identifying those helped us come up with solutions and tools which today help us scale better, right? So tooling is something that will obviously follow if you have the right mindset and that's what I would like to say when I say that culture eats tooling for breakfast. Yeah, so we used technology for anything that was repeatable and these were some of the three questions that we found ourselves asking every time that, hey, is this particular problem repeatable? Can it be reliably applied? Is my solution reliable? Will it fail at certain points and is this an economical solution and will there be a better way to do it? Right? So another thing is all of these tools that we have decided designed in-house like tessellate extra and TS2FA and so on. All of them are open source. We intend to make sure that what worked for us could also work for you and I hope you guys can go ahead and check it on our trusting social organization and see how these work for and see if they work for you. If you think you can find value in them, you can always submit an issue or contribute as well, right? Yeah, that's about it. That is my GitHub handle on Twitter and feel free to reach out to any of us at the booth. We have all of us are SREs and we can end up having a good discussion around solutions that might work for your problems. Thanks. So the difference between DevOps and SRE DevOps is a bigger. It's like a discipline which is more like saying something around its theory and SRE is one part or a subset of DevOps where there's a particular mindset that you apply to the day to day activities that you do pretty much overlaps because SRE follows a DevOps culture in a manner, but over and above what SRE does is basically sees if there's a better way to do something. That's the mindset you wouldn't really distinguish between the job profiles or engineers as such. It's a particular mindset that you need to build with time. I guess in a very short way, if I have to answer. Hi, Talina. Hi. Great job. Lots of information back in. That's a quick question. Like you said that your SRE thing is pretty big. Yeah, it's looking at the hybrid stack. It cuts across three multiple cloud providers, couple of data centers. How do you ramp up like new engineers to have any process? Do you like how do we onboard them? Yeah, like how do you onboard them? How do they ramp up? Okay, so we keep improvising with time. It's always hard to get to provide so much context to someone who newly joins us, but I think the key factor here is that we have this one room where all of us sit and there is context or there is a context which is usually discussed openly. There are no silos of information that we believe in keeping, which would mean that and anybody can just come up and ask, hey, how did you really implement this and how does this work? How does this piece of code even work? If they come up and ask some of the core members who have been working for the past few months, that's how context is usually passed around. We have a process of maintaining RFCs and proper documentation, which is again accessible. Another thing is we try to involve all SREs in different products or toolings that we have built and we keep switching context for more visibility. That is something that we think of like, let's say we keep rotating it. You would have to be on call for four to five days for a week or two and we do shifts of morning and evening kinds and it's separated like there would be a runtime platform that we have as our in-house tooling that we've built. There'll be a non-call for people who directly talk to products, right? And people who are in operations. So three to four on-call cadutes. Hi. Although I haven't worked on it, but maybe because it kind of gives you a extreme map of the entire network. Yeah. Hi, from my understanding, SRE, there's a bigger aspect to SRE, which is application development and bug fixes as well, right? Along with a number of improvements. Do you have an application team integrated with SRE? There's one team which encompasses both infrastructure and application folks, right? Or you have a separate application team which works, which has a clear boundary from infrastructure team because this looks more like an infrastructure operations team, just leveraging the DevOps tool CICV and a number of automations, which are any way there in any normal infrastructure operations. But in the core concept of SRE should be one team with everyone capable of doing everything and a fixed number of hours or a lot of effort is devoted towards improvements and enhancements in the operations. So when you say fixing application bugs, do you talk about the product related bugs or do you talk about the internal tooling that we build for different? No, whatever product you are running in the platform, right? So you have one team which understands completely infrastructure plus understands the nature of the application plus business and application as well, right? So they are able to take care of anything and everything. Right? So as the SRE team at Trusting Social, there are two aspects to it. One would be people who are directly talking to basically handling the business line are directly talking to the product owners. If there is a bug that ever comes in production, they would be on call and they would just communicate it to the product teams and the product teams go ahead and fix them. We never really go ahead and fix product related bugs. We are mostly there for reliably maintaining the product in life or production services. Second is someone who's working on the tooling aspects which help these product line like the people at like operations help them to make sure that deployments are done faster or in a reliable way. These people who built in-house tooling keep enhancing the tools depending on the requests that come up from the operation heads, right? So that is how it's usually managed. So we'll take the last question. Yeah, I think he can. Yeah, hi. Is this working? Okay, cool. Yeah. So I had reliability engineering at Trusting Social to answer your question. I don't know where it came from. Who was the guy who asked it? Oh, you did. All right. Cool. Yeah. So the SRE essence is exactly what you said it is. So what we try to do is we have multiple teams. One is a team which we call as someone who is responsible for understanding the entire company's need because we have around two dozen products that run across multiple countries and data centers. Somebody has to keep the bar of excellence at understanding that, hey, these are the common challenges that I see on every day. How can I take this back into an engineering room and build something which is going to work you uniformly for everybody? The question is not just build unique solutions for every product because then you will end up maintaining large teams which have a different onboarding as well. So you build something which is common. Now this is pure engineering here. So it is possible that you might find that a lot of things that we do are repeat of what you may say, hey, there's an open source tool for this, there's X for Y, there's Z for Alpha, there's P for beta or something like that. But bringing this all down to one single roof is first thing. Then there are special teams which actually integrate with the products. Their only job is to act like how an AWS guy would work with you and to help you do a pre-sale cycle so that your product gets onboarded onto it and their entire job is built around customer empathy. If you are a product of mine and I want you to onboard it to my platform which is running commonly for the entire company, the entire team is dedicated towards that. Then there's another team which only and only focuses on how can we optimize operations because traditionally what has happened is that we only see as DevOps as, okay, we'll click a few things, we'll install X, we'll install Y and we'll maintain it and this tend to forget how do we take this back to the boardroom and keep extracting the benefits into a central tool chain not at the cost of overengineering but also to understand how can a product and a business benefit from it and this is a cycle of teams which keep working together. End goal being there's one single team as we all call it as and by their factors that you're not considered here these because the talk is small. I think this talk should have been an hour long so but they didn't give a sponsor slot for that long. So anyway, so there are other aspects as well security. Nobody talks about security in DevOps, right? That is also one of the key ingredients that goes. So there's a team which is dedicated only thinking about that whatever product is running out there on a public domain. How is data being leaked? Is there any PII information which is going out? Are the API is working strong? Am I leaking anything on get path which should not be our proxy is going to case this request. So there's a lot more than just this. This is just a tip of the iceberg. I would say you want to know more come check us at the booth. So he was just the head of engineering at Cytolability Engineering and he can definitely help any of you answer any questions that you have not related to SRE even like you can just go have a chat with him.