 My name is Veenu and I'm a software engineer on the Cloud infrastructure management team at Twitter. I'm also proud being a member of Twitter women, Tech women and women a code. Today, I'll be talking about how we built a set of tools and processes, a framework to improve the utilization of infrastructure resources. I'll start by sharing with some history and context around the Twitter stack and deep dive into Twitter's charge pack, a service that helped gain visibility and build accountability around infrastructure resource utilization. We learned a ton while building charge pack service and I'm super excited to share more details on the next iteration of it called Kite, our service lifecycle manager. It aims to address some fundamental challenges around cracking, managing and ownership attribution of resources in a dynamic environment where microservices are deployed at scale. I'll close out with some impact and future work in both Kite and charge pack. Let's start with history and context. If you didn't already know, Twitter migrated from a large monolithic rail service into microservices to address the developer and reliability productivity concerns then. This chart, although a bit old, is a great representation of the service graph that powers Twitter and its associated products. What you're seeing here is the RPC trace for a request to create a tweet. The highlighted service is our interest service and reverse proxy called TFE. This routes the request to numerous downstream services, which further fan out to even more services. If it hasn't already crossed your mind, Twitter has a lot of services. When I joined the company, I was at awe at the breadth and the depth of the stack. Over time, as my team and I worked on resource allocation, utilization, and more importantly ownership attribution to these gazillion services, I transitioned to more or less of this. Anyways, Twitter has invested heavily both in in-house and open source technologies that power its core infrastructure primitives such as compute, storage, databases, etc. In fact, the focus has been in enabling multi-tenancy right from the start. This introduces unique resource allocation, capacity planning, and ownership challenges. Just to close the thought, here is a slide that captures at high level Twitter's platform and infrastructure world. At the bottom most layer, we have the data center management. We run Mesos Aurora for long running services, Hadoop for batch compute, and use Manhattan for key value storage. These services make up core infrastructure primitives, moving up the stack. We have services such as cache, observability, messaging, etc. built on top of these core primitives. So it shouldn't be a surprise that these services pretty much make two-thirds of our entire server footprint. Twitter runs in multiple data centers and has hundreds and thousands of servers. Planning capacity for such large services is a challenge by itself. Given the footprint and complexity around multi-tenancy, we started asking ourselves, how to get visibility into the footprint of individual jobs and datasets within these shared platforms? How to attribute resource consumption to teams or organizations? How to incentivize the right behavior to improve efficiency of resource usage in these platforms? Thus enters Charge Pack. So what is Charge Pack? It provides the ability to track and measure infrastructure allocation and utilization of resources per service, per project, per engineering team to improve visibility and enable accountability. Let us look at three key features of Charge Pack. I'm going to start with its capability to support a diverse set of shared platform services. So one of the first challenges for my team was to model a resource catalog for a set of disparate infrastructure services. It became clear that there was a need for a consistent way to identify and inventory these resources. A resource in this context means either a physical or virtual entity that is consumed by applications, thus driving the overall capacity of the underlying infrastructure or platform service. We had to support resource fluidity. With the utility computing model, the resources are no longer a permanent entity, both from infrastructure or service owners' perspectives. It is much easier to model capacity and understand costs when everything was running under bare metal machines. In the current service-oriented architecture, we need to treat resources in a more granular way, break them down to primitives like CPU, RAM, memory stored, reads or writes per second, etc. and create abstract resources as we go higher up the stack, for example, tweets per second. Infrastructure resources evolve over time and our model captures this. Second challenge as part of supporting a diverse set of infrastructure services is the need to track client and ensure they are mapped to owners. Without correct ownership, it is difficult to ensure accountability. I talk about how we solve this problem later during the talk. Let's look at the entity model. The unit price of a resource is captured in offering measure cost. In our case, we have a process to review and change the unit price per quarter. An offering measure captures the resource itself, example, cost per day or service per day, etc. This alone cannot define what we are truly measuring and costing out. There is a need for context around the resource in the form of offering infrastructure service and providers. Providers can represent either internal data center or public cloud providers. Here is an example of Aurora. Aurora offers compute, which is measured in codease, where compute is the offering in this case. On the contrary, an infrastructure like Hadoop offers clusters that store and process data as separate offerings with measures such as GB RAM, number of accesses per day, etc. We were able to onboard other infrastructures like Manhattan, Blobstore in a similar way as our entity model can represent these desperate infrastructures in a consistent way. We worked with each individual infrastructure team to identify the resource and subsequently worked with capacity planners and finance team to define a unit cost. Now that you have learned our entity model, let us look at how we translate into APIs. Here is our measures endpoint, which tracks each resource per infrastructure. We defer to the infrastructure service owners to define a resource and recommend them to add one that has an impact on overall capacity. Next, let us revisit the question we had earlier. How to incentivize the right behavior to improve efficiency of resource usage in these platforms? We then thought pricing is one way. So here is an example of how we price Twitter's compute platform for Aurora. We started defining unit cost from the bottom up. That is from the bare metal layer where we incur true cost to buy physical machines. We calculated cost per server per day, which includes CAPEX and OPEX cost of machines, license, headroom, utilities, inefficiencies and the human cost to operate that resource. It is important to keep in mind that TCO evolves over time to account for depreciation. In order to track these changes, our entity model is designed to capture these unit prices with a time varying dimension. These form the basis to help calculate other service unit prices as we move up the stack. Next, we looked at the total use codes in Aurora. This is a combination of both codes used by production and non-production jobs. We use the default container stat metrics exported by Mesos. Next, we account for excess cores that are due to incorrect container sizing and extra quota allocated to account for say traffic spikes. The goal here is to incentivize folks to right size their jobs and ensure they are requesting enough quota and if possible, tax them for holding excess cores. Finally, the unit price also accounts for cores used by the compute team themselves for testing, operation and maintenance. We treat this as a general tax that is equally distributed across the pricing of all the charge back cores. I know that's a lot of components to consider for a unit price computation. But we wanted to ensure that we keep the cost true to the model and not some funny money we charge out to customers to ensure resource utilization. The second feature is the metering service that tracks utilization of resources. We worked with every infrastructure service owner to export their resource usage data. These are classic ETL pipelines that extract data from different sources, example, Aurora scheduler, Hadoop master that are transformed and are loaded into a generalized charge pack schema called the raw fact. Our contract is pretty simple. The raw fact expects the client identifier, offering measure, volume and additional metadata such as job environment or directory parts when it comes to file accesses and the timestamps. This is now transformed by the data transform service whose job is to fold. One, it resolves ownership of a client identifier from a raw fact to an appropriate team or organization. And number two, it uses offering measure and volume to compute total cost. The transform data is then stored in our resolved fact table. Before consuming by the reports, the data is validated by the data facility service. Data facility is very important to us to ensure that the usage and the cost are true to the cost of ownership. Thus, we invested in a system early on that looks at the data and sends out alerts when there is missing data and or huge deviations of usage or cost. The system is currently being extended to do notification to service owners directly when a given project shows huge deviation so they can either take immediate action or to resolve any issues. The resolved fact is a centralized job that is used to generate a variety of reports. There was quite a lot and if all goes well, these insuppers reports. There was a perfect segue to the reporting capabilities of charge pack. We have numerous internal customers and each have different requirements. For instance, the infrastructure service owner require information on overall cluster growth and client growth over time to help plan capacity. They also use this data to figure out how each client is doing in terms of the allocated capacity versus utilization. We generated P&L reports for all the infrastructure services that have been onboarded into the charge pack system. This shows the income and expense per infrastructure service and gives us information on whether they are efficient. For instance, we don't want these services to make a profit instead be close to a zero margin. If it is on the profit side, we know it can be due to some optimization on the infrastructure side, subsequently reducing the unit prices for everyone's benefit. If it's on the negative side, we know that the infrastructure service either has excess capacity or some inefficiencies and we need to identify how to improve that. Our other customers are the service owners and developers. Here is a view of team spend per infrastructure. We provide detailed bills per team per org on the various platform services. Each section is a platform service with offers defined. We share usage numbers and show the overall cost. For obvious legal reasons, we cannot show real numbers. Although this gives a good understanding of how the team is spending on each infrastructure service, this view is not good enough when you wanted to work on optimization projects. So we created a drill down view where they get a unique view of the project level spent even further drill down to client identifiers. So you see you have a cost incurring in Aurora and you can go find the exact project and the exact job that is incurring the cost to see whether is it necessary or not. It is much more effective for engineers to know the exact data set or job contributing to their spend. To summarize, Chargepack supports a diverse set of infra services that has the ability to meet the resources at the daily granularity and provides primitives to build sophisticated reports. Here are some of the learnings from the system. One of the most important lessons learned is that the data must be trustworthy. As mentioned earlier, the Chargepack reports are used for a variety of use cases. So it is all the more important to ensure data is accurate and trustworthy. We were also able to extend the service to notify when there is data inconsistencies. For example, sudden increase or decrease as I mentioned earlier. Ownership accuracy of client identifiers is extremely important. When we first deployed the Chargepack system for a physical service over 60% of the identifiers were attributed to unknown. Even worse, the ownership mappings was maintained in a spreadsheet which go out of date pretty quickly. We realized that it was unsustainable model and had to figure out a solution. One of the reasons ownership problem is hard to solve is because of the lack of clear grouping primitives. Ownership mappings to team seems too coarse while ownership mapping to client identifiers seems too granular. There was a need for some form of logical group to owners. A project is a logical group we decided upon and it's a grouping of resources such as services, grants, ad hoc jobs, etc. It provided an easy manageable perceptible unit. Everything changes. As mentioned earlier, TCO evolves over time to account for depreciation. Projects and identifiers can be handed off to other teams as the priorities of the team changes. In order to track these changes, we had to customize the entity model to capture the change history of resources using time varying dimensions. How did we address or incorporate these learnings? Introducing Kite, a service lifecycle manager which provides simplified workflows involved in the creation and management of services. At a high level, Kite provides a single pane of glass for some common steps when building and launching services. This includes setting up a project, creating identifiers, provisioning certificates, acquiring capacity and quota to use infrastructure services and finally, view the cost and usage. We started working on this in Q4 of last year and as of today we have integrated with eight infrastructure services, aggregated 10,000 identifiers, created over 1,000 projects that are used by over 100 teams. The Kite V1 UI serves as a single pane of glass and allows every engineer in the company to manage their projects, their client accounts for infrastructure services and acquire quota for Aurora. Chargeback exposed us to the difficulty around creating and managing ownership of client identifiers per infrastructure. One of the things we focused with Kite early on is to build an identity system that explicitly captures org structure and exposes team. Group client identifiers into projects, associate team with projects and allow management operations on these entities. To expand on the model itself from the bottom, client identifiers are scoped per infrastructure and they are associated with the service account. The backend that powers the service account is modeled to be pluggable. We use LDAP groups. Service accounts are primarily used for authentication and are expected to be eventually used for authorization. Client identifiers are logically grouped as projects. Projects belong to a team which in turn belongs to a cost center or department, essentially some entity that can be responsible for the dollar spent. Here is an example. We have revenue, which is one of our cost centers and one of their teams is called ads prediction. That team owns several services, one of which is prediction. Prediction runs under the service account ads prediction. A service can have multiple service accounts based on the access control needs, example one for every different environment. The same model can support all the infrastructures be it Hadoop or Manhattan or any other platform in our service. By default, every entity is time varying dimension. Thus, if a team were to get reorg tomorrow or if a project removed from team A to team B, the entity model supports each of these changes, which means at any point in time, as long as we have this data, we can say which team owned this project and what was the cost incurred by that team on a specific day. The impact through height has been enormous. For example, kite currently manages over 10,000 identifiers. Here is a split by infrastructure service. Clearly Hadoop observability or monitoring service contains the largest set of client identifiers while databases has a small set. This allowed us to improve ownership significantly and reduce unknowns. Here is a UI that helps user claim ownership for unknown identifiers. Currently we have our infrastructure services reporting the client identifiers through our metering pipeline and the owners can go and claim ownership of those identifiers. The identity service, powering the kite UI, enables everyone in the company to search and discover owners for projects, client identifiers, etc. If you are a site reliability engineer and at any point in time you want to identify who does this job belong to, they can easily find out from kite. Chargeback data has enabled teams to understand their team spend and attribute costs and attribute press and crowds to actual events. For example, let us walk you through some of the things our team was able to improve or understand over the period of time. About a year back, we realized we had some underutilized resource and we were able to release those back into the resource pool. I mentioned earlier unit price updates are done and that is the dip after the unit price update in Q2. Lastly, that increase in daily cost is around the launch of the same new project, kite and the resource we requested for the project. Similarly, you can see at the end around November where we launched another project which caused the increase in the resource utilization. Kite also tracks additional metadata such as tier, monitoring dashboards, runbooks, pager-duty URLs, headchats per project, making it extremely simple for SREs, infrastructure service owners and teams to build tools, scripts and automation around this data. For example, the ops folks use our API to query for tier zero systems, look up their SLA and run queries to ensure everyone meets their SLA and subsequently alert to the correct hip-chat room or email group. In the previous graph, if you wanted more granularity than the team level to know how exactly a particular project is spending over the period of time, then our overview page has project level spends as well. So the idea is as an engineering manager or a tech lead you care about the team and how the team is performing or utilizing the resources over the period of time. As a developer who is working on one or two projects belonging in the team, they can keep an eye out for how their project is performing over the period of time. We recently enabled quota management from Kite for jobs running in Aurora. A user need not go to different user interface to acquire quota for their production service but can directly acquire this from Kite and get a sense of how much it costs. This forces developer to be a bit more conscious when they design and plan release of their system such that they are not over provisioning resources. For future work, we are focusing on three things. The first one is resource provisioning. After the success of Aurora Quota Manager, we plan to extend to include Hadoop, Storage and other system. This way, service owners have a single place to go to request and manage quota. The other advantage of this is, as I mentioned earlier, right now it is a resource creation and then after the fact come claim model. But with provisioning resources right from Kite, we attribute proper ownership to the client identifiers right from the start. Resulting in zero unknown cost and a conscious decision by developers when requesting any form of resource. The second one is project depreciation. Similar to production review process, it is important to have a proper deprecation review process. Kite provides a list of resources associated with the project and with a charge back daily usage pipeline we can validate and notify the release of the resources back into the pool to the users. We want to build a process based on a deprecation policy. Why is this important? A lot of time think about running an Aurora job or running an Aurora service that has cash. You go request Aurora and then you have requested cash separately and when the time is there to kill the Aurora service you may never have released your cash pool. So your cash is assigned to you unutilized and with the proper project deprecation service we now know all the resources associated with the project we can inform you of what resources that you haven't released and we can help you with that. Last but not the least, capacity planning. Every time there are conversations about capacity planning the request are collected with a top-down approach. Service owners usually add some extra headroom along with their actual request. At the end of the day as this request travels down and comes to the infrastructure owners and when they receive the request and go to the capacity planning meeting we have this. All those extra headroom requests added up at each layer not to mention the extra headroom added by the infrastructure service owners themselves. The outcome of this is over provisioning and over requesting of resources. With all the chargeback data we have we feel we are in a better position to provide service owners with growth forecasts and services. So just to summarize we started off with the visibility into resource utilization chargeback. Although we were able to find underutilized resources without proper ownership model we weren't able to attribute or act upon these underutilized resources. With service identity we came up with a model that captures the organization and ties these resources to team. Our reports help their customers release unused resources including our own team. Our current model is claimed after you create we want to minimize this effort through resource provisioning where we can capture the ownership at the time of creation with chargeback and kite ownership model we are at a better place to provide teams with better forecast of their resource needs to make informed decisions. That's all I have for today thanks for listening.