 Hello, my name is Tim Kajasa and I'm going to talk about using Prometheus to automate power capacity planning. Quick intro into myself, I was a traditional network engineer but now I focused primarily on tool development and monitoring for systems and networking team here at Hudson River Trading. Hudson River Trading or HRT is an automated trading firm which was founded back in 2002. Being an automated trading firm simply means we use computers to buy and sell financial insurance such as stocks and bonds. Since HRT is a trading company, latency is key for us and the desire for low latency necessitates that we co-locate our computers as close as possible to an exchange's trading platform, never in the cloud. Operating in the physical computing space represents some interesting capacity planning challenges for us. Since we don't operate in the cloud and we can't click a button to add capacity, capacity planning is extremely important. Do we have enough servers to run additional workloads? Do we have enough rack space for these servers? And what is almost always the gating factor is do we have enough power for everything? If we were looking to grow in the answer to any of these is no, the turnaround time for adding capacity grows from days to weeks or months. So the focus of this talk is on how we make our power capacity planning more accurate and how we can then automate it. In a typical data center, our servers will be divided up into multiple racks and within each rack will be multiple PDUs or power distribution units. These PDUs are what supply power to all the servers. Like all capacity planning, when we are looking to grow our server footprint, we will look to see where there is excess capacity and with power that means where is the most available power. To find the available power, we just need to know how much power is being used, the load on the PDU and how much power has been allocated to that PDU. From that, we can calculate how much available power there is. So this should be simple, right? In an ideal world, yes, this should be simple. The first step to tracking power usage is easy. Point the SNMP exporter at the PDUs and we know how much power is being used there. And that's exactly what we do. But where the trouble comes in is we have multiple PDU vendors. With multiple PDU vendors come multiple SNMP MIBS, which result in multiple metrics. So the way a PDU displays power load is different across the vendors. One PDU vendor may display the total load of the PDU in amps, another in tenths of amps, and while another, we may need to sum multiple metrics to get the total load on that PDU. The solution to this is to use recording rules to abstract away the underlying hardware and generate a common metric name. Once we verify all the recording rules are correct, there's no need to touch any of the downstream tooling. Now that we have a consistent metric representing the load on a particular PDU, we need to know how much power each PDU has been allocated. Again, in an ideal world, all PDUs would have the same amount of power allocated to them. However, we operate in the messy real world. Power is expensive and can be limited in some training collas, which results in us having different power allocations per PDU. One site might be able to have 24 amps per PDU, while another might only have 16, and we even have variations within collas. To handle this, we have a standalone script that queries NetBox, an open source data center management tool, to determine how much power we have allocated to each PDU. Then the script writes these power allocation thresholds out to a file for the text file exporter to read in, turning what used to be configuration data into metrics. Now with the consistent metric name, consistent metric name of the load over the PDU and the per PDU thresholds, calculating the power free in each rack is easy. From here, it's simple to even write a tool to query Prometheus to report where new servers should be racked. Even if the power load is highly variable like it is here, we can leverage Prometheus to make better capacity planning decisions. We can even use it to create nice dashboards. Thanks for listening.