 Hey everybody How many people before you saw the schedule had ever heard of chart beat and I imagine no one Excellent as I expected Before I get started I'll warn you I'm from New York. I talk fast. I'm gonna try to slow down a little bit I might say some funny things, but You know, that's that is what it is. So let's get started So, yeah, the topic of this talk is a year with Apache Aurora I put a little asterisk there because it's actually been almost two years at this point since I submitted this started submitting the talk and Maybe 18 months. We've actively been using Aurora so There we go. Okay. So here's what we're gonna talk about today First who is chart beat because as we've established no one has ever heard of us What our architecture looks like? How or why we adopted Mesa's and and Aurora How we active actually use Aurora and then we're gonna take a deeper look into some of the interesting features that we found So about chart beat we're 75 employees We're eight years old venture capital-backed startup in New York City We have somewhere north of 20 engineers that includes front-end engineers Data scientists and a bunch of back-end engineers and my team which is five of us and we're the platform and DevOps team We call ourselves the platypus team And that five includes myself and also our CTO who unfortunately is usually in meetings So but he's a really good programmer. So we like it when we get him We're in New York City. We're if you know, New York, we're just south of Union Square above the strand bookstore It's an awesome location And we're entirely hosted on AWS and Every engineer at our organization pushes code all the time So who what do we do? So this is the marketing slide, you know, they gave me a couple I had to put in here And our mission statement is we are the content intelligence platform that empowers storytellers audience builders and analysts to drive the stories that change the world and Our customers are the press. So we work for most of the large news news agencies online magazines big blogs everyone from the New York Times Washington Post BBC a lot of The press in Europe and we get a lot of traffic So basically the short story is we put a little JavaScript on everyone's page and that pings us all the time And we can measure how far our user went how long they spent reading a page and all that and it turns out to be about 300,000 requests a second coming in just with those JavaScript pings We are on 50,000 over 50,000 websites around the world and we track about 50 billion site page visits a month so Okay, this is what it looks like. So we have. Oh, I just went too far trying to get the Laser pointer here. Okay, so I forgot laser pointer. So we have dashboards Real-time historic and video dashboards and when our customers log in they get a pretty view in real time of how people are using their page so you could this is a Pri public radio international They're one of the customers that lets us use their their data and they currently have 494 people on their I believe this is probably their home page I did this early in the morning. So, you know, the traffic picks up And the average time someone spends on this page is 45 seconds You can drill into all the different pages You can pivot on author and section and all that cool stuff And we also do optimization. So we have something we call the heads-up display when you go on to your website And this is nrc.nl. You can see these these bars that show What how each of the click-throughs is behaving on every page on your site? So they use this to decide. Hey, you know that story down there About Obama is doing really well. Let's keep it on the home page This the one next to it something about Quebec. Sorry, don't we Dutch isn't doing that well They might want to push that down or do something else with it or move it up If they want to put it in a better position So maybe they want to swap that and get more traffic going to that story if they want So I think it's really important in any talk like this to understand what the or how the organization approaches software engineering I don't think we're quite unique, but we definitely have our own way of doing things So this is my favorite quote. I try to use it anytime. I give a talk. It's really old It's one of the quotes that inspired me when I was like 22 years old right out of college engineering And it's by Larry wall who wrote pearl and he said in the first edition of the programming pearl book We will encourage you to develop the three great virtues of a programmer laziness impatience and hubris and He got some pushback on this and so in follow-up editions They expanded on what they really meant and they don't mean I should be lazy They mean that I should be actively writing code to allow myself to be lazy, right? I shouldn't be lazy I should be lazy because the computer is not programming fast enough for me It's not give me back my response So I'm gonna write code to make it do things faster and if you don't have a lot of hubris You can't be an engineer because you have to believe that you can do anything and make that computer do what you want So how does this translate into chart beats? Engineering standards so the platform team wrote a mission statement because they told us we had to move some sort of okay our KPI thing and we came up with this that our mission is to build an effective Efficient platform and secure a development platform for chart beat engineers because we believe that an efficient and effective development platform leads to fast execution So How do we operate in a day-to-day basis and like I said, I don't think this is unique It's definitely unique compared to Engineering 25 years ago when I started When we had CVS So git is the source of truth for everything, right? We store our configuration for our entire Amazon infrastructure in git so we can always reproduce it any time we want Everything that gets deployed is deployed by a git hash except for Java stuff. We use semantic versioning for that We might be able to fix that later, but Engineers run the code on their laptop they run it in a dev environment. They run it in a production environment We prefer the command line everyone likes to do things from the command line I'm probably one of the few that actually uses an IDE because I'm old and I like IDs because they didn't exist when I was a kid But other people just use Vim or whatever We prefer writing scripts to memorizing commands right because my brain is full of other things other than esoteric commands And we don't reinvent things that work. We're small We make templates and we write scripts to automate things as far as programming. We are almost exclusively Python enclosure With exception of JavaScript for the front end, but I don't understand that I don't use it So why maces? Why now we're eight years old. We're by you know, everyone seems to think we're very successful company in our industry So why would we make such a big switch? So I feel like the freedom to innovate is the result of a successful product, right? We want to set ourselves up for the next five years You know chart beats gotten to this great point We have a lot of customers, you know where we have good revenue. Where do we want to be five years from now? So that's what this project was about We wanted to reduce our server footprint to save money as a lot of folks have mentioned here We want to provide faster more reliable service to our customers We wanted to migrate all of our jobs in one year to whatever System we decided to use And while we're doing that we want to pay off tech debt, right? Because you're gonna take the effort to move a bunch of jobs over to a new system You should probably address, you know, hey does this job really need to be here anymore? Should we spend a couple days tighten it up doing whatever and you got stuff running that? Hey, oh my god, that's still running. It's been up for five years and no one knew so let's not move that Most importantly we wanted to make life better for our engineers right we wanted happy engineers so come to today we have a Moderate-sized cluster compared to what I've heard around here We have three hundred thirteen hundred and fifty cores in our in our cluster and almost everything that we run is now in mesos Okay, so what's a happy engineer? So happy engineers are productive engineers because engineers want to be productive right you got into this industry because you're curious You want to build stuff and if you can't build that you're gonna get frustrated So they like uneventful on-call rotations like a lot of companies our size every engineers on call Right, it's a one-week on-call with a backup person you go woken up in the middle of the night You're like dude. I just got woken up. This sucks, right? So they don't want to have to do anything when they're on call They want to push things quickly, you know, they don't want to have to jump through hoops to get their code into production They want to be able to monitor and debug their applications easily. They want to be able to scale their applications Really they want to when I I should say when I'm talking about engineers now I'm talking about our product engineers the guys who we work for they want to write product code, right? They want to write JavaScript. They want to write Python API. So they don't want to mess with dev ops So they want self-service dev ops. It's easy to use and that's what we would set out to build So what do we do first? So before mesos we had a lot of puppet, right? We had higher roles and puppet that mapped to Amazon AWS tags for the instances We built virtually envies into devians for our Python code so that we could Capture all the dependencies mostly we had single-purpose servers We use fabric to go in and restart jobs and stuff like that. It is flexible. I mean puppets great It's very flexible, but it's really complicated, right? You got Ruby and you've got hyra and it's it's it's really complicated. So say you have this project foo, right? foo had an API service a Kafka consumer some cron job workers that go and do some database roll-ups and stuff and you You basically build out foo API 01 02 03 04 05 foo Kafka consumer 1 2 3 4 5 6 7 8 9 10 16 right and all these workers. So you have all you basically just scale out horizontally with your single app So what happens when we've got a whole bunch of apps? You know all of a sudden foo has a whole bunch of servers bar has a whole bunch of servers Baz has a whole bunch of servers. We found ourselves with 773 EC2 instances You know for a company with 25 people that's that 25 engineers, that's a lot During the US election last year we broke a thousand instances and we were like that's a lot We had 125 different roles in puppet It was really hard on DevOps is confusing for the product engineers We wasted resources it was really hard to scale So we started looking at it and we said, you know We're using something like 40 50% of our CPU and RAM and that's just not really cool. So We decided that whatever we built had to allow us to solve the Python dependency management solution problem for once and for all It had to play nicely with our current workflow. We didn't want to tell all of our engineers Oh, now you have to do it all this way right because they're used to doing things their way So it had to be hackable so that we could kind of customize it tweak it It had to be open source. We only use open source software with the exception of some Amazon databases It had to be supported by an active community. That's actually using this stuff in the real world It had to allow us to slow do this migration over time right and it had to make our engineers happy so We chose Aurora I'm not going to get into why we chose Aurora Happy to have a beer later if you want to talk about that compared to and I've also by the way Written a couple blog posts about this where I get into some of the details of why we chose Aurora versus marathon So what is Aurora if you haven't used Aurora? It's a mesos framework for long-running processes and cron jobs It was built by Twitter. I was based on Borg. They had an engineer who had previously worked on Borg They they launched it at Twitter in 2010 and it joined the Apache incubator in 2013 I'm not quite sure when it became a top-level project, but I'm sure someone here knows They're currently on release 018 and about every six months they roll a new release the new releases always support the latest mesos version Got a very active user community and it's written in Java and Python So basically this is Aurora You know over on the left side. You have a Framework that's registered in mesos. There's an agent that runs on each of your servers that receives the the instructions to go ahead and launch a new job Everything in Aurora runs inside a sandbox container, right? So you get a directory. It's a truth And in there all your stuff goes and your whatever user you launch a job as has permissions for that truth They have an observer which basically lets you through a UI come in and look at your jobs You can look at the log files and that sort of stuff And they have an executor that monitors the life of the job So they define things as jobs so a job might be an API server and I say I want 42 of them, right? So yeah, that's 42 tasks and mesos goes ahead and schedules them inside thermos is the the the executor inside thermos you get Processes so you can run multiple things in in parallel in pipelines Something can install a job then run a job health checkers that sort of things that those are all processes So some of the features of Aurora that we've found useful so all the job templating is in Python Which means you can do anything right and everyone loves Python these days One problem that we had with with puppet instances is when something died it where it got wedged Which is a very technical term that we like to use someone had to actually log it in the machine We started figure out what's going on now if if a process gets wedged a health checker in Aurora We'll say hey that that thing's not doing whatever you said the health check was like it's I don't see anything in the log file I can't hit this port it kills it reschedules it no one cares that no one even knows I mean we obviously do know but no one has to do anything It's it has a very hackable CLI which I'm going to get into It does service discovery using finagle through zookeeper You can map ports obviously all around it has a good API and something that I actually found really cool that I Initially thought was kind of weird was the way they name jobs by cluster Organization environment and then the job name which helps in zookeeper and in all of our metrics for knowing what's what so This is how What would they call a dot Aurora file? Looks that this is where you define your job description in Aurora So every job in in basic Aurora you have an aurora file which can define multiple jobs their Python as I said and The processes are basically any kind of unique thing that you can do so in this case We define the path to some this is the hello world from the aurora website by the way I took a couple things out because they were esoteric So you define the name of the script you want to run you have a process which is you're going to call installer You're going to name it fetch package and it's going to go and copy the thing from this directory slash vagrant into your local sandbox It's going to print something saying hey, I did this and then it's going to make it executable and run it I'm sorry make it executable the next process runs it right so I had one that installed it now This one's going to run it so because the head and just calls Python It's just you know obviously does a Unix command line statement So then I go ahead and I link those two things together So I'm going to do an install and then I'm going to run hello world and I need one CPU Mega RAM and eight megs of disk to do so and I call that hello world task So then I go ahead and define my top-level job And I say it's going to run on the dev cluster. It's going to run as the user develop I'm sorry in the environment to vell. There's develop prod and test environments. I'm going to run it using the double data user called hello world and Go ahead and it runs it so You know, I find that DevOps is a balance between flexibility and Reliability right you want to let your your users do what they want to do But you need to be safe right you want to protect them from doing silly things and at the same time make your job as a DevOps Engineer easier because if you know how things are running if you have control over it You can manage it if you let your users kind of do whatever they want That's great until it becomes a nightmare for you So we wanted to make this much more kind of More structured and the reason is because when we looked at the work that we do Almost everything we do falls into three categories, right? We have Kafka consumers that read off a Kafka right to a database or to another Kafka topic We have workers that listen to a rabbit queue and get work And then they go and most of them do something like take an hour's worth of data from this database and turn it Into a five-minute roll-up or copy some stuff from here to there and then we have APIs, right? So we have closure consumers Kafka consumers. We have Python workers and we have Python API servers That's pretty much like 95% of the stuff we do So big decision time, what do we want to do? So we decided we're gonna adopt pants obviously adopt a roar That's the the one I already let you know we're gonna adopt pants. I'll talk about pants in a second We're gonna wrap this Aurora command line interface with our own client that we can add some more kind of control around And I'll talk about that in a sec I'm going to create a library of Aurora templates that make it easy to do repetitive things like pull An artifact from s3 drop it in here make it executable and run it We're gonna just let Aurora do its thing we've always had issues with log file management It's like one of these old the oldest problems in the book is you know Oh bio handles. Damn it. You know like discs filling up and so we said, you know Aurora Let has a disc quota. So when a job is hits that quota it kills the job It restarts it and then it goes through periodically and cleans out all the sandboxes. Who is let let it do that And we're gonna Not even bother was with containers. So at the point we did made this decision that there was no Docker support in Aurora And we don't use Docker. We use it a little bit but not very much So we didn't really care about that. We said we're just gonna go with these sandboxes. That sounds fine us So how do we make a roar fit into our workflow? so The dot Aurora file is very powerful. You can really do anything You can define a whole bunch of jobs and go ahead and run them But that gets very confusing because you as a DevOps engineer you no longer know what's running everywhere So we decided we're gonna take all the common config options out of the Aurora file and put them in a YAML file Things like how many CPUs do you need? How much RAM do you need that sort of stuff? Flags that that you might want to pass a command line and then the Aurora files become much more simple We decided that we're gonna require, you know version artifacts built by our server, which is what we did before But we actually tie that into the client So if you you have to specify the git hash in our YAML file for what you want to deploy and our client actually checks to make sure That thing exists before you try to launch it So we put in some safety nets for our product engineers You also have to be unmasked or if you want to push something to the production the production environment because People do silly things especially at two in the morning when they get woken up So we also decided that Every YAML file specifies one job that job could be running devils task prod whatever but One multiple YAML files can actually point to the same Aurora file Which has the definitions of how the job runs and I'll talk about that in a sec. It gets very interesting All of our configs as I mentioned already live in the repo which makes it really easy to find jobs So we have one directory where all of the job configurations are you can go and grep and you know Replace stuff and it's very easy to make major changes or to figure out what's running where and Then we also added some additional functionality for things like tally log files as they're running so What's the difference so on the top? This is what it takes to run foo server dot aurora on the top is the basic aurora command line to do it They say aurora create. There's also aurora Update which will do a rolling up update. You can do a restart. You can do a kill And then you name the job so remember I said these they have these funny names those a is a cluster name We have a and bb because one of our guys used to work at Google and apparently that's what they do a and then cb ops is the name of the user and then it's running in production called foo server and a path to the Aurora file so the way we do it we have this Aurora manage command It mimics all of the commands, but it Maps to that yaml file and it pulls out all the stuff that you're not going to remember Like the name of the user that it's running under because if someone launches this job with a different user that could cause havoc, right? and We have to specify whether we want it to run endeavor prod because Engineers launch things they mistype it and they launch things in the wrong place So it's about a safety net So this yaml file that we've designed at the top has information about the job So this is for a thing called eight ball that does one of our one of our dials on our dashboard It runs as so you specify the the the the file is the Aurora file that's going to run the user It's running as CBE The build name refers to an artifact that's going to be found in s3 and in this case It's called eight ball We're launching this by get so we also allow versions for Java things We have a static type for things like third-party things like Rafauna that you know We just kind of build once and deploy and then all of your configurations So these are the things that most people are going to want to change frequently and Then we allow you to specify things that you're going to use in a in a way that we've defined in your Aurora file like arguments command line arguments to your whatever your thing that you're running is And we allow you to override this for different stages It's very common, you know in in dev I want two of them and in prod I want 56 of them, you know in dev I want these command line arguments and prod I want these command line arguments So we specify so we've broken it out so you can override them and you always have to specify a get hash So pants pants is Here's my one slide on what pants is we discovered pants because they use it to build Aurora It's also from Twitter You can find it on pants bill.io. So it's a build system for big mono repos Especially Python ones they do support other things, but as far as I know it I've never seen anyone use anything but Python and it's awesome So if you are familiar with maven, it's essentially maven for Python And it creates pex files pex files are executable Python directories So basically what it does is it takes all of your dependencies So you know you need py amul and you want requests 2.3 and all this stuff Which is a problem when you're trying to deploy multiple things on one server, right? Because it's all at the top level So instead of a virtual environment it basically builds a directory with all of your dependencies all of your code any Extra resources like yaml files or other config files, and it puts them on a zip file and it makes it executable So you now have one artifact with all of your Python stuff in one place And we tag that tag them with a get hash We name it for whether it's built up for trusty or precise or whatever and we upload it to s3 from our from our Jenkins So it has directory level build files, which is kind of It's a lot of files, but it's actually very very flexible And it the reason for this is it does incremental builds for mono repo So if you make a change in some Python code here, you don't want to kick off a build of everything It can figure out what else in your repo depends on that file change and just build that right So We have no more repo level dependency conflicts you can even specify different versions of third-party stuff It was this was a big migration So we decided early everything's going to Aurora and by the way it has to get pants before it goes to Aurora You know we obviously helped our product engineers to do this and it was a great way of getting rid of some tech debt So what's pants look like this is a pants build file for the fiddler server That's another one of that's one of our API servers so you basically specify the entry point into your code and Dependencies which are either relative to your code or there are somewhere else in your repo or the third party So this will include pyamol. It don't include a handlers directory Some of our in-house login utils and memcache utils Sharknado 3 which is our API server because people love naming things And the sharknado 3 g event plug-in which we use and there's also some constants yaml file that needs to get included in there All this gets put in one directory Tartar zipped up and made executable It's like magic and this will build fiddler server dash githash dash trusty and x86 64 dot pecs So the next thing we did was write a bunch of templates for doing common things all right So we wrote since the raw file is just Python We wrote a whole bunch of templates to automate the common things people wanted to do like installers So we have installers to install jars Tars pecs files G zips whatever that are found in various s3 Directories pull them in and drop them right no one needs to know how to do that. It's it's like someone did it once Jay JVM and JMX configuration options all sorts of environment stuff If you have to create a config file based on some inputs and drop that into your your troop before your thing runs We support that very easy Access credentials we install all these with puppet on the machines, but they're all you know, they're they're They're hidden and stuff like that. And so we have ways to get to that. We also have shared resources, you know Here's here's the list of all the databases. Here's all the Kafka brokers Here's all the the zookeepers and all that sort of stuff and then Supporting actors in this in this world So we have this thing called off proxy that every API has to run that Authenticates users against the database for for different apis Someone wrote that once we just have a one-liner to drop it in and use it Aurora uses these all these health checkers. They have an HTTP health checker and we wrote ones that tell log files Proxy another service to see if it's up and someone just says I want to use that health check no problem all right So this is what one of our aurora files looks like for eight ball so Aurora makes heavy use of this thing called pystasio and that's it's it's a Type-safe dicked in Python It's almost like a struct and it allows mustache templates to replace things either at At the time when you define the job or once it's been assigned to a server, right? So something like the port the HTTP port you don't know until it's been assigned to a server something like the name of my My my sequel database. I know upfront when I want to run it So it allows for different evaluation times so things can be added in so aurora Exposes pystasio templates for the port mappings and that sort of thing then the instance ID the name of the server and we wrote our own for doing things that we need to do So it works by adding all these these these profiles So we have we define a bunch of profiles up top that we're going to bind later on You can see in this ops thing someone's defining command-line arguments for the API server that's going to run So here's our memcache servers. They're in the services struct The port is is is defined by thermos executor And it allows you to just if you give any string into this this port stick It'll assign a port with that name and then you can use it later. We use private There's a private public, you know the JMX port all these ports are assigned Down here you say I need to run eight ball Here's a command line you want to run the pex file that I put in that struct up there and the options from that thing over there And it creates a command line Then I want to add in off proxy. So it's just a one-liner. I want off proxy. I want this health check That is going to check my API server at this URL With this port that is going to be assigned later And then I want to run these first two things I need to install eight ball and run eight ball But then don't stop let that keep going and then go ahead and install this off proxy in this health check process So we've kind of defined these pipelines that you can use to define how your things are going to run and then finally you define the job and You go ahead and you bind all these profiles and Aurora runs it So we've taken this kind of this idea of our custom templates a step further we have 104 Workers that are doing basically the same thing they listen to rabbit queue and they run some stuff and it's actually You know we have this thing called Igor Which is our worker framework and the difference between One worker job and another worker job is what are they listening to and some imports, right? Some python imports because you don't want to import the giant data science library for something that's not using it So we decided to this take remember I said that you can have multiple yaml job definitions pointing to the same aurora config So we said alright Let's have one aurora config for all of our workers and then just to find yaml for all the different ones So now if someone wants to add a worker They just have to write a little bit of yaml and say use this you know This is the the rabbit queue you want to listen to and this is some command line arguments So right now in our aurora config Directory where we store everything and we have 104 of these defined and you can see some of the different different ones so Elastic search indexing is a big one So then we took it a step further and we created an ETL pipeline called deep water Because someone who may or may not be sitting in the front row wanted to name something after an oil spill But deep water actually Let's you define a whole workflow that runs in aurora jobs So you can define these your steps as Python classes and then each step in the pipeline gets its own aurora job And you can give different requirements different resource requirements of different steps in the job You can scale them out However, you want but it also uses postgres for consistency. So if a job failed, it's marked postgres and we know so That's only part of the story. So before Deploying anything we had to figure out the right, you know, we want to use aurora. Okay Now, how do we solve every other problem with this migration? Right? So here's some things that we had to deal with so request routing metrics and monitoring log file collection configuration management and Bunch of other stuff, right? So how did we do this so the first one is Routing, right? So how do you route traffic as jobs move around the cluster? So we used to do this where you know we a job was running on foo API o one And it was always gonna run on foo API foo API o one and you know what if foo API o one crashes because it's Amazon We're gonna launch another foo API o one, right? And it's always gonna be you know port nine thousand for this nine thousand one for that This is obviously changing So we introduced HAProxy and Synapse so Synapse Everyone knows what HAProxy is right? Synapse Well for it. So this is kind of how it it works all of our jobs have our API servers have off proxy They have the API server themself and then they have a health check, right? And we bind off proxy to the public HTTP port and we bind the API server to the private port So internally if someone needs to make a request to that API server because you know It's a batch job that needs to read that data they go right to the private port someone coming from the outside world gets sent to off proxy and then that Proxies them to the API server the health proxy never needs to talk to HAProxy because that's running in the same Mesa server Right next to it so a request coming in for get HUD API scroll depth or private HUD API scroll depth goes HAProxy Into the Mesa cluster so the way Synapse helps us out here and Synapse was written by Airbnb, thank you written by Airbnb and it's The config is YAML and it's a super set of the HAProxy config, right? So Aurora when a job is launched in Aurora it announces into zookeeper that you know Hey, this HUD API is running You've got three instances of it and here's their ports that are defined HAProxy is pulling zookeeper or Synapse rather is pulling zookeeper and when it detects a change it generates a new HAProxy config and bounces HAProxy this happens pretty quickly and I know that Earlier someone was talking about how there's actually an update to HAProxy which makes it so you don't lose any connections But when it bounces we probably lose a couple All of these APIs are being accessed from JavaScript. That's just gonna retry the request So it's not a big deal, you know, it takes like 200 milliseconds to restart it or something We use puppet to manage the HAProxy config and the Synapse config So if a user needs to add a new route, they'll add it to puppet and push that out And it'll get picked up on our HAProxies. Our HAProxies are all Running in ELB Okay So question number two metrics collection right metric collection is really important and we wanted to make it easy Right, we had several different ways people were doing this before They were using, you know everything from TSDB to All sorts of different ways of collecting stuff. So we decided to consolidate everything on open TSDB and Grafana Open TSDB was written by Etsy I think most people are familiar with Grafana. That's a great dashboard for visualizing this stuff. So our flow is open TSDB into Grafana Nagios is pulling Grafana and also TSDB directly And seeing if something is wonky and then we use pagerDuty for alerting the cool thing about the naming of all these things in Aurora Is that it's very easy for us to say, you know Figure out the the tag TSDB works with tags So you have it's time series database. You have a time series You get a point in time and it has a bunch of tags So one tag is the name of the job one tag is which environment it's running in one tag is what users it running as So that gets very easy to say well, let me look at the HUD API dev now Let me look at HUD API prod because everything is consistently named We're able to actually write tools that easily let engineers put this data in and also scrape all the data That's coming out of Aurora for the jobs running so we can generate dashboards for any job Showing how much CPU is it using how much RAM is it using all of our JMX metrics also get this data So we can graph things very easily everyone knows where to look for any job. That's running So we've written libraries and Python and closure that do all of this kind of auto tagging based on the Aurora names And we wrote what we call the JMX collector Which actually pulls any job that's running in Aurora and just pulls out the JMX metrics If there's a JMX port to find and stores them in TSTB and Grafana dashboards for everything so You know engineers love pretty graphs. So here's our generic meso's job graph Up on the top you can see CPU used by task you can the red line is the limit that's been assigned This job is nice. It peaks a little bit but not too bad, which is how we like it, right? And you can see if any any specific tasks are using more CPU than others Which is actually happens a lot in Kafka consumers. It especially if you have an unbalanced Topic right so we can say oh this task is is using way more one thing. We haven't quite figured out how to solve is A way to give certain tasks more CPU But I think that's probably a pipe dream. It's better. Just figure out how to balance the topic better So log file analysis this was actually a really big one for our users I think mostly because log file analysis has always been kind of tough When we told users they couldn't just easily SSH in and use poly SH to tell all their logs They were like no, it's horrible So we tried to pull everything into elastic search With cabana and I know some people love it. I hate it Was it was really messy and incredibly expensive. So we chose to use flume Athena which is an Amazon product. It's Presto running an Amazon and Something called tail LLL Because we already had a tail LL So what are what are those look like so I mentioned users want a poly if you're not familiar with poly SH It's a little Python program let you like shell into 10 servers run the same command and see all the output nicely separated Even with colors. It's really cool So we wrote this addition to our Aurora client Called Aurora managed tail LL and a job name So since we have we know where everything's running from the Aurora client you can say hey I want to tell this job and it will say hey where are these jobs running? It'll log into all of them and and pull back all of your log files and just print them out The there is an Aurora web UI to just kind of click on a job and look at the log file just like in Mesa's But that doesn't help with a cron job that already stopped running or if it got cleaned up So we suck all of our log files into Athena through flume Athena just lets you do ad hoc SQL like queries across s3 buckets So, you know if you want to do kind of historical forensic on why did this thing die yesterday? you can just go into Athena and look at the logs and we have a lot of stuff that's not running in Mesa's So all of those logs go there as well like our Kafka brokers our databases. I didn't mention we don't put databases in Aurora Yeah, I don't think we will So two years later and we're really psyched We've reduced our on-call events dramatically We've cut our EC2 instance cost by about 33 percent and at the same time we've built new stuff It's not like we didn't build new stuff We actually recently did an engineering survey because we couldn't figure out any other way to really measure our KPIs So we says let's let's do a Google survey and ask our engineers what they think and we asked a bunch of somewhat leading questions Maybe but they said that they rarely experience blockers deploying stuff, which is great And it's honestly changed our entire approach to DevOps One of the most Telling examples is how we use the pecs files You know, we actually build pecs files for all of our command line tools now, so we have a pecs file that runs S3 command we have pecs files that run, you know all of our All right We have command line tools to launch instances and stuff like that and we just wrote we do mess a pecs file now We wrote a thing called Pecs runner which we deploy on all of our machines Which you can say pecs run and give it the name of a job and a githash and it'll just download it from s3 cache it locally Which we do on our meso servers as well. So this pecs you want this githash. Oh, I already have it I don't have to go and get it from s3 And that's really changed the way that we approach bundling our Python stuff. So that's been great So that's it. I want to leave time for questions and I have so awesome any questions. I Should say that's our engineering blog and there's posts about this. They get into some more detail I'm on Twitter at our mangy and github at our mangy and everywhere else pretty much at our mangy except at work Where I'm Rick, so that's it. Thank you