 So this is a astronomer's journey from cloud lock-in to open-source software independence. This is the obligatory slide where you can verify that I am who I say I am. So Aaron Brongersma, VP of Engineering at Astronomer. And I'm going to go in a little bit here. There. As astronomer, we focus on two major core competencies. One is streaming data and the other is batch-style ETL. We have a very code-first approach that we bring. We think that to empower data engineers and data scientists, we need to be able to run native Python code as quickly as possible and allow them to really understand their business logic better than any kind of predictive whizzy gig or whatever there. So astronomer, it all started out where we were chained to our cloud provider, but one day we got away. So early priorities at astronomer was getting the market quickly for validation, so we focused really heavily on quick prototyping. The next goal for us was to get data just in motion, and then we were always doubling down on exploration. So V1 of astronomer looked very different than it is today, but it started off with heavy utilization in AWS Lambda, and then we set API gateway in front of that to kind of act as our API gateway layer. We used CloudWatch for every bit of our metrics gathering, and we bolted into Amazon's Kinesis and Elastic Beanstalk to kind of do our auto scaling. Now, this works when you have a team of between three to five. It worked out well for our startup. It got us to market quickly. It got our customers on version one of our application. But there was trouble in paradise. We had some engineering obstacles, which were that as we were scaling our user base, the number of events that we were sending through API gateway was becoming cost prohibitive. The next thing we learned, and we learned this the hard way, was that Kinesis isn't Kafka. So as you start seeing the really awesome things that happen in the Kafka ecosystem, it leaves you wanting for more when you're running your data through only a part of the potential of what Kafka is. And we also found that provisioning with Elastic Beanstalk was very slow and antiquated. It was one of the first ways that we started deploying containers on AWS. Moving forward, it seriously left us lacking in that department. This is my face when I saw the Amazon build the very first time we ran a billion events through our API gateway. So that's my thoughts on API gateway. We also have product obstacles. So our customers had issues accessing all of their data. And so if you're a SaaS provider and you want to operationalize someone else's data warehouse, that either becomes a very, you know, custom VPN solution, they had open ports to the public, they had a whitelist IP addresses. So what we found is we really needed to start doing our work inside of a firewall or inside of a private network. So quoted from our CTO, we basically needed a stack that could run our data engineering tools anywhere we wanted to be, anywhere our customers were. And we didn't want to be limited by any one cloud service provider. So that landed us on DCOS. We knew that a lot of the components that we were using at Astronomer, things like Cassandra and Spark were first-class citizens in Mezos. But what we didn't know is how to get Mezos up and running quickly. So we were a small engineering team. We'd never really worked on the smack stack before. So DCOS was a really quick way for us to get up and running and have best practices rolled out throughout our fleet. So DCOS at Astronomer. We used DCOS to run all of our Apache Airflow tasks against. So Apache Airflow is our tool that we use for all of our batch ETL. It came out of Airbnb. It has a Mezos executor that is pretty awesome. So we can actually scale out all of our ETL jobs inside of Docker containers and scale that out across our entire DCOS Mezos cluster that we have. We moved Marathon in to replace Elastic Beanstalk. That may be one of the greatest things that we did. That was like a night and day change. And then we really focused on task scheduling. At first, we were doing nightly batches into Amazon Redshift. And we were finding that either our customers created too many events or it wasn't real time enough for them. So we started moving more to Spark Streaming technology, which allows us to do micro batches and between any time between a 10 to two hour window, we can micro batch into Redshift using Spark Streaming. So Apache Airflow. The reason why we've stumbled upon Apache Airflow is we were running batch ETL jobs, but they weren't running in a dependency driven manner. So one of the things that would happen is that our nightly ETL tasks that were supposed to populate our data warehouse didn't trigger. So traditionally with Cron, you would see job one would run every night at midnight, and job two might run at one in the morning. There was nothing blocking job two when one failed. So bringing in Apache Airflow, tasks are laid out in what's considered a DAG, which is a distributed, a sycrylic graph. And it's just a graph of tasks. Like start here, do this, do all of these, and then report back when you're done. Now this is really important when you're doing transforms or you're actually trying to do enrichments. So when you're bringing in data from a number of different data sources, you can make sure that, you know, if your connection timed out to your MongoDB database, you could still rerun that one particular task. So why Airflow on Mezos? The first thing was that we were able to use the Mezos scheduler, which was robust. One of the greatest things that we really love is that it's the actual task scheduling is one of the least painful parts for our entire fleet. We spend a lot more time tuning the Docker images or making sure that the actual Python DAGs are written correctly. But when it comes to scaling out elastically, we're easily able to do millions of events an hour. We were up and running quickly. So as soon as we had DCOS, we had Airflow. With our new custom executor, we were able to start distributing tasks out within an hour of scaling up a brand new cluster. And like I said, millions of tasks a day. And that was really important for a lot of the customers where they were used to only having nightly ETL jobs happen. And so now we're able to give them insights up to the hour, maybe even less, depending on where the data sources are. So more on Airflow at Astronomer. We use it for all our data pipelines. Originally, we had a toolkit called Aries. And Aries was a JavaScript operator that actually allowed you to use Node.js inside of your Airflow tasks. That's absolutely insane. So it really kind of went against the original principles of Airflow and being dependency driven. We were very much focused on streaming in the beginning. So we've kind of dialed the back of the JavaScript and have doubled down now in Python. The next thing that we really needed Airflow for is our ability to do our intelligent Redshift loading. So we guarantee that the data comes in. That also triggers a task that executes a Spark streaming job. And then we're able to make sure that we can do our schema inference. So we get much better defined tables when we go inside of a data warehouse. And then once again, just hitting on dependency driven task running. So next thing we had to kill at Astronomer was we needed to get from Kinesis to Kafka. More and more customers that we were working with were sending more and more events through our pipeline. We went from 500 million and now we're up to 3 billion. So you start feeling the pain with starting out Kinesis. You start feeling the pain when you're using API Gateway. So some things that jumped out to us. The first thing that we really struggled with was the Kinesis Client Library. Now, the Kinesis Client Library looks pretty cool at first because it allows you to use any kind of language you want. And as long as you're communicating through standard out, you can actually process a stream in order to do all your check pointing. The things that we run into is that since it logs directly to standard out and it's very sensitive about standard error, it becomes really difficult to troubleshoot when things actually go wrong. So we've had quite a few race conditions where we have a really difficult time trying to pull those out of our Elk stack logs. Not available everywhere. More and more of the companies that we're doing business with are not running on Amazon. They're running within whether it's Azure, whether they're running inside of Google Compute, or they're running on their own data center still. It seems like back home in Cincinnati, the bigger the tower, the more people that are still running inside of their data centers, they're not necessarily adopting the cloud. So the road to Kafka for us first was a rewrite of our API. So we moved off our JavaScript API that was running behind API gateway and it was using Lambda functions and we rewrote that and go. We got a 10x improvement in performance out of that. And when we were rewriting the API and go, we started working on the abstractions that would allow us to switch our underlying message service. So we abstracted away just generic messages and those generic messages can either go into Kafka or Kinesis. We improved our provisioning, monitoring, and testing. So this is really important that it's one thing to go do a Docker compose up and have a Kafka. It's another one you roll it out in production. So there's a lot of tuning and micro adjustments that you need to make before you actually run over production client loads. So we've been spending quite a bit more time on that. And that's one of the bigger pain points of running your own Kafka versus using something like Kinesis. And then one of the other things that we needed to do is run all run both of our systems in parallel. So what we have now is we're running 99% of our traffic right through Kinesis still. And then we're running our new Canary version of our stream processing library that's written in Go and it uses Kafka. And once we're through the final QHX there, we'll be on that production. So one of the things that jumps out to me, though, is that not all AWS tools are created equal. And this is something that you forget when you start working with tools like Amazon S3 that actually had an incident yesterday that went down. But tools that tend to be very durable. So RDS tends to be very durable. S3 is durable. But oftentimes you start looking for new tools like the simple workflow service that came out of Amazon. Those tools aren't necessarily as battle hardened. They might be more in a beta state even though it's been released to the public. So a lot of the tools we tried to use that were more bleeding edge differently didn't hold up to any of the workloads that we were putting against them. Exploration. So this is one of the key philosophies we have at Astronomer. And through our exploration, we've really made some really awesome improvements to how we deployed DCOS in our fleet. Cloud formation to Terraform. So when my team first started Astronomer, it was the first time we'd ever operationalized Mezos anywhere. So we had a lot of work to do to get up to speed on understanding what we were working with and how to support it when it goes down or when nodes go down and how do we do healthy housekeeping across that fleet. So one of the first steps we did was reverse engineered our cloud formation templates and brought those into Terraform. So what that did was it gave us a really easy way for any of our developers to bring up their test environments on the fly. So if we needed to test a new version of DCOS, it was really easy for us to make micro adjustments in our config. This also allowed us to have an ability to scale out different node types fairly easily. So when we wanted to roll out new high CPU or high mem or change an underlying instance type, it was fairly easy for us to add those in. So this really set forth the infrastructure as code. So all of our code changes are stored inside of our Terraform configs that are running inside of console. So any one of our engineers can be on any different machine and we all have the same shared state up inside of console. So that was really good for us for change control. We focused on 100% repeatable installs without having to do any under the hood manual configuration. So this was really big for us for bringing up some of the underlying components like Postgres on RDS, all of our networking inside of Amazon, inside of our VPC, all of that is done in one click. It provisions everything from your bootstrap, all of your masters and any of the worker nodes and any number of availability zones. We've scaled this up to over a couple of 100 VMs right now, but we do have issues along the way on our way to scale up. It feels more like a pet than a cattle situation. And as we start scaling out Terraform larger and larger in our organization, we definitely have more pockets of very sensitive servers that need to be kept more like pets. That seems like a bit of an anti-pattern to me, so my team has been doing quite a bit more work to understand how we're going to make sure we have healthy scheduling of rescheduling of tasks that die. But one of the things that's really hurt us lately here is the fact that every one of our servers we have to be very careful when we're detaching like Rex Ray volumes and whatnot. The next thing that we focused on and one of the tools we've had one of the best pieces of success with us, Prometheus, so we went right after the throat of our CloudWatch metrics and decided that we wanted to try to keep all of that data in-house. That meant that we had to write actual custom exporters for Prometheus, so we write those in Go and we contribute those back into the open source community, but it's pretty awesome to be able to change any of the micro adjustments that you want to monitor for. All of our systems are monitored out of the box, so every machine that comes up automatically subscribes with the correct tags, so we never really worry about having that one instance that came up that didn't have the monitoring agents on it. I already touched on the fact that we write our own exporters, which is actually really easy, but polling makes alerting really easy, so one of the differences that we have is that real-time logs actually flow into elastic search, and so we can do pattern matching and regular expression pattern matching there, but for Prometheus, Prometheus looks at the health of a service, if the numbers are out of whack, if you're using too much CPU, so we really have two different types of alerting in our fleet. Kong and the inevitable demise of the API gateway. So, we found Kong almost by chance. We weren't looking for it in the beginning, we were a heavy InginX shop, and so early on we needed to start replacing some of our GraphQL authentication, which got us down the path of looking at Kong. We already run Cassandra throughout our fleet, so it made it really nice to run all of your different Kong nodes across all the availability zones, and then still have your Cassandra nodes across availability zones as well. So, we have much more fault tolerance there with our API gateway than we have before, plus we were able to run it on fixed costs. So, three instances handles just as much traffic as we were running through API gateway, and we haven't seen a hit there. It's got built-in authentication, and it handles all of our rate limiting. It'll also invoke some of our old API Lambda functions, which is really nice as we were doing our transition to go, and then also it was backed by Cassandra. So, time series databases for fun and profit. Kyro's DB didn't seem like a first choice for us. Our engineering team had done a lot of work within Flux DB in the past, and we had had great success, but one of the things that happened is we were starting sending more and more as we passed the 500 million events, we really wanted to have better durability of the data that we have for Influx. That started us down the path of Kyro's. So, Kyro's sits on top of Cassandra, and it works as an abstraction layer. It gives you a time series database. It's highly opinionated on how you insert the data in, but it's very performant for writes, and we have it do automatic roll-ups and aggregations, and our UI and all of our billing services are backed by this. So, whenever we want to pull up any of our uses metrics, all of that's backed inside of Cassandra. We've seen extreme durability, even when you do have nodes that go down. Kyro's does a really great job of doing the housekeeping and re-indexing. Like I said, it really came as an underdog as a selection of a tool in our stack, because there are definitely some more up-and-coming technologies, but we haven't been burned by Kyro's yet, and it's been awesome. So, what's next? So, under development, we've been spending quite a bit more time around Apache Druid. A lot of the queries that we do tend to have backfill required as well, so maybe we get a task up to date within the last hour or two hours, but sometimes our customers will ask for us to go back and backfill for the last month. So, one of the things that we would love to do is make sure that we have a better set of tooling around pulling that data out of RAM instead of pulling that back off of a disk. So, we want to start powering some of our backload technology directly off of Druid. We're super pumped about Kafka Connect, so that's one of the reasons why we've really pushed hard to move off of Kinesis as well is that there's definitely more of a need in the ecosystem that we're selling into for more real-time transactions of more traditional relational database systems. So, it's not all of the data that we want to replicate across, but once you start realizing what patterns you're looking for in your transactions or your databases, these can actually drive some really awesome business insights. It's also stopped us from having to run more traditional batch ETL jobs, and we've been able to just replace them straight out with Kafka Connect. One component that we still struggle with moving off of is Amazon S3. The fact that it's just durable, and it's most of the time it's up, I'm going to knock on wood here real quick because it wasn't yesterday. But it makes it really difficult for us to go and try to make an investment in something like Ceph or Minio. We use Minio a lot more in our development environments. But Ceph is certainly a lot more technical debt for our team to bring up. We're looking to have that in Q1 next year of having Ceph figured out and operationalized. Now, one of the things that when we start focused on having a software as a service platform, interconnectivity between our services is really starts to get dangerous. In the beginning, you can firewall and reject almost everything in an IP table rule, but that doesn't scale as you start adding hundreds and hundreds of new containers into your fleet. So we're starting to do a lot more R&D into tools like Istio, Weave, also Linkerdee. Like very important for us to figure out that service mesh fabric there for multi-tenant environments. We kind of dance the jig when we saw Kubernetes available because there's some lighter weight workloads that we're looking into running tasks on that don't necessarily need all of the durability or the ability to roll out more of the JVM-based tasks. So the thought of having these lighter weight Kubernetes-based pods out there that we can run for some of our stateless services really looks to give us a pretty awesome competitive advantage. So that's who I am. You can look me up on Twitter or LinkedIn. And I'm more than excited to ask a little more questions and make this more interactive. I know there's not many of us in here, but we'd love to hear if you had any interesting challenges along your journey to move more to an open source environment. I just had a question regarding your decision for storage. Like I understand it's up most of the time, but like were you looking for maybe, you know, you always like redundancy. So I'm wondering like why not consider either like an on-premise or like another provider or something like that? No, I think we're definitely looking for more providers and more redundancy. The interesting thing is where we're at in our startup growth. That you're starting to have that and determining the layer of durability across all our services. So we have some particular tools that we run like our UI. Our user interface isn't our primary interaction with our platform. So it has a much lower level service SLI measurement that we have. So there are certain particular things like Amazon S3, for example, is not a persistent long-term storage solution for us. It's more of an intermediate. So when we run a particular task on an Apache Airflow, it will go and gather all the data for a time range and dump that into the file in S3. And then we were able to like reload that into Spark, do our transforms, and it really becomes like more of an intermediary space. So knowing that we have the ability to operationalize that and bring that on-prem, if we need to for a particular customer, like that's when Seth would come into play for us. But I think that there's a lot of pain removed just having the ability to just use Amazon S3 at this time for us. Like we have a smaller DevOps team. So just like I said, like adding one more component to the stack that we would have to master. It's always easy to like bring them up, right? All the services. But when something actually goes wrong and you don't have that technical depth to go and deep dive it, I always kind of get a little uncomfortable trying to roll my team into supporting something and that we've only run in development for a short time. So it sounds like you have a kind of small team and you're moving pretty fast. It looks like you have a pretty good stack here. So I'm kind of impressed by what you're able to do here. Can you talk a little bit about like the composition of the team that worked on this and what their skill sets are and the timeline you've had moving from Amazon to DCOS and so on? Sure. So the size of our team now is about 10 product engineers. It's heavily focused more on the DevOps side. So I came in originally as head of infrastructure and came in with an infrastructure team. We've kind of done a shake-up on our engineering organization where we have two teams for our two different product lines. We have a Stream team and we have a Batch team. Those teams are composed with front-end engineers for each of the teams, a product manager, and then DevOps and product development engineers that kind of sit in the two flexible roles. We share a lot of the information so some of our issues are actually cross-team. Like so working on things like Apache Airflow that would be one of those cross-functional team requirements but most of the time people snap back into their functional role whether it's streaming or Batch. As far as getting up to speed for most of the team it took us about six months to really get our heads around it and start running more and more of our production data through. One of the first things I did when I started was moved a lot of our production systems of record off of DCOS to actually figure out how all of the rex-ray volume mounting worked, what actually happened when you restarted a node or a node got scheduled for termination in Amazon. We really wanted to figure that out before we just trusted the magic rex-ray box. Since then we've brought it all back in but we really feel like we've operationalized and figured out how those components work and we have a really good set of checklists on how to handle when a node fails and you don't have your rex-ray volume removed and how we migrate those over. But I'm really glad that we went and did the extra diligence to make sure that we could figure those out. As far as moving forward for our team there's still certainly a high overhead for getting people up and running on it. Especially when a lot of our talent in Cincinnati comes from more of the Docker community. So looking at things like Mezos tend to be a much more mature framework that a lot of our up-and-comers don't get direct access to. But that minor challenge there of trying to find some new talent that is their pre-established Mezos people we've been able to get a lot of our engineering team up and running quickly. So a lot of that is just the ability to bring up an environment in terraform in about five to 10 minutes and then be able to tear that down at the end of the day if you have problems. Cool? All right, everybody. Thanks for coming out.