 Welcome, everyone, to theCUBE's presentation of the AWS Startup Showcase, AI ML Top Startups Building Foundation Model Infrastructure. This is season three, episode one of our ongoing series, covering exciting startups from the AWS ecosystem to talk about data and analytics. I'm your host, Lisa Martin, and today we're excited to be joined by two guests from Astronomer. Stephen Hillion joins us. It's chief data officer, and Jeff Fletcher, it's director of ML. They're here to talk about machine learning and data orchestration. Guys, thank you so much for joining us today. Thank you. It's great to be here. Before we get into machine learning, let's give the audience an overview of Astronomer. Talk about what that is, Stephen. Talk about what you mean by data orchestration. Yeah, let's start with Astronomer. We're the airflow company, basically. The commercial developer behind the open source project Apache airflow. I don't know if you've heard of airflow. It's sort of the sort of de facto standard these days for orchestrating data pipelines, data engineering pipelines. And as we'll talk about later, machine learning pipelines. It's really is the de facto standard. I think we're up to about 12 million downloads a month. It's actually as an open source project. I think at this point it's more popular by some measures than Slack. Airflow was created by Airbnb some years ago to manage all of their data pipelines and manage all of their workflows. And now it powers the data ecosystem for organizations as diverse as electronic arts. Condé Nast is one of our big customers, a big user of airflow, and also not to mention the biggest banks on Wall Street use airflow and Astronomer to power the flow of data throughout their organizations. Talk about that a little bit more, Stephen, in terms of the business impact. You mentioned some great customer names there. What is the business impact or outcomes that a data orchestration strategy enables businesses to achieve? Yeah, I mean, at the heart of it is quite simply scheduling and managing data pipelines. And so if you have some enormous retailer who's managing the flow of information throughout their organization, they may literally have thousands or even tens of thousands of data pipelines that need to execute every day to do things as simple as delivering metrics for the executives to consume at the end of the day to producing on a weekly basis new machine learning models that can be used to drive product recommendations. One of our customers, for example, is a British food delivery service. And you get those recommendations in your application that says, well, maybe you want to have samosas with your curry. That sort of thing is powered by machine learning models that they train on a regular basis to reflect changing conditions in the market. And those are produced through Airflow and through the Astronomer platform, which is essentially a managed platform for running Airflow. So at its simplest, it really is just scheduling and managing those workflows. But that's easier said than done, of course. I mean, if you have 10,000 of those things, then you need to make sure that they all run, that they all have sufficient compute resources, if things fail, how do you track those down across those 10,000 workflows? How easy is it for an average data scientist or data engineer to contribute their code, their Python notebooks or their SQL code into a production environment? And then you've got reproducibility, governance, auditing, like managing data flows across an organization which we think of as orchestrating them is much more than just scheduling. It becomes really complicated pretty quickly. I imagine there's a fair amount of complexity there. Jeff, let's bring you into the conversation. Talk a little bit about Astronomer through your lens, data orchestration and how it applies to ML ops. So I come from a machine learning background. And for me, the interesting part is that machine learning requires the expansion into orchestration. A lot of the same things that you're using to go and develop and build pipelines in a standard data orchestration space applies equally well in a machine learning orchestration space. What you're doing is you're moving data between different locations, between different tools and then tasking different types of tools to act on that data. So extending it made logical sense from a implementation perspective. And a lot of my focus at Astronomer is to really to explain how Airflow can be used well in a machine learning context. It is being used well. It is being used a lot by the customers that we have and also by users of the open source version, but it's really being able to explain to people why it's a natural extension for it and how well it fits into that. And a lot of it is also extending some of the infrastructure capabilities that Astronomer provides to those customers for them to be able to run some of the more platform specific requirements that come with doing machine learning pipelines. Let's get into some of the things that make Astronomer unique. Jeff, sticking with you, when you're in customer conversations, what are some of the key differentiators that you articulate to customers? So a lot of it is that we're not specific to one cloud provider. So we have the ability to operate across all of the big cloud providers. We also have, so I know, I'm certain we have the best developers that understand how best practices, implementations for data orchestration works. So we spend a lot of time talking to not just the business outcomes and the business users of the product, but also for the technical people, how to help them better implement things that they may have come across on a stack overflow article, or not necessarily just grown with how the product has migrated. So it's the ability to run it wherever you need to run it. And also our ability to help you, the customer, better implement and understand those workflows that I think are two of the primary differentiators that we have. I'll add another one. If you don't mind. Is lineage and dependencies between workflows? One thing we've done is to augment core airflow with lineage services. So using the open lineage framework, another open source framework, for tracking data sets as they move from one workflow to another, one team to another, one data source to another is a really key component of what we do. And we bundle that within the service so that as a developer or as a production engineer, you really don't have to worry about lineage. It just happens. Jeff may show us some of this later that you can actually see as data flows from source through to a data warehouse, out through a Python notebook to produce a predictive model or a dashboard. Can you see how those data products relate to each other? And when something goes wrong, figure out what upstream maybe caused the problem or if you're about to change something, figure out what the impact is gonna be on the rest of the organization. So lineage is a big deal for us. Got it. And just to add on to that, the other thing to think about is that traditional airflow is actually a complicated implementation. It required quite a lot of time spent understanding what was almost a bespoke language that you needed to be able to develop in to write these DAGs, which is the kind of like fundamental pipelines. So part of what we're focusing on is tooling that makes it more accessible to say a data analyst or a data scientist who doesn't have or really needs to gain the necessary background in how the semantics of airflow DAGs works to still be able to get the benefit of what airflow can do. So there's new features and capabilities built into the Astronomy Cloud platform that effectively obfuscates and removes the need to understand some of the deep work that goes on, but you can still do it. You still have that capability, but we're expanding it to be able to have orchestrated and repeatable processes accessible to more teams within the business. In terms of accessibility to more teams in the business, you talked about data scientists, data analysts, developers. Stephen, I want to talk to you, as the Chief Data Officer, are you having more and more conversations with that role and how is it emerging and evolving within your customer base? That's a good question. And it is evolving because I think if you look historically at the way that airflow has been used, it's often from the ground up, you have individual data engineers or maybe a single data engineering teams who adopt airflow because it's very popular. Lots of people know how to use it and they bring it into an organization and say, hey, let's use this to run our data pipelines. But then increasingly as you turn from pure workflow management and job scheduling to the larger topic of orchestration, you realize it gets pretty complicated. You want to have coordination across teams and you want to have standardization for the way that you manage your data pipelines. And so having a managed service for airflow that exists in the cloud is easy to spin up as you expand usage across the organization and thinking long-term about that in the context of orchestration, that's where I think the Chief Data Officer or the head of analytics tends to get involved because they really want to think of this as sort of a strategic investment that they're making not just per team, individual airflow deployments but a network of data orchestrators. Right, that network is key. You know, every company these days has to be a data company. We talk about companies being data driven. It's a common word, but it's true. It's whether it is a grocer or a bank or a hospital, they've got to be data companies. So talk to me a little bit about Astronomers business model. How is this available? How do customers get their hands on it? So, Jeff, go ahead. Yeah, so we have a managed cloud service and we have two modes of operation. One, you can bring your own cloud infrastructure. So you can say, here is an account in, say, AWS or Azure and we can go and deploy the necessary infrastructure into that or alternatively, we can host everything for you. So it becomes a full SaaS offering but we then provide a platform that connects the backend to your internal IDP process or however you are authenticating users to make sure that the correct people are accessing the services that they need with role-based access control. From there, we're deploying through Kubernetes the different services and capabilities into either your cloud account or into an account that we host. And from there, Airflow does what Airflow does which is its ability to then reach two different data systems and data platforms and to then run the orchestration. We make sure we do this securely. We have all the necessary compliance certifications required for GDPR in Europe and HIPAA based out of the US and a whole bunch of host of others. So it is a secure platform that can run in a place that you needed to run but it is a managed Airflow that includes a lot of the extra capabilities like the cloud developer environment and the open lineage services to enhance the overall Airflow experience. Enhance the overall experience. So Stephen, going back to you, if I'm a condonast or another organization, what are some of the key business outcomes that I can expect as one of the things I think we've learned during the pandemic is access to real-time data is no longer a nice to have for organizations. It's really an imperative. It's that demanding consumer that wants to have that personalized customized instant access to a product or a service. So if I'm a condonast or I'm one of your customers, what can I expect my business to be able to achieve as a result of data orchestration? Yeah, I think in a nutshell, it's about providing a reliable, scalable and easy to use service for developing and running data workflows. And talking of demanding customers, I mean, I'm actually a customer myself. So, you know, as you mentioned, I'm the head of data for Astronomite. You won't be surprised to hear that we actually use Astronomite and Airflow to run all of our data pipelines. And so I can actually talk about my experience. You know, when I started, I was of course familiar with Airflow, but it always seemed a little bit unapproachable to me. If I was introducing that to a new team of data scientists, they don't necessarily want to have to think about learning something new. But I think because of the layers that Astronomite has provided with our Astro service around Airflow, it was pretty easy for me to get up and running. Of course, I've got an incentive for doing that. I work for the Airflow company, but we went from about, at the beginning of last year, about 500 data tasks that we were running on a daily basis to about 15,000 every day. We run something like a million data operations every month within my team. And so as one outcome, just the ability to spin up new production workflows essentially in a single day. You go from an idea in the morning to a new dashboard or a new model in the afternoon. That's really the business outcome, is just removing that friction to operationalizing your machine learning and data workflows. And I imagined to, oh, go ahead, Jeff. Yeah, I think to add to that, one of the things that becomes part of the business cycle is a repeatable capabilities for things like reporting, for things like new machine learning models. And the impediment that has existed is that it's difficult to take that from a team that's an analyst team who then provide that or a data science team that then provide that to the data engineering team who have to work the workflow all the way through. What we're trying to unlock is the ability for those teams to directly get access to scheduling and orchestrating capabilities so that a business analyst can have a new report for C-suite execs that needs to be done once a week. But the time to repeat ability for that report is much shorter. So it is then immediately in the hands of the person that needs to see it, it doesn't have to go into a long list of to-dos for a data engineering team that's already overworked that they eventually get it to it in a month's time. So that is also a part of it, is that the realizing orchestration, I think, is fairly well unlike it's, a lot of people get the benefit of being able to orchestrate things within a business, but it's having more people be able to do it and shorten the time that that repeatability is there is one of the main benefits from good managed orchestration. So a lot of workforce productivity improvements in what you're doing to simplify things, giving more people access to data to be able to make those faster decisions, which ultimately helps the end user on the other end to get that product or the service that they're expecting like that. Jeff, I understand you have a demo that you can share so we can kind of dig into this. Yeah, let me take you through a quick look of how the whole thing works. So our starting point is our cloud infrastructure. This is the login, you go to the portal, you can see there's a bunch of workspaces that are available. Workspaces are kind of like individual places for people to operate in. I'm not going to delve into all the deep technical details here, but starting point for a lot of our data science customers is we have what we call our cloud IDE, which is a web-based development environment for writing and building out DAGs without actually having to know how the underpinnings of airflow work. This is an internal one, something that we use. You have a notebook-like interface that lets you write Python code and SQL code and a bunch of specific bespoke type of blocks if you want. They all get pulled together and create a workflow. So this is a workflow which gets compiled to something that looks like a complicated set of Python code, which is the DAG. I then have a CI CD process pipeline where I commit this through to my GitHub repo. So this comes to a repo here, which is where these DAGs that I created in the previous step exist. I can then go and say, all right, I want to see how those particular DAGs have been running. We then get to the actual airflow part. So this is the managed airflow component. So we add the ability for teams to fairly easily bring up an airflow instance and write code inside our notebook-like environment to get it into that instance. So you can see it's been running. That same process that we built here, that graph ends up here inside this, but you don't need to know how the fundamentals of airflow work in order to get this going. Then we can run one of these. It runs in the background and we can manage how it goes. And from there, every time this runs, it's emitting to a process underneath, which is the open lineage service, which is the lineage integration, that allows me to come in here and have a look and see this was that same graph that we built, but now it's the historic version. So I know where things started, where things are going and how it ran. And then I can also do a comparison. So if I wanna see how this particular run worked compared to one historically, I can grab one from a previous date and it will show me the comparison between the two. So that combination of managed airflow, getting airflow up and running very quickly, but the cloud IDE that lets you write code and know how to get something into a repeatable format, get that into airflow and have that attached to the lineage process adds what is a complete end-to-end orchestration process for any business looking to get the benefit from orchestration. Outstanding. Thank you so much, Jeff, for digging into that. So one of my last questions, Steven, it's for you. This is exciting. There's a lot that you guys are enabling organizations to achieve here to really become data-driven companies. So where can folks go to get their hands on this? Yeah, just go to astronomer.io and we have plenty of resources if you're new to airflow or you can read our documentation and our guides to getting started. We have a CLI that you can download that is really, I think the easiest way to get started with airflow. But you can actually sign up for a trial. You can sign up for a guided trial where our teams, we have a team of experts, really the world experts on getting airflow up and running and they'll take you through that trial and allow you to actually kick the tires and see how this works with your data. And I think you'll see pretty quickly that it's very easy to get started with airflow whether you're doing that from the command line or doing that in our cloud service and all of that is available on our website. Astronomer.io, Jeff, last question for you. What are you excited about? There's so much going on here. What are some of the things maybe can give us a sneak peek coming down the road here that prospects and existing customers should be excited about? I think a lot of the development around the data awareness component. So one of the things that's traditionally being complicated with orchestration is you leave your data in the place that you're operating on and we're starting to have more data processing capability being built into airflow and from a astronomer perspective we're adding more capabilities around working with larger data sets, doing bigger data manipulation with inside the airflow process itself and that lends itself to better machine learning implementation. So as we start to grow and as we start to get better in the machine learning context, well, in the data awareness context, unlocks a lot more capability to do and implement proper machine learning pipelines. Awesome guys, exciting stuff. Thank you so much for talking to me about astronomer machine learning data orchestration and really the value in it for your customers. Steve and Jeff, we appreciate your time. Thank you. My pleasure. And we thank you for watching. This is season three, episode one of our ongoing series covering exciting startups from the AWS ecosystem. I'm your host, Lisa Martin. You're watching theCUBE, the leader in live tech coverage.