 Hello, and welcome back to our special SuperCloud 5 edition. This is a battle for cloud supremacy. AWS is battling for it. We've got live coverage in on location in Las Vegas for AWS's annual conference and our in studio special presentation, breaking down all the hot trends and the most important conversations happening with the experts, customers, engineers, everyone's weighing in on this next level cloud where GenAI is in the front and center. Data and AI is the value proposition that's going to shape the future. And with us is Steve Hylian, SVP of Data and AI Distrometer, Airflow Astro, if you're familiar with that, you'll know all about this company. Behind every online interaction from checkouts to dashboards and all kinds of processes, data is at the center of the action and data pipelines is essential to operational analytics. As an expert in this, the future of Steven, great to have you here on the special edition of SuperCloud 5. Happy to be here. So we love talking about the future. And I've been saying, I remember saying in the queue back in 2013, I said, Dave, data developer will be a category. The idea that data will be programmed on not some siloed data warehouse. And we're coming off the trough of the Hadoop world of where's the big data results? And okay. That was a nightmare. But now it's happening. Finally, so it takes 10 years to gestate, but we are in the middle of the perfect storm of innovation. Some compare the GenAI and AI to the web wave early on nascent growing fast, but the hype is matched to the reality that people are putting stuff into production faster. So I'm saying not as fast, but it's going to take a change over mindset for data. You guys are doing this. Take us through what you're doing right real quick for the company, the core product and how it fits into this big wave. Yeah. You're right. There was that cycle of going through new data platforms that I think now has arrived with a whole infrastructure that maybe still be very complicated, but it is so much easier to produce data products. And that's what it's about. It's about data production. It's about things that are meaningful for driving business, right? And to do that, you need to orchestrate the flow of data throughout the organization and that's what we do. So astronomers, the commercial developer behind Airflow, Apache Airflow, developed of course originally by Airbnb to manage all of their data pipelines. Well now it manages the world's data pipelines and we're providing a cloud service for that. Take us through, you have a lot of experience. You've seen the movie before multiple cycles of innovation, but the data piece particular and with AI is kind of set up. If you go back and look at infrastructure as code, the DevOps movement, DevSecOps, now data has that same vibe at large scale. Feels like SRE kind of conversation. We used to have about 10 years ago as SRE became kind of the standard for platform engineering as it is today. But those platform engineering teams now have to deal with now a new architecture to factor in data engineering. You guys are in the middle of this. What does that mean data engineering? What are you seeing? This is a big part of your product. Historically data engineers have just been about, just been about the production of meaningful data sets that are used by data scientists, used by application developers to power their applications. Really what matters there is less writing this particular SQL query or generating this particular feature, but does that run operationally? Does that actually power things lights out? Is it a reliable data set that has SLAs associated with it? So you're exactly right. Building that platform has now become what's foundational rather than just the individual data sets. It's about the operational aspects of that. It's about data ops. And so data engineers, I think, have long enjoyed generating new data sets, but now they're enjoying creating frameworks that allow everybody to build data sets. This is a killer feature I want to get into because data frameworks has a programming vibe to it. You mentioned Airbnb and Airflow, we also feature Uber on our show where the platform has to deal with multiple databases, different modals of data in real time. And so you have this idea that, okay, if I have this kind of new architecture, what does that mean for the developer? And with Generavai being a big part of this modern stack, we hear that I don't need five PhDs to do this or I might not need five data engineers to do this. What's that going to turn into? As Generavai comes into this modern stack, this multimodal kind of data environment, database environment, what's the core of Generai's impact here? What does it do? I think it has a two-fold impact. First of all, you've got a lot of tooling that's suddenly available to be able to produce very rich results with less effort than you had before. You've got pre-trained foundational models. You've got a whole layer of tooling around that. But it's the same old story in the sense that how do you even get started, right? How do you pick your way through this sort of forest of different technologies? So what I think the impact is that you have more tooling, but now you have to make a decision about what the right architecture is. And that's in a sense, I think, where orchestration comes in, because it's the one thing that ties together all of these different technologies. And so it's gonna have an opinion about those, right? One of the most powerful things we did as a company was to produce a registry of connectors from the airflow pipeline, from the airflow conveyor belt, if you like, connectors do different technologies. That's inherently opinionated because it says, yeah, if you wanna build a production model, you may wanna do that in memory on our platform, but if it's now gonna have to scale to very large data sets, well, then you wanna maybe use Databricks. If you wanna be using large language models, well, there's a whole host of different toolkits you can use and different providers of LLMs, but these are the ones that we've pointed to, whether that's Cohere or OpenAI or Weaviate or PGvector. We have off-the-shelf connectors for those that make life a lot easier for the data engineer. So let's take us through what you guys do real quick, because a lot of things I wanna connect what you just said there. There's a picks and shovels market going on right now in Generve AI, a lot of choice to be made, try to figure out where people wanna figure out where they are on the spectrum. Are we on, is there a power law in models? What's a specialty model? How do I have data interact? What data sets should I protect? That is my intellectual property. What are open? I mean, these are all questions. How does a company engage with you guys? What do you do to give them the confidence, the enthusiasm's there, but the confidence I can move forward? Well, the first thing we wanna do is provide a stable platform so that you can build these applications in a way that's reliable in a production setting, right? So just having a cloud service for orchestrating data pipelines from data ingest feature generation out through model development, model deployment, model monitoring, just having a reliable platform for that is number one. But what we found over time is that people actually want us to be more opinionated. If we orchestrate these things, well, what are the things that we should be orchestrating? So we're starting to produce more reference implementations that say, if you are gonna be building a rag style application of large language models, well, we've got a serving suggestion if you like that. On the box, we'll tell you what are the components that you're probably gonna need and how do you fit them together? So I think that's really a big push for us this year that astronomer is not just to be the better airflow service but to have actual, again, serving suggestions for how to build real applications. Who's your target customer? Who's the user? Who's the buyer? I can imagine probably platform engineering. Take us through who's actually engaging with it. Yeah, the end user is the data engineer and the machine learning engineer. This is where they meet, in fact. They both are adept at using Python and SQL. That's the language really of our platform is the language of the data scientist and the data engineer. So this is where they meet. They're the end user. They're writing Python pipelines in the airflow framework. They don't have to do an awful lot of extra work beyond what they were already doing in their notebooks to get that running in production. So we wanna make their lives easier. It's a thrill for them, right, to say, well, I'm not just like experimenting with the model, but now it's actually running production essentially at the click of a button. The buyer, the person really who cares about this, who says, I want all of you folks to be using one platform is, say, the chief data operator, the chief data officer, the VP of data, the VP of machine learning, who really wants to have a platform that makes it easier to push these things into production and that encourages the use of standards and best practices. And what are people saying when they use you? Is it faster time for the developers, more time for operations? What's the benefits that you guys are offering? A lot of it is reliability and scalability and uptime, basically. I wanna use Airflow because Airflow is the de facto standard for orchestrating pipelines, but I don't wanna have to manage that myself and have an army of infrastructure engineers to do that. So let's just make that a cloud service. A lot of it is cost as well, right? Especially for machine learning workloads, you want those to scale up at the right time when you're building models with large quantities of data, both the data processing and the actual training of the models. But you also want that to scale down when you're not using it. That's very common in the machine learning world and same with the data engineers. I got to ask you as the product has a product roadmap responsibility, you get the keys to the kingdom for the company. Are you talking a lot of customers? As the AI world involves, I'm envisioning just a new kind of operating system, AI-like system that has a combination of data pipelines, real-time management policy, guardrails, all these things we hear about. Is there a future soon where there's going to be some automation involved around managing data pipelines because talking about moving data around is very expensive. Why should I even build an LLM if I can just use a pre-existing one, call an API and get data, send some prompt in, get data back? That's like an API to me. A prompt is just a call, right? Why not? There's got to be, seems like a system is emerging that will scale. What's that look like? That's right. I think that's true. You want that to be straightforward and safe and easy to do. But there's still some artistry involved, right? I mean, you talked about prompt engineering, there's fine-tuning of models. In a sense, it's no different from the traditional world of being a model where there is a lot of finesse, I think, in the data sets that you bring to it and the features that you generate and the way that you work with those models and monitor them. So that's important, I think. But for the way to manage that, it's interesting to look back to AI to say, what can you do to help? Because obviously AI has something to say about code generation. So what we're finding, we've already built prototypes and features around this, that if I am building a pipeline, help me build that pipeline, you already know about the context of all my data. You know about what's upstream and downstream. So can you maybe create the framework for the DAG in our language, the pipeline, the workflow, and then maybe fill in some of the code, right? Code generation, co-pilot and so on is already widely in use and we're making that available to the data engineer too. I think that's going to be a great accelerant to building new things because if you have that kind of co-pilot or that human plus AI, knowing a lot about the environment, it's going to be faster to deploy new stuff. Look, my other job actually is running the data team for Astronomer, right? So we analyze our own data, all of that rich airflow data so that we can provide a better service to our customers, all the usual stuff, as well as making suggestions to them and those reference implementations. And for me, this accelerant is crazy. We went from something like 1,500 tasks per month when I first started the team and then two years later, we're running 1.5 million tasks a month just for a relatively small company like ours because it's so powerful and it's easy to do. Yeah, and that's just adding existing stuff. You get better from what you have. Now the question is, what happens next? What new things are emerging? When you think about orchestration, I'm thinking about, okay, you've got this nice platform that's enabling more scale and tasks, but now what's net new that you see coming out of this that you're gonna harvest? We were so overwhelmed by ideas right now. I just started a new notion paper. We're just like, I'm saying, stop people telling me about ideas. My brain's full, stop. But already we're starting to look at things like pipeline generation, for example, and automating that process and code gen. We're also looking at, everybody is building a chat bot and why not? Because they're really powerful and it's better than searching through all the documentation. But if you can do that in a way that's very contextual, right? So if you've got a problem with your pipeline or this data set hasn't arrived in time, you can look in the logs and say, oh, this may be the cause of the issue, right? So we're looking at stuff like that. What's most exciting for me, actually, it's gonna sound really boring, is data documentation. Nobody documents their data. There are no data dictionaries out there worth looking at, but why can't you get those to be automatically generated, right? We have a lot of the context as an orchestrator to say we know all the data sets and the relationships between them. So maybe we can help you with your documentation and we'll handle that. You know, hey, the low hanging fruit is a lot of stuff out there to get success, to get more budget, to get more efficiency that frees up more creative time. I mean, we're entering a creative class. I've never seen a used word artisan before. I like that because craftsmanship is coming back to, I won't say the voodoo because prompts aren't magic. You can actually engineer these. That's right. Well, it applies to our customers, not just the developers, right? Who now get to focus on the generation of business value. We have, there are so many use cases I can talk to you about. We've got the largest retail in the world are running all their pipelines on our infrastructure. But my favorite is Laurel. They used to be called Time by Ping. Their job is to automate the process of creating the time sheets that lawyers and accountants have to submit at the end of the day. It's incredibly tedious process. Lawyers hate doing this. Well, if you think about it, you can just look at what they're doing on the screens to explain and document what it is that they did and how they use these time on a 15 minute basis. And then you can summarize that with these large language models to submit that to the client. So all of that work has gone away for these lawyers. It's like, and they focus on now actually doing their jobs and being creative. And make sure the client has the billing up. Check on the other side. It's all good. Well, I want to get before we get into the customer applications of building models for the app side for your customers, which is a big part of what comes out of the enablement you guys have. I want to talk to you about, you mentioned RAG earlier, retrieval, augmentation generation. Big part of that is vector databases, embeddings, kind of new stuff around there. One thing that's come up is how do you handle the observability aspect of the new stuff? How do you know when a prompt works or there's kind of new signals now coming out of some of the engineering and efficiencies of this, whereas net new data, there's no observability algorithm for like understanding, does that have memory? Like from a retrieval standpoint, not from running in memory. So interesting dynamics here. How do you see the, how do you know what's good? How do you keep track of things? Yes. Well, that's very important. And as you, again, you're not just calling these models, right? You're populating the vector databases. For example, you're retrieving those documents later. You're doing the prompt engineering and so on. All of that stuff needs to be monitored as well as the accuracy of the models themselves. Well, I mean, not to be glib, but that is exactly what the orchestrator does, right? It is about running all of those things in a single platform, having the log access, having the statistics recorded to a central location. Also integrating with tools like Lang Smith, for example, is becoming a mechanism that people often use for looking at the way that the large language models are getting used and executed both in development and production. So integrating that into our tooling as well. We view ourselves as sort of the orchestrator of orchestrators, the observer of observers in some ways. And I think this orchestration model is pivotal because you can build that foundation. And I want you to address that because everyone's like, do I need another platform? Well, in this case, more than another tool. Talk about the dynamic between platform and tools in this context of moving forward and what it means to the customer. It's very important for us to be agnostic, right? I mean, we're not building models, right? We're not executing the SQL queries. It's very important for us to be opinionated but agnostic. So having reference to particular databases and compute platforms and tooling and so on. We wanna be the one-stop shopping for all of the different toolkits that are available in the modern data stack. But we wanna make it easy to plug those together, right? So it's very important for us to make sure that we have integrations that are built in and that we make it easier for people to use those, right? So if there are higher level abstractions that we can build. For example, if you're taking a bunch of documents from a standard location like S3 and if you're plugging those into a standard vector database like Weaviate, for example. Well, why should you have to write all the individual lines of code to do that? That's a standard orchestration building block. So we'll provide that for you. Yeah, one of the things about these embeddings is they're not compatible. They got to be in the data store piece of it. These are considerations that every customer has to have to figure out that's why platforms tend to do well. Yes, that's right. What should companies and people be wary of as they go in here? What should they look forward to stay away from? Is there tripwires in the process you're seeing out there? There's many people often have talked about data security and sort of ethics around the use of these models. Biden's executive order that just came out I think is a good first step in that direction. But I think the other thing to think about is cost, actually. So over the last 10 to 15 years of people as people have started to adopt more cloud services, there's been a little bit of an overreaction and now people are saying, well, now let's start to monitor these cloud costs and make sure that we're getting the biggest bang for the buck and be able to scale back when we need to scale back. And that's as true with large language models as it is with anything else. It's very exciting to use these things, but they do cost. So we actually think that one of the reasons why you need a centralized data platform for managing all these flows is it provides you one place to go and point the finger at blame. It's like there's one place that you can start monitoring the lineage of all of your data sets and the costs that get accrued from your data queries to your model calls. What do you say to the folks who say, hey, everyone's coming at me saying, I'm the data platform, pick me. Is there a beauty contest going on out there or can platforms coexist? I mean, obviously pipeline is super critical infrastructure and you look at the data piece of it. Then you got security platforms, you got other platforms. What's the platforms or platforms? Are there a multitude of platforms that could coexist? How do customers navigate this? Yeah, I think again for us, I think the role we can play is in creating these reference implementations and having integrations that we're never gonna do all of them. So we're gonna be selective, we're gonna be keeping our ear to the ground to hear what people are using and the success that they have. And we can literally monitor that and have some sense of what works and what doesn't. I think in general, data scientists and machine learning engineers and data engineers have their favorite tools. The open source community is very rich and is often creating new things. What matters, I think, is how well those things run in a production environment and how well integrated they are. We can, of course, be the glue, right? But we can only go so far. So it's important, there's a game to be played here between the diplomats that we are and all of the different technologies. And if they work well, we'll add them into our interfaces and provide those services. You know, Steven, one of the things that's being played out here in the conversation around AWS's annual user conferences, NextGen Cloud, Generve AI, as they're trying to get that more in play, is how do I run this at the end of the day? So take us through an example of a customer. Take us through modern stack, modern progressive company that has AI, NLP, they build some models. They probably believe that they have intellectual property and data and so they get it. What do they do next? How do you engage with them? They probably know your open source projects, they're involved. That's right. What happens? Take us through the day in the life. So what we've seen so far is we've seen companies like Ramp, for example, one of our customers who are developing internal applications to make use of large language models so they can serve their customers better. And they've built internal applications, but quickly realize again those have to be fed with data. I think even now, because in a way, this is highly analogous to traditional data engineering and machine learning, if not downright identical, you can use standard best practices around ingestion of data and model training, model monitoring. And we have examples of how to use those on the registry that I mentioned. And so that's often the first point of engagement is they'll just come to our website and start using those. We actually have started building our own applications, as I mentioned, using, for example, generative AI. And what we've done is to make those very public, right? So for example, they can just jump right in and start using them. That's exactly right. So we have DAGs pre-written that are available right now on GitHub. You can search for them on our website that allow you to build pipelines that do exactly the sort of things that we've just been talking about. Processing of documents, for example, interaction with vector databases, calling out to large language models, monitoring those, looking at costs and so on. So you have the flywheel going on. You have, as you get these out there, you're getting data from your own environments that make the product better for the customers. That's why your ideation notion board's going crazy right now, probably from your team. It's why I love my job, honestly, because I get to listen to what customers are doing and say, okay, I'm going to try that too. I'm just going to copy you, but try and do that in a way that aggregates from what the community is doing and then share that back out. You mentioned the task go from, you know, thousands to millions. What is the big takeaway is you kind of look back and pinch yourself and say, wow, this JNI is legit next level. What have you observed? What can you share? Because what jumped out at you the most? You say, hey, this is why the game has changed. This is why it's legit next level. Yeah, when we first put out one of our own internal chat bot applications, I was thinking to myself, would this be useful for some of our professional services work? And I saw a conversation going on on our internal Slack to say, hey, I've got this old airflow pipeline and I need to migrate that to modern one and how would I do that? And then there was a suggestion made and then it was like, well, now suppose that this was a little bit more complicated or I had two of these things. Now what if I had a thousand of them and this conversation was going on? Eventually they arrived and answer. Well, one of those people was our professional services person but the other was our internal chat bot and it really made me open my eyes. We've all had that moment, of course, of looking at the interaction between humans and large language models, but the application to something that's really burdens them to us, the migration of pipelines, especially from legacy platforms. Now I suddenly had this sort of road to Damascus moment where I realized we can be using these things to radically simplify the production of new pipelines. Yeah, I mean, every new wave, simplification, reduce the steps that takes to do something, makes it easier and this is what we're seeing at scale. What's the final word? Give a quick plug for what you guys are doing. Why should people pay attention to what you're working on? People watching out here that are in data, I'm sure the data engineer is going to explode as a persona, we're seeing it. What's the message that you'd like to share with them? Well, I think so far, let's focus on generative AI because I think that's what a lot of people here at the conference are talking about and rightly so because it is exciting but there's a lot of opportunities but you have to approach it maybe not so much just with caution but with a view to how to make it real. And I would say there should be no limits to people's imaginations. But in the end, if it's not running in a production setting, like any other predictive model, then it's meaningless, right? The fun stops at a certain point and things get real when you actually want to make this drive business and it's kind of fun too. So that to me is the most important thing is to say look at what your colleagues and your compatriots are doing across the industry and learn from them because there are so many people now engaged with this technology, there's lots of great examples and our hope is that we can be of service there by consolidating some of those into a central location. And then everyone's probably going to realize as we're reporting here on the ground is everyone's going to be in the data business, data aggregation, data pipeline, you're building infrastructure, plumbing, abstractions to manage data flows and feed that into applications. That's exactly right. I mean, sometimes I resist using the word plumbing because it's hardly the most romantic or glamorous thing but that is exactly what we're doing, right? We are the modern, we are the plumbing for the modern data infrastructure. Well, I really appreciate what you do and having the product responsibility is a huge task and you got the keys to the king, you got to go to the customer side, high velocity change, internal change in the company, but I mean, we're in a world where data engineering is, this is the beginning of, I think, a bigger like we've seen with the cloud. So congratulations. Yeah, it's moving very fast and it's very exciting. Thanks for coming on theCUBE. Appreciate sponsoring us here at SuperCloud 5 and being part of our program. Yeah, my pleasure. Thank you. Thanks. Okay, coverage here, we'll be back with more after this short break, SuperCloud 5. AWS annual conferences are going on, all the action here on theCUBE.net here in Palo Alto, we'll be right back.