 All right. Well, hi everyone and welcome to the guest the last day of the conference. So you made it congrats We are going to talk today about how Spotify has used flight to revamp their financial analytics platform and My name is Heitham Abouftou. I am one of the co-founders and maintainers of the flight open source projects and And joining me today is Delin Wilder from Spotify, but unfortunately Due to personal circumstances, he wasn't able to be here in person But he was gracious enough to share his thoughts in a recorded video So without further ado, I let him introduce himself and talk about what Spotify has been trying to do My name is Dylan I'm an engineering manager on a team called Vivaldi within Spotify. We work in the finance department And I'm also the tech lead on a project called one model. And so that's what this is about and one model Is really powered by flight. I just want to give some background on What we're talking about. So this is really one model is a financial forecasting problem and quickly what financial forecasting is is Every quarter is part of compliance objectives Spotify is required to project two years into the future what we think our profit and losses are and that's part of just being a public company And so it's basically a projection across all of Spotify's various business processes What revenues do we think will take in? What costs do we think we'll have to spend in order to get there? As in addition to just being part of you know requirement. It also forms a Big part of Spotify's business planning. You can imagine this is useful for investment investors It's just as useful internally for the people making decisions the C-suite, etc Unfortunately, you know, it's organically evolved over the years To be a bunch of different Heterogeneous processes across many different teams as you might imagine Spotify is a pretty complicated company We have a lot of different products. We work in a lot of different markets We deal with licensors music law stuff like that And so the expertise in order to kind of run this system is scattered across a lot of different domain experts And so as it stands this process takes about three to four weeks every quarter Span as many as eight different teams these teams work in silos some of these pieces are Excel models And then at the end of the day, you might imagine team one is handing off a Google sheet to team two to start running You know if the number of subscribers we get is input to the amount of revenue We think we're gonna get team one has to hand that off to team two And so it's a complicated process and so Goal number one that we're really trying to accomplish. It may be obvious, but it's an automation problem We want to fix these manual handoffs. We want to speed up the the feedback time to do this process And we want to you know reduce for errors and The secondary piece of this is that because it takes so long to run we can only really do this once per quarter That's too much of an effort investment And so if we think by automating this we can really unlock the second piece of the problem Which is business case against scenario analysis if we can run this problem and to end, you know within a couple hours then These people making the decisions can come in and say Well, what happens if we increase, you know our subs by 20 million people in India Versus what happens if we open a new market in Eastern Europe or something like that? So you can start to ask these questions more frequently without having you humans in the loop in order to answer them There we go So this is high-level overview what it looks like there's again It's a complicated process each of these nodes in this graph is actually, you know, this isn't one flight task per se It's a bunch of different things. These are just high level Logical pieces to Spotify's business model And as you can see here each individual one of these might be owned by a different team And there's lots of you know complex dependencies between teams And there's no I mean this is our best shot at what this thing looks like There's no like person who is in charge of like the overall end-to-end thing who knows exactly what this looks like It's it's very much a distributed ownership model So, yeah one model automating all the components And once we do that we can kind of unlock this scenario this business case and analysis scenario And you know, we think that'll be really cool All right, and thank you didn't So I thought this you can think of this part as well, this is our sort of requirement gathering phase now We have a problem statement and this is Spotify, right? You can imagine that problem replicates to a lot of other even smaller companies that don't have the capabilities they do the infrastructure they do and I thought before I talk about how flight has helped them and has been used in lift as well I will go just quickly through how we got here So it was not very long time ago when that was sort of the software mantra You develop software and ship some golden image on a CD and throw it over the wall to you know poor ops engineers who have to deploy and scale and service your application and handle incidents and all of that And in 2009 That the term div ops was coined and it's sort of skyrocketed sense and the you know coining of a term doesn't really mean we Finally figured it out or we solved the problem. I think it's more of a Recognition that there is a problem or a set of requirements for a problem to be solved And you know over the years We have built quite a lot of tools right as an industry There are a lot more, you know, CI systems and monitoring and you know all of these things Incident management and responses and you can see a lot of them in the showcase floor downstairs if you have been to And a couple of years after that this term came to be again another term got coined RPA robotic process automation And it was more of the the business side of you know A lot of companies looked at you know these The DevOps and you know what all the innovation that has been happening there and they were like well We want some of that But their problems are a little bit different the problem that the data they deal with and the people who need these Automations are not necessarily developers who will write codes for it So they also needed slightly different set of solutions for it and I think by 2018 or something there were like the most number of RPA solutions out there right now things are sort of converging and in 2015 ML Ops term was coined so that was already six years after DevOps or so and Again there these are like the dates don't mean a whole lot But one of the signals you can deduce from this is the the industry understanding or recognition of the difference in problems and requirements for ML and data only really sort of matured around that time and since then again, we have been developing tools and and processes and trying to make you know the life of ML engineers and data scientists better I You know maybe before then I would say that what a lot of attempts to sort of force Some of the dev tools DevOps tools onto ML engineers And they conceptually sort of work If you want to you know deploy your model Yeah, you can write a CI system for the or the CI use one of the existing CI tools for it If you have you know multiple steps and training and pulling data you can You know overimpose it into onto an existing CI tool but you know it Didn't really quite work a lot of engineers found themselves fighting with infrastructure a lot to get anything really done And I found this sort of meme I guess very applicable because you know if in a few months it will be exactly seven years since Coining that term and it pretty much feels like an an hour Apologies about that An hour has passed since sort of the the progression of Of us solving this problem I wanted to summarize a little bit some of the the key differences I see between the requirements for you know the DevOps and then the ML ops data and ML ops And you know again from a 10,000 foot view they look pretty much the same set of problems But the when you look at the details the inputs for example for a data or ML Pipeline a training model might be it's not just the code you wrote to train a model It's also the terabytes of data you want to train the model and when those changes your outputs change In the dev world you just have code and it's you know usually in megabytes And it's okay if you have you know 10 or 100 different CI pipelines that just check out the code every time Right because it's it's it's cheap. You can't really do the same if you are running an ML job You want to maybe share as much as you can the the you know the tools of the tasks and the steps that Do the heavy lifting or or query out of data or a compile a lot of data? Similarly for the actual job that runs in in the dev world. Yeah, the CI system. Maybe it takes minutes to finish I know 20 minutes if it's a bit slow and you start Getting Grumpy and start optimizing it In the ML world if you have a training job, you know, it's very normal for these to take hours Even days in some cases. We have cases where They run at lift Training building the map of the world essentially and that gets refreshed You know every couple of days and takes a long time to build and a lot of regions and a lot of You know slicing and dicing of the data and so on and it's a massive amount of information to process So it's expected that these things take time and when you when they start taking time You start seeing problems that you haven't experienced in in you know the in the simpler use cases Things like machines can die You're right like there are some fundamental issues that you just don't even think about in a typical CI system And you can't really run rerun things when when you know system failures happen You'll you need to be able to checkpoint cash stuff and rerun exactly from where you You know failed last time. There are a lot of other requirements that stem from just this one change in the requirements Whereas in for the iteration for example, they people still want pretty much the same flexibility, right? They want to make a change want to test it locally right unit tests somebody else makes a change I want to make sure their change is validated before it runs You want to be able to you know promote things that succeed to production So you want the same semantics you want the monitoring parts you want the incident management and alerting all of these things you want the as goals the trick is to make sure they We offer them or make them available to you in Where you are where you are writing your code or where you are developing your workflows and Obviously the stakeholders are different between the two And I want in the next few slides. I will talk about how Like flight addresses some of these concerns. It's really our take on on the problem Flight has been developed at Lyft We open source flight two years ago and since then there are tens of companies now that adopted the open source flight for for various Different platforms inside companies from building ML platforms on top data and data ops on top To just purely using flight those companies that sort of resell flight or repackage flight With the you know vertical integration in certain fields And and and flight would power, you know the underlying orchestration parts flight is a workflow to mission platform for business critical data and ML processes It's my sound mouthful, but I hopefully it will it will be clear as we go on We have a ton of integrations and I think at this point It's probably fair to say most of these are built by the community More than you know the original authors of the platform So they were all built to be used. They were not just built to you know Somebody's keeping count of how many integrations are on the platform They were built purposely for certain to solve certain business use cases And we we worked quite hard to make sure the platform is extensible through plugins or safe And isolated exclusions of these plugins and you know So it's it's quite easy to to expand that list as business Needs arise These are a list of some of the companies and I think we just added a couple more yesterday So this lift spotify Intel and bunch of other You know bigger and other startups This is our And and you know any any product I guess starts with some assumptions And this is our assumption for how we think an ML engineer You know go through the process of developing a Workflow or a pipeline or training model or data scientists does the same They start on the left here with some idea. So they have some You know, they want to write some business logic to try out something maybe query some data process data a certain way Train the model you know things like that And they want to write it locally. They want to start up the laptop open up their favorite IDE or Jupiter notebook Or what have you and just write it there if you start asking them to well, you can't do that You have to go somewhere else to start writing then things fell apart They want to be in their favorite tools there or the comfortable environment where they can pull in requirements that they have or they want the Library requirements and just write their code you want to take You know restrictions on which versions of libraries they can and cannot use because some other You know person in the team is using an older version now. They cannot upgrade, you know things like that They just want to sit down write the code. They want to write and test it right there But at the same time once they exceed the capacity or maybe of their laptop Maybe they are just running on like sample data on their laptop and now they want to run on the full data sets They should also be able to Very confidently run the code that they wrote in a remote environment where they can you know leverage bigger and more beefier machines and you know larger compute environments to process bigger and bigger data sets And that process should give them should be should guarantee to a certain degree That the code will behave exactly the same if it worked on my machine It should work remotely, you know all the problems of well, you have a mismatching Requirements or mismatching versions of some dependency like all of that should just not be a thing that they think about Once they are ready once, you know things start Shaping up they usually connect some of these steps together. Maybe somebody else in the team wrote You know a task that queries the data that I need I can just use it and connect a few of those and you know end up with a workflow And you want to then run that in again a remote environment Reliably that can scale that can do things like paralyze my the steps that can be paralyzed Without me having to think about that too, right? I'm just running if I'm running in Python Python is like single threaded. I shouldn't have to do additional work to then you know make it a parallel Get into parallel execution optimize it this way it should just work You should also be able to Promote the successful workflows, right or things that produced the artifacts That you can you deem, you know goods maybe through running other tests and so on You should be able to promote them to production just like we do with services You know, you're deployed to staging environment and you're like Multiple versions and the version that passes all the end-to-end tests integration tests and so on you promote that to production You should give you the same semantics, but again within an environment that looks familiar to them Once again those things happen, you should be able to run on a schedule Right, especially for data processing Pipelines we see that very often right data keeps coming in from external sensors from, you know External sources and you want to keep processing the data as they come in So you want to trigger those pipelines on schedules or based on events and you want to monitor them You want to get notified when things fail alerts and write like all of the convene all the things we take for granted As software engineers like they need to be offered for these pipelines as well And again the key point here is we should meet them where they are shouldn't ask them to like adopt all of these you know Different tools for to make this work And finally as things that you have more and more workflows running in production You will start, you know, maybe seeing failures for from schedules or maybe you want to do some introspection for You get some bad predictions and you want to figure out where where did things go wrong? You should be able to introspect any execution or any artifact and trace it all the way back to Ideally even the version of the table that you know came in that caused us to produce a model that produced this bad prediction and The platform should that this is where we started like that the platform should give you all of these tools as much as we can Without you having to do anything This is an example of Python codes annotated to be flight code So the there are three functions here. They're just no more Python functions. You write your logic in there The one thing that may be not It's currently optional. I guess in Python are types We do require types for the inputs and outputs. You see the pay multiplier in top here You have fixed two inputs produce one data frame We do require you to declare those and we think I know it's a Contentious point, but my take is any production code should really be strongly typed I know types are optional in Python and in many cases when you prototype you don't necessarily care about them But if this is code that you are writing to be, you know, Productionized I think they should be you should declare the types and flight uses the types in various other ways We'll we'll talk about in as the as we progress to ensure, you know tasks fit together and to ensure You know caching works and so on So once you have the your function, right? You can run it locally can write unit tests It's just a Python function. All you need to do is add this top annotation the add task And now it becomes a task a flight tasks now flight knows that this is a sort of an atomic Execution unit it knows that it can take this plus, you know, of course any other function it cool And run that in separate hardware send box environments It can run in parallel if it needs to it's like a you know, it's a standalone sort of function or task in our language And we wrote a couple of those and then the third one here again a function But we marked this as a workflow what does what that does It tells flights that for when you execute this workflow to the body here Don't really call the tasks, right? So we you see me calling pay multiplier and total spend It will not really call these functions. All it will do is it will record That's you want to call these tasks. This is the data flow You want to pass the inputs that come from You know the workflow are sent as inputs to this task and the output of this task is sent to this Right, so it's it builds up the tree of the graph or the DAG based on Executing the code you write here and yes in this example, it's just a one line But this can be arbitrary complex And we will just execute that once Statically with no inputs to understand the graph of execution and from that point on With it gets transferred into our own representation of graphs We use protobuf to represent the typing and the graphs and all of these information And it enables flight to be targeted through different SDKs We started with Python but Spotify contributed the Java SDK and a scholar SDK And you they can all interrupt, right? So some people write like a spark Java task to process things But then their teammates write Python tasks and Python workflows and they call this Java task This is they are called all completely addressable and uniquely addressable and run in complete sandbox environments So we don't care about your you know different Requirements or dependencies versions. This is all completely isolated and personalized In this example everything here is going to be in one container. So you will care but But yeah in in a real environment these things could be spread over multiple GitHub repos like it just completely independent on each other And then lastly even after you know marking things as this you can still Obviously call them as Python functions because they are Python functions So you can write, you know, you can use py test frameworks and like all the testing you can do Locally in Python so you can add as many checks as you want for you know For anyone else who comes in and will revise this code to make sure, you know it behaves as you expect it to give in the inputs and There are a couple of Nefty features here you can add on top right? There's this on top I added cache equals true to this one task And this is me telling flight that I know this task is deterministic I know the behavior is deterministic if I pass it, you know one and two it will always produce three kind of thing And that enables flight to automatically cache the outputs or memorize the output of this task For giving inputs and outputs and given signature and so on So if somebody else in my team or in the company calls exactly the same task With the same version with the same inputs, I will not run it again flight will not run it again It will just say well here's the output somebody else, you know Saved you the time and and ran this and again all of this is tracked So you can always go and see well which execution exactly produced like ran the first time For this input and you can go there and look at the logs and all of that. So it's all tracked in the system Once, you know things run locally and you are happy with what you see And you want to maybe run in a remote environment with more resources We give you, you know capabilities to define or ask for more resources for given tasks Maybe one task needs GPUs the other tasks don't need that So you don't really need to be paying the cost of a GPU machine if you are running code that doesn't need GPU So you can isolate that part into a task and just say for this part. I need GPUs you can also The second one here to the spend I converted this into a spark task All you had to do is say well the task config is spark And flight will take care of or the spark plug-in in flight will take care of initializing or setting up a Spark cluster for you controlize spark and Initialize the spark context and the connectivity and all of that and then your function will start running And you will have that spark context if you you know need it in this case I it will automatically convert. We also have type transformers that will automatically convert from You know appendice data frame to a pie spark data frame So you don't have to do anything you just take this and start running your code Your spark code right here And you can see I still use it exactly the same way right I still am calling total to spend in the workflow Exactly the same way nothing changes here, but instead now it's a spark task and you can do all the map reduce Magic that spark lets you do And finally at the bottom we run we have this concept called launch plans most plans are That can overlay on top of a workflow where you can define certain behavior for certain inputs You can also define Schedules you can say we'll run this every hour on the hour or you know, you can have scrum schedules or Fixed rates you can have notifications when things fail so you can integrate this into your existing maybe Incident management systems and maybe slack notifications email locations all of that It will integrate well into your existing infrastructure and Finally flight also Integrates well into your existing github's tools right if you have a CI system And you want to maybe at the end of after merging a new workflow into you know You the github repo you want to kick off the run of that workflow. You can you know use this flight CTL cube cuttle inspired CLI That can do pretty much everything like the API layer in flight Exposes as you can create exclusions register, you know, all of that you can wait for exclusions Like consume or download the outputs of exclusions. You can do exactly the same thing with Flight remote the middle here. This is a Python library that just is a wrapper on top of the again the API So it lets you kid interact with our back end Little bit nicely than just calling, you know rest or gRPC endpoints So you can script it yeah, we have people use this in in other Jupiter notebooks To you know script some interactions with or schedule some interactions with with our back end Also have a UI and I wanted to show a couple of screenshots. It looks very low resolution from here So on the top left here that once you register a workflow with the back end You can see a rendering of our understanding of the execution graph So you can tell which parts were running in parallel and which parts will have dependencies on other parts And how the inputs and how the data is flowing between these You know execution units on the bottom left. This is the Like home page of sorts for a workflow or a task. There's a histogram that shows you You know the past few runs how long they took and whether they succeeded or failed is a list of versions So every every change you make to a task or all flow is tracked as a separate version So everything is immutable and that's that enables us to always be able to go back to the exact Version of the code that produced, you know certain artifact And and that's obviously applies to all of your new dependency tasks and everything else you call On the right here is a launch form so you can launch, you know executions also from the UI And this is you know auto-generated One of the users of the inputs the input types we you know ask users to add in So we can render the rights like controls for for these inputs, right? If it's a Boolean will you have a toggle if it's a string you get this There's one foot arrays and so on and this is also pluggable You can enrich this if you have you know new input types Once you run or you start an execution you get a Big graph of you know, what's happening and you can see how much is being what the progress is what tasks succeeded what failed That you can see there are like boxes inboxes here because this is one of the more complex graphs we have that That uses Composition right so we have a workflow that's calling other workflows It's calling other workflows and with like branching and if conditions and so on and the UI is able to like render the nesting Level multiple nesting levels here. We are experimenting with a few different ways of rendering the nested graphs that Because this looks a bit more complex than you can you know easily Understand but but you know will feedback is always welcome. So let us know When things go bad and as they you know Will at some point or another flight also tries to pull in the logs looked traces Stack traces or as much information as we can up bubble that up in the UI You can get exactly the same information obviously from the CLI or you know, you were the library I showed you the remote library So we try to make this as actionable as possible You know, if it's an image pool failure from Kubernetes, you will see that right up here Whatever the actual error is bubbles up all the way to the top layer in the UI and you will just see that right in front of you and these are just some of the Features I wanted to share that we hear here quite often that you know people like using once they start using the platform It sounds like there are a few people here who are more on the infrastructure side We'll be happy to share more details about how these things run or deployed afterwards, but But I want to take a second and just share what Dylan heard from his own customers right within the Within Spotify about how things are going with flights So let me Go here Really quickly. I thought this slide was fun Why did we pick flight? This is actually Written by our data science team. So this is not from us, but we think it's really cool Because as engineers a lot of this might be kind of table stakes for us But for the data scientists being able to get in up and running on flight and getting all of this stuff for free Has been a really big win for them and they've been really excited about it You know the ability to share models with each other and compose things easily Out of the box parallelism. I mean again may not seem like the biggest deal But for them when you're writing Python scripts and everything runs and takes a certain amount of time Whereas now for free we get parallelism across tasks. I think they think that's really cool the caching Just connectivity with GCS all of this stuff They've been really excited about it Yep, so we talked about composition. Yeah, you can call, you know workflows from one another You can share we haven't really talked too much about this but everything in the system is tracked and addressable globally addressable within a deployment and You can other people can start using your tasks and workflows One of the teams we worked with they their infrastructure team of platform team built You know one flight project with just Common tasks that everybody else uses Things like you know query table blah with you know dates or you know things like that In Paris and as Dylan talked about it caching we talked about in storage and yeah compile time validation that we are Also very big. We feel strongly about that that most of the errors you get Should really surface as early as possible So you get you know type checks Like he does this task produce a type that can the next task can consume or not You know things like that. We try to surface that as early as possible Just a couple more things The if you have a MacBook might be the easiest way to just try flights You can brew install flight CTL and flight CTL sandbox starts will give you a single Container that starts up a cake a 3s cluster with all the flight components running You have a server You know you can play with UI and you can run start writing flight code and against Start with flight work can show you a very simple like a lord examples. You can use and Lastly October 1st. It's a an event Happens in October for a lot of open source tools to to ask, you know the community to contribute We have I'm fine, you know good first issues kind of thing you can play with and October fast flight at org is you know where you can go check out what issues are available and start you know contributing and Coolest thing is you will win swags, so I know you know I know we had probably enough swags from the conference, but you know, there are swags who can say no to that So go check it out and that will be all I will be happy to take your questions now Oh Thank you Say again streaming data do we handle streaming data? Not natively But we one of the one of our customers wrote a flink plug-in for flights so and It lets you do that But it doesn't so so that I guess the question really is can can you execute these workflows? In a stream or can it consume a stream and just run the workflows, right? The execution that requires workflows to run in you know seconds, right? Very quickly in overhead kind of situation and most of the workloads we saw run in in Containers and there's always the container startup runtime and so on so at the moment the answer is no But but it is something we are thinking about how to how do how can we integrate well with streaming Solutions and how do we offer more importantly? How do you offer you sort of a unified dev experience? Yes Yeah, so Are the to repeat the question are the task execution environments are they just individual containers? Can they be more than one? Right that the question. Yes. So this this behavior is Controlled by plugins. So even the single task like I demonstrated Runs through a plug-in that couldn't what we call it a pod plug-in. So in in Kubernetes world will run as a pod But we have plugins that do the distributed pytorch right and that starts up a couple of forgot what but like sort of master nodes and Cluster right and it distributes the data that way and does the training So there are different spark also starts driver pod and some Executor pods So depending on the plug-in the environment will look different But we do take care of annotating and tracking the resources created by them So we make sure things get cleaned up And we make sure when we have retries. That's another feature. I did not talk about you can declare retries on individual task level and Flight make sure you start from a sort of a sandbox environment. We have the data Separated right so like the next retry will not sort of Read Like corrupted data because of the previous retry, you know things like that So we we do a lot to make sure these start up individually, but but yeah, they can be arbitrary large number of containers Yes, this session has ended. So if anyone has questions, they can come directly and see our speaker But otherwise you're welcome to go to your next session. All right. Thank you