 Hello Hello and welcome to the session on flight. Hopefully you are having a wonderful open-source summit And we have a lot of content today to cover We have about 45 slides. We have kept them verbose so you can follow along by reading them But we'd also try to go through them really fast because the amount of content is a lot So please ask your questions As we go along, we will try to answer all of them in the end now that the logistics are out of the way Let's introduce ourselves. So my name is Kathan. I I work at Lyft. I'm an engineer at Lyft and I need the project called flight and I also need the machine learning platform at Lyft Co-presenting with me today is Matthew Smith who was also who is also an engineer at Lyft We used to work on flight now leads for casting for Lyft So Flight was open-sourced last year at Kubecon In November 2019 and December 2019 rather and it's been about six months of Very interesting times in the world and very interesting times with flight as well So we will start off giving a quick refresher why we Started creating flight and what the concepts and features are We'll dive into the architecture and some of the challenges that we've uncovered while scaling up flight in the last six months and then I'll hand over to Matt who Who will introduce a case study of how they use flight in forecasting systems to deliver real business value for our customers at Lyft Then if you have the time we'll try to do a demo. We'll try to keep it really short We'll introduce new concepts in flight only within the demo And then we'll keep five minutes for questions So let's get started This is this slide is very interesting. It's actually How we started building flight. I used to lead a team called ETA ETA stands for estimated time of arrival. It's a very business critical metric for Lyft because when you open up the app you get a time Of how far the driver is of how far is the distance and based on that we derive which driver to assign to you or which How much to charge you So when I was leading the team I specifically was working on the offline models that would estimate various different parameters and and Would come up with the fair calculations or estimate the traffic on the road and so on and all of them inherently were machine learning problems But I realized that machine learning problems are never developed in isolation They depend on data that's probably either generated from raw data sources like users or generated by other teams And then the model itself that I wrote actually Powered dispatch and many other systems downstream. So they affected it affected the downstream systems and their data and models generated So data and ml is this complex beast which interacts with each other and are interdependent And and if you go back sorry to the previous slide Each of these boxes represents a team at Lyft And the arrow represents interconnections between them Or that flow of data or ml models And interestingly all of them are powered by Flight This is not a complete picture of Lyft by the way. This is only a zoomed in part of the large picture I couldn't have the entire picture on the on one slide here. All right So when we when we thought about all of these challenges, we were like How do we attack and tackle them? We decided to tackle Some part of the stack not all of it and we wanted to do a good job at what we were doing So this is a very popular diagram probably illustration that everybody's seen The color coding here represents the part of the stack that flight tries to tackle Purple more the purple better flight will handle it Gray means flight does not handle it for example flight is not does not provide any serving infrastructure It can be integrated with the serving infrastructure, but it on its own is not a serving infrastructure Uh similarly flight is not a great tool for data collection Because you might be getting data from different sources and you might be streaming that data flight is not a streaming engine So you would Store the data in s3, but you could use flight to essentially clean the data or extract features from it and hence the color coding Now that we know, uh Now that we know the part of the stack that we wanted to handle We we realized that most of the users were essentially Um, had a bunch of had a had workflows that they dealt with every day For example, they were either first start with Discovering data, then they would clean the data then they would extract features then they would train a model Uh, and then downstream use that model to predict batch predict or you know, apply the model to a serving system Uh, and and this uh meant for us that we wanted to orchestrate these processes for the users themselves and this orchestration We wanted to do a great job at it And that's why flight tagline now is production great orchestration for data in ml the other problem that we saw very common at lift was different teams would re do the same Uh pieces of work because collaboration and reusing Uh of a developed ip was very hard in data systems So for example, there was a upstream Uh team that would know how to let's say extract data from an Open data source in this case. Let's talk about open street maps Then you would extract the data Uh, that team would extract the data provide, uh, a representation of the data for the downstream teams But in some cases a downstream team may want to Access the original data source and extract another information In this case, they would not be able to leverage any of the processes built by the upstream team So we wanted to, uh enable This sort of collaborative reuse And we also wanted to make it really easy to perform standardized mo ops across all the teams So what is flight flight is essentially a hosted Um scalable platform for a company It is uh, it is a fabric that connects multiple open source or closed source technologies and allows, uh Development of user defined workflows Which are auditable repeatable and secure All of this while maintaining extensibility and observability Um, it is it it We intend that every company that uses it provides the system as a centralized platform So that there are a lot of advantages that you can leverage with the centralized platform So to understand more of flight, let's first understand some concepts Uh Two of the most important core entities of flight are tasks and workflows Both of these entities are purely declarative version have strongly typed interfaces a task is Uh, the smallest indivisible portion or entity within flight It it uh, if you take an example of a programming language, it would represent a function Uh, and you can you can write a function You can specify that it takes a few parameters and it produces a few other parameters This is called as the interface of the task A workflow orchestrates multiple tasks Managing the data flow between these tasks and produces and it's it itself takes some inputs and produces outputs On the right hand side is an example of a workflow It takes an integer float and a string and produces string and a binary But to produce the two outputs it has to go through multiple tasks Uh, the first task on the left is a spark task And the second and in parallel to that there is a simple python task What this means is a user is writing a code a piece of task or a code that runs on spark And at the same time the user may write another piece of code that runs in python and downstream if you see there's a code written in C++ From from the platform's point of view all of these are essentially tasks that are demarcated by their interfaces Flight will automatically create a spark cluster for the users manage the life cycle of the cluster and manage all the dependencies within their system and then also massage the data Into the system. Let's spark handle the right thing and then produce And then store the data and massage it into the next bit So let's take as we said, let's take an example. Let's take a few examples of tasks On the right hand side Is a spark task You write spark by spark code in this case you do not it's not different from writing any regular by spark code So if you are counting words, you can take the example from the internet drop it into a function and decorate it with one of the decorator that comes with flight kit Then also Optionally, you might want to annotate with a set of inputs that it takes and produces set of outputs This informs the platform that these are the inputs parametrisal Any other type system in a programming language? So it's very composable The second example at the bottom is running an arbitrary container It's taking a container from open source in this case an open cv container Passing an image to it and it produces another image Passes an image to it Creates another image which is Just the edges Extracted from that image On the left hand side is an example written in scala Uh, and this is uh help. This is created by spotify They have a new flight kit called flight kit java that allows you to express tasks and workflows in java or scala And in interesting things tasks are independent entities. They can be executed independently We'll we'll show that in the demo Okay Now that we know tasks, uh, let's put them together into a workflow Workflows are composable and declarative entities On the left hand side is an example of a workflow composed of other workflows Or a workflow can be composed of tasks Which is on the right hand side It is it is a typical workflow for a machine learning pipeline You take some data you split it into training and validation site You fit a model and then you compute metrics on it This is a very standard workflow For machine learning at lift or mostly at other companies You can also associate schedules to a workflow and you can associate more than one schedule And this is because workflows exist in isolation and the each execution exists in isolation isolation as well When we when we built flight initially we wanted everything to be static and The interfaces to be also statically defined. This makes it extremely Easy to actually verify that the workflow will work Um, or at least compile What compilation here means it verifies all the inputs passed into the workflow Which are passed into a task and further down to another task Will all work because they are types match for example a task produces Let's say an integer and a string and a downstream task consumes two float values You cannot Uh create a data connection between the upstream task and the downstream task It will cause a compile time failure just like you would get compilation failures in a type safe language like go or c++ But users wanted some dynamism. So this is possible only if everything statically defined And statically defined means you actually define the structure of the workflow and you define the structure of the tasks ahead of time But as we started using uh flight people wanted dynamism and dynamism is very useful in some cases Let's take an example on the right hand side on the top. There is a workflow of D is the node in that workflow or in that graph Which wants to generate a new workflow based on some user provided input Um, and so it just spits out a workflow that the flight engine knows how to work with This comes with some risks. Uh, this workflow may not be confirmed So it may not produce the exact outputs that you desire or it may be broken just because it's uncompilable This this looks similar to like java's cheating where You where the user produces The workflow and the flight backend will dynamically compile it and make sure that it's going to run Otherwise they get an error and that is the tradeoff you have to do with dynamism Another very good example of where dynamism is useful is given the set of Data corpus. Let's assume a million images. I want to classify The images whether they have a dog or a cat And this is a very classical model cnn model that you can use So let's say we use the model But the problem here is how do I scale it so that it happens very quickly? So I may run hundred instances of The predictor where each instance will handle 10,000 or thousand images And flight makes it really easy. These are called array jobs and you just launch 100 containers or you you tell flight that I want to run launch 100 of them and it'll Schedule the data bits correctly And this is done using dynamic workflows another important concept In flight is projects and domains Projects are logical groupings of workflows and domains provide CICD like semantics to workflow So a user when he's I trick when they are iterating on their workflows They usually just use the development domain once they are confident They push it to something like preprod where they can run smoke tests And once they are confident with that they push it to production where it starts serving production information And this is also great for accounting and auditability It's also great for sharing. What do I mean by that? Let's take an example Project a on the left hand side Uh has a a pipeline a let's assume it's a feature engineering pipeline So you are taking in some data sets and computing some features from it They also have a model which works with that feature Project p on the right hand side Wants to use the same set of features but apply another transform to those features and then run the same model In the typical scenario, you would copy paste or you would have to create a center like representation of the feature and then share it in in flight instead You can just share the code without really sharing the code You would share the reference to the code so that Project a can keep on updating the versions of the code while project b is just a consumer And they can also consume the model in the same way So on the right hand side if you say we can just fetch a representation of the pipeline Compose a new pipeline out of it and add the task While transforming it using t1 But there is a problem Does that mean you compute the same features again and again now that you've uh composed it? No, you don't need to Uh and this is where flight data catalog fits in A flight data catalog allows Thing called as memorization. So let's take an example in this case w1 is a workflow that composes of two other workflows w2 and w3 And it let of some execution of w1 has a failure in w3's task edge Let's stay let's uh, let's further complicate this example What if the entire execution takes many many days? Let's say it takes two days To execute and task edge is takes five minutes to execute, but it has a bug So now if it has a bug You can fix the bug But what is the modus operandi going forward for the user of w1? Are they supposed to launch a new execution of w1 and then pay the penalty of the computation for two days If you use flight catalog memorization, then you don't have to We know the inputs and outputs and we know the version of the task So we can quickly memorize these the previous outputs if the inputs and outputs and the versions match This is called the signature of the task And in this case when you run w1 again, you will see that w2 previously completed Task g previously completed. And so immediately flight will go to start executing task edge And this makes the total I creation time bring it brings it down by two days in this case This is another Interesting artifact It it maintains a causal dependency structure of all the data as it was produced and consumed And this is what we call as artifact lineage Okay, uh, when we now we decided to build all of this we wanted our users to not Think about machines and and we wanted to have a centralized system Where we will be able to amortize the cost of running lots of large data intensive applications Um, and so we decided to make it serverless for the users As a team my team manages the services the the machines for all of the users They just request for cpus gpus memory number of spark executors and Dynamagically all of this is created and and given back this entire system is Uh, available behind the gRPC rest interface and Because of this construction it is completely language agnostic And this is proven by two flight kit variants that we have flight kit python and flight kit java Or you can just use rock container from the internet When something fails or something succeeds you can get notifications Uh, and you can always keep a history of all the executions Uh modulo the the configuration of Retention, uh, you can retrieve results for as long as you want because flight captures all the executions and their outputs But when we first uh, so flight has gone through couple iterations within the company And when we first launched we realized that all our users wanted to constantly evolve and change the platform We realized when we were building the latest then the version that is in the open source of flight We realized that we should absolutely Make it extensible to the core and so And matt will talk about like how he extended the python stk to do interesting things But we also wanted they back into be extensible and back into extensibility is extremely useful to add new capabilities to flight Like distributed training spark sage maker and we'll see some of that in the demo And this comes built in with flight so you can keep on adding extensions and making it smarter and smarter Okay, if that has not convinced you, uh, let me give you another slide why we think Flight would be beneficial for you for you to use but flight Flight is used at lift at production in production at scale We run more than a million workflow executions And if you see to the far right of this diagram on the right the the The ramp is going up and higher every day But what is that central blip that I see there was one time where earlier in the year or late last year When we had a huge spike and then a drop Let's get back to it in a couple slides to power this execution uh engine It's important to architect the system to first scale So the system has been architected following The architecture pattern set by kubernetes It is a simple user plane control plane data plane concept like all other cloud native technologies Flight admin and flight console on the right hand side are Form the control plane flight propeller and kubernetes itself Forms the data plane and on the left hand side are various components that interact with these things using using the service api But by default all of this is if you are able to run this in a single kubernetes cluster just by applying a YAML that is that we That we create every every time we release a new version All right, so let's get back to what happened So when what happened is that we had a super exponential growth and in two days we ramped from I want to say about 50,000 a day or a couple days workflows to to whatever 100 150,000 Or maybe less than 50,000 initially It was about 6x and that Was hard on flight because we were not ready Users lacked the visibility system admins were overwhelmed and various kubernetes components started failing We saw problems in xcd. We saw problems at schedule or a control plane and flight propeller itself. It's pretty complex It handles a lot of data and we saw some problems there too. So we decided to dive deep So we started addressing one by one the problems problem number one Let's address scale. We decided to scale out and scale up By scale up We meant flight propeller itself was optimized heavily so that it can run 2000 plus concurrent workflows without any change in latency characteristics And this this happens on one machine and you can easily short flight propeller to multiple machines Then we also decided to scale up scale out to multiple kubernetes clusters This allows us to have different fall domains as well as as well as Mitigates many problems with kubernetes control plane Another problem was multi-tenancy. There were these large corolla of users that would come and run their large workloads Well, the smaller users would not be able to Get the fair share of resources So flight So we started using projects to contribute to multi-tenancy Primitives for example, we leverage kubernetes resource quotas But that was not enough. So we had to build our own resource manager on top of the kubernetes quotas the resource manager also provides things like fair cues and Also helps in maintaining quotas on downstream services like a big query has a concurrency limit. You can protect the query from browning out Another problem was visibility flight has been trying to improve visibility from day one and what we have We have been we kept on improving it. For example, if we run a spark job, you get immediate link to spark logs spark history ui We get a grafata template that shows how the workflow is performing or all of the workflows in a user's Domain project are performing and they can see how the memory usage is changing over time and so on Also flight Demarkets user versus system errors is really nicely so that the users know exactly when to Contact the system administrators Another problem was we wanted extreme visibility so that we can maintain it maintain flight very easily and that's also a grafata template That's coming to os s soon Moreover flight admin has a Interesting and flexible routing schedule system so that it can route various messages of various workflows to different clusters And this helps us in actually deploying the control plane and the data plane in an in us In a progressive manner, so you can actually bake it in a lower intensity or lower lower criticality cluster before Rolling it out to a more critical cluster And eventually you want the platform because it's a centralized platform to be efficient. We saw a lot of problems with cluster autoscaler and krs scheduler. We started optimizing them. We've we've Observed more than 25 percent in saving just by Optimizing some of the things and these are now available with flight open source We also utilize spot instances if you are on aws And we provide a deep visibility to our users on where they are spending the cost So this has been the biggest driver in saving a lot of money for our users on the right hand side Is a dashboard that we built on top of superset and we are trying to see how we can open source that all right, so I know that was very quick a lot of information but Hopefully you got something out of it and you can ask me more questions But i'll hand over to matt who will talk about real-time forecasting and how they use flight in real-time forecasting Thank you. Thanks, kathan So i'll give a quick introduction A little bit of background about myself Like kathan mentioned at the beginning He and I kind of worked together back all the way when we're on the eta team Building kind of the first iteration of flight Out of that we kind of expanded this platform that Ended up kind of supporting the entire company growing I think to maybe 30 or 40 teams using us Around the company and I stayed with flight through open sourcing in december I was really exciting journey And then just a few months ago I moved over to lead the real-time forecasting team at lift And that kind of makes me maybe a little bit of a Unique user of flight in the sense that We're maybe the first team where someone who has actually worked on flight and really knows this system well has Taken it and built a You know real ml product on top of it This is gonna say other teams aren't doing it, but uh, you know I like to think maybe have a little bit more in-depth knowledge than some So to give a little bit of background about what real-time forecasts are We're a business focused streaming infrastructure and I put business focused and italics because You know, we really don't want to spend a lot of time dealing with Computer on the back end. We want to really use these kind of serverless platforms weave together kind of the best in breed type solutions From open source we can lift And just make those be very agile and making those kind of mesh together nicely For delivering real-time forecasts. That's what our team is kind of evaluated against our ability to generate high-quality forecasts to power the product And you know, so there is this streaming online component, but as people who are probably you know, if you're familiar with kind of What a real-time model actually does You know, there's massive amounts of offline components here. There's the training. There's the tuning There's regular evaluations of the model health You want to be able to run large-scale back tests on new models And then as far as actually advancing models into production, there's workflow driven aspects of that as well You might want to run some integration tests and kind of health checks in a simulated environment Before advancing things to serve real traffic. You serve real traffic and you don't go through these gates You might find out that you're really causing some pain for the business So here's a little bit of a high level view of our current architecture And I'll start with kind of the online component which is up here at the top in the dotted box So it's called a streaming labeled streaming process What we're doing is we're taking observations These are temporal spatial observations. So these are big fat requests. It's information about Very minute details of a region And we're bringing those into this streaming engine Which we'll do some kind of filtering. It's fairly simplistic at the moment But then eventually it's going to have to call this model predict code which currently we host using flask and By some kind of magic provided by Lyft It's very easy for us to deploy those up and we'll have multiple versions even of maybe the same model running at the same time Producing kind of again these pieces of data that we can kind of A-B test against each other And eventually those will end up in druid or at least a summary of them will end up in druid Which we then serve to teams that consume our forecasts but Besides that, you know, there's as I mentioned this data is very High bandwidth So we are also doing kind of this additional work where we have to drop it into an offline context So as you can see, there's some arrows where you kind of split off to the side And we're putting things into s3 and at this point We haven't really had time to organize them or do anything with it because it would be too detrimental to our performance And that's where flight kind of comes in which is what drives our offline orchestration Right now we have kind of three major Things running in flight which is our training pipelines. So it's going to be consuming this data. That's being dropped over time Maybe going way back in history, you know, we're talking pretty much the entire history. We've ever collected that with perhaps to Retrain our model and adapt to changes in the environment Especially with things like going on today with coveted, you know, that agility and retraining is extremely important We also have kind of some data compaction and schema evolution workflows that are happening offline. These are kind of pulling in the that we'll say unschematized data coming just being dropped into s3 and making sense of it pre-joining it putting into hide making available to our dashboards And then we also have monitoring workflows, which will do kind of these batch analysis which You know might require massive spark jobs that can really look back in time and you know Put these metrics out into Our grafana dashboards, but also drive pages Slack messages notifications things letting us know that our You know system is maybe kind of on the downturn um So in building out this kind of product How we work as a team is there's The engineering group which iOE again, we work very closely with some research scientists Who really just want to kind of focus on the business logic? and So what we try to do is kind of bridge this gap between the back end system of flight and flask And some kind of simplistic miles on their side I blew up this little pink box here, which shows kind of a divide of the system code, which would be owned by my team Basically, we have this framework library that sits on top of either flight kit or flask depending on which environments it's Executing in and then we have kind of a simple way for Our scientists to implement predictor and evaluate methods and I'll go into a little more detail on that um Yeah, so in this user journey, uh, you know, it's pretty typical you start with an ideate and iterate phase Uh, it's where you write the business logic You want to test locally at small scale iterate very quickly and then eventually Put it into a offline environment where you can maybe look through a large amount of real production data Detect anomalies and kind of test things out and you're going to kind of probably go through this cycle a few times You want to do it very quickly and efficiently But once you're happy with that you move into the productionization phase you want to promote your model A, uh, you know in a CICD driven way, uh, you'll want to add schedules onto the training Pipelines and you want to maybe even also be able to do some ad hoc executions to recover from failures and that sort of thing And of course monitor and really get like a good understanding of The data systems producing And then there's kind of the final phase retrieve and replay. Obviously things eventually go wrong models drift bugs come in um Systems themselves evolve. Maybe the features you're ingesting change and you want to be able to Quickly kind of go back to that ideate iterate phase to repair things So it becomes very important to be able to retrieve kind of your artifact lineage And kind of have these reproducible environments where you can you know isolate the error and Ideate an iterate on very quickly to repair and go back into production So in this first phase, uh, you know Ideate we have the right business logic portion and as I mentioned there's kind of this divide There's a framework author and there's the ml suite or scientist who's writing the model code on the left Uh, this is kind of the interface that we produce for our scientists and ml sues If you've developed a model at your company, you probably have a very similar abstraction It's just a python class with some inputs and outputs. Uh, it has a predict method a train method um And the framework author on the other side would just kind of have offered authored this base class that sits nicely uh in either You know on top of flight kit for the offline executions and also on top of flask for the things that are actually running in production so, um Another thing to I guess mention is these ml sues and scientists they never actually deal with flight You know flight is a very powerful tool, but it can be a bit um You know all the knobs can get kind of overwhelming. So what we try to do is just to abstract that away where We define the flight tasks the workflows Etc and no and define kind of semantics by which these Business logic or these methods driven by business logic can fit into the overall framework um, so As part of ideation we want to kind of smooth the testing for our uh users and Again, we can kind of leverage flight here and it's kind of reproducible aspect as kth and I mentioned We have an api where you can basically retrieve all the information of a past execution um and be kind of alerted even on when executions fail so We can take that information use flight To resurrect those inputs and bring them into a local Uh context and allow people to kind of mess around with them So to kind of show what the flow might look like you can look at here's a snippet of our flight ui The user goes in says oh this failed And then maybe just copies this id Into a notebook where they've loaded our library and voila, they'll get a pandas data frame that you know They can look into find where maybe there's a null or a value that was You know causing havoc in their code So, you know that makes it pretty easy to do these kind of small scale testing evaluations But then again, we want to eventually make sure that uh, you know We Have kind of full big picture view see how something can stacks up against you know a year's worth of data uh And again to make that easy on the model author All they do is open a pull request wait for a container to build and in the background our framework had Ensured that all the tasks and workflows are created and registered Uh, and that we also have made some tools for them so they can call like command line interface Which will end up kicking off some back testing and executions that will then Uh, also give a report perhaps of its performance So, you know, there's a lot of code but this is kind of an example of what a testing workflow would look like that's supported by our framework Uh, you know fetches data. It'll train the model with the user code that's specified in my container It'll then predict again with that exact user code that was submitted into a pr And then we'll run health metrics on it and produce a report Uh, and this all kind of happens invisibly behind a cli called forecast cli and you can see you just give it a test command Play out some models and start dates and end dates for us. It's a whole year And that'll go in the background execute a task in flight Flight will see to it that all that data is You know brought into these massive processes And you know, this might include a fleet of hundreds of machines at a time just churning on all this information Uh, and synthesize all the outputs, uh, and just reporting back to us And it was very easy on our side in the framework to Make this happen because we just use flight kit. It has some nice wrappers around the apis And you know, it's actually literally just maybe 10 lines of code to support that kind of use case So now moving into the productionization phase Again, the model author is going to check in the code. They want to wait for some integration tests to pass So again, we might look at past data bring that into a large offline process Look at the health report that comes out of it and ensure that it's asserted against some things Uh, and then they'll roll out stage by stage perhaps going into staging You know canary and then production And have the ability to kind of monitor those things and thanks to flight Uh, you know, we get a lot of that kind of for free and again It's just the framework author kind of pulling together these things through the stk and api To make it possible So again kind of like the framework author might create kind of an integration testing method again using just the flight kit And he's simply a search that after the run is completed on flight. That was totally managed by flight That scores pass and we get pass go keep going on the gate I'll skip over this but again, this is kind of just showing When you go to release things we elevate these pipelines side by side with this Code that's going into our online process. So this means we might have a Our container which contains all the user business logic is simultaneously being elevated as a flask service And at the same time we have the exact same container running in these offline processes and that gives us this really beautiful consistent view across our entire process online and offline it makes it very easy to Make sure your data lines up when the interface has changed that you evolve schemas nicely So it's been very helpful using flight for that And again, uh, you know with flight we get we've hooked up into the kind of Uh integrations they made with pager duty and wavefront and slack We get emails and slack notifications when our training jobs fail We get paged so we can take corrective action If our health metrics start to decline, we'll get pages. We can look at wavefront and track The output of these kind of processes over time. And it really just you know As you probably people probably know Monitoring model health is actually an extremely difficult problem. So it's really great when a lot of it kind of comes to a free It's just a matter of Making sure your dashboard is focused on the right metrics so then again, uh, you know things might have been in production working nicely for a while but Inevitably something will go wrong. There'll be an anomaly in behavior. The market will shift the model will drift Uh Or, you know, we'll just come across an edge case bug And how do we go back and make use of this wealth of data we've gathered? over time And re ideate on this model so we can iterate and correct things so We want to create a situation where it's basically a one-click tool for common debug paths So you can retry maybe an inference that have failed in production And see the exact data that comes in or if there's drift we might want to do kind of a batch comparison In an offline manner So We have a tool where again the forecast cli we Allow them to debug a failure. They'll give an inference id And we'll actually drop them into a auto-generated notebook Which Is pre-populated with the exact code With the exact inputs the exact outputs. So add your fingertips automatically right away You basically have a reproduction of the failed environment And then you can kind of make some functions locally in your notebook twist things around and get into that kind of ideation Um and the reason why this is possible is again because of flight and kind of its auto-divility We're able to go back and find actual Executions that are linked to a specific get shot. So we have exactly the code Uh, we have exactly the training artifacts that made a power that model And we can pull this full closure of information together in a very seamless way. Uh, so again Users and makes things Well, it's made easy on our side. Thanks to kind of the rich apis on the flight So that's kind of wraps up, uh How we use it currently for real-time forecasting. Uh, you know, we see going even farther with all this, uh, you know, we want to right now we kind of serve our forecasts in a real-time online system But we want to onboard more teams We want them to be able to consume our forecasts in their offline environments So they can train against forecast and we want to do that in a really seamless manner and as kathan was talking about With the shareability. We think that's a pretty simple next step Uh, so we're going to see more we're going to drive more adoption of forecasting at lift and I think flight's going to be kind of the Main power and component of that and I'll hand this back to kathan now to give a quick demo or answer questions Given we have only about five minutes left. Thanks Hey, uh, thank you matt. I think, um Forecasting is one of the most complicated use cases because it is the platform of its loan They they produce an artifact that is consumed downstream from by a lot of other teams. So, um They have built a very slick framework on top of flight There are many other users of flight who directly consume flight without building any frameworks and those are simple use cases The real the real time offline parody makes the forecasting framework really interesting and hard Thank you Matt for the excellent overview, let me uh, I'll try to do a demo We I will time box it to like two minutes I don't know if I'll get done with much of it, but I basically have some primed jupyter notebooks here and I'm going to quickly walk through them And while we are walking I will execute them. Uh, so Couple here for functions and what we have is a simple method that actually detects uh Edges from a given image and so I had an image I had the linux penguin locally and then I detected edges and the function works fine The function is also written locally in the jupyter notebook. And now what I did is, um, I'm going to run this on On flight and so I declare, uh, this is a new task type available on flight So you can run raw containers just picked up a container from open source. That's an open cv container Uh, and wrote the script I pass the script in as an argument. Um, and pass the image and you know retrieve the Results, so let's run that and I can now execute just that one function itself And so the moment I execute I get a link we can jump to it. It's executing Uh, and so let it run till that time. I'll show you another one that we are working on Uh, it's called sage maker integration. So sage maker is a popular framework, uh, or framework for the platform written by amazon that allows you to train, um Models in a distributor setting if you want so so in this case what we have done is we've taken xg boost model, uh, that's the algorithm specification And sage maker allows you to have multiple algorithms specified. We wrap it into a training system So this is the training model. Uh, it's available And you specify what type of instance you already wrote instance counts and uh, and and of course with sage maker You can use flight primitives to like cashable. So you are creating an artifact lineage in here But let's not run this model. Let's run a hyper parameter optimization on top of it That means you take one model try out different hyper parameters and find the best fitting model And so to do that we wrap them the training model in with an hyper parameter optimizer Uh, and these are just shims written in the python sdk so that we can easily call, uh, sage maker api But sage maker api is integrated with the back end. It's a back end plugin Um, and when you execute it, um, they should launch an execution If you go to there, it should launch an execution and this is actually running on our staging cluster in production right now Uh, let's go back to the existing one. Oh, so that completed. Uh, let's click on the task Uh, you can see the inputs and outputs. You'll see oh some output was produced We can take that output and see what was generated Well, let's go back down over there. I will not you can of course create it into a workflow and register and execute it But I am going to skip to the last part So we'll just modify this Uh, and let's see this way. So what happened is I actually passed image This image into the edge detection algorithm, uh, which was an htp link It's from the it's from the web and it performed edge detection remote And I did not leave the jupiter notebook at all and this entire thing is captured in the ui and you can go back and see the history of execution At the same time, we also ran, uh, hyper parameter optimizer. Um, and so the input specified were the hyper parameter ranges And the output was a model And so this is the model we got if you go back We can just do a sync and say what what's the model that I got Boom, I can see the model and you can retrieve the download the model and use it um This is all showing single task execution So what we want to get away from is that this has been a question that people have asked us recently Is flight a workflow orchestration tool? Yes, it is a workflow orchestration tool But we think that the it's the user's workflow orchestration tool and the user Stats hit their journey usually from starting one task at a time like training a model or extracting Writing a spark job And flights help a flight helps you there you write one function you execute that function Then you say like okay, maybe I need to add another function because the one was not enough Let's say you are extracting a feature you want to apply a transform to it a distributor transform to it So you write another function Then you put them together into a pipeline Then you add another function you put them together into a pipeline Then you productionize it and you schedule it to run or you run ad hoc based on some triggers So that was a very very quick demo. I am really sorry. I had to rush through it But I want to jump through the next couple slides. Basically our ecosystems evolving every day We are we're getting more and more along the way and we basically develop only the things that Currently are used at lift and when which we are confident of will run at scale We ship out, but if you have requirements bring us and we will work with you Um flight has a lot of new tools coming in, but we are going to focus extremely on UI improvements and data catalog visualization And that's about it. Let's open up the session for questions. I actually don't see many questions. So We did we finish our talk then in a couple minutes in advance I didn't see any questions And uh, we'll be right on the window for a few more minutes. Okay, here we go. We got one Oh Use for time series forecasting Uh, can you hear me by the way kathen? Yes, I can Okay, um, yeah, so the algorithm we use is actually entirely in-house based You know is totally concocted by our research scientists, so What they I think our most recent one is an online based model which looks at kind of past history and trains I think a gradient descent sk learn model, uh, but Really, uh, you know, we try to kind of abstract away the concept of what actual algorithm might be there We really just care about piping the right inputs and outputs and letting the user go wild and actually You know predicting or training Okay, the next question is have you tried profit? Uh, oh profit for time series forecasting matt again. That's a question for you. I think We have not so I think I'll have to look into that. It sounds pretty interesting Yeah, it's my facebook. So yeah, um I think so that's the interesting part from my point of view. I think profit can be integrated with light Uh, and I think uh, what for matt's point of view? These are black boxes. You can use an odd and n-style model for time series forecasting where you profit um, so How do you compare flight with ml flow? So I am not an expert on ml flow. So don't I may be wrong in some cases, but from my understanding ml flow is a library to track experiments And it is uh, you still use you can use it with anything Spark or other frameworks We are more than that. We are a workflow orchestration and a computation platform. We run spark itself and Tracking of experiments is a byproduct of like you're running everything on flight While ml flow, it's it's not a byproduct of that You can go to data bricks and run your spark job or use ml flow if you don't use ml flow That's my understanding Um, the next question. What do you think about feature stores? Oh, I I actually like the idea of feature stores, but I don't think feature stores Should exist as purely Uh, computed data only because the amount of data that you compute becomes Sorbitantly high and if you keep on computing every single feature that the user wants And if you do not have great discovery on them, it's very easy to overwhelm the system. So I believe that features Store should actually have references in some cases to the code that generated them here and then generate the features Yeah, I think I could maybe add a little bit there as well, which is uh, you know Kind of the work we're doing in streaming forecasters kind of filling gaps that currently exist in the feature store solutions at lift We don't really have an approach that works for the bandwidth of data that we work with so In a way, we've kind of created this attraction like a feature store that's driven by flight and again as kathan mentioned, it's very Reactive it responds more to the need for data as opposed to Precomputing and we think we can maybe upstream this concept to a broader offering within lift because yeah ideally this should be fully abstracted for everyone like Yep, getting access to features is A very common problem And being able to have a federated approach is just really important on the ML system Yeah, yeah, I think it's a feature store is more of a hybrid system over here. That's what we are getting at Instead of just computation on storage I don't know if you have any more time to do questions moderator Please wrap up. So We are ready to wrap wrap up and we can take questions offline or join our flight slack and please ask questions Or the go to thought.org Looking forward to more questions. I would love to interact with the community. Thank you All right