 Good evening, everyone. Thank you for making it to this very last session of the day. You probably all had a very long and exciting day. I did as well. So you all made it. In this last session of the day, I will be talking about the data pipelines behind carbon credits and why we, as Pachama, chose Flight as a workflow orchestration system to power these carbon credits. Maybe through a show of hands, who here uses a workflow orchestrator so that I have a bit of an idea? OK. OK, there's some. My name is Bernard. In my day job, I'm a staff engineer at Pachama. And in my free time, I'm developing and contributing to the open source project flight. I sit on the technical steering committee there. I'm generally fairly involved. If you want to get in touch, I'm available after this talk. I will be around here. There will be a question and answer session. But feel free to also go to my website where you find links to most of my socials, my GitHub, my LinkedIn, and all of that. Today, I will be talking about a short intro to Pachama, what we generally do, and intro to flight, what it is, and roughly how it works. Why we as Pachama chose to use flight in Airflow or Argo or Kubeflow or any of the other orchestrators out there. And this should be interesting because you might see features that you don't have in your workflow orchestration system. And you can use this to evaluate when you want to choose a workflow orchestration system and you don't have one yet, you can use these questions to evaluate whether that's useful to you. And in the end, I'm going to talk about a few common pitfalls that I'm seeing when people start out using flight. There's one or two tricky bits that you need to remember at first and just showing these because those are typical things I see our developers using. So what is Pachama? Pachama tries to find the best carbon forest restoration and conservation projects. And these projects typically sell carbon credits as a mean of their income. They need to rent land, they need to plant trees. And usually carbon credits are a mean for them to restore land or to conserve land for the future. One ton or one carbon credit is usually one ton of CO2 that's either restored or conserved. So for you as a buyer, it is important that the carbon accounting is right. So for you as a buyer, it is important that when you offset a ton of carbon after you've reduced everything that you could, sometimes you just need to offset. And when you offset, you want to make sure that the project is trustworthy, that the carbon is stored a long time and that it will not be deforested after a while, for example. At the moment, the evolution of these projects is mostly happening manually. So usually you'll fly in a bunch of forest scientists, they will come in, measure trees, do this on a few plots in land, and then from that extrapolate to the amount of carbon that exists on their property or on their project. And they will do that every five years or so. So there is a lot of subjectiveness and there's a lot of places for error, just inherent in this process. And sadly, this has led to a lot of scrutiny and probably a lot of you here think carbon credits are not really worth buying. And to an extent, that's true, right? There are a lot of bad actors, there are a lot of not so great projects for different reasons, and Potama tries to change that. We're trying to use modern technologies such as satellite imagery, other remote sensing data, LiDAR sources, and machine learning to make the whole forest carbon project process more transparent. That's inherently kind of what we do. And in a nutshell, this looks like we're gathering data. This is satellite imagery, LiDAR data, anything that can let us estimate carbon on the ground, as well as auxiliary data, such as does the project developer have experience in developing these projects? Or what's the political situation in the country the project is located in? Because all of these have an effect on the projects. Based on these, we have transparent evaluation criteria that we use to determine whether a project is good or not, and we then approve or decline these projects. Generally, to give you a sense at the moment, around 30% of conservation projects get accepted through our platform, which means around 70% of the projects don't meet our standards. So transparency and integrity are the core values why POTROM exists, right? We want to make this market more transparent. We think that restoration is generally a good thing to do. Saving our forest is generally a good thing to do, but we want to be transparent about it and we want to get to reproducible results. So we try to bring this to our engineering practices and the key tool that we use to do that is Flight, our workflow orchestration system. And Flight is truly open source. It was originally developed at Lyft and then open sourced a few years ago. It is now part of the Linux Foundation and a graduate project there. So you shouldn't expect that to ever go back to be closed source and a lot of companies are already using it. So these are probably the biggest logos that are using Flight internally. Most of them are also active in our community. Most of them contribute back. So it's great to see we do have a lot of adoption there. And now I would like to give you just a brief overview of what Flight looks like. And afterwards we'd like to talk about what are the core features that we think are important for our day-to-day workflows. You can see three little Python functions here. The first one returns a string, the second one adds an exclamation mark and then the third one orchestrates these two together. You can run this locally, should work hopefully. And this is great, right? This could be arbitrarily complex tasks. And now if you want to convert them to be runnable on Flight, the only thing that you'd have to do is add three decorators. So this is probably familiar if you already have an orchestration system. Most of them do a similar thing. You decorate the two task functions with task and then workflow creates the static graph of how the tasks interact with each other. Flight is statically typed. So already at registration time, we will check whether all the types fit together. And it is inherently immutable, which is a core feature for us, which means whenever you run an execution, you create a new version. All of them are undelatable. You can archive them, but they are undelatable. So you always have full data lineage of all the workflows that you run. Once you run them remotely, you do, of course, have an ICU and ICUX that lets you trigger workflows, that lets you see the state of execution, see logs, and all the things that you would expect from a typical workflow orchestrator. You also see a graph of what the execution looks like. And this can be arbitrarily complex, right? This is the simplest of all examples. Under the hood, flight is, Kubernetes native. So you would have, this is a simplified representation, but under the hood what's happening is we fly admin, a control plane. The control plane creates a workflow custom resource and then flight itself runs as an operator called flight propeller. It picks up on the flight workflow resource and will then schedule the tasks in the according order. So each task is a pod. And in this particular case, the hello world pod would be scheduled first. The outputs would be taken, put into the ed exclamation mark task, and then they would run in this order. So the important thing to remember here, each task represents a pod, and this will become important later on in the slides. At Pachama, we're using flight for all kinds of things because people started liking it, and it is a bit of a platform internally. And so they started adding new features. Most of it, I would say, is traditional data engineering. So we are processing heaps and lots of satellite imagery. So a lot of nodes, a lot of compute power all managed through flight. We are doing some machine learning because one of the key problems that we are facing is we need to know how many tons of CO2 are on one pixel or one hectare. And this is still a slightly open resource question, and we have a lot of very good machine learners working on that problem because it is the underlying core of carbon accounting for most projects. And the last part is simply running tasks in order, right? If a project comes in, you might need as many as 40, 50 checks, and all of them we manage through flight to get this complete data lineage, and that's how we run them. So these were the first two sections, a slight intro to Pachama, a slight intro to flight, and now I would like to walk you through what our decision process looked like and what were the main features that we found are really helpful in flight. And I tried to do a comparison to all the orchestrators out there, but quite frankly, there are too many, and all of them are slightly different. So please use this as a resource of features that exist, and then you can compare it to your orchestrators or you can compare it against new orchestrators. So by no means flight is not for everyone, but these are features that we like and we find really useful. That's why we picked it. When I'm to walk you through this, I created a little demograph of what it might look like to evaluate a project, highly simplified, but what it might look like is we ingest satellite imagery, we pre-process it, we might combine bands, we might do some analysis on it, then machine learning might infer how much carbon is on that satellite imagery, and together with some other results, we then decide whether the project is approved or declined, or we at least give a human evaluated a lot of information to make that decision. So one of the first features we really needed, and you can already see this here, some of these tasks might be really beefy. Imagine all the machine learning tasks, they will need a GPU, they need a big machine, whereas others will probably be tiny and just Python functions that need to quickly run somewhere. So resource allocation or defining resources per task was one of the core requirements that we have, already possible in a lot of orchestrators, in flight it looks like this. So in the task decorator that you saw before, you add requests limits, you can select which GPUs you wanna use for training or inference, and under the hood, Kubernetes will use obtained tolerations, no selectors that's configurable to schedule your task on the correct nodes, and it will use Kubernetes requests and limits, of course. The second feature that we find incredibly helpful and that our finance team finds incredibly helpful in our sustainability team is you introduce an orchestrator and at some point people don't just run it on one project, but they run it on plenty and find new use cases, and suddenly you don't just run one project, work for every once in a while, but trust a lot of them. And the easiest way to cut down cost, at least that's what we found, is to switch from on-demand instances to spot instances, using compute power that's available in the cloud, but not used currently, and usually you'll get these for I think 25% or 10% of the cost, depending on the clock provider, so you can immediately cut your cost by 75% or 90%. And not every workflow, not every workload is suitable to run on a spot instance, but if yours is, and most of them are, I would say, the only thing it takes is one line in a decorator and your workflows will be scheduled onto interruptible nodes. Flight, of course, when your interruptible node gets, or when your node gets preempted, flight will, of course, reschedule on a new node, and it's transparent for the user, so there's no downtime or no error there, unless it fails 10 times. The next feature that we were looking for involves the inner development cycle. So imagine you're developing this workflow, and maybe the part that you're focused on is inferring how much carbon is on a pixel. So you always rely on the first two tasks to run before you can work on your task. So you will be developing, and whenever you run this remotely, the other two tasks need to run, and then you can try out your task. And of course, you could save the intermediate results, develop against the intermediate results, and then make this work, but flight offers a nicer way to do this. And again, it's two lines in a decorator. You set cache to true, and you set cache version. Again, cache version, caching is tricky. You wanna set something better than 1.0 here, but that means flight, once a task completed successfully once, it will reuse that execution, it will reuse the output, and it will never be running again. Again, saving resources, and making your development experience a bit quicker. We use this a lot. Most of our stuff is cached. The third part revolves around working with a lot of data. So we work in a two-spatial engineering. Satellite imagery is a huge, and you also get a lot of them, usually. So it's usually not enough to boot up one beefy machine and then run everything on one of these beefy machines. Usually you need some form of distributed compute framework, similar to Spark from the last task, last talk, or you can use Dask, or Ray, or you name any. And wouldn't it be great, for example, in this case, if the pre-processed data task here could run on any of these distributed compute frameworks? And for the example here, we'll use Dask, but this works for, you know, name any, really. And again, it's as simple as putting on a decorator. So you put on decorator that makes this task now a Dask task, you specify what the cluster should look like, how many workers you want, whether you want the workers to be interruptible, how many requests, et cetera. Any features that you'd usually expect also work here. And again, I said this works for most distributed frameworks. If your distributed framework is not on there, we're happy to implement it. It's usually fairly easy, and we've added a bunch lately. And this works the same for developers when they're developing locally. So you can still run this locally. Locally, we will use a local Dask cluster, in this case, or a local Spark executor. And remotely, in Kubernetes, what will happen is, and remember, this is the sketch from before. Each task will be started as a pod. And this is now switched to each task will create its own Dask drop resource. And if you have the Dask operator installed, it will bring up a cluster. You get an on-demand cluster that only boots up when you need it. It will run the workload. It will have the same dependencies, especially if you're working with Dask, that's really important. It will have the same set of dependencies as your client, so you will never get serialization issues. The work runs, and then the operator turns off your cluster again. Again, saving resources and giving you an on-demand cluster with very little effort. And if you have flight deployed already, it's usually a half an hour change to set this up. For completeness, we also integrate with other frameworks that don't have a Kubernetes operator. And there's a system called Flight Agents. It lets you write a small web server that basically you need to implement three or four functions. And then you can also add your own sources. So maybe you have an internal API that you would like to hook up, or maybe you'd like to use any of the agents that we already provide, for example, Snowflake, Databricks, or so. That's supported as well. The last big feature is one that I'm personally really excited about because I've been working in the intersection of two spatial NML for a very long time. And if you've worked with Python a bunch, you probably have been in dependency hell at some point or another. And I find particularly with the intersection of two spatial NML, you get all the crazy dependencies usually. So one of the things that really helps us there is imagine in the pipeline above, this might be three completely different teams. It's simplified here, but maybe you have a team specialized on getting geospatial data. They know GDAL, they know Earth Engine, all the geospatial craziness. You might have a data engineering team that is really good at using Spark or Dask or any other compute framework because that's quite a specialized knowledge. And team C might be your typical machine learning team. So this could be people working with TensorFlow, Torch, using CUDA, any of that. So wouldn't it be great if all of them could have their own set of dependencies? So they at least only need to manage their stuff. They need to make sure their stuff works. And other teams can just depend on that without adding transitive dependencies to basically pip-installing a library or so. Again, flight does have a system here that makes it very easy for that. Imagine here, the first team creates this ingest data task. Again, it could be really complex. This doesn't have to be just one line. And then the second team, all they would need to do is reference a task. I'm sorry. I don't know what's going on here. There we go. So the second team, the only thing they need to do is reference a task that already exists in the platform. And they can use it as if it was theirs. But at runtime, flight will start the correct Docker image, the correct set of dependencies. So it does need to have very, very nice workflow there. I have one more tiny thing, because this was recently contributed by or donated to the project by LinkedIn, which is interactive debugging. So if you're running really large machine learning workloads, and these run in Kubernetes somewhere, they're running on beefy machines, what you can do is you can add another decorator. And once this task starts up, it will start up a full VS Code server that you can connect to. You need to either set up ingress through Kubernetes or to a QtTL port forward. And machine learners can actually debug the workloads in the cloud using all the resources available there, because sometimes it might not just not be possible to develop on your local machine. This wraps up all the features that we think make flight unique and that really help us develop machine learning pipelines and data engineering pipelines quickly and also transparently. So for all the projects that went through a platform, we have a complete lineage of what data went in, which version of which algorithm was used, and how we got to an outcome, which we think is important to show that we are not just guessing. We want to know what went in. Also, we make errors. As everybody, we want to know if a bug, what impact did it have, and just have that full lineage. To finish off, to common pitfalls, if you do end up trying this out, two of the most common beginner issues are when you add a flight workflow decorator, the Python code inside is not really Python code anymore. It's a domain-specific language, so DSL, for creating a graph, a static graph. So in line three here, you see if hello and greeting, that's really hard to represent in a static graph, so it doesn't work really. So this is not possible. And flight will tell you so, but just for a mental model, code in a workflow is not really Python code. It's a language to describe a graph. Flight, of course, has mechanisms to do this dynamism. They are called dynamic workflow and conditionals. I won't go into those, but we have to discover because that is a typical use case. But just be aware that's not possible. And the second pitfall that I'm often seeing is while flight is really good at making sure your types match, it's also a slight downside because all the types will be serialized to protobuf in between tasks, so it's actually language agnostic. You can write a flight task in Python, and one in Java, and one in C++. But on the downside, some of the things that you might expect to work won't. So in this case, the task is returning a custom type. It will serialize to, I think, a protobuf struct. And then if the workflow would only like to return a part of this struct, this is not possible because flight can't introspect what's in the struct in the tag here. So a typical case, people usually run into in the beginning, but once you wrap your head around it, it's fairly easy to work with. Again, the link in the docs down here is how you would circumvent the situation. So the takeaways of this session is, should you ever evaluate whether you want to use an orchestrator, or you already have an orchestrator, and you want to evaluate whether you should switch? These are some of the questions that you could ask yourself. Does it support resources per task? That's really powerful. That really helps bring down your cloud bill. Our spot instance is supported. Again, one line that cuts a lot of your cloud spend. Can you support caching? Because that usually helps with the speed of developing because you don't need to wait on your dependencies. Your dependencies are already cached. How well can you integrate with other tools, such as Spark, or Snowflake, or Dask, or name any? And how easy is it to add a new integration in case it doesn't exist yet? And again, for me, particularly the most important, how well can you isolate dependencies so that teams can move quickly and are not stuck in dependency health for a very long time? I hope you also, or you might have gotten a sense that carbon offsets are a hard problem. Not just hard social problem associated with a lot of tricky bits, but also a hard engineering problem. Bringing transparency to the space is still an open resource question to some extent, and also an open engineering question. Because there is no platform that allows you to trace all the steps that went into the carbon accounting of a project. Related to that, I would also like to make a small call out to the CNCF Tech Sustainability. I've recently learned about what they do, and I think it's a great work. It's on the engineering side, what can we do as engineers, all of us, to make sure we are not using or we only using the resources that we really need to? And as far as I know, they are in the Project Pavilion. I think some of them also sitting here. So go check them out. I think they are there tomorrow and Friday, as well, if I'm not mistaken. I can, I don't think I have time for demo, so I would encourage you to don't trust me and try the whole thing yourself. This is all you should need. The slides are up in schedule, so you don't need to take a picture. But we have Flight CTL, which is a small Go Binary, similar to Cube CTL, that can start up a demo cluster on your machine. So Flight CTL demo start will bring up a K3S cluster with all the dependencies installed. Then you install Flight Kit in your Python environment, code ahead your workflow, and the last line, the PyFlight Run Remote, will bring that to Kubernetes. A few shout outs for Pajama, the company that I work at, because it's one thing to use open source. I think it's another one to also contribute back and give back and also giving media opportunity to talk about what we and what Flight does here. I think that's not usual. That's nice. Union, these are the folks that originally developed Flight at Lyft. They are now offering a managed solution. If that's something you're interested in, and they're super active in our community, so still contributing back a lot. And the whole Flight community, because this is probably what makes Flight really enjoyable. We have a very active Slack. Everybody's really nice. And you get really quick responses to your problems. So in case you try it out, log into our Slack if you have an issue post there. And usually within a few minutes, you should have a resolution. That is it from my end. There are links to our docs, links to my website. Again, in case you want to connect, happy to chat about anything around carbon credits, Flight, open source. Also tricky questions, no problem. The middle one brings you to the session feedback in SCAD. So if you do have any feedback, I'm very open. And with that, I would open it to questions if there are any. Yeah, and I think there are two mics here. And maybe there's also one. Yeah, yes, please. Hello, thank you. How do you manage unmotability in the carbon offsetting problem? I guess that it is important to ensure that the data is not changed in the entire process. Do you manage that with some tool or how is the process? Yes, I think there is a lot to this. We generally try to either, so our input data, we try to keep on our own buckets, full copies of the full input data so that we can be sure that we still have the input data. With Flight, generally, because we do save the whole workflow graph, we do a full lineage so we could rerun all the intermediate results if we wanted to. And generally, we also save them. So I want to say we try to preserve as much as possible. But generally, because we are using Flight, if we have the input data around, we can reproduce the output data fully if that answers your question. OK, yes, I think this mic. OK, first of all, thank you very much for your talk. It's been great. My question circles around you as a company, and Pajama as a company, is mostly focused on carbon credits. Do you measure your consumption of what you've just shown? And do you offset that, or how do you deal with your own footprint? I think that's a great question. And of course, we do think about that because we are working in this space. We generally try to use cloud providers that use regenerative energies. I think we're using TCP at the moment, trying to use the regions where they are offering basically net zero offerings. I'm not sure whether we're tracking all the compute yet. This is also something I've learned in the talk that you've given yesterday. And I would need to investigate. I personally don't think we track how many resources we are actually using, and also what of that is already offset it through our cloud provider and what isn't. So I don't think we have that. That's all right. Thank you very much for your answer. And we from the tag are of course there for you to help you. Happy to. I will definitely come back. We've already chatted. Hi. So are you self hosting the flight in your Kubernetes? Yeah, of course. Because, I mean, I work there, so I usually, yeah. How easy was the process? So given I also develop a lot in flight for me personally, I want to say fairly easy. We try to make it as easy as possible for our users as well. It should be one Helm install that sets up the whole infrastructure. You will need a Postgres database and a storage account, or cloud storage somewhere. We support all three major cloud providers, or AWS, GCP, and Azure. You can self host it on-premise as well. Again, you would need a database and a cloud storage. The only thing that usually, or the two things that I can usually tricky are Ingress. Because everybody sets up their own Ingress. Usually we are personally using Istio. You can use any Ingress really. So you need to be comfortable setting up Ingress. And if you're using secret injection, you also need to be comfortable setting up that. So we're using Vault internally. You can use any secret provider or any major one. So be sure to be supported, yeah. Cool. And second question. Are you using Ray to basically distribute the workload? I know that it's possible to use Ray. So Ray serve really, et cetera. Are you using that for it right now? To distribute the workload? Yeah. Like to share the GPUs and all of that stuff. We are personally not using Ray. So I don't want to speak to that too much. I would need to get back to you on that. Because there is someone else in maintaining the Ray integration. I'm not sure what's possible there. Thanks. Any other questions? OK. Thank you very much. If you have difficult carbon questions, I'm around.