 Hey everyone. I started with jokes already, so now it's time to get serious. Welcome, first of all, it's my first time here at the conference in Bernof, but also in Czech Republic, and so far thumbs up for everything. Thank you for making it like that. My name is Ivica and I will try to make the next 25 to 35 minutes interesting for you. And because we are now in the post-lunch, sleepy time or energy dip, I'll try to make this interesting, I promise. If you want to make it interactive, you can just, you know the drill, right? Ask your questions, we will answer all of them in the end, so there will be a slide for that. But also, we need to get our blood flowing, if it's interesting, as I said. So is there anyone here in the audience that considers themselves to be a data engineer or someone that works with ETL, with data processing, with whatever? Just don't be shy. Okay, a few of you will be voting myself. Well, I'm lying actually, but I just want to provide an example for you guys. Okay, but did you also hear that data is the new oil? Kind of? Yes? No? What about this one? Data is the new gold. I don't know, I heard it and I'm not even a data engineer, right? But I also agree with both of those statements because it is, they are true. And when we talk about ETL, the E in ETL stands for extracting. And it means getting something out. In our case, it means getting the gold out of the ground. And it is hard work, absolutely. I haven't tried it, but I heard dwarfs talking about it. Because first of all, you need to know where the gold is. And even if you know where the gold is, you need the permit to dig. Well, there could also be dragons guarding it. You can just ask Bilbo Baggins, Minos for sure. But knowing where to dig is hard, it's a story. Because the other half is getting the permit and usually in my own country you need to bribe a few people. You need to, you know, lube and grease up and what not. Yes, it is how it works. In any case, the scenario that we will go through today goes exactly like that. You know there is gold in those hills and you have the permit, but the local clerk that gave you the permit was just a little bit on the bottle and the permit says you need to dig from your home. What? Well, exactly. We are going to dig for gold from our own home. And speaking about home, my professional home is one of the largest fashion companies that you never heard of. And it's called Best Seller. But I'm hopeful that you heard for at least a few of those brands such as Jack in Jones or Veromoda or Only or about 18 of the other ones. As I said, my name is Idiza and I'm definitely not at war for all my Dutch colleagues might disagree with that. And I work as a data platform team lead in said company, Best Seller. And what we do is we essentially provide the analytics platform to our colleagues and to the stakeholders so that they can make business decisions and also be informed about those decisions. And some of those decisions are, do I buy this shirt? Is it going to sell? How many of those do I need? When do I show them on the website? When do I take them down? Do all of that nine months in advance because you don't buy it today for tomorrow. You buy it for the next. So there are challenges in running this platform and providing these insights. And one of those challenges is definitely around data governance and data residency. And the scenario that I want to share with you today is about how you can use Apache Airflow, any CS anywhere to solve all those challenges. But we first need to understand what this residency is. What does it mean? I'm not a native English speaker, you can probably hear that. So I reached out for the dictionary. There it is. Residency is a fact of living somewhere where you reside. In my example, I'm living in the Netherlands, that's where I'm a resident. But also the fact that I'm allowed to live there is regulated by law. I also pay taxes there, which is, again, regulated by law. Which brings us to the next term, being governance. So what does it mean to govern something? A society, a country, an organization. Well, it means coming up with laws and making sure that they are enforced within this organization or a country. Now that we understand what the problem is and what residency and governance are, let's also see how the statement of Acme, the company that makes everything, acquire the competitor, can be turned into Acme, acquire competitor in the US and now they have problems with paying Christmas bonuses. And it's not actually what you think. They still have money to pay Christmas bonuses, but they cannot do it. In the world of big fish acquires small fish. Our own big fish, Acme, acquired their US competitor called Cyberline Systems. Nothing wrong with that, right? Companies by companies. But if we consider that these two companies are in two different geographic regions of the world with different governing parties and different laws around data processing, that's when things start to get interesting or nasty, depending on who you ask, right? Because we absolutely know that there is valuable data hosted somewhere in the US where Cyberline Systems is. And the leadership from Acme would give pretty much anything to get insights from that data. But their ETL tool of choice is hosted somewhere in the EU. Because you are not allowed by law to process US citizen data outside of the US. So you have to process the data in the US and that's what the goal is. And that is the challenge that we have for us because we need to get to that goal from our own home. Clear? We get it? Okay. Essentially, we are in this situation architectural. As I said, Acme, an existing company, has their infrastructure hosted somewhere in the EU. They have a container management platform, they also have an ETL platform. And what they recently bought is two data centers, one of them in New York, the other one is in San Francisco. It is, of course, technically possible to load the data that is somewhere remote. But we are not allowed to do this. This is where I would ask you to pick up your phone, scan this, and tell me how would you approach the problem? How would you solve it? I'm going to give you a few minutes and drink some. I was in the blue half a few months ago, yes. So pretty even distribution, I would say. Yes, thank you for making me feel more amen. How it becomes a game, right? There's a whole concept of gamification and everything. This is how you gamify it. Okay, but whatever the results are, I'm going to anyways tell you that there is no problem. So let's just continue. Roughly a third of you said, let's do a site-to-site meeting. Right? A valid solution. There is nothing wrong with it until you consider the complexity. Because we need skilled people to set it up and to monitor it. You need to make sure, because this is now a critical part of your infrastructure, you need to make sure there are two redundant leads, you need to make sure there are people on call to babysit it when it needs to be babysit. And we're talking about two geographic locations, so different time zones. So you need people around the clock, and you need to pay them to pick up page-to-duty phone calls. Not ideal. Another approach, as you said, could be to run the EZL2 of choice per location. It's very easy technically. Most of them are containerized, so you just run them in all locations. But what if you have 70 locations? What if you have 25 or 250? Who's going to manage that? Who's going to manage all the access to all of those tools? Because they are now separate tools. How do you manage permissions that these tools have to the datasets that you want them to process? How do you reconcile logs? You have dependencies between javas in location 1 and location 17. One of them fails. How do you reconcile logs in multiple locations? It simply becomes an operational nightmare very quickly. So let's not do this. Maybe it is safer to cry in a corner. People that were selling shovels during a gold rush. And I don't have any shovels on me. But I would like to introduce you to two shovels, which are Apache Airflow and AWS's ECS. And let's talk about them briefly. By the way, anyone in the room using any one of these tools? Okay. Any happy users of those tools? Yes? Yes? Yes, that's what I want to say. Okay. AWS ECS is the elastic container service. And it is essentially a container management platform. Some call it formance Kubernetes. Maybe it is, maybe it isn't. I don't know. But what I do know is that it allows you to run containers, to scale them, to make sure that they keep running. And it does the job right in most cases. It is far from perfect, but it works. And one of the best features of it is the integration into the entire AWS ecosystem. If you need to handle permissions for those containers, it's IAM. If you need to change firewall rules for those containers, it's VPC, again, on AWS side. The way we use ECS within Bestseller is to run services 24-7, and mainly the services that make or break our business. So things like refund services and order management systems and returns, and anything essentially that, as I said, can make or break our business. But we also run schedule jobs. We also run one-off containers once a month when we need something to be done, et cetera. And those processes are still important. They can be weighted on, but they are still business critical. And one of the things that ECS doesn't do is container orchestration. So it's not going to wait for something to happen to run your container. It does it on schedule, right? So how do you orchestrate this? Well, that's where Apache Airflow comes in, because it is an open-source platform for orchestrating and for developing workflows specifically that have dependencies between them. What it does is it allows you to create workflows called dimes, in AG, dimes, through Python code, which means that you can version control it, you can collaborate on them with your teammates, and have all the good stuff that comes from it, right? But you can also create intricate dependencies between jobs in your workflow, and you can also create dependencies between the workflows themselves. So if you look at a very simple example of making a pizza, I guess you did it at least once in your life, right? So how does it go? You prepare the toppings, you prepare the base, and then you bake them. But you can also do the toppings and the base in parallel. Your partner can do one, you do the other, the chopping, the taking care of everything that falls on the floor, etc. So you can do it in parallel, and in the end you will make a pizza, but you need both of these processes to finish first before you put them together. Well, that's exactly what Airflow does. It allows you to organize these jobs or these tasks into workflows that they call DAX. And then you can configure dependencies between them, and you can say which comes first, which runs in parallel, etc., etc. And keep in mind Airflow is not imposing these dependencies, they come from the real world. You cannot make pizza without the dough and the toppings. You can without, what's it called? Pineapple? You can do without pineapple, right? That's all you need. The real power of Airflow is that it comes with batteries included, and these batteries have been committed by the community, because it is an open source tool. There's more than 10 million in sales per month. So you could say it's pretty popular. Big companies, big players such as Adobe, and ADN, and Airbnb, and LinkedIn, etc., they all use it because it is a good tool. But let's talk about these batteries. So one of the first batteries that comes included with Airflow is called operators. So what do they do? Operate, right? Not as doctors would operate, but they just operate on something else. As you can see in this example, we are using the Redshift operator. And what it allows you to do is to provide a SQL query which is going to be executed on a Redshift cluster without you having to know too much about how to do that thing. That's what operators do. They do the heavy lifting for us so that we don't have to. Make sense? Okay. So enough about EC2 and Airflow. Let's talk about the features that make them work together and that make them work well together. This one is called ECS Anywhere, and it is essentially a feature that allows you to reuse the existing infrastructure that you have, including Raspberry Pis, to extend your cluster and to run containers on those machines. The next one would be the Airflow's ECS operator. And I think you can guess pretty much what this guy does. It allows you to run containers on ECS. You already see how they work together? Yes? No? Well, I'm here to tell you. Okay, so with these two essentially, what we can do is we can orchestrate containers on our existing infrastructure using this guy. That would be the ECS operator for Airflow. And what it does is it essentially runs a container somewhere on ECS. Looking at the final solution architecture, this is what we have. This is what we ended up with when we said, okay, let's use ECS and Airflow together. But there's a lot of going on on this diagram, right? So let's go through it. This is the existing container management platform, the existing ECS. And it is somewhere in the EU. We have it. It's up and running. It's making money. It's all good. What we also have is the existing ETL solution. Again, located somewhere in the EU. It's also running. It's providing value. But then we also have these two data centers that are remote and that are untouchable for us up to this point. One of them in New York, but we can extend the existing container management platform to use this data center as well by using ECS anywhere. And we can do the same for the New York, for the San Francisco office, sorry. So now we'll be understanding what the final architecture looks like. Let's see it in practice. This would be the demo time. And usually experiences of me that do live demos is, I don't know, demo gods are not favorable. So I will just show you screenshots of what I did. It's easier and you have to trust me. That's why it's easier, right? Cool. If you don't trust me, everything that I'm showing here is available in a repository. It's everything as code, so you can just run this on your own and just prove me wrong, please. Okay. So the thing that we are going to do first is just show the ETL scripts that we have because they are the basis of everything after. And we have two of them. The first one is the bonus calculation script. We want to give our people the Christmas bonus that they earned working hard for us. And the script itself is fairly simple. It just loads a CSV file, does some data processing, and then saves the result into another CSV file and uploads it into AWS S3. The second script that we have is the tax calculation. And this is the one that calculates company taxes based on the offices that they have, lease contracts, whatever, whatever. It's too much for me to go into. We must include these new branch offices in the US. And it is very similar to the first ETL script that we have. Loads CSV data, processes it, uploads the result, spack into S3, and puts it on call. So let's verify that they work. This guy said he doesn't trust me, so I'm going to have a screenshot to prove wrong. Okay. I can just run the script like this with the valid inputs that it has. And a few seconds later, if I list the contents of this S3 bucket, you can see I did it this morning. Sure enough, it works. The result file is there. The same goes for the bonus calculation script. It's pretty much the same deal, same steps, same target, same distance. We get the results. Nice. And since we saw that these scripts are working on their own standalone, now would be the time to containerize them because remember, we need to run these scripts as containers. They need to be scheduled somehow. Docker files, Docker build, Docker push, boring, boring. The only thing you need to know is they build and I push the container where it needs to be because the next step is to actually use that container in an architecture like this one. So what I have running right now is two sets of resources. The port and area are the resources running in AWS provision to the same code that I mentioned. But we also have the blue area here and no, I didn't buy any property in New York or San Francisco. I just spun up to VMs. They're good enough. And what I also did is I extended the existing container platform to now use these virtual machines that I have running on my machine as external cluster. Makes sense? Clear? Okay. Why two? Well, that's because we have two ETL scripts. One for Christmas bonus, one for tax calculation. And that's because two of these machines have access to a different data set. One host has access only to employee data while the other host only has access to property data and lease contracts and similar. Which means that our infrastructure is ready to run the container, to run the CTL script as a container and see if it works. I'm here just using AWS CLI and essentially telling it, hey, go run this thing on this cluster here and please use the external part of the cluster itself, which is the two VMs that I spoke about. And sure enough, looking at the logs, it works. It does something, it doesn't complain. To me, that's good enough. But you can also notice that it runs on one of these machines. It runs on the bonus cost, as I call it. So what would happen if I do the same thing again? Remember, we have two machines. Would it still run on the same one? Or would it use a different machine? I see a lot of people yawning, so let's do some stretching again. Who of you think, by raising your hand, that the next time I run the container is going to run on the same machine? And which of you think that it is going to run on a different machine? Okay. You probably noticed I raised both hands because both is true. And that's thanks to the round robin algorithm and the 50% chance of running on an incorrect machine. Because we have two of them, right? The host itself is not wrong. This is the situation that we end up with because the task can run on both of those. And it's a no-no because the hosts are not wrong. It's the data set that they have access to is wrong. So the tax ETL script can run on the bonus host and say, no can do boss. I don't have access to these files. So how do we make this work? How do we approach this? Tax. Nice one. The real metadata tax or some other type of tax? It says tax. Okay. Well, you're not wrong, my friend. Thank you for speaking up. This here represents the algorithm, let's say, behind how ECS schedules a task. How it chooses where to run your container. And as you can see, it first looks at CPU and memory, meaning, hey, can I run this container on this much memory? And then if it can, it then looks at the location, instance type, et cetera, et cetera, meaning if you can tell your container, hey, please run on instances that have access to a GPU if you need a GPU. Or you can tell them, hey, please run in this region of AWS because that's where my other infrastructure is. But what you can also use is something called custom attributes or custom placement constraints. And as the name suggests, they are completely custom, meaning that you can tag your container instances with whatever you wish and then use those tags when you run your container. This is something that we can use to tell one of these ETL scripts to run here or there or nowhere because you can screw up these tags or these attributes and the container will simply not be scheduled. This is how these custom attributes are set. What we have essentially is a custom attribute called purpose with the value of bonus. And the same goes for the other one. Purpose, tags. And as you can see, I set them on two different container instances so now they are completely separate. What we can also do is when we run the container, when we try to schedule it, we can now say, hey, please include this placement constraint and only run this container on this specific set of machines for this one machine. And of course by using these task placement constraints and custom attributes that we just set, we can now guarantee that the proper scheduling is going to happen such as this one. This is the last step but also the most complicated one and now is the time to schedule these ETL containers on our very specific infrastructure using Airflow and using everything that we mentioned up to this point. So what I did is I went ahead and I created these tags in Airflow. One of them for Christmas bones, one of them for the tax calculation and they are pretty similar in code and I did so by using this guide that we already met like 10 minutes ago, the ECS run task operator. To schedule these containers, we can use code like this and if you have a keen eye, you can also notice that it's very similar to how we ran the container by using the AWS CLI. The name of the cluster where we wanted this thing to run we also specify which task definition or which container definition to use where to run it meaning external instances or internal but we can also provide the placement constraint saying hey dude, this is the Christmas bonus calculation ETL script please run it on instances whose attribute purpose equals bones. And ECS is going to respect this. Now, the dark the container itself has been scheduled it ran as you can see from the box we also can see that the exit code is zero. What does zero mean? Can you remind me? Yeah. It was successful but what the hell happened just now? Do we know? Let's show it a bit. The first thing that happened is that Airflow as a tool ran a DAG. It scheduled this DAG and the DAG itself started a container on ECS. Then when ECS said okay, this guy wants to schedule a container on this specific machine type in this specific location and I can do this. So ECS schedules a container on the remote data center, on the remote machine that we have with access to proper data. That's when the container executes, it starts ETL script, it tracks along processes the file and in the end it uploads the results back to Europe. And this is fine because these are just the results. It didn't upload any of the data on US citizens which is uploaded a bunch of IDs and numbers. So this is fine. We are still compliant. And because time is running out and we are having fun it is, we are having fun, right? Yes, thank you. It's time to recap. We saw that these processes can be pretty they can be simple, they can be complex, they can be blue, white, ran manually scheduled, etc. But what they are necessary because the data that we process comes in many shapes and forms and needs to be transformed and processed between, etc. You know this better than I do probably. What we also saw is that ECS is the right tool for the job but it does come with trade-offs. We also saw that Airflow is the right tool for the job but it needs a bit of a boost from a container technology started more than 10 years ago proving the docker is still valuable, right? If you use it in the right way. And in the end something that I learned in my career along the way is that there are technical solutions to technical problems. But there are also non-technical solutions to technical problems and vice versa. And the sooner you get this the sooner you can advance your career and become the famous tanics engineer that you can be. With this I would very much like to thank you for your time and for listening to me. If you have any questions you can shout them like here but you can also use our good friend Sligo and ask any questions that you may have. Okay so we still have some time please ask those questions before people start pouring in again. If you have them please. You mentioned your four tasks uploaded back process data. This process data just scrubbed by output of container. No, it is a CSV file that is created by the container. It's a CSV file and this automatically or it's part of the job and the integration from to our native Yeah, it is a part of the ETL script. Essentially the ETL script is completely custom. It can be whatever this is just a stupid example of calculating Christmas bonuses. So it can be whichever you like. There was another question here. Yes please. I think you will raise your hand. Maybe I'm wrong. Okay, yeah please go ahead. You were mentioning the placement constraints. Yes. I was thinking so basically it's airflow that does the schedule. It has things and tolerations. It's basically duplicating the same thing or it's running things and tolerations under the video. Okay so the question is about custom task placement constraints and airflow scheduling this. Yes. And then the point is what is the body view? Okay. It can work with Kubernetes as well. In this case it was working with AWS and the reason it is duplicating the work and it's encroaching on the infrastructure part is because it is scheduling this. Airflow is the orchestrator now. You are giving it the power to do so. And also operators. Yes. Because without operators and using continuation you should write by your own everything. And these inside airflow just few strings and everything is done. In this very specific example yes because the operator itself is simple and we are using the task. But what's the real power of this is that you can because airflow is written in Python and those workflows are written in Python but if you're not someone that does Python if you prefer Rust or whichever else containers allow you to do it you're not locked into Python anymore. I think we are out of time. We still have two minutes. Okay. Please. Two more minutes please. The session is still running. Yes. Yeah. So the question is have we considered using other solutions such as Lambda's to achieve the same task? Yes it's a very valid question but you have to think again about if the tool has everything unique. Lambda is completely custom. It's the code that you put into it that is valuable. It has limitations such as how much can it scale in terms of memory but also time limitations because you can only run for 15 minutes and what if we have a 10 gig CSV5 that's going to take three hours to process it? Also the reason why we are using airflow is because it is the ethial tool of choice and it comes with values included meaning we don't have to write the CSV processing logic it does it for us. Please go ahead. Yeah. I was just wondering about the experience of run. I guess you are running your own air flow instance so it's like managing what was the experience like is it like maintenance heavy maintenance free depends on how you look at it so the question is are we running our own airflow instance and is it difficult to manage it or is it difficult to just keep it running? That's why I said we have both air flow which is self-managed containers and it is not very much updated because we are bad at doing this it's not air flow it's us it's like every relationship but we also have another deployment of air flow which is completely managed by AWS and that one is much easier to work with but also costs like five or six times more per month. Yes. If there are normal questions thank you very much again for attending.