 Sama-sama. Selamat tinggal, semua. Terima kasih sekejap. Selamat malam ini. Jadi, saya cuba membuatnya cepat. Saya cuba membuatnya lebih menarik untuk semua orang. Jadi, apa yang saya bercakap tentang hari ini, adalah sebuah inisiatif yang baru. Di sini, kita pergi. Jadi, sebelum ini, semua orang yang mungkin jadi penyakit, mungkin tahu kenapa Kronya. Jadi, ia sesuatu yang telah dikenali, dan di server remote. Sebelumnya, kami telah menerima air-flow Apache. Ini adalah sesuatu yang kami sekarang bergerak kemudian. Jadi, sesuatu di belakang. Oh, minta maaf, saya lupa untuk menjelaskan diri saya. Nama saya Daniel. Saya penyakit penyakit dengan WeGo. Jadi, saya bersama dengan team tawaran, tawaran dan tawaran. Jadi, apa yang kita lakukan, adalah membuat tawaran dan menjadikan tawaran yang digunakan oleh penyakit penyakit dan penyakit penyakit. Jadi, salah satu tawaran, adalah untuk membuat hidup penyakit kita lebih mudah. Dan semasa orang dapat bekerja dengan tawaran yang lebih mudah, atau cara lebih mudah untuk memperkenalkan produknya, mereka dapat bergerak lebih cepat. Jadi, beberapa tawaran mengapa kita memutuskan untuk bergerak dari Kron. Jadi, kita berada di WeGo. Kita memperkenalkan banyak tawaran tentang tawaran di GCP. Jadi, semua tawaran ini selalu berguna. Mereka tidak berguna untuk menerima tawaran atau untuk mereka memperkenalkan tawaran. Jadi, banyak perkara mengambil tawaran data. Jadi, seperti Asan beritahu, kita memiliki banyak tawaran yang akan datang. Tawaran tawaran berfungsi untuk membuat perniagaan keberanian yang berlaku dengan tawaran ini. Jadi, kita juga membuat perkara seperti ETL dari tawaran Tawaran Tawaran. Jadi, kadang-kadang kita mempunyai tawaran berbeza. Kadang-kadang kita akan melihat tawaran data dengan S3 dan kita basically dapat melihat tawaran tawaran. Jadi, perkara yang lain kita lakukan adalah bahawa, pada tawaran tawaran tawaran, kita boleh melihat tawaran tawaran tawaran dan membuat tawaran tawaran dari itu. Kadang-kadang kita hanya menggunakan dengan tawaran tawaran kita untuk dapat menggunakan tawaran untuk orang perniagaan. Jadi, kadang-kadang kita juga melihat tawaran data kepada tawaran Tawaran Tawaran. Dan beberapa tawaran tawaran tawaran ini memperbaiki kita untuk menggunakan tawaran tawaran. Jadi, idea ini adalah semua tawaran ini sangat indipendeng. Mereka mengandalkan banyak perkara yang berlaku dan mereka perlu menjadi sempurna. Jadi, salah satu perkara yang penting adalah mereka mempunyai tawaran dan tawaran. Tawaran tawaran kita mempunyai tawaran dan ia memperbaiki kejadian. Jadi, untuk contoh, jika kita mempunyai klinik yang berlaku mungkin sepanjang hari, sepanjang jam. Mereka tidak mempunyai tawaran untuk memperbaiki kejadian yang berlaku. Kadang-kadang mereka juga memperbaiki kejadian. Jadi, jika saya membuat tawaran dan tawaran, saya perlu memperbaiki kejadian. Baiklah, saya akan memperbaiki kejadian yang berlaku apabila ia berlaku dengan tawaran tawaran ini. Jadi, kamu akan fikir, ia sangat mudah. Kita hanya perlu memperbaiki tawaran. Jadi, mari kita mulakan dengan sebuah instans. Kita mempunyai sebuah korepo. Sebenarnya, kamu mempunyai kejadian yang berlaku. Jadi, kamu memperbaiki kejadian. Kamu memperbaiki segala-galanya kejadian. Kamu memperbaiki sesuatu seperti Anaconda, dan kamu memperbaiki kejadian. Jadi, ia sesuatu yang kamu boleh memperbaiki mungkin setiap hari, setiap jam. Dan, kamu tahu, dalam perjalanan yang benar, kamu mungkin memperbaiki kejadian dan tawaran yang berlaku pada kejadian kamu. Tolonglah. Atau mungkin ia memperbaiki kejadian yang berlaku pada kejadian yang sama. Jadi, ia lebih daripada kejadian yang besar tapi masih tidak bagus. Kamu mungkin membuat instans dengan manual-le untuk melakukan dan memperbaiki kejadian dan sebagainya. Jadi, mari kita memperbaiki kejadian. Apabila ia berlaku untuk memperbaiki kejadian baru, bagaimana itu akan berlaku? Kamu membuat instans dari UI. Kamu memperbaiki kejadian. Kamu memperbaiki kejadian. Mungkin ia adalah instans yang lebih baik. Kamu hanya membuat kejadian yang berlaku pada apa-apa yang kamu perlukan. Kamu membuat instans untuk kejadian yang berlaku. Kamu mempunyai kejadian SSH kerana kamu perlu kemudian memperbaiki kejadian. Jadi, ia adalah cara perjalanan naif yang akan berlaku apabila ia berlaku kejadian. Apabila ia berlaku dengan kejadian, kejadian itu adalah sebuah perjalanan yang berlaku apabila ia berlaku dengan kejadian naif kerana kamu hanya memperbaiki kejadian, memulai kejadian dan, eskipun, kamu memperbaiki kejadian. Apabila kamu berlaku dengan kejadian. Jadi, apa yang salah dengan kejadian ini? Jika sesiapa ada apa-apa pengalaman yang membuat sesuatu seperti ini, saya rasa, kamu tahu, semua orang bermula di mana-mana. Seperti, kamu tahu, ia tidak selesa bahawa kejadian kecil, kejadian kecil mungkin mempunyai kejadian ini. Tetapi, sangat cepat, kamu mempunyai bahawa ada beberapa masalah yang berlaku dengan kejadian ini. Jadi, okey, ada beberapa masalah. Jadi, perkara pertama, kita mempunyai masalah dengan kejadian. Jadi, kita berkata, pekerja berlaku di bawah. Pertama-tama, kejadian berlaku di bawah adalah sebaik-baik saja untuk dapat membuat perjalanan yang kejadian ini berlaku apabila ia berlaku. Bagi pengalaman sistem yang berlaku kerana kamu mempunyai dan memperbaiki semua pengalaman yang kamu inginkan. Dan juga, pengalaman biasa. Jadi, saya bermaksud, bahawa anda mungkin memperbaiki pengalaman biasa. Jika anda masuk, anda memperbaiki kejadian ini. Jika anda tidak mempunyai kejadian ini, anda akan memperbaiki kejadian ini. Jadi, perkara kedua adalah anda mempunyai masalah dengan kejadian. Bagi anda memperbaiki pengalaman biasa, mari kita berkata dengan Anaconda, apapun yang anda memperbaiki, mungkin hanya untuk satu masalah, akan memperbaiki pengalaman mungkin lagi. Jadi, jika anda mempunyai kejadian A, kejadian B, tetapi ia memperbaiki pengalaman biasa. Jika anda memperbaiki kejadian A, pengalaman B akan memperbaiki. Jadi, sesuatu yang tidak berhubung kepada masalah yang anda sekarang memperbaiki akan memperbaiki. Dan anda mempunyai kejadian kejadian. Jadi, kejadian kejadian, anda perlu risau untuk pastikan kejadian ini tidak memperbaiki kejadian kejadian terlebih dahulu, dan bagaimana anda dapat memperbaiki kejadian kejadian yang sangat penting, penting apabila kejadian kejadian kejadian. Jadi, tiga perkara yang tersebut adalah kejadian Ssh. Walaupun anda menggunakan kejadian Ssh dengan Git, jika anda memperbaiki kejadian kejadian kejadian anda, anda mesti pastikan tidak akan memperbaiki kejadian itu. Jadi, ketika anda mahu pengalaman biasa untuk memperbaiki kejadian anda, anda akan perlu memperbaiki alat-alat kejadian ini juga. Jadi, kejadian kejadian Ssh. Jadi, kejadian kejadian yang tersebut adalah kejadian Kron. tetapi anda akan mempunyai sesuatu masalah dengan ia. Jadi, bagaimanapun, jika anda mempunyai kerja untuk setiap jam. Tetapi, jika anda mempunyai sebuah jam, kerja anda mengambil sepuluh jam untuk menerima. Jadi, anda akan menerima sesuatu masalah. Kerana, tiga tiga taz ini akan menyebabkan masalah jika mereka meledakkan. Jadi, perkara yang kita perlu risaukan adalah log-log. Dengan mengambil perjalanan tangan, semua perjalanan tangan yang anda menyebabkan akan mengambil perjalanan tangan. Dan ada kemungkinan untuk mencuba apa yang tidak terjadi sehingga anda mempunyai perjalanan tangan untuk mencoba. Jadi, perkara yang paling terbaik yang tidak mungkin dipersetuakan adalah perjalanan tangan. Jadi, jika kita berkata, saya perlu mengawas pasangan baru, dan saya berjalan di perjalanan tangan yang sangat kebiasaan. Jika kita mengawas pasangan baru, itu menyebabkan perjalanan yang mungkin disebabkan By the tasks that are currently running. So this might cause the task to fail or maybe the instance might crash. So based on these things, I kind of like divided up into three major categories of what kind of problem this is. So the first one is a process problem. It can be solved with DevOps, it can be solved with a better process. The next thing is about how we design a project. And finally is about the program specifically that is running this. So based on this, this is kind of how we decided to classify these issues. And based on this, we can come up with solutions that meets these three problem categories. So basically the goal when we came in to try to redesign the system is that we wanted to make everything reproducible, define everything as code. So like what Tian mentioned, one of the only ways that we can move as fast as we do is that we try to define everything as code. So that for example if the instance goes down, we are able to immediately bring it up because we have defined it as code. So it comes to infrastructure, when it comes to instance provisioning. So even the CICD pipeline, we also want to define everything as code. So this is something that is new for us as well. Because previously we were sharing the dependencies, what we want to do now is to define the individual task, Python dependencies as code as well. So scheduling. We need it to be stable and easy to use. One of the issues that we face was that we went, when it came to all these random tasks or running at once on a single instance, we didn't have a lot of insight into what was going on. When things failed, we couldn't really tell. So we wanted something to be rock solid and stable. And more importantly, easy to use. Because when it comes to the users of these tools, they might not be like full-fledged engineers. They might be people from data team, they might be people from marketing, but just want to run some tasks on a remote server. So scalability is nice, but it's not at the forefront when it came to designing the system. So we wanted to remove most of the manual interaction. You don't have to go in SSH in the server and basically tweak things. A nice UI is always nice. So okay, the first thing that we probably want to talk about is maybe dive a bit deeper into what we mean by infrastructure as code. So this is a stack that we use. Over here we go. When it comes to instance provisioning, we are using Packer together with Ansible. So what this means is that we are able to create a machine image with everything installed. So we're using Ansible for that. And we're using Packer to create machine images based on the same code base. So for example, if we need to create a docker container, we can use that if we want to create an AWS, AMI, a machine image for AWS, machine image for Google Cloud. We can use pretty much the exact same code base to generate that. So when it comes to infrastructure management, so instances are not the only bit when it comes to our infrastructure. We also have other supporting infrastructure, such as databases, such as load balancers and all these things, they need to be automated somehow as well. So we use Terraform for this. When it comes to deployment and automation, we are using Jenkins. More specifically, we're using Jenkins Pipelines and DropDSL. So this allows you to basically create everything from scratch. So let's say we spin up a brand new Jenkins instance. We need to create the jobs that would run maybe for certain specific tasks. So all these can be defined as code using Jenkins Pipeline and DropDSL. And for deployment, we also use Ansible. So okay, let's come back to the architecture components that support airflow. So we have an instance called the builder. So what the builder is is that it controls the deployment as well as the automation involved. So it's kind of the brains. It's the key point of interaction. And it's basically a Jenkins server that uses Ansible for automation as well as deployment. So okay, the start of the show, airflow. So this is our scheduler. It focuses on recurring tasks as well as monitoring those tasks that have run. And finally, we have the worker which is a very humble, basic instance that only has runtime dependencies. So the key thing is that we are using Pip ENV. So for those who are familiar with the issues concerning dependency management with Python, one of the key things is basically, without Pip ENV you are always left not knowing what actually is installed. So even if you define something like a requirements TXT, let's say that okay, I'm going to lock down this top-level version at version one. But the underlying dependencies might still shift. So something that Pip ENV solves is that it gives you an absolute certainty when it comes to reproducible environments. So for those of you who haven't checked out Pip ENV, it's a great project. Do check it out. So now, okay, we have explained these three components. How do they interact? We have our job repo. We have our schedule repo. So we split these two up basically since they contain different information and it might be easier to extend. So our job repo contains all our code of the tasks that need to run. The schedule repo contains the airflow schedules. Okay, I'll go into a bit more detail about what exactly the airflow schedules are. But the terminology is that they're called DAX, directed acyclic graphs. Just keep that in mind first. Jenkins will do two things. It will build the virtual environment using Pip ENV and deploy the code to the worker. Jenkins will then also transfer the schedule files to airflow. So the way that airflow would interact with the worker is through SSH. So there's something called the SSH operator that allows you to run code remotely. So this is an example of our Jenkins pipeline job that is used for deploying code. So as you can see, we are able to define the branch that we want to deploy. We are also able to define the tasks that we want to deploy. You are able to define which tasks you want to deploy and basically that code would be packaged up and just sent over to the worker. So essentially what it does is that it pulls the latest code from Git. It builds the virtual environment from the pitfall and the pitfall lock using Pip ENV. And finally it will copy code and virtual environments to the worker using Ansible. Finally, it templates the credentials using Ansible Vault. Ansible Vault is the encryption program that we use to keep all our credentials safe. So whenever we deploy something, this will be decrypted and it will be made available to be templated. So this is a visual representation of a specific task that we're running. It gives it a clean slate. It cleans the workspace. It checks out the latest version of the source code. Then based on that input, we can specify one or many tasks that we want to build. So the first thing that it does when the task is being built is that it creates the virtual environment. So initially we are talking about resource starvation. If let's say we install virtual environment on the worker itself. So this helps to avoid that case because we are creating the virtual environment on the builder which is entirely separate from the worker. So the worker can just focus on running the task. The builder can focus on doing what it does best. Building. So it built the Python environment and finally it will be using Ansible to deploy everything that we need. So when it comes to the worker we need to use these. So because of the way that we have structured our infrastructure by having the builder the worker and airflow separate, we are actually able to scale out the number of workers that we want and for example if it goes down we can always recreate it without any disruption. So right. When it comes to airflow, it's a sequence of tasks. They define as Python code and they call it DAX. So what DAX are, they are direct acyclic graphs. So all that it means is that you can basically have any kind of like a tree like sequence of tasks as long as it doesn't loop back on itself it's fine. So it's always in relation to for example its parents. So things can run in sequence So there's a cron-like scheduler. So since we are moving away from cron and we are familiar with cron so it's not too far removed from what we know. So things like concurrency controls they can be set per deck if I only want to be running at one time it can be such that the rest of the jobs will be queued up after it. So this is a huge step up from what cron is because cron is not aware of the currently running tasks. So for example, we have real-time logs that are viewable on UI. So everyone likes a good UI. We can actually see the progress of the jobs as they are running. And one of the nice things is that there's inbuilt support for backfill. So if let's say I've said that maybe this task is supposed to start on the 1st of July and the server was down for a couple of days and it finally comes up on the 5th what FLOW can do is that it can backfill for those days and of course you get the honor off. So okay, this is the first bit of code that we've seen. So okay, it's a bit small but this is basically a snippet of code directly from the documentation to explain what a deck is. So the bits that you probably might want to look out for is over here. So when a deck is created as you can see this one is very familiar to application users because it's basically Python. You just have to create a deck which is like a job. So within this job you have to assign tasks to it. So in this case there are three tasks, T1, T2 and T3. And what happens inside the task it's not very important but each of them are handling a single thing. And finally, as you can see everything is upstream to task 1 and T3 is upstream to task 1 as well. So what this means is that task 1 will run first then task 2 and task 3 will run in parallel. So this is extremely powerful. People are able to build super complex graphs based on this because it's all about just a relationship between the tasks that have to run. So one of the great things is that for example if all these things need to run in this sequence task 2 and task 3 won't run. So this is a huge life saver when it comes to making sure that your prerequisites have been met first before moving on to the next task. So okay since we have talked about the relationship between the builder and the worker now we're talking about the relationship between the builder and the scheduler. So this is an extremely simple job of basically just taking the schedule and loading it into the airflow airflow server. So it just pulls the latest code from task schedules and it sends it to GCS then into the scheduler. So that's all. It's just just checking out the latest code and loading it to GCS and then to airflow. So the reason why we load it to GCS is because using Ansible to run anything on restart. So what that means is that if let's say we recreate the airflow server we are able to just take the latest state of whatever we have loaded to airflow server and just replicate it. So this is how the scheduler looks like. We have different decks and each of them have different tasks within them. They all can run separately. Based on a very visual interface on what has run, when it ran, and even have a breakdown on which of the individual tasks have run and how long it took. So scheduler automatically loads these decks once we have send them over and we are using the SSH operator to basically communicate between the worker and the scheduler. So that's how we are able to trigger jobs remotely. If you guys are interested in looking at some of the code that we have or having a live demonstration I'm open to doing it right now. So maybe I can just show how it looks like. Let me just... So this is how airflow looks like basically. So we can see all the previously run tasks. So this is one of the particular tasks that we have that we are running right now. So for this, there is something really sensitive. It's basically getting data from raw data dump in S3 based on one of our party providers. We basically need to sync it over to our internal systems every day. So in this case the way that we have structured this task is that it does treat things. Firstly it checks to see where the data is in S3 because that's where the data is. And then it performs the sync between S3 and GCS and finally it loads in the BigQuery. So the great thing about this is that it's perfect fit for airflow because this data is beyond our control. We don't know when exactly it will come into S3. So for this particular case we have seen that data comes in from 12 noon all the way into 4pm so it can come in anytime. So what happens is that we have set up this task to start at 12 noon but if let's say the S3 check fails we have set up 10 retries half an hourly. So basically every half an hour if let's say the check fails it will just wait half an hour then after that it try again. So we don't have to worry whether the task will run when the data in S3 is not complete. So finally the check passes then we move on to the next step to sync it over and finally to load in the BigQuery. So this was one of those use cases that was really quite well suited for airflow. So all these small tricks can be done just because of all the inbuilt features for airflow. So once again it comes down to so this is basically how the Python script is done. So it's three separate commands over here. So it just runs S3 check it can run GCS GCS sync and low BigQuery. So even though they all have the same entry point main.py they can run as different commands. So you can think of them as just three separate commands. So this can run basically anything. It can run Python it can run Bash if you want to. Since this is going through an SSH operator. So it's just running anything on a remote system. Sorry, I'm not sure what RP is. Yeah, I guess so. No, it's not. There's a server that runs this. So there are two components when it comes to airflow. There's a web server and there's a scheduler as well. So to be able to see the UI you basically have to run that process. And the scheduler is the one that kicks off the job. It does it's own calculation to find out whether you have passed a certain interval whether to run a job. So of course you can run things manually but the power of this is that you are able to schedule things in a crown like fashion. It's part of airflow. Yeah. So like for example in airflow here let's say we just take a look here. So this is a schedule. It runs at 430. You'll notice that all this is in UTC. So it's recommended at least in this version of airflow to stick to UTC. But I mean it's just a small inconvenience. So alright. Any questions for anyone? Yes. Cannot be resolved with PIP-Freeze. So it's more like for example if I install I can't think of any specific example but a common package. In this case it's more like it shouldn't okay. What happens is that at least for each of our tasks we have their own PIP-Freeze. So it's quite rare that you would install two different packages that have some kind of conflicting underlying dependency for example if I have task A and task B. We purposely separate them out such that if let's say they handle different things they would just be completely isolated. So when it comes to one of the issues that we had was on a lot of packages like O of 2. So that has caused a lot of grief for us because a lot of the clients that we're using PIP-Freeze 2 and PIP-Freeze 4. So that was a huge pain point. With a lot of other packages they lock the top level dependencies but they let the bottom like the underlying dependency shift. So just having a requirements TXT would not be fully reproducible. So the only way to do it is to PIP environment. So it's really been a godsend. No. So it's not so much of the requirements. Yes. Previously it was all shared. So there's no way around it if everything is shared. So the way that we work is that there is task folder and everything below that are individual tasks. So they will never communicate with each other. They will each have their own PIP-Freeze 4 and PIP-Freeze 4 lock. So like in this case for example I only have these things. So I don't have to worry about these things shifting or someone installing something or whatever. Ansible is more of an automation tool. Docker is different from Ansible. So I mean when it comes to so docker is not okay let's go back. So when it comes to this slide so we use Ansible for two different things. We use Ansible to provision our instances. So you can actually use Ansible instead of a docker file to create a docker container. So one of the benefits is that it's a lot more configurable and if you're using Ansible together with Packer you can use the same Ansible code to create something that's not a docker file. You can create maybe a Google Cloud Image you can create an AMI. That's one of the cases that we use Ansible for. Another one is to communicate between already deployed instances. So like for example, the worker. So the builder is a separate instance from the worker. We want to get some stuff configuration, some code from the builder to the worker, we use Ansible. So it can run locally it can run remotely as well. So it serves two different functions. Alright. Ansible and Airflow. Airflow does not run Ansible. Airflow uses basically there is an internal there's some internal functionality to run SSH commands. So Ansible is used together with Jenkins just as a bit of automation. So it's a bit of glue to get the code into the worker to get the schedules into Airflow. So Airflow itself does not run Ansible it just uses you can check it out the SSH operator. So there are different operators like for example Bash operator we can run Bash commands locally on the Airflow server itself you can run Python commands as well. So that's a Python operator. So the one that we chose to run since we need to run on a remote server operator. Okay. One of the limitations at least in this setup since we're using SSH is that we need to define ahead of time where it should be running on. There's no automatic load balancing or automatic service discovery. If you're looking for that there is actually a salary operator so you can run salary as a Kubernetes operator. So it's really, really new. I'm not sure whether anyone is using a production yet but you can check it out. So instead of pointing directly into another instance you can use one of these distributed software or distributed frameworks to run jobs automatically and scale out automatically. For now it's getting into production and it's quite stable. It's been running fine. A lot of small it's incubating. The thing is it's okay. I think a bunch of people are running a production. If you're looking for something more stable and you're on Google Cloud they have Composer. So it's managed airflow. But in the end if we wanted something to be a bit more flexible because we were thinking of using this but also AWS. So if this pays off we can actually run a whole bunch of other tasks on AWS as well. But so far it's been fine. We are slowly migrating some stuff into production but it's been maybe ask me in the months time and see whether I regret my decision. So far it's fine. So far not that I know of the resource usage is quite moderate so it's not really eating away at everything just by scheduling jobs. So that's one of the reasons why we broke up the builder and scheduler into three separate sets of instances. So we'll just have one dedicated builder one dedicated scheduler. If anything goes down we can just spin up a new for example airflow server with a bigger marry capacity or just more cost. So there is an option in build to send you an email on retry, on failure but since no one uses emails anymore you can actually just look at the UI I think that's a way to alert you on slack if something fails you can set things like the retries on the deck level so if it's going to be a one-off error then it can retry by itself. If not there should be some ways in there but for now we just monitor the UI if anything fails, right now it's daily jobs so we put in retries as well if anything fails normally it's just a hicap so maybe it's just intermittent failure so it just retry by itself and usually be fixed I've never tried it before but maybe maybe you can they have quite a lot of plugins so let's say something doesn't really meet your needs it's just Python code you can create your own operator to send out whatever messages you want so it's quite flexible you can even send out HTTP requests maybe if you need to trigger something on a remote server you can as well there's a whole bunch of stuff that you can do with it but I think we'll only just scratch the surface so maybe in a while we'll try out all these fancy things alright