 Thank you very much. Ladies and gentlemen, welcome to our humble office. Actually, this is like one of the office that we have. We also have another one at 11th floor. That's why we plan to hire many more. Just a brief of introductions. My name is Didi. Didi Ahmadi. You can call me Didi. I work as an engineer manager here. I'm overseeing like a few teams. First is the cloud infrastructures, security, and then data engineering tools and also data engineer which interface with a bunch of business across the Trafaloka, ranging from whatever products that we have. So, previously, I worked at a physical effect company. They make movies. I mean, I like Star Wars and something like that. It's called Lucasfilm. I'm dealing a lot with infrastructures right now. I used to work with Telco and some sort of healthcare as well. It's my, I think it's my 12th year in Singapore. So, time flies, I guess. Yes. The subject here is our journey. I mean, the title is changing a bit, I guess. It used to be, what, battle stories? Battle stories on products, dixonizing, Spark at Trafaloka. The main reason we chose battle story because like it's not everything is sunshine and rainbow. The main reason we're thinking to have this kind of like, let's call it sharing sessions because we first a lot of interesting things and some of them is definitely not fun things and we would expect people to learn from our mistakes. That's the things. So, it's not about fancy stuff. It might be fancy. I mean, you guys decide whether it's fancy or not. But we're going to start with that. Next, please. So, this is the spark that you discussed that we have right now. We do have data processing, which is IRL, I guess. The second thing is data exploration. I'm trying to make it simple here because everything can be lumped together into these two things. And then we start from 2016, not 2017, I guess. There's going to be slight explaining more about the tech stack what happened on 2016. We start with self-managed spark clusters. I mean, how many of you ever set up in minutes self-managed production spark cluster on your own? Cool. One, two. How fun was it? It was not fun. Precisely, right? Okay. Cool. So, it wasn't fun. But why do we do that? Because I think that was the best option happened at that time. And we just curious. And of course, it's not fun. We will explain more why it's not fun. And then we choose to introduce a bit of EMR here for a certain backfilling and something like that. But the overall stack is the same. And then by 2018, a number of our data is already on the GCP. So, we just like, okay, what's next? Because, okay, most of these here are in AWS, I guess. And then we're still using AWS for a number of use cases, but a bunch of our stacks are already on the GCP on this part. So, we decide to move a number of them into a data box. I mean, this is our code name TT. I mean, we do have our own internal code name, but we prefer not to share it for now, I guess. Because I think what we're planning to explain in detail is about this, the pain points. We're not going to talk about this too much because it's pretty easy, I guess. And how we can transition here easily seamlessly, including the customizations. For example, we have hundreds of jobs, how to change one line of code, and then you can change everything from here to here without problems. We do the flip within one day or something like that. Just change it and then everything's sent to here. We do have a number of, like, cue ball instance here being used for many things, but cue ball is not just for Spark, but because these things we're going to discuss just for Sparks. We still have a cue ball instance, number of things until now. We still have a data breaks instance here for now. And something that I need to mention is when a certain company or organization screw, you cannot just leverage the existing use case. I've mentioned like a few people, you need to customize it a certain way. You need to build certain features. You need to build a wrapper on top of it. You need to build orchestrations. There's something that basically we're going to talk about this, even though, for example, oh yeah, self-manage. You just need to run it, right? No. You need to optimize it. You need to build something to ensure that basically it's suitable for your use case. And then we do have another codename RS and data parks, basically general Spark submissions in the Google, which has like schedulers and then key management and then anything in between. So basically it's not as fancy as kind of like a click and go, just like cue ball and data breaks, perhaps, but it serves a certain use case. 70, 80 percent is good enough for us because it cater for a certain use case as well. It's not just catering for everyone, but cater for a certain specific set of engineer with a certain skill sets. Okay. I don't want to talk too much because like a lot of details going to be presented by Nisrina and Andri as well. And I need to breakfast things. So basically the next session is going to be on Nisrina, I guess. There you go. Okay. Hi. My name is Nisrina and thank you for coming. I'm going to begin by talking about the ETL part of Spark. And this is just to give you a bit of background. This is how our pipeline looked like maybe around 2016. So no big data test stock at all. Like all of our data, tracking data, production data, everything is in MongoDB, sharded or other sharded or just read replicas. And our ETL, our vanilla Python code is running off of one node, like one gigantic node that we keep upgrading. Yeah. So that obviously does scale very well. And that's where Spark and all of its friends come from. We kind of need something that scale horizontally really well. And as a processing platform, Spark is what we choose because well, it scaled really well. There are a lot of, you know, Python at the time was also like a big reason why we choose Spark because that's a language that a lot of our data engineers are familiar with. Okay, next. And for the first iteration, it's simply a single size, fixed size cluster. So it was, we didn't start off having it this big. It was like, I think at the end of about 20 something R3 to X large EC2 nodes, it was around like those one terabyte something memory. We're using high availability mode. At the beginning, we don't use high availability mode. It was just one node. And then like a few accidents later, yeah, we decided to, yeah, we need failover. Some ancillaries. So three zookeeper node to support the HA mode and like one separate node for Timeline server. What else? Right. HTTP distribution. Hortonworks provisioned with Ambari and some Ansible playbook. So the provisioning part is pretty, is one thing that we work on a lot because the Ansible playbook is because we want things to be reproducible. And Ansible is what a lot of other engineers are using in our organization. The Ambari part is specifically for the Hadoop stack because yeah, we, we first try to write some Ansible role to InSouth spark and things, but it's, it's just complicated. And yeah, we're running around 250 something ETL mostly runtime in a few minutes to a few hours. So they're not very big jobs. They're kind of medium size, maybe in the 10 minutes to three, two to three hours runtime. We're running spark version 1.6. So still mostly our DD. Okay. So yeah, you've probably seen this diagram like a hundred times before if you're familiar with spark, just your usual setup of like two master and then some zookeepers and those spark cores that we keep scaling out as we need it. And this is maybe what we're using air flow to schedule jobs. And because we're running spark submit from this notes here, we're installing spark client here too. So spark submit is running locally in here. And yeah, so that those notes are pretty, this is where we release where we release grips. Those notes are pretty static. That one's kind of keep growing as we need to just give background. Okay. All right. So again, I think I want to share what's interesting is the, is the provisioning part. The this part cluster configuration is mostly vanilla with several configuration we made. And yeah, this is how we kind of mix Ansible and unbody for provisioning. So generally we have four Ansible playbook for each note type. Note type is those. So for example, this spark worker is one note type this spark master is one note type the zookeeper is one note type. The air flow worker is one note type. So for each of those we built like this, we usually create like this four playbooks that define different things. We, we tend to uniformly do it this way just for you know, uniformity and other people who are working on, you can have like other people who are familiar with this kind of format working on, working on it together. So the first thing we define as the resources, what are this? Let's say I'm having my, my air flow worker, what are they, where are the note types? Where are the volumes? How am I going to mount the volumes? Things like that. That was one playbook. And then the VPC rules. And this is also kind of important because we separated our notes into VPCs in AWS. This defined, there's like one playbook that defines solely inbound outbound rules. So for example, a worker note can only expose, port something to a master note. And yeah, this kind of relationship. And then the software provisioning, this is where we, you know, if we have like PIP install spark, we would put it, just put it there, but we don't. So like other PIP libraries, other PIP libraries that we use for transformation, they're installed in this, in this, in this playbook. And then the fourth one is the monitoring rules. They're usually, there's, we're using, we're using Datadog for monitoring. It's mostly just Datadog checks. And the umbody part is to provision the Hadoop stack and your Hadoop stack and that's supposed to be spark, Hadoop second spark, like to complement this number three there. Oh, it just, this is a screenshot of one of our Ansible playbook for, for provisioning. So it's like, yeah, and so I don't know, PyMongo cryptography, Py country, things like that. Those are just a list of things that we install so that whenever we're adding a new notes, we can just run this, run this playbook and it's reproducible. And the user submits this script via Airflow operator. So they write, so when they want to like write an ETL, first they write the spark portion of thing, like the one that actually, that actually contains the processing logic. And then they schedule it via Airflow and the Airflow in our team, they're used for a lot of, to orchestrate a lot of other things. So for example, after you're running your spark job, you're writing it to S3 and then from S3, you're loading it to Redshift or some other place. So yeah, this is just one part of the entire pipeline. We're creating a custom operator. This is, this extends BUSH operator, if you're familiar with Airflow, your, this essentially ended up translated to a spark submit command. So if you, if you see the, if you see the argument, it's pretty obvious on how they translate to, to a spark submit, like there's the Maserian cluster and then you got the spark arguments. Yeah, we use Datadoc for monitoring. We originally only use this, or maybe like in a production cluster, especially if it's just one cluster, it's important to keep it alive like 24 seven and monitoring and alerting is important. We initially only use this umbody, umbody got this built in kind of alerting thing that you can, that you can use, but we ended up moving to Datadoc just because we'll reach our interface, just easier to configure things. And it has all sort of Hadoop related check that you can use and fairly convenient to use. You can just use very few configurations, things like uh, you can configure a few things and you can get various metrics from resource managers from node manager. Uh, you get streamed to, to Datadoc services and then you can just build dashboard and alert on top of that. Yeah, this is what kind of our monitoring playbook look like. So yeah, for example, this is, I think this is for one of the, the master node. So we, we kind of plug in the, the URI name to the yarn, um, integration, the HDFS name node, the Spark, um, I forgot. Oh yeah, the Spark URI. And yeah, there are several other checks that we, that we kind of use. Um, for example, you want to check if your resource manager processes live, you can just use the process checks to see if there's this string in your, your usual PS Oaks, uh, output and TCP check. There are some several ports. So we check like, I don't know, like just manager web UI. We keep checking if port 888 is, um, is accepting request. Yeah, this though, I kind of add this because this is like, yeah, this looks way nicer than, uh, what our previous dashboard look like. So yeah, you can, you can build this kind of thing on top of, after you stream everything. Right. So, um, this to conclude what have, um, the self-managed part, this, so it's fairly stable. It works okay. But if we can improve a few things that this is what we want to improve. Uh, so our ETL, it mostly serve dashboard and, uh, and reports, uh, that are expected by people kind of first thing in the morning they want to see in the office. So it usually running like two to eight AM before, uh, for it to, before people are, are expecting it. So it's idle most of the other time. It really is just there doing nothing. So lots of room for cuts, for cost saving and scaling up still require manual work. Like if you want to add notes, that will, that will require us to like run the unstable playbook manually. And then we need to configure some things in the um, buddy, uh, platform manually to just like half a day to one day work to like really keep things, um, stable. Uh, we want more cost transparency. Like, um, from like a David presentation, we have many products. We have many sub teams. We have many business units. And sometimes there, uh, sometimes you want to know exactly how much is this product costing to run their ETL. Like you have this huge cluster, but you don't know who is actually taking up most of the resources and we want that transparency. Uh, also like upgrade version from 1.6. Um, we tried it once. We're upgrading like 1.4 to 1.6 of young buddy, but that wasn't very straightforward. Lots of ELS of trial and errors, essentially. Um, less time spent on managing the operation. You'll have to keep a cluster lifetime for seven that come with sustained cost. And so yeah, uh, conclusion, we want transient cluster. So, um, yeah, just to give you an idea on how much your cluster is underutilized, this is the maximum. It's like the one terabyte memory, um, capacity. And this is the actual usage. So yeah, look at all those free spaces there. Um, this actually does hit, uh, does it peak kind of in certain ranges? This kind of smoothed it out, but it does hit peak, but like only for a short amount of time. Right. So we're moving on to transient cluster. So it's, uh, so instead of like just park submitting to an existing cluster, we spin things up, run it, shed it down. So it should be more efficient cost-wise. And also, uh, we did some preliminary calculation. Well, we have the, so we have the spark argument data. So we have things like I'm an executor, we have, we have a number of memory. We have v cores requested by each script. And then we, since we use airflow, we conveniently have the, the runtime duration for each script. And we calculate kind of the, what is the gigabyte hour, the v core hour for each script. And we kind of, we kind of get the idea of how much we're actually utilizing the entire cluster. And it was only 20%. Yeah, sure. The major work here is basically on how to ensure that we are moving on the right way. And then we make some sort of like a some reasons that basically, okay, actually by calculating the amount of the, the past amount of data that we have on airflow and any other sources that we do some sort of analysis, constant numbers, we come out with this, okay. The current setup actually only used in every kind of a synchro capacity. So why not just using something like a transfer or something? Right. Thank you, Masiri. And number one, again, don't, doesn't have to maintain a long running cluster. So we choose, uh, so trends in cluster, okay, check, uh, but which ones would you, we ended up choosing data proc because one thing it's been up really fast, like one to two minutes, uh, to spin things up. And then, uh, this like a kind of big reason too, because most of our other infirm moving to GCP, and we don't want to have like half here, half there. And it's also cost effective. We kind of calculate with this, with that amount of 20% home, much of this cost, like a little money. And it's, it's pretty good. So cost effective here is not just comparing with like a one of these solutions. Try to compare with about a few products, four to five products, and calculate everything and demands with our users. That's right. We come out like, okay, with this, I think it's the most effective cost effective for our users. Just be mindful that this is based on our users. In case you do your own research, the number might give you a difference, right? So it works for us, but it does, all right, encourage to try on your own. Yeah. Okay, this is, um, I don't know how do I explain this, but it's just like a general flow on how the transient, um, setup work kind of, we first upload a script every time you're running, uh, an airflow worker, you're uploading a script to, to a temporary GCS path because yeah, because the script kind of changes sometimes. So we are uploading as needed when it's want to be submitted. And then we create a cluster based on certain spec. And after the cluster, wait until the cluster is created, submitted, and then delete it. And, uh, if you remember, there are some, how about like some stuff that used to be in the uncivil scripts, things like, uh, you know, pip libraries that we installed jars that we have to download. They're in a common GCS bucket that is fairly static. And for pip, for pip install, specifically, we're using, uh, in its script in the data for cluster. So they're, yeah, so they're executed each time a cluster is spin up. So if not quick, I'll go back one minute. I think the main difference here between the standalone cluster specifically, we need on the process, number two and number fours, right? We need to create something to create cluster based on the specs. This has to be done automatically because we need to migrate this just by changing one. And of course we need to create some soft translation layers to find out how much cluster spec that actually needed people defining on their own. Nina can speak on the details of that part. Oh yeah. Question, maybe when you spin up the approach, uh, do you point at the central meta store? How many, like, because it spins up of the entire cluster with spark as a robot. So what about your meta store? What do you do with that yarn config? I mean, wouldn't it be easier just to do this part? Like, um, the meta store as in, Oh, okay. Uh, we're really spinning up, uh, all the, all the specification that you need to create a cluster is the, in that number two there. And it was, it can be transient. We don't really store anything other than for things like logs. They're stored in Dataproc. Um, what's it called? Um, yeah, it has its own like logging thing. Yeah. So look for genies, park, job, walks, walks, they grow into the Dataproc. And for the high meta store, there's no connection to central. I know. No connection. Yep. So like in case of any kind of events or like sale or something, so you need to scale up more of the person because the goals of the jobs are high. How do you manage those things? Uh, like if I'm getting correctly, how do we manage? We roll up the cluster every time, right? With certain specs on the basis of biggest analysis. That's right. But at any time of like any kind of event or sale or something, at that time you need a high cluster, like we need a better cluster to manage all the jobs and rules. Right. So how do you manage those speeds and other things? Okay. So we still configure each script manually. So for example, if for if like one day this particular script really need like a much bigger one, then someone need to need to manually change the closer spec. It usually you can just use our backfill kind of code. So that is just one time and you don't have to change it like to the rest. When we set this one, I don't think there's an auto-scale yet on the data blocks. And yes, it only happened recently and still on the data as well. And yes, so it's more towards the use case that we have. We sort of understand the workloads. But in case there's something that has to be able to search or something like that, I think with this, we still need to go and configure. Again, the goal for this is mainly to move because the pain point of managing self-managed customers, it outweighs what I'm going to have to enhance. So think of it this way. Nina is the sole person that's terminating that kind of things compared to other people. And she also needs to put a number of other words. Happened by a single point of failure in terms of magical, it's definitely not important. Right. When we think about like, okay, how can we scale the whole operation of things while still managing the overall things, and then still aligning with a strategy on these things, we choose that one. The goal here is like, okay, we need to, for example, ease the whole manageability. And then the second thing is, we need to bring the cost and do like the next sense models. And then the third thing is basically, okay, what can we do to make sure that the transition is like a quick and fast. And then it still serves the whole purpose. This is the process that we do. Of course, after this, the whole, all the data engineers which is working on our institutions basically responsible for their own sparkles. They know their work will pass. So we leave it to them. So basically, at least we put some of the overall things, we make sure that transition is smooth. And after that, we train them how to use this and et cetera. And transition is quite fast. We're likely to have quite a lot of the right person in the game. So there it goes. Not on the work, everything is still okay. Okay, any other, okay. Yeah, maybe the next part is kind of relevant to the cluster sizing. So again, we have 250 something script. How do we make sure they're with all their own configuration? So there's already some, that's this, I don't know, executor memory, something non executor already on that script submission. So how do we translate that to a cluster size that is appropriate? And yeah, and we want the migration to work as seamlessly as possible. We don't want to ask people to, hey, can you look at your cluster configuration and translate it to your script configuration and translate it to a cluster spec. And so we try to automate things. And this is how we do it. So again, we have this spark argument thing from the airflow operator. And this is, we will look at all the kind of, at least most of the script. And this is what people configure usually. So it was, yeah, you probably have seen this many times. So things like non executor is your memory, driver memory, core, and then overheads. And how do we, so, and yeah, this is, this really is something, the end result that we find is actually very fairly simple. We try several things like, we try, for example, actually manually accounting how many nodes will each container take if, if, if let's say one node is one container. Sometimes we try to group things together. And the, and the solution was pretty simple. We just match the total, we just make sure that the total number of memory, the total amount of memory and the total number of e-cores actually match the request, the one that's requested by the script. So for example, if your script is asking for 10 executor, six gigabyte memory, at least the end result is, the end result is 60 gigabyte total. For example, it could be, I don't know, 12 gigs times fives. It could be, yeah, just, just, we're just matching the total. And we choose the instance type with minimum price because it could be, there are several, there are a lot of instance type there available, which is the one with minimum price. And then in case there are several, several set up with the same price, we choose the least number of instance, just like with rather have like a larger, but like fewer instance, just to reduce like network shuffling. And for a few cases, there are actually really few that, that we need to handle manually. Then yeah, we define and tune as needed. Right. So that's the whole cluster sizing part. And again, I mentioned one thing that we want to set, that we want to do is to make the ETL cost more transparent. And it's pretty straightforward to do. If you're using data proc, you can just label, you can just label things. And it will, this will be available in the building data. So you can like slice and die so much of this thing cost. If you're slicing, if you're choosing things from like a certain script name. So these are just the some, some labels that we apply things like is backfill. Backfill is, we thought it's important to label backfill because sometimes when you're backfilling, you get like a sudden spike in cost and you want to be able to explain that. And it's like script name, what product this belong to, environment, things like that. Yeah. And then after you label things, then you can just build this kind of dashboard on top of a building data. What, what script name costs how much for how long? So I'm sorry that we need to block this one. Yeah, it's mostly blocked, but you just imagine something. So basically the idea is to have like a certain allocation that basically, this certain risk is how much it costs. We want to have that kind of materials. And so far it's quite hard to get that kind of things, even though for a company, it's like, custom maintenance requirements, that basically, one time how much it costs. We can, on the back end, can set attribution each business unit, that's it. We know precisely this business unit how much it costs from the processor. But actually, if it's available or not, that will bring good thing or not, we're trying to basically just put the data, let everybody take a look at it. Great. Okay, like a little bit of review. How's it going so far? It's overall been working really well. The data proc cost was really only 20 to 25 percent of the cost of this fixed size cluster. Yeah, one thing, or you know that our data is in S3 and our processing cluster is in GCP. We didn't really, we know there's going to be some data transfer cost, but turn out it really is not negligible. So just, it still is a saving in the end because that was like a huge factor, the 20 percent. But yeah, if you ever wanted to do like a cross-cloud processing, you might want to like, just so that you're not surprised. And how many cluster you can spawn is still bounded by some GCP quotas, things like, it's the number of GCE, yeah, GCE instance project you can spawn, how many API calls can you make to the specific endpoint in like per second or per day. So you still have to kind of total it somewhere, not just spawning infinitely. All right, that concludes the ETL part of the presentation. I'm going to pass this along to Andri. Yep, okay. Can you pass me the laptop? Oops. Oops. Oh yeah, sure, sure. Can help this. So yeah, thank you, Nisrina, for the microphone. So yeah, I'm kind of touched here. Okay, nice. So yeah, thanks, Nisrina, for explaining all our part processing environment from our self-managed to attention data pro clusters. So now comes a Databricks era. So, please wait a second. Okay, so basically in Databricks, we basically do two things. So one, notebook environment, where we do all the data explorations there, and the second, scheduled ETL jobs. So basically, having using Databricks, Databricks comes with a lot of advantage. For example, it's basically it's very easy to use. Basically, you have a code in your notebook environment. You have a spark. You can spawn as spark clusters. You can attach the notebook environment in the spark clusters, and voila, you can run it. It's very seamless. And also, there's a scheduler on top of it as well. And there's a lot of useful features in Databricks, such as auto-scaling, where you can auto-scale your spark clusters. And there's also a secret feature where you can store and retrieve your secrets safely without having to expose sensitive information in your code. And yeah, easy monitoring and logging as well. Everything is embedded in the UI and it's very rich. So having said that, there's some numbers drawbacks as well. For example, right now, there's no straightforward connections to GCP. So yeah, I mean, connecting to GCP environment is not that easy, and it requires a lot of manual interventions. And there's no custom default template when spawning clusters. I will explain that later on. And also, it's not so easy to monitor the cost of our usage as well. I'll explain that later on as well. So here, yeah, we have our fair share of challenge of using Databricks. Some of them are first, picking a suitable cluster configurations. So basically, lots of people, they care about the logics they want to solve. They care about like business problems. They care about getting things done. However, some of them, they might not know how to tune the clusters for their workload. Then how would they choose their cluster workload then? So this is the default page when I want to spawn a cluster for my job. So basically, this is the first thing that I saw. And some of the properties here is not preferable for our generalized use cases. And in our opinion, it can be optimized better. But however, some people who don't know how to tune their workload, they just choose these default configurations. And yeah, basically, we can't control what people see when they open this page. And yeah, we wish we do. And yeah, our solutions on this is that we provide some documentation, some basic documentation for people who don't know how to tune their workload. Some best practice use cases. For example, enabling Databricks auto scaling features, like whether you have to spawn a few big clusters versus small numbers clusters, mixing AWS spot instances with on-demand instances, because Databricks has the features where you can use the spot instances and you can fall back on-demand. In case the spot instances get reclaimed by AWS. And also you can, yeah, please use the dedicated cluster for scheduled jobs so you have no dependency health. So our second challenge was, yeah, there's limited features on cost monitoring in our opinion. So this is the graph provided by Databricks to us. Then, yeah, looking at this graph is not really apparent how to see things like, how is our usage on a certain period of time. Then, yeah, I believe that some information can be added here and the visualization can be made better as well. So our solution here was we create a custom dashboard. So basically we are pulling like some API from AWS, from Databricks as well, and we made our own dashboard. So here instead of using a metric such as DB use, we used a real money spent here. And on top of Databricks cost, we also included our AWS cost as an infrastructure cost as well. So we really know how much we spent on a specific project, specific business unit, or a specific team. So our third challenge is, yeah, unfortunately, we still find bare secrets in notebooks. Bare secrets as in password, API key in notebook is not really aligned to our security best practices. So what's our solution for this? So secret scanning. So we have a secret scanning pipeline where we periodically scan our notebook environment for these bare secrets. Then we put up in a dashboard, for our internal dashboard, for people to see and fix them. Yeah, but unfortunately, I can't show you guys the dashboard here because it contains some such information. We've proven up to Juicy, we've proven up to you and shared it with you. Do you expert know those somewhere first and then you reimburse or you commit them or how do you scan? Yeah, exactly. So yeah, so our takeaway of using Databricks is that? Oh, by the way, this is secret scanning. It's not just for Databricks notebook. It's for everything in our code directory. So we keep track of history of that. This is the biggest offender or something like that. You know how many books, right? If you work collaboratively, you need to send announcement to everyone. Yeah, okay, so I'll continue. So here, yeah, having using Databricks, some of our takeaways of using Databricks are, Databricks is, in our opinion, is very excellent Spark framework. And it's very easy to use. Basically, you just have to get your own notebook, write the code there, explore there, then just plug into the Spark clusters and just run it. And it has a lot of fantastic inbuilt features such as auto-scaling, secrets, and also a falling from spot instances to on-demand instances. It helps us to cost-cutting. And yeah, easy logging on monitoring as well. It's a lot of it are embedded in the UI. And yeah, in our opinion, some features can be made better, for example, like as I've shared before, cost monitoring and cluster spec enforcement. And also, yeah, we hope for a better GCP integration. So here, so our current ETL lifecycle in Databricks is that we do a data exploration in the notebook. We do our development there. Then once we are confident that we can get a result that we want, basically we put a scheduler on top of it, then that's become our production environment. However, as we grow our engineering expertise, we believe that this process can be made better. And since we are moving our infrastructure to GCP, so yeah, we made our own in-house frameworks to tackle this issue. So introducing RS. RS is our, what's RS? Basically, RS is our code name for our in-house general approximation framework in GCP. It's leveraging lots of GCP technologies, such as cloud composers and cloud Dataproc, which is basically Airflow and Spark managed in GCP. And yeah, our core focus here, I want to emphasize this, is to maintain a simplicity for our users without compromising on engineering quality. And yeah, we want everything on GitHub for better transparency and we can monitor how the code evolves. So some of the key components of our frameworks are scheduler, execution engine, CICD pipeline, a secret management in the form of an SDK where people can plug in, use it, they can store and retrieve a feature safely, kind of like Databricks secrets that I have mentioned before. And also we have a CLI wrapper where users can use it to interact with the whole systems. And yeah, this is the whole architecture of our framework here. So basically, as you can see here, users use the CLI to interact with the whole systems. The beauty of this part is that they don't have to know about this part. So basically it's abstracted from them by the CLI. They just have to fire a few commands in the CLI to interact with the components there. So basically what they need to do is just, they write their ETL jobs, they submit it to the upstream repo, then once the upstream repo in GitHub receives their code, CICD will kick in, all do all the necessary jobs, putting all the files in the necessary places, and the scaler will pick it and do the job. So yeah, this, well, I mean, it's kind of like engineering diagram, but I want to talk about user point of view. So here in user point of view, the system is very simple. Like they only need to care about a few things, their ETL jobs with the dependencies, and also the list of dependencies, like for example, Python packages in the form of requirements files, and also user configurations, how they want to run the jobs in the YAML configurations. Then basically, they just have to fire up this component through the CLI, and the CLI will transfer all of this, activated it, push this into the systems, and their ETL will be in production in no time. So yeah, talking about user configurations here, so this is the user configurations. It's like they just have to pass the systems, like in a YAML format, how they want to run the jobs. For example, the schedule, crown schedule, like the machine type, number of workers, for example, how many spark workers you want to have, like owner email, in case you want to be notified if the job fails, and yeah, there's a lot of it, and basically the system will parse these YAML configurations, it will transform them into an airflow deck, and so it can be interpreted by the scheduler to run the job. So yeah, some of our highlights here, features highlights are that, a secret management in the form of SDK, whereas yeah, I've explained before, where you can store and retrieve your secrets safely, and there's a permission isolation as well. So in this framework, each of the job has its own sets of permissions, what it can do, what it cannot do, and the CICD pipeline and the scheduler have their own permission as well. So we also provide a backfill capability. We have a CL abstractions, where people can use it to interact with the whole system, without knowing the implementation. Also GitHub, yeah, we want everything to be checked in GitHub, so yeah, it's provide a better transparency, we know how the code evolves, and ownership attribution as well. Yeah, kind of same thing as what Nines does. So basically we label everything, we label every resources, so that we know how much we really spend, how much our usage on a certain business unit, certain projects, on a certain team. And yeah, having said that, we still believe that there's still lots of improvement to be made, it's still far from perfect, and we are still continue to working on this to make this better. And yeah, that wraps up our stories using Databricks, and building our own solution in GCP. So yeah, I will pass it to DD. All right, cool. Just going to hold it like this. Thank you very much for Nina and Andres. I mean, I think that's pretty much the journey and history that we have right now. I mean, because of the times, the presentation that we're trying to present, it's just, we're trying to make it touching a certain things, just give a hint on what kind of complexity, but we cannot deep dive, because for example, it's like one section can, we can have discussion for two days, three days for the whole week, but how it's being done, how's the detail and blah, blah, blah, blah, blah, blah. So I think the next step would be a bit of Q&A, in case any of you would like to ask some things. Go ahead, I guess. Yes. I saw in your architecture diagram that you are using goblin after Kafka. What exactly is the reason for using goblin there? Can't we just use Kafka consumers instead of using goblin? So I mean, the question is like on the first slide, there's a need to use goblin and then Kafka instead of directly to Kafka, right? Yeah, sure. I guess we can take it to you. The 2016 and 2019 slide. By the way, I just want you to have a look at that time. Okay. So Kafka is where we send our tracking data and goblin is, yeah, I think your question is, why not like a custom Kafka consumer? You just use Kafka consumer itself, multiple of them. Right. Just first simplicity, I guess. I mean, goblin is exactly Kafka consumers. They're built exactly for this kind of pulling things from Kafka and then dumping it into like a nicely time kind of folder structure in S3. So yeah, just first simplicity. Any other questions? Cool. Yeah, those questions keep on repeating. Oh my God, I asked a question. You know, I have more of like, so if you're talking about cross-cloud connectivity, like you know, one of the challenges you mentioned, the integration is a bit loud. And do you guys have these cases where due to joint computation across two cloud data sets, like something coming from S3, something is coming from GCS? Yes. Do you? I mean, you may like somebody trying to do join the data from the GCP and also from AWS. Yes, actually, it happened many times. Let me repeat the question first. So I'm still holding my cup because some breakfast. So yeah, there are a number of cases in which basically, okay, the data is not on the GCP yet. And we need to pull something from AWS. And then we need to join both of them. We have the flexibility to do so. Basically, pretty much all the platform that we have right now, whatever three, four things, we have to be able to do so. So also the main reason we have secret management here, so basically to ensure that the whole platform can access data from the S3 as well. So regardless of the location, you should be able to pull this data from here and then join the data from there and just mix it together. I think hopefully that answers your questions. Yes, I mean, that's more, I mean, that's all I mean. Actually, I mentioned somebody using my two clouds. Another question I think I'm allowed to. I don't want to block the people if the other question might be more interesting. Yeah, go ahead. Okay, so my next question, I think you mentioned about the challenge there. You don't have to control how people are contributing cluster. So is it like for individual workloads, you're creating separate cluster every time? Or how does that deployment model work? In which area, by the way? So I think you're talking about the cost governance or whatever cost control in one dimension, which is one of the challenges you mentioned. So it's cost example itself. Yeah, you're right. All right, I think what we're trying to do is we're trying to empower people with a certain magic within their fingertips. First of all, we like to solve problems and we trust our people up to a certain point. But we cannot lock them first. We need to trust them first and then move forwards. It means like, for example, by having like a data bricks and cue balls, we give people power to create. Okay, I think we trust you. You can create cluster as much as you want based on your workload. We just need to train you better. Some people doing that religiously. Some people willing doing that kind of like, because time frame, whatever that things, I just want to get it done and over with something like that. So it comes with a certain consequences that basically without having a flexibility to enforce that along the way, and then people doing a cross check, people not doing a cross checks, it will make people having some sort of like a a bit kind of like a lazy perspective. Oh, you know what? I can do it all the time and something like that. Trafaloka, we work in much kind of like small independent teams and we give people and we give them a flexibility to do. Look, we trust you, do as much as you want, but please be mindful. This is why one of the key take on all our system is basically making sure we have a proper attribution. Right? Is it something that basically from, I don't know, one to two years ago, we decided that basically, look, we need to empower users, but we need to make sure that people are responsible. The easiest way is basically attribute them with what? Money. Money is the easy numbers. Everybody understand basically $100 is $100. Having that kind of thing basically making sure our life is assured as a person that manage the platform that basically, look, we give you powers and then people say like, okay, this is how much you spend and then basically, oh, is this fair enough? People can adjust it accordingly. Hopefully they answer your question. Right. Anytime. Anything else? I mean, I'm trying to be mindful with the timing here. Yes, sir. Can you liberate again? So when you have 300 PT something jobs, right? And sometimes they start simultaneously, right? Some of them need one or 20 gigabytes, 20 cores, 30 gigabytes of memory, it all runs in each job, runs in not in an independent cluster, right? Sometimes they run in the same cluster. How it works? I think the last question was kind of there. When two jobs go to the same cluster, how do you split the AWS cluster? Okay, okay. The question is like, when you have like a number of jobs, hundreds of jobs, I think your question is more towards the self-managed cluster on the first one. On the current architecture, okay. The current architectures is actually, you know what, Nina want to answer that actually. She's kind of eager to just give it her. Right. So in the current architecture, we spawn a new cluster for every execution. So if there are 350 running daily, there are 250 cluster spawn that they running things and then turned off exactly after they're done. What are all thanks to Ansible? Ansible is more essential in our previous kind of setup. The current setup was simply a Python API call. We just call it like Dataproc. Dataproc got this API that we can call when we want to submit, when we want to spawn a cluster with certain spec. We're going to just call the API, wait until it's completed, and then we know the endpoint. And then that's what we used to gain, call another API that submit jobs. Like the Ansible one is more to provision things when we were managing it alone. Can you do the same with the Databricks? Do you run separate cluster in Databricks for the job as well? I think so. You want to answer that? Oh, sorry, again? Do you run a separate cluster for every job in Databricks as well? Well, actually for our schedule jobs, yes. We do spawn a new automated cluster job. But for our EDA purposes, we do spawn an interactive cluster for all the people to use it. Is that answered your question? Yeah, yeah, yeah. Okay. Let's follow up question. Do you see there is a lower utilization of each cluster? Because every cluster is sum overhead, right? So do you see an opportunity in squeezing jobs into the same cluster simultaneously? Or does it make sense? Well, actually some of the reasons we do use a new cluster for a scheduled job is because, first, yeah, we don't want any dependency hell so that each of them have their own environment, kind of like a Kubernetes environment. So yeah, the dependencies won't battle with each other. So and second things, yeah, sometimes we will have a trial and error. For example, for a job, we do not know like, at first we don't know how much resources we will use. Then we do a trial run, for example. Then we look at, I mentioned before, like the logging and monitoring is very easy. So we do some monitoring there, like CPU usages and memory usage. And we do tune the cluster based on that. Yeah. And the third thing is because Databricks provides a much lower cost for scheduled jobs rather than interactive clusters. So yeah, that's our If you want to tap on, like I don't want to interrupt it, if you want to tap on knowledge about Databricks, there are two, three, there's like over subscribed room of solutions architects of Databricks today. Two of them wear Databricks teachers and I don't know why. So if you can like catch us later offline, we'll talk about instance pools, Boston profiles, probably different from yarn clusters. Yeah, we have another five, six, seven minutes is all right. One question. You talked about moving from AWS to Google Cloud, right? And I find that very interesting. What apart from like this transient spinning up and spinning up clusters and what were the other factors because you're moving a lot of other stuff to Google Cloud as well. Can you please highlight the different factors you think? Because it's a big decision, right? Yes, it's a big decisions. I cannot explain end to end in detail in this kind of forums, but I guess there are a number of key factors in deciding what kind of platform that you have. Our decision that being met at that time might not be suitable with your decision that need to be met right now. I'm a big believer that basically as long as you do your best at any given point in times, you won't make any regrets. So at around 2017, I guess, we realized that basically if we would like to rebuild everything, I mean, our system and AWS, for example, there are a number of things that we have to build from the scratch. And the propositions from the GCP at that time is much more compelling. Another key factor would be, for example, select on the BigQuery. BigQuery is really, really comparing, for example. Nowadays, the competition is quite tough, right? It's the Redshift, the Redshift Spectrums, Azure, for example. And we try with a few things, but we feel that basically when we're trying to rebuild stuff on the GCP with whatever thing that they have right now, for a certain stack and certain use case, it much more makes sense and perhaps it brings us a certain benefit compared to AWS or compared to others. I mean, I cannot disclose right now what kind of things, when are they and stuff like that. But we still have a number of like a presence in AWS, for example. And if we fill that basically in that use case, it's much more suitable in AWS. I say like, stick with it. We are quite practical. And we keep evaluating on what happened on the market right now because we live on very interesting world. Things are changing every, I don't know, month, I guess, or perhaps three months. And then if we found that basically, hey, you know what, this is sound suitable for us. And after we do like our due diligence, it gives us much more benefit. Let's say, why not? I guess our decision might be different compared to your company or your organization. So the variable are quite a lot here. So the point here is like, for a certain use case that we have right now, it is much more suitable to build it on the GCP rather than on the AWS for a certain use case in which basically still basically good in AWS. We just stick as it is. That's what a business is. Hopefully, they answer your question. All right, anytime. Anything else, I guess? No? I guess? So we might take a, yeah, let's take a five minute break. All right. You guys want to fish your legs, have a coffee. Second part will be about Delta announcements that came from Spark AI Summit. And yeah, so if you need any more information or any new speeches. Yeah, I mean, we're going to be mingling around here in case you would like to ask something and stuff like that.