 Okey, so my name is Raymond, I work in Scala, most of the time, it's what that pays the bills, it's really about building a modern data platform or other data pipelines, I just found this picture yesterday and ripped it off, so a lot of this stuff that I'm going to talk about is very context dependent on the situation I'm in right now and the solution that I picked is also situational dependent, but I tried to pick the right tool for the right job, etc. So, small disclaimer, this is my interpretation of the problem, so you might have some friendly, I definitely welcome those ideas but at this point in time, it's basically what works for us. So, sorry, this talk is divided into two parts. The first part is mostly about related to the problem itself, the team, etc. And the second part, we will dive into, sorry, more technical stuff. So, I won't be flooding you with lots of the Scala code like I did last year. So, yeah. It was okay. It was okay? You're different, man. Okay. So, I think it's befitting to start the talk by telling you about the company and what we basically do, what's the nature of our systems. In a nutshell, we are B2B. So, whatever the shit that goes on behind this side, which is us versus the customer side, it's completely hidden, right? I think this is quite familiar for everybody. So, you can, you know, if you have a good tolerance for this kind of stuff. So basically, that's it. So, typically, at least in our setup, these are the components roughly that we have already. So, the company that I work for currently is five years. We no longer classify it as a startup per se. So, you can say that it's post startup kind of thing. So, these are the usual things that we have inside. You know, probably you must find this very familiar. Machine learning and a pipeline, etc. Caches, etc. So, but that was the problem. So, you know, five years is a long time in the startup. So, I'm with this company, right? Just for the last past year. And I notice one thing is that, you know, as the system keeps on growing, right? From its initial stage, where it's like a single proof of concept system, right? It starts growing and growing and growing. Documentation used to be like, ooh, top notch, you know, then over time, it detaches itself, the test, no longer test, what it's supposed to do. And people starts having their own interpretation of the problem, etc. I'm sure that problem is, you know, very familiar. Sounds familiar at least. So, over time, it becomes, the coupling became very tight, right? Harder to debug. Not to mention performance problems that it was, that it introduces across the board. And generally, being a real pain in the ass for everybody. So, so I came to, I walked into this, knowing it. Before I joined, I basically asked the CEO and the technical director, what state of the system are you in right now? He basically lay this down for me. I was like, okay, I was born to, you know, to truck up ship. So, that's what I'll be doing. But when I joined the company, I told them one single thing is that, you know that there's already a problem, right? So, you must give me the freedom to come change the system in a way that it's going to help you to grow. So, the point of this whole thing is really, don't cling to a mistake, right? Just because you spent a long time making it. I don't know where I got this from, but I just ripped it off. And yeah, really, don't. So, the reason why is, I've been building data systems for a few years already. And I notice the entire scene has changed. You know, it used to be Hadoop, MapReduce, people wrote scripts on pig, et cetera. It worked until you push to production, things were falling left, falling right, and then all kinds of things start coming up. Industry, already, the whole academia plus the industry actually ramp up pretty fast from 2013 until today. In the last five years, you'll see a plethora, the ISPORA of tools, things that you can use, literally. But all of them do not solve the problem 100% of the time. They only solve a part of the problem at that particular time. So, when I jump in, the first thing I always do is I have to face reality. So, what I mean by this is that whenever you jump into a system and then help to migrate not only the system but more importantly the team together with the system from a very monolithical approach and then start taking steps to break down systems so that it can be reasoned with. One of the problems we find is that when coupling in the code becomes too tight, it makes reasoning very difficult. You must have seen a code like this written before. So, for me, the important thing was to assess how much change is actually needed versus how much change do I really need to do. Okay. So, I always like to see this picture because at this point in time, this whole system is something that has been running already. I'm basically a CEO looking at this system. It's running. It seems to be running. It's not very fast but yes, it's running. Why should I spend time to rebuild it, etc. And it's like from anybody who is outside of the whole system. It's like all things look the same, isn't it? Once you lifted the hood of the car, everything looks the same. A V8 engine is still a V8 engine but people sometimes forget how many pistons there are. So, one thing I notice is that when you start analysing the system, you have to understand there are a lot of moving parts inside. But the one thing that I realise is that each of these moving parts, most of the time it's made up from smaller moving parts. Has anyone played this game before? I suck at this. I suck at this. I'm so stupid that I can only get one side. At any point in time. But they are far more clever people than myself. So, when you start analysing this stuff, you notice that when you attempt to change A, it affects B, B affects C and D and then D affects somebody else. But you thought this was manageable. And I went in and I saw, Holy shit! The number of variables inside the system was a lot more than what I anticipated. Naturally, it would start to panic. The first rule of when you're doing a system migration, it's never to panic. You need to slow down. You need to just relax. Tackle one thing at a time. Because if somebody can design an engine and put it together, every engine, this four piston engine is built up from its individual parts. If it can be put together, it can be taken apart. So, the key thing is doing it incrementally. Personally, depending on the complexity of the systems, typically I spend 3 weeks to about 3 months to fully identify some of these data flow patterns, the black boxes, etc. And then so comes this general idea. So, designing a system is actually quite hard. Initially, if you're in a startup, designing a system is actually very easy. Because the number of ideas you have to validate at the point in time, it's actually very small. But as you keep building the system and keep adding features, when a company, like the one I'm servicing now, has been ongoing for like 5 years, there's really a lot of things going on inside. There's really a lot of things. So, I tend to focus on these few things. I always try to think in terms of black boxes. And I always look at the data flow patterns. Because these are the very important things that people sometimes ignore. I'm a programmer myself. So, it's very important for you to be able to lift yourself out and put on an architect role or whatever you want to call it. And start looking at how components are talking to one another. So, when components talk from one to another, they are not doing something idle. Well, you can find some dead code in the systems itself. It's okay. But there must be something that is driving the systems to communicate so that they generate revenue for the company. Those things, first. Because those are the key things. Ya. So, the next thing I actually will look at is actually data correctness. What I mean by data correctness is basically how important is the data needed to be correct at any point in time. So, when you're building a lot of these systems, like people will tell you, go use Cassandra, go use H-Pace, go use this, go use that. Go use Kafka, exactly one semantics. Go use that. There are so many things you have to pick. So, how do you make sure that things fit and gel together? It's actually a very difficult task. Because in my own personal habit, I tend to read the papers first and let the paper tell me what the ideas they're trying to solve and I go test it out. So, typically it's a very long, laborious process because then you realise that the paper then you start wondering certain things. It could be that they haven't really implemented it at a point in time or they actually haven't figured out the solution for it. So, they might as well don't say it. Ya. Right. The last point is that during the analysis of a system, depending how big it is that is under your charge, you will be flooded with a lot of information. There will be people that write code that does this and that but they don't fulfil any business value at all. They just chuck it through. They just chuck it through to realise something and then they close the JIRA task. The manager will be like good job. But it doesn't focus it doesn't bring back to the whole revenue point. If you look at it, it's basically dead code, redundant code. It doesn't do anything. So, once you sort of figure out how to separate systems and then you start looking at which are the things that you should be caring about which are the things that you can chuck aside, you will be faced with the ultimate problem is I want to design the ideal system. This will be my legacy. Please please don't go down that route because it doesn't exist. I studied a bit about the CAP Theorem from Eric Brewer that that tripartite relationship tells me that there's no perfect system because it just can't happen in the way that we imagine it. So that sort of helps me to ground myself down. Right. So in the analysis, sometimes I really give up as some of the fellas who build the code who are still with the company today they also find it laborious. The things I keep telling myself is that Jin can be broken into its separate parts. And I keep telling myself this as I keep analyzing it because my tendency is to when I look at the system, I look at how the data is flowing and then when it's time to decide because most of these things are hidden within the functions, the classes or whatever it is. You have to go in and understand what is it doing exactly because why? I have to convince my team that this new thing we are going to do is the better way. This thing, chuck it away. Chuck it away. But for me to understand that I have to bring the translation as close as possible to what they are doing. So this is a bit of a thing that I always tell myself is that you can't really change if you don't really understand it. So maybe you tell yourself that as well. So once you are done with the analysis, you have this semi grand plan you are ready to tell and inspire your developers to join you on this epic journey. But before you go there, you understand what your team what is the makeup of your team. What kind of people are they? Are they like hardcore fellas? It's like I go to I was telling Tony just now like when I went to the Haskell meetup group, some of these guys really, I know that they are way smarter than me because some of the things they are saying, I just kind of comprehend. It's beyond me. It's really beyond me. So you know that where the team at is actually very important because they are the guys who will be helping you to build this whole thing together. So you need buy in from them. So nowhere the team should be roughly. So it boils down to my own personal belief is that when you are building a data pipeline system, you can never get it to a perfection stage because at a point when you are ready to release the moment you release your data pipeline people are going to ask you to add features going to add do this, do that. You don't really have a chance to redo this again. So the third point is really about affecting changes in a team. So if the team is already a bunch of software engineers, sometimes getting them to change the mindset is really the hardest. This is the part where I sort of like I try to convince the management that training them giving them the time to learn is actually the correct thing because you have one advantage that no other company has and that advantage is that the guy is already working in the company why aren't you like building them up? So this point is particularly important is that a lot of times a management forgot what it's like to learn something. So learning requires time. That learn knowledge also requires time to be really effective requires an even longer time. The point I'm trying to make is that when you're building a system you get to a point where the performance you start getting performance improvements from the whole system. Management has this tendency to come in and say that we're ready to launch, okay I'm going to talk to HR, I'm going to talk to the comms I'm going to release of the publication, everything is ready and then things like this and then they forget this whole lesson this whole painful lesson that you just went through spending months through arm the team with the mindset on how to learn then you just rush through you know the immediate I mean I work in a few projects in the past such that when the system is being released features keep piling because because you actually sell them the idea that this organisation is supporting this initiative to learn and apply and suddenly they realise it's just bollocks bullshit, right yeah so this one is this is probably a key message for the stakeholders please remember there is no perfect data architecture there's no such thing at all, you only have something that works for this current situation the reason is because you're going to need in the future it's going to be very different it's going to be very different 2013 I ran a research centre that built a data aggregation platform for Hewlett Packard labs itself when we started that project and ran it for about 3 years we did something that most people couldn't do with a very small team we actually sold commercial licenses so once that was done the whole thing just disbanded but 5 years now the stuff that we sold customers back then is completely irrelevant they're asking for real-time data analytics they're asking for real-time machine learning etc nobody has figured out the answer to that nobody you can try to get as close as you possibly can but it's really difficult so why am I showing you this slide again it's actually to understand that there are always trade-offs think about this the technologies we use today are purported to be able to support data sizes of the future but few companies actually have the data size to exercise or to prove or disprove those wonderful technologies that these companies are saying and what's clear is that there's going to be more data in the future than it is today so within 5 years in my own personal timeline within 5 years I'm seeing so much changes I'm pretty sure there will be more rapid changes that's going to come in the next 5 years so the key point is create a culture of learning and this appetite for adventure not the second one developers have the second one not really the first one it's really difficult to have an organization that supports both but the point I'm trying to make here is that when you're learning something the organization must encourage its engineering team to continuously try something out just keep on trying it's like this famous 20% thing that Google keeps advertising it's completely fake so anyway, that's my take so basically that's the end of my part 1 I was wondering if you have any questions you mentioned about change of mindset so what kind of challenges that you face for changing mindset and how did you look at that? oh wow, okay it's a good question so one of the cornerstones is probably when developers start building code they think monolithically they say function A takes in something and returns me something so if I had to expand on it I would naturally build function B and then does something else and something else so when they start building monolithic they start to a certain extent they sort of build these structures but they don't really know how to take things apart so they actually never understand not say never or rather sometimes it doesn't dawn upon them that I can actually do function B in a different way I don't have to be constrained by language itself and sometimes that feature B has already been solved by an open source technology so if you have time to go explore we didn't know so if you look at the whole ecosystem most of the times at least in my own experience developers aren't given this opportunity to go see things explore like for example meetups is a good way to explore things conferences is another excellent way to go explore things and stuff like this so yeah another thing that I discovered them most of the time they don't really know how to use multi-course multi-threading processing like for example not to diss the python fans here but python has no real multi-threading programming model it doesn't have it no matter what the documentation says it just doesn't have it so what's my preference I always choose a language that has a memory model the memory model dictates what kind of situations what kind of rules the program will obey in a multi-threaded environment does it have data races or etc so I don't really waste my time looking at the others so that's the other part I wanted to make a point any other question I want to actually let the developers have time to learn, to explore but in the real world in the face of people constrain time, money and some source so how even to first think from the measurements that we are for developers to learn to know the second question is that how we able to speed up this process ah, ok wow, this is a bit subjective I guess to answer your first part of the question it depends on how you frame the answer to your senior management sometimes a lot of times I ask my mum to take care of my son sometimes I forget, she forget she hasn't raised a son in 44 years raising a son today is very different from the time she raised me so she forgets about things like this and obviously there will be conflicts so somehow to draw an analogy it's like, sometimes the manager maybe he's too far away from the engineering organization and some of times this manager or CEO didn't come from the engineering background it's very difficult for him to empathize with the situation so the question becomes how do you go and empathize my job my current role, I'm quite lucky because the CEO himself came from an engineering background so it was easier for us to talk face to face you can relate so I don't have a perfect answer it's really largely dependent on the situation you are in and the second question is how are you able to help them speak after 42 years we actually engage a consultant maybe or someone on board so they will learn from each other to be honest we tried this approach so we engage people whom you might have heard from AUT E so AUT E, I like AUT E because I worked with them before and they came new ways of thinking about the problem and enforcing you not to repeat certain problems and they do it by in a collective manner you join them for programming session for 5 days I find it to be very helpful for me back then many years ago and I continuously engage with them but that also requires buying management so they have to believe these guys should be trained because if you don't train them and they stick around, what does that mean you know what I'm saying so the other part is actually I ask the management to lay off the KPIs companies as they keep on growing they put in these things called OKRs performance management systems and they try to measure and reward certain behaviors and penalize certain behaviors I basically told them instead of you, why don't you take another route you can still do OKRs but reward them for learning reward them for sharing their experience that they learn like go to conferences, share what they learn etc that can be rewarded by using a new technique let's say they learn a new programming technique and then they apply that to the current problem so typically it's like if you're a team lead yourself and then you start browsing their code you typically will see the change reward them for that I don't know what you want to give them maybe a book voucher or whatever it is but have the organization reward them and have it done publicly I find it very useful to have it done publicly because when somebody gives somebody a present everybody else in the room sees it it encourages that kind of positivity it can't be like a half big idea it has to be believe to a certain extent any other question no alright let's talk tech so this is the problem we actually have the top one is the problem that we have or rather in transitioning or not having the second one is my idea of how things should be done so when you're at a university you read these textbooks and they always say the way to avoid technical debt is to make things modular then they show you all kinds of construct make it modular then you're like don't really quite understand what you're trying to do per se so in my problem I solve breaking up of data pipelines so that's my problem so we basically have a monolithical data pipeline that is typically written in a programming language that has a strong coupling when this pipeline runs it runs in a single place single node, single CPU I call it the executor so the problem I want to do is I want to reach this part I want to be able to deconstruct the pipeline into its individual components so that I can string them together and they can run I'm happy to say we've done that so the second point I wanted to make was if you look at these two pipelines they're basically same isn't it so this is basically saying D will consume B and C and all of them consume from A this is the same thing as function composition it's the same except that during execution it might look slightly different but the core idea is always the same for me I have a tendency to separate computation from the execution of the computation what I mean by this is in this situation why is it that A needs to execute together with B and C and then eventually D is why do I have to do that if I can break things apart enough I don't have to do that so come to base quest of mine I spent 3 months in the same company on a project that uses Apache flink has anyone used flink before flink is awesome so flink is literally it's a data pipeline engine that understands both streaming and beam so what does this have to do what I'm trying to look for are 3 important things I want the proper extraction streaming and batching it is paramount to have a proper extraction and the second thing is I want to be able to decompose the pipeline finally I want to separate the execution from the underlying implementation it sounds like it's like the tech from the future to a certain extent actually it is so we decided to go with beam so I should take some moment to explain has anyone heard of beam before okay today has anyone went to the ThoughtWorks talk yesterday turns out somebody so beam is really cool it has 3 main ideas API, model and the engine it provides you the streaming and batching model to program a data pipeline that can be expressed in these 3 languages we all know if you are a scholar developer this is like your second home so the nice thing about beam is that the model is the same regardless of which language you choose because there is this unified idea and supported on the multiple runners that we have here so this is the one I use I haven't used Apex, Spark Dataflow Dataflow is the Google version the Google managed service of Apache Beam itself and Samzah, I haven't used Samzah but I've used Gear Pump before so I highly encourage you to go read Google VLDB paper of 2015 Tyler Akidau is the main guy behind this idea and he knows very clearly what the Dataflow model which Beam encompasses is talking about by Beam so I always like to break things down Beam gives me 4 ideas to break things down it tells me basically what results I want to calculate where in the event time are these results calculated usually when the time that this event was generated it could be at the customer side like in our case, customers generate those events sometimes they don't tag those events with a timeline or timestamp sometimes they do so the other point was actually how do these sort of refinements or results relate basically it's talking about watermarks if data arrives be passed a time with the elements that arrive late now the general idea in my understanding is this a batch when you're processing batch data it's nothing more than a window of a street if you think about it there's no difference when you're processing batch data you're simply taking a window of the data you're processing in Scala parlance it's like lists of some elements without sliding you're creating windows you're already creating windows it's true and I like the fact that the same idea is actually done properly for all the languages that Beam supports which means you have one unified model to reason about the pipeline you're building it's very important in terms of terminology if you build data pipeline what is in your opinion event time processing time how do I assign event time processing time there will be a group of 5 people arguing how we should do it here there's no standardised terminology so Google has solved the problem it basically tells us this is how we think of it here's a framework that manifests our thinking let's use it now why am I showing you this line it's because I used Flink and that's the Beam API they look remarkably similar now they are colour coded intentionally it relates to these 4 ideas what, where, when and how in the Flink API this is how they sort of relate to one another we are basically creating windows and this whole thing can be translated into a Beam equivalent and that and I'm just trying to draw the semblance between these 2 the point I want to make next was this was done in 2016 2018 it's a very different thing you can see that the 2 models are converging Flink is trying to find a place in the world and they have this idea of a state full streaming processing so it literally means that we can build checkpoints we have actually experimented with this feature and it works works very well okay so this whole story is that we want to find a way to model what a job should be about to help the guys find the proper abstraction to build a data pipeline to contain the idea by using a Beam model so coming back to this so at this point in time we pretty much have decided that each job will be a Beam job so the next thing is I can't let Beam job just fire any old heart so we decided to do something because I had a lot of time I had to figure out a way to build for lack of a term it's like a schedule like an engine we have to build some sort of DSL to string these guys together to express the idea of a DANG of a workflow so in reality what happens is that this is a Beam job it takes in inputs and outputs consumed by this guy and so and so forth and the DSL literally schedules and runs them because Beam doesn't understand this idea of a scheduling so I'll come to that in a little bit later so at a point in time I asked myself do I really want to do this but at a point in time there was no open source that I could discover and that was just a few months ago that could do what I wanted it to do so we went and build one all done in 8000 lines of Scala code test and the other half is the actual engine so let me explain a bit so everybody understands a DANG so these I think everybody has done workflow scheduling before so these are Beam jobs they have a descriptor that describes what they do so what the engine does is that it literally lifts these descriptions into its engine thus validation to make sure that there's no loop and if you're interested the text tag is I use quiver have you heard of quiver? functional graph library so quiver is a direct translation from Erwin's I don't really know how to pronounce his name but he's the original author of FGL functional graph library so I basically use Ross Baker's interpretation behind HDP4S he wrote quiver so I use that to represent a DANG and then build an AKAR engine around it that manipulates the DANG to fire to schedule and do all kinds of stuff so these are some of the data structures that is inside so we have RESTful API so that fellas outside of the system can go query the workflows yes I did step or something big feet the problem with big feet so and literally these are all the data structures that is inside the shared memory and these are all the sort of life cycle handlers so each of these jobs is modeled as an AKAR actor literally there's no command dosing so by any questions so the text that I'm using here is SCARA 2.12 AKAR 2.5 CATS as well as quiver that's as simple as it is no fanfare so next thing just want to quickly introduce the data types we are using everybody knows the case class SCARA developers it's too simple I'm insulting you so basically I only form in abstractions the idea of a job the idea of a workflow which literally is a graph that contains the nodes and the edges that describe it so the DSL basically validates that the whole configuration is a valid dag and it has valid arguments and valid outputs and one is validated it's being stringed into an actual workflow and that's it so the patterns usual stuff we basically use two types of programming techniques inside the whole DSL, monads and monad transformers and finally some AKAR programming not too overly complicated this is the whole DSL that is driving this whole dag of pipelines run it literally has the simplest thing you can think of the creation, the start the update, then the stop and discovering what to run next nothing magical so this is another simple state function that we use our code has no variables everything is run through state functions so I definitely hate the idea of a variable sitting there voting my very presence so I model everything after this so these are literally state objects and operations defined over there so it's quite clear that I'm trying to bind Google data flow to a workflow ID so that I can track it across the system itself nothing magical here finally MTLs, monad transformers this is literally what happens when somebody issues a stop workflow literally have these are the either stack transformers and anybody has seen it have you seen it before? basically it says try to stop it and then try to deactivate then try to cancel if it is a data flow and then update etc and this is a simple as it goes just to bring the whole idea back so we basically have a beam job that is housed in its own repo that understands how to do that job and that job only specification is stored in our documentation database so anybody who wants to reuse that beam job simply has to go look at what that job does what inputs it takes and it allows us to reuse it so the job is being loaded into our engine and it stays there and then you can dynamically create workflows I read the documentation you have this, you have this, you have this I want to create a workflow that does that all it has to do is to submit a simple REST call and we piece it together for you and then at the point in time it stays in the memory until you invoke it now now comes the other part of the question where do you run it? I want to answer 3 things here where do I run it how do I scale it and how do I support diverse workloads this is our final data architecture that we actually have so let me go into detail slightly more detail or rather not too detail so at this point in time imagine you have couple of pipelines scheduled in the engine and you need to fire it so beam job by right it's a Java program or a Python program or a Go program is that it allows you to run locally in one of your servers or you can send it to Google and have Google run it for you because there are parameters that you can pass to beam that commands it to do so all this is housed in the configuration for each of these jobs so when this guy is being scheduled what we actually do is that we look at the compute classes that we have visibility into because the compute class will tell me how many resources we actually have in that network zone and what kind of pipelines can we run so when a job is lazily being dispatched before it's being dispatched we basically ask the clusters are you busy can you do some work for me if they say yes we just dispatch when these guys drop into one of the nodes that is running in the clusters and in the beams configuration they can choose to go activate and run in Google's environment or they run locally in the machine that's being assigned so how do we do this we use mesos anyone has heard of mesos okay one two not much okay cold mesos it is the monster that nobody tells you about so mesos it started I can't remember when was it started long time ago i guess actually just a few years ago a company formed around it it's called mesosphere so mesos like the key thing it's sharing it provides this general abstraction to share the resources of servers in a zone for you to consume it also has a model bus based or java so it's basically an ecosystem by itself so the nice thing about mesos is that it doesn't take the centralized scheduler approach to dispatching jobs what I mean is if you have a typically you have a centralized scheduler you build a pipeline that needs requirement ABC seems to work then you build another more complicated pipeline that needs ABCDE and then you and then you have to make customizations to the scheduler and the scheduler code continues to blow so what does mesos does is that it leaves the decision on how to execute to the pipeline or the jobs that you wrote instead what it does is nothing more than I need this I need GPUs, I need CPUs I need to ram this tell me where to go so once you ask the mesos cluster how much you need it will basically broadcast a message the first guy that answers the message literally gets assigned the jobs so why why did we do this because when I was building an engine I didn't want to write centralized scheduling code and logic to distribute the pipelines that I have that has very different requirements some of our pipelines actually need open CV now if you've done beam before you cannot command Google to run your job in the environment with an open CV installed so what do you need to do you have to fix the problem of providing the resource to your job so you need an abstraction layer to abstract this away from the developer of the pipeline from the developer's point of view why should I have to worry about this you should take care of the problem for me it's not really my concern so we basically use mesos to solve this problem so that we can actually doing testing we actually run these beam jobs in our own data center environment and when it's ready we simply switch oh god, is it me again yes, it was me again so sorry so when it's about to run in a Google environment we basically switch a tag that changes the runner type if you look at the so one talking about this sorry, it's literally this runner type so when we switch the runner types by changing a few bytes of characters, it switches its behavior to go run on these platforms so that allows me if you think about it I have a program by switching parameters I change where they run it allows me to do testing I can install a tensorflow cluster which is supported by mesos and I'll just tell it you go run there, your job for testing I don't want to do anything at all so coming back so coming back to this so let's turn out this proper abstraction I want to free the developer I just want the developer to focus one thing translate all the data pipelines that we have in the business and move it in using beam and I'll string it for you with the engine and let it run a mock inside our data infrastructure and then we basically build life cycle management to manage how these jobs behave for example, do I need to restart what's the status of my job if you're running a Google job how do I go monitor it we basically solve these problems and that's it because we want to solve these problems if you go read about mesos basically these are the 4 main problems that mesos attempts to solve and it actually solves it very well because I've used this in another deep learning project 2 years back and it works beautifully so just iteration again developer focuses on building the beam jobs and string by our DSL developer is free from worrying about where this thing runs because we just solve this problem for you but in some of our pipelines that we actually need I think I'm reiterating again but in case you miss the point we basically need open CV to a certain extent to run certain stuff cropping images cropping videos but we don't want developer to worry about where they have to run this thing so we set up a small like a very small multiple cluster machine and then have them carry all these dependencies so when they're ready to test all they have to do is just run it and it just automatically gets scheduled and then it's being executed and monitored so how we do this it's an idea of homogeneous cluster so imagine so this is the 3 compute classes that you saw previously in one of the slides so how we do this is basically we inject the job and its dependencies the binaries or whatever it is go, python, whatever it is into each of these nodes itself so this becomes a DevOps operation okay yeah okay so there's there's a fair bit of actually just a little bit of deaf work that's all nothing that automation cannot handle okay, cool so I just want to share some key observations I have we're coming to the end of the talk in case you haven't wondered so recognize, I notice these 3 things where as I've been building data architectures or whatever you want to call them data processing engines but there's no perfect data architecture never, not in a million years I don't care what you try to tell your boss it doesn't exist okay so the main reason always have this idea of change I mean you always hear this old cliche thing that the only concern is change it's true it's 110% true this last point is actually pretty key you really want a team that can adapt enabling them to adapt I don't know how you're going to do it but learning for me is key like one of the things and I tell my guys, right, is that when you start on this project 2 things, we're going to go slow going slow, take the time to do it do the best job you can these are 2 key messages it's very difficult for them to believe that it's real it's very difficult we can ask, we can ask them it's very difficult, they'll be like they're looking at me and then they start start darting their eyes and then you can sense this eye of disgust they're like, how dare you lie in front of me yeah but you have to have endorsement basically, management buy because it only makes sense it only makes sense the last 2 slides before I end reading references later I'll be posting them to the media upside and you can go download them so anyone using SCIO from Spotify awesome Arun, you're always the most awesome man always highly encourage you to take a look at Verizon's Quiver very nice library to do simple multigraph modelling very nice here are some stuff these are the key papers that I use to think about when I'm doing scheduling think about the problem of scheduling they are the same guys roughly more or less the same guys that contributed to Messos in the end so these are the key papers so I highly recommend you read this and this to talk about streaming and matching systems and lastly in case you think I'm lying go check this link it has all the parameters about choosing like how big a machine I want to run on Google what type of machine I want to run without worrying about how it's being run Google has a very nice interface that displays all your pipelines very nice colours and tells you about data ingestion rates etc and that's it thank you so you have any questions okay you raise his hand first you can inject into Google cluster what you can do is treat the VM in the Google Cloud as preamble to creating an image so we take a blank image we put the stuff that we want we test it first we cut an image and push it into the cluster so once the cluster so what you have is that there will be multiple they carry all the same images and then you can start testing stuff like this so you have multiple images for different types so that's one way you can take it the second route is actually to look at how mesos actually tag resources you can introduce this idea of tags which means you can mark certain images to say you only do tensorflow so what happens in the programming model, your frameworks need to remember I was telling you about the resource offers for mesos so if you want this during that phase you have to say I want it and tell mesos again after that it's yours that's it just out of curiosity something like ansible i'm not good with that box if you have SSH access to a node you can just use ansible to tell it to install something oh okay but there you go there you go there's already one other solution thanks for sharing so like i said in the start of the talk it's my interpretation of the problem yeah any other questions yeah i love i'm just curious during your design phase when you were considering mesos do you ever consider kubernetes as an option what was an option sorry oh okay my advice for kubernetes is if you have a data setup of about 300 nodes you can use kubernetes if you have less than 100 nodes you'll say anything else okay the reason i'm not trying to insult your intelligence your supreme intelligence is that the amount of learning for kubernetes as far as when i started it's really a good DSL but the amount of work to get it to do and the stringing etc back then for me that was like 2 years ago it was like too much work i couldn't justify it to launch a 20 node cluster i couldn't have done it with but i was just thinking because if you say 2 years back then it makes sense kubernetes wasn't stable then and also at the same time i was thinking kubernetes is offered by google actually i will go with terraform terraform is different terraform is very nice when you have to prepare images for on-premise installations the other thing i didn't tell you about why i'm doing this is because the company may have the decision to install our product on-premise so i need them to have a generic way of doing something without doing too much yeah so yeah no problem so anybody else what are the response times we are talking about so you come again what are the response times we are talking about when we call the city it does NLP or something so most probably it's because some of your clients have scalping on use scalping on earnings on use so something like that so the response time is pretty important at least order of magnitude of this response time we are talking about okay the honest answer is we have measured it because we are in the process of translating all these pipelines into the new framework itself we have a note of seconds milliseconds oh okay and this is something for google to work on google when you tell google to launch a beam job in this infrastructure it packs all your binaries and your dependencies and ships them across their network and insanely fast provisions a server maybe picks one at runtime injects this thing in then start the job so what i'm trying to point out is that because there's this overhead of carrying your binaries over to the environment you have to make sure that the job the data that you consume the time must be more than the amount of latency that you're going to transfer because i can tell you google it starts behaving irrationally about 5pm Singapore time to about 6.30pm so i don't know maybe maintenance or something it's just crazy okay any other questions? in the beginning you mentioned that there was training data as well as batch data so what i mean if you're trying to deploy these workflows for streaming data streaming data typically is always streaming always on so why would you need to deploy it and stop it at any time or if you deploy it and then you leave it there right software like all software don't work 100% at the time so the nice thing about google as we were playing with it it gives you the parameters on execution wise to define an upper bound and a lower bound about how many processes or other threads you want to run this job on and you can also specify the type of image it's like the size of the machine in the AWS being offered to you as a command line parameter to alter how your streaming batch job is supposed to scale so you give it a bound but it doesn't work at least i haven't seen it work 24x7 and it just stays there and keeps on ingesting etc and nobody cares about it because i notice something happens between 5-630 things stop running etc i cant explain so when that failure happens it actually handle it itself or you have it and fix those things you have to go in and fix those things so the streaming model even Tyler Aki Dao and his paper also admitted that it's not a silver bullet but it solved the problem of providing a unified model to think about streaming batch data so operations wise if you choose to run it on google infrastructure it's basically you're leaving your good software in the hands of google hoping that it doesn't kill over so how different is it if someone is using Kafka and structure streaming in Spark versus using B what's the difference between oh wow that's a big question it's actually hidden in the references there's this beam versus spark let me show it to you again go read that it gives a much better exposition than i possibly can no problem so i saw that i was even just support the data go so i need using Scala so Scala for building the run so what was the question again so this is this is Spotify's wrapper around Apache Beam if you ask me i think better than this than in Padoo it's a programming construct in Beam or do function in Beam it literally is a Java programming model yes oh no, oh so here's the thing because the teams make up is very diverse so i need to find something that's another decision we have to make is that we are building an abstraction what i'm looking out for is the right abstraction the language that it runs in sometimes may not matter because some of these pipelines are really trivial you don't have to learn how to write Scala in order to use it but if the same model can be fulfilled by writing a simple version like in Python which Beam supports then you should go for it because it's a rational decision to do you don't have to do Scala for everything i mean it pains me to say that but no, you don't have to what kind of developer experience do they have now this is a very silly thing to do but sometimes you need to debug something so how does that work in this entire thing you mean from debugging so i sort of can't imagine how you debug one concrete stage then how would you make sure that your entire entire thing makes sense so integration testing so the basic thing is once you've identified what your pipeline should do and you've broken down into stages all the developers have to agree that the input is what and the output and you ingest my input and etc so these input and outputs we have chosen them to be on cloud storage google cloud storage but because we have sort of taken abstracted away about input and output we can literally replace them with any data solve that we want so the basic of types is the stream the stringification i mean i can process stream in different flavors i can split it on different characters is it like the trivial example would be i have a stage that splits by quotes or a stage that splits by something else so how do you make sure that you construct your stream splitting pipeline that you actually use the right stages from the library you plug them in together it sort of makes sense to use type safety to make sure that they play together nicely but internal limitations might be different so it's something like integration level integration test level so if you ask me it's like if you are used to writing the SCALA way the SCIO is actually how you imagine you will write it in SCALA it's the same in SCIO SCIO takes care of the splitting of stuff for you so that's a nice part about translating ideas but it's like if a job is too big to be processed we typically like what we've been doing in the integration testing is that and then we basically use that set a later stage and then to read in the data itself so the nice thing about dataflow dataflow model versus flink they all operate on keys and values it's all like it's grouping by keys and grouping by values so if you're used to Spark's idea about grouping the beam thing will translate quite naturally i'm not sure whether i'm answering your question more or less i think my question is too big to be answered okay thank you very kind you're welcome you have a question thanks man good to see you i have a couple of questions one is that you mentioned that the libraries are actually bundled long so one thing that you wanted and i saw from your constructs that you're trying to deploy that you're trying to kick off being job one thing before that on the other side of it is it going to be a separate cluster and what are the parameters that you're passing as part of the workflow in order to actually kick off the job and what are you actually taking off that part is not in the presentation yes, correct it's intentional so let me answer your question the first question first it's literally an ARCA FSM that is running each of these jobs but these jobs i don't launch like A, B, O sorry did i say shit sorry so when i ingest the workflow A, B, C and D i only launch whatever is needed at a point in time so there will be one FSM for A B and C are only there in the configuration they never realize as an actor until the point they complete or that's where the other part of the presentation is linked to this idea of mapping the states in Google Beam sorry Apache Beam to what we should ingest because Beam's states we have to make a conscious decision which states we should ignore and which states we should mimic but the job is an unknown state we should never when we are firing the job on our side we should never proceed with the next stage because it's already unknown there's no point in firing the next stage because the previous is in an unknown stage things like this so sorry what was the second part of the question again the second part is the parameters because i saw poster i saw you using doobie doobie is pretty cool actually so it's going to be run in a cluster but if it's going to be running in cluster so you have multiple first coming to different data sharding data sharding going to not yet implemented so now i mean how do you choose between different nodes oh i see right now i haven't really thought about the problem yet i know the abstraction i want to use but i haven't thought about that problem ok so maybe one last question you guys decided to build your own DSL to create this diagram i ask myself that question many times i have a specific challenge that Apache Beam program model is posing on oh so ok on creating this data pipeline i get it from my team members day in day out so the main reason is because the beam is actually a programming model to handle an idea of a pipeline it doesn't understand this idea of a scheduling per se and i wanted to make our jobs because once we have broken down into these individual tasks it also means that other pipelines in the future which we are translating can reuse these guys just by altering the parameters the input so i needed a way to string these ideas together so you have this parallely executable jobs let's say this execution purpose that you have and parallely stitch data flow as required that's exactly exactly so the one thing if you look back to the earlier slide it's about A and then there's B and C and then finally D, it's a classic scatter gather so B and C there's no reason why they cannot be launched concurrently the dag is saying that B and C will give you the idea about who should go first not even a topological sort so what you need to figure out is that suddenly i realise that it's left to the implementation to design and i decided to launch them concurrently so when A completes properly then B and C in the dag should conceptually be able to launch at the same time and that's exactly what i did so provided the source of data it's already available but if one depends on the other then this is not going to be possible so one thing that we built into the engine is also to track the progress at which stage are they at and we never launch anything downstream unless we are very sure that the front end actually has completed so that's the basic thing we can do after a minute it's not perfect but it serves our purpose at the moment bye ok, thank you thanks guys