 Good morning. So thanks. My name is Pallavi. I work for a company called InMobi. So the clicker is not working. I'm going to be stuck to my laptop. So that's who I am. I'm a committer and a PMC member for Apache Falcon. I also contribute to Apache Pig, mostly in the area of Pig on Spark. So this is what we're going to do in the next 30 minutes. Find out why we built Apache Falcon at InMobi and how we use it and how it has helped us. OK, a little bit of story. So how many of you have heard of InMobi? Great, great, great, great. For those of you who haven't, you all have free apps on your phone. And if you are like me, you wouldn't spend even 10 bucks buying an app. And InMobi isn't the business of making this apps free for you. So that's what we do. By inserting annoying little ads, but that is free for you. So when InMobi started six years ago, we were in just in the US doing the ad business, primarily on mobile. And we needed a platform which would mostly collect all the data in terms of ad views, ad clicks, app installs. So we could eventually build our customer, that is advertiser, and see how much it cost you for us advertising. So this is how the simple data pipeline looked like. So we used to, everything that a user does with the ad, we collected in our logs. And the click logs, we would enhance it with some metadata, advertiser details, publisher details, or app details, or even device details, so on and so forth. And those enhanced clicks, we would agree it early and say, OK, from this advertiser, we saw so many ads, and so many clicks for this particular ad. And then we would aggregate it daily, and we would build the advertiser. So it was pretty straightforward when we started out. But there were a few requirements here and there, apart from the standard pipeline that we saw here. As data flow toward in this pipeline, you had to control when these jobs ran. The early jobs had to run every hour, the daily ones every day, so on and so forth. And data, you didn't want to keep it around all the time, for different reasons. You don't want to waste storage, there are compliance reasons why you don't want to keep data around. So each of the data set that we acted on had different retention, and there were replication needs also. So all of this used to happen. And how did we solve that in the beginning? Simple. We just wrote cron jobs, bunch of cron jobs to schedule it, bunch of cron jobs to delete data, bunch of cron jobs to replicate data. And if anything failed, it was just email notifications that went to the right developer, and you would go look at it. But as the business grew, our pipelines became this way. Don't try to understand that. I still don't. So it became a little more complex. And then in terms of, we started seeing different ad formats. We expanded in different geographies. And people wanted to analyze data differently. We did not just needed the data and analyze the data to bill our customers. Our ad campaign managers said, I want to see how my advertisements are performing. And we wanted to do the user targeting, right at the right place kind of a thing. So our analytical pipelines started becoming more and more complex. And as we expanded in geographies, so we came to China, Asia-Pacific, then Europe, so on and so forth, we were doing similar data pipelines across different colors. So it just became more and more difficult to manage all of this. Operability became a big nightmare. So we had to handle failures across these pipelines. Data used to arrive late. Then we needed to do reprocessing. And each of them had a different data replication and replay requirements, archival requirements. And we needed to have stricter SLAs. The customer wouldn't wait for an entire day to get billed. The burn rate had to be calculated on an R to R basis, so on and so forth. So we were struggling to operate the whole system. And that's why we said, let's take a step back and see how we can solve this better. So we said, let's look at the problem pattern. This is typically what we do. We get import data from an external data source. And then you do some processing on top of it. You evict data, you archive data, you export data into warehouses. And you might want to replicate data and do some different kind of processing on a different cluster. When I say external, it's just external to Hadoop cluster. So this, we said, it looks like the problem pattern, which repeats across all our data pipelines. So how do we go about solving it? That's when we built Falcon. So this is how it looks. So Falcon kind of builds an abstraction layer on top of your Hadoop cluster, on top of your storage, be it HDFS or Hive, on top of your execution jobs, be it MR, Spark, Pig, or even Hive queries for that matter. And Falcon will ensure that it talks to the underlying cluster while keeping you abstracted of all the complexities. So you just interact with Falcon, you define a data set, you define a process. What execution engine it uses in the background, you wouldn't worry so much about it. I mean, during implementation, yes, but not during operation. So, but you would have to tell Falcon about the end points, right? So you would have to tell, you know, how do you access data? Where is it sitting? Is it HDFS or Hive? Or where is it executing? Or, you know, where is my registry Hcat? So on and so forth. So you would just give all these details to Falcon and Falcon would communicate in the Hadoop cluster on your behalf. So when you look at the pipeline that we've been talking about, the simple pipeline, right? Everything remains the same. Only thing is the data sets have now become Falcon feeds. The processes have now become Falcon processes. And the other operations that we saw that were, you know, auxiliary to your main business logic have now according to dotted lines, which means Falcon has started managing for that for you. So you don't have to write your cron jobs any longer. So, yeah, all of this happens within a cluster, what we call the Falcon cluster within your Hadoop cluster. So if you look at the three problem areas that Falcon tries to solve, it'll be these three. One is the process management. So we'll handle relays for you. We'll get into each of these a little bit in detail. Or we do the process management. We handle retries, we handle reruns, we handle relays, and then data management, import, export, retention, so on and so forth. And then data governance. You need to be able to say, find the lineage of any data set instance that you produced all the way back to the source. And you'll need to be able to monitor your SLA. So you'll have to say, okay, I've waited too long, not just failures, right? I've waited too long for some job to run. I need to act on it. So all of these, let's look at each one of them in a little more detail. Process management. So what do you mean by process relays? So when you build a process, there's more often than not, even the sample data pipeline that I showed, there's a data dependency, which means I process something, I produce an output. There are multiple consumers for that output, one or more consumers for that output. So it'll have to wait for that data arrive before it starts processing. So this is what we mean by process relays. There's a dependency graph, a DAG of all these processes, right? And we should be able to manage that well. So when a data is produced, something else triggers. That produces more data, a third pipeline, a third process gets triggered, so on and so forth. So the processes will here need to wait for the data to arrive. And it could be either imported data or even replicated data. You could be waiting on a replicated data too. Late data arrival. So what do we mean by late data arrival? So it's very, in the same, let's take the same clicks example that I've been talking about. Let's say a click was supposed to have been logged, a zero-tar click was supposed to have been logged at zero-tar, but it doesn't happen for some reason, some delay upstream. It comes into first-tar. But when it comes in the first-tar, all my zero-tar processing has already been done in the whole pipeline. My early billing has already happened. So I need to handle this differently now. So I'll probably need to reprocess the whole thing, though zero-tar data all over again to account for these new clicks that came later. So that is what I mean by late data arrival. So the way you wanna handle this is, let the existing pipelines go through, let the first-tar data go through, second-tar data go through, don't affect that, but somehow still manage, be able to detect that something arrived late and be able to process when it arrived late. So that is what is late data arrival. So and sometimes you might wanna do different logic on top of it. You might wanna say, if the data arrives late, I wanna execute a separate process or let it be the same process. So Falcon kind of gives you the ability to do both. And all of this declaratively. So you don't have to write a single line of code to be able to address any of the problem areas that I've been talking about. So you just have to give a declaratively to Falcon. So when you have a process specification, this is what you say. You say where the process should run, in which cluster should it run. And remember, we already defined in the cluster the Hadoop endpoints, the Yann endpoints, the Uzi endpoints, or the HDFS endpoints, so on and so forth. And then you'll tell Falcon how to run it, what is the frequency that it needs to run in and what is the order in which it needs to run and whether the amount of parallelism that you wanna give it. And then you'll of course have to tell what is the input and then what is the output. Where is your processing logic sitting? In this case, it's a Uzi workflow, but it could as well be a PIC script or a Hive query or even a Spark job. And then this is how you specify late data processing. But you basically say that, wait for every one hour, see if the data has arrived, if the data has changed. If so, do something. Either execute the same process all over again or execute a different process. So same applies to retry. When there are failures, not all the time it is because of a bug in your program. It's probably because of a temporary failure. Because of the infra, some network went down. So Falcon also has this logic where it automatically retries for you. You can specify the policy. Periodic basically means every ambulance you retry. Then there is exponential back off. It is a wait for one hour, two hours, four hours, eight hours, so on and so forth. So you can configure all those policies and retries and late data processing all declaratively. So you don't have to, and Falcon will make sure it schedules the right set of processes, right set of jobs to handle all of this in the background. That is about the process management. Coming to data management, retention as a service. So like I said, just like we did in process, we declaratively tell Falcon how long it needs to keep data around. Should it be deleted? Should it be archived? So and when we know why we need to remove data, right? Either it is because we don't want to keep temporary data around all the time, or it might be because of legal compliance reasons. Whatever it is, you tell declaratively to Falcon that how long it needs to keep data around, it will automatically take care of the retention. Same with the replication. Replication typically you would use this for two different reasons, disaster recovery, right? And there is global local aggregation that we'll talk about a little later. So we replicate data from one cluster to another again declaratively. The other part of it is the configurable resource consumption. You can tell Falcon to say that there's a high priority data set. I want it to be replicated as quickly as possible. Take the maximum network bandwidth. Or you can say this is a low priority one. It can take a little longer to replicate. So we can configure the resource consumption there during replication too. So all of that, you would just do it declaratively again. So data import, we know you might either want to get a snapshot of data in in case of metadata where you want to import it once and if it changes, you want to get the whole dump again. Or it could be incremental updates where you process something on your Hadoop cluster, generate some data, ship it out to your data warehouse for analytics, right? So all of that, both modes are supported, incremental snapshotting, right? And this is how you would do it at a very high level. This is what is called the feed. Feed is basically your data set and process, of course, is your business logic. So in the feed you will say how frequently is data supposed to arrive in this data set? What is the SLA monitoring? We kind of touched about it. Here you're saying if the data doesn't arrive in two hours, there's something wrong. It's a warning level. If the data doesn't arrive in three hours, then alert. So the SLA monitoring can also be set up here right here. And then this is data retention where you, as you see in just one line, you say keep data for the last two days, remove everything else. Instead of delete, you could have archive also. Replication, yeah. If you see the cluster, first one says source, second one says target. So you're replicating whatever is coming in as and when it comes in into the target cluster. And finally, the location of data. So this is your data part. This, of course, is HDFS, but like I said, it could be Hcat too. Moving on to data governance. Some of the capabilities that Falcon supports is dependency graph. So when you are building these processes, you should be able to visualize the various dependencies across these processes and feeds, right? So Falcon has APIs to do that. And the lineage, I already mentioned that when I produce an output, you have to be able to go back to the source of it in case there's a problem with the output or figure out, or for auditing purposes also, you need to figure out where the data came from. So you'll be able to, there are APIs to help you get this lineage. Again, SLA monitoring, we've been talking about that a little bit. So dependency graph, you can either get the dependencies for a set of entities. When I say entities, it's both feeds and processes. So you can get that dependency or it could be at the level of an instance. When I say instance, it just means one run of a process or one dump of a data set. So one instance of a data set, or that's what we call a feed instance and a process instance. An example would be something like this, right? So there's a click process producing clicks. That's an impression, it's pretty much a view, an ad view, produces some impressions. You wanna summarize that. So this is how your dependency graph looks like for the entities. When it goes to an instance, this is how it would look. It'll say, let's say somebody at 1230 hours was produced by data produced at 12 hours for clicks and impression. So that is an instance of a run. So you should be able to track both the dependency at the entity level and at the instance level. We'll just move on to a simple case theory. This is an example of at a very high level how processing looks in a single data center at InMobile. So there are ad requests, there are views, there are clicks and there are conversions. Conversions are basically app downloads or installs. So all of these get processed and some summary is produced, right? Either for billing, again for various analytical reasons. So when we produce that summary, when we were just in the US, this was good enough for us. But when we moved across locations, we had multiple such data centers. And we wanted the data to, which is co-located there to go to that data center. For example, any ads that we serve here in India go to a data center in Hong Kong. Any ads that serve in the US go to a data center in the US. That is because we want the latency to be as minimum as possible. So we do all the processing locally. So, but then we still need an aggregated view, right? An advertiser, when we give out our bill to an advertiser, we have to say across all the geographies, here's how much we spend them, right? So to be able to do that, we could either move all our data to one data center and process it there, or we could process them individually at individual data centers, get the local summaries and aggregate them at the global data center. So that's what we do at InMobi where all the computation happens in individual data centers. So they pretty much replicas of each other in terms of processing that we do. So they produce these small little summaries and then when these summaries are shipped out to a bigger cluster, where it kind of rolls all these up across four or five data centers. And this is where the replication comes into play. So when we generate the summary in the local data center, we replicate it to the global data center and there are there's an aggregation that gets triggered over there. So that's one more different use case of how we use data replication. And since we have multiple data centers, we typically have one Falcon server sitting in each data center, coordinating across all the processes. And then we have something called the prism. Prism pretty much acts like a router. It sends out all your requests to the different Falcon servers. So you could say, schedule process one in colo one and two, process two and two and three, process three in all colos. So you can tell that to prism and it should be able to route it accordingly to you. And even the management happens at the prism level. You don't have to go to individual Falcon server to be able to do that. This is what we call the distributed mode in which the Falcon works. A few numbers here. We have five clusters at Inmobi. Four are local data centers, actual physical data centers. The fifth one is more a logical cluster for global aggregation that I just mentioned. And these are a few numbers which we can kind of consume a little later. I won't read through it, but we do process quite a bit of data. You see, we have more than 200K Hadoop jobs running every day. And we're gradually catching up on Spark also. So we have at least a few hundreds running there and quite a bit of data, one petabit of data storage that we have. New data is about six terabytes. Okay, so when I talk about Falcon, you'll say, okay, what else is out there? Just for comparison's sake, to have a level playing field, right? Which other products compare to Falcon? If you look at the ingestion side, the data management side, there is Apache Niffy and Dillingdon's Goblin. Both of them are more focused on the ingestion in terms of getting data into Hadoop. So that is where they play really well. And then there is Apache Uzi and Azkaban for orchestration of processes, where you wanna schedule processes, build your DAGs and stuff. And there is Atlas for data governance, but they are all equally good in what they do, but to be fair, there's not one product that does all of it, like cruise management, data management, and data governance, all in one go. All right, so if you wanna learn more, that's our website. It's an open source project, so feel free to contribute. There are a bunch of giras that you can go and look at. We've marked the giras as newbie, if you are just starting up with Falcon, and you can contribute right there. I'll stop here and take questions. Is there any limitation or is there any constraint on the kind of data which flows through this Falcon? I mean, like a data format, or it could be anything? No, so Falcon acts at a very abstract level, right? For the data format and all that, it's up to you what you want, how you wanna dump the data. In fact, adding Mobi, we use Thrift, we use Prodabuff, there's a mixture of things, R, C, O, R, C. There are a bunch of various data formats that flow across, but that becomes more specific to your Hadoop job rather than Falcon doesn't worry about it. As long as it says, okay, I see data in this location, it's fine, it says, okay, I'll notify, saying this data has arrived, it's up to the process to decide how it wants to read it. And how about ingestion of data? Is there any, for example, you've mentioned Hadoop, is there any other way I can consume the data into Falcon, like a Kafka topic, or MQ, or some other various ways to ingest the data within Falcon process? So right now, for the import bit, we have integration with any JDBC compliant data source. But for Kafka, so we have cases where we do dump data into Kafka and then eventually, for the batch processing, it comes into Hadoop. So for that, what we have worked out is, we have something, an inbuilt tool where we mirror all the data that arrives in Kafka as minute data into HDFS, we do a simple copy of it. So there's a consumer listening to that queue and simply it has a minute lead dump and that becomes your batch. We just compress it and start processing in our batch. Sure, so kind of an adapter. Kind of an adapter, yes. Hi. So I have a question in the process management. Somebody is asking. So I have a question in the process management flow where you get in the data into any cluster, any Hadoop cluster or any other cluster. So is there any validation framework around that? So suppose data has arrived, instead of it has not arrived, the data has arrived, but it doesn't have sufficient data, right? So do we have any benchmarks management or validation management? I'll come to that. So when I talked about the data management bit, I mentioned retention, replication, import, export. But really what is there is what we call the life cycle data management framework. So retention is just one plugin there. You could have your own plugin, which will say, okay, I need to count the records. If it is less than a particular threshold, I won't even consider it. So you can write that plugin right there. So these are the ones that I talked about are plugins that come out of the box, but we have something called the life cycle framework where it will allow you to plug in such framework, such logic that you're talking about. So hello. Yeah, yeah, one question. So first you showed in the first slide, like Falcon server, right? So do we need to have a dedicated server or like? It's a simple Tomcat server that runs. So I mean, typically we don't have any dedicated server. There are multiple processes that run along with it. It's a very lightweight process that it runs. It's a simple web app that it runs. So it can share box with anything else that you have. Okay, so it can be invoked even through REST APIs. It is, yes. So one question here. This particular declarative, the declarative processing that you mentioned, how complicated processes can be written in it? Can we do some kind of window operations or it is pretty much temporal data? So the window ink is, okay. The window ink can actually specify if you, I probably didn't, right? So here if you see the input, right? There's a start and end to it. So that is the window that we talk about. So you can say when this says now, it says at this point in time to minus one is last hours data. So this is an early click that we talked about in the example. So it'll consume anything from now till the last hour. So that is a window, data window. And one more question is the retention that you mentioned or the, basically the retention framework. Does it take care of a referential integrity also or it is pretty much one file based? Yeah, it's this file based right now it doesn't. Okay, thanks. So this is what we're talking about, right? So lineage is, you can track the lineage via the dependencies. When I say lineage, I was talking about the 1230, whatever data was produced, where did it actually come from? So that is a lineage part. So there's an API where you say summary and this is the art data that I'm trying to track. It'll take you all the way back along these two process both of it will track it and say, okay, here's where the data came from. That's what I mean by lineage. So you can track the lineage by using instance dependency. Same question, in lineages only. So in lineages, do you only track the lineages within the Falcon declaration? Or do you go into UJI and all that stuff also? No, because Falcon is aware of what it knows. So beyond that, it's a little beyond Falcon scope to figure out how things are done. But as long as you declare it as a Falcon process or feed will track it. But not outside. Not outside. Hi, on the process management, do you support conditional execution and maybe loops and stuff? Sorry, sir. UJI supports conditional execution and stuff. Do you support conditional execution? So see the thing is we are not, there are a lot of things that UJI does. What we are saying is reuse that. Falcon is not gonna rebuild any of those. Although we do have plans to build something called native scheduler, but we say if there are UJI capabilities that you need, go ahead and give us a UJI workflow. We just run it. The same is true for Askaban as well? No, we don't integrate with Askaban right now. It's pretty much UJI, Pig, Hive, and Spark. Is that a pluggable that can someone build Askaban? Yes, you can build. So the engines are pluggable. Like I said, right now we say it's an engine that we use, any execution engine, it doesn't matter. Right now it's UJI and we have something natively also. But it could be Askaban tomorrow. But you have to build it. Okay, thanks. Yeah, do you have connectivity to Elasticsearch from Falcon? So the only in terms of export, that is our only, if you wanna send something out to Elasticsearch or get something in from Elasticsearch, but like I said, we only have JDBC adapter support as long as you can have something, an adapter sitting there which pulls from Elasticsearch using JDBC adapters, it is fine otherwise. No native integration as such. Okay, so if I want real time data into my Elasticsearch, you use Kafka for getting the streaming information. And then for processing the data before putting into Elasticsearch, is this Falcon, will that be useful? So typically, although we do plan to bring in streaming also into this, right now it is pretty much restricted to batch processing. So you would wanna dump data and then be able to do aggregations on top of it. Right now the streaming integration is not there, but it's in the roadmap to build that too. Okay, thank you. We have time for one last question. Hello. When the data arrives and if the job fails while processing the data due to unknown reasons and later if you want to just re-trigger that one job, does the Falcon do that by itself? So even if there are sub-actions in your process, right? It understands where it failed and it triggers from the point of failure. It doesn't redo everything all over again. How does it know? So we can give a job instance ID and- Yes, yes, it figures out. So yeah, you can force it and say do redo everything, but by default it triggers from the point of failure. All right, thanks everyone.