 We are building a new data management and in-processing platform called Ivory, we talk about that. For the last few years, we have quite a lot of data processing happens on Adu, in fact it exclusively happens on Adu. Almost the sequel conventional way of processing data is used only for final last mile delivery of reports. Otherwise all the processing happens on Adu. What we have realized is that there is a need to manage this data in a much more seamless fashion than just looking at it as simply a set of files and directories and have Aduk jobs running and processing on them. So if you look at our use cases, we have about 3 billion events that flow into our Aduk system every day and several terabytes of data and data come in minute wise tongues and hour wise tongues and they need to be processed through a pipeline. The pipeline is like steps of multiple processes that happen one after the other and they need to just flow through. Now all these pipelines are very SLA sensitive and we have to make sure that they are running on time and nothing breaks. Besides we also have needs to maintain some kind of retention policy on the data, archival on the data. We need to be able to copy data across multiple data centers. So there are a lot of these regular life cycle management that we need to do on the feed and also manage the processing. So that is the need that we saw and we have started building this platform called Ivory. I will go over in a bit of detail what the use cases that we are trying to solve with this product. So first is obviously to do with feed management. So what we wanted to stop doing is to look at data as just a bunch of directories and files that come and reside on our clusters. So we wanted to look at them as a logical kind of table which is what the HCR catalog, the metadata repository kind of does. So we want to look at them as tables and have these tables manage partitions and the time at which this feed arrives as one of the partitions also. So we also want to make sure that we have retention policies applied on this data. We remove data that is old. So that is one of the use cases that we wanted to solve. Second one is the feeds that we want to configure in the platform should be archived after its end of life before it is kind of removed from the cluster. So that is another use case that we are trying to get through. The other one is feeds may be generated in one of the colos, in one of the data centers and it may be required in another data center for some processing or aggregation. So replication is a very important use case for us. So once the feeds are all on the cluster, which means the data is all on the cluster, we need to allow our processing to sit on top of the feeds and consume these feeds. So if you look at a classic scenario, lots may flow into one data center in each of the data centers and then they may eventually be processed in these data centers and we may want to consume a final summary of this finally for our reporting or our feedback poses. So if data were to be unavailable let's say in one data center due to an outage for instance, data may still be available in the other data centers. So the scenario that what will actually happen in this particular scenario is the data that is available from the live data centers would be used for aggregations but eventually the data from the inactive data center may eventually arrive. So we should be able to kind of go and fill up that gap as and when data becomes available in an eventual consistency model. So you could have a processing which runs for a particular hour which may have partial data because of unavailable data and that may then be, you know, processing may be completed and the measurement data becomes available. That's the dependency. So you may actually, you know, one log may be used for processing the data. It may produce a summary or some kind of analytics which may then be used for further analytics. So if the first feed is delayed it will cause cascading delays in the whole pipeline. So what the ivory system is trying to do is to identify the full lineage of the entire data flow. So who is dependent, which process depends on, which processing depends on which feed and identify this lineage and if any reprocessing is triggered on one of them it automatically goes and triggers the entire pipeline and makes sure that all the analysis are consistent with each other. So that's the lineage use case. So that's something that we frequently run into. So ivory would basically take care of that scenario as well. The other use case that this system is trying to solve is, you know, typically what we would do is we would do all the feed processing using a workflow engine and sometimes the retrace may not be handled properly. So the ivory system guarantees that there may be retry policies associated with this feed processing pipelines and it will enforce these retry policies on these pipelines. And people also, because it's aware of all the feeds in the cluster and all the processing that happens on it, it basically can provide higher SLA alertness if something is going to be missed due to delays or something is, data is not available for a particular pipeline to kick off then we could generate alerts for operations to look into it and identify the root process problem or if the pipelines are start because of some job processing delays or job execution delays, even that can be tracked through the system. Also, we want to start, you know, allowing HCAD, I mean metadata repository like HCAD integrated into our environment which will allow our feeds to be registered with HCAD which can make all the consumers consume data without actually, you know, trying to go to a specific directory in HDFS and trying to access it. So, you start looking at Hadoop more like a, you know, data warehouse rather than just a processing cluster, data processing cluster and storage cluster. So, you start looking at it more like a data warehouse. That is what the Ivory system is trying to find the gaps that makes it difficult for us to reach that state and trying to fill those gaps basically. So, if you look at it, architecture is fairly simple. So, what it does is it allows users to configure three entities in the system, clusters, feeds and process. Clusters is basically something that defines the infrastructure endpoints in our column while feeds are basically something that defines what's the schema, the location of that in the HDFS, what's the retention policy, what's the, you know, time cutoff after which the data should be disregarded and what's the frequency of the speed, etc. So, those are the, you know, elements of the feed. Similarly, a process basically defines the input feeds that it uses and the output feed basically generates. So, providing a complete lineage and dependency graph of all data in the cluster and how they are produced and consumed. So, through Ivory system, it's possible to figure out who produced this data, what all process consumed this data, what data does it produce and who is in turn consuming those data. So, you could actually build a complete lineage of this data. So, if there is problem in one of the processing pipeline, for instance, that would immediately figure out what are the dependent process that are stored because of this and we can take corrective actions and also make sure unblock the original pipeline. So, that's the other thing. Second is, the Ivory system through the REST API can talk to the underlying workflow engine and get status of each of the instances that run for the particular feed or process. Say for example, if a feed is configured in the system and it has a replication requirement to copy the data from one cluster to, let's say, another cluster, the Ivory system will automatically trigger off the replication instance every hour and replication instance is available for us to track through the Ivory system. Similarly, process executions can also be tracked based on the frequency of the process. The process similar to a feed can be defined to run at a periodic rate and can be made to depend on multiple feeds. So, again, those process instances can be tracked. So, what Ivory does is it does not do much of the heavy lifting. All it does is maintain the state about the dependencies and allows itself to be hooked to the workflow execution path such that it tracks the end of the workflow. The workflow itself is executed by the scheduler which is open in our case. In mobile, we use Uzi for our workflow execution, but the Ivory system is built in such a way that we can plug in a different workflow engine of choice. So, the workflow engine, once it completes its activity, we would basically interject that execution and get a message through our messaging service back to Ivory and back to the consumers. So, the message flow allows us to figure out what happened to the particular workflow instance or the process instance and allow us to manage retries or figure out if data is basically partially processed and should it be processed again, stuff like that. It also identifies the execution time and if it is missing the SLA, etc. and will it impact any downstream processing. So, all that it is able to identify by hooking itself into the workflow completion path Ultimately, the workflow scheduler then talks to Hadoop to submit the jobs and for instance, all the things that I spoke about earlier, retention, replication, archival, feed processing, everything is eventually scheduled as a workflow in the workflow scheduler. Ivory does not have a scheduler of its own, does not do any heavy lifting. So, it basically allows everything to be processed by the scheduler over Hadoop. So, everything is a map produced job at the end of the day and that is how it is basically implemented. So, I guess I mentioned this already. So, basically, you can pick your scheduler of choice. Currently, our implementation is over Uzi. So, it delegates all the scheduling and workflow management to Uzi in our case. All the activities are really doing work in terms of the functionality for feed management or the process execution or delegated to the workflow execution engine. So, Ivory system offers a rest API over which consumers can query the status of the instances that are running in the system and can also manage the lifecycle of those instances basically rerun schedule, suspend all the lifecycle operations on both process instances as well as feed instances. And that's basically how you can use Ivory. So, there are three basic entities in Ivory which are cluster, feed and process. This is what you basically configure in your system. Cluster is basically the infrastructure component that you work with. So, it provides read write and workflow execution interfaces. Basically, it gives you the endpoints of the environment on which this particular Ivory instance is basically going to work. It also allows you to provide default job configurations that you don't need to repeat. So, cluster-wide default job configurations that you want for all jobs that run in your environment essentially goes into this configuration. The feed definition basically has frequency, cluster on which this feed is valid. The partitions on this feed, the cutoff period after which this feed data is not consumed by any of the consumers, schema which will basically be used for registering with edge catalog, location of where the data is processed on HDFS and the ACLs like flow machines, zonerships, etc. So, all these basically define the feed. You basically define all data on your cluster as a feed definition, as one or more feed definitions basically. Then process consumes one or more feed instances and produces one or more output feed instances. You could define your workflow as part of the process definition and the workflow basically gets executed whenever an instance is materialized. You can associate retry policies and say when do you want to retry, how do you want to retry, how many times do you want to retry this process if there were many failures and what kind of failures should we handle how. Then you also specify if data were to arrive out of water or laid for this feed processing pipeline, how do you handle that. So, all that are basically policies that are extensible that you can mention in the process definition. So, through these three entities that you define your cluster feed and process, you basically articulate what data is stored on your cluster, on which cluster, and how they get processed and produced. All of them are basically written through these entities. So, we've been using it every for a while, for a little while now. I mean actually it was a week now, we just launched this feed. It basically offers only these capabilities though much more is on roadmap for us to build for the next quarter or more. We provide feed redemption, all the entity management are all provided besides in terms of functionality, feed redemption, process and life cycle, process life cycle management and feed life cycle management. Then it provides detailed dependency graph on these three entities that are registered with Ivory, NCLI and Pw20 through JMS. Those are the capabilities that are implemented currently in the version that we have in production today. So, this is basically an open source effort. So, we've put the code, everything happens publicly, so we've put the code out there. If anybody has similar problems to work on, it's open for collaboration. So, open for questions. We have a couple of questions. Can you monitor or do you have a monitoring or tracking system which can actually track the progress of data inside your workbook? At what note the data is right now? So, as part of the status of a process instance, we basically provide the user all the actions that are executing, its current state and along with it the, you know, HTTP URL to access the log. We also archive HTTP in the log file for processing later. So, they are permanently stored into the Ivory system. So, yes, the answer is the Ivory system provides to the step what is actually happening for running as well as completing the workflow. It can provide you up to that level. And then the workflow is initiated using an API call. So, what if I have a use case of something like page rank and the which needs iterative MR jobs? So, is there a way to do that sort of thing on Ivory? That would need iterative cycles of the workflow. So, how do you basically, we have similar use cases, right? So, X our data is dependent, I mean X plus one is different on X our data. Is that what you are talking about? X instance is the input for X plus one and Y plus one. And then it looks correct. So, that is the standard. That's a regular use case that we support in Ivory system. Basically, you can configure a feed. And the feed can be both input and output for in a process. So, it basically allows you to create that cycle. But only thing is, so basically you associate a frequency with that feed. Let's say the feed needs to run every 5 minutes. 0th minute file is available and then you start processing that. You produce 5th minute file. And 5th minute file automatically becomes the input for the 10th minute processing. As soon as the data becomes available, it will start triggering. So, it won't trigger unless the previous instance is available. So, it will create, it has the dimensions. So, for every iteration you make another STTP API called to the... No, we will create the workflow appropriately as required to execute this use case. And schedule it with Woozi in our case and let Woozi handle that. So, we basically will be able to express all these requirements of appropriately Woozi. Woozi only supports direct and DAGs, right? It has one way. So, the point is each DAG completes and generates one workflow, one instance. And the next instance of the workflow depends on the previous instance. So, each DAG basically completes. And the next DAG execution depends on the previous DAGs. So, that way you maintain the DAG's, you know, semantics and allow Woozi to execute it. The next instance basically moves back on. Ask a man, actually he's fairly... Yeah, the question was, the scheduler is open so we can integrate with any scheduler is what I mentioned. So, the question was, Woozi and Ask a man, for instance, are very different. They're not on bar. So, how do you plan to integrate? So, basically the system is open for extension. That's how it is built. So, if there is a lot of gap in another scheduler, it becomes difficult for us to implement. So, we have to find a scheduler that provides most of the capabilities that we need. So, Ask a man may fall fairly short. So, if we have a different scheduler which is very close to what Woozi can do, then probably that's a good choice. Ask a man, what actually happens is there may be a lot more to be... So, what we have in our implementation, how it works is, the ivory system basically takes in the process requirements and maps them to a workflow input that the workflow engine accepts it. So, we have a feed mapper and a process mapper, which maps the ivory process definition into corresponding input that the workflow engine understands. So, in case of Ask a man, we may have to... Do you actually generate the Woozi workflow? Yeah, we do. So, we actually... We can see the definition of the confirmation program. Yeah, sure. I will just put it up. So, we... So, complete the answer for previous question, right? So, if you want to use Ask a man, for instance, you may have to do a lot of functionality that Woozi offers, for instance, data availability, which may be non-trivial to do and which may make it very hard. So, if you have a similar or on-par workflow scheduler, you can probably integrate it. But, as of now, we found Woozi to be very suitable for most of our requirements, hence the choice of Woozi in our case, in our default implementation. Right. So, archival will be implemented basically through... Again, implementations can be there for archival. In our case, we just archive it to a box with a large storage and eventually the storage can then be moved into a tape or whatever. So, how we implement the archival is, we provide a policy for the archival and a implementation for the archival and it is extensible. So, we will have a default implementation for... So, most of the things that we have are open for extension. So, nothing is like finalized. I can write a XAM implementation. Yes, an AXAM implementation. Absolutely. So, all that basically makes sure that we want to have a scalable system and we need to be able to schedule it. So, we will ultimately end up running it as an app-reduced job. That will be the only restriction. So, I will just... So, that is how a cluster definition looks like in Ivory. So, you basically define multiple interfaces and endpoints. So, here in this case, we have read-only endpoints, write-endpoints, extrudent points, perfluent points, messaging-endpoints and what Colo it does. The Colo is very important for replication. So, the Colo is also a property of the cluster. So, this is the... Again, this will go through a lot of iteration in our next release because we have a basic version that is launched in production zone. This is how a feed basically looks like. Feed has multiple partitions. It can belong to a feed group. Then there is frequency for this. Late cutoff period, what clusters is this feed available on and what's the time range for which it is valid in that cluster and what's the retention period for the feed, the location where it is present in the APFs, APFs and schema. That's how a feed looks like. So, retention, if you look at it, there is an action here. So, we support multiple actions. Delete will nuke the data. Change permissions will just change the permissions so that it's inaccessible and keep it somewhere. And the archival will basically move it to archival. If the retention action is archival, there will be a separate section in the feed which is not displayed here shown here, which will tell how the archival has to be done. So, that's how it basically looks like. So, finally, the process definition. Basically, define the cluster on which you want to execute. The frequency for this process, the validity time periods. And then you define inputs and outputs. And then you specify any property that you want to send to the workflow. And then you can finally specify what's the workflow to execute. In this case, the workflow engine is Uzi, and the user has written a Uzi workflow. And this is, in fact, the only engine that is supported for workflow execution. But you could integrate, let's say, spring batch or whatever. It's just a bag that needs to be executed and it will make sure that it gets executed. That's how this will be taken care of. And finally, you have retry policy in the process. There's also a late cutoff, late processing section, which is not this example, which will say if input A is delayed, what do you do about it? So you basically have late policies also that you can mention here. And it's just a blackout. So how do we get the input data which converts it into an Uzi workflow or XML workflow? Or do you give the inward format, do you give the input so that Uzi works for XML generator? So this is all you've given to the Ivory system. The only thing that will be Uzi-ish will be the workflow XML that you will generate here. The workflow XML works on this contract. You basically use variable names, which is same as this name, input and output, and define your workflow. Those two variable names on this variable name are the only things available for you. It's like you write the workflow as a function which takes the arguments. The three arguments here are the name of the inputs and the properties, name of the outputs. So within those constraints, you can write whatever you want inside the workflow XML and it'll just execute it. So the workflow XML is completely a black box as far as Ivory is concerned. It makes sure that the input, the output, and the properties are passed down to the workflow. So in a sense if you were to change the workflow engine, then all your existing job meditator has to be really ported to the new engine workflow language also. In a sense, you're not capturing the workflow metadata in your own representation. We planned to, but we decided not to do that because different workflows may offer a wide range of capabilities. So if we were to offer a workflow definition in our Ivory process, then it has to be a super set of everything so that we can talk to any engine. If it were a super set of that and if we don't work with the appropriate engine, we would have unsupported operations in that. So it will be fairly complex to manage. Need not be super set. You can represent whatever Uzi things you want to and have your own representation. The thing is that otherwise you cannot replace the workflow engine so easily. All your production is running on lots of Uzi workflows. There is no way that you can convert that to a new engine easily. I agree. So normally what the standard things that you will basically execute on a Hadoop environment are captured in Uzi. That's hardly adequate. But we could probably map to that. But the choice of allowing it to go with Uzi for now is because we didn't want to take that burden of managing the workflow definition in our system. So we thought of delegating and chose that model. But if there are like a lot of use cases for us to work with different ones, especially if it becomes widely adopted and there are scenarios where people were wanting to switch from one scheduler to another scheduler once it's a frequent use case, we will definitely have to implement that. So in a sense, Ivory is another workflow engine. It's an abstract workflow engine. And a policy engine also. So a mix of both. And you have your well-defined policies for archival retention. I mean, the fix that we know about. So you are managing them and running things mostly via Uzi. All right. Thanks.