 Good afternoon everyone. I am Vimal Sharma. I'm a software engineer at Hortonworks. So I will be talking today about Apache Atlas. So a quick show of hands. How many of you have heard about Apache Hadoop or at least used it? Yeah, so majority. And how many of you have at least heard about Apache Atlas? Okay, that's good to know. So this is probably a good audience for me to at least introduce Apache Atlas and see how it works. So with respect to Apache Atlas, I am a member of project management committee and committed to the project. These are some of the details about the project. The development for Apache Atlas project started in late 2014 and it was incubated to Apache in May 2015. Developers from organizations like IBM, Hortonworks, Atna, Merck and Target are involved in the evolution of the project. There have been three releases in past one year, 0.7 release which was in July 2016, 0.7.1 which was in Jan this year, 0.8 release which was in March this year and Atlas graduated to a top level Apache project very recently last month. So a high level overview of the project. So Apache Atlas is basically the governance and metadata management framework which was initially built for Hadoop framework but it's generic and flexible enough so that any arbitrary component can be modeled and its metadata can be captured. So we can have two types of metadata. The first kind of metadata is that of the data asset for instance a high table or a HBase column family. The second kind of metadata is that of the processes occurring within that component or across those components. So if there is a CTA query in Hive that can be captured in Atlas plus if there is a yarn job which picks up some data from HBase, does some processing and then dumps into Kafka, these kind of events can also be captured. Apart from the metadata capturing and visualization, Atlas also provides the facility to classify the metadata entities using tags which can be used in tandem with Ranger to enforce security policies. So Atlas has built-in support for many popular Hadoop components like Hive, Storm and Scope and its architecture is very flexible so that any component can be modeled and captured. So let's look into the use cases for the governance problem. The first use case will that be of the extract transform load pipelines. So if I am the owner of a current ETL pipeline, so how do I narrow down any upstream failure? So let's say my source dataset is dependent on an ETL pipeline and which fails. So there is some data quality issue or data itself is not there. So how do I debug this kind of issues? Plus if there is any downstream ETL pipeline, downstream dataset which derives from my computer dataset, if there is a failure in my pipeline, how do I alert the downstream owners so that they can take proper measures to correct the data quality issues? So it seems a visual lineage and record of such kind of dependencies would be very helpful and Atlas does exactly that. Second use case is that of the redundant processing. So if I am the developer and I want to compute some information based on the source dataset, so can I avoid the redundant processing? So if I have some mechanism to know whether that information is already available in one of the datasets, can I avoid the redundant processing? So we will see that Atlas, we can exploit the Atlas classification feature to know whether the information which we need is already there. Third use case is that of the compliance and security from a business point of view. So if we want to restrict access to sensitive information to some set of users, how do I enforce these policies? So the standard solution would be to apply ranger policies on these datasets, but that would be an overkill. I mean, if there are datasets across components which have sensitive information, like a high table can have sensitive information and HDFS file can have sensitive information. So how do I make sure that there is a single policy which we can apply and the same kind of rules will apply across components? Fourth use case is that of the cluster admin. Very often Hadoop cluster admins have this use case of cleaning up their cluster from dormant and unused datasets. So it seems if there is some mechanism to come up with a relevance score for a dataset, so low relevance datasets can be archived or deleted from the dataset altogether. So using Atlas, ETL, using Atlas lineage diagrams and classification features, cluster admins can come up with some relevance score based on which they can take decision to archive or keep the dataset. This is the Atlas architecture. At the core of Atlas is its type system, which we will look into more detail in further slides. Apart from the type system, there is the ingestion and export engine, which is responsible for ingesting the metadata events and entities as well as the export. Atlas models the metadata as a graph and it uses the Titan library to do this. The metadata is actually stored in HBase and there is a solar based indexing engine, which is used to improve the search capability over the metadata. As I mentioned earlier, Atlas has built-in support for many components like high scoop and storm, and metadata events in these components are communicated to Atlas via KafkaQ. Apart from this, Atlas publishes the tag addition and deletion events to an entity, to another KafkaQ, to which Apache Ranger is a subscriber, and it can use the tag addition and deletion event to enforce the policies which are defined on that particular tag. There is a high level REST API which can be leveraged to do all these operations, ingestion, export, as well as registration of type system and search. Cross component lineage is one of the central features of Apache Atlas. Lineage is a visual diagram between the dependencies among the datasets. If we have created an external table with some HDFS path, the lineage will look like the one shown in the diagram on Atlas UI. So individual components like Hive or HDFS can have their own metadata store, and the metadata store logs can be used to investigate what events happen within that component. But where Atlas comes into the picture is the events which are cross component. So let's say there is one Spark job which picks up some data from HDFS, does some processing, and then dumps it into a Kafka topic. So in this case, neither the Spark logs nor the HDFS logs would be helpful in connecting this event. So Atlas can be leveraged to define the model for several components and capture all these events which are occurring across components. So Ranger is a listener on the Kafka topic to which Atlas publishes the tag edition and deletion entity events. And if we have defined any security policies on top of those tags, those will start applying on that particular dataset. So we have attribute based policies rather than asset based policies. We don't need to define policies individually for individual datasets. We can define policies based on the tag, and whenever we attach that tag to the dataset in Atlas UI, the corresponding policies will start getting applied. So type system is central to Atlas. So it's the skeleton of the metadata which we want to store in Atlas. Type system is analogous to object oriented programming in the sense that in OOP we have the notion of class. So we can have attributes and super types. So similarly, a type can have attributes. It will have a unique name and a set of super types. So an instance of a class is termed as object in OOP. And an instance of a type in Atlas is called entity in Atlas terminology. So attributes can have several properties like it can be mandatory or optional. It can be unique which will uniquely identify it across the Atlas repository. It can be composite meaning which the lifetime of the attribute is controlled by the parent entity. An example would be the high columns. So high columns don't have any identity outside of the high table in which they reside. So if we delete the high table, the corresponding columns should be deleted as well. Reverse reference is another property which is mainly used for a back pointer reference to the enclosing entity. So our example of high table applies here as well. So columns individually will have a back reference to the enclosing high table. So Atlas has a bunch of base types which are predefined for you and those are bootstabbed whenever Atlas server is restarted. We will go through some of them. So reference table is the one which has a mandatory attribute named QualifiedDem which is a unique attribute which identifies this metadata entity uniquely across the Atlas store. Asset is the one which is used to identify entities which have some notion of ownership. So it will have a mandatory attribute name and optional attributes owner and description. Dataset derives from reference table and asset and it is the one which is responsible for the entities which are actually stored in Atlas. So a high table for instance will be an instance of Dataset. Process is another one which also inherits from reference table and asset and it has optional attributes named input and outputs. So process is the one which is responsible for tracking the lineage in Atlas store and when we navigate to the Atlas UI we will see all the set of inputs which are used and the output datasets which are derived. So we will be modeling the Spark data frame type in our demo. So a little bit of introduction about Spark. So in Spark there is this concept of RDD which are the basic unit of execution in Spark framework and data frames are a special type of RDDs which have some notion of relation among its data. So may be a JSON in which there is a structure to the data or a high table in which the information is related in some sense. So let's try to model the Spark data frame type. So as outlined in the slide above, Spark data frame will inherit from, will be inheriting from Dataset type and it will have additional attributes. Source will be a mandatory attribute which will indicate from which source the data is derived. Destination and columns will be the optional set of attributes. Data frame column is another type which will also inherit from Dataset and it will have mandatory attributes type which will indicate the kind of data which is there whether it's a string type or an integer type and data frame which is a reference to the parent data frame and an optional command. So as I mentioned earlier Atlas stores the metadata entities as a graph. So this is a snapshot of the type which we have just defined and the entities. So vertex one indicates the Spark data frame type, vertex three indicates the data frame entity and it will have adjust to its type as well as the columns which it contains. So four and five are the column vertices and they will have their attribute values and they will have adjust to the column data frame type. So let's consider this use case in which there is a salary disbursement Spark data frame Spark process. So payroll details of an employee of all the employees are in one HDFS path which will contain personal details like monthly salary, bank account number, name and more details. And there is another HDFS path which contains the variable components like bonus and stock options. So Spark process will pick up data from these HDFS paths, do its processing and then compute the monthly salary for all the employees and then finally dump it into a Kafka topic from where the actual disbursement will be done. So as you can see, not visible, right? So here I have declared the Spark data frame type. Is it visible? No? Okay, so let's assume that I have put the model in place for Spark data frame type and data frame column and then registered those entities using this piece of code and link them together using this piece of code. So now we will go to the Atlas UI, search for this process which you registered and we can see that this lineage is created. These are the source HDFS paths and this is the process which picks up data from HDFS paths and this is the Spark data frame and then this is another process which puts the data into Kafka topic. We can go down to inspect other attributes like columns and a qualified name and source as well and this is the tags tab which shows the list of the tags which are attached to this particular entity and along with this there is the audit which shows the operations which you are doing on that entity. So we can go ahead and attach any tag to this particular entity. So let's say I am defining, I added this particular tag to this entity, tag expires on and I specified the value to 1st of September. So this information will be related to Apache Ranger and if there is a policy of the kind in Apache Ranger that all the expired data sets should not be allowed access. They should not be exposed to the user and admins. So this will start applying on this particular entity as well. Then there is the tags tab on which we can see all the information categorized by the kind of tags which are attached to that particular entity. So we can see that the expires on tag has these particular entities and we can navigate to PII, Personally Identifiable Information and we can see the list of entities which are attached to that particular tag. So there is search capability in Atlas UI. So this is the basic search which is based on the solar based index. So if we type in any keyword, all the entities which have that keyword in any of the attributes will be written in the list of results. Along with this, there is the advanced search in which we can specify more advanced predicates. We can search by the value of the attributes which are there in that particular entity. Okay, so the roadmap for Apache Atlas. So as I mentioned, there is in-built support for components like Hive, Storm and Scope. In-built support for other popular components like Spark, HBase and NiFi is in the roadmap. Along with this, there is the column-level lineage. So as of now for Hive, there is table-level lineage which is available. But going forward, we will be adding the support for column-level lineage as well. So let's say there is a query of this kind, create table destination as select, some operations on the columns and from the source table. So value in X is derived from the source columns A and B. So if we navigate to the entity page for column X, we will get to see the lineage diagram of this kind wherein A and B and the source tables will be on the left-hand side along with the operation in the middle. And then we derive the value for X. Apart from this, the import and export of the metadata is in the roadmap. So all the metadata which is there in Atlas repository, if it can be exported in a well-known format, so that it can be consumed by third-party metadata tools and vice versa. If there is some data in other third-party tools, can it be ingested into Atlas in a seamless manner? Apart from this, there is work in Open Discovery framework which will provide the capability for data scientists to go to an entity page and compute basic metrics in the entity, on the data which is there in the entity. So let's say there is a high table. I want to know the data quality score for that table or various basic statistical metrics like mean, median, things like this to determine the data quality score. So these are the details of the project. Project website, data and user mailing list along with the list of giras with the open issues. So Atlas has been gaining significant traction recently, mostly because of the critical area of the enterprise governance and various enterprises have been actively using Apache Atlas to govern their enterprise clusters. Apart from this, Atlas has a very rich code base in Java and Scala. So I would like to take this opportunity to invite potential developers to check out the project page, to check out the code and see if the list of open items interest you. And we would be happy to welcome you on board and start taking up issues and contribute to the project. I would close the presentation here and open for any questions from the audience. So is Column Level Lineage Supporter only for Hive for Inbuilt? I mean, the roadmap for its supporter for other hooks as well. Sorry, I didn't go to the question. Column Level Lineage you were talking about, right? Yeah, it's only for Hive. Though in roadmap, it's only for Hive. If it does others, then we have to extend it or like we can... I mean, it's so flexible that you can easily extend the model for any other component and start publishing data into Atlas. I mean, for Spark as well, you can put the model in place and then once the model is there, you can register the entities using the REST API which is there in Atlas. So on one hand, people who have administrative access for the cluster, you know, like Hadoop cluster, we use Apache Ambari, let's say, right? For configuring component or installing a component and other things. Why was this developed as a separate component and not within Apache Ambari as a data governance kind of thing? Can you repeat the question? See, we have Apache Ambari for administrative tasks like installing HBase or adding a new region server or adding a new data node in the Hadoop cluster, right? So the people who are going to have access to Apache Ambari, the admins of Hadoop, right? They are the ones who are also going to apply data governance policies. Structurally, fundamentally, HLD Atlas can be different from Apache Ambari, but from a user point of view, would it not have made sense had they been like a same project, done within the same project? I mean, Atlas is an open source project, so you can spin up independently from Ambari and start registering metadata and, I mean, connect it to Ranger to impose policies on top of all the data which is there. It doesn't depend on Ambari in any sense. No, it's not about dependence. I'm asking, would it have made sense to develop the features that you've developed within Atlas as a part of Ambari itself? I don't see, I mean, so Ambari is meant for Hadoop components, right? And Atlas was started as an independent effort. I mean, basically from an industry point of view to govern the metadata. But it's different. And it's extensible and it's flexible so that any arbitrary component, whether it be Hadoop component or not, can be modeled using Atlas and its metadata can be captured. For the demo that you showed, right? The demo that you showed, that's a salary processor, right? Right. How can we provide details about the processor? Like what kind of processing it is actually doing? Is it, we need to add those as the... Yeah, so we will have to add attributes in the model which we defined. So this is a very simple model which I used for the demo. So we can add all attributes which we want to capture. Let's say, how much time did it take or what kind of CPU power did it consume. So we can capture all those attributes and register it with Atlas. Can that be the part of the hooks that are developed or when the Spark hook comes, will it be part of it? Yeah, so that depends on the developer. So if you're putting the model into place, then you can put in triggers after your Spark job. So when the job finishes, all this metadata will be captured and then it will be reported to Atlas. Okay, thanks. And I have another question. Okay. Yeah. See, at present, when we manage through a range, we will allow even the end users to manage the policies. It's the same way possible in Atlas. They can define the governance and the data. Yeah. So Atlas integrates with Ranger in the sense that you can define tags in Atlas and the information about tag addition and deletion is communicated to Ranger. And on the same tag, we can define policies in Ranger like access policies or data expiry policies. So Atlas closely integrates with Ranger in that sense. Atlas is not the policy management framework. It is meant to capture metadata and then provide actions on top of the metadata. Yeah. So in an enterprise cluster, admin will have, I mean, the power to define tags and attach tags to particular datasets. There will be proper user management. Any developer can't go and attach tags, secure tags to a particular dataset. So if anyone is trying to use this for an enterprise application, right, what would be the key considerations for its scalability and high availability? Atlas is highly available. And it's scalable in the sense that it uses HBase as its metadata repository. So it's highly scalable. You can put in as much data as you want. And it's because it's been tested with enterprise hardware clusters. So we did have some performance optimizations when these were reported. But it seems to work well in enterprise Hadoop clusters as well. A lot of data. Yeah, we'll catch up. Yeah. So let's say if I were to develop a hook for Elasticsearch. Can you hear me? Yeah. Yeah. Let's say if I were to develop a hook for Elasticsearch using this, would I have to fork your code base and add it or are there just any interfaces that I can implement and that will start working for Elasticsearch? How easy or difficult is it? So you won't have to fiddle around with Atlas code base. You'll have to have an understanding of the component which you are trying to model, what all attributes you want to model, which you want to capture. Once that understanding is in place, you can define the model. That will be a JSON. And then using the Atlas REST API, you can register that model with Atlas. And after that, you can start capturing the metadata for that component. More questions? From Hortonworks, there is this schema registry, right? So is there any... Hi, could you please stand while asking the question? Yeah. Is there any intersection in the features between that and this? If yes, why is that a separate project? Schema registry is actually quite different from Atlas. That is mostly for streaming applications, wherein we want to define beforehand what kind of messages will be produced and consumed from the topic. Atlas is more from a metadata point of view, not the actual data. What kind of events are happening within that component or across the components in the enterprise cluster? Does that answer your question? So if I just have to store the schema of whatever message is coming in, I can do that using this one as well, right? I don't need schema registry. No, actually. Schema registry is more of, I mean, suited to streaming applications, as I said. You define the schema and register it in a database, and then the producers use that to... I mean, the schema registry engine makes sure that the schemas which are already registered are only produced, the data which is registered, the schema which is registered, the data follows that schema. Atlas is, I mean, more on what is happening across the component or within that component. Okay, you mean per message in case of schema registry? I'm sorry to cut you short, but you'll have to take the discussion offline. Thank you. Thanks, Primal, for the great presentation. Thank you very much.