 This talk has been moved to this Systems Engineering and Hardware Channel. It is on data governance. The summary is in a world that has become increasingly dependent on organization level data sharing and collaboration for decision making. It is of utmost importance that we ensure airtight security for our data. That's when Apache Ranger comes to the rescue. So I'm going to share the presentation and I'm going to play the video. Sometimes things that let me share a few things I learned about data governance from my time at Red Hut. A brief introduction about myself. I'm Durga. I'm currently pursuing master's in computer science at Northeastern University, Boston. A little bit more about me. I listen to a lot of music, mostly stuff from my home country, India. But if it's something really, I'll just go for it. Something else I also love doing is cooking. And I'm also highly passionate about harnessing AI to solve game puzzles. And that's a lot of talk about me. Let's move ahead. The growing importance of data just a century back, if you can recall, the world was not as connected as we are today. Hence, data was highly localized. This also meant that we were never able to see any global patterns or emerging trends. Sometimes things that would be quite obvious with a simple examination were used to arrive at divergent conclusions. But this changed with the advent of internet. Now even the vacuum cleaner at our homes are connected to the internet. Medical devices, opinion polls, ad viewership data, you name it, and we can do something with the data we get from them. This makes data a key player in our society and, more importantly, in any enterprise environment. Hence, we need to ensure that data is consistent, it is usable, its integrity is not compromised, and it is secure and also available to the right players at the right time. This is crucial for shaping the direction in which any organization charts its course. Sadly, Dilbert's strategy doesn't help at all. So how do we do it then? Well, data governance is the answer for that. So there are a lot of ways to define data governance. You can see one from the Data Governance Institute right on the screen. But for the sake of simplicity, I would say data governance is nothing but the process of ensuring that the data is secure, incorrect, and available to the right players. So where do we start today in terms of data governance? We are at crossroads of sorts in terms of the progress we have made in securing data. On one hand, we have laws like GDPR and the California Consumer Privacy Act, whereas on the other hand, we have highly controversial laws like the Indian Personal Data Protection Bill that could result in a high level of state surveillance. The interaction of GDPR was a turning point in the history of data governance. Although it pertained to the EU countries, it sent ripples across the global community and set precedents for similar legislative frameworks across the world. So these are the different type of data governance frameworks regulations that have been followed across the globe. So a direct impact of the introduction of such regulations can be something as visible as the front you see on a website that lets us choose whether to accept cookies or not. This gives the end user a choice of what information about him or her becomes available to the organization behind the website. If we look at a larger picture, the GDPR and similar laws forbid disclosure of identity of an individual user in many scenarios. But for analytics and other purposes, we still need figures. This could be sales data, usage patterns of some feature, exercise data, et cetera, et cetera. We use data governance features such as identity access management, single sign-on, allow and deny lists, et cetera, to provide hierarchical level access to different players involved. So this is the top famous data breaches in the world. The cost of Equifax data breach was around $147 million. But over the past years, there have been many large data breaches. This is just a visualization showing the number of records affected by the five biggest ones. It goes without a saying. A compromise on enterprise data does not just affect the privacy of its customers. It guarantees a PR hell, loss of credibility that results in shockwaves that bring down the valuation of the company, translating to a loss for investors and everyone associated with the company. As of 2019, the average cost of a data breach was $3.92 million, or an average of $150 per record. And so it is of utmost importance that we need to guard ourselves against such incidents. Today, we are lucky to have some well-built data governance solutions out there that fit every single kind of business. Some of you might be familiar with these names, but for those who don't know about these, we have solutions from all edge to that, Colibra, IBM data governance solutions, talent, et cetera, to name a few. So how do we enforce effective data governance? We'll concentrate on some robust open source solutions that are accessible to everyone, mainly from Apache. These include Apache Ranger, Apache Nauts, Apache Atlas, Apache Sentry. But specifically, we will see how Apache Ranger and Atlas can be used to build an effective data governance system for an enterprise. Nothing in depth, but just a walkthrough about what these tools offer so that we are more conscious of what's available out there. Apache Atlas. So what is Apache Atlas? Apache Atlas is a framework that provides data governance capabilities for organizations to build a catalog of their assets and also classify them. It wouldn't be fair not to mention about open metadata management when I'm talking about Apache Atlas, since Apache Atlas is designed to exchange metadata with other tools and processes. And this is the overall architecture of the Apache Atlas that is being shown in the figure. So what basically it does is, Apache Atlas captures data lineage across components, captures the downstream and upstream of data assets across the enterprise environment. And also, it enables classification of data assets using tags such as PI, which is personally identifiable information, expires on, referring to defining the expiry of a particular data asset. And also, Apache Atlas enables free text search on metadata. And it has a lot of APIs for custom metadata ingestion. So overall, it acts as a metadata repository. So this is nothing but some different type of entities in Apache Atlas. I've just listed out two where there are a lot. One such is HiveDB, and the other one is HiveTable. So these are some sample data that I have ingested into Apache Atlas. So this is the level of details that Apache Atlas provides when you're talking about entities. Everything that you ingest into Apache Atlas will be an entity, be it a DBB, to table, column, whatever it is. So the one on the left picture is HiveDB entity. And you can see it lists down the different properties, such as name, and who owns it, and all these details. And the other one on the right, we can talk about the HiveTable. You can see the different columns associated with that particular table. And you can see the different timestamps, the time when it created, and also when what's it last accessed, and also the owner and parameters, and defining the number of rows, and so on. So this is the level of detail that Apache Atlas provides. And you can search these entities using the search filter in the Apache Atlas UI. So no doubt it is a highly enriched UI, where you can access the metadata and classify them according to your use cases. Moving ahead, Apache Atlas classifications and data lineage. So why do we need to classify data in any organization? Well, to put it in simple terms, classification eases the process of identifying, retrieving, and processing data for everyone who has to deal with it. If you just take a look at the figure shown here, you can see that once we tag a source as personally identifiable information, then the beauty of Apache Atlas is that it will automatically propagate and enforces these tags for everything, all the entities that has been derived from it. These classifications or tags are later used by Apache Ranger or maybe some other tool to enforce certain policies in terms of data security. As in defining the personally identifiable information tag data should be accessed by only so and so group of people, such kind of policies can be defined in Apache Ranger. In this way, Atlas defines the classifications and Ranger enforces the policies using these classifications. So we can classify and track the data flow and lineage all across the organization using Apache Atlas. Another important feature of Apache Atlas is the glossary section. A glossary provides appropriate vocabularies for business users and it allows the terms or words to be related to each other and categorized so that they can be understood in different contexts. These terms can be then mapped to assets like a database, tables, columns, et cetera. You can see one such example in the figure that's been shown where you group all the terms that is related to banking under banking category, such as checking account savings account. Moving on, let's discuss about another tool which is predominantly used in any data governance system that is Apache Ranger. So getting into details, Apache Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform. Ranger provides you with provisions to create services for specific Hadoop resources such as HDFS, HBase, Hive, et cetera and add access policies to those services via a centralized platform which is nothing but the Ranger admin UI portal. You can also create tag-based services and add access policies to those services as I mentioned when I was discussing about Apache Atlas. So you create the tags in Apache Atlas and you create the rules in Apache Ranger to enforce security. There are so many more services that Ranger provides like providing security administrators with deep visibility into the Hadoop environment through a centralized audit location that tracks all the access requests in real time and support multiple destination sources including HDFS and Sola. So that's a lot of talk. Let's go ahead and have a quick look at the Ranger admin portal for better understanding. So this is the Ranger admin portal UI where you can see different type of services that can be integrated with Ranger. These plugins are used to enforce security policies, define security frameworks across the Hadoop environment. Few that have been listed are Hive, HBase, HDFS, Knox, and all these. These are different type of plugins that can be used to enforce security policies. So this is a gist of how do you define Ranger policies in Apache Ranger Hive plugin? So these are the four generic default policies that would have been created when you enable the Ranger Hive plugin. So you have policies to enforce security policies at different hierarchies. Say, for example, you can enforce in across the database, table, and column level, and you can enforce in Hive services and also you can enforce in user defined functions and also across the URLs. You can even select specific databases, specific tables, columns, or Hive services or you can give generic to all. So this also specifies, each policy specifies certain conditions where you can define this particular user can have these so-and-so permissions. Say, for example, a select permission on a table or drop privileges or create privileges. So that's just a gist of the policies in Ranger Hive plugin. So let's talk about Ranger Hive plugin a bit more. The scope of this session does not extend to a discussion on all the features of Ranger due to time constraints. So that is why I have chosen one such plugin, which I am quite familiar with. Basically, what this plugin is meant for is providing authorization across Apache Hive in a more granular way. Enabling this plugin would grant or provoke access to various Hive components, also enabling row-based policies for authorization, as I said before. I would like to demo this plugin in action, but before proceeding to the demo, I would like to mention the versions that I have chosen for this particular implementation. Ranger is of 2.0 version and Hive would be of 3.1.2. Now, let me show a small demo of this plugin and see how Ranger can provide authorization across Apache Hive. So this is the login page for Apache Ranger. And I'm gonna login with the credentials that I've configured and also transfer the port 6080. So as I mentioned before, I have installed Ranger and Hive both locally on my mission and the version information for those two services is of Ranger is of 2.0 and Hive is of 3.1.2. And this is the service manager UI where you can see different type of services that can be configured as plugins and can be used to provide security policies on Hadoop ecosystem. And I've also already created one Hive service for this demo purpose, Hive Dev. Once you click on this icon, you'll be able to view the service details and have configured for this particular user. And this is the JDBC URL with which you'll be able to interact with Hive that is running locally. And for this to work, we have, we should have the Hive server too and running in our mission. And this is the UI portal for that. I already started the Hive server. So that's how you create a service. After creating a service, you can see the four default generic policies that have been created by Ranger. So one is of Hive service. The other one is for database, table and column. The other one is for database along with user defined functions. And the last one is for the URL. And you can see the status as enable for all the four policies for this particular user. So by default, for the very first time, Ranger does not have any deny conditions or any remote policies being enabled. So it grants all the permissions and all these services for this particular user. So in case if you want to check, you can click on this icon and you can see all the permissions have been enabled for this user and there are no deny conditions being applied. That would be the same for all the four. So let's check in Hive if we have all the policies. So my current working directory is, I'm now getting to the directory where I have installed Hive and I'm now getting to the bin and let's log into the vline shell. Once you log in, you connect to the URL that I have been used for connecting to Ranger. Basically it's a server and the username would be the one that I've used for configuring the Ranger setup. So connected to Apache Hive. And now let's see what all databases we have currently in our mission. We have three DBs, DB three default sample. Now let's say I want to drop on database. Say I'm dropping this database and then if it has been dropped, let's check by clicking on show databases and you can see two databases right now, DB three and default and sample has been dropped. So basically you have the drop privilege and also let's say I want to create this one back. So once you click on create database sample, you can see that database has been created. So this means that you have all the privileges. So now to see if we can use Ranger for enforcing security policies in Hive, we can click on this edit policy icon and then say I don't want to give this particular user the drop privileges. So when you do this, you have to also uncheck this all because this one says that this particular user has all the permissions to do all the operations over here. So I'm checking both drop and all and then click on save. Similarly, you have to do the same step for all the four policies, drop all and then save. So we're just removing the drop privilege, nothing else. All the other permissions remain the same, apart from drop. So the user should not be able to drop the database, this particular user, and then finally click on save. So we have done this through all the four generic policies. Let's see if that is being applied over here. So once you click on show database, you can see three different DVs over here. One is default sample and the other one is DV3. So I want to drop the database, say DV3. See, a Ranger policy has been applied and you can see the exception being thrown over here. High access control exception permission denied. This particular user does not have drop privilege. So let's check if all the other privileges are the same. So I'm trying to create the other database, create database DV1 and then click on show databases. It's already created. So you can see DV1 has been created. But if you want to drop any DV, that's when you get this exception. From Ranger saying this, you don't have permission to do this. So that's how the security policies have been applied in Hive via Ranger. So Ranger can enforce these type of security policies. Not only this create drop, but also we can see a lot of permissions over here, select or read only policy, or you can just remove the user does not have permission to alter or maybe lock update. So you can play around with all these permissions and you can say this particular user or groups also can be activated and say this particular group has access only to this DV or does not have drop privileges, does not have create privileges. That's how you define policies in Ranger which can be used to enforce security in Hive. So that was a big demo of Apache Ranger Hive plug-in and action. And any questions? That was a very interesting presentation. So let's see, Ricardo, do you want to come on video and audio and answer any questions? I'm assuming Durga still is not here. I'm sorry. No problem, welcome, Ricardo. Find the button. Excellent, welcome, Durga. Do you want to come on video as well, Durga? Yep, just click the share audio. There you go, I'll add you in. Thank you very much. That was a very interesting presentation of very nicely done video recording. So does anyone have questions or either Durga or you recorded either one of you all want to add to anything in the presentation? I don't have much to add. I would like to say thanks to Ricardo and my entire team for this knowledge that I have gained at Red Hat. And I hadn't mentioned about regard on the starting of the presentation but without him literally this wouldn't have been possible whatever knowledge that I gained regarding data governance and the importance of data over here. So yeah, officially thank you so much Ricardo for all that you have monitored throughout my time at Red Hat. And I don't know if Sharad is here but if he is here he'll kill me if I don't mention his name. Yeah, thanks Sharad, thanks my entire team. Alex, same basically, Alex, Anish, Malik, Sergio. Sorry everybody. We should say thank you, Durga. You did a very good job on it. Yeah, it was very interesting, it was very fun. So I definitely enjoyed it. Any questions from the audience, please speak up. Yeah, no questions means either people understood everything or they didn't understand anything. Well, we still have folks watching so they're still watching, so. Yeah, that's a good sign. Thank you all very much.