 Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager of Data Diversity. We are proud to produce this webinar series of Data Governance Case Studies for the Data Governance Professionals Organization. We'd like to thank you for joining today's DGPO, Considerations in Governing Internet of Things Data Lake sponsored today by Calibra. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A section in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DGPO. Now, let me turn the webinar over to Barbara Deamer from the Data Governance Professionals Organization to introduce today's webinar and speaker. Barbara, hello and welcome. Thank you. On behalf of the DGPO, I'd like to welcome everyone to our monthly webinar. Before we get started this afternoon, I just want to give a brief overview of the DGPO. The DGPO is a community of data governance professionals whose mission is to share knowledge, content, and best practices with its members. Towards that goal, the DGPO has been working on the expanding the best practice information for the six areas you see in the graphic on the left-hand side of the slide. The DGPO sponsors a Data Governance Best Practice Award, which is judged based on responses to these six areas. The 2018 winner was done in Bradstreet and the runner-up for Freddie Mac and the Arkansas State Insurance Department. Their submissions and others are documented on the members-only section of the DGPO website. To learn more about the DGPO, please visit our website at dgpo.org. Now I'd like to introduce Daniel Scholder, who represents the webinar sponsor, Calibra. Hello, welcome. Thank you. I'm Daniel Scholder from Calibra. We're always thrilled to sponsor these DGPO webinars. They're a terrific way to hear about what people are doing and how people are addressing some of the toughest challenges in data governance in the industry. We at Calibra are very much aware of the fact that how much of value we can create if we can just get to use a lot of our data. The IOT data is one of the critical pieces that we see coming up more and more within our customer base. The challenge of using that is something that has visibility across the organization, and there's a tremendous amount of untapped value that most organizations have with their data, and the government is the key to unlocking that value. What we see today with a lot of the hundreds of organizations that we work with is that there's this division, I guess, in the way that people think about using and managing and unlocking that value of data. In the orange picture here toward the top, you have the use of analytics, the use of new visualization techniques, the building of a lot of predictive models, the use of AI and machine learning, all sorts of different techniques, and there's a growing number of them there that are being used to analyze data. At the same time, we're also faced with a growing amount, a growing scope of data management challenge, when we start to include all the different kinds of data sources, the new data lakes that people are building, things like IOT data, other types of sources, and artificial intelligence and machine learning is only likely to compound that problem by creating even more data. But there's a gap often in between those two things. People have the technology and the capability to analyze data, they have the technology, they have data in the organization to be analyzed, but finding it, understanding it, knowing that they can trust what it is is a great challenge, and that is the core of what governments provides. The ability to find and understand our data, there's a huge percentage of people that say that they cannot find things, and in fact, there are, there's another, I don't have the statistic here, but there's another one I've seen that shows that people report that they have access to more data today, but they actually use it less, and in part that's because they don't understand what it is that they've got and what it is that they can use. There's a lot about how to trust data, what is it, not only are we handling our data correctly, but also is the data actually what we need and in the condition we need to be, do we have the right things to be basing critical decisions on? And the risk management, you know, it's very clear in the retail sector that trusting, that brands are trusted with people's data, and anytime that trust is violated, it has a huge impact on what customers do and what, you know, and how they behave. So all of these things are part of our governance initiative. There's a huge set of capabilities that need to be built out or can be built out in order to facilitate that connection between those two pieces, the building out of a good data experience, the creation of a data catalog, and the implementation of appropriate governance in combination give us the ability to use the data effectively to find the data, to know what it is, and to know that we can trust it, that it's the right data, that it's had the right degree of scrutiny applied to it to make sure that it is what we need it to be. And this is really what Kaliber does. We provide, we are the leading data governance platform. We provide automation for your organization to be able to do all of those things at the scale that you need. And, you know, we believe that data governance is a huge part of what people need to do. We also know that knowledge about data governance is rather sparse. It's why we love to sponsor these types of webinars. We also have on our own something we call a Kaliber University, which has a lot of information about the best practices that we see with data governance and how to do it. And we have a very active community of practice that can help people about data governance. So if anyone is interested in Kaliber or interested in any of those other things, you're welcome to go see our website. I encourage you to take a look at university.kaliber.com as a great place to find out, to learn more about best practices for data governance. And I'm very excited to hear about some of the experiences that we've had with IoT and Data Lake. And so I'll turn it back over to you, Shannon, or Barbara. Thank you, Daniel. Before we go on, I would like to remind everyone that the recording for the webinar will be posted in the DGPO members-only section of the DGPO website in a few days. This afternoon, I am pleased to introduce Mariela Botea and Simil Sores, who will focus on key considerations in governing an Internet of Things Data Lake. Mariela is the data governance advisor in ExxonMobil, a member of the Enterprise Data Enablement Team within the Data Science and Analytics Organization. Prior to this role, Mariela held various positions in the upstream and production business lines as an exploration geologist, who was always working with data of one sort or another. Simil is the founder and managing partner of Information Asset, a consulting firm that specializes in helping organizations build out their data governance programs. Prior to this role, Simil was the director of Information Governance at IBM. Mariela and Simil, we will look forward to your presentation. Thank you, Barbara, and thank you, everybody, for logging into our webinar today. Our agenda today, and oops, sorry, I have control. Here we go. Our agenda today, we will do a short introduction about ExxonMobil. I'll look into our digital transformation vision for our IoT system. I'll look at what has been accomplished so far on our roadmap. We will review the governance strategy around the Data Lake zones, roles and responsibilities and policies, and then we'll go back to our roadmap and see what remains to be accomplished this year. Hopefully at the end, we'll have plenty of time for questions. A little background, ExxonMobil is 135 years old and the largest publicly traded oil and gas company. We have over 71,000 employees operating across six countries in over 130, sorry, six continents in 130 countries. Our estimated earnings in 2017 were $19.7 billion U.S. dollars. Our reported reserves are 91 billion barrels of oil equivalent. We produce 2.3 million barrels per day, and Mobile One is our industry leading synthetic, and there are 21,000 retail stores worldwide. The picture you see here is the energy cube at our Houston, Texas campus. This building houses all our training classes, auditoriums for campus-wide employee meetings, and it's also the welcome center for visitors onto the campus. When you drive up, this is the first building you see coming into the ExxonMobil campus. All right, so let's look at our digital transformation vision for our IoT system. ExxonMobil has hundreds, if not thousands, of IoT Prime Series data. So let me give you a little background about this data lake. It's being set up for our manufacturing business line, which entails our plants and our refineries. This IoT data are the sensor readings from the plant and refinery machinery. The plan is not to have the data flow directly from the plant onto the lake. We will still pass it through our lab and PhD systems. But this IoT machinery reading data, along with reference data, makes up the foundational data within our data lake. With the establishment of the data lake and enabling advanced analytic capabilities, we've identified millions saved per year of soft credits. Some of these benefits, more efficient access to data, enabling real-time investigation, improved integration of data, enabling data discovery, and knowledge retention, storing the results of the analysis for future discovery. Our manufacturing time series data, some of the challenges that we've been facing with it, are the slow rate of technology evolution. There has been no real step change and capability for this area since its original concept. All we've done is throw more computing horsepower at it, faster networks and CPUs, more memory and disk. And also, the data is stored in disparate repositories with custom data structures. We have over 40 plant historian systems for time series data at a corporate level at FunMobile. Some of the opportunities we identified with the establishment of the data lake to address these challenges, a single repository for most of the plant data. It would allow for access to a broader set of plant and manufacturing data and supply of foundation for analysis. So this becomes an analytics platform which will enable future workflows for centralized monitoring of the plant. So to summarize our case for action, currently we're spending the majority of our time gathering and manipulating the data to complete our analysis. By combining the data into a single repository with one data structure will allow us to have more time acting and analyzing. Now we see that data management has been brought into the picture. That's because we want our data lake to remain a trusted source of data and as we all know, trusted data is governed data. This is a high level look at our data lake governance roadmap. Last year we focused on the solution identification phase and developed the architecture of the data lake and also completed the governance playbook and roadmap. Information asset is one of the many consultants that ExxonMobil works with and Sunil and his team were already in-house working a related project. We'll call it a data catalog when I met him at the DGIQ conference in 2017. I approached him about helping with a data governance playbook for the data lake and one of the requirements was that this playbook could be then applied with a little bit of tweaking to any other data lake we decide to stand up within the organization. So the rest of this presentation will review the governance around the architecture, the roles and responsibilities and the governance policies. We will then come back to this roadmap and I will share where we are today in the progress of the data lake and also some lessons learned. So let's go into our data lake zone. The raw zone is our landing zone. The data ownership is the same as the source ownership. Security is limited to approved users and user groups. There may be some metadata already assigned to the data assets but it has not yet been entered into our data catalog. Here the raw data is stored in the format that it's received but most importantly this is where the data is profiled and classified prior to being adjusted. The next three zones all pretty much have the same governance so I put them all on one slide so I'm sorry if it's a little busy. But after the data is profiled and classified in the raw zone it moves to the source zone and the source zone is where we validate the source data invested into the lake. From the source zone the data moves into the transform zone. Here the data is transformed and may contain some business logic or common models. This data now becomes the single version of the truth in the data lake. From the transform zone the data moves into the publish zone. Here the data has been structured for consumption and this zone only contains data that will be used by consumers and can be accessed by BI type applications. The data ownership is no longer that of the source owner. Now the data lake owner is accountable for the data. Security is limited to approved consumers. Metadata both business and technical should be maintained and available to all consumers via the data catalog. Here lineage should also be captured. Reference data is now managed in the data lake and rule-based data quality policies should be implemented on critical data elements and data sets. The analytic zone is a dedicated area for data analysts and data scientists to use as an ad hoc data discovery and exploration. Here existing data is combined with new data sets without compromising the source data. Importantly here security should be designed to ensure that the sandbox activities within the analytic zone do not adversely affect other data zones. Staging zone is an optional zone to be used when needed. This zone is used to store data during ETL job executions. Thus their data storage is temporary. At the end of the process, the data should be cleared and removed. The archive zone will store data that should be retained due to any business requirements. Rules should be defined based on age and usage and it's recommended that retention policies be automated. This zone is important because it allows us to be able to address data retention requirements when needed, store multiple versions of data for tracking purposes, and also offload historical data that's no longer needed in any of the other zones. Next we'll go into our governance strategy for roles and responsibilities and you've got a sneak peek of that already. All right, so this is the data governance framework we decided to establish for our data lake. There is a data governance council made up of key stakeholders and these individuals are responsible for the data governance monitoring enforcement of policies. The next two are the business data domain owner and the business data domain custodian and it's a mouthful. We put the word business there because we wanted to make sure that manufacturing realized that this is a business role, these two are business roles, and not IT roles. So the data domain owner is accountable for data decisions driving improvements globally and cross functionally and the data domain custodian, who's a delegate of the data domain owner, is responsible for the data and process improvements globally and cross functionally. The data stewards, your subject matter experts can be either, they can be business, they can be IT, they can be whatever and these are subject matter experts contact for functions or regions or sites. We decided to do it by function or what we call capabilities within the manufacturing organization and these individuals contribute to the data integrity, data role definitions, data quality, it's a big role and I didn't put it all in there because there wouldn't be enough room. The technical data stewards, that's the big data team. Those are the individuals responsible for the day-to-day management of the data ingestion process. So this data lake governance effort is a subset of a much larger effort to establish governance, a governance framework for all of manufacturing and we hope that the momentum that we've created with the data lake, this very small, remember they all say start small, this very small portion as a launching pad to establish the overall manufacturing governance body. So next we'll go into our data governance policies. Data ingestion process needs to be standardized, period. This helps with meeting business taxonomy and data catalog requirements. It controls the ingestion streams and meets all the regulatory privacy, security and legal hold requirements. What you see here is a sample data ingestion process. If the request is approved, it's checked for redundancies and identified for special tagging is necessary. The appropriate documentation is gathered, made a data is documented and ensured that all onboarded metadata is then written into the data catalog. Then the appropriate security level to determine and the data is scheduled to be ingested into the data lake. Now I'll tell you that ExxonMobil is working on an enterprise metadata strategy and we are partnering with them in this effort as we put together our ingestion policies in place. So this is still a work in progress for us. Data owners and data stewards need to ensure that the appropriate business and technical metadata is captured for critical data elements in the data catalog. Not every element is a critical data element. These are terms, concepts that are determined to be vital to the successful operation of an organization. Things such as data used by senior management in critical decisions, data consumed in an operational context or data from analytical models used to make daily decisions. Here we see an example of a critical data element metadata-required fields. We're still developing our list of required attributes as part of the enterprise metadata strategy I mentioned in the previous slide. The data catalog contains both business and technical metadata. It provides a window into the information landscape and serves as a central location for managing understanding and trusting the data. Here you can see we can also manage policies, data quality rules, and data audience. Some of the benefits of a data catalog you have a common vocabulary understood by all. It helps users find and use their data, enforces data policy rules, helps users understand the nuances of their data, and also minimizes redundant data engagement. First data is relatively static. Some examples are country codes, units of measure, company codes. The use of a common reference data is essential for creating consistency and reducing redundancy. On the other hand, poor reference data can be the leading cause of errors and analysis, so it must be governed to maintain the mapping between values and data sets within the data lake. Data quality management includes the methods to measure and improve the quality and integrity of data. To support data quality rules, it is critical to define the source of the data and how it will be managed and consumed within the data lake. As I mentioned earlier, data profiling and classifying is performed in the raw zone, but once the quality is determined, it's moved to the transformed zone where it becomes the source of truth for analytics and reporting. Restrictions to data through access controls should be present as the data moves across the data lake zones. Some of the basic security elements that we are using are role-based access, who can read what at what level, audit security, who touched what and who did what, and protection. How and when do we mask or encrypt our data? Since the data lakes go is to maximize the ease of finding data and reduce the time it takes for insightful analysis, data can quickly accumulate and some of it can become stale very quickly. Stale data can have serious implications on delivering insights. So a data life cycle management strategy needs to be implemented to reduce the amount of redundant and obsolete data. It's important to remember that not all data in the data lake follows a single life cycle. Retention periods and use needs should be taken into consideration when developing a data life cycle management strategy. And if you have it noticed, these policies pretty much follow the policies that you can find in the DMBoc. So our next slide now we get to the news here is what have we done so far? So where are we today? In 2018, we kicked off our practical design phase. I'm heavily involved with the socialization of the roles and responsibilities and the metadata strategy that I mentioned earlier. I consult with other individuals about the progression of the finalization of the governance tools, developing the prototypes and the process guides. If you're familiar with a waterfall type delivery, we will progress to gate 3 after completion of this deliverables at the end of the year. The manufacturing data lake is expected to go live approximately late 2019. So this is an ongoing effort that we have here. We have a lot of progress, and we don't have it all figured out. We're learning as we go, and hopefully as other lakes are created within the company, they will be able to leverage our learnings that we've gone through so far. And there are a few smaller, this is a very big data lake, there are a few smaller data lakes being put together right now that I'm consulting with. First of all, I would say the vote more time socializing your roles and responsibilities. The bigger you are, the more time you will need. The bigger the company, the bigger the organization you're working with, the more time you should allow yourself to do the roles and responsibilities. I will tell you right now that I don't think we'll finish socializing the roles and responsibilities. June, when I gave this presentation at the DGIQ, we were still struggling to have a data owner identified. I will say that since June, it's been a month, we have identified data owner for the foundational data, us, and we've also identified a few other individuals like a subject matter expert that can help us with identifying a subject matter. I will go into the next one. Don't go at it alone. That's my number two. Bring an experienced vendor or consultant to help you answer questions and give guidance, especially if you're bringing in new technology and you haven't done it before. Another big learning was include a business contact in your governance activities from the very beginning. And since June, when I gave the presentation, we didn't have somebody and just last week, we've actually been assigned an individual subject matter expert to help us with validating and mapping the reference data. Another one, communication. Establish a way to communicate with the management and those doing the work. In our case, we put together a share point where we hold a share point that's the overall data lake project share point, but we found that our governance information was being lost and inundated in there. So we ended up creating our own share point that hopefully later on will become the share point used by all of manufacturing when it comes to data governance. Another thing that we did is we created a Yammer group for questions and answers are all in one single thread. We were having an issue that emails would go out and we'd be inundated with emails and at the end, you have a huge email thread to try to follow through to be able to get to your answer. So the Yammer group has actually helped with that. And lastly, I will say it and everybody else says it starts small. So the Yammer group is trying to find an owner or subject matter experts to cover actually entirety of time series in reference. So the youth cases should hopefully go a little bit faster. So the Yammer group is trying to find an owner or subject matter experts to cover actually entirety of time series in reference. So the youth cases should hopefully go a little bit faster. And the data lake itself is a subset of manufacturing for which we're also working on establishing an overall governance for that business line. So the momentum and the visibility that we're gaining with the data lake hopefully will carry on to the overall manufacturing governance when we start working for individuals for roles and responsibilities. And at that point, like I mentioned earlier, it will be by capability. So again, that sort of subsets the manufacturing a little bit where it won't be such a large effort like trying to cover all of manufacturing. You're covering a capability of it. So I will stop there and see if there are any questions. Thank you for your presentation. He'll pipe in to help with answering questions as needed. That's because you've been doing such a good job, Mariela. Thank you so much. So Mariela, we do have a couple of questions out there. I think one of the ones that came up a couple of different times is what tool do you use for the data cache? I figured that was one of the questions that was going to come up. I'm going to take a look at the Hadoop on-prem. And with Hadoop or Hortonworks come all the regular software that comes with that which is Ranger for security and what's the other one? Atlas. Thank you. The other tool that we're looking at, of course, the other tool we have in-house is Calibra. So that's why we haven't finalized our tool selection yet. We just brought in a third tool which actually has gone through procurement, so I guess I can mention it now. I couldn't mention it at the DGIQ because it hadn't finished going through procurement. But we've brought in EBX as well to look at. EBX is more of a master data management tool that we're using. Of course, APIs are big in there as well, and we use Neelsoft for that. Did I forget anything, Sunil? No, that's good. Great. Thank you so much. Another question for the model that you're using, who is responsible for the data quality profiling and remediation? Like I said, we didn't have a business contact that had the responsibility to say yay or nay. We have some subject matter experts working with us who are both mainly IT who are very knowledgeable of the business line, and they've been helping with that aspect of it. We just got, like I mentioned in the presentation, assigned somebody with authority. I guess that's the word. We had business 90 folks working it, but nobody had the final authority. So now we have somebody who has final authority who can say, yes, we will use BBLs for barrels instead of spelling out the word barrels for setting up our, for instance, for units of measure, making sure we have the right quality for that. That's what we're focusing on right now. In this area right now, there's reference data and IOT data that's coming in. Hopefully that answers your question. It's a group effort from business and IT. Excellent. Another question we had is from a DG perspective, how do you think DG is really different when you're versus a smaller and also kind of having to do with that, why do you think that this is the data lake versus just simply an enterprise data warehouse? What do you find is the major differences in your organization? Yeah, we have data warehouses and you've got to remember this is for the manufacturing business line and for their use and their best result, it was a data lake. Just because of the way the data is stored in the data lake versus the data warehouse. They needed to do IOT data structured and structured everything in there. We do have other business lines using data warehouses and that's the right answer for them. We have several of those. In this case with the manufacturing data lake, this case since we started, we noticed that we started last year, a couple of other data lakes are popping up for much smaller data groups and those are smaller efforts. As far as, could you repeat that first part of that? Barbara? I think how is enterprise data governance different from data lake governance especially when it relates to the volume of data? Do you want me to take a stab at that? If you are bringing in data into the manufacturing services data lake, you still need to have data owners. You still need to know for the Beaumont refinery who owns blending data for example and who owns the reference data on battles whether it's BBL or not. That has little to do with the capacity because you are governing the data lake, you need those owners and those owners might be at the enterprise level within manufacturing services. You might also have derived data within the data lake. You might create combinations of data so then you would need owners for that derived data and you would also need potentially somebody to decide who has access to the data and the lake. I think when you start looking at the tools, you have to think through if I am at the enterprise level, I want to go look at my definitions in a tool in the data catalog. If I am in the data lake, I want to look at my definitions in Atlas. Maybe any definitions you create in your enterprise tool need to be sent into what works Atlas for easy access within the lake. Go ahead Barbara. I just wanted to mention that we are starting small. The data lake is the first thing we are focusing on with manufacturing. The next step is to look at overall manufacturing governance because we don't have that in place right now. There is nothing to say that the data lake owner for the reference data won't be the same manufacturing overall enterprise owner for reference data across manufacturing. That's why we are hoping that the momentum from the data lake will move and continue on to the enterprise level of manufacturing. Kind of a follow-up to that then. Is this part of your data governance from a data lake perspective? Is that really tied into your corporate data governance program or are you viewing it as the stepping stone to get a corporate data governance program? Good question. We do not have a corporate data governance program. We are a big company and those of you who know how big companies work are very siloed. The idea of a corporate data governance has been talked about. But no, we do not have a corporate data governance and it probably won't go that way. I will say that we have a project in place called the financial transformation project and that of course is focused on financial data, right? Vendor and all that kind of material masters and all that. That's probably the closest we could probably get to the corporate governance around financial data and that's something we are working towards. When you think about upstream geo science and engineering data, how much different that is from manufacturing data or downstream fields and loop data? Yeah, I don't think it's going to happen anytime soon but it's something that we have been talking about. Thank you. Another one, following up on your choice of using Hadoop and you mentioned, I believe that you are using the on-premise solution. What were your considerations and why did on-premise ultimately win out over the cloud? You know, I wasn't really involved in those decisions when it was made but I will say that there are a lot of decisions when it was made but I will say that ExxonMobil has recently, or ExxonMobil IT has recently passed a decree that going forward all technology will be in the cloud but this particular data lake was put in place before that decree went up. Sunil, do you know why or were you involved in any of that? Yeah, I think it's also because of the security, right? That's always a concern. Do you think there will be a move particularly since your company has declared everything has to be in the cloud? Have you started a process where you might be moving from Hadoop on-premise to Hadoop in the cloud? Not for this project. If anything, it will happen for any future data lakes. That's a particular project. There was another question on tools which was what ETL tool do you use? Oh, gosh, we have several homegrown ETL tools that we use. Off the top of my head, I don't remember. I have to look it up. I'm not going to put Sunil in the spot because I don't think he knows. Informatica power center, I think there might be some business objects, data services, and then in terms of moving data into the data lake itself, there might be scope, right? That's not really, there's no transformation. So those are the three and maybe Hortonworks NIFI. Yeah, here it is. Thank you. I just looked at it as well. Data services, BODS is one of them. SSIS and SAP landscape transformation. Some of the ones that we use. And KIPCO, DV, is in there as well. Thank you, Sunil. A lot of people are interested in tools. Yeah, I know. Do you have something that you have a data classification or detection tool that's used to identify the sensitive data? Or do you identify it prior to landing in the data lake or maybe after? Do you have a methodology on that already? Yeah, you know, I'm not particularly working on that particular area, but yes, we do have sensitive data and we decided for this particular case, but there's not a lot. There's a lot of it. Machinery readings are not particularly sensitive, but what is sensitive we decided to place in the data lake. And that data is already pre-tagged as sensitive. So we know what cannot go in there. That's interesting. Cool. Also, you said in the first part of the presentation, you have the different silos of data where you pass it from just having the source data and then you go into a second column and actually land in that third column where you could have the analytics performed. Do you have a data governance policy or are you developing one that would identify the standardization for those fields in that where people are getting their analytics from? Is there any part of a data governance plan over that standardization? That's the published zone. That's the published zone. That's where individuals, of course, it's limited. The access to that area would be limited to approved consumers. We haven't really started thinking about governance around that pretty much. It's like, hey, it's BI. We're doing analysis. So, yeah, thank you. That's something we should really start thinking about as well. Not something that I've thought of at the moment. That's one of the things we're fighting with right now. That's a challenge. Thanks for bringing that up, Barbara. I haven't really thought about it that much. People are supposed to have it and then I guess there should be some kind of governance around it. Another person asked, if you're using the Colibri, are you using the Colibri data catalog and have you already implemented that? You said there are a couple of different tools, but in particular the Colibri data catalog. We are using Colibri data catalog. It's a marketplace. So, it's we have an interface. We don't use the Colibri front end. We've created our own marketplace front end. You put in a search word and it brings up what is in Colibri metadata, of course, showing you where you have data and how to access it. If you don't have access, it shows it to you anyways and you can request access. I don't think we turned on the option in Colibri where it does the requesting for you. We still have it to where because it's still in alpha beta. The enterprise marketplace should be completed by the end of August. We're still trouble shooting the data. We're using the back end for the data catalog. Great. Another question. I'm not sure based on your previous answer, but does your data governance council have any connection to a center of excellence? Yes. The center of excellence would be the capabilities within manufacturing. You have them all very involved. If you're talking just about the council, the business line representative would be representatives of the capabilities. Data owners, even law would probably be a council member. At least they say they want to be. When I went through the packet with law for the presentation I thought it would be included in the data governance council for the data lake. Although we don't plan on having any data there that's high volatility or anything like that. Okay, thanks. Another question. In your model, who is responsible for data quality profiling and remediation? The individuals responsible for the profiling, modeling and remediation are a combination of IT, the application, the data leads, IT data leads, and the business subject matter experts. They're working on data quality. Another tool question. In your case, what tool did you use to manage your end maturity access? That's Ranger, right, Sunil? Yes, it's Ranger in the lake. So you could have potentially tags that are created in Colibra. You could say I want to define in all my sensitive data. I know the particular instance doesn't have sensitive data but you could have a sensitive data tag or a GDPR tag. And then you could have access-based policies based on those tags. Great, thank you. A couple of questions on your data governance council. What kind of size do you have and how often does the data governance council meet? What kind of size? What is the size of your data governance council? How many both? Yeah, good question. We haven't gotten that far yet. We're still struggling. We haven't gotten that far yet of identifying the data governance council members. We just kind of know who we want in there and what they would be responsible for. Like I said, we're still socializing and struggling to get the actual data domain owner named. And we have individuals sort of tagged on the slot but they haven't actually said yes, I will be accountable. The only individual that's been assigned to us so far would be the data steward for the foundational data which is the combination of the IOT and reference data. And we're still socializing and getting commitment for the data domain owner and custodian. Like I said, we expected it to not take so long but I have a feeling that it will probably take by the end of the third quarter, hopefully we will have those individuals in place. Meanwhile, we'll also be working the governance council members in parallel. We've just been focusing on one thing and that's getting the owner in place. Great. How do you govern the latency to deliver the data while bringing it into the various zones or how long does it take to go through those various steps to get it into the published zone? Good question. You have more experience with data lakes. I'm not familiar with that actual aspect of it. I think in the case of the data lakes, there's less requirement around and there's less concern about the latency. But in many other cases, the way you address some of those issues is I don't necessarily think people have to have people unnecessarily expecting to have the data immediately. But the way you do that is you set up different zones. So we talked about zones earlier and what we've seen in some organizations is you might bring different zones to your own plans and even if it includes sensitive data, after you land it and you stage it, you might have a governance zone in Mariela where a small group of people have access to it and it's relatively dirty and it's quite recent. Then once the data is cleaned, it's put into that Goal zone, the analytic zone. The way you might address and even data quality is by creating different zones. One of the things I will say is that I know that with the amount of time that we're spending gathering and manipulating the data to do our analysis, right now the latency is weeks. With the data lake, they identified that they could be done in hours. I guess the latency will be anything is better than weeks. Even if it's a few hours old, they'll still be very happy and also the fact that they don't have to rummage through 40 different systems to be able to gather the data that they need. Anything would be faster than what they have now. That's a good point. One of the questions also was everyone struggles with funding of data governance basically. Everyone wants the trusted data. In your experience, do you think by doing this data lake on manufacturing data, is that basically the strategy of trying to get people to understand the benefits of having data governance? What would you say is really your overreaching strategy to hopefully enter data governance in a more stronger format for ExxonMobil? As we probably all experienced, that starts off with being really a data quality initiative. We clean the data, it gets dirty, we clean the data, finally we realize that the root cause and governments is then necessary and it kicks in. Like I said, we have siloed organizations. We have some that are more mature than others. One of our organizations, chemicals, is very mature. We have a lot of activities and realize, hey, we need governance to be able to maintain the quality of the data that we need. They actually have devoted an entire organization to governance and quality. We have also downstream fuels and lubes and they follow the same format as chemicals. They are the most of the organizations at ExxonMobil. They are the two organizations most successful with governance initiatives. That's because, in my opinion, they saw the need with data quality, they decided to go with governance and it's being led internally. It's being led by the business line with help from IT, not the other way around. And they created an organization devoted especially for the governance and quality maintenance. And we're hoping, somebody asked about the corporate governance. And we're hoping that we may not have a corporate governance per se, but if we can get each business line to maybe follow the same example as chemicals and downstream fuels and lubes where they create an organization internally that is based on the data governance and data quality, that would be a huge step forward for data governance at ExxonMobil. And for any company, really. Thank you. Just a couple more. When you're moving the data from one zone to the next, is the data deleted from the previous zone? Which particular zone do you store the original data in ? Yeah, the data is stored in the original zone in the, and I guess I could go, let me just go back a few slides. Real fast. Oh, too fast. Okay. So, I believe that the data is stored in the original in the source zone. Correct, Sunil? Yes. There's a landing zone before and that's where it's validated. I mean not validated, but that that's where the data is profiled and classified in the raw zone and then it's stored in the source zone. Right, exactly. Yeah. And if it gets deleted, I'm not familiar with that. Do you know, Sunil, as it moves from the source, I guess it's always in the source zone. It's always in the source zone contains data from the source that's been enriched and combined with other data sets. Correct. Right, and then it goes into the and then you can access that transformed data through the published zone. Which is why, Mariela, when you think about data governance at the enterprise versus data governance for the lake, you know, you can probably have someone in manufacturing services be the first transformation. You might need a new set of owners for the published zone. Yeah. Our plan is not to change the source ownership at all. Each one of our 40 plant historian systems has an owner and we will keep it that way, but the data lake owner will be responsible for the transformed zone data once it's been enriched and combined with other data. So would that also then, so different places throughout that process, would you have data quality applied in the transformation zone and the published zone or are you really looking at data quality from the source system and then if the data quality is there, then the transformation is hopefully transforming the data correctly. Yeah, we expect the data source to be already of high quality. I'll tell you that the project doesn't have that in scope to be the quality police. So we're expecting the data to be of high quality. And the thing is as the double-edged sword, as you have your data lake and you're looking at your data in the data lake, you may find issues with the source data and that way you can turn it, you know, the source data you need to fix this, then it will get moved back into the data lake source after it gets corrected. But as far as data quality and the transform zone, you know, that's really, I think the data analysts and the data scientists are expecting it to already be of good quality coming in from the source. I don't think data quality is in scope, right? But typically what happens in data lakes when you're ingesting the data, you actually have processes, right, and you can use tooling to actually set up very simple business rules, right? You might say if you're bringing in customer data, for example, you might say data birds should not be null or data birds should be in this format or people should not be more than 100 years of age and then you have the ability when you're in quarantine data that actually does not meet certain data quality rules and then you could potentially have a data store go and look at that quarantine data. That would be one way to address it, right there, you know, in the raw zone. Great, thank you. I think it's we're about out of time, but I would like to remind everyone that the webinar will be posted and the recording will be posted in the DGPO members only section at DGPO.org. Hope everyone has a great day and we'll see you next month. Thanks all. Thank you, everybody. Pleasure to be here. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you, everybody. Pleasure to be here.