 Yeah, well, we can certainly talk about it afterwards if we don't catch it during. Good afternoon, everyone, and good evening to Mandy, all the way from UK. So my name is Tirana Lurvasudev and I'm from IBM working in the Global Chief Data Office and I have a general interest in data governance and data monetization and so on. And I'm very, very glad to be the moderator for this session here. And as you know that data is the most valuable asset, as everybody has agreed. And it becomes valuable only when it is properly managed and curated and governed and made it available to the right people at the right time. So that's all about the data warps. We'll hear from all the experts here. So also when the data is very much distributed, typically when in this new IoT world, 80% of the data will be generated by the devices, not by the humans or transactions and so on. So the distributed governance of data becomes all the more important. There's a lot of challenges in terms of securing and making sure it is trustworthy and so on. So here we have a very distinguished panel here. Let me first introduce them. Right to me is David Radley. David is an Nigeria maintenance, maintainer at IBM UK, Hershey Lab. He has over 30 years of experience in IT with at least 15 years of information management. And in his role, David promotes and develops information architecture to interact with analytics and metadata driven solutions. And he has spent a lot of time in the Apache Atlas Committee and is now the next foundation maintainer for Nigeria. So very experienced person. He also leads the events in Nigeria with the senior leadership team and very active in the LFII Foundation here. So welcome, David. Hello. And here we have Dan Wolfson, founder of Pragmatic Data Research Limited. Pragmatic Data Research Limited is a consultancy specializing in accelerating digital transformations through innovative data architectures and governance. Dan retired from IBM as a distinguished engineer and director in CTO in the weather business solutions group of IBM AI applications where he led applications of geospatial data and analytics and developed areas such as environmental intelligence, agricultural and utilities and so on. He has over 35 years of experience in research and commercial distributed computing ranging from transaction management of data-oriented systems and so on. He has numerous papers to his credit, co-authored of enterprise master data management, beyond big data and many others too. He's also a member of the IBM Academy of Technology, IBM Master in Winter and several patents to his credit and he's also been recognized in the ACM and ACM Distinguished Engineer. So welcome, Dan. Thank you. And we have a distinguished remote participant here, Mandy Chisel, joined us through WebEx and Mandy, special thanks to you. We had some emergency earlier. We had another panelist and Mandy really generous to join here at this late hours, evening hours for her. Mandy also worked for IBM for 35 years and the last 15 years as IBM Distinguished Engineer and she's now one of the founders of the pragmatic Data Research Limited. Her focus has always been using and supporting open standards to achieve heterogeneous interoperability. She worked on the CORBA standards, as well as the OASIS and an open group. Mandy is also a leader in the Contributed to the Nigeria Open Source Project. She's a fellow of the Royal Academy of Engineering and is a distinguished as the first woman to win the Royal Academy of Engineering Silver Medal. So welcome Mandy to this panel. And also, since we have only a few people here, we have got five we can count, maybe we can promote you all to the panel and then maybe go around and introduce yourself and then make this panel a discussion more interactive. So if you want to have a quick introduction for yourselves and then what's your interest are, that will be helpful for the panelists to answer the questions appropriately. My name is Sarah Griffin and I work in the Chief Technology and Innovation Office at Dell Technologies, and I'm an engineering technologist there. But I was a math teacher for 10 years before joining Dell. Cool. Hello, my name is Marco. I'm a professor at Northern Arizona University and I'm here because I'm proposing a research that uses chatbots to improve the quality of the code written to analyze data. So I want to get some insights. I'm Kieran. I work at a consulting firm in DC doing like AI and analytics stuff, and I just wanted to learn a little bit what y'all are talking about. I'm Kelly. I'm a solutions architect at Databricks. So I work with Databricks customers to help design architectures for their data. I'm also the next speaker in this room about Data Lake. So I like learning about what everyone else is talking about so looking forward to this. My name is Josh Mitchell. I'm a software engineer. I work for the Department of Defense. And, or the DOE, but on Department of Defense work. And I recently acquired a data team. So maybe like last couple years, there's been this evolution that we've been going through. So I'm very curious about strategies. Lawrence Hecht. I do a lot of things, but I also work on the Linux Foundation research team. And I've been an advocate for open data forever. And I really want to move the ball forward with having industries collaborate more on metadata standards. Yeah, right. Well, thank you all. So let's start with some of the basics. And I would like to hear from the experts here. So what do you, in your view, what do you mean by data ops? Let's start with David, sort of Dan. So to me, so actually, let me back up a little bit. One of the jobs that I had was, not only was I CTO of an area, but I was also the development director for an area. And I had data engineering teams under me. And we did everything from how do we find the data to how do we manage the data, how do we build analytics over the data, how do we provide the analytics as products that we actually sold and people consumed as products. And so if we think about that whole evolution, that whole life cycle, and how we manage that life cycle, to me that's a lot of what the data ops is really about. How can we bring some automation to that practice, how we bring collaboration to that practice, and how can we be efficient about that practice. And there can be a lot of trickiness that comes into play here. It's not just about the technology. It's about the process. It's about the organizations. It's about the structure, about how we foster the right kinds of collaboration and communication, often through tools, but not exclusively through tools, in order to handle these kinds of things on a daily basis. And when you're in operation with this, once you're through that life cycle and you're in operation, things happen. The world doesn't stay static. And so in my particular case, for example, we would pull in data from a number of satellite systems around the world. And we built that out of geospatial data. It's terabytes of data, petabytes of data. Well, what happens when one of those satellite systems, one of those data delivery systems that one of the governments has, fails? What do you do? How do you find the other data? Well, that's part of the whole data ops process, is building the resiliency in your process to be able to continue to deliver your data, your service to your consumers with the least amount of interruption and at a reasonable cost and happiness to everybody involved. Thank you, Dan. So that gives a very, very good perspective about how we can bring into the silence so that our ops itself. So, David, do you have anything more to add or maybe refute any of these? I'm not going to refute any of them. It's very reasonable, I think. I mean, the only additional thing that I was to say that the agile methodology seems very important to bring from into using it with data. So we're looking about operationalizing data so that you get a handle on its quality and this is all about the procedures, the processes as well as the technologies and the people that come into play to bring to make this happen. So there's a cultural element to this as well and the processes whereby the people interact with it need to be agile so you can make quick changes. When the automation, it hopefully solves a lot of the problems. The people need to be involved as well to solve those out of line situations. So finding that balance between how much you can automate and how dangerous it might be if you over automate is quite an interesting situation. That's another good point about scaling and how you can accelerate the whole process here. Mandy, over to you. Do you want to bring in some perspectives here? I think those are two sets of important points. I'd also add that data on its own is not useful until it's processed. So one of the other things we need to think about is how the data as it's being turned into a product is then consumed by analytics and also brought into production. So we have DevOps which is focused very much on that link between developers and operations teams and then we can also think of DataOps as bringing the data scientist life cycle into this production development type process. So you've still got the distinct life cycles, distinct tools that are part of it but it's about bringing those teams together so that they can do their thing with their tools but also the exchange, the result is a system that is a good balance of the data, the analytics and the rest of the systems that operate around it. Thank you Mandy. You brought about the keyword DevOps so that takes on my next question. David, would you differentiate how the DevOps and DataOps, where they match, where they aren't? So around how you deliver applications in the developers world and DataOps is the emphasis around data with data science, analytics, models all of the processes around that. So both of them seem to have in common agility so the agile methods. So I think that's probably the way that I differentiate between them. Any other thoughts? Mandy, do you want to add on to that and then I'll jump in afterwards. I think the aims are similar to actually take some of the risk out of the handover of work from one team to another but it involves very different tools and different skill sets and different types of professionals so I think overall they need to operate together but there will be different technology involved and also different cultures around the pieces that are coming together to build a system that's combining data analytics and traditional software. I think that what I would add is the word really expand on is the difference between kind of data engineering and infrastructure engineering and where data engineering and data science in my mind is really a continuum where most data scientists are doing some amount of data engineering. Many data engineers are really doing some amount of data science and it's that you're really working with and in the data and it's not just about programming it's not just about code artifacts it's actually about looking at the end result of what's being produced and making sure that it's consumable by the users that need to consume it. It's a little bit different from the way we think about DevOps which is deploying out large amounts of infrastructure keeping it up and running but it's not thinking about the equivalent there would be to say every application is tailored for a particular small set of users and we have to rewrite the application for each whereas in data ops when we start to think about data as a product and other things we have to think about are we producing something that is consumable how do we need to transform it and find the data itself in order to continue making it better and better and more usable by this particular community and so it's an ongoing process of refinement in many cases as well as being able to deal with all the resiliency issues all the scalability issues and the cost issues that we have to deal with. Okay very good so we are all good at coming up with new terminologies and coining other things new ways of thinking and so on so I just want to ask you how does the data ops and data governance fit together are they one of the same or how much they differ I don't see them exactly as the same Mandy I think you had a term that you had been toying with around this I'm not sure which term you're talking about Data GovOps Oh yeah I mean data governance is so much more around the culture and the way the organization thinks both strategically and operationally so it's a much bigger deal data ops we're looking at the production of data for certain tasks and things so it needs to fit in the data governance program it's probably in the same way as all the uses of data but it provides a mechanism where some of the data governance program can be automated you have a question please so ML ops and basically I've written a lot about or somewhat about ML ops and where in CICD for data ops is that basically somewhat what we're talking about here it can be related I mean when you want to do machine learning you need to be preparing very large wide data sets typically for that and the process of preparing those data sets is usually part of a data ops operation when we get into ML ops it's not only about being able to build out those algorithms and do the training but it's also about how do we transition that from just the machine learning part, the model part to something that's production worthy and there's a very close relationship there between the data ops and the ML ops traditionally and there's actually a couple tricky points around that so one point is on just the pure ML ops side you want to typically look for bias you want to look for other factors that could be influencing that and you want to make sure that the data that you're using to do your training on to build an unbiased model result the problem with both when you start to look at scaling up is that the data sets that you have available may not be the same so let me give you a concrete example we built a model for predicting crop yield of a particular kind right and there's a lot of different ways of doing crop yield it doesn't really matter but if I tried to build a crop yield model and I do it in the small and I do it with from a couple of growers worth of data and they happen to be in the Midwest of the US and I build this wonderful model that's superbly predictable for what they have and now somebody from India comes and say can I apply your model to my fields in India well it's not going to work because the data is completely different but the data itself which we're training against you have to think about how do I scale my data to cover the use cases that I'm trying to target the scalability where the ML ops is often limited and often doesn't in my opinion think about the restrictions and constraints that the training set is providing in order to produce the model result does that make sense? if you're training a straight ML model then your model is limited by the set of data that's available in the training so if I'm training for North America and suddenly somebody gives me India and even though it might be the same factors it may be temperature and NDVI and how much water is available and the variety of crop and whatever else even though the factors are all the same the fact that it is in a different part of the world means that the actual processes might be different and that combination of factors even though they're all the same elements of data when I apply them between this part of the world and that part of the world I get a different answer the model is different and so when we do ML ops and this is also how it relates to data ops we need to think about what is the scope of the data set and to acknowledge that the training that we're doing is valid within the scope of the training set and when I'm trying to scale it if I try to scale it beyond what the training set constraints are I can't guarantee that I'm going to produce a good answer and that's a combination of ML ops and data ops that needs to be the interaction exactly, yes, that's part of the scale-up process and there is one more aspect we can add would be that even if it is the same environment there is a model drifting, we call it that's another one over time what we do, one of the examples in our organization is that we used to build these models every month ideally we should build every day or every week because the data is changing every day even for the same environment so that is much different from the software once it is built the same level of accuracy is assured but since the model is based on the data and the data changes almost every moment we had to re-refresh these models so that's one of the aspects of this model or ML ops actually of course there are other aspects it's both ML ops and data ops, it's together it's just which hat do you want to put on so that brings into the next question when we say that where do we start for the data ops in an organization who want to take that? so the way we found this has worked quite well is because we're talking about big ideas often you can try and boil the ocean you can think very large and get nowhere and you can work for a long time on these sort of projects in the governance space in the master data management space and get nowhere and have many full projects fail and the way that we found is successful is to be very tight on what you want to achieve have a very small gain showing the process that you want to prove and have a very invested stakeholder around the business such that they can see after something like a small number of weeks or months but no longer than that that they can show business value after say 3-4 months such that they can then say we can move to phase 2 and get the next small chunk you need these quick wins otherwise people get bored they won't realize all the value this wonderful value that we're talking about it might not ever happen there but you need those quick wins along the way which is where the agile side is very important to allow you to do that so working in that way allows you to start small prove what you've got and then build on it incrementally Man, do you want to add anything? Do you want to... where do we start? that helps in your organization any advice? Yeah, I think it is important to start with something that's going to matter to a stakeholder so it does need to solve a real problem the other thing is though the danger of starting small and keeping small is that you have no path to go beyond that first model so you do also have to think what is the end goal what are we really trying to achieve and then very carefully think about cultural impact because often getting the technology right is the simple piece but a lot of... because people are... they're restricted in what they can do by the shape of the organization and so they're like they're in a box but that box gives them security and so as you start to change people's processes and procedures and tools they feel insecure so you have to have a program that's not only just doing the technology proving the business value but it's taking people on that journey and showing them that they they have a place in the future otherwise you're really not going to get any cooperation in the process Yeah I think that's very true and it's a really important point that you know when you make changes you have to provide the safety and the culture around that to support those kinds of changes I think well when we step back and think about where do you start one of the important things and we've all kind of said this by the other is to say where can I demonstrate some new value so if I have systems and they're running just fine then is there value in trying to apply new practice to those systems if I have some systems or I'm building some new systems or I'm doing some new innovation work and I need a way to do that in a structured way that allows me to do that quickly and also where I'm able to take some risk because it's new and I don't have to follow the same established processes necessarily that's a good place also to start and so innovation thinking through the innovation life cycle and how to do that but again coming back to a key point you need to have the support of the organization and think that through and you have to provide safety and that's sometimes why it's good to do it in a project that also is recognized as being one that is intentionally maybe higher risk right and higher reward that's okay yeah so starting what I heard from you both is that starting small is good as a proof of concept but you know once you scale up the problems can be different and maybe there are some risks involved yeah that's true but if we think about the innovation life cycle so let me just spend a minute talking about the innovation life cycle the way I think about it because I think it's relevant which is you know we start with incubation projects we start with an idea we start with some incubation projects we look at to get some validation around those incubation projects and then we start to scale them up right in different ways and that may that process may require reengineering what we're doing right I mean what works in the smalls doesn't necessarily automatically going to work in the large and so part of what you want to do is fail fast so you may have a great idea and the incubation project says you know what it's not going to actually scale and that's good so now you move on to the next incubation project right and the same thing with the DevOps if you start to apply it down at the incubation project and then it gives you a way to learn right and learn what's going to work for you and for your organization and if it doesn't work that's okay move on change and keep going right and so it's about the innovation process as well and the people involved and giving them that normal way of thinking that you know what failure is okay we'll just do the next one there's no lack of work there's no lack of interesting ideas we just need to keep moving forward yep question so so scaling one of the important parts of scaling is that you can in order to prevent I would think over automating is that you know when things are failing yeah so our team is been working we were using airflow in combination with great expectations we're starting on that journey I think it seems that in terms of automated data testing that there's at least from what I saw there wasn't a lot of emphasis placed on tooling to do automated data testing at least not in my like looking around great expectations was a fantastic tool for that I didn't seem to feel like there was a lot of support around that maybe I'm wrong your face is saying that maybe that's not true but can you talk about how the open source community like some of the bigger projects similar to that are like what those are yeah so I don't know I guess I have two quick points one point is that quality is in the eyes of the consumer and so you know what's good quality for one project in one use may not be good for another and so while it's important to set a baseline I don't want to accept any nulls of this kind or whatever it's going to be generally very clear but a lot once you're past a lot of the basics then you're into more nuanced kind of quality metrics that you're going to be doing so that's just kind of one quick thought I haven't personally seen as much open source quality work around that maybe Mandy you have you have more experience there than I do no I would agree with that I think there's a lot of people writing pipeline and a lot of people writing training tools for different types of analytics but much less on the basic quality tools so yeah I think they're really good to the discovery quality tools tend to be vendor products maybe I think we were discussing earlier David so one of the things that are missing that often we are talking about some of the issues now is that ops complete or is what are the major issues and process and technology and tools I don't know if it's complete I'm not quite sure what we're talking about with this I know we were talking about if we automated everything and we left it up to the computer to do everything that we could end up in a big problem like there's algorithms in the bank that go sell sell sell sell and you haven't got any money left because it gets one of those sort of micro crashes that happens a lot so there's this balance between automating as much as you can but no more and having proper human oversight for those places where it's required so that you have good control an element of control as to what you're actually what your processes are doing and they don't and you notice when they go off board so you need the metrics in place to be able to spot the alerts in place to be able to spot when they're when they're going off off records such that the people can get involved the humans can get involved and rectify them the term is often management by exception you want people you want to set your thresholds and then look for where the exceptions are and then figure your comfort level for where your trigger points are when you do your statistical process control on that and then turn that into a human action in many cases thank you so covering some of the basics although there are more to discuss there let's switch gears and then bring the open source topic to the data ops so how can open source help open source and open approaches can help in the data ops so we're I'll say that I'm a maintainer on the Ageria project and Ageria is a way of solving the problem where we have silos of data which have varying quality and varying different formats different technologies different types of data so the Ageria allows you to map into a common open set of types that represents these things and in the Ageria ecosystem you can then get a view across your data so instead of saying give me all your oracle databases you could say give me all your assets or give me all your grocery terms that could come from all the various different places so I see that coherent way of getting access to the data and being able to then govern and classify appropriately first of all at those capability levels, the open type levels but also if you want to actually have a semantic layer on top and govern at the semantic layer having grocery terms that you know that what a customer is such that and the various attributes that made up your customer in your glossary such that you can map it to the database columns the event fields, the API fields all of which that might represent the same say national insurance number of a customer that way seems to be a very solid way to underpin such an operation if you want to be able to get a view across all of your data via the metadata and metadata Algeria has very solid core layers and we're building out on certain access sort of ways of doing it for different tools, different use cases different personas there's a lot of scope for the community to come in and for us to come together with your use cases and join us in Algeria and other projects so another project in this area that seems to be doing a very similar thing is the open lineage when I say a similar thing it's producing an open standard that many projects can produce lineage in this idea that lineage where things have come from so you can look in a report to see where did this actually come from what systems did it flow through that is very important for financial institutions that that's often the big driver for a lot of this activity Mandy anything you want to add No, I think David's point that open source allows companies to work together on integration and standards that are difficult for a particular vendor to push and get adoption across the industry so I think we are we're good at standards we're good at getting companies to collaborate on key points where they're thinking to work together and also writing the code that everybody needs but really doesn't differentiate in the market so for us the metadata repository is very expensive to build everybody needs one open source is a good way to share the cost of providing those connectors is another area so I think we have a key role to play in getting very specialist tools to work together and operate as a coherent ecosystem I guess the one thing that I would add just to push that point a little further is that everybody already has a whole ton of tools and each of them is their own little silos their own little islands and what you can do in Nigeria is you can link those together and now you can share and provide visibility across those islands form those bridges and get more value out of the investments that you've already made so I think that's another important point yep mentioned open lineage has open bytes open bytes is related to open lineage it's in the same category landscape landscape right now why is it in the same category what's the difference is it related I don't know anything about open bytes do you know anything about open bytes no I've not heard of it open lineage is a set of events I think it's open telemetry it's based on it's just basically providing a standard set of event types that processing engines can emit to say this job ran it worked on this data and that means that you can have capture tools that are watching how processing moves from one engine to another and actually trace the changes that are happening today to even those sorts of technology working on it in terms of pipelines things like I can't pronounce it Amazon in data hub are integral in terms of integrate lots of things how does so Amazon and data hub and a number of others they can all integrate with agiria right as peers that's really what agiria enables is somebody that has Amazon over here and data hub over here and something else over there and some other vendors product over here you can link those together and they can be peers so if different teams prefer a particular tool it's not a problem because their contributions can be shared without people using different tools right and we often see like so amundsen is geared very much for data science community data hub data science and a little bit of data more data engineering side gives you a little bit more visibility in that you take some other tool they go for a slightly different audience and so different groups gravitate towards different tools which is fine but the critical missing piece is often the bridges that allow you to share and interchange that information bringing back sorry please go ahead what tips would you give not a lot of influence in a large company that's been very siloed and there's a lot of talk about data governance but there hasn't been a lot of movement I see democratization of data as being core to this but silos like an organization that's already been siloed resist that idea and usually it's for security reasons how can the open source community like help with the democratization idea while also promoting security around that data Maddie do you want to start or do you want me to jump in or what I can give you an example from Nigeria so Nigeria operates in a peer to peer way so that each silo can invest in their own tools and the way the protocol works team choose what they share and what they've seen and the security is done right at the instance level so you can have two data sets in the same catalog one is very secure one is very open and that's come from working with lots of different companies in different industries to those types of requirements so I think because the technology is freely and easily available different teams can download it try it bring confidence to it without needing a sales team and a demo platform and special provisioning so I think that familiarity and ease of validation helps in with different parts of the organization to build trust in a particular solution I was just thinking we're right at the end here so I wanted to thank you for coming and I would like you to think about contributing to our communities and with Nigeria and the open lineage ones being the big two that we're involved with bring your use cases and if you want to contribute that would be fabulous if you have bring your use cases bring your code contributions to your enthusiasm and help because this is a together thing it's an open source thing all about community it's not us versus another we would like this to be an open source project that brings people together it's all about the integration and that's what we think is most important so there is one question I just wanted to have a quick one minute five type questions what are the emerging trends in that ops I just wanted to quickly go through that I'm not sure we have time I'll just say that the thing with data ops is again to embrace the innovation life cycle and leverage the open tools and the open integration that we're focused on any one sentence um no nothing more, not in that time yeah there's a lot more topics to discuss you heard about data mesh, data fabric distributed data, there's a lot more things to discuss but time is limited but again as David is mentioning please join the community and work towards making the data as secure as a viable and if you have a GitHub account please start with repositories because the latest foundation loves that yep and we'll be we'll stick around and answer any other questions and there's some curious swag yep, we've got hats for all people who have asked questions well for everybody here we've got hats and bottles for anybody that's interested and stickers if you're interested in any of those things come up the front afterwards thank you my distinguished panelists thank you Mandy, thank you Dan, thank you David and thank you all for being a very interactively asking questions and then being participative in this session here okay, thanks, thank you so much thank you