 Ah, Buddhistius, thank you very much. It's an honor to be presenting here at BigThings, and always so good to be back. I want to give QR code here. If you want to load up the slides, I have a slightly earlier version, but I keep the slides online. So it won't be exactly what I have in my talk, but very close, if you want to follow some of the links. Now, this talk is about data governance, and I have definitely seen some experts who feel that it's not only an important topic, it's one of the most important topics in the field, because if it's not done correctly, it holds back what else we're doing. I'm not an expert in data governance. I've had many years in computing, and frankly, over that time, we weren't really sure if data governance were just something that was an aspiration, something wishful. But the fact is, I'm a practitioner in data science, and this has become a very vital topic, so it should be something that I know about. The history of this talk was that my colleague, Ben Lorica, who helped create the, or helped lead, I should say, the O'Reilly AI conference, also the Stradedata conference. Ben Lorica and I had been doing some surveys, and we found so much interest in the field about data governance and closely adjacent topics. So as an assignment from Ben, I went and did a report based on this to try to understand what are the issues, who are the vendors, what's real, what's really driving it, where are the open source standards, et cetera. Now, there were three surveys that we did. Last year, I showed, I think, at least one of these, but we've done three large surveys out in industry about the adoption of machine learning. And with that, also, some related areas of the adoption of cloud and how data infrastructure is evolving. So these are all free. They're available by download from O'Reilly media, and we were able to get a lot of information back about some of the current trends, especially in enterprise. So for instance, taking a look at what some of the top issues are, you can see it broken down here. What we tried to do with this study was to take a contrast. We looked at companies that had been using machine learning and production for five years or more, and then we looked at companies that had not even started. So trying to understand the contrast between those, what is different? We're also looking at companies in the middle that are just adopting machine learning, and as they come into it, what kind of challenges do they face? So in this contrast study, we're hoping to find out for the companies that do this well, what are their traits, what are their characteristics? A definition of data governance is, it encompasses people, processes, information technology to create an organization, handle an organization's data across the business enterprise. That's the textbook definition. Probably another definition I like is from Rob Carroll, and it says, even people with decades of data management experience tend to brush data governance under the rug for two main reasons. One is, traditionally it's been very hard, and so we're not really sure if the solution would work. And two is, there's been a history of false starts. So benders come out and they make claims. It's usually, it's encountered with a lot of suspicion. Now data governance, one analogy might be that data governance is to your data assets as HR is to people. And a real truism that's operating in this field is that companies come in to doing this kind of work because of the risk. Because they have compliance requirements, or security risks, or other things which cost them. But once they get into this, once they begin these practices in earnest, what they tend to find is that there are efficiencies, there are upsides. And so by having efficiency in process, you get cost reductions, maybe even open up new business lines because of the insights that you gain. So it's sort of like data science about your business. Now I'll quote a friend, Alistair Kroll, who also does Strata Conference. This is very much a generalization, but what we see geographically is that in East Asia, there's a lot of data. There are concerns about, is the data correct? Because the governments are very much involved in the collection of data. In Europe, of course, we worry a lot about storage and analysis of someone else's data. There are historical precedents that are very terrible and we try to avoid this. In North America, we tend to worry about the unintended consequences of data because it's very litigious. There's a lot of attorneys and a lot of legal action is what's driving it. But in reality, all three of these concerns are valid. And for a good practice, you need to be able to put all three of these together. So looking geographically, we tried to see, okay, are there any differences? And it turns out in North America, data governance has a slight more advantage in, or sorry, more momentum, not advantage. There's more activity, but just slightly. It's pretty well distributed across these three global areas. I wanna give some history about where this field came from because I do believe it's very important. First off, the first three decades, 1960 to 1990, there was some data modeling, but not really a lot in terms of policy or any kind of automated management. So negligible amounts of data modeling. But then you get into the 1990s and you have the first generation. It's often called enterprise repository area. And then we have more and more policies pick up. It went along slowly, as I mentioned before, until 2018. And in 2018, we had three major things happen. One is GDPR went into effect. The other is that we had major issues that came to light about what had happened in 2016 with Facebook and Cambridge Analytica. And third, which is not as widely known, is that 2018 was a major year for cyber attacks. There were over 100 million people worldwide who had their personal information exposed online. So Wall Street Journal called it a global reckoning of data governance. And so just to lead up to this, where did this come from and what's driving it? What do you need to know? If you go back to the 1970s or early in the 1960s, we had mainframes. We had a very large server and then we had other terminal devices that were connected over proprietary wires. So the networking was all very proprietary. Going into the 1980s, then we saw an evolution of client server where you had different servers that could talk with each other. One would be a client of the other. And so in terms of that specialization, you began to have more databases, database servers, a little bit of policy regarding data governance, but just a little bit. In the 1990s, really two things happened. On the one hand, you get internet working becomes more popularized then. It was TCP IP was introduced in 1983, but by the 1990s, you started to see commercial traction. And so you get three layers. You get a presentation layer and then there's the application logic layer, the business logic moved in the middle. And then you had the backend, which is typically where your databases were. So this kind of three tier model is it sometimes called? What it meant was that the rules, the business rules that had operated in a client before and client server architecture, those rules were now moved back into the middle part. So you had some consolidation where you could apply policies. But then on the other hand, the other big thing that was happening was that you had the rise of data, sorry, data warehousing, so enterprise data warehouses, and also the rise of business intelligence. And so throughout the 1990s, these two things really took hold. They focused a lot of the data management inside of the enterprise data warehouse. And that's really where the early data governance work was focused. But you see the diagram is increasingly getting more complex. It's beginning to spread out. So then we get into 2000s and this diagram, typical architecture becomes much more complex. In the mid 2000s, you have the introduction of cloud and you still have a data warehouse in enterprise. You still have your business intelligence off of that. You have some data governance policies applied to it, but you're also starting to have more and more data science kinds of activities that begin to emerge. And a lot of that is starting to move out into the cloud. And so you start to see what Andrew Ng calls virtuous cycle of data, where your web applications, you collect log files, and then those log files are aggregated. You run machine learning over the aggregate data and it feeds back in data products to your web applications. You get that kind of virtuous cycle that flywheel, which led to a lot of what we have now with machine learning. And so this is what really led to the takeoff of big data. Certainly we saw this in companies like Google and Amazon and others. And then the clients, the people using your data, they're elsewhere. They're out beyond in edge cache probably. More and more, they're on mobile devices. So this whole landscape is becoming more complex, much harder to govern, and more security threats are coming in. You move into the 2010s and that virtuous cycle has really, really evolved. And we now talk about data science. We talk about machine learning, cloud computing. These have all become accepted. And as it turns out, some of what you do with your analysis is coming out of other departments. Your data science department is probably feeding back into some business intelligence or other activities in the organization. So where you need to govern your data has changed. It's now at least in two or three places. And meanwhile, the whole part of the mobile devices that's just expanded. Then we bring it up to now. And so now we live in a world that is arguably hybrid cloud or multi-cloud and it's much more complex. And certainly we rely on doing a lot of cluster computing and using big data, doing a lot of data science work. There's probably still multiple data warehouses. Throughout this time, the companies involved have gone through lots of mergers and acquisitions. So there may be different types of IT policies that are floating around in different departments. There's a lot more emphasis on the security threats that can happen through mobile devices. There's also Edge devices, IoT devices. There's also external data partners. So this has become a very complex landscape. And the problem is that the solutions that have evolved are very much point solutions. They are applied in one part of this landscape or another, but there's very little that ties them together. And so along with this, we get the dual aspect of having compliance now. We have a lot of issues in terms of privacy, security, ethics, other matters of compliance. And these are risks. So for data governance, it's driving a very complex landscape. And part of the problem is that for a long time, we have this idea that if you just put your data in a data warehouse and you lock it down there, we can apply policies. So everything's good. We can keep the regulators happy. But the reality was we moved the data out of the data warehouse. We moved it into data lakes. We moved it into other areas for your data science teams. And also along with that, we have the notion of Agile Manifesto. We have the notion that evolved that the legibility of a system equals the legibility of the data in that system. And it's not true. So let me ask a question. In the Agile Manifesto, how many times is the word data used? And just shout out your answer. More than two times. Anybody? Okay. One or two times. Anybody? Zero times. Anybody? Okay, good. Yeah, Agile Manifesto talks about data zero times because the idea was the code was important. Let me move up here. The idea was that coding is imminent and data is secondary. You can have a database. You can have some sort of data model. And that was what really took a hold over the last 20 years in mainstream. Meanwhile, in the companies that got the first mover advantage, Google and Microsoft, Amazon, Apple, Facebook, et cetera, they had a different view. And their view was that learning is imminent. And that data is a competitive differentiator. Now those two statements really don't reconcile. It's quite a divergence. But we're still stuck in this landscape that's very complex and it's becoming more complex because of edge inference, because of streaming data, because of multi-cloud, a lot more security issues, et cetera. So it's a problem. This is the problem of data governance. But fortunately, there's a new role. It's called Chief Data Officer and it'll fix everything. There's a lot of issues, but allegedly in the corporate world, a CDO has all these responsibilities. And so I think this really puts too much on the plate of the CDO. And there have to be better mechanisms at a lower level that evolve. Okay, so to that point, we went out to research and find what is available. And so if you go to the slides, they'll be published on Twitter a little bit later. Also, I have these online, but we created some Google Docs. There's some Google spreadsheets that do a survey of all the different vendors in this space, as well as what are the different open source projects that are available in this space, trying to compare their maturity and understand their relative kinds of features. Now, certainly there are the big four that have been around for a long time, IBM, SaaS, Informatica, SAP, but there are a lot of other smaller firms that are coming up. And if you look at the dates, it's interesting because some of the new ones really came out in 2014 or so in the past five years. So we have seen a lot of momentum for new solutions here, although again, there's still point solutions. But fortunately, there are standards that have been emerging. So if you work in science, you may have heard of the fair guiding principles, fair data principles, and Mark Wilikson and Charles Ogle and others. So this is good as far as research is concerned. Definitely has excellent guidance toward how to share data, what the data providers need to do, and also really driving toward reproducible research, which is important in science, but also important in industry. For data science teams, you need to be able to make your results reproducible so that other departments can agree, you can get confirmation, but also just in terms of security to make sure there weren't mistakes that you aren't making decisions off the wrong data. There's also the standards, if you haven't worked in the metadata area of this, but there are metadata standards such as DCAT, which describe very formally how to represent a data catalog, which is usually a core component of your data governance. There are related things like PROBO and PAB and other standards. These have been around for a while and they're used actually quite a lot, maybe not as visible to the end user, but they're under the hood. There's also a project called Igeria, and this is now at the Linux Foundation out of ODPI. IBM has really pushed this. In particular, Mandy Chesle, who's at IBM UK, has an excellent paper called The Case for Open Metadata, and there have been implementations for Apache Atlas and others, but the notion is that while we have point solutions to address what is our data governance in a data lake, what is our data governance on mobile devices, what is data governance for IoT, et cetera, while there are those point devices, the important thing is if you have standards for metadata exchange, then you can validate that these devices have the ability to talk with each other correctly, and so by having that kind of exchange, it's a little bit more of an internet working model, and that makes the most sense for right now. As well, there are government initiatives in the US, and I've been involved in this over the past year. I've been much more involved in this area. I did never think I'd get involved in government, but apparently so. So there was some legislation that went through in the US Congress called evidence-based policymaking, and there were prior laws as well as far as data quality, but evidence-based policymaking, that's led into a strategic initiative out of the federal government. There's a year one action plan. There have been executive orders about artificial intelligence, and it's interesting. What it says is that the different agencies that are managing data, as much as possible, they need to make that data available for other agencies or researchers to use. It's introducing many data science principles into governance, at least in the US. Okay, and the other problem is while we're doing all of this, machine learning became very popular, very valuable, very useful, and machine learning introduces a lot of changes into the software engineering life cycle. So if you had come out of more standard engineering and you're encountering machine learning for the first time, what you find is these are stochastic systems. Machine learning models are built in probabilistic ways, and when you have people who are on a board of directors or executive staff, maybe they have 30 years in IT, but they grew up thinking about things like Six Sigma. Now they're encountering probabilistic systems where they have to embrace the uncertainty and really leverage it as opposed to getting rid of it. So that changes everything. There's a really great paper called Ground Data Context Service by Joel Holstein and others out of Berkeley that was from 2017, and it really describes this landscape probably better than any other. But in the interest of time, let me move forward a bit. What I found while I was researching this is that there's a new category that's emerging, and it's not from the cloud providers. It's from a younger generation of companies. So if we talk about Amazon, Google, Microsoft, Baidu, et cetera as the cloud providers, there's a younger generation that is Lyft, Uber, LinkedIn, Stitch Fix, Netflix, Airbnb. And what's interesting is that going around, all of these companies have efforts about data governance. And so I've got a list here. Actually many of these are now open source, and they're beginning to talk with each other. But a year ago they probably didn't even know about each other. And so typically what's happened here is the company is recognizing that they have compliance requirements, like GDPR, and maybe reluctantly, but they go and they start to build out their data catalog, they track the usage of their data sets, and they begin putting together a lot of metadata to describe data usage in their company. Each one of these projects here is addressing the idea of building a knowledge graph out of that metadata about your data set usage. And the interesting thing is that once you go through and build that knowledge graph and build applications off of it, typically you'll have outcomes like better search and discovery within your data science teams, certainly better responses for compliance. But you may even find where there's cases where you can open up entirely new business lines, because now you have a global view of the data that's coming in and moving through your organization. So case in point, Lyft, I was talking to Mark Grover, who was the project lead on Admincin, they recently have made some announcements about Admincin open source, and what they saw was data scientists were spending upwards of a third of their time just trying to understand what are the data sets that they need to use, what are these different columns within each of these data sets, how do we use it? And by introducing this knowledge graph of metadata about data sets, they really cut the time that the data scientists have to go in and explore the data. Now they can begin using it and learning from other teams. There's also a training aspect of this too, obviously onboarding. So definitely watch this space, because I think what data governance, this is what's interesting and what's the most immediate thing that's happening. Okay. Real briefly, I can talk about this later. One of the projects I'm on, it's led through Julia Lane, who has an office at the White House, and it's called Rich Context. We just, we're doing a conference in DC this last week, bringing a lot of people together. It is about knowledge graph, about metadata, about data sets. And the data sets may come from many different directions. They may come from federal agencies like wage data, census. They may come from state agencies such as support for families or prison records or things like this. And it's only by being able to bring these different silo data together and understand how they fit together that we can really move forward. And again, speaking to this notion of evidence-based policy making in the US. This is also being used in Germany. Some of my colleagues are up on this bunk and they have a very similar kind of effort for their research data centers. Also, I've been involved over the past year with Project Jupiter features for Rich Context. These are just now getting announced and released, but there are several features. One is called a data explorer where you can register data sets. Another is a metadata explorer where once you've registered a data set, you can start to move through the knowledge graph of metadata about that data set and understand what are the columns. Maybe get annotations that have come in from previous data scientists or researchers. There's also annotation and commenting so that if you're working with a data set and you notice something is wrong, you can highlight it and then that'll feedback through telemetry to the data stewards for the data set. And there's also a telemetry package where within an enterprise context, if people are using Jupiter notebooks and registering data sets and working with them, then there's some privacy-preserving logging that's happening, collecting telemetry that feeds back in. So these Jupiter features are part of what we're working on for the government project, but also very good for academia, for research. Now, as far as long-term outlook, when we take a look, I'll just wrap up here. I'd love to talk about this in more detail, but speaking to the business issues, when we look at machine learning adoption in industry and we look at the challenges, the difference between the companies that have a very mature practice and are doing well versus the ones that aren't even getting started. If I were to take a survival analysis view of this and see where is it that they run into problems, the first three problems are companies that aren't getting started yet, typically they're buried in tech debt. Their data infrastructure needs so much work. Number two, they have a company culture from the top that does not recognize the need for data science, doesn't understand about machine learning. They've come out of, like I say, Six Sigma and they don't understand probabilistic systems. And number three, there is a lack of people working in product management who have the understandings who can translate between machine learning applications and the business use cases. So for the ones that aren't getting started, those are the big three problems that block them. In the middle, the companies that are adopting, they have data quality issues, everybody does, we always struggle with this. They're struggling to hire enough for the right talented people, we're all struggling with this. And thirdly, there's compliance issues, security, legal concerns, other priorities that are competing. But then once you get past this, you get into an area of hyperparameter tuning, things that are not risks, things that are actually improving your business. And I talk tomorrow about AutoML, we'll explore that. What this leads to is a kind of world that's very polarized. There's a dozen companies with prime mover advantage. And okay, lemme wrap this up. There's a dozen companies that have a prime mover advantage because they're cloud providers. Or like I said, the AI natives and all. And they tend to go after the really easy problems that are very valuable. They do some work in hard problems, but they tend to avoid having a lot of people involved. They tend to lean heavily on automation. So I think in enterprise, there's a lot of opportunities for things like human in the loop, which makes the best out of both your people and your automation. But if I look at the other two categories below, it's roughly 50-50. The companies that are really moving forward with machine learning versus the companies that aren't getting started. And so what I showed in the previous slide about where the hazards are, it really applies here. The point being that 50% of enterprise, they're still struggling to understand the business use case for machine learning. If they're buried in tech debt, the management doesn't understand, doesn't buy in. And so this is how this world is split right now. And what we're seeing is that over 2019, we see very surprising amounts of investment from that second category of adopters and the top category versus the laggards. So the amount of investment in machine learning has accelerated, and so the gap between these categories is also accelerating. And one of the biggest factors for this is just how much hardware is changing. I did work in neural networks back in the 80s, and we had really great kinds of software going around. We just didn't have the hardware that would run it. And it wasn't until we got to 2012 that we had breakthroughs in neural networks because finally we had the hardware. And you see this, I don't know if you've seen the Cerebrus wafer scale, but they have chips that 40,000 times faster memory. When you look at the largest GPU, it's 21.1 billion transistors, but the Cerebrus deep learning chips, they have 1.2 trillion. It's just orders of magnitude larger and faster. So the hardware landscape is changing drastically. I'm gonna scoot through this. Just one final point, there's a great report from Berkeley that described cloud computing early on and it really set the tone for like the next five years. It predicted a lot of what would happen in terms of the rollout of cloud and what the issues were. They did a follow-up 10 years later, earlier this year. So I recommend, again, in a contrast study, looking at both of those papers if you wanna understand this. And here they're talking about what are the contemporary patterns. The previous paper they did was very prescient and this one too I think is showing quite well. So Eric Jonas is the lead on this paper. It was through Dave Patterson and they have some surprising ideas, namely that a lot of the services are moving up the stack and we're going to probably be moving more and more toward not necessarily serverless, but where you don't really care about the individual servers. You're just getting a high level of service. This does lead to more lock-in, but importantly it decouples computation and storage. And that's interesting because that is contra to the hypothesis for things like how to open Spark. Okay, I guess I'm on time. I could talk about a lot more, but I will be having an office hour session. Meet the expert immediately after this and I look forward to talking with you all. If you have any questions there and here's links and I will have the slides online. Thank you very much.