 Hello, I'm Gavin Mendel-Gleason. I'm the CTO of TerminusDB, and I wanted to talk to you about collaboration for structured data. So first I'm going to give a little bit of an outline of my talk, the motivation, why we need structured data in the first place, the problems that data management presents for teams, and some of the challenges. Why can't I just use Git for my structured data? Why can't I just use my database? And then I want to propose a solution, a distributed data collaboration approach using revision control. Okay, so first, why structured data in the first place? It's important that we look back on why we have structured data and how it's of use to us before we can understand how to use it in a modern pipeline. So data is core. Structured data gives us the ability to surface the data that we need from our enterprise or from whatever project we're working on, and then eventually give the right data and the right information to people so that they can act on it. And that's really what gives data its value in the first place. So the software that we use to manipulate data may be giving the data to other software. So for instance, once we structure the data, we might clean the data, we might use it in some sort of machine learning or artificial intelligence pipeline, and then we have enrichment. So for instance, in an artificial intelligence pipeline, you'll be enriching the data with categorization or feature selection or some other kind of output that enriches the data with more information, or it could be some sort of statistical analysis. It could even be human beings that are enriching the data. The structured data might be utilized by software for display as well. So we might want to be able to make graphs and visualizations. We might need to publish it to the web, or we might publish it internally in some sort of dashboard or something along those lines. That kind of information that giving a structure to the data allows us to do that more effectively. And usually in a real process that we're using in practice, we'll need both of these aspects. They're both going to be required of the data. So currently in data analytics and data science teams, we need good structured data in order to make the right decisions. So if you want to beat the competition in terms of what you know about something or insights that you have about something, then really you have to be on the forefront of utilizing that data effectively. But there's a number of serious problems in the way that we're currently doing it. So currently about 80% of the time that data scientists take is in curation. And often this curation process is being performed repeatedly without any proper pipelining. So we have type description and structure being imposed repeatedly in an ad hoc fashion multiple times. And I've seen this a lot in practice. So people send around a CSV, they ingest the CSV, they figure out the meanings of the various columns of the CSV. They then cast the different columns of the CSV to the appropriate data type. They then take that data type and utilize it in some sort of machine learning process in order to get some kind of valuable feature out of it. And then they dump that feature set to some other CSV that then has ingested subsequently. And that whole process is then repeated, where the information is continuously re-ingested and more type description takes place multiple times. This creates a lot of possibility for error. A lot of times you can end up with different interpretations of the data that are frankly wrong. So it's important to have metadata to understand what we are recording, why, what type it is, but more information than just the type, at least the type, but more information is better. And less time is spent on this metadata than really makes sense. We need to be spending more time on this because over time if 80% of your time is going into curation, you don't want to be duplicating that effort. You need to be utilizing that effort effectively. So curation and editing to obtain high quality data is also therefore critical. So the cleaning stages tend to be iterative. They tend to involve you ingest it and you find out that there's something wrong with the data. You have to go back. You have to change it. And you need a way to be able to change it so that other people in the pipeline or who are ingesting the information also get those changes. Structured data makes that editing safer because you won't change something in an inconsistent way. So if it's a date field, you will add a date to it. So if you have structured data that has constraints on the types of information about the data, then that helps to avoid problems. If you have links between different kinds of objects in your data, you also need constraints around referential integrity. And databases provide this. CSVs and Excel don't really... Well, you can to some extended Excel, but really not at all in CSVs. It has to be imposed as a sort of a practice rather than something that can be checked by machine. So it's possible to create automatic curation interfaces as well. If you have enough metadata about the data supplied, so if you know what the type is, if you know what the field means, you can give information and context to that field such that it can be ingested automatically into an interface and then describe to the user so the user knows what that field is, what they're editing, and why. Then the publishing of the data requires easy access and sufficient information such that you can service the information in the appropriate way. So query languages are key here. So query languages give us the discoverability we need. So we need to say, oh, I want information about item 1368, or maybe you don't know the number, you know only the name, and you want to find out all the items that have a similar name, then you get that item, and then you have information about that item for maybe surfacing it in a webpage on Amazon.com or something along those lines. So having structured information there gives you the ability to automatically construct the appropriate published information. We can display the information as graphs, charts, forms, web pages, all kinds of different things as long as we have enough data and metadata about that data to surface it appropriately. So the problem that we're facing though is that currently we're still in the dark ages and I both want to convince you that this is true and then I want to suggest some ways that we might get out of this problem. So whether we're talking about structured or unstructured data, or semi-structured data as you might call it a CSV file, we're doing it wrong currently. It's not the right way to do it. So major aspects of current data pipelines do work for some value of work, otherwise we wouldn't be doing it. So we are getting value from our data as it stands, and it does help. We are getting valuable insights. However, the situation is very awkward and we're imposing long-term costs on the agility of businesses, on the agility of pipelines, because they're so ad hoc and because we don't have good pipelining solutions at the moment. Getting the structured data pipeline right is going to be the critical aspect of success for businesses going forward. So the businesses that are more agile are going to be able to move with market concerns with new insights come in, they'll be able to act on them, and they'll be able to continually iterate and update their process so that they're always at the forefront and don't lag behind. And that's really, this agility is going to prove very important. And even though there are some upfront costs to structuring your data appropriately, over time the amortized cost is going to be much lower. So core enterprise data at the moment is still maintained in Excel's and CSV's to an incredible extent. So there are, like we'll talk about databases in a minute, but there's a reason that they're in Excel's and CSV's at the moment. And some people think, oh, it's just stupidity. It's not entirely that, okay? So there is some stupidity perhaps involved in some of these decisions, but there are also good reasons. So currently the distribution model for the data is really via email and Slack. So people are taking Excel's and CSV's, they're emailing them to each other, they're Slack'ing them to each other, or maybe more sophisticated users are using secure copy or something along those lines. So users are also taking these and they're repeatedly cleaning, reparsing, recasting to obtain structure, specifically in the case of CSV's, a little better in the case of Excel's. And we have enormous problems with versioning. Because we're distributing these in a sort of ad hoc way via email, via Slack, we end up with multiple edits, the same version, and in Excel you have a linear versioning system, and it doesn't really account for this very well, so you have to sort of manually merge these in by finding out whether foo1.3.final2.xls was modified and whether final3 is subsequent or whether it's related in some way. And okay, this works if your team is small. If the data science team is sufficiently small, you can manage this, but it really turns into a mess at larger teams. And we've seen the problems that this can expose in larger teams in practice and they can be really disastrous changes that are not understood and then end up somehow distributed to larger teams over time and important information or data gets lost. And sometimes that can be very costly. There's also a problem that the metadata is largely non-existent. So in the case of CSV's, you don't even know what the type of a column is meant to be without some sort of external information. Some more sophisticated actors will have a metadata CSV file that contains some information about the CSV to help the processing, but then you somehow have to manage the pipeline of keeping those two things in sync. If you have Excel, you're a little bit better off. You have more structure. You have things about you can have formatting. You can have formulas. You can have type information. And this is an improvement, but you have not all that much information about the meaning of, say, a column. So why is the third column an integer? What does that actually represent? And you want to be able to have maybe more information than you can just contain in a header for the column. And this actually turns out to be really important. So now there are ways to deal with this problem of distribution and versioning and collaboration using some of the existing tools that people are using. So if you use Git for CSV's, you're way better off. I mean, this is really, for most use cases, you're going to be better off keeping it in Git and using that as the distribution model than Slack or something like that. But it's still, it's quite awkward for data. So some of the problems with Git for this is it stores changes as lines of text. That's not necessarily the right granularity that you want to be looking at for changes. It also, it doesn't scale that brilliantly on data. It doesn't go up to enormous sizes without some effort. And oftentimes it's quite awkward once you do go up to those scales. And there are limitations ultimately on the scale that you can provide. So what we really need is something like Git that exposes structure naturally. So you have this structure of the data. So you have the type information. It can actually do versioning on all of that. It can do versioning on the metadata. And it can give you discoverability, which is something Git doesn't do. And these are not easy to expose and get as features. So we also have databases and some people will be, have been listening to this and say, okay, well, why don't you just use it, do it properly and do it in a database? Well, yes. Okay. So there are big advantages to database, especially to using a real proper database. You get schema control. You get the ability to have referential integrity. You can add constraints. You can have type information. And although it's massively underutilized, you can generally have comments on columns and tables as well, something that is surprisingly unused in practice. So the databases are also very awkward. And that's why we still have CSVs in Excel. So there are problems with databases that are making it difficult for people to use practice. So you can add revision control features to a database. And this can help overcome some of the problems that we have on collaboration on databases. But it involves a lot of mucking around with the structure. And you have to actually bake it into your schema somehow if you don't use a tool that is structured around revision control databases in the first place. Now this might not seem like too bad of a problem at first. And it may actually, it's oftentimes much better than not doing it or using one of the other, you know, just ad hoc CSV approaches. But it also imposes costs in terms of agility. So changes in the schema structure over time will become very difficult to manage because you always have to design and engineer it in a very careful way to make sure that the version control matches up in some sort of consistent fashion. You also have to hand roll your distribution model. So the current method of using databases is it creates antagonistic classes. And this is something that we see very commonly in large enterprises. So you have an antagonistic rather than a cooperative relationship between your various people inside of your organization. And this imposes a high cost to collaboration. So you have gatekeepers, the database administrators oftentimes depends on the engineering house or how things are structured. Sometimes it's part of engineering itself. But these gatekeepers tends to be somebody ends up as a gatekeeper and they want to stop the engineering team from wrecking the core database because the core database is maintaining important information that should not be lost. And in order to stop them, you end up imposing barriers to entry for these people. We don't want them to destroy the central core so we don't let them edit it so you have to go through gym in order to get it into the database. So that means that experimenters are then keeping their data local and they don't share it because the cost of dealing with that Nazi gatekeeper who's a real pain in the arse. Now the gatekeeper is not just doing it to be mean, he's doing it because you really don't want to destroy all of the centralized data. And the experimenters aren't just keeping their data because they want to keep it from being shared. They're doing so because it's hard to get past gym who's trying to save things from going down the tubes. In addition to this metadata is still very poorly maintained even in well-designed cases. So we've all seen a column in a database that says QV underscore IPS underscore RQ and we have no idea what it means. So now some of this is the fault of the designer and the fact that they didn't impose the appropriate foreign key. They almost certainly didn't put a comment to that. But there's also a sort of there's something about the structure of the databases that we currently have that really privilege just treating columns as data types and not thinking about them in terms of their metadata and how important it is going forward. So I think this is something that needs to be looked at as well. So managing data appropriately in a team really means collaborating. And that is going to be the thing that makes it possible. So if you want to overcome the CSV, the Excel problem and the database problem, the core, the intersection of the problem there is collaboration. So software solved this with Git. It has been an absolute sea change and has really revolutionized the software business and we're still playing catch up in data. So how do they do that? Well, they have provenance. So revisions and authorship are all core to the way that they maintain information about source code. We also have safety. So we have the ability to roll back. We have the ability to branch and this reduces the cost of collaborating. So if I want to change something, you don't have Jim saying no, because he's not that afraid. You put it into the dev branch and if Gavin does something totally ridiculous, we can always roll it back and it won't be the end of the world. So that safety really means that there's a lower barrier to entry for collaboration. We also have the ability to impose quality restrictions this way without imposing big barriers to collaboration. So we have these CICG pipelines. We allow people to put things in dev, but dev is probably not where our production software comes out of. So you probably have one, two, maybe even three different branches that have cascading tests and at each stage we move things from one to the other. We might make some edits along the way, perform some kinds of tests, including possibly human tests, some kind of acceptance testing, certainly some kind of unit testing, and then it goes into production. And that's really not being done in the same sort of way with data, even though data is so critical to have right, and we don't want to be editing production databases that are used for surfacing information either to humans or to machines. So we need this CICD type approach in data. The last thing is about distribution. So distribution management, if you don't want to be slacking excels and you need to be able to have the data to hand easily, and I need to be able to share whatever changes I've made. So if you went back and you built a tool for collaboration on structured data, what would we be talking about? And I think the answer to this is we need databases that are designed for distributed collaboration. And this would become, I think, a bigger part of the market for databases over the next five to ten years. So what's involved here? Well, I mean, obviously the reason we're not using Git, so discoverability and schema. So we need to be able to have queries, we need easy retrieval, we need programmatic update, and not just a big monster CSV file. We need to be able to have the structure and type of entities, we need referential integrity, we don't want to be constantly casting. And that means that we have to have real schemas with real information and metadata about that information. And when we have a schema, that means we have to pay attention to schema migration. So we also need to be dealing with that schema migration problem. We'll talk a little bit about that in a moment. Using revision control. So we have to be scalable to typical data set sizes. So the revision control, like software we have at the moment, is not necessarily the greatest of going up to the kinds of data set sizes that we're talking about. Now in practice, it turns out that most data sets are smaller than two gigabytes in practice. Many people think that they need to go straight to terabyte, but oftentimes that's not the case. Oftentimes you have smaller data sets than that. So it might be possible to do it and get, but you still won't get any of that structured information. But we should be able to try to increase that headroom to, you know, typical data sets sizes for the 98%. So probably under 100 gigabytes, you want to be able to store that feasibly in a scalable manner. So provenance is also something that's going to be critical. So we need the authorship, commit time, all of the things that Git gives us that we love. We want to be able to do those same pipelining controls. We want modern CICD architectures to work on our data. And we also need safety. We need the ability to branch, roll back, and allow testing and experiment. But the core is really collaboration. This is the thing that we need to solve that is not solved by CSVs properly, by Excel properly, or by databases properly. So modifications have to be safe and fearless. So the revision control aspect is not just an afterthought. It's a requirement if you're going to enable collaboration in a distributed fashion on teams. And we want to be distributed. We want to be able to push, pull, and clone. We want everybody to be able to make these changes easily and then share their changes. And that will remove the need for Slack and email. So like if you don't Slack or email people your source code anymore, it's very rare. It used to be the case that you would email your source code around to people if you wanted to share it. But now that we have things like GitHub, it's just not necessary. The costs to entry are so low that it's easier to actually just use a proper revision control system. We also need to have our data where we have the problem. So if I have an ML processing server, I need to be able to get the relevant data to that ML server. So things like clone, push, and pull become really important in those aspects. But in order to do that, you need to be able to have something like it has where you have delta calculation. So that we only send around the information that needs to be sent in the cases of updates. And this change management then gives us the capacity to do sophisticated merges. And that is really merge is where the ability to do truly distributed data management will come from. So currently that works quite well in code, although sometimes merges can be painful. That's also going to be true of data. They will be painful, but it's cheaper than not doing it in the long term. We also want to be able to do this safely so we can collaborate and share information in such a way that we're not worried that it's going to be disclosed. So we really want for data, it's even more important than code. A lot of code is now open source and you can still have a private repository. I think people's expectation is that the source code has a slightly lower value in and of itself than does data. So data, there's all kinds of privacy concerns. There's all kinds of really a lot of the gold is in the data and you want to be able to encrypt that in an effective manner. So we want these deltas to be encryptable and it should be possible to send them in an intent encrypted manner. So there's multiple solutions to these problems from custom tool chains through to a number of open source databases. And I'll just name some of them here. DVC is worth a look in if you're if you're working with the data dolt and dolt hub. They are a SQL based revision control database. And then I'm CTO for Terminus DB, which is again an open source solution. It's GPL. All of our front end and client is under the Apache license. So you're free to connect to it and you have a lot of freedom in the way that you want to use it. You can come to terminusdb.com, we're a sponsor of the Linux conference and I'd very much appreciate it. If you came to our stand and we'd be happy to go through Terminus DB or what the options are in terms of CICD pipelines for data are. Thank you very much.