 From Cambridge, Massachusetts, it's The Cube, covering MIT Chief Data Officer and Information Quality Symposium 2019, brought to you by SiliconANGLE Media. Hi everybody, welcome back to Cambridge, Massachusetts. This is The Cube, the leader in tech coverage. I'm Dave Vellante with my co-host, Paul Gillan, and we're here covering the MIT Chief Data Officer Information Quality Conference, hashtag MITCVOIQ. Lisa Erlinger is here, she's a senior researcher at the Johannes Kepler University in Linz, Austria, and the Software Competence Center in Hagenburg. Lisa, thanks for coming in The Cube. Great to see you. Thanks for having me, it's great to be here. You're welcome, so Friday you're going to lay out the results of the study, and it's a study of data quality tools, kind of the long tail of tools, some of those ones that may not have made the Gartner Magic Quadrant and maybe other studies, but talk about the study and why it was initiated. Okay, so the main motivation for this study was actually a very practical one because we have many company projects with companies from different domains, like steel industry, financial sector, and also focus on automotive industry at our department at Johannes Kepler University in Linz. And we have experienced with these companies for more than 20 years actually in this department. And what reoccurred was the fact that we spent the majority of time in such big data projects on data quality measurement and improvement tasks. So at some point we thought, okay, what possibilities are there to automate these tasks and what tools are out there on the market to automate these data quality tasks? So this was actually the motivation where we thought, okay, we look at those tools, also companies ask us, do you have any suggestions which tool performs best in this and this domain? And I think this study answers some questions that have not been answered so far in this particular detailed, in these details. For example, Gartner Magic Quadrant of Data Quality Tools, it's pretty interesting, but it's very high level and focusing on some global vendors, but it does not look on the specific measurement functionalities. Yeah, you have to have some certain number of whatever customers or revenue to get into the Magic Quadrant. So there's a long tail that they don't cover. But talk a little bit more about the methodology. Was it sort of, you got hands on or was it more just kind of investigating what the capabilities of the tools were, talking to customers, how did you come to the conclusions? We actually approached this from a very scientific side. We conducted a systematic search, which tools are out there on the market, not only industrial tools, but also open source tools were included. And I think this gives a really nice digest of the market from different perspectives, because we also include some tools that have not been investigated by Gartner, for example, like Mobit DQ Data Quality or Apache Griffin, which has really nice monitoring capabilities, but lacks some other features from these comprehensive tools, of course. So was the methodology largely, or the goal of the methodology largely to capture a feature function analysis and being able to compare that in terms of its, binary, do you have it or not? How robust is it? And try to develop a common taxonomy across all these tools? Is that what you did? So we came up with a very detailed requirements catalog, which is divided into three fields, like the focuses on data profiling to get the first insight into data quality. The second is data quality management in terms of dimensions, metrics and rules. And the third part is dedicated to data quality monitoring over time. And in all those, for all those three categories, we came up with different case studies on a database, on a test database. And so we conducted, we looked, okay, does this tool, yes, support this feature? No, or partially, and when partially, to which extent? So I think, especially on the partial assessment, we got a lot into detail in our survey, which is available on Archive Online already. So the preliminary results are already online. How do you find it? Where is it available? On Archive. Archive? Yes. What's the URL? Sorry. Archive.com.org? Archive.org, yeah. Archive.org. But actually there is an ID, I have not with me currently, but I can send you afterwards here. Yeah, maybe you can post that. Yeah, we can post it afterwards. I was amazed, you tested 667 tools. Now, I would have expected that there would be 30 or 40. Where are all of these, what do all of these long tail tools do? Are they specialized by industry or by function? Oh, sorry, I think we got some confusion here because we identified 667 tools out there on the market, but we narrowed this down. Because as you said, it's quite impossible to observe all those tools. But the question still stands. What is the difference? What are these very small niche tools? What do they do? So most of them are domain specific. And I think this really highlights also these very basic early definition about data quality, of like data quality is defined as fitness for use. And we can pretty much see it here, that like we excluded the majority of these tools just because they assess some specific kind of data. And we just really wanted to find tools that are generally applicable for different kinds of data, for structured data, unstructured data, and so on. And most of these tools, okay, someone came up with, we want to assess the quality of our, I don't know, like geologic data or something like that. To what extent did you consider other sort of non-technical factors? Did you do that at all? I mean, was there pricing or complexity of downloading? Or is there a free version available? Did you ignore those and just focus on the feature function or do those play a role? So basically the focus was on the feature function, but of course we had to contact the customer support, especially with the commercial tools. We had to ask them to provide it with some trial licenses. And there we perceived different feedback from those companies. And I think the best comprehensive study here is definitely Gardner Magic Quadrant for Data Quality Tools because they give a broad assessment here. But what we also highlight in our study are companies that have a very open support and they are very willing to support you. For example, Informatica Data Quality, we perceived a really close interaction with them in terms of support, trial licenses, and also specific functionality. Also, our contact from Experian from France was really helpful here. And other companies like IBM, they focus on big vendors and here it was not able to assess these tools, for example. Okay, so the other difference with the Magic Quadrants, you guys actually used the tools, played with them, experienced firsthand the customer experience. Exactly, yeah. Did you talk to customers as well? Because you were the customer, you had that experience. Yes, I were the customer, but I was also happy to attend some Data Quality event in Vienna and there I met some other customers who had experience with single tools. Not of course this wide range we observed, but it was interesting to get feedback on single tools and verify our results. And it matched pretty good. How large was the team that ran the study? Five people. Five people, and how long did it take you to start to do it? Actually, we performed it for one year roughly, like the assessments. And I think it's a pretty long time, especially when you see how quick the market responds, especially in the open source world, but nevertheless you need to make some cut. And I think it's a very recent study now, and there was also the idea to publish it now, the preliminary results. Were there any surprises in the results? I think like the main results, or one of the surprises was that we think that there is definitely more potential for automation. But not only for automation, I really enjoyed the keynote this morning that we need more automation, but at the same time we think that there is also the demand for more declaration. We observed some tools that say, yeah, we apply machine learning, and then you look into the documentation and find no information which algorithm, which parameters, which thresholds. So I think this is definitely, especially if you want to assess the data quality, you really need to know what algorithm and how is it tuned and give the user, which in most case will be a person with technical background, like some chief data officer. And he or she really needs to have the possibility to tune these algorithms, to get reliable results and to know what's going on and why which records are selected, for example. So now what? You're presenting the results, right? Obviously here at this conference and other conferences, and so it's been a year. So what's the next wave? What's next for you? The next ways we're currently working on a project which is called some knowledge graph for data quality assessment, which should tackle two problems in once. The first is to come up with a semantic representation of your data landscape in your company. But not only the data landscape itself in terms of gathering metadata, but also to automatically improve or annotate this data schema with data profiles. And I think what we've seen in the tools, we have a lot of capabilities for data profiling, but this is usually left to the user, our talk. And here we store it centrally and allow the user to continuously verify and newly incoming data if this adheres to this standard data profile. And I think this is definitely one step into the way into more automation. And also I think it's the best thing here with this approach would be to overcome this very arduous way of coming up with all the single rules within a team, but present the data profile to a group of data within your data quality project to those people involved in the project. And then they can verify the project and only update it and refine it, but they have some automated basis that is presented to them. All right, same team or new team? Same team, yeah, we're continuing. So Lelisa, thanks so much for coming on theCUBE and sharing the results. You've had a good luck with your talk on Friday. Thank you very much, thank you. All right, and thank you for watching. Keep it right there, everybody. We'll be back with our next guest right after this short break from MIT CDO IQ. You're watching theCUBE.