 Hello. Good afternoon, everyone. My name is Nacho Alvaro. I'm here today with my colleague David Bordas. I'm the head of data technologies and analytics at MinSight. David is one of our big data architect. And we are here today to talk about data quality in big data environments. We will go through some of these points of the agenda. We will do some introduction of the key aspects of data quality problems that we are seeing in some of our customers. And then we go deep and to explain real implementation of data quality engine that we have done in one project. Just to spend a few seconds to talk about the moment we are living in big data. We started a few years ago. We did some lab testing at home, hiring people, testing new technologies, and gathering information and creating algorithms. And hopefully you provided some results to your company. But today I think that we are living in a completely different moment. We are in the moment of scaling this to the whole enterprise. Some of the things that we are encountering in our customers is that they are thinking, for instance, in how to scale and to deploy platforms to almost the whole company, to new geographies. They are thinking as well how to replace all technology that was in place for many years. And they are thinking as well in giving all their employees and even their customers some of the insights and the data that they are using in their operations. Just to spend a few seconds, again, in the evolution that we did in big data, we started doing some data leaks. We then faced some of the problems related with quality, security, and trustability. And then we looked for some use cases to find good insights and to provide some good results to the business. And now I think that we are really doing the transformation thanks to big data adoption. For doing this transformation, I really believe you need at least four things. Strategy, you need governance, you need strong culture, you need as well embrace new intelligence and new technologies related with big data. For just for recap a little bit, strategy, you must think about what you want to become in this digital area. And you need as well to think about how you will go through all the steps in order to have an adoption plan that will provide intelligence to the company. Governance, you need governance not only for data. You probably have in place right now more than 1,000 of algorithms that you have to control because they are working in real time. You need as well a flexible organization that will adapt very well to changes. And the only way to do that for me is the biggest challenge in this moment is to have a strong culture. Only with a strong culture, you will put all these people working together and reach new level of capability, knowledge, and innovation. And as well, I think that you need to embrace new technologies such as IE. IE can really change the world. But you have to put IE in the really core of your operations, the moment of truth when you think about that you will have some kind of critical decisions with customers, operations, risk, and so on. The point is that we encounter and we do see very often that data quality is not on top of our priorities when we, in many companies. If there is a kind of, I don't know, maybe there is a kind of formal that can explain this to quite easily. But if you think in data quality, it acts as a multiplier or all the capability that you have to put together to master big data. If you don't have any data quality, you won't provide any impact on your business. That's why we really think that data quality must be on top of your priorities. Data quality on big data is something that you have to approach a little bit in a different way. You need to think about many different signals that you can process in the digital life in order to have a single view, a single unique view of the reality. With all these signals, you can build a trusted view of this reality. Maybe you will use some redundant information, but it's something that you need to do in order to be confident of this reality. We did encounter as well that many customers still think that data quality is something more related to IT. I don't think that's the case. Only business users have the knowledge of understanding what data quality problem can be solved and how to really understand a pattern that can be very dangerous or providing potential business, potential advantage to just business. Data quality tools are quite behind of any big data tools. They are very oriented to structure information, but when you scale up, they don't work very well. So you have to consider that you must do some kind of ad hoc development and to provide some kind of interactive tools in order to navigate through the data for those business users that was mentioned before. Another thing, algorithms and machine learning will help to fix and identify potential problems in big data. If you think, for instance, in self-driving cars, most of the companies that are working in this domain are performing many tests, fixing information around you, and building good knowledge about the real situation of taking decisions on how you must drive your car. But let's go back to the present and talk about a real implementation of that quality engine in big data. Please, David. Thank you, Nacho. Well, my part is much more technical. I'm going to explain how to create that quality engine once you already have a data lake. At the end of the presentation, I will also show how to create or to use the module of data quality we have in your current informational system or in data warehouse you already have, in case you still don't have a data lake in your company. The steps we follow were the following. First one, this is what we understand is data lake, the data lake and the surroundings. In blue, you will find all the informational systems, the ETLs, the files, and third-party systems, everything with data, which is previous to our system. The integration layer, which is in yellow or orange, I think, maybe yours or from another part of the company. In Violet, I think, the sandbox, networks, and marketplace, you will find all the, what we understand as analytics. In some data lakes, you will not find a marketplace. But in Indra, in Insight, we always work with that module, which is when we have algorithms in production, we prepare them and put in a marketplace like in Google or Apple Store. And then a customer or another department who already have the data lake can play the algorithm and will use it. In pink, you will find all the presentation layer, BI, some portals. And in green, which is the matter of today, you will find the data lake itself. Maybe you have or not a non-SQL database. You will always have a government and orchestration system. And today's, well, this presentation is about the ingestion and data quality system. This is where we started. Most of the companies at this point are in, well, at this situation. We all started here in Spain, in all the banks, in all the public administrations, gathering information for informational systems and ETLs. That was the beginning. At that point, we had sometimes to create, well, normalization systems, consolidation systems. But in most of the cases, the information we gather was correct from the beginning and was easy to gather into the data lake and putting production. The problem began when we started receiving files, third parties information, information from internet, non-structure, or that we didn't have time to functionally analyze before the ingestion. So we didn't know what we were receiving. Now I'm going to show you the real steps we follow when we created our data quality system. The first architecture is only a couple of ideas. I mean, if you use simply common sense, you will realize that if you want a data quality engine, you need a place in which to storage the rules, obviously. If you are doing big data, you will use a spark engine and a place to storage all the information that is common for everyone. The point is the metadata and the configurations. That is a system that can be in or out of the data lake and that storage all the configurations of the files, the rules, which rule applies to which file, to which field, within the file. All that information is in this module of metadata and configurations. In the second version of the architecture, we realized we didn't want to depend on third parties. We didn't want anyone to orchestrate our solution. So we started using Uzi because we already have Uzi in our cloud data. And we decided we didn't want to depend on the information on or in files or in third parties in between to give us the information from the input and the output. The connection module and the publisher are, let's say, two sides of the same coin because basically, they have the same technologies and they do the same. But sometimes, as you all know, the appies to connect the systems are not the appies to send or to publish the information. In the third version of the architecture, the previous one, by the way, went to production. It worked. The following version of the system will realize that, well, we all know that when we process a file and process with data quality, check validations, and so on, a percentage of the information is not going to pass all the rules. At that point, we realized we had to have a graphical interface made in Java to manage the chaos, the elements that didn't pass all the corrections. We also wanted to create a system with audit and a place with a graphical interface in which we could manage all the rules. In the previous version, a developer or an architect had to change the rules. If a new rule appeared, we had to program it. Right now, I think that we have more than 300 kinds of rules that can receive a parameter and apply to much more rules. What you can see in the part below is the real first version of the workflow of this system. We also realized at any point that some of the data quality systems didn't have the ability to manage the workflow of all the file and the transitions. Here, you can see, well, this is the connection module. The second point is storage and only validation rules. The third one, sorry, is the self-remediate rules. I will talk about them later because our engine is able also to repair the data. The fourth one, the hand, is, well, manual changes. In case you have rules and remediation rules and you still have chaos in your file, you can repair them. And the publishment. Why did we choose to have those five points in our workflow? Because when we started receiving files, not all the publishment services, I mean, the people who were going to receive our data, weren't prepared for our data. So we started ingesting information, applying everything. But we put the checkbox in the last one. That means don't publish. So we can manage all the workflow of our solution. The fourth and last version of the architecture was increased thanks to the discovery module. In most of the cases, when we receive a file or a third system communication, they send us a functional or a business description, a functional design, or whatever. But sometimes we receive a file and they say, OK, ingest the file and send it to somewhere. With that discovery module, we use sort of called Visual Analyzer and the discovery module. And we also have versions with R and Python for the open source version of the data quality system. For example, with that module, if you don't have any information about our column, but you see that the regular expression is this one of a passport, a serial number of a car, or whatever, the discovery module will tell you. So at least you could be able to put a little bit of rules, pre-checks to the file in order to process or pre-process the file and have a start point. If we put some color to the diagram, as I said, well, it's chimping because it was Oracle at the beginning. And after that, it was R and Python in the data discovery. The metadata and configurations, we have them in MongoDB because we already have MongoDB in the first client with this system. And the government portal is an open source government portal implemented in Java. Now let's dig into the components of the solution. Well, the metadata and configurations I told you before. Rules, rules that apply to a file and so on. Also, well, the workflow configuration and all that stuff. The rules, I think that after two years we did project, we have almost all the rules I can imagine. We have simple rules. This value must be this one. This variable must be that. Simple validation in HSQL, regular expressions, matching with master data. I think that we cover all the rules we could imagine. We even can invoke functions made in Java, R, Python, or even invoke external services from third systems. So I think that we cover all like an imagine, at least from a validation point of view. We also have self-remediate rules that can repair data. They are from very simple to very complex. I mean, for example, if you know the address of someone and the city, maybe you can infer the postal code. This is a reparation. The postal code is not in the input file, but I can infer. Thanks to two fields that in this case are the address. Finally, we also have translation rules. The translation rules are in case you receive the same information from two systems, for example. Imagine you have an SIP and an Oracle. Both send you a file about the citizen, for example. In one file, the gender is male, female. In the other, it's zero and one. And the final system that is going to, where you are going to publish, is they are expecting F and M. Not neither female male nor zero or one. We can also translate. This is not a translation system, but we can translate codes from place to place. The storage system is simply a big data system. You know, Rao, data output. Some other presenters will say trusted or other names to the data part, but it's when the information is clean. Rao is the input, and well, the output is only when the publisher does need a different format for the information. The rules engine, well, it has three submodules. The biggest part is the batch engine that ruins with all the information. With the biggest files, we can process. The pre-validator is something we didn't find in any data quality system in the market. Most of the places we found that when a data analyst or any kind of person who runs the data quality application, they told us that in most of the cases, if you want to pre-check just one single quality rule in a file, you need to replay all the file. In that case, we have a pre-validator. The pre-validator is not the part of the engine as it cannot validate or repair data. It only pre-validates in the online, in the government portal. It shows you if the rule is going to apply or not and what will be the result. And so the real time or on-demand engine is in case you have, for example, you process a file 10 millions of records and only 100 didn't pass. You can run in the government solution, in online, in the Java part, the remaining ones, you have a manually corrected them or apply any new rule you didn't pass at the beginning. Well, the connection and the publishing modules are almost the same in technologies. You can see the only difference is one published and the other connects to the information. The data quality system, as I said before, was created only for managing the OKs and chaos. But we start giving it new dimensions. The ability to pre-validate, to manage rules, well, it's the complete front end of the solution. And the discovery module is only in case you receive a file and you don't have information or you have only a small number of rules and you want to check if there is anything you missed. Now I'm going to show you the first implementation. And you will see a lot of weird things, but it was the first implementation. That was the environment we have. This is the generic data lake size in the beginning, and that's what we had. The data lake was a clodera. As a non-SQL, we had a MongoDB. Well, the information systems were SAP and Oracle. Well, you can see over there. We didn't have a marketplace nor a portal on the beginning. The BI was in OB. That was what we had the day we started the project. And that was the first version of the data quality engine. If you see, we only process files and information from the information system, the data warehouse we had. And mostly, the connections were through the Oracle Manager file transfer, or directly via scope. We have other versions now with Spark, but that was the beginning. The ingestion mode over the ingestion mode, which is a node out of the data lake, was deployed the connection and publisher module. And the weird thing for all of you could be that you can see that the rules, the files, to be reviewed, that means the chaos, are in Oracle. And it was because our first version of the data quality engine was deployed in HIFE. And as you all know, HIFE is for what you can read, but you cannot write. You can write if you move all the packages and the CSV behind the partition. But if you wanted to change just one record, you needed a relational behind. So the first version was like this one. The capabilities of the first version were very simple. It could only work with batch. We didn't have an online version or any government portal. As you can see, all the administration was made through the own interface of Cloudera. And that was all. It worked. It was fine. It was very fast. But we knew we had to evolve the solution. The current version is this one. The Connection module adds Spark and the publisher. Most of the storage system, depending on the version, could be directly in HDFS. But right now, we always put all the information in Cloudera. The Data Discovery module could be used if you have an Oracle or whatever the discovery you have in your company, you can use it. We have or we use an open source version in most of the cases with R and Python, because we have the same results. The data and the configuration is still in MongoDB. We could move to Kudo, obviously, and we did in another version. And well, all the government and Oracle station portal uses basically Angular and Java. What are the new abilities or capacities of the system? Well, now, besides of validations and remediation, we can enrich and make advance repairs. Besides all the online part, the pre-validator, the online launch, or the creation and management of new rules can be done in the online, in the web portal. Besides, the administration is full right now. The audit is full. Do you remember all the steps of our workflow? We know exactly what we receive. What was the file after the rules? What was the file after the self-repedition rules? So we can come back to 20 points. We can audit everything. We also have a quality statistics, how many registers passed all the steps. We can even stop the publishment. For example, 20% of the registers are KO, then don't publish. We can do that with the government portal. Obviously, the creation and configuration of rules is also included in the new version. Those are other versions we also have of the data quality system. We have versions in which everything is based in big data. We don't have a MongoDB. And we also have a version which is, in case your company doesn't have a full data lake, you can just add some nodes of Spark, use the application, but run it with a MongoDB as a non-SQL or, well, an Oracle, for example. In both cases, the important point is that the engine is developed in Spark. What are the results and the future we see? Well, the results are very important. The first file we process with the first version of the data quality engine was compared with the previous ETL that was working for the same job. They needed, as you can see, five days for what we needed, six minutes. So obviously, moving a solution of data quality to a big data is more than useful. Regarding the improvement in the data quality, after five or 10 files, we've realized that at least a third part of any file, it's always corrupted, or has chaos, or simply didn't pass all the rules. Thanks to the self-remedy system and, well, the manual workflow and, well, all the tools that now we provide, the quality of the files is almost 100%. We have also collateral benefits of the solution. First, thanks to the discovery modules, we have less functional dependency because we can try to process the files at the moment they come without waiting for the functional design. I'm not saying that we don't need functional analysis. We still need them, and we will for a long time. But in this case, if we receive a file without a shape, we don't know the columns, the values. We can pass through the discovery module and try to process it. The real time pre-validations were something we didn't find in any other data quality system. And more than new rules, as you can see there, the new types of processable files, it's something we didn't find in any place. I mean, right now, as we have a module with MongoDB, we can also process a JSON or an XML. All the systems we find for data quality process, relational information, columnar, CSVs. Well, the self-remedy module is common, all the data quality systems have, but within COWRs is better. And the workflow control allows us to stop the workflow at any time. What do we see at the future? Well, more data quality only because of the speed. More data quality and ETLs move to big data systems. And probably the next step is also the master data management systems move to the big data. I think that is the natural step. Well, the truth is that all the data quality systems or engines we find in the market, they, well, you must pay per record, per file, or whatever, and for all the companies that already have a big data platform, to implement a system like this one, will be, well, they will save a lot of money doing that. Besides, we see that we are finding new rules that no one expected, not functional rules, but real rules we find in the files. And, well, those are takeaway points of this session. And I think that we are ready for your questions. Thank you. Any questions? No questions? Well, no questions. Thank you.