 So, I am Nitha Pandey and I am from the data team with Android. What I am going to talk about is DTU on Hubby. We are building a solution which is primarily DTU for entity evolution and we might talk again after this talk after the next talk. We are going to share some experiences of building this particular solution on Hubby. So, I just spent some few minutes setting the context here in terms of what is this DTU all about. We all have heard about master data management. All organizations exist in some shape or form and if I really have to define it in a very simple way, the data that is really critical for the organization, the customer data is of course other types of data like product data or campaign data. So, since the data is very critical for the organization and it comes from various systems where you would cleanse it, standardize it, normalize it. You would find dupes in it and this dupes may not be really the matches based on values or even close values. It could be from linkages, etc. But you get all these entities together to try to find dupes in the data and ultimately provide that critical data as like a master list which everybody in the organization can access. Customer and all the customer entities are the most common form and here also I am going to talk about primarily the customer entities or profiles that cleanse, standardize and make it available for a lot of applications across the organization. So, coming to why is this clicker? One thing we will see that you didn't hear about dupes for a long time. It is very common to just buy external solutions like we have IBM initiate or Informatica master data management or KTQ which are very common to get these tools of the shared configuration. So, there are some scenarios which have triggered in-house development. I do have some scenarios and I do see such kind of scenarios coming in many other organizations. Also, we already have an IBM system but we are not really satisfied. We may have to go for it in-house. So, taking a new example again, we do have a customer IBM built with IBM initiate and it has been there for some time. But what triggered this was definitely the customer data that we were thinking of earlier. Now for the analytical requirements, the scope is definitely easy. We just do not want to look at the paid users who would have done some important transactions with us but also want to look at visitors, subscribers, free trialers, different type of subscribers. So, those are the analytical requirements today. Also, we just do not want to look at our customers but taking a new example again, most of the customers are individuals with small business. For many of the innovation projects where in the experiment with data, we would even want to master customers, customers, employees or even individual accountants. Everybody in the ecosystem, we would rather want to master data at that level so that it does not help just the analytical requirements but also different offerings. And when it comes to enrichment of the data, today we have a lot of social data. So, even if we know our customers partially from some of the data we have, there is a lot of social data available today. With all this, the scale of data is definitely important. The traditional MDM solutions are really at a, since they are first with RDB as a base and the license costs are based on the number of records that you process and after. The cost definitely skyrockets here if you go for the traditional solution. The second requirement that we saw was real time. Earlier, we would use the MDM data for some marketing use cases. Some use cases with today, the requirements that are coming in are can we use the master data while the customer is online and use it for better recognition. Also, instead of getting the data at the back end and then doing the deduce, is it possible as in when we capture the data, we do a deduce standardization? So, those are the real time requirements that we have seen. So, though we have not completely got into the, on catering to the real time requirements, we are definitely building the solution keeping these in mind. And last is also that all the data for most of the enterprise, we have a data hub wherein we are getting data from all the different offerings in one place and that is HDSE. When you have the data on the view, many of the times it makes sense to build these kind of solutions on top of the view as opposed to the traditional solutions which you have on the RDV side. So, some of the solutions have support for Hadoop but they are not for Hadoop. So, with that trigger, we did start with development of a house tool which is about a year old pack. So, if you look at the different components in this system, one is you would collect the data from different sources. The first step that you would do is to standardize the data and the reason for that is the better you standardize, the better your next steps that is matching. The next step is matching which is every profile or entity that you get, you would look against the master, try to match it and see if that entity exists and if that entity exists, then the next step is to create or reconcile it or to create some kind of a golden record. The reason it's called a golden record is it's built from multiple sources and beyond that point it becomes like a source of growth for that particular entity which would be used with several applications or again backed by the offerings. So, once the data is available, you would finally store it in some store which would be used for various purposes. The typical use case is that, as I mentioned, one is real-time when the offerings can just look up whether the profile exists and they know anything about that particular entity. Second is the interactive exploration wherein the analysts would take the data and join it to other databases and the downstream application that would build data in patch which are also very... Coming to how... These are the different components and how did we solve it? This is really for the entire pipeline, the patch pipeline. We built different components, all of them custom. The cleansing and standardization, we used a lot of open source, NLP library, the phone library, the email validation, etc. but ultimately we built a standardization library and a recognizer which would recognize what entity this is. This is the customer profile's GDU. A lot of the cruising libraries were PII which was phone, email addresses, etc. So, as I mentioned, we need this in real-time as well as we need this in patch. We built this library in created media so that PIC could be able to call it and at the same time also hosted it for real-time so that offering which you would want to start using it can discover it. Second, coming to the matching, which is a core part of this particular application. Matching was definitely, since it's a core, you would want to build the most accurate one, you would want a scalable one, you would want it to be performant but since it's a core part, till you have this out, you're not going to have the entire pipeline. So, in order to take a lean approach, we did build a simple probabilistic-based matching wherein we would take different ways of attributes or we would assume those attributes based on your stick. So, there is attribute matching for each of the profiles and then there is a aggregate profile matching. The thing that I wanted to mention here is that we started simple so that it got us going but at the same time we wanted to keep moving for more sophisticated algorithms at least the clustering techniques that can be used to come up with unique profiles using let's say Mahots or any other like these but at this point in time if we do that it would be a long-awaited situation. So, we built a matching framework in a way wherein for the attributes or for the profile the algorithm can be added later and the framework can be configured to take those algorithms. This helps us to even separate the matching framework data engineering from the sciences part and though we have got an algorithm in place which is good to start with, we have used it for very different complicated algorithms also. Again, we have the same approach, like we are calling the library because PIPPIG would be running the overall data pipeline. After we master we store the data in HBase the master data is stored in HBase. What helped, the features of HBase that helped us there was since the profiles are brought from several different courses it has a wide set of attributes but at the same time it will pass. So, HBase helped us there also since it's TII we need to update it so the versioning helps us there. Also, since it's HBase we can make use of range scans wherein we are comparing single attributes to a set of attributes. So, the row key design we did in such a way so that the relevant attribute or the relevant profiles are together and we will be able to do the range scans. So, once the data, master data is finally stored in HBase what helped us here again is the PIG interoperability with HBase where in the, not only PIG is able to look up but able to use H5 for storing the data direction. Then once the data is stored in HBase what we are, one is the real time scan which can be solved by using solar on top of HBase that is what we still have to build but on the back side and the interactive exploration side we are using a high external table to HBase in order to, so that the analysts or downstream applications are able to extract the data join it to other databases and with HBase. This is a little bit more on the matching techniques definitely if you want to use the clustering algorithm Mahu is one of the options we tried out we also tried out MNB, Vendee, Figo for Spark but the whole idea there is the, it expects vectorization and it is very costly at this point in time to do term frequency, to use term frequency for vectorization, etc. So, we are also looking at some options where we will be able to, able to do clustering directly based on the text so that, these are all taking time and that is why we use the simple probabilistic algorithm earlier which helps us to get started but it is not that scalable or very difficult to generalize whereas the other approaches would be much more scalable for us. Yeah, so somebody has mastered it there is a lot of different cases dedupe is the core of it we can use a lot of different options to call it as I mentioned simple algorithm versus using Mahu or MNB R and then TML to port it but that is something that will keep evolving working. That's it. Yes, one I can think of is this hierarchical like if let's say you have a parent business and you have multiple business other thing is, yes if the algorithm is different like as you make your algorithm very accurate you may find some views and they might be same but I think you are asking the more of like can they be really linked? Yes, so we in fact have a column called link ID in our schema in HBase wherein if the let's say even there are 100 outlets for a particular merchant then all these hundreds have to be linked to the first merchant. So we do have the link. Okay, really quick before we move on to the next question just want to make one quick announcement we are going to have another really quick announcement after this speech so make sure you stick around stay here before you go to lunch we have a really awesome message coming from some of our partners here so make sure you stick around and then also as you are asking your questions our sponsors over at Bloom Reach they are going to be passing out you know some of their goodies they have de-stressed balls so I know when you guys get stressed out from all of your hard work make sure you use them and a big thanks to our sponsors over at Bloom Reach and now let's let the questions continue. Hi, this is a dozen questions. So my question is then you are creating and building the platform for you or multiple customers. So how you are handling the incremental update? So one is the source precedence we do take into consideration source precedence also because some of the sources are more reliable than some of the other sources the other is the updated time so based on these so every time it takes like multiple profiles and generates a new master record but the main thing there is the source precedence so every attribute has a source precedence and as well as it has the last object. Every time whenever you knew that you recommended all of it? Yes, so every time you match so in this case it's not exactly for every record rate so every record that has matched and has some data which is not which is updated after the earlier one has run so it's an incremental run. So how you are planning to handle real time update to master? Okay so the real time update would be the direct API so today we have APIs to look up similarly we would have API that would directly write to edgebase so every time you match the edgebase record itself is input along with the incoming and then you again construct it So again last point so when how will you search that Is this your Roky design? Yeah currently we have a Roky design which is used for this patch process but for real time lookup that will not be sufficient that means we would need to have the solar indexes and APIs on top of it so later even the patch one can use it but right now we are having the Roky design based very specifically on the Geo data so that the business entity, business and individuals in the same locality are together so that's the Roky design Okay so we have explored that but we have not completely built it so right now the one we have is a simple probabilistic algorithm and we are looking towards both of them with whatever approach we have today is the vectorization that really is a lot of simplification beyond that we can use multiple approaches but we are not there So can you say cleansing the radar? Yes So we have both the requirements one is patch which is the requirement now and what is also being asked for is instead of the raw data entering the backend system can we have the data validation and standardization done by the offering when the data has been entered and captured that would help the end users also and the clean data would enter into the system So we have just recently had a web we have built a library which is being used in that and the same library we had a web service deployment so that's like a prototype that we have built and we are looking at can we continue to use that so as I mentioned you have taken by building it as a library we have taken care of the fact that we can use it in patch in this case we are building the UDS and in real time we are deploying it We have time for one more question The logging one So we are having our own for the web services one and include has a logging standard that we are following and for our patch processes also since this is the ETL kind of a framework we have our own ETL framework that we lock the data and we do have monitoring framework so but then that's already there for the other ETL processes or other web service applications we are just using We have time for one more question and it's going right over there Hello So you said the data is very sparse so how is it a challenge for your probabilistic algorithm do you have any precision on that Yeah, so we do have based on so one is we have also used the concept of threshold here because whatever score we get depending on the use case people can either choose lower threshold like marketing they can go with they need not really one more directional data we go for lower threshold whereas some of the offering go for higher so based on the different threshold we have different precision we call computer so we have a test framework with any change to the algorithm we go ahead and generate that so overall the precision the if we consider a threshold of somewhere around 0.8 then the precision somewhere 0.7 but when we are going higher 0.9 also so that is where it is today so the overall probabilistic algorithm is given as a okay acceptable precision or accuracy to get us going but definitely that is the first version of it that we have Okay, perfect thank you very much everybody give a warm round of applause thank you