 As she mentioned, I'm Yagnik. I work at Shopify. For those who haven't heard about it, Shopify is not Spotify. There's a big difference. We don't do anything with music streaming. We're a commerce platform and we power nearly 165,000 stores at this point of time. So if you come to me and tell me, hey, you work at Spotify, how do I do this? I will punch you in the face. I'm kidding, not punch you, but I will be really mad. So yeah, Shopify right now is roughly stats, which is around 300 million unique visits every day, 8 billion in GMB, 10,000 plus in checkouts every minute, which really means for me, around 50 terabyte of data coming in every day and around 80 plus servers in our Hadoop cluster handling that data load. Why am I giving you these numbers? Really just to tell you that I'm handling decent amount of data. Would you call it big data or just data? It doesn't matter, but it's the amount of data that I have to play with on a daily basis and handle. So by processing all this data and taking care of it, we ended up going through what I like to call the path to data enlightenment. The path to data enlightenment in my view is something that pretty much every startup goes from the beginning right to the, I won't call it the end, but I'd say Google does it right now, so Shopify also does it. But point being, you start with something like querying your production slave. Usually it's your CEO or CTO who's directly going on to your production servers making queries and randomly getting data. And then when they do that, it's all fine and good because you're really small, you're a few people and it doesn't really affect your experience on the side or on the load or anything of that sort. But once you start scaling up and you have a bunch of services happening, you have data coming from, not just one place, you have data coming from multiple places, then you start putting all this data into what I call a data dump. People have called it a data lake. I'm not sure if I should still call it a lake or a dump. But anyway, the data dump is essentially all your data going into one place. Usually for startups, it's like MySQL or any other database where you put in data from all the other different databases. So for instance, we have Shopify Core, then we have other microservices such as billing or authentication, and all the data will collaborate together into one MySQL database where you can join all the datasets, which wasn't possible before because all these databases were separate. But sooner or later, you come and hit a dead end where you can't really query and do a lot of stuff with this data. And that's where you end up going or that's where Shopify ended up going to Vertica, where we ended up doing the exact same what we did with MySQL but ended up putting it in Vertica to handle the kind of load that all our datasets would give us. So the sole reason for that was like that's our data warehouse. It's still a data dump because it's still all our raw data going into one place, but it's on an OLAP database which can handle that kind of querying and load. But a year back, Shopify learned that that wasn't enough. As we IPO'd, we realized that we actually cannot trust our data at all. The problem became even worse when you have to give GMV numbers out in public and you have to make sure that they don't change in history. So you can't just go back in time and make them different. You have to be the same. And that's where one or one and a half years back, Shopify decided to model its datasets, which is we started using Star Scheme if anyone's familiar with that or dimensionally modeling based on Kimball methodologies. And the idea was that we need to be able to trust our data. When someone comes to me and asks me, hey, what's the number for GMV yesterday? I should be able to confidently tell them this is it. If they come and tell me how many users we lost, say two weeks back, it shouldn't change when they ask me the same question two years later. And that's where the idea of building a data startup inside Shopify came in. And that's where we are. I do want to point out though is I put this on a graph which is complexity versus confidence. And that was the key point of this. The data sets initially, when you're querying it yourself, your confidence level is kind of, if because you're working on operational data, which is constantly changing, it's never doesn't store its history. It doesn't know anything about anything else. All it knows is its current state of time. And as you go through a model warehouse, what you're looking for is historical data. You're looking for changes in data. And more so, you want to be sure that the data is exactly how you expect it to be. Obviously, the complexity also increases as you go. Quering draw data is super easy, but building models and actually thinking about your data is a little harder than that. Took us a good one year to figure it out and build it out for ourselves. So it does have higher complexity. During this process, there are a lot of things we learned. And the biggest being because you've gone through these different phases of building data warehouses and building frameworks on top of that, reporting modules, yada, yada, yada. All these things has taught us a lot of a great deal of things that are required when you're building any kind of data framework. The point being, these data frameworks all have the core idea of processing data, but they all need certain aspects which make them powerful enough to do things, which make them production ready, or as we call it in our SaaS business, that any business should have a production ready app. You're something like uptime and monitoring. So taking all those concepts and applying them on data framework stack, and we ended up coming with a few of these. I'm going to talk about each one of them individually, but I wanted to list them down so you know what's coming up. I'll be talking about metadata, and it's hopefully not going to be just meta and go figure it out yourself kind of deal. Talk about instrumentation, storage, registration, data development, and build process. A couple things I will mention in the middle which are work in progress, which is these are my ideas or these are Shopify's ideas, but they're not hard and they're being battle tested right now. Whether or not they're good enough, I can't really say that because we're still testing these things, and I'm pretty open to discussing those ideas and get your viewpoints on that. So if you have any questions, feel free to ask me on those. Going ahead, metadata. I promise I won't make it very meta, so I'm going to talk about very specific things and give you exactly what we did and why we did it and what was the reasoning behind it. So the first thing was data schema. When we started out we had JSON and we started dropping all these data sets into what people call data leaks, and the idea is that when you put it in a data lake, you just drop the data and don't worry about the schema. What became really painful was the fact that people started dropping all kinds of fields into the JSON blob. Things that didn't actually even make sense just because they could store it somewhere, and then we started enforcing schemas. So every time you drop any kind of data set, then into HDFS or any storage that you use, drop the schema with it, and what that schema has is all the fields at that current point of time, and then the types of those fields. So essentially whether it's an integer, decimal, daytime, money object, yada, yada, yada. And you would store that with the schema. I'm going to get into storage of this schema a little later, but that's what the first part of the metadata is to store the schema. Along with that schema set, we also started storing retention and privacy policies. Being a public company, we had to follow a lot of rules and regulations around what all we can store and what all we cannot. And if someone asks us to come and delete their data, how do we handle that? So every schema set tells you whether this data could change in time because of some delete request or how long this data is going to stay in HDFS before it's going to be perished permanently. Really depends on the data set of what kind of data you're using. So something like logs are probably going to remain forever, whereas something like passwords are never going to come in, and they'll be retracted under the privacy policy. Next up is operational metadata. I added these as fields so that these are the things that we started storing, but the thing I want to focus on is the reasoning behind it. What came up quite often in our past data systems up until the model data warehouses, when people were looking at reports, they would often ask, is it actually up to date? How many records did you load last time? Or how are you doing these things? They were also started asking questions like, where are you getting your data from? I trust this data. And then when our engineers and analysts came around, they were like, but I want to know which code actually generated this data set, and how do I know about that? And then finally, because we have multiple data products, which data product is actually dropping this data set? So the metadata for operational data sets was exactly those questions being answered by the data and by the system. The idea was you'd give every detail of which source system was putting in, and when it was built, how many input-output records you would put in, and then at which point of code you could go back and rebuild that exact same data set, knowing that it will actually build up and give you the exact same results. Next up would be business metadata, which is definitions, questions, source, and owners. This is purely for business users. It had nothing to do with engineers and analysts. It was for all reports that we had that you could just go in and ask questions on the portal, and it will tell you which reports match your specific questions. Trying to be smart about, instead of a new manager coming in and asking for, do I look for GMB, or do I look for products? Do I look for shops? Instead, they could go to our portal and just ask in natural language. I want to see how much that shop sold in the last 10 years, and it will get converted to the exact thing or the exact report that they need. So that's the kind of other metadata we started storing. What all this metadata actually was meant for was my happiness in the end. I'd say Shopify is pretty strong on developer happiness. That's what I'm saying. They care about my happiness, and my happiness was this one chart. This is what we call overseer. The point of its existence is to help me tell what all datasets are there, what all their dependencies are, what are they dependent on, whether they passed or failed in a little lost run, whether they actually loaded any dataset, and how late or behind the data could be. This one thing tells me everything I need to know about the system at any given point of time, and all this metadata actually serves into overseer, which is this graph. So as you can see, for example, I pointed out on one of the green dots, which is that job pass, it's channel transition facts, and it tells me a bunch of metadata that I need to know at that point of time, but it could be expanded into a lot more in terms of doing this, doing any kind of analysis. Taking it a step further, though, we also use it for impact analysis, where if you make any changes in schema or in any kind of metadata, you know exactly which report it will end up impacting, and in our case, we break CI and don't let the developer merge in or analyst merge in any changes that would actually cause a breaking change into something else. So that's the point of metadata. Next up, I got a single slide for instrumentation. I think that was as an engineer, it's critical for me to know how are my systems running. And as data developers, we're pretty used to hard drives and network failures, a bunch of issues on the hardware level. So what we do at Shopify is we instrument literally every little detail of hardware that's running. So we know how much network IO is happening, disk or memory IO happening, whether something is dead. Along with that, we also have service instrumentation. When I say service instrumentation, it essentially means how a service is reacting. So in our case, we use Redshift, which is for loading our data sets. So is Redshift behaving properly? How does its metrics look? Whether it's getting too much, too hammered? We don't have access to DirectBox, so we have to work with APIs. But the point being, every single service in your system needs to be instrumented. And finally, job instrumentation. A job instrumentation could also be seen as the metadata just seen in real time. It's not dropped at a time of load, but it's being shown to you while it's happening. So something like this is what one of our graphs looks like, where this is one of the operational or system instrumentation. Job instrumentation would tell you more of what all jobs are currently running, how many records they have loaded yet, and then how many records are remaining, and how they're performing overall. So that would be instrumentation. Next up would be storage. So this was a huge debate at Shopify. Initially, we started at Vertica and putting everything in SQL, and suddenly we're moving to JSON blobs. But when we moved to JSON blobs and started using it with Spark, it was pretty good. As a developer, I could see my data. But it was really painful for the servers itself, mostly because we were loading all that data set, sparsing it again and again, and it made it really heavy. So what I'm going to talk about storage is what should be, what you'll be thinking about when you're choosing your storage format. In my case, I'd recommend Avro or Parquet, and I'm going to tell you why in either case. So first up is you want something that's shareable across applications. What that means is different types of data applications should be able to load that data and then also drop that data. Avro being a standard format, Parquet is a little less standard than Avro, but the differences that are pretty major to consider. Next thing is fast to parse. So the idea is JSON is super slow to parse when you're loading from disk every single time or putting it on a network band, taking over your network bandwidth when you're moving it around. Whereas Avro is binary encoded, which is fairly nice because it's super tiny. And when compressed, it could easily take care of your stuff. Next thing is the wrapper should also worry about schema. So I talked about schema earlier, but how do you actually store it? My recommendation would go storing the schema inside the wrapper, which Avro does. And what it does is for every single file, it will store the records and it will also store the schema that all those records conform to. Next up would be schema evolution. Sorry, I am getting a little thirsty, but the point with schema evolution is your data is going to change. Every single time you load any new data, every single time you have any migration, every single time the developer decides to change any event or your data is going to look different. It might be something as trivial as adding a field, or it might be something a little more complex like changing the data type. And your library or your schema library should be able to handle that kind of evolution and be able to go from one to the other without making all the downstream jobs break. So schema evolution is something that Avro right now supports. A lot of other libraries like Trift and Protobuf also supports that. But in Hadoopland, Avro is probably the best choice. So what are your options? Avro, Trift and Protobuf are all row-based tools. And then Parquet and Arc are more of columnar. The idea behind making a columnar is it's much easier to actually run a query on it because when you're running a query on it, it won't load the whole data set. Whereas it will only load a partial data set and therefore be a little nice to your machines and your memory. What I would recommend is do not please for the love of God. Don't use JSON, CSV or text files because initially it might be good. But as you scale your systems they don't scale with it. And you need to be able to, you need something that scales as your data grows. Next up is orchestration. By orchestration I mostly mean running workflows. Uzi is probably the standard right now. But Shopify had multiple experiences. We went from Luigi to Askban and now we're going to Uzi. And the biggest main point was be able to schedule stuff when data changes. Which something that Askban and I believe Luigi supports it in an iffy manner. But Askban doesn't and Uzi does. When you're picking up anything that you're orchestrating, you should be thinking about whether it's event or time-based scheduling. The idea behind event is whether that event could be a data is dropped or an event, something that's coming from some kind of message bus. Or in our case sometimes it's time-based for example our CFO looks for his reports at 7 in the morning every single day. So every single day there's a job at 12 which makes sure that all the data from yesterday is ready for him to look at at 7 in the morning. So support for both of them is extremely important. Next up would be notifications. For notifications what you're looking at is essentially whenever something goes wrong or something's going haywire, it should tell the person responsible to fix it or to look at it. Most workflows now have some kind of notification or failure emails but if you could have something smarter like I'm not loading enough number of records then that would be something even more awesome. I don't believe anything out there that tells you something like that but for failure notifications pretty much all of them do that. And lastly metrics and logs. When we were initially looking at things no one really cared about metrics and logs that much. They were like we'll worry about it when we need it but logs are critical when you're building data sets and data products. Knowing how much data you're loading, knowing what all is happening in your current data product or current application is critical to any data engineer and you should think about when you're picking up an orchestration system you should think about both metrics and logs and whether it supports it or not. This is probably probably my favorite slide because this is what I care most deeply about. It's small it's just one but matters the most to me. When you're building a data product don't just care about your data, care about who's going to work with it, who's going to be building on it both analysts and engineers and how are they going to work with it. So for in our case and what we made called Starscream which is our framework, Starscream has a bunch of options in it which allows happy developers or I hope they're happy. Some of the things are impact analysis which is what I showed you earlier. Overseer is an impact analysis tool and what it does is read all the metadata and knows about when things are going to break and notifies you before you break them as opposed to most data products where you will find out after you broke something that you broke it. So have something which allows you impact analysis and tells you on CI whether that's possible. Next up is CI. I'm giving an example of Shopify and Starscream. We run on Spark and every single transformation that we do on our data is actually tested in continuous integration. So we are testing on how whether that data is going to pass or not, whether it's allowed or not, whether it meets our standards or not and should be part of your data product. Next one is probably the hardest thing that came out of our experience. Every time a data engineer was building something they would either work with a very small subset of data which is not representative of the actual data profile locally or they would be working on the whole Hadoop cluster and taking over resources because they want to run some job over every single page view that happened on Shopify. What we realized was that if we could somehow deterministically bring down a set of data and then let developers use that, then they could just set a percentage and then we would call the data. So the idea is you give it a percentage and then what you get is the actual profile of the data being pulled down to your local machine which is represented of what's on the production servers. So for example, if I have pageview fact or pageviews of Shopify and they're highly biased towards certain pages, then the culled data should also look highly biased to that data set and how we do this is we run a job which basically creates ratios of culled data present for you and you could just download those. So what it would do is go over the whole data set, get its profile and how every field looks like and how often that field is seen and then take certain ratios so one, ten, fifty percentage and get those ready in HDFS every day so that you could just download that data set and build with it. A step further recalling is you'll never work with a single data set you're always joining different data sets. So what you could do, what you could do a step ahead is if you give it a job it knows all its dependencies and gets all the data sets in a culled manner which will join how you expect it. So that's the next thing that we did where we could you give it say for instance a fact building job about admin pageviews which requires pageviews, it requires user details, it requires shop details and what culling will do is go into every single data set and look at how they join and pull it down for your local machine, for your developer to develop. After that comes reconciliation. Most people at least in my experience at Shopify, every time we had data products we would just change things and change the data with it but once we became, once we started modeling our data and we went public we couldn't really just change our data hoping that nothing else would change. We needed to be sure that any time you make any code change it will not cause change in the data sets that's not expected. So someone in the finance team joked about and called it an RDD which is an abstract layer of spark but what it from our perspective it means is reconciliation develop driven development. The idea is every time you make a code change and when you push it to GitHub a bot will go and run your job and run the same job on master and see if the data changed where you expected it to. So in our case what the command really looks like is reconciled job one and I expect only column one to change and what it will do is run both the data sets, compare completely and then make sure that only one column changed and finally error reporting. At Shopify we decided to differentiate between filters and rejects. I'm going to go a little bit into it. This might be, I'd say this might be a little dicey of a topic because there were stronger arguments there and I'm guessing there'll be strong arguments here but every time you in our data sets you could not just filter your data you could also reject your data which is things that I know that are crap and I'm going to get rid of it early on which is whereas filter is just I'm going to get rid of it but there's no good reason. Rejects are actually stored separately with the job knowing this is the reason that I don't like the job, don't like the data set I'm going to get rid of it. Some people have argued that one or the other is not required. We can talk more about it if you feel strongly about it but that's what we do right now. Next up is a build process. This is work in progress. We're still trying to figure it out but I was mentioning to someone when I was talking about presenting here that incremental and full builds is something that's very much in talks in the data ecosystem right now. When I talk about incremental what I mean is you only look at the data that has changed from the last time you looked at the data so it's only deflogs or only change sets whereas full build is you can assume that every time you process your data you will get the whole data set and you're going to look at it completely. One thing that we definitely learned is don't worry about your aggregations as part of your incremental jobs. If you're only looking at deflogs don't try to be smart and go like I'll just add it or I'll just take the mean and do whatever of those sorts of things where when you're aggregating your data or taking sums. It's pretty hard problem and if you have a mistake you need to figure out you need to have a rollback strategy which is not straightforward in incremental jobs. I would suggest having a separate job that would think about just aggregating. So you have one job that does your incremental part which is cleaning and conforming it and then you have your aggregation part which just goes over the whole data sets and aggregates it which is really straightforward. And then finally MOTM which is what we call men of the maxes. Whenever you're working with a bunch of data sources then you should be worried about how far ahead each of those data sources are compared to each other. So for instance at Shopify we have orders and each order has many line items but they both could come into Hadoop at different point of time so you could theoretically have an order which does not have all its line items in it. It's more of a consistency problem but what we do with that is you make sure that you only look at men of the maxes so you only look at the timestamps which is the minimum of all the max values of all the data sets. So say for instance orders is present till yesterday and line items are only present till day before. Then you only process for that incremental job till day before because the rest of the data set is actually missing. This is still work in progress so I don't have a lot to add to the build process part but that's still building up I guess. And finally these are things that I haven't really talked about and we're still trying to figure it out but we're not sure. Something about data movement, how data moves from one place to the other and how it should be stored. We're using HDFS right now but it might be a better idea and it might be a better idea to think about another way of storing these data sets or moving them from our source data which is scoop or in our case custom build solution. Choose something else and people have talked about Kafka and streaming but data movement is something I believe right now is highly dependent on the company you're in, what's your SLA and how much you're looking to or how early you want that data. Second is data quality. Personally I feel that building multiple frameworks for data quality that would evolve with the company. We haven't come up to a solution which is which would be ideal for building a data quality framework which could also evolve as your data products evolve. I'd love to hear your ideas if you have any. And finally partitioning, that's again something that's left to the company and how you query your data. You could partition on date times but I've seen in our case we also partition on shops. That's again very specific to the kind of queries you're running. Someone talk about that. That's yet at this point of time I'll open up the floor for questions, anything I could answer. You talked about storage means how you guys are storing the data. So you talked about OZ or not OZ, you talked about ORVO, Thrift or other protocols. Sorry I am sorry I didn't catch that last part. Hello? Yeah. So you talked about storage like you are saying you you guys are using ORVO or other storage protocols to to store the data. So you are talking in context of SDFS or you are talking in terms of some no sequels. HDFS mostly. So you mean you guys are storing the data which comes under files in form of ORVO. So ORVO is actually marshalling or marshalling stuff for you. Yeah. Okay. So you guys are not using any no SQL to query the data. It's more towards you always have a batch processing towards your data to to give some meaningful information to you guys, right? Only recently we started using Cassandra to store our data which is only needed for incremental jobs. So something like if you're doing a large join we look it up in Cassandra but all that data is also present in HDFS in case you're doing a batch processing job. So we do use no SQL in our stack but it's not primary, it's secondary to HDFS. Okay and how the performance differs. Like as you told as you suggested no never to store the data like in CSV or in JSON format which is readable sometime means you can like at some point of time we need to get the data means you need to just look up the data but if you are using all these protocols like which are encoding or decoding the data to make it more to make it smaller to process with it better or so is that a challenge means you always need to processing on the data to look into the data even? That was exactly the reason we selected JSON initially which was developer happiness we need to look into the data itself but what happened what we realized was that a hue offers that you could look at Avro data and secondly the performance implications of it are extremely high. I don't have to graph up but I can show you the memory difference between a spark job that's loads into JSON versus Avro is significantly different and amount of time it takes to load move it from one box to the other over network is also significantly different. That's a significant improvement you saw and in terms of orchestration you you you talked about OZ so it's more towards jobs combinations means if you if you have a big task to do so it's more towards a job orchestration right? Yeah. Okay thank you. Any other questions? So you talked about representative data so can you explain more? Sorry, representing data. Yeah I mean see what you're talking about that representative set from the actual set if you have a very big data set you want a representative set from that right? Right colleague so okay let me put in the example of page view so say for instance you have a hundred million page views happening in an hour and when you have those hundred million page views you can't or in our case it's it's a lot more but in general page use is a pretty high big data set and you can't expect your analyst to always production machines and run their jobs on the whole data set to see if it's working. So what we realized was that what we needed to give an analyst or an engineer is the ability to get that same data set locally but it can't be the whole data set I can't ask them to download a terabyte every day what I want them to do is get a gigabyte but that gigabyte should be very much representative of the data and when I say representative the profile should match so the number of null values how often a page is seen those should still relatively be similar or the probability of them being different is very low. So what we wanted the idea with Culling was that you deterministically figure out how the data looks and then also give a percentage so like I as an analyst I could say I'm gonna work with 10% of this data so I give it 10% and call it and it will go look at the whole page views look at its profile and then select rows that match the criteria and then download it on the local machine. So we run this job on a daily basis for our critical data set so something like GMV or orders which are pretty high and then we will store it on a daily basis for 1% of the data 10, 25, 50% and then an analyst can start with just working with 1% build their job and then go go from that to testing it on 25% making sure they caught all the assumptions testing it on 50% and then we let them run it on production to make sure that they caught every edge case that possibly existed in that data set and all their assumptions are still valid. So that's the idea behind Culling. Is this offline processing you do mean is it how much time it takes to do the Culling? It depends on the data set so something like page views would probably take a few hours whereas something like shops would probably happen in like 15-20 minutes. Okay thanks. Hey you are talking about continuous integration just wanted to know what kind of tools are you using and what exactly happens underneath like since we are talking about big data and how do you exactly you know measure the code quality for example for the commit or so how it does it break and things like that. So at Shopify we build something called Starscream which is a framework that I cannot talk about because it's not open source but the idea is the Starscream is something use something to build dimensionally modeled data or Starschema and what we did was it's based on Spark so we took all the spark primitives made a mock out of it and then every single transformation you do on your data set you actually have a corresponding test for it so for example in your you have your fixtures which is your test data and then you apply those transformations to make sure that the data is actually correct. The CI tool does that runs it and mock and also runs it on an actual spark cluster with test data to make sure that the transformations at the assumptions are correct and they still hold true when someone changes the code and then finally what it does is whenever you're it does the impact analysis because it over so every time you deploy to see up every time you send a branch to GitHub and you enable CI it will send a request to overseer which is the impact analysis tool which knows about all the data sets all the jobs and all the schemas and it will see if anywhere in the DAG you're breaking any assumption that someone else made and then tell you that this is broken you cannot merge it in and that assumption goes all the way until Tableau which is our reporting framework so we know that if I made a change in the name of the field that's being used in Tableau that will not get merged in because it will be read on CI. Those are the two primary things that CI does for us which is impact analysis and then code coverage. Does that answer your question? Yeah thank you. Excuse me so here here so you mentioned about the culling right so basically you have let's say terabytes of data and you want to give only GBs of data to your analyst people to explore it more better way okay so how you are giving this freedom to analyst people like you are giving it through redshift or what basically you're using is it a day job or weekly job? We do a daily job right now for most of our critical data sets our analysts are allowed to run a job for a specific data set which is not already present for them and so that's that's what we do right now so you could run it yourself or you could pick one of the data sets that's present on HDFS and then it's stored as a file on HDFS you have to download it but it's significantly smaller. So for storage what you are using like which technology? Sorry for downloading. For storage yeah for storage. Just Hadoop commands like Hadoop FS. So command line arguments. Yeah hi I have a question. I'm sorry where are you? Yeah so I have a question like how do we handle inconsistent data so as we store the data format that we follow is and blocks or some other kind so the data is continuously changing so how do we deal with the inconsistency of that data? So something like that happened with our Kafka data source where the data was constantly changing with event streams so say for instance you have an event for admin admin being all of Shopify admin and then all the events coming in had different profile attached to it and what we realized was that people that was just having an open firehose was a bad idea and we moved to curated data sets so every single event stream is a single type of event and the schema only changes as much as your database would change in its schema so only once in a while your schema changes and we managed that with Avro schema evolution so Avro allows you to change the schema and be able to tell you whether it's backward or forward compatible in which case in Kafka's case for example you have your Avro schema and then if you make a change to that schema if it's not forward compatible it will actually break things and the CI will break and won't let you merge that change in because you broke something but if it's compatible it'll just let you have it so adding a field is always a compatible change whereas changing the data type is an incompatible change in most cases so that's how we did it thanks we have time only for one more question hey hi so I have basically two very quick questions specifically from a monitoring point of view to see if everything is working fine and things like that do you have an in-house custom solution which you have built for monitoring so do I have an in-house what for monitoring in-house solution we use a data dog which is graphite based okay for monitoring so it's for real-time monitoring that is right along with that we use Splunk for our log monitoring okay so that's it like our code base sends events to those things okay and for visualizations again data dog is like they present a visualization library thanks a lot last question two more minutes hello hi this is Praveen I want to understand like did you ever get a problem in a phase problem in compliance where you store the PI information at different geographical location and use Hadoop Federation sorry can you repeat I stood PI PI information where so the concept of Hadoop Federation did you use or any other methodology I used we have been looking at it but we haven't implement anything for Hadoop Federation we just handle it at the time of loading so we have a custom instead of scoop we have poor man's scoop which what we call longboat and we built it in-house so we have PII data have like PII control happening at that layer itself okay but right now we don't use anything on Hadoop side okay but how do you manage to use that data to show it to the whole geography so we have a central service for PII which tells us what all what all rules are applicable for private data and whether that data has been removed or not so we already are aware of deletes in our data set things we need to care about is something like an email or sorry something like a phone number which needs to be retracted so you're not allowed to store the whole phone number you're only allowed to store the area code so we have certain rules in our we have a separate service which manages different rules and knows about different kinds of PII data so we hit that service and we ask it whether this field needs to be PII compliant and if yes then what kind of rule are retracting logic do we need to apply to it and you give it the field it'll apply to retracting logic and then you store it into HDF into HDFS yeah okay does that make sense yeah I got it I I feel really sorry but I'm gonna have to cut you and please take this offline