 Hello, everybody, and welcome to the next talk in the power room for data science. I'm delighted to be introducing Fracti Yadav, who is going to be talking to us about designing your data lake to cope with changing schema. Hello, Fracti, and welcome to you, EuroPython. Hi, Ali. Thank you for the warm welcome. Where are you calling in from today? India. Oh, wow. I'm in the UK, so quite a difference. I'm personally really excited for this talk, so without further ado, I will let you take over. Sure. Thank you. Hello, everyone. Thank you for coming to the talk. And welcome to the talk entitled Data Lake Design for Scheme Evolution. My name is Fracti Yadav, so here is a little about me before we hop on to the talk. I am a senior data engineer working for episodes on the NLP team. So episodes as a company provides simplified healthcare solution to healthcare payers and provider organizations. And we are building up a NLP engine that can simplify medical coding in healthcare domain. I personally love to architect software and data handling pipelines. I thrive for building up simplest architecture designs. And as they say that building out simple architecture design is really a difficult job. My area of expertise and key skill lies into Python, Cloud, containers, MLops, automation, and everything that has to do with data engineering. If after the talk, if you want to connect to me, you can ping me on LinkedIn. My username is Fracti Yadav. Like I say, I like to keep it simple, and my username goes the same way. So without wasting any more time, let's move to the agenda. So today's talk agenda is to understand how can we build out a data lake? So we'll be starting with what is a data lake? What are the main characteristics of a data lake? And then what are the important components that are required when you're building up a data lake? And after that, we'll move to the challenging problem that is associated with a data lake. It's the schema evolution. So what is a data lake? So a data lake is a place where you can collect everything, everything as in the data that you want to collect. So it's a centralized repository where you can have data flowing in from multiple sources. Now that data can be structured, it can be unstructured, but it's a central place that allows you to store all that data. Now why do we require a data lake? So in the organizations where they have multiple sources of data, multiple streams of data flowing in, what they require is that they have a single source of truth. So when there are multiple teams, if they want to access, say, a piece of data, now when they are accessing that piece of data, it should remain same for all the teams as long as the conditions that they are creating the data for. So for that purpose, what we need is a single source of truth. So that's the requirement that builds up the data lake. So next, we'll move to an imagination of correlating the data lake with a real-life situation. So now imagine a large body of water that has water in its natural state. So now there can be many brands that can use that water, process that water, and then produce packaged water bottles. Now these brands can put up their own tags depending on and also they can advertise it with different qualities of pertaining to the water depending on their processing steps. So what they are doing is that there are multiple brands, they are using the same data lake, same water lake to produce the water bottles, but their processing steps are different and hence their outputs are different. But what remains same is the lake that they are consuming and that lake is in a natural state. So that goes same with the data lake. Into the data lake, the data would be stored in a very raw state. And if someone wants to consume that data, they can further process it to get the value out of it. So this correlation example was given by James Dixon. So I particularly like this example and whenever I have to explain someone a data lake, I put this up. Okay, so next let's go to understand the characteristics of a data lake. So the data lakes, the data lake base start with when we have to collect some data at one place. So your data lake should give you the flexibility to collect everything at one place. Now once we have that data at one place, there would be different personas that would require to dive at a different granularity level in this data lake. For example, a business user would like to get the aggregated data to build up their own dashboards. There would be a team of engineers who would require to get this data to train their ML models. Now for that purpose, they would require to deep dive into this data lake. So your data lake should provide the flexibility of providing data at various granularity levels. And then once they know that whatever data that they want is present into the data lake at the required granularity level, next what they will do is that they will use their choice of tools. So for a business user, they will use dashboarding tools for engineers. They can use SQL Spark clusters for processing that data. So now to make use of that data, your data lake should provide a flexible access so that every team can use their own choice of tools and make sense out of that data. So this is all about how do we store it, what kind of accesses people would need, and how would they actually access that data. The next is when the business cases evolve, the data structure is also evolved. So the kind of data that you would be investing in your data lake, the schema of that would evolve. So it means that you have to make your data lake future proof as in when the data structure change, the base design, base architecture of your data lake should not be disturbed. So design it in a way that whenever there are data schema changes, the data lake should adopt that schema changes on the fly. So these are a few key characteristics of a data lake when you are designing one. Next in line we have the important components of a data lake. So I have listed here the four important components that are must when you are designing up a data lake. So the first is the storage component. Second goes for search, third is process and use, and fourth is security. So we'll go through each component one by one. First is the storage. So the kind of storage platform that you use for your data lake should allow you to store anything. And as business use cases will evolve, the data will also grow. So it should be scalable on demand. So if today I am getting a huge amount of data that has to be ingested, so I should not make some preparation beforehand. Rather the storage solution should scale on demand. The third point is a really interesting one, where we have to keep the compute and storage independent of each other. So when we use Apache Hadoo, then over there for the storage purpose, over there then the compute and storage are quite coupled. So in case if you want to increase the storage there, you have to also increase the compute. And there is a lot of maintenance involved when we are using Hadoo. So when you are designing your data lake, make sure that your storage and compute are pretty independent of each other. So that they can scale single handedly. Now the data that you will be storing, it would be objects type. So make sure that your data lake support, the storage solution supports object storage. The next point is taking care of storage classes. So as the data will grow, the data that would have been stored say five years back, would now maybe become infrequently accessed. And the data that I am storing today would maybe getting frequently accessed. So you have to differentiate between the storage layer of this frequently accessed data and infrequently accessed data. Because for the storage solution, you will have to pay a cost to them. So make sure that you are building your data lake and storage classes in that format that whenever you see that a data, some data would be accessed infrequently, you can move it to a more infrequent layer or archive layer where the cost is less. And the next is versioning. So we have multiple sources of streams that are ingesting data into a data lake. So there may become a chance when these streams overwrite each other's data. So for that purpose, we should have versioning enabled at the data lake layer so that we know that which version was overwritten by which data stream. So all these are the storage specification, storage layer specification for your data lake. And at Episodes, we use AWS S3 for our storage solution. And S3 provides with us with all of these mentioned features with a proper costing. Next component here is search. So now we have all these multiple sources of data getting pushed into a data lake. Next is that we should build out a robust data catalog that can tell the users or the teams what kind of data is living into a data lake. So we should enable the data discovery mechanisms there so that they know that the metadata of the data that is getting stored. Now, every day, the new data would be added into this data lake. So make sure that you have crawlers that are crawling the new data that is being pushed to your data lake and then extracting the metadata out of it and then updating it into your catalog. Once this data is available in your data catalog, the next step is to build out an API that can be used to expose all the metadata that is available for your data lake to the teams. So suppose a business user wants to know that, do I have the data available from say January to March 2021? So they can just query to get to know about the metadata, what is the size of data, how many files were there. So they can do with metadata. So that's why we need a strong searching mechanism for your data lake. Now, once the users have the metadata available, the next step they would like is to use that data and build out solution through it. So like I told you that the business users will have a different tool set, the engineers will have a different tool set. So your data lake should provide you with the flexibility to use different tools. Like through Spark, you can read up some data. So the kind of format that you are writing up your data into data lake, make sure that it is readable through different tools. Business users would require it for dashboarding, engineers would require it for processing clusters. So there should be easy connection that are provided with your data lake layer. Now, when we have so much of data flowing in, we have to make sure that our data is secure. In our case, episodes deals with healthcare data. So we have under HIPAA compliance. So for us, it is very necessary that our data is encrypted, both at rest at and in transit. And also when we have multiple teams accessing that data lake, then we have to make sure that we are providing these data access through role-based access. So there is a role assigned to one team and they can only access that much data or a pattern of data that they can access. Network security has to be built in so that your data is not exposed to the world. And whenever there are teams that are accessing that data, you would prefer that you record those API calls so that you have a pattern of your usage of your data lake. That you can see that this team is using this much data or if someone is accessing the data that they don't require. So just to have all those usage history, you can record the API calls. Now let's move to the main challenge that is associated with the data lakes, which is the schema evolution problem. So over here, first I will talk about why do we have to design for schema evolution. So as business cases evolve, the data that is associated with all the products that they offer, that also changes. So that leads to schema changes. So as business use cases evolve, the schema evolves. And once that evolves, so there is a key feature of a data lake that the data lake should provide you with a feature of schema on read. By schema on read, I mean that the data is written with a particular schema. But the user should have the flexibility to use its own schema to read that data. It should not be bound to read the same schema that the data was written with to read the data. So your data lake should provide you with the flexibility and the ability to have schema on read. Now if the schema keeps changing, giving this flexibility is a challenge. Now the third one is having schema registry. So now if there are multiple versions of schema, we need to record these versions. To record these versions, we need to have a schema registry that keeps track of all the versions that are being changed. As the fourth one is that when the new version is introduced, the data reads and writes should remain unaffected by this new version. So if the new version is getting introduced into the data lake, it should not happen that if someone wants to read the older data that is written with older schema, they are not able to read that. So it should not happen in your data lake. So that's why these are few reasons that we have to design specifically for schema evolution because your data will definitely evolve as your business use case evolves. So here we'll discuss some of the problem statement and the challenges that we faced at episode and what kind of solution we built out to solve this problem. So like I told you that I'm a part of NLP team there and we are building out a NLP engine to simplify medical coding in healthcare domain. So for us, our data of interest is the output from our NLP engine. So episodes, machine learning and NLP engine process millions of pages of medical documents every year. So the amount of data that we are processing every year and it's getting increased year by year no doubt. Now under this NLP engine, we have up to 15 machine learning and deep learning models which are working together to produce the whole inference output. By inference output, I mean the whole prediction that is being done on the medical documents. So now when there are so many models that are involved for this inference, the output that is generated is a complex nested JSON series. If you have worked with nested JSON series, you would have already guessed that how difficult is to maintain that schema as it keeps evolving. So for us, the problem was same. So whenever there was an update to our NLP engine, this inference data structure evolves with that and the challenges with this evolving NLP engine we faced while maintaining up the data schema is that when the data grew in size and complexity, we had to store it and also along with that make it searchable for our team members. So suppose if someone on my team wants to do some analysis on to evaluate how our NLP engine is performing then they what they would do is that they would go back and query this JSON. If the version is changed, then they would have to ideally would have to create two code pieces that are dealing with two different schema of this JSON. So that was the problem. So we had to make it searchable anyhow and the solution that we had to build, we had to make sure that we are keeping the schema compatibility, versioning and data integrity intact. So now to keep the track of this complex nested JSON data, we had to make sure that whatever solution we are building, these schemas are also getting propagated continuously. So the solution that we came up with is that we tried serializing this JSON data using Apache Avro. Now what is Apache Avro? Avro is an open source project that provide you with data serialization and data exchange services. Data serialization is like changing data into bytes format, binary format. So that is the data serialization. So Apache Avro provides you with that facility. So and also along with that since it is converting your data into binary format, it also provide you with an advantage of making your data storage compact because it reduces the size of data when we are converting it from JSON to Avro format. Avro by itself stores both the data definition and data together in one message or a file. So suppose if I am converting one JSON file into Avro format then what it would do is it would store the data in a binary format into that Avro file and also it will store the data definition. The data definition is stored in a JSON format so whenever you are looking at an Avro file you can clearly see the schema with which it was written. So Avro by itself handles schema changes like missing fields or if we are adding a new field to a JSON file, JSON series or if we are changing the name of one of the fields. So these are few changes that Avro by default takes care of and Avro has APIs for multiple programming languages. For Python you can try out changing up your JSONs to Avro using FastAvro. So FastAvro is the Python package for Avro. Now for us the solution was that we wanted both the backward and the forward evolutions. So how does Avro provide backward compatibility? So if you are having a new schema, so you can use the new schema to read the data that was created using previous schemas and in terms of forward compatibility you can have an older schema that can be applied to read the data that was created using newer schemas. So in the next slide we'll look at an example that will show that how are we actually, how can we actually do this using schemas interchangeably. Avro has a very simple integration with dynamic languages and since it stores the both the data definition and data together, it makes the whole data fully describing. So you don't need anything else if you have the Avro converted file in front of you. You can make sense out of data directly from that. Now here is an example that explains how does it work? How does the schema evolution work? So suppose I have a record of student where I am storing as of now student name which is of type string, date of birth, again a string age is a type integer and id is a type integer again. So this is version one of my schema. Now here in version two I feel that I no more require the age key. So in my version two I have removed the age key and what I have done is that I also realize that my id field can either be an integer or can be a string. So for the id type what we can define is a union data type. So Avro supports defining union data type. So union data type is something that it can be either this type or this type. So in our case the id can be integer or can be a string. Here I could have defined that it could be an integer, it could be a string or it could be a null value also. So Avro provides you with that functionality. And then the date of birth is the age is something that we have removed. So now if we use version two schema to read the data that was written with version one schema what it would do is that it would read the id as integer since we have defined into version two schema that it can be both and it would directly skip the age column. So suppose if you are reading using pi as part then what it would do is that it would directly skip the age column even if it is reading the old data since we have not mentioned that age column into a new schema. So similarly if into the version three of schema if you feel that date of birth is not required and if you remove that and so using even version three you would be able to read the version one. So it would then also skip the date of birth column for you. So that's how the Avro takes case of data type changing, removing out the fields. If I am adding more feed then that can also be taken care of. And there is also one more interesting feature that comes along which is providing aliases. So alias how does it help is that suppose if my field name id is now changed as id underscore. So under the so in my version two what I can also define is that this id has this id underscore value as alias as id. So though I would name in my version two as id underscore but if I am reading the older data my new version knows that this id underscore has the alias of id. So it could read the older data with id value only under the column id underscore. So that's how Avro is helping us to keep the schema evolution with our JSON series and we what we are doing is that we are taking up the whole JSON output of our inference engine and we are converting this output into Avro schema Avro format and then dumping this Avro directly into a data lake. We are not making any changes to the JSON that we are receiving. We are just directly dumping this JSON into Avro format. So the data is still in its raw format but it's just changing its form so that it may easily searchable. Now using this Avro you can use PySpark to query out this data directly and it can be so in our use cases when the business users if we have to publish some dashboards to them what we do is that we then further convert into Sparky so that the frequently accessed data that is there. So that's how this Avro is getting consumed then further into the data layers of our organization. So it was all about how did we build out the data lake and how did we overcome our JSON schema problem. Thank you. Thank you so much for that Prashu. We do not as yet have any questions in the chat. Oh no we've just got one here. You mentioned that the evolving version schemers are searchable for your colleagues. What do you use to search or visualize the schemers in the sense of a data catalog? Okay so we are using the confluent schema registry so there is a schema registry for Avro. So we are using for that for storing the schema versions for our data. Just checking the chat to see if the person who asked that question has a follow up. Yes otherwise thank you so much for that. As somebody who used to be in data engineering I find it particularly interesting and that's addressing a really really real problem. So yeah. Yeah data engineering is fun Avro. Thank you very much.