 Hello. Hello everyone. It's great to be here today with you. I'm Arakashinsky. I'm working at Allegro. This is our biggest Polish e-commerce site for business, for customers, platform for buying stuff for everyone. Today I would like to talk with you about data structures with Abro. And so let's start. A short agenda for today. First, what's Abro? I would like to give you a brief introduction for this topic. Then how we're using Abro in our data platform. Then I would like to share you three things we failed and learned to how to manage. Then two use cases where Abro's structures are really useful in our data stack. Okay, so what is Abro? Abro is a data serialization system. It's a Apache project. And Abro is for giving information a structure. First of all, with Abro you can define schema of your information. So you can say this is a string field, this is an integer field. We'll see an example later. But this is a schema, as we know, from the databases or from the XML structures. This is the same thing. So if we're talking, if we've got schema, we can start thinking about data contract. And this data contract is between someone who's producing Abro messages and someone who's consuming these Abro messages. So this is a baseline to connect these sites of the pipeline. This pipeline we will show later. And if we are talking about data contract, we've got to somehow define the schema. In Abro, schema defining is quite easy because you are defining schema in JSON format. JSON is quite popular and everyone can write some structures in JSON. What's more, it's quite easy to maintain to get reviews and etc. What's more, schema evolution. With time, when your project go live and when your requirements, for example, from business side are growing, you've got to somehow manage evolving evolution of your schema. And this is natively available in Abro. What is very important about this data format, this data transformation, this method of presentation, information, data presentation, Abro is widely adopted across many tools of big data ecosystem. To name just a few of them, it's Hadoop, native Hadoop, Hive with SQLs. It's a Spark, Presto. There's a lot of tools, so it's quite easy to interchange data in Abro format across many systems. We will see an example of this later. What's very important, Abro is exchanged between points in binary and for the long term storage is also stored as a binary format, so you can reduce frame size on the wire, which is also very important for the nowadays applications. I said before that Abro schema is quite easy to maintain. It's easy to describe. Here's an example of how Abro schema looks like. The most important thing is that this schema is divided into two sections. The first section on the top of the schema, the first part of the schema where we've got type, name, name space and doc, is something like an FQDN for the schema. This is the main identifier of the schema Abro. We are using this part and the name and the name space to make schemas unique across our whole ecosystem, our whole data ecosystem. The second part of this schema is strictly related with the fields, which we like to define in our data structure. Here's an example we've got two kinds of fields. It's a field ID with a type integer and a second field with the username and this is type string. What's more important, all fields in Abro schema, you can also additionally describe by documentation field, which is very important. We will show how we can use this on the next stages of this presentation. This is a basic Abro schema. This is how it looks like. We know briefly what we can do with Abro. So, how we are using Abro in our data stack. This is a big overview over our data platform. It's nothing unusual on this scenario. We've got a messaging platform in the center of the communication. It's based on the Hermes. Hermes is front and back and for the Kafka, which is underneath. So, we've got messaging platform based on the Hermes and Kafka. Kafka you probably know. To store the message, Kafka is a classical pop-up solution. We are storing data in topics and information are flowing through the topics between someone who is producing message and someone who would like to consume this message. So, we've got messaging platform in the middle in the heart of data platform system. Then we are pulling this data to the Hadoop platform, to our internal Hadoop cluster. On the other hand, all messages are going to this point. This is a word of microservices. So, we are exchanging the same data produced in this point, Abro JSON, between at least two endpoints. One endpoint is a Hadoop cluster. The second one is a microservices word. And how it works. Where this Abro is used. There's a very important block on this screen. There's something called schema registry. It's a place where we are putting all our schemas that our developer produced to produce some messages on the stack. So, whenever we are trying to push new messages to the stack, we are registering schema registry. We will come back to this part of graph later. And user can send messages based on the Abro or JSON. This is the decision of the user. And based on this, user can send messages to the messaging platform. But where is the whole magic related with the Abro and with the schema registry? The magic is that is there where whenever user trying to send message to the messaging platform, he can send message in Abro or JSON format. But during the receiving message by Hermes, all messages are validated against schema that user had to provide earlier during the application development. So, whenever message is put to the messaging platform, it must be validated against schema that is previously registered in schema registry. And the schema is also checked. This is the something we will discuss later. So, this is a big overview. Oh, we've got also, we are also a big query on the screen. We will talk on the use cases sections about this. So, we said from big picture how our architecture looks like. But now let's face, let's see how microservices, microservices work can benefit from the Abro schemas. As we mentioned in the beginning, every message must be validated against schema registry and schema hold it in this storage. It's a simple web application that is serving schemas over the HTTP. And whenever message is coming to the data platform, this is in this place, this message is validated. Okay? And it must be okay, must be good from the perspective of the schema. So, this message is message arrived to the messaging platform and then how users, how consumers use, how consumers can get this message and be sure that this message is okay and they are reading exactly what consumer produced. All microservices on the consumer side, this is this place. I hope so, it's visible for you. This is this place where consumers getting these messages. They are also using schema registry to fetch our schema and there is no need to exchange any additional information about the data structure. So, whenever developer creating application, first of all, if you'd like to, if developer would like to use data platform and use data from the messaging platform. First of all, he is pulling schema from schema registry, building application based on the schema on this schema. And this is easily plugged into Java ecosystem. And based on that, developers just doing what they've got to do and they don't need to think about how my data structure looks, how I've got to struggle with this data structure, how the producer described this message. So, once defined schema on the development side is widely used, at least in two points. We'll bring the third point in the minute. Schema is used in two points. First, in the moment of message production from the producer side. And on the other hand, whenever our message is consumed on this left side of the graph, again, our schema is used to read message and the content of this message. So, this is the microservices world with schema, with our schema and how it works. But, as we mentioned on this graph, there's also an analytics world for this whole part. And this is the place where the analytics world starting to benefit from the average schemas. We had to evolve from JSON world. So, every message in the beginning of almost three years ago was stored in JSON in our data platform. Then we decided to move to the other world. And how this transformation benefits for our analytics. This is example of the history and the actual state of the data structures in our Hadoop platform. First of all, you can see the first definition of table. It's a definition of tables from the Hive meta store. First definition is based on the JSON. So, all messages are in the plain string. Message is encoded in message is stored as a JSON in a one line string in the table. And on the lower part of this screen, you've got table based on the average definition. So, when developer creating schema, it's not only used to produce and consume this message, but also this schema is used for analytics world to exchange this information about how my data looks like. And this is one of the biggest benefits from Avro. It's not only used for the Java application, but also used for permanent storage for Hadoop platform for BigQuery. We will talk about this later. I would like to show you that I don't know if you remember the documentation field in the Avro schema example. This is in the third, fourth column. There is a comment field, and this is exactly where documentation is used. So, when analytics are going to explore data, they've got, right now, they've got not only structures, not only field names, etc. They don't need to figure out what the author tried to say, but they've got explicit data schema with types, with field names, with documentation to these fields. And this is part where this whole data exploration part of data is getting easier. This is example. Previously, we got example how table definition looks like. And this is example how analysis world change after switching from non-structured world or JSON structures world to the other analytics. In the first example, we've got to use special functions on the high side to extract fields from JSONs that were stored on the Hadoop under HDFS. And on the lower examples, we've got example of the same structure, but hold in Avro format. As you can see, sorry for this delay, we're back online. So, this is a change between structures data hold in JSON format and data hold in the Avro format. I hope this change is quite visible. It's just reducing time to get your data and data exploration where can reduce time to get to the really inside from our data. But how to achieve this? With our tools, we've got fully available in open source and big data stack. Whole magic is in here. This is the one comment which will bring you hold this magic to your data ecosystem. This is a table definition for the Hive SQL, for the Hive meta store. And what we are doing explicitly in this place. First of all, we are telling Hive that we'd like to store our data as Avro. And what the second very important thing, we'd like to store our Avro data with the schema which is hold in the HTTP service somewhere in your stack. And this is all. When you are producing messages in Avro, there's a very state way to achieve this state of holding data in your whole ecosystem. So, this is the place probably, if you'd like to take something from this presentation, this is probably the most important thing. That this is the place, the Hive meta store is the place where you can define one table definition based on the one Avro schema. And all the magic is done by for you. What's more, Hive meta store is very popular data catalog for whole big data ecosystem. So, this table definition is not only used for the Hive, but it's also used for Spark for almost all of the Hadoop tools which are used in nowadays stack. What's more, it's also compatible with Presto, with Apache Ignite. You can probably even talk with Hive meta store, with Elasticsearch, with Druid. So, this place where the Hive meta store is a very powerful tool. And this is one of the examples how you can benefit from the central meta catalog and storing data in Avro structures. Okay, everything is simple on the slides, but there are a few things you've got to remember during this. This is something what we had to learn during this whole struggle with the data transformation. One of them is to create our own set of validators, of schema validators. This is something what is embedded in our internal schema registry. Those are small components that are responsible for... Oh, excuse me. Those are the small components that are responsible for repairing ecosystem imperfections. So, we've got a few validators, maybe to show you. Schema validators, this is something I would like to tell you about now, hold in here, in schema registry. So, whenever a user is trying to put new schema, those validators are around against developer schema, and this schema is checked if it's proper for our ecosystem. What we are checking? First of all, we are checking documentation. If all the fields, if in all fields, you describe the documentation field. This is not only for the developer, but also for the analytics during the data exploration useful. What's more, we are checking also subject name. This is our protection against uniqueness of our schema across all ecosystem, whole ecosystem. And the example of the field validators, we've got something about 10, 12 validators on the production. This is just an example. And the third validator is a strictly technical validator responsible for checking if you are giving... When you are extending your NM symbol, you've got to extend... You're extending it only by adding new definition to the NM field. You are not changing the middle of the NM. The structure of our is that NM symbols are stored as integers on the binary representation. So, if you'd like to put something in the middle of the NM definition, data context might be changed with time. So, you've got to add only new elements to NM at the end of the list. So, this is an example of validators. Second lesson we had to learn. Yeah, give tools for our developers. This is something what was at the beginning missed in our stack. And we had to provide at least a few things. The most important things are on the screen. So, first of all, there is an option to create a converter between Java class and Avro schema in both ways. So, whenever the developer is trying to build something, if he firstly describes his data definition in the Java class, then this can be converted to the Avro schema and Avro schema can be put into schema registry. On the other hand, when a consumer is trying to read some data, he is getting Avro schema from the schema registry and based on the schema, just generating Java class and plug it into his Java application and hold on, whole work is done just in the one place. And everyone talking with the same data definition. And this is something what we can start to name as a data contract. We talked something what was in the first two slides. Okay, there's no better way to show someone how it works by the example. And we created also example services for our developers. So, there was a full example of application, how to work with Avro, how to write schema, how to deploy such kind of application on the production. And it was very useful for our developers. And probably most obvious, but sometimes not common, give people place where they can exchange information about what they are doing. That was also place for us as data engineers where we get a lot of corner cases. And by this we could prepare validators for some unusual situations to get data flow proper across all tools and all places of our data platform. Okay, prepare for exceptions. In the very first draft of our solution, we treated all data and all schemas with the same policies. But in the real world, it's a fake. So, we spent a lot of time to create exception policies on the many places of the ecosystem. For example, there are some users that don't want to store data on the central point in the HDFS. They just want to push data through the messaging platform. And for example, for such kind of user validation, all validators are not proper. And this is something what consumes a lot of time for us. Yeah, this is something what we had to learn with the time. Okay, so right now we know a few things about what we learn. But where is the real benefit of these data structures of this whole work with data structures and managing data structures in our stack where we can benefit from this? First example is related to GDPR. You are probably familiar with this European Union law. It's against data privacy. It's related to data privacy. One of the biggest concerns for us was the right to be forgotten. I don't know if you are familiar with this law. This law gives permission, give law, give ability for users to say, hey, I'd like to be deleted from all your ecosystems. I don't want you to store any data about me. And that was task for us as a data engineers team to solve this problem on the Hadoop platform. And Avro was a place where we could manage this problem, but how we achieved this? You probably remember the first slides with the Avro schema, the example Avro schema. And this is a bit change, Avro schema. We decided to... So if our developers describe our data already, why don't use this information and extend Avro schema and add custom fields related with the personal data? Please keep in mind this is only data definition. This is not data storage. This is data definition. So on the level of data definition, we created additional fields which are responsible for storing what kind of data, private data information is stored in some topic with some data set with informations. We created additional fields with the name user data type. Here we've got two data types, user ID and user name. And how we are using this to remove all personal data information for specific user from our ecosystem. We are indexing this data. So our topics, our data are partitioned by two factors. First of all, it's a topic, half-cut topic. It's a representation of table in height. And the second level of partitioning is date. So we are going through the old data in the ecosystem and we know where our developers put the specific personal data information based on the user data type. We are indexing and collecting all this information in the height table. We are calling this index table, a GDPR index table. And whenever user is coming to us and saying, please remove all my data from your ecosystem, we are just getting information from the index. We are just searching for the user ID. We know where all this information related with the specific user I'll hold because we built index before. And based on this, we are running JAP which is responsible for just fine, directly on the storage, the information related with the user and just clear these fields. And this is how we used Avro to make the GDPR happy for our company. This is one of the examples where we benefit from Avro. There's a second example, a bit, far more easier but very important. We're trying to get public cloud providers to expand our capabilities especially in the data analytics fields. And we decided to use Google Cloud Platform and use BigQuery for this task. And where is the place where Avro helping us to talk with the BigQuery? There's a very straightforward diagram for this on the right side of the screen. There's our internal Hadoop cluster with the data. You know this screen with the data definition in the hive. We've got columns which are defined to the Avro schema. We've got types, we've got documents, descriptions. And how we are transferring all this knowledge to the BigQuery to give analytics a really good place to work. We are copying this data to the cloud storage. We are copying raw Avro files. And then we are loading this data into BigQuery. And on the BigQuery side, this is on the left side of the screen, we've got exact representation as we've got in our internal Hadoop cluster. On site in our DC. So Avro in this scenario was unable for us to give same user experience in terms of data structures, in data structures, in our data center and in the BigQuery. To use BigQuery, our analysts just need to change the place where data are stored usually. And usually the SQLs are very similar between our Hive or Spark SQL and the BigQuery. So once known structures on our data centers are same structures on the BigQuery side. So it was a huge game changer for using BigQuery and the whole part with Avro, with the data structures, I must say, was a huge benefit in the terms of moving to expanding, to expand to the BigQuery and use additional capacity for the data computation. So, I think it's time to get the ask question in the subject of this presentation. Is it worth it to fight with Avro? Yes, but I think we've got to widen this definition. It's not only to, it's not only to invest in Avro, it's worth to invest in any kind of data structures. Later today and tomorrow, you will hear a lot of things about machine learning, data analytics, data science, et cetera, and probably using data structures, named structures with fixed types and structures that are easily used by your ecosystem users will be a huge enable for performing machine learning, data analysis, data scientists, staff, et cetera. So in my opinion, yes, it's worth to go to the data structure world. Yeah, it's worth it, yeah. Thank you. That's all from me. If you've got any questions, I'm here for you. Hi, can you hear me? Yeah, how? So did you investigate any other row columnar formats, RC formats, columnar formats, rather than row format, like Avro, which is not so good in performance when it comes to analytics? Yeah, great question. Avro isn't the best performance data format ever. Yes, you're absolutely right. But there's a balance between performance and wide adoption across all tools we are using. And this is the main reason why we choose Avro. Avro, there's a schema, there's a wide adoption across all tools. We know that performance isn't great, but we are facing this and when we need a higher performance on the data side of the analytics, of the analysis world, we are converting this Avro data to the, for example, market or depending on the use case. And we've got Avro schema, and this is far easier to convert this data structure to the different one. And that was also one of the reasons why we choose Avro. This is our base form of the data, and then we can transform it to whatever we want to boost our analytics. Hello. Which is the software that you use for the schema registry? Or you believe yourself? Okay. We've got our own implementation of schema registry, but there's an open source tool that is coming, that is written by Confluent, as I remember, and you can freely use this tool to hold your schemas in this solution. What about Hermes? Could it be a substitute for Kafka Connect? Yeah, it might be. Yes, it might be substituted for this solution. Hermes is our tool, it's open sourced, so we can look on the online how it works. Probably we created Hermes before the Kafka Connect, so this is probably why we've got our own solution to cover all the things related with Kafka. The most important thing with Hermes in our platform is that whole communication is based on an HTTP protocol. It's fairly easy to use, so we just hide the Kafka underneath the Hermes. This is why we used Hermes in this solution. Hello. Hi. Based on your experience, when you are defining the schema, do you prefer defining your schema from scratch or using Hava class converters? It depends on the use case. Developers that are using Java for daily work, they are using usually converters. I'm non-Java guy, so I'm defining schema by myself, and it depends what you prefer. If you want to recommend some way, I mean, in some places, they recommend you to use, to define your schema from scratch because some Java converters may don't do things well. Yeah. Converting schema to the Java class in both ways, it's not so easy case, as you mentioned, and we've got a custom converter that is aware of our ecosystem and our data structures we'd like to have on the production. If you're asking me about recommendation, if you're Java developer, use converters from the stack or build it by yourself. When you're someone who's fighting with data structures on the Hadoop ecosystem or analytics world, you probably like to use schema defined by yourself. Thank you. Okay. If you'd like, ask me anything for these two days. Thank you for your time. Bye.