 So thank you and hi everyone coming to our talk. We are super happy to be here first time Okay in Madrid and actually for first time it at the big things conference and We also super happy to talk about one of our favorite or my favorite topic. It's data ingestion I hope you will enjoy that but before starting our talk. Let me show you the problems What we're trying to tackle here Let's imagine a data project when you have to do some kind of data analysis or when you have to do Training some kind of model. What is the more most fundamental thing what you need to start the project? Can you guess? so this is a data conference so Data you are right. So if you have data, that's half success So really if you have data, you are in very good position and after that Usually when you want to do some kind of analysis or modeling stuff You have to transform your data in a shape to be able to use for your chart or your Training your model and to do that. Usually what you do You write the ETL job. So this is my nice ETL job. What I've write Usually I take some kind of sample data and try to prepare all of the edge cases based on the sample data To fit into my to being able to process with my ETL job So so far so good everything was fine for my sample data set So now I try to do some kind of backfill because I need more data to have like some to Being able to do some kind of meaningful analysis. What usually happens? You try to do a backfill for one day And as you can see here, you are very well prepared data is in a correct format. Everything is fine And when you try to run for a week what can usually happen that there is some kind of data Which is not quite in the shape What you actually expecting? So what you can do actually But like you have a data which is in a integer format and you expect as a string It's not quite right, but you still can process. You're still fine But what you usually happens when then you are running a For a longer time range that you get the data like this one. So as you can see this you can put We can process with your ETL job So what you can do you have some kind of deadline Or maybe because of this kind of data your data pipeline get clogged and you get some kind of alert Data engineering team wakes up early in the morning. They have to fix it. So what usually people do? they just hack it in and This is great for one time, but anytime you have to work with This kind of data set or data you will always will suffer this kind of issue. That's why we try to Eliminate or minimize these kind of issues in our project But first thing first, let's ask you Who is familiar with Prezi here? Okay, bunch of people so just a few words about us. So we have around 300 employee at Prezi. We have offices in Budapest, San Francisco and Riga We have more than more than 100 million registered users and so far We are over 3.5 billion Prezi views, but Prezi is not just the presentation software anymore in our portfolio We also have an infographic tool called infogram and Last week we announced Prezi video which could try to combine streaming a video with With your presentation But this is a data conference. So let's move to our data. So in high level We have around two petabytes of clickstream data you know in You know data lake and we increase this number or and one terabyte per day and Or how our data team looks like basically we have a data scientist and analytics team and Data engineering team and that's all and we try to have some kind of make the data open So in the company basically anybody can use and and anybody can access Or data platform and as you can see a bunch of people are using so we have security PMs finance developers Analyst so a bunch of people so basically the whole company and now I would like to pass The clicker now here. Thanks Thanks so much. So has he just said everyone in the company uses data today But I think Prezi in his history has always been keen on capturing data So when I joined I arrived at some point of the data history of Prezi. So at the beginning it was a bit More let's say self made every developers who was developing a new feature was logging things that they believed were important And it wasn't an instruction format. It was not perfect But then when I joined with which was about three years ago We were in a state where we switched from unstructured to structure Data because there was more need for data and we believe that the point that we reached a more advanced level What is actually what I will actually will I would like to show you today is how we got there What was our starting point how we progressed from at some point where we believe that this system that the way We collected data was not Helping us in our needs anymore so Here where the main challenge we had to face at that point in time, which was about one and a half years ago Our approach back then was to log everything that we had in the product. So every user interaction We were logging it. We were also logging technical logs. So stuff with that we were giving the user and everything was logged in In Prezi and in the product a thing was that as you can imagine as an analyst and from our point of view was very hard every day to find where some piece of information is and Of course with this approach of logging everything not everything was relevant and everything was Useful for you and and your analysis and we had a lot catalog. So a catalog of everything that we were It's a capturing in our product and this catalog Unfortunately was not much in line with the reality because it often happens that Either somebody in time pressure was logging a new event and this was not registered in local catalog and nobody knew about this new event or For example, some of the things that we had in there and our dictionary Then they didn't even exist in real life because in the end some feature changed and nothing was there anymore So you can imagine like the pain for us analysts was always to try to find the exact thing at the right time And for example, if an example was once we were trying to define a new error message for an event It was an attribute of an event and I was trying to understand the which kind of error Can I log to this event? So you can see like we had like 27 error messages even at the error massage I'm on the list this list, but yeah, so what should I log to this log like? Shall I log like a string or an integer to try to capture what the feature was trying to do and Yeah, so this was not a very good position where we are on I don't top of this Not only we logged everything and everything was messy like sometimes we had the necessity like to See our events happening like in real time For example, a new feature was released and we were wanted to check real time what was happening to those users and at the time we everything we ship we were shipping to like a central machine, so Not all the parts of the company had the ability to access this data Because they had to query why we bash on the central machine and not everyone could access them And also we had a lot of lack of quality checks So as Thomas mentioned for example sometimes Things changed in one of the events and one of the attributes were I don't know switching from being an integer to a string and then In that case for example when we were summing up things all the strings were left out and so our insights were not accurate and of course people were a bit Unhappy about the situation and and this case was actually a worst case where you know your ETL pipeline doesn't break but you can still do some analysis but things were not perfect actually far from perfect and Yeah, but in other cases for example the ETL broke completely and we were stuck on first trying to fix this and then Getting things right and do the analysis. So everything was around the fact that before Delivering an insight as an analyst We always had to do millions of checks on our data find the data see if the data was right transform it Basically everything that happens is about cleaning data. It was happening back then Which is about what every analyst has to do from time to time, but we try it always to reduce this time So what this means for us? I remember one time we had like a Meeting where many people and many important stakeholders were invited like for example our head of product and at that time We had to explain why we couldn't deliver the insights or the things that we was asking for So it was then when we decided that we needed a change and we need a change with which was happening Both from the bottom then to the top from the bottom because we Analyst had all the interests of course to deliver something which was accurate and valuable for the whole company But also from the top management We wanted to have their support and they were all on board with this because of course They couldn't benefit of the power of the data behind this hundred million user that we have so This new project started and the project was at the birth of data So we wanted to reshape completely the processes and how we capture data You can imagine is it's a very let's say extensive project It can be very painful as well because we had to go and ask each and every one of our Dev teams to try to rethink about how we log information and how we capture events in a different way This means for them that they be basically to scrap a lot of the things they've done in the past and Produce a completely new way to code in into our code base So but luckily we had again support from the top on this So this new project Took place under the name of glass box by the name of it We already wanted to have this clarity this ability to look inside the box of our Data transformation and try to understand what's happening both on the transformation We'll also try to have clarity on the data and have a way that everyone can look into the box Everyone from any business background in the company or also product manager everyone could go and get the data they need So this were the clear project goals that we listed back then someone hand We wanted to reduce the time to inside So I was getting to this unbearable position where to deliver something We really needed to rethink it about it twice or three times And we wanted to improve quality and consistency of our logs For example, we were in the position where since the product is in different platforms So I can launch my prezi like on my phone on on a desktop app or on the browser and everywhere We were logging events with different names. So there was no consistency at all So if we wanted for example to establish that a user clicked on next In their presentation then we had to have a unique name where we never did we named this event have enough consistency across platforms and Then of course we as I said we wanted to empower more parts of the organization to look into the data to serve to self-serve into the data Which was not possible at the time because all the analysts had that let's say coding got the capabilities to Take the data transform it and look at it and provide insights For this we also introduced a few New tool and new tooling that we will speak about a bit later So as a member of the analyst team and for the analyst's team, we wanted to have achieve a Shift in our focus for example We wanted to avoid the situation where we locked a lot of garbage and out of garbage can only come garbage out So we want to avoid the situation where we look something wrong at the birth of it and Then we want to of course spend much less time on cleaning and doing it else Modification and new ETS to try to fit all the edge cases we had and of course We were often in this situation where we didn't have trust in our own analysis and we had to Every time think twice or three times Are we developed this inside and if it was correct the right thing we were looking at? And this we were in the ops to have free more of our time to deliver more valuable Things to the whole company so more deep-dive analysis more modeling hopefully has something that creates more value for the whole business And that at this step We introduced a new process called log management. So log management is another catalog Which different of course from the previous log catalog But this new catalog that we built plays now a central role on everything we do About capturing events are crazy. So it lives under the principle of an enforced review process This means that everyone in the company every developers are not able to Implement any new event if they are not approved and Put into log management. So the catalog plays the role of of Refereff on this on this on this process. So nobody can add new logs or implement them also other process we introduce to to to bring Transparency for example, if I am a product person and I want to Introduce a new feature then I need to have clear in mind and I have to be able to visualize what the feature does So what the user does to achieve? Something in the product for example, we we have a new Feature, which is an image search So the product manager responsible for this image search has to have a clear way to represent what the user does To achieve the goal of inserting a new image that he searched on the web on the tool So for that he has to build something which is a visual representation We call click flows and all the company can access them and also We introduced something called job story loops. So we changed a bit of the approach of the analysis as well So everyone has a series of steps that has to complete to achieve a success on Doing a certain feature you can imagine this is important Also when you do a B test when you do some improvement of the features So it's a more user-based approach that it doesn't count how many times something happens But whether the user is successful on doing something or not So with this in mind all the new instrumentation that we put in place Was always with the goal of capturing the information that makes the user achieve the goal Also, another thing we introduce is a better naming convention on our events and a Different way to track user intent from what we call technical events So for example if the editor of our presentation can be open for many parts of the product And so when the user opens it we capture this intention with the event named open editor When we actually serve the editor fully loaded then we fire another event Which this time is a technical event, which is loaded editor. So you can see already How we name this differently one when the pass sense and one in the present tense and this helps for example when you have to Make calculations about for example, how much time is Is in between these two events? So you want to know the distribution by browser if you want to optimize for a certain browser And another thing you can achieve with that is that if you want to build I don't know an active time calculation on the on the basic based on user events Then you can also you can only use user events because that this is actually the time that the user spends with the product And you can exclude all the other types of technical events and another thing that We we want to want it to achieve again was this consistency so every event was having information and they were universal for every platform and Another challenge we have to face us time on Thomas mentioned is that now we became a multi-product company And this means that we had to find a way that someone and we have to be consistent within product But also we have to be able to scale this and force a view process across products So we have to find solutions on the sense to have to have to achieve both goals Yeah, and again, this was one of the thing that again each events needs an approval And this introduce a sort of a gamified approach So we have like a group of admins where everybody has to be very reverential with us and ask for approval, but this is also More in general with the goal to encourage like the discussion about How what's the best way to capture an information or the information I need and not log everything again? So this is the process part. I pass it back to Thomas now. So, yeah So I I will go a bit more on the technical details because that's nice that now analysts are happy But we have data engineers and I don't want to wake up if we have some issue And so far you could see that we have the log management, which is nice But on its own if you don't enforce what's in the log management that that doesn't Worse anything so we were starting to thinking About what to do and how we can enforce that and we figured out Using this log management we could get generate some kind of schemas what we can use later on to generate Loglines and even to validating These events or log lines so we were thinking what kind of format we should we would like to support and We did decided to go with the overall schema Definition why we went with the overall schema Because in the data world is pretty popular. So all of the tooling and support overall schema is pretty simple So it's very easy to generate very easy to work with that and also Schema compatibility is in the format, which means you can check And it's well defined. What does it means when there is a breaking change? between two schema So what we do it in the log management, we generated a Schema what we basically loading confluent schema registry We were looking for some kind of schema registry which is standard and we can use for this use case and this is an open source tool so and basically it supports Validating the schema compatibility and we can easily store that and it has a rest API So it was very easy to work with that So far I was talking about one schema, but In reality with we wanted to make sure Whatever platforms your event comes in we wanted to enforce the platform specific attributes So you can imagine that if you have a using present from mobile then in mobile or desktop It makes sense to have a device ID which identifies your device But if you are coming from the web, there are other properties like user agent information what we want to capture and we want to enforce that from From from the web to get these information captured So how we do that? We basically we come up with the idea to generating platform-specific Strict schemas so for every platform degenerate a different schema where we making sure all the platform specific properties are Enforced and how this looks like so as you can see and on the screen This is where the platform time. This is an enum So it means that if you and at the platform time, which is desktop That means if you want to send in like a event from the desktop, which called like mobile. We would reject that and As you can see there, there is the device ID, which means in this schema device has these mandatory Otherwise if this schema would come from The web this property wouldn't be there actually so that's nice. We have a platform-specific Attributes so we generated it for all of the platforms But in the end we would like to store under one event type and we don't want to have all of these key schemas Everywhere so we come up with the other concept like called as loose schema Which is the super set of all of the schemas and whatever is mandatory on the platform level? In the street schema, it's it will be optional On the loose schema first as you can see what earlier was in on now It's a string why we decided to come up with that because if you work with a pro you can you know that If you add or remove the Enum value that can break your compatibility level So we want to make sure that we are enforcing that from the platform level But we are not enforcing those in arms in in the loose schema And also as you can see device ID, it's there in the loose schema, but there's a nullable value So if you if these even coming from a web in the end it will be a an event where the device ID is null and We also said that for the platform-specific properties to have no compatibility So you can break the schema anytime because we are generating all the strict and the loose schema Because of that we can make sure that if you only Check the validity or the compatibility at the loose schema level because we generate all of these schemas We can make sure if the loose schema is fully compatible Then the platform-specific will be because we are generating that So you can assume So we have now schemas in our scheme registry, but as you have heard earlier We could convince the management But one there is one harder thing to come with the management is come with your engineers to use the proper events So we try to make them as easy as possible So what we do did we reached out to the different platform teams and we try to find one champion Who basically build an SDK for us? What does it mean? They built an SDK which basically went to the went to the confluent schema registry They get the scheme of the platform-specific schema as basically they started to generate some kind of code So like let's say you want to capture the event insert image As a developer you will have a method after the SDK generated the code where you will have an Insert image event method and it will have only the parameters which needed that specific event Because all the platform-specific properties and all the other things what's needed for an event will be captured by the SDK and basically we made SDKs for all the platforms so Mac iOS Windows Android and web as well So we have the schema that the code generated We basically Capturing the events from the different devices, but we need to send it back to us and store it somehow So how we do that basically they are sending json messages to a logging endpoint where we end and reach the messages These messages we had basically like server time or like user ID, but we don't really trust on the client so those ones we need to add later on and then Basically we'll send these json messages into a lockstash So if you don't know what's lockstash basically it's a very basic tool where you can Get data from various sources and it can you can do some kind of transformation if you want and you can send out to various other Destinations was this used for sending data into elastic search But it was pretty useful for us because it's very easy to work with that and very easy to extend So basically we added a so-called an evrok codec That feature to be able to validate with the strict and the loose schema the incoming messages So we get a json message We validate with the strict schema and then we basically convert into our role using the loose schema And that will be the end format and as you can see on the top left there is a you that's Server skip the logging endpoint because if you are coming if the message come from some kind of back-end server We are not enforcing to go through to our logging endpoint because they we can trust on them how they enrich the message so after that we send these messages to a Kibana our elastic search because we want to have some kind of live stream about how Incoming events or this is a short-term storage. This is mostly for that if people are interested. What are the currently? Incoming logs, or they want to do some basic analysis and see what's happening Or they want just want to check why the schema validation failed Or what's happening? They can check it in Kibana and we also send to our Kafka These messages if the validation Succeeds we put in into a log stream which is a log stream With the topic name of the event type if validation fails we put in a error or topic so we want to make sure or Event streams are validated but It's not that useful if you have the information in Kafka you want to somehow Store these events and how we do that so we try to move the data into S3 and what we use From moving to Kafka to S3 the messages we're using an open source project Which was open source by LinkedIn which call it's called like Apache Goblin and We just not blindly store it on S3. We also do some kind of transformation there. So what we do So we have the event first of all we flatten the event. So if you have Like embedded properties like last as you can see that this is now body dot event name. It will be Change to body underscore underscore event name because there are Storages or like back ends which can't really handle these this type of formats and Then we remove the PI information So for example, we are teaching IP address and we are doing gip lookup and we are enriching them These messages with gip Information but removing the IP information. That's one example And then we wait for a Y and then we do a deduplication and compaction with this data What does it mean? We basically drop it even if we get an event multiple times We are dropping those event and also if you have worked with Kafka and storing Data from Kafka to S3 or HDFS or wherever usually how these tools are working. They are putting down one file Per partition for a topic which is not very useful if you later on you want to query because the last five is better on S3 and in these storages But what happens if so we are waiting for a couple of hours just to make sure all the events are coming in But if we still see that some log is late like one day Then we put in a separate folder as a late event Because in that case we had to decide case-by-case what to do with that if it if it it's a data was important that we might We might need to rerun or Whole data pipeline for that day again, or maybe we just decide to drop it and Then we store in S3 in a partition format as you can see and Also, we register into the hyphen to store to being able to query with Presto or whatever you want to use So as you can see we used on Daily and the hourly partition now we get rid of that because it was too much hassle and too many fines So basically we just decided okay. Let's go only with the daily partitions and When you have the data on S3 you have the hive a meta store. We provided two links what we never had before We're using Presto and analysts can use or anybody can use this app lean for ad hoc analysis And we are using indicative as a funneling tool and I think even we'll talk more about that Yeah, thanks, so To go to the result part of our project. So after this all big Change we were able now finally to have some changes also on the way we worked So for sure we reduced our time to insight I roll like by two by two times because it's something like that We estimated for example based on our g-ro cards before we were always going over time on the plant Amount of time we wanted to spend on some analysis, but then now we can make it on time even before sometimes So this is something that we can see it from this point of view and also of course We don't have to waste time on Checking whether the data is correct and where is the data and also sorry about that Also, I can see that in the company we have like three times more user There are surf serving data because in a company of 300 now I can count at least like 100 users serving them But it was before all the power was centered on our team and in a few other Let's say people who had some coding skill to get some analysis out of the data And I wrote it. We had almost no Cleaning time. This is true. Of course now there are other processes that we are going through like for example Changing our sum of our core data sets to use the new logs And this is like another part of the just one mention process that is still ongoing as a member of Data science and analytics team what we achieved was of course that as I said this Time spent on cleaning went down and of course, we don't have interrupts now For example with people more people much more people able to self-serve data They don't come to us and say, oh, can I please know what is the distribution of browsers on this particular event? Or in this particular process they can find it out themselves Because now with the new structure that we have and with the new tools that we have in place It's just a couple of clicks away for them to know this the answer to their question And then of course we managed to have a more a shift of to a more strategic role where we could kick off like more Data-driven projects which are like initiative coming from us and off from the product side and Of course as I mentioned for the old company The benefit was that we have now kibana so we don't need to go to the bash and query the live logs So everyone for example can go in there if a feature is released And how many user for example have done that feature and to check if the logs are correct and now when we release the new Preview the product for example, we were able to pull like almost real time What were the new videos created and everybody in the company could consult this list and We still we used Zeppelin about it Zeppelin as our notebook still there is needed some coding power Which we use for our analysis we can plug Packages and everything in there So it's a it's a good tool for us also sometimes you share it with stakeholders They just need to run it through and they can have a refreshed version of the data in there and then we use indicative which is another third-party tool which is actually Nowadays much much used in the company so it's a super easy way for product manager for any other business a colder to put together a user flow kind of funnel and this is how for example it looks like one example that we put together and And they can understand drop-offs in the job story loops that we were speaking about before So if they want to optimize a certain feature They can actually for example see their ebi test and as well in an indicative and check which part Performed the better So Yeah, so this were what we could achieve for the whole company and why this was beneficial and Also, of course we increase our data transparency because now with everything logged Consistently across platform. There are way less question coming up like oh, is this the real event even from the the non Data stakeholders is this the real event? I'm looking for or for example We have Descriptions as metadata in the event. So when they go into indicative they have always a description part of the event So they always show what they're looking at and of course as I said we decrease a lot of time on catching bugs if something happens we have Input checks and output checks that we can go and raise it to the relevant team We usually don't allow something to happen because of log management So if they have approval then something most probably comes correctly in the right format And there are a couple of next steps which Thomas will talk about oh, yeah, so We can't stop the work on this project. So because we as you have heard We still have things to do like not the whole product are instrumented yet and we would like to extend it and and really Getting rid of the old Event collection or data ingestion pipeline so far we can't because and it's a pain for us now that Supporting both platform, but because everybody sees the benefit. We do hope soon we can move away from the odd one and fully move to this new data platform data ingestion platform and we We not just want to support analytical event, but we would like to Use the same platform so different editor use cases and even like Not like for example, even we can see example at the infra team who are collecting events from From actually the CI system, which is not connected to our product But but now people started to use more and more for internal tools as well to analyzing those products as well and Of course, we have a Kafka. We have we have very clean Data events so we would like to use more for like real-time triggers So even triggering events in the product if you do something we can I don't know Do something based on the incoming event in the real-time so we still see So this platform opened up a bunch of new use cases for us And this is where we are and we are happy to hear any of the questions. Do you have? Thank you. Thanks. Any question. We were clear enough So if there is no question, but if you still want to ask anything We will be here for the party for sure. So maybe next to a beer So feel free to ask really anything we are happy to help and sir or whatever or feel free to challenge us as well So, thank you. Thanks