 All right, so We're gonna talk you about something about data and big data, you know, we are in big like big that Spain and We're gonna talk about two things one more about culture Methodologies how to work in data teams the other one is more technical more about technology and implementation So my name is Jose Manuel Navarro, I'm the city of Europe and data analytics and this is Luis my teammate lead data engineer and I will do the part of more technology more culture and he will do the technology one so first some context about ourselves our company our status then the culture solution the technology we are implementing and Finally the end so who is Rwanda analytics where a company to give tools and data to improve the decision-making process to home your to buy your your home and Don't screw it up as I did in my previous life So, you know buying a house is important decision. Do it well for your own future We do tools like these you do a lot of maps using a lot of technologies. We do some of data Engineering we do a lot of things all related to real estate industry We can be in the very top of this Analytics pyramid, you know from the scriptive diagnosis predictive and and Prescription we try to do prescription to our customers our users. This Prescription is more like should I buy this house or should I rent it? Okay? How does our data looks like? Okay first dimension our data is arranged in three dimensions three X axis the first one is where is the data located? Where is the apartment you want to buy? Where is the flat you want to rent? In this dimension we have different Administrative units the odd units like the country the the state the city the neighborhood So we can go up or down in this in these dimensions The second dimension is when when happens the operation? When do you want to buy? When do you want to rent for so how many time? Maybe for years maybe for months in this dimension we can arrange in different units You can make analysis by year by quarter by month Okay, and the last one is what what do you want to buy do you want an apartment? Do you want a garage? What what do you want to sell? Okay, and In the for the same typology of the asset. How does it like? Is it the small two bedrooms? Okay? So one of our typical questions we want to solve is Like this how many two bedroom apartments were sold last month near plaza mayor in this question We have the three dimensions. What do you want? Where do you want and when do you want? Okay, this is the context and how does how is our team our team is very small we are 15 people company But we have a white ranch of of profiles of people different people working for in our company One is to my next first the business analyst marketers data scientists GIS engineers and software engineers, so for a very small company. This is a very diverse ecosystem And in one typical day in our company we can say something like this. Okay guys place send this data to the user Okay, this is a fair question or fair command and Depending on the the guy you can interpret this in different ways The first one is okay domain expert interpreted like okay I'm gonna prepare a PowerPoint presentation and I will go to the offices and present it to them and The deliverable will be something like this Then the business analyst for the same question has a different understanding I will create an excel file and I will send it by email Okay, it does like that looks like this And the marketers say I will send a newsletter. It does great and He and he sends the newsletter The data scientist says a weight. I'm gonna prepare a notebook and I will do my analysis and I will send their results and he prepares all the staffs and Then the GIS engineers says hey, I have a map We are gonna present those data in the map and that will be great And then they they send them up and finally the software engineers says I've got an API just send them the API and the deliverable will be something like this so for the same question we have very different answers and we have a very big Toward bubble so you know We sometimes don't understand we sometimes have a different understanding on the same fact of the same question But all of the answers are valid. No, none of them are invalid Just is just a different interpretation of the question So this is an issue for us and this is a challenge to solve So What we have or we had in the beginning of this process we had a lack of understanding So same question different interpretations Everything we did was a dog. We created analysis and again and again and again for different customers use cases And we didn't reuse mostly nothing So we we Struggle doing repeatability so we we didn't achieve to use the same analysis for the next the next contracts The results was very inconsistent because we tried to do the same and next week try to repeat and it wasn't consistent And everything was manually verified So we had a lot of hours checking and checking and checking our results and making sure that the result is valid This leads to high responsive times frequent errors and a lot of hours spending on on the work The result of all of this. This is very difficult to scale. So we had a very big problem and we need to solve it The revelation for us was these two books the This one in the left side is written by some very Industry gurus Michael Stonebreaker is the creator of postgres. So that's enough for me and the other guys are in Facebook and They created the data platform for Facebook. So I think they know what they say And for us, it was a revelation because it gave us a new point to move towards This point is a new buzzword in the industry and it's called data ops You know buzzwords are great But there is a lot of meaning behind some bad words Data ops is a new methodology to implement in data teams Some definitions the Wikipedia says that data ops is an automated process oriented methodology Used by analytics and data teams to improve the quality and reduce the time, right? That looks great and fits very good for our our challenges Other different definition is data ops is an approach to build a self-service data platform That's also great because sometimes we had a lot of Requests internal requests asking for data asking for analysis asking for things and we need to give okay Do it yourself? Data Ops is a methodology consisting on people processes and tools three things aligned in the same direction For enterprise to rapidly repeatedly and relatively deliver production rate data Okay, that's a very nice definition You know data ops is something that is being created by the industry. There's not a single thing There are also different interpretation. What about data ops? So the main ideas for us after reading these books are Okay, everything automated or most things automated everything rapid and and put effort in achieving a speed Quality and reliability you need to make sure I have confidence about your data and your processes Everything's repeatable. You cannot repeat by hand everything all the process over and over again And finally self-service don't be a blocker Don't don't require required others to ask you thing to ask for things That's the point we want it all and I want it now so it's a very difficult in that data Department and I that team to have all of these things repeatability quality speed everything automated and in very very short time So data ops aims to go to to this approach So Data ops comes from joining three different disciplines or worlds in the in the industry One is the data world or there is the software development and the other is the infrastructure As you probably guess data ops comes from DevOps In this case in these intersections we have other disciplines and best practices like business intelligence big data and DevOps and Data ops tries to gather everything together get the best practices of all of this and Put it in the same unified methodology for data analytics teams and challenges So data ops is in the middle of everything So just going to the point Key practices, you know, you have the culture and then you need some practice to implement in your teams These are the practices we are going very fast through all of them first Productize all your processing you cannot Need to write one code for one analysis and the next and the following week right again the same code For instance, you need to do an analysis and you don't hard code the parameters the input values You always write abstract or generic code asking for parameters very very busy basic, right? But most of the times Some analysis are doing like the left The goal reduced your code if you write three lines of code make sure that those lines can be reduced Don't throw them away Second test your data very basic also Our analysis our job is mostly transformations getting some input data Transform it and putting some output the output is different data or visualizations or whatever Test the input data and test the output data. That's very basic It's common sense and also test the code Just use unit testing integration testing whatever but make sure that your code does what it is meant to do When you do a real job, you don't do only one transformation You have a pipeline of transformations or a chain again test the input data test the output data in The middle there's the intermediate steps. So you are you are making sure that all the steps are going right And finally test your code make sure that your code does what it does what we need to do Run your task in your test frequently and automatically you don't need to pass a button You don't need someone to remember away. I need to run my test. No put a platform and run constantly We have had some or Jenkins. Sorry Measure your process third key practice In every step of your pipeline you are doing a lot of things you are getting some data You are producing all the data collect metrics all of in all of those steps a very basic metric is just The round count you just count the row count in the beginning and in the end of the transformation and just store it When you have the store the metrics store you can then visualize trends or maybe some Subtle change in the data or some values out of a specter range That's very basic to detect early problems second next Practice version control very very very basic, but you can be surprised that a lot of Data scientists or data teams still don't use version control For instance, you still get some emails with data and source code. Oh my god Put it in the repo make a commit and send me the commit Represent everything as code as much as possible represent data modifications as code if you represent a data modification as code instead of just running the SQL in the in the database you can track that data modification and check if it is run or not and Create modifications and so on Represent the schema modifications also represent the infrastructure. How many servers do I need to run this pipeline? Maybe this guy knows No, go to the text go to the source code and check it and Represent also sample datasets. Hey guys. I have to do a demo. I need some sample data. Can you give it to me? No? You are self-service go to the repo and get the data set also enable parallel development Very busy basic also it is great to implant in a merchant do it by your own advantage Next support several environments also common sense my god who is has only one database only the production database we had that and Who does this we did that? So always we when we are in rush you say, okay, hey guys this data is wrong fix it record run Okay, go production SQL. Hey fix no no more not anymore Create a data migration Apply it in your environment in your preproduction environment or first in your development environment Check if everything is alright then apply it to the preproduction environment check if everything is alright And then run it in a production environment. Okay, we cannot screw it up in production environments So that's why you need several environments So data migrations are very very basic And you need a gain test to check that every data migration is consistent in every environment and You need to replicate those environments If you have a very bright environment in production But the preproduction environment is different and your development environment is different. You have a mess and Nothing is consistent. So you need to replicate across environments the data destructor schema everything Next practice Initialization so since docker everything changed a lot and now we have a way to move processing or move whole programs between or across environments a Container as you know, it's an atomic unit of execution with all of the dependencies libraries binaries resources everything unit you need to run a process or a pipeline or a or whatever is Inside the container so you can move the container from one environment to the other In our case if you think twice an ETL is just a chain of programs each program resist an output an input Makes some transformations and produce an out an output if the output output is the input of the next step In our case those are docker containers Next deploy frequently and without fear You can probably make the place like this yet. Just one deploy per month That's right. A lot of changes the first day of the month a lot of risk also Probably you are gonna break a lot of things and probably you are gonna spend a lot of time tracking those errors Okay, let's go one deploy per week But not bad you will have more control deploys smaller with less errors and less risk But go to the very very radical place, which is One deploy or more than one deploy per day Just when you finish a task verify it at the end and deploy it to the production. That's the continuous deployment environment And last the democratize your data We are usually as Developers or in the data departments. We are usually as waiters Someone needs a data ask for it and we as waiters go and send the data to that guy, but That's not that this in a scale. So if you are having 100 people asking for data, how many waiters do you need a lot of waiters? So go to a different approach go to the self-service write tools write dashboards write back office to Environments write whatever just to make sure that all that your data is Self-series anyone in the company can consume that data without asking for it So all of these together is what data ops means But we created a data platform just to implement all of these ideas Inside our infrastructure. So Luis is gonna talk to you about this data platform Okay, thank you So as my pal said we have a lot of problems inside It's our company and we try to solve it using the data ops Ideas so Before talking about what I doing and I want to tell you a story. Okay, so this is one day in our company So this is a odor and she wants to be a developer. Hello, Zeodor. Good job. So it's very fancy. So Okay, they say today is Thursday. No, they start he starts bad today is Wednesday but Okay, it's Thursday. Let's say it's Thursday and we need to load the data in our system So we have providers and we need to load the data from the providers in our system So let's do it. Let's see what happened. So first Download the files by hand as he is. Okay. Next. We need to circuit a script by hand and this script is a dog He transformed the files in something that we don't know and the results We need to move to another folder in another server once again by hand and Then we secured our retail to make the transformations. I love the data in our system with the Django I don't like it So why we're using the Django to make the transformations is by using a scripts to the transformations. Okay, we don't know But okay, we continue and we wait One hour two hours. What is happening? I'm bored. Yes, okay Half day waiting and then it fails. Why? Why disfailing? Why disfailing? I don't know. I Don't know. I'm going to be crazy. It's failing. I waste four hours of our life Waiting for a script that is failing So came over So this guy needs to start again and see what is failing and we are losing our time so What's problem that we have? Everything is manual. So we have a guy Pressing every day the bottom to see what is happening. I do one step press the bottom to the next one I finish on a step boost again the bottom. So if something is wrong Maybe we can change it, but the guy is all the time monitoring the system to see that everything is fine So the guy is not working. So it's only monitoring and pressing a button and another problems DTL is 140 in batch. So this is not really a problem, but it's a problem when Maybe we need two days to have a complete data set to process the data If we spend two days waiting for the data and we lose One day to process this data We waste a lot of time Waiting to the result. So this is not fine And we thought okay, we need to change this and creating something based in in the streaming but they took into to that later and It's not multiprocess. So we don't have only one provider. We have multiple providers. It was spent a Minimum of four hours for each provider Maybe we need one week to have all the data if it's face maybe two weeks So we're wasting maybe one month to have all the data No, that's wrong And big problems Diyango, Diyango is not executing in a let's say Python code. Diyango Python. No, it is executing SQL scripts and we don't have rollbacks so If one script fails, what is happening? It continues, fails everything. It continues with the net script with incorrect values And the thing that is worth if we have invalid values these values Are low in our system. So we are working with incorrect values Okay, for you for we are we're dead and of course we don't have a trust ability. No insulations It's not atomic. We don't know what is happening. We can rollback. We don't know we don't know anything and the other real problem And this is for your if we have the same file that we secuted the script Maybe we have different results. That's not make sense so we need to change it and We have it to do it now so this is the That house values that my parents did before and we fail in all of them so our goal is to Do all these eight points so We need to change we need to improve and we need to be proud of our job and we need to work So, okay, one spoilers if someone close your eyes don't listen because okay, we are not going about Disrupting technologies or something fancy ideas and we are only talking about where we're trying to do to improve our our day and Of course the only one that we want is be proud of our jobs and say, okay, this is running. It is working and it's cool so and The idea is that so it's a detail. I saw we We have sources from providers and We read it we store in our data lake Right now we have a data lake And two months ago with cap nothing we have all the files in our personal computers so amazing and Then we make some transformations and we publish it in Some API that will be consumed by everyone. So we want to democratize the data So if that analyzer said I want the data if a commercial said I want the data if someone said I want data Mmm, I don't want to create a PDF. I don't want to create an Estelle I don't want to create whatever I want to say, okay, you have the API you can consume the data and you can do what you want So This is the idea that we have in a sad way, so we have multiple providers and We catch of the data that we have We have sources and we have providers some providers are external to us and another providers Is to ourselves so we take the source the other source we provide we put it in our system using loaders We have one loader for each provider because it's provider Give us the information in different ways. So we need to standardize this data to have a common schema and Then we have something interesting that it's the watchdogs. So what stops give us the traceability that's in the another system We don't have so it's something is wrong if we need to send alerts metrics or Alarms about the data that we are receiving. Okay in the watchdog. We'll do it and Then the final step to transformation We transform the data about what we want and we store inside our system so Another important thing and all these steps the loader the watchdog and the transformation is thinking like Mini it yields so This big ETL is like a small ETL loader is an ETL and it's right and it's containerized in Docker The watchdog again, it's a docker and The transform also it's inside a docker so we can move it to different environments We can test it in a different environment. We can put it in production With one click so it's very easy So as a loader and something more specific loaders and simply Read from your providers and give us the data. So and for that we are trying to use Again the cloud the fancy clouds right now. We are using this instance again In our server so it's not a scalable. We have a lot of problems. Oh, we say, okay, we need to be there Are really company so we're going to use cloud. We are using a Google cloud and the loader. What does it all at the loader? So with the data and put it in goal your story it's our first satellite and then send a poops of message okay, poops of these Publication subscriptions, so it's very see a queue a queue worker and for that we have a something like I like to match that is Asynchronous as synchronous messaging that synchronizes in the city with the system So then we have the watchdog watchdog has a set give us the traceability and we have something that it is a physical process control Now we have a statistic honest about what is the quality of the code of the data we receive? We can create metrics. We can check the data if it's valid if some provider give us an empty CBS we can say okay something is wrong of course If some value could be a string or an int and it's a float, okay This room we can put alarms and also we have a QA system. So QA system So we can see if some of our values of QA Are in correct or we need to modify something we can modify it Listen the fly and also need a step as we are working right now with Datasets in CBS This part convert these files into strings. So right now instead of process everything on bats. We say okay. We convert the strings and we are transforming it by demand and This is the final implementation. So we have the loaders running inside a Airflow container. So their flow is Helping us to schedule all the loaders because each provider give us the data. Maybe once a day, maybe once a week, maybe once a month so Maybe we can put it. Okay, we create a code, you open a server and run this. Okay. No, it's not scalable So we use Airflow using Google Compose that's it a service and software Google Cloud and everything is fine and It's writing using pandas. So they're always very fancy. It's a panda with a tree. So it's like The statue in install. So I don't know and okay in this one write the data in Google Store it and send a post message and then we have Google Functions. It's like I don't know if someone have used Amazon is like the Amazon Landers. So this function if See that we have a message in the PubSoup. At the moment create a new docker with the Watchdog, create this information in there in the PubSoup execute all the metrics and Create the new strings to be processed in the net step. And in this step, okay, we create metrics, we create QAs, we create a lot of messages What to do with that? So we send it to To elastic and Kibana to be processed by our data analytics and our for me for example and since something is wrong, something is working fine and Make any changes we need it and Also a service. So it's policy to another API and it's so one want to see it. Okay. Use the API, please Okay, and the transformation maybe it's Where we're trying to do something different So this is the part like it's we use in the yango. Hey, no now we are using Apache Bin running over Google data flow and Okay, Apache Bin. I'm sorry about Kaba for me is very boring. So we're trying to use Spotify Steel that is writing in a Scala and It's it's it's very fancy. So if you never tried I Said please try it still because it's cool and then okay, this is Something that maybe we need to duplicate but we store All the data in postgres. So here we have a button like so right now. We are working on streams. Everything is very fast but We are not able to Quite at this speed in to postgres. So for that we need to create some windows processing Windows processing into steel and every 10 minutes do a book of a book of Insert sequences into postgres. So it's working again Fine, but maybe we could be more faster. Also, we have an API that's with the data that we have in the postgres and We hope that everyone use this API. So with this API and with postgres Everyone can use it for example cartoon to see all the data that we have there They can use tableau to see the data. They can use who Peter to create some analytics or The program that they want so The result is the only deal for hours to process our data and This is important. I can say four hours. Okay four hours, but what is the size of the data frame? so maybe Bigger, right? No one here, right? So one here right for four hours This is not the data So, okay, we need to improve it. It's because if we need to process more data again, we are dead So and for each provider for hours for each provider Okay, horrible. Now, okay, 13 minutes all the process. So download the file Put it the file inside the cloud. Okay, we're going better. So right now 87.5% less of the time than the only deal. So I really believe that it's a good improvement good improvement and If we saw all the values that we say in the data ops so Productive processing we are reducing all the all the code that we have. So yes, we are testing all the data So we made some modifications. We run the test In the code and in the data to see that everything is fine. We're doing it version control systems. Yes, we are using it Support several environments right now. We have developed and we have production and we have another test system, right? Contournize. Yes, talker is cool Deploy frequently. Yes, if we make some modification, we can test it and everything is fine. We can put it in production. Perfect democrat data access Again, we have our appease API's and it say we want to use this data. We can use it API. We can use every you want perfect. Let me show you process again, I believe that so Hey, hey, I prove it 24 months later we pass this some and As we said before in our company lack of understanding everything is out of so okay difficult to scale so It's it's horrible. I'm right now and I believe that We have a common and stability. We have really a product a product to consume a High rate of beauty the product is consistent. Everything is automated and we validate everything that we do Low responses time. So if one guy ask us we need the data for this thing We can have the data in minutes nor days or weeks And we don't cafe rules No human more requirements. Everything is automated again and we can scale So So very quickly just as a summary you know data analytics is a complex just there's a plane discipline and It houses a great framework to bring a bit of order in this chaos and You have a framework of steps and best practices just to guide to the to the goal Your process is more important than your tools as you know as you watch We're not using fancy tools There's regular big data tools But the process is key here to have a common process and very well-known process inside a company was key for us You need to use software engineering best practices because it's the base the foundation for everything and Finally put a data option engineering in your life. Everything can be done. Everything can be implemented Because we have a data engineer team or in this case a data ops engineering team And that's all. Thank you very much. And please Remember to rate in the web application. Just click the button click the star and button. Thank you very much some questions I'm not sure if we have enough time for questions Okay later. Thank you