 Až se mám opravené zvukinessy, měli jsou z a můžeme mít tady, jak se dělává data v opunčitě v opunčitě. Prostě, když je to vytvořilo. Opunčitě klásky, kteří jsou zvukovat do programu Insights, svoje nějaké dělává data do Red Hat, to je to procesu a vytvořilo. a než se to vytrnit do výzvodního naši o zvukování, co se nevědějí o naši očinitosti, co se vytrnit a vytrnit výzvodní návět, co je to přeždětné. Až se to vytrnit výzvodní naši, co se to předtývá, a se to vytrnit výzvodní naši očinitosti, ...to vyušla výzvať produkčnou a výzvať se výzvať data na výzvační středky. Až aby výzvať výzvať kůlce na výzvační problémy tyhle nečeštější. Můžu všetnout tím, co jeho způjevaný a před nádvětné příležitě. a z nejvíce oby Hov process, which is basically owned by the connected custom experience team. If you are interested in the other parts, you might visit other talks by my co-coleagues on this conference and you'll definitely learn more about the whole process and the whole setup. Když se to říkáme o dataprocesinu, to bychom můžeme zvukovat dataprocesinu. V Úgrasě máme dataprocesinu od klásců, které máme about 250,000 archívů. Archívů jsou třeba třeba některé dělávky, které jsou dělávky. a jsme zvukovat vzpomínky, a jsme se to procesují. A tady jsou vzpomínky na dátu, které jsme můžeme mít v 100 týbě, a jsme mít vzpomínky můžeme mít v 100 milíci dátu. Je to vždy mít vždy, protože jsme procesují dátu v mětkou týbě, ale jsme vzpomínky na zvukovat, které je dátu, které jsme dátu. Je to tady tady dětána, v tom jenlinessu. Máme nějakého dětány, které jsme všechny a předtěžíme dětány. Aby jsme tady dětány předtěžíme dětány k tomu třeba nějaké úplně. Zděláme se, co to je tady za mnou zvukovat. A zanimám, kde je nějaký vždycky desíc? Díky? Jsou tím, kdo vlastne vlastní vlap? Díky. A ostatní zvuk je, že se v tomhle nejde nějaký kontributor, někdo, který se dělá. Díky. Je to velmi zvědětné a zvědětné, aby se na význy. Aby zvědětáme příjdecky význy, které byli mětné do toho. Aby měli někdo informace o Tektonu. A dnes do toho bylo význy, které byli mětné do toho, které byli mětné do toho. Protože uměnit se to, které brky škými internetů. Některá píky třeba taky, když jste určitě na dětaprocesě. A všetnou budu chybat dělá, které si příjemí zvukat dalé souštěny. Brky se s píky, kteří je homoprávý. Protože je to, že je zběžené na mesižině. Jsou věci, které se vždyčí, když jsou zvukovat v mesižině, a procesování dátky je zvukovat na zvukování. A vždyčí výjdy se výjdy proč se jednou tady výjdy, než se výjdy nevysíjí. A výjdy se výjdy se výjdy proč se jednou tady výjdy. Výjdy se výjdy, než se výjdy proč se jít a výjdy se výjdy se výjdy. To je v roce výstřejsku jeho velký tušel na dátu, kde mám dátu, kde navíze se vznit. To je výstřejsku kde je velké tušel na dátu. Když jsou tu světé těch tytoch tytoch tytoch tytoch tytoch tytoch tytoch tytoch tytoch tytoch. Až se vědělá velmi zvuk nebo vždyžování. Sehle je to zeptavit velmi zvuk, který se na plátku velmi zvuk nebo největá. Plátka je z růjů o věděli a nevědělá se, jak to zvuk se nevětá. Plátka nevětá se o tom, jak se to zeptaví. A tak se nevětá, jak se to zeptaví. Plátka se zeptaví zrův, když bych se který zavrát. V svojejtočí, kde se vzákladí začínává ten rán. A tady se o tom, když se vzákladí vzákladí tady, když se vzákladí tady zvoním. Takže tady zvoní tady vzákladí vzákladí a vzákladí něco nebezuprím. Tady mám dál vydávky do dátových, a imá hypnou dátorství, kde se základí. To je velká architektuřa, když můžete procesovat data, důsobné opravdu, důsobné případky, když je to, co můžete být. Měli měli měli třeba pipeline managerů. Pipeline manager je těžký měl, který je třeba pipeline processing, a měli měli měli měl třeba třeba třeba třeba třeba třeba, Perhaps we solve the concurrency between the tasks and so on. There is really a lot of solutions on the market. There is plenty of solutions which are open-source and they differ in many things, they differ in platforms which they integrate with, They differ in the complexity of the tools. Some of the managers are really light white and just orchestrate the pipeline description. Some are really complex and are the full framework that helps you with creating the pipelines visually and so on. On the bottom side, there is a list of some well-known pipeline managers. Most often used are airflow, Argo CD or Argo workflows, tecton Jenkins, you probably know some of these. Why I don't talk about tecton? As a data engineer, you usually don't have the choice of your tools for two reasons. One is that when you are processing big data, the tooling is expensive to maintain. Companies usually have some tools already posted or bought or there are some available solutions. You basically pick what is available in your environment. The second reason is that the other option would be to run the tools on your own. Data engineers usually don't have the skills to run such complex tools in production quality. They also don't have the time because they focus on the data, on the tools. That's the situation where we are. We currently operate our pipelines on Argo workflows and its self-maintained system. It was not easy to set up with our skill set that we have in the team, but it works just fine. Now we are in a challenge that we need to replicate our environment to multiple namespaces in the cluster. It's not trivial to set up for us and maintain multiple Argo instances. At the same time, we notice that in the namespaces that we use in the OpenShift cluster, we have OpenShift pipelines. OpenShift pipelines is basically branded tecton provided with OpenShift. We were thinking if tecton could be the tool that we switch to because it's provided to us and it's maintained by somebody else. That could be a solution for a problem with the workflow manager. A few facts about tecton. Tecton is Bernatis native. That's the most important. We want the manager to be stable and run smoothly in an OpenShift environment. This is satisfying. It's open source, but that's great. There is also an established open source community because the tecton project is there since 2018, if I'm mistaken. There is a big experience. The project has excellent documentation. It's easy to get started. It's not very complex. It looks like it could be the tool for the job. Just to clarify the vocabulary that's used in tecton, they have pipelines, which is the pipeline. It's composed of tasks. Every task has to an OpenShift port. What's task runs on single port in the cluster. That's important to understand. The tasks are composed of steps and every step is mapped to a container. You can have multiple containers within the port. You can define your tasks with this kind of structure. We were looking for the common use cases for tecton on the Internet. They present themselves as a solution for CICD. CICD pipelines are the most common pipelines run with the tool. The above diagram is a typical structure of CICD pipeline. Often tecton is also used for machine learning for the training of the models. That's the bottom diagram that's usually in one task after another. The question that was ahead of us was if tecton would work with our pipelines. This is basically our daily pipeline. Not the whole one, it's just part of it. You can get the idea of how complex it is compared to the usual CICD pipeline. We were wondering, because running the pipeline is one thing, but running it on a day-to-day basis with all the things that are around might be a big challenge for the tool. My identified some key errors where we want to do some experiments to basically understand if tecton will make it with our pipelines. These are the errors that we identified and then we wanted to cover in our spikes or experiments. This is a spoiler because the colors show how it went. The green ones went fine, we solved what we wanted. The orange ones worked as well with some compromises and with some workarounds. The red ones, it's not a blocker, but it needs some more understanding. If I go from the top, the time-based execution, it's not typical for CICD pipelines to be triggered by a cron job. There is some bit complex way how to do that, but it's possible with the tecton tools. You can define some event listener and there are some mechanism of triggers where you can transform the parameters and send it to the pipeline. In the end, we ran the pipelines with a plural command in a cron job. It sends just events to the listener and our contributors. We were wondering how the observability in the tool will be because the amount of tasks is huge. If you want to quickly navigate to some task that is causing some issues, it might not be easy. It was surprisingly easy. The tecton has excellent CLI. It's really easy to use and the output is sometimes more comprehensive than the UI. The UI is integrated into the OpenShifter console. You have everything at hand and the user experience is really well. It's quite easy to navigate even the complex pipelines and the structure inside. The third thing that we investigated was performance. We found out that the performance is really comparable. This is the Argo work post that we are using now and there were not any significant differences. This is also something that we don't have to worry about. Before I jump to the orange section, I would like to share how we structure our tasks because it's important for some of the workarounds that we had to do to bypass the limitations of what we did. I need to stress out that the limitations that we found are not tecton fault or something, but it's just that we are not using tecton for what it is intended. We are stretching the features of the tecton in the direction that was not intended. This is our best practices for the tasks. This is developed over the time that we run the pipelines and this proved to be a useful set of rules when designing tasks. From the top, the task, and that's really important, the task needs to be implanted. That means that when you run the task multiple times on the same data, you need to end up with the same results in the data lake or in the databases in the end. That's important because often you need to rerun the pipeline. For example, you are missing some data, something changed, and you need to rerun the whole processing and you don't want your data to be screwed by this operation. This is a really important part for the sustainability of the pipelines. The next one is the ability to run for past days. We need to deal with the historical data because we do some time analysis of the data, so we need to have the tasks to understand the date for which they are run. It's also important, especially for backfills of the data or when they are outages, and we need to add the missing data to the history. This is also a key feature. Single responsibility, this is very useful. In another case, it means that one task equals to one table in the data lake. It's not good practice to do more things in a task because then it gets more complex when you want to handle your tasks in general way. For example, when you want to generate the pipelines. The task needs to be as uniform as possible. For the same reason, no data sharing between tasks. That means that the task can only take data from external source or the data lake and store the data to the data lake. No exceptions allowed to store the data to some temporary storage that's mounted to the port and remounted somewhere else. These hacks were very easy to do. The last two practices are... From a programmer's perspective, it's probably not a great practice, but from the maintenance of the pipeline code, it proved to be practical. It's about the organization of the code. We have our tasks defined in Python, and one task is one class, and it includes all the metadata that are needed for building documentation, for building the pipeline definitions, and basically for navigating in the tests. The logic is bundled together with the metadata. It's in one file. If you need to change something, if you have to the exact place where to look, it's also easier than to do some reviews on the merchant quest, because you can easily see that there is nothing left. Before when we had the things split, it happened that, for example, the documentation went updated or the metadata were not matching, so this worked for us. Obvious, the execute method of the task class needs to be unified so that you can cooperate with the task generally. If this is satisfied, you can easily work with the list of tasks and use them for multiple purposes. For example, building the documentation, building the graphs for the pipelines, and so on. If all the tasks are uniform, it's easy. Now let's jump to the orange part. There are the other challenges. The first one was loop support. In Argo templates, if you are using the loops quite heavily, you can imagine a set of tasks that you need to run multiple times with different parameters. That's for what the loops are, and there is no such thing in the tecton. This is often used when we parallelize some processing of the big datasets. We split them into smaller parts and run them in parallel. Also it's used in backfills, where we tend to loop over the dates and run the whole pipelines for every day. This is an essential feature, and it was the most challenging part of the experiments, because it forced us to change the strategy of how we build the pipelines. The solution that we found was that we created our own model for the pipeline, and we built the pipeline in the model, and then we have some compiler that can transform the pipeline into the YAML that's understood by the tecton. During the transformation we basically expand the loop and render all the iterations explicitly in the YAML. The downside of this solution is that we can't have dynamic iterations. We need to set the amount of iterations before the pipeline is run. That can be limitations for some use cases. In our pipelines this pattern could be avoided, so it's no buffer for us. But this is probably the biggest limitation of the tecton that we found. There is something called matrix, this new feature in AlphaStage in tecton, that somehow could fix the problem, but it's not in production and it's not in OpenShift pipelines. The next one is backfill. This is the most complex error, because if you imagine the backfill as a huge pipeline when you have a pipeline run for every day you can easily get to a huge pipeline that has tons of tasks. This is a real challenge for the environment that you run in, because it's really demanding on resources, and also it's a challenge for the tool because you stretch it to the limits. The limitations of tecton that we get in this area was loop support. We already discussed that. That means that the pipeline could be composed of other pipelines. That's difficult for defining the pipelines, but we overcome this with our Python model, where we allow this, and then we have some transformation code that again expands the pipeline into the expanded view where everything is explicitly upset. The transformation of the natural model will do the work for the tecton. Another limitation is default time limit in tecton. Tecton tends to by default kill every pipeline after one hour of running. That's not very practical for the backfill because they often run several hours. We found out that this limit can be override only when the pipeline is triggered from the command line. That basically ruled out the possibility to run the backfills from the UI. We usually run the backfills from the CLI anyway, so again no blocker, but it's something to be aware of. What was the problem is control over garbage collection of the finished pots. In the OpenShift main spaces there is a quota on how many pots you could consume. It includes also the finished pots, that are no longer running. You easily exceed these quotas with the backfills and there are thousands of pots that need to be created. We also need because tecton has some kind of garbage collection, but you need to have cluster level permissions, which we don't. It doesn't have the granularity that we need. We usually have tasks that finish successfully removed in short time to keep them for 24 hours. They can be reviewed and altered by the person who has the watch duty. We ended up with our own solution. We created a simple script that does the finished task. We run it in a cron job so that it removes unwanted tasks from the space and freezes the resources. The only red one we didn't find out any workaround for it. Red prices are useful when you are processing the data from external sources. When there are some voltages in the sources, for example restarts of services and so on, you don't want to kill all your pipelines because of these. It's useful when you have on the task to define some retry interval with some delay and you try again for three times and it fails for three times the whole pipeline fail. Tecton doesn't allow us to set the delay. It is important because you need to give some time before you retry because you need something to change in the environment that you depend on. Basically we need more testing to see if it affects how we operate the pipelines on the daily basis. Setting limits on memory and force. This is also important because some of our tasks require 20 gigabytes of memory for processing of the data so we need to carefully define the limits for every task that they can fit in the environment. We have in total 64 gigabytes of memory so it's difficult to squeeze in everything. This is important feature for us. Tecton supports that. The limits cannot be passed as a parameter to the task. You can't have generic tasks that you reuse in multiple places where you want to change the memory limits. This was solved again by our model and we define every task and we have no usable tasks. We would have to ditch this pattern in pipeline. The last one is day-to-day pipeline link. We are wondering how the person that cares about the pipelines would have difficult life when they want to watch all the deaths happening. The limitations that we found were these. The long-running pipelines from E-Vibe already discussed that garbage collection was also touched. It's not possible to pause pipeline execution. Sometimes it's important to pause some pipeline that's resources hungry to leave some room for the other one to finish in time. This is not possible and it was resolved by a simple script that basically if you stop the task it will stay for a while in the Tecton. We have a script that produces new pipeline that contains only the tasks that were not run in the previous run. That would be work around this way. Smart reprise of a failed pipeline that's similar thing if pipeline fails somewhere in the middle you face the problem and you want to re-run it again. It's not possible with Tecton. You need to run the whole pipeline which is wasting of resources. We used the same script to create new pipeline that contains only the stuff We found solution for these problems and it seemed that it could be working. What I wanted to stress out that we were quite surprised by the user experience of Tecton with these huge pipelines it's easy to navigate in the UI it's easy to realize what's running, what's failing in which state the pipelines are and it's easy to navigate in the running tasks that have plenty of tasks and to find what you need. That's basically it about our experiments and now we are at the stage where we need to evaluate our findings and decide if we want to switch to the Tecton or not. If we decide to switch the next step will be to create some open source project where we bundle all the work-arounds and hacks in one place and we would like this project to serve as a boilerplate for anybody who wants to create a copy of our environment within let's say one day to be able to easy start so we want to put the scripts or the work-arounds in there all with the documentation and some follow guides how to set everything up. So that's our plan. That's it for my talk. Thank you and now it's some time for your questions. We have four minutes. OK. Can you see if you don't need feedback to them because the software is useful but all the limitations you mentioned are things that you expect to have. We already heard the question. The question also if we gave the feedback to the Tecton community. So we didn't do it yet though most of the limitations that we found are already known and discussed in the Tecton community in the forums there are some solutions or discussions in progress so we didn't find anything extraneous that would need this attention. OK. How do you consider contributing to Tecton or others maybe some of those limitations? That might be an option if we decide to go with Tecton we might join the community and try to affect what features have priority. You may want to repeat the question. Did everybody heard that? It's for the stream. So the question is if we were considering joining the Tecton community. Sorry for that. OK. I think we took the similar journey as you do right now a couple of years ago when we took one of our service traditional service and moved it into Tecton and one thing that experienced the challenge was how to properly test the task pipeline in different teams so have you done any research in this area so the question was how we do testing of the pipelines and the tasks so we have the tasks they are defined in Python so it's easy to do testing for that and we have them quite well covered some are even like testing the old so the coverage is good the problem is with the whole pipeline and there we trust to the Tecton that it's able to process the prescription as is defined and will probably and will probably add tests for the transformation of the model of the pipeline to the Tecton code so that's probably the error where some issues can happen so we have to cover that ok, so I think that's it because we are out of time thank you for your attention and see you next time thank you excuse me everybody we're about to begin so welcome to talk about exploratory data analysis techniques and how it works hello everyone I'm Aksha Burke I'm a software engineer at Red Hat working for Red Hat in science and here is my first talk in the Teocom so this is the first time I'm presenting in front of such huge people so yeah, I'm starting with whatever I have and try to experiment as much as easily as possible so this is about the exploratory data analysis those techniques that we follow usually in the data analysis part so there are lots of things like today's world it's a AI and machine learning thing everything is everyone is just wanted to have a magic like nowadays generative language models like chat GPTs and the bar and all they are just huge popular but everything is started with the data the data collection and then processing and then training to the models and all the things this is the whole journey and I'm just explaining one part of the journey related to data because if you want the output it should need a no, a data that is the input for all the journey so in that we will explain how we can process that data how easily the data should be managed and process the model so that they can adopt it and quickly perform the operations they wanted to do so yeah, move in with that here is the agenda for today's session so we will first introduce about the EDA then how, what divorce EDA and data analysis and there are the techniques that the data cleaning, the visualization there are some libraries so I mostly speaking up with the you know correlated the python language that is the most popular language in data science as well and then decision making and business strategies on that and then for the Q&A sessions so yeah, moving on with that is there any questions so please ask in time so this is the simple one the part, the raw data between exploited data analysis and the output so it is the, no the exploited data analysis before that we will understand the raw data raw data is multiple types the text images everything like that in the text we can like suppose we looking into the large models, language models using the data in form of reading, understanding the knowledge books, journals articles and all these so raw data is in multiple types of form we can collect mostly we try to gather in the CSV files the text files and then from that I mean in a clean way whenever the data is generated it is not in that clean way that models can adopt it so that the raw data the collection of the process I think nowadays the huge data is generating because we are all using internet and for every click thousands of lines of data is generating and that there is no limitations for data in today's world but looking into past 2 years like maybe 15-20 years back so collecting a data is a big challenge that time but not today so now we have huge data but now going on is the data analysis so the data is in unstructured way because we need to data in perfect format so that we can understand patterns from it get the analysis we can improve our decision making system try to do that so that needs a data analysis thing in the explorative data analysis it is somehow critical process that basically performing the initial investigation of the data that discover patterns anomalies the hypothesis and to check assumptions as well it is I mean the first it is a good practice to understand the data first and then gather the insights from it as much as possible it is a approach to extract information from unfolded data to summarize the main characteristics of the data because I mean in the many times it happens that most of the people underestimate the importance of data preparation and the data exploration it is the important step because if we have well defined structured data the projects that are using the data to manage to learn train their models it will be easy and with the minimal time the output will be there so and as we know the output will be the in the case of explorative data analysis the output is the data analysis we will extract the patterns in different way we will provide clean data and statistically input so I have the next slide here because it is really a data analysis so how much can people explorative data analysis and data analysis is kind of a confusing thing to everyone but I think this table will help so basically the future is the goal of the data analysis is answer questions make predictions gain better understanding on that but in EDA just to gain insight from the meaning of the data and they collect such information from the editor nothing about the model training nothing much more about the machine learning all the things the time the future time is the EDA is early in the data science process the initial process in that and after that the data analysis so we can say that EDA is a part of the data analysis so the data analysis is a huge big tree and it is one part of the EDA so one part of that looking into the formula the future so EDA is less formal and data analysis is more formal because as I said EDA is a big big part and the EDA is a small part of that the methods are the same all the methods are using the EDA as already in the data analysis but in EDA specifically I mentioned that the visualization visualization is basically too much focus in this because many times it happens that there is no need of machine learning we can get it from just analyzing the data visualization will solve our problems visualization is most important because going ahead we will explain about the types of visualization but this is the important step as a human tendency everyone knows that our mind captures the visualization first anything than just reading or hearing this is the important and this solves here most of the part and then in the data analysis part statistics, modeling, machine learning then all the steps model, training, testing and then gathering and then providing output to check test our data on that and getting output and then still if it is gathered we need also need to put a parts to get getting calculating its accuracy and all the things how much it is accurately performed this is the all thing differentiates the both data analysis and EDA is there any question till now? so going ahead so first we will explain the data cleaning techniques so there are several techniques but today I just wanted to highlight few of the techniques that are miss many times but that are quite important in that stage so the first one is the handling missing data so I will just give an example like we have bunch of data in this collected in the scene the spreadsheet file csv file and some in content that suppose there is a some data is missing in the particular column and but when we analyze that it will about us to getting expected accuracy in that area so how we can handle this so we can first do that the first step is to identify the missing values so those will be identified in python there is a function like to identify the missing values in data so in that way if you process a work like we are using the pandas library python library and it is popular library using the EDA technology as well and we will just data frame there and data frame we run against we will check that is it is there any missing values in the particular column and row using these functions then there is next stage like we found missing data in that so we can either delete the particular row or we can either fill the value in that so that anyway so the simplest method we can use the drop any functions and it will simply delete all the rows and columns contains the missing data so it is that simple in python and the next one is the implementation replacing the missing data like as i said there is a few columns and rows having the missing data so we can fill that data with mean sometimes standard deviation anything like that but it is all about the data we are working with so these things we can do as well and the modeling as well helps us to handle the missing data there will be models that helps to handle the missing data very well so no need to go into that much row things the next thing is delete the outliners so first the step in that part is to identify the outliners to identify the outliners i mean it is very much important very much important thing because we have a data of age suppose we wanted to process data in some base group of suppose 10 to 20 for the people we are analyzing data in that and suppose there is a data of age of 60 and 70 so these are the outliners because when we are processing on only those part but the there are some data that are not correlated with the what are our expectations so that is also important to gather because when we are trying to get a mean of that data because we have a problem needs to perform operations on 20 age group people and we just suppose get a mean of that so it is not close to that but if any number is 60 or 70 in that age group that will break that accuracy in very short time in that way so identify outliners the methods like zscore and visualize data using the box plots or scatter plots i mean the scatter plot is very much popular in that way so we just see that these data are outside that are predicted then evaluating those outliners so once we have identified the outliners to evaluation determine with the valid data points so sometimes there may be error when generating the data when collecting the data when we can check in that way or the same we just figure out why the data is wrong in that case so that was also one thing the same you put it outliners as well it will remove from the data set or leave a gap in the data so can be filled with the missing value and one more thing we can fill with outliners we can transform the data so we can either fully transform the data with whatever the average or the correlated data we are expecting in that situation so that's what about the outliners moving on we have few more techniques on the data cleaning the next one is the data standardization and normalization so the standardization in that is this technique because we need the data in the standard format like we use mostly in the standard derivation in that or me, median, like those things because we need the data in a particular range in that situation so in that the standardization data standardization is very much important like there is a data of either true or false data so it should be in standard format it's either true or false not more than that nothing else from that so we can check that and with the help of some of the tools like scikit-learn standard scalar can simplify this process so scikit-learn is also very much popular in the machine learning library and it helps to data cleaning, data pre-processing part that the standardization will do. The next one is the normalization so this is also a technique that will work like a min-max approach in a specific range like suppose there is a data of a huge number starting from 0 to millions of numbers there but it's not possible to work on few of the numbers that are out of our range to we can read. So in that case we just normalize in the scale of 0 to 1 or either we can substitute in which where we want to normalize in the specific range in that way the data will be easily readable and it's human readable and easy to because it will not change anything in the processing of the model. We are changing the range of the data and there is a min-max scale tools in the scikit-learn that helps to normalize the data mathematically we will say that if there is a data of large scale we will go that the point decimal points number and fractions number but it's quite hard to work with those numbers to normalize it with the scale so basically 0 to 1 scale is very popular in the data analysis state because mathematically there is also few theorems that can show that 0 to 1 normalization is basically very much helpful and quite good to process with whatever our data analysis thing going on so that's the thing about the data standardization and normalization the next one is the handling duplicate data so as far as we've seen the handling the missing data the handling outliers and the data standardization normalization will be in duplicate data as well so while collection data multiple entries will be duplicated if we have few checks there while collection it's okay but the data should be duplicate data will not be that much helpful to improve the accuracy and get the output from that so how we can handle that so it's so these are all the steps we need to perform while cleaning the data and while handling the duplicate data we'll check that all the duplicates records with the function duplicated against the pandas data frame so all the duplicated data will be identified there and just I think it's a standardized process to remove that duplicate data there is nothing we can do with that like doing standardization or normalization with that because it's a duplicate data we need to remove that to get it things done and the removing of the duplicate data is also simple with the drop duplicate function in the function of the pandas library we can run again the data frame so it is that simple to remove that all the variables so this is I am specifically talking about the python thing there are lots of other tools we can manage we can do those things there are too many software as well that helps to data cleaning the models are creating, the models are helping in that way but going with the basic if we understand how in a scratch so the next things will be differently to understand all the things so this is the four techniques I think that is much more the basic one but which is important in the data cleaning process there are few more data cleaning techniques like these are the maintain data integrity encoding category variables data integration then cleaning document data cleaning steps is also one more important thing I will say that what we are doing is also every time in general way we are trying to document what we are doing so the words in the next will come there try to you know adopt those things very quickly so that is the important step when we perform the steps on a data set so to document it is a good step and it takes the data cleaning in the various steps we need to take multiple times as well it need not be no sure that we will get data in that clean way so we need to perform various types of techniques on that with with all of the understanding so moving ahead I would like to show the data visualization techniques and I also I will express the few of that will be you know important in that step and very easy to understand as well the first one I already told about the scattered plots and the scattered plots is a mainly used for the detected outliner so there is the these are the outliners the points the graph shows about the cost and the weights so whenever we talking about this looks like much more centric and we draw a line from here much more linear and it is suggestion to that so there is not looks like it is not a outliners or anything that so we can easily understand that data will be good to go and linear models we can use in such type of data to get the output from the linear suppose we are using the regression techniques which regression technique we want to use is it will be helpful so the it is the widely used to relationship continuous variable the two variables continuously linearly data we can collect if there is any variable is here so it is outliner and we need to perform the data linear techniques the outliner techniques over that the next one is the bar charts so I know everyone use the bar charts many times in the excel sheet by using bar charts to do the thing so I just created a children data here that represent the children and their primary school secondary school so those are the color that how much the the school children will be using that and all the in that category so it will I mean no need to extend much more on that so it is that the magic of visualization I don't need to experiment much of this going on with the histogram so it is a graphical representation it is a distribution of continuous variable so there is a data bin data is distributing the bins and display the frequency or count of values following within the each bins and this help to shape central tendency spread the data and use pool for identifying the data skew slowness and as well as the identifiers as well so histogram is also most popular one in the visualization technique and it is very much helpful in that going on also if you have the data techniques here there are lots of visualization techniques nowadays I mean if we are using the tools like W and Power BI and all the same so it is very easy to do that but we need to know about the which techniques we have and how for which data which one suits well so the line plots the first one is the line plots it help to visualize the frames over the time and the continuous sequence so this is about the time sensitive data we can plot using the line plot the next one is the box plot the box plot data so I will not put the data here so we will just explain so it will provide the summary distribution of the data and it displays the minimum first quantile mean, median like the third quantile the maximum value minimum value if there is any outliers like that so these box plots will help to spread skewness, central tendency and present the outliers of the data the heat maps so the knows that heat maps are so popular like for example I am giving the weather data we can visualize using the heat map with different colors with different color shades on that so this is the color coding visualization heat map is the color coding visualization with the intensity and the magnitude of variable across the multiple categories and the dimensions so heat map effective for showing correlations patterns and clustering the large data set they are commonly used for analyzing data matrix for multivariate data so there is the types of data multivariate, univariate, bivariate so univariate is simple data multivariate content of multiple types of data so these types of data will be easily visualize using the heat maps going ahead with the pie charts so pie charts pie charts everyone also knows about the pie charts the proportion or the percentage of the different categories within the data set we can show with that if we visualize the composition the whole relative sizes of the different different categories will be visualized and but it is only numbering it is limited with only numbering on that then geographical maps as it name said geographical data we can visualize it will show the distributions across the regions locations any country wise we will able to show the pattern from that specific to any particular data we are working with that like population like any other things we can highlight with the geographical maps then there is interactive visualizations so like i said the tools like Power BI W there is a library plotline matplotline is simple one but the plotline is the providing the interactive visualization so that the next level all these techniques will give the much more detailed detailed data detailed data representation from the data provided in that period so all these data analysis techniques passulates the better understanding of the data set and the visualization is the most important as i said in that area going ahead with the libraries mostly use the area so pandas is the first and foremost the most popular one the data from the multiple sources we collect the data in pandas and pandas store it in the data frame and it is easy easy to process the rich function set of the pandas library that will help to process the data cleaning thing will do with that and much more things with this because it has data frames and series and this is used mainly in the pre-processing part and the initial data in the IDE process the numpy is a mathematical competition related library in the python it provides the data structure mathematical functions and operations on the we can perform it with the arrays matrix and the pandas data frames in the way matplug light is the visualization library in python is a popular one one of the most popular one and it has a variety of basic basic plots like line plots, scatter plots bar plots, histogram and and many more so visualizing data distribution relationship and pattern during ERA it will be easy with the matplug light library there is a sebone one this is the high level data visualization library so those are the library then plotler is also the interactive one and the scikit learn is the machine learning and the library will help to pre-process the data as well these are the few decision making and business strategy to identify the data driven opportunities validate those are the things so if we have time there is time so we will just go on a QLS session so I will just review those decision making and business strategy gather from the EDA its validate assumptions then mitigate risk optimize resource allocation then improve customer understanding enhance product development optimize operation and process and monitor key performance so we can have all these things collectively from all the EDA process so yeah going with the QLS session any questions any questions no questions no questions online as well yeah thank you so much last time I did that no no its terrifying so should I give you an opening or what do you do at Microsoft I can do it myself well we are all here now I am David Duncan I am a partner solutions architecture manager which means I am on an alliance team and I spend a lot of time working on red hat solutions just look to the right sure yeah that is a sweet spot David Duncan partner solutions architect I work as a manager now and I have been working on red hat solutions on top of AWS for the last 8 years and before that I have worked on red hat as a software development engineer I wanted to talk a little bit about machine learning ops not because I want to spend a whole lot of time teaching you about ops because I think that there is a fair amount of understanding of machine learning and operations at red hat I think almost every engineer here is participating in an operations space they are doing testing building test methodologies almost every day and I think that there are very important key understandings that we have in design that are important for success like I said I am an architect and that is going to become painfully clear and I put that up on the slides here to identify where it is that you are going to see me where you are going to see my talk and where you are going to see the things that are associated here I put a lot of time and effort into determining where there are target requirements and then trying to figure out how we are going to build product technology to match those but I do that in the context of the cloud and I think that the cloud is a very important part of machine learning we have all been inundated with conversations around generative AI at least I have and I think my team has been sick of hearing where it is generative AI twice in one day and I wanted to talk a lot more about designing for success and what key components are there and how to isolate what needs to be done and maybe this will help you as engineers or data or machine learning engineers or practitioners to find a way to structure your ops and structure the configuration and then look at some of the tools that are available to you and I will get into this in a little bit but my experience has been one in which that design has led to product designation and so I spend a lot of time talking about where those products landed and a little bit about how the development cycles and what made that go but at Amazon at Amazon there is a a practice we call well architected and I was part of the team that designed this as well but it's a big team and this doesn't look very clearly this is not a one person job in the early days we were just trying to figure out how we would do five pillars we had four pillars when we started we got five, now there are six super exciting, continues to grow but I'm constantly looking at where things focus in terms of operational excellence in terms of cost sustainability how all that effort comes together and where it comes through and what we do at Amazon that I really like and I'm going to get to some of the things that I really like about the design principles that we had too in that open source community especially around open infrastructure which we have some people to represent here on the front but there is what we call the lens and the lens is a way of actually taking those the challenges that you face in any one particular component of infrastructure development or your line of business and really focusing down on the core decision practices that make that happen on the red hat side what I really enjoy and what I've talked about a lot with other people who are involved in this some of them were on stage this morning for the keynote, staff are things like architectural decision records and creating that infrastructure as code or the decision as code with a certain amount of historicity an important component of what it is that you do in your operations if you don't have that then you're due to repeat those same decisions and that is a very complex and complicated problem so sorry, I thought I had a little bit of a of a pathway so wherever you land in the industry if you're in gaming if you're in if you're doing large scale databases point of sale wherever you're collecting information from doing whatever it is that you want to do whether that's like some sort of really basic recommendation engine or you need to do some sort of highly structured publicist groups style investigation unstructured data that may take you in whatever direction there are lots of places to start I personally started on the very far side of there in the HPC lens and the HPC lens was a place where we created a structure for things that was very silo so you have a front end node that front end node would communicate simple chunks of data out to a large scale group of nodes that were there and returned it to the front end node and we don't do that in machine learning machine learning is typically done in the same way that we do Hadoop in a way that is opportunistic now and the opportunistic modes are some of the things that I think are really really interesting and what it is that makes this happen what makes me excited about this you want to talk about it so as you as you build that structure as you build that pathway each one of these ends up having a lens you end up having a way to do this in red hat you have there is a solutions collection and each one of those solutions collections is pretty easy to sort of review and there are people here doing that research and university who are contributing to those in an open source way that I highly recommend that you spend some time reading and taking some mattress there and then also there is open infrastructure that you can participate in today in the to experiment and learn and build out structure of practice that is similar to what we were talking about to be spoken about this morning the keynote which is that open source software service and machine learning itself has a lot of component parts that need to be extremely well documented well understood and part of that is tools part of that is finding out what it is that we are doing in our documentation and then some of it is just basically the paperwork to get things done but ultimately it comes together in an operations model this is sort of the crux of what I think is really important for us is that we have to figure out a way to ensure that we have a data processing model and then we need to ensure that we have a way to build a continuous cycle of this I guess I could have done this in a circle but I didn't and really the focus that we have is kind of starts for machine learning ops, kind of starts to collect data these other two components are a really nice thing to have but they don't necessarily fall straight into the operational models they usually this is what you get handed as an engineer and want to continue forward with so I stuff with you know I stuff with leaving that out we are not going to talk about business I can't do it but I can talk about what I think is the continuous cycle here so I guess I did make a circle and I put it on this slide to have one before it each one of these steps that goes into this process has to be clearly defined and this is something that I think is really important is that we find the parts of this structure in each one of the phases of the business so that you can so that you can clearly define them so I put in here that we were talking about this all the way to the edge what about it backwards which was all the way from visualization and all the way to the edge trying to find ways to tune back into whatever it is that you are evaluating whether that's the hum of a giant tank of oil or the position of a series of ships it doesn't really matter each one of these component parts is really important anomaly detection is one of the things that I've worked on and I think is a really important part of that edge computing process like how do I get this data over here, which data do I transfer or how important is it what is the clear path for doing that so thinking about things from this perspective and listening to customers talk to me about this I spent a lot of time putting the first component of this was going to be so a lot of that came from determining that most people were just looking for a model they weren't thinking to themselves oh I'm going to go make a model I got one and I want to really implement this if you don't know what it does and you just need to have something that works you got to figure that out in this process and this is messy so this is not exactly where you want to have this is not where you're doing ops this is where you're trying to figure out whether or not you have a structure and this is essentially the first part of your machine learning experience and not where you're thinking about how you're going to tune this configuration good news once you've done this a couple of times you'll start to have an operational phase and this part will look a lot like that little black dot in the middle and you'll have an understanding of how it is that you're going to create all of these essential silos in which you can now start to create your infrastructure your infrastructure model as you see pipelines and how that configuration comes together is really important and then this first phase, the pilot phase becomes a lot less messy so for each one of the next iterations that you have now you have this machine learning time model and this is where you've already got the decision records you already have all these things in place and you can start to you can start to push them into end of play and this is the great part about this conversation that we're having here is that when you look at how much of this effort actually goes into production and these are kind of little numbers but they're still consistent I did a little background checking and you're like these are still good you're still not making a whole lot of you're making big decisions you're doing a lot of transitioning from one model to the next to try and determine exactly what it is that you're going to get out of it and I have lots of fun stories about making those decisions making changes in the model retraining that model and then determining whether or not the aspects of that determination were in fact valid and explainability is like one of those things that is almost that's like guaranteed that you will have a lot of trouble in this space if you do not have already a process and pipeline that makes it work so I started I started off like I said thinking about this in the context of this and I'm highly opinionated I'm just going to tell you right now take this with a grain of salt I'm truly biased when it comes to what these operations look like in real life like how they in fact get structured because I started from the simple fact that I work on a team where the center of my universe is actually Elevation so I live in a place this to me represents 7 years of my life making this a product and turning OpenShift into a service on AWS was started off by creating incremental changes that people could use in the context of previous versions of OpenShift this didn't appear for us until 4 but on 3 I wrote a scale out model because the problem with the operations for building out a structure for cloud on OpenShift was that no one had thought about how the scale would work so scaling out was super easy you just buy another node and that was a fun part of that experience but a very deep pocket kind of experience but the goal was to scale in I'm not using it today so how far can I scale my favorite support question the emergency support call that you get on GitHub was was was from a customer who said I've scaled down to one node and I don't seem to be able to scale back up well, that is true SCD is not going to let me do that anymore and so and I'm sorry but good news you have a cloud formation template that will deploy it all over again and you don't have to create the auto scaling groups that are allowing you to do that scale in and scale out and we did that so the structure of that design we did was to create an auto scaling hook configuration and the auto scaling hook configuration would do all of the node setup and then when you wanted to scale down the auto scaling hooks allowed you to tear down the node in a proper way in the old OpenShift container platform model and so this was the start of getting those people who were making the business decisions very excited about the fact that they had an incorporated model and so as a foundational layer that was the thing that created the excitement that made two business decisions, one on the red side and one on the AWS side that brought us a service that would in fact support doing a lot of that work now everywhere you go that product design started to become a really important part of how it is that we build this concept of two platforms talking to each other and in fact integrated together being integrated together means that you have to solve other problems on top of the problems that you have for just creating a container based platform we have many different selection so now there are many different ways to deploy OpenShift on AWS now this is super suddenly you've got a big choice to make on how this is going to happen who's going to manage it what's going to be going on and then you've got these other things that you can do which are the vanilla options what am I getting myself into I've got a practice that has to make all these decisions and that's where the decision records come in you can go back through those you can make a single decision and then you can make another one based on that in fact there's some in the blueprints for the Open Infrastructure that talk about some of those failures and the failed decisions or the advantage of one option over another in terms of making a decision about one OpenTorch project or a platform component over another and then relaying those back to best practice that's what those well architected lenses for and the components that you find in there it's a forcing function to ask questions like is this sustainable does this apply operational excellence am I getting scalability that I expected to get so using those techniques and the technologies that are associated with that that kind of question and answer experience is a great way to form that body so take that those six phases and the operations and then look at how you're going to put that together and you'll start to see that there are some places that are going to be some big holes so you need an inventory and so once I started once I looked in and said roses life on the way what else are we going to need well it turns out that you need a way to build out this infrastructure that is consistent with your expectations and I I chose so obvious choice right again I know I know this food on my table but personal favorite I can make the decision for that outside of my experience but there are some things that I think are important and they are they are supported by by creating enthusiasm for that same same kind of technology so identifying that that was something that was really important made me continue to talk to our cloud formation team and say hey I really like what I can do with Ansible and I don't like I don't like writing this target and then not being able to iterate very well or find ways to do more minimal tests and that team decided that they didn't like having just one way to do it and that they wanted to be able to incorporate that strategy into what they were doing and so that's how the cloud control API was born is out of that iteration out of us talking back and forth and that didn't just happen to Ansible because tools were changing chef was going away puppets going away, nobody wants to use those or the server model doesn't work for us and it certainly doesn't work for machine learning and I'll get more into that through this is that we were looking for kind of an easy way to get to do the discovery for the resources to create a way of building out those best practices in an iterative model and also in a way that was easily consumable so Ansible becomes kind of an easy place for us all to create opinionated decisions on the other hand it travels well so it doesn't so I don't need this is just one aspect of it and yes this is David's opinionated inventory I'm not when you get out to the edge you're going to use different collections there's going to be a different experience but it's all going to be part of your execution environment and support infrastructure is covered so originally there was nothing on AWS except the redheaded resolution the only thing that ran on Amazon originally in 2006 you could get a redhead box that was it and once we started to look at what it was that we were doing with data there were a lot of places for us to go with that and some of the other things became kind of obvious was we still need high availability high availability configurations were absolutely necessary and we still need a way to manage data at the edge there are lots of times where you're collecting a lot of information and you need to put that information into a large scale a a data lake and we found that there were some options that Microsoft SQL allowed us to use in collecting that data at the edge made it very easy for us to have simple communications over VPN to collect small amounts of data and then pass that data into buckets that were associated with long term storage and that has driven a lot of variation in the way that we birthdote so looking at an example of how we build a pipeline we start with code that includes the infrastructure's code and the open shift kind of stays in the middle in the early days of machine learning from data scientists and things like that people who were actually crunching the actual algorithms and trying to determine whether or not they were fully functional we have this tool that they call the deep learning army and the deep learning army is like a was exactly what I didn't want it to be but it works, it seems to work everybody loves it it was a collection basically every scientific library that you can think of for Python, Ruby and R that it was just like dumped into one machine image with all the NVIDIA, Kudu drivers and everything just that for component parts but that doesn't make a pipeline and so in this case this is outside of the pilot phase these are the things that you end up doing to get structured data and this represents for me a data practitioner so if you're looking at the way that this works you'll see that I've got up in the far right corner the concept of using Amazon Athena which is a way of building out a sequel style analysis of the contents of an S3 bucket so if you have structured data or mildly structured data you can create a configuration and get something back and then AWS Glue is a way to create tokens and to tag data so that it goes from an unstructured to a structured more structured model and you can get those artifacts and get the trained model from that experience and then you can push that into production so once you've got an approval your approval process to kind of transfer that over on the top once you have an approval process you can move that over into your production cluster that cluster can be and then the artifacts can come with you your trained model can come with you you can pull the batch production data from there and then use that model in a way that is consistent with your business requirements your recommendation engine or whatever so that whole process is exactly what I think of when we bring this when we bring this out in experience and that's why there are tons of tools for doing this there's a single node outpost is an edge solution that we can use we can use that in the context of greengrass you can impact for the AWS experience we have a a small a small method for doing basically circular functions at the edge and you can collect that information and stick it into s3 and then this becomes this is the simplified way as you become more as your data engineer who finds this more and more you may find that the basic tools no longer work and you'll create a more bespoke model that is consistent with building out on the Rosa experience where the open data tools will take you through that whole process so that effectively is everything I have to talk about today and then to say that this is a great opportunity to talk about machine learning ops and what makes this work and to see a little bit of an example of from an architecture what it looks like does anybody have any questions do you briefly mentioned monitor and model drift so in your experience how difficult was it to set up both the monitor and the model drift like dealing with each retraining and how difficult was it also for you well, so for me model drift is less of a concern because obviously I'm more focused on operations than I am on the models themselves but what I see from my friends is that because the training is happening so consistently when you put this into operation like if you've got the Ansible models that are like building up the systems and then you're doing your basic training and then you've got that model in connection with your data and I didn't mention public data sets are like a huge a huge help in terms of training and they come across as an S3 bucket in fact quite commonly I would use those in the context of OpenShift so I'll just grab that and then just put the assignment for the public data set in an open data hub workspace so those like building out the YAML structural for that and making it work reduces the amount of effort necessary for a retrain and so that's what I was talking about like the pilot phase is messy and you will there will be much gnashing of teeth while you try to push that back together which is again I don't like the machine learning even though it's like it's incredibly reproducible because you just pick one and you just use it over and over again but you know if there's a security vulnerability or if there's a modification that you really want to take advantage of then you have to switch your entire environment right so any other questions? this session is still running I'm sorry so modern surveying is obviously a big part of operating models so all the inference part a couple of years back was a myriad of young projects trying to solve that starting from just rabbit in a class application and then deployed as a container that TF serve and some other projects trying to solve that I wonder if we now are converging to a more cohesive consistent deployment model when it comes to inference part and maybe what is Amazon doing on their back ends about their AI services is it always just homegrown depending on the use case or is there some convergence well so there is always an opinionated convergence so if you look at red hat I'll start right there look at how red hat does this in the other data hub you have all the tools that are associated there and then you have a way to just generate you know I mean this is to support the pilot phase in juker notebooks and then create that configuration and maintain those models in specific containers so that you can have that container space but I don't know that Amazon the way to do that is they create SageMaker they use some very opinionated models and they say you Mr. Customer can use all these exactly the way that they exist today and maybe later make a decision on how you can use it use a different one interchange them but in my opinion part of what we did like building a rail workstation on top of AWS was to create spaces where if you knew what you wanted to do in that messy phase you could have immediate access to the right kind of hardware from a red hat instance on other platforms that you might want to use I think we're out of time One more minute Another question So you do mention handle the station so I've been using handle the station for a very while I wonder is primarily intent for training models that do you have any sort of other settings Well OK, so talking about it from the perspective of the Amazon products so SageMaker I think is really meant to leverage models that are there and then you have the ability to train to do some additional training on some of the extant models and then you can add your own if you want to in that process there is what they call machine learning ops which is basically a service or machine learning ops that is available there for me obviously that's not where my work extends into how I can make this work on the red hat open data science model which kind of gives you a similar kind of environment but done in a very open source way that is achievable on other platforms so if you have a hybrid scenario and you want to have the same experience from your on premises to your public file experience then I would say you can do that in the context of open data science or open data hub and the SageMaker would give you sort of similar tools in that sense if I wanted to go and build a cluster and then use DASC to handle a complex problem across a suite of systems I can do that as easily in the context of an OpenShift cluster on AWS as I can just building that out in terms of machine learning ops with SageMaker and the SageMaker networks well it was a very nice talk if you have any further questions so I think you would be more than glad to talk about it talk about it all day just look, I think you know hey everyone I started with jokes already so now it's time to get serious welcome, first of all it's my first time here at the conference in Bernof but also in Czech Republic so far thumbs up for everything thank you for making it like that and I will try to make the next 25 to 35 minutes interesting for you and because we are now in the post-lunch sleepy time or energy day I'll try to make this interesting, I promise if you want to make it interactive you can just ask your questions we will answer all of them in the end so there will be a slide for that but also we need to get our blood flowing if it's interesting as I said so is there anyone in the audience that considers themselves to be a data engineer or someone that works with ETL with data processing with whatever just don't be shy a few of you will be voting myself well, I'm lying actually but I want to provide an example for you guys ok but did you also hear that data is the new oil kind of, yes, no what about this one data is the new gold I heard it and I'm not even a data engineer but I also agree with both of those statements because it is, they are true and when we talk about ETL the E in ETL stands for extracting and it means getting something out in our case it means getting the gold out of the ground and it is hard work, absolutely I haven't tried it but I heard dwarfs talking about it because first of all you need to know where the gold is and even if you know where the gold is you need the permit to dig well, there could also be dragons guarding it, you can just ask Bilbo Baggins, Minos for sure but knowing where to dig is hard, story because the other half is getting the permit and usually in my own country you need to bribe a few people you need to, you know lube and grease up and what not yes, it is how it works in any case the scenario that we will go through today goes exactly like that you know there is gold in those hills and you have the permit but the local clerk that gave you the permit was just a little bit on the bottle and the permit says you need to dig from your home what? well, exactly we are going to dig for gold from our own home and speaking about home my professional home is one of the largest fashion companies that you never heard of and it's called best seller but I'm hopeful that you've heard for at least a few of those brands such as Jack and Jones or Veromoda or Only or about 18 of the other ones as I said my name is Yvica and I'm definitely another dwarf although my Dutch colleagues might disagree with that and I work as a data platform team lead in said company, best seller and what we do is we essentially provide the analytics platform to our colleagues and to the stakeholders so that they can make business decisions and also be informed about those decisions and some of those decisions are do I buy this shirt is it going to sell how many of those do I need when do I show them on the website when do I take them down and do all of that 9 months in advance because you don't buy it today for tomorrow you buy it for the next season so there are challenges in running this platform and providing these insights and one of those challenges is definitely around data governance and data residency and the scenario that I want to share with you today is about how you can use Apache Airflow any CS anywhere to solve those challenges but we first need to understand what this residency is what does it mean I'm not a native English speaker you can probably hear that so I reached out for the dictionary there it is residency is a fact of living somewhere where you reside in my example I'm living in the Netherlands that's where I'm a resident but also the fact that I'm allowed to live there is regulated by law I also pay taxes there which is again regulated by law which brings us to the next term being governance so what does it mean to govern something a society a country an organization well it means coming up with laws that are enforced within this organization or a country now that we understand what the problem is and what residency and governance are let's also see how the statement of the company that makes everything acquire the competitor can be turned into acquire competitor in the US and now they have problems with paying Christmas bonuses and it's not actually what you think they need to pay Christmas bonuses but they cannot do it in the world of big fish acquires small fish our own big fish Acme acquired their US competitor called Cyberline Systems nothing wrong with that right companies by companies but if we consider that these two companies are in two different geographic regions of the world with different governing bodies and different laws around data processing that's when things start to get interesting or nasty depends on who you ask because we absolutely know that there is valuable data hosted somewhere in the US where Cyberline Systems is and the leadership from Acme would give pretty much anything to get insights from that data but their ETL tool of choice is hosted somewhere in the EU that's a problem because you are not allowed by law to process US citizen data outside of the US so you have to process the data in the US and that's what the goal is and that is the challenge that we have for us because we need to get to that goal from our own home clear? we get it? essentially we are in this situation architectural Acme an existing company has their infrastructure hosted somewhere in the EU they have a container management platform they also have an ETL platform and what they recently bought is two data centers one of them in New York the other one is in San Francisco it is of course technically possible to load the data that is somewhere remote but we are not allowed to do this this is where I would ask you to pick up your phone scan this and tell me how would you approach the problem how would you solve it I am going to give you a few minutes with entering some notes ok, I was in the blue half a few months ago yes so pretty pretty even distribution I would say yes, thank you we will make it even more even yes how it becomes a game there is a whole concept of gamification and everything, this is how we gamify ok, but whatever the results are I am going to anyways tell you that there is nothing wrong so let's just continue roughly a third of you said let's do a site-to-site meeting right, a valid solution there is nothing wrong with it until we put together the complexity because we need skilled people to set it up you need to make sure because this is now a critical part of your infrastructure you need to make sure there is two redundant leads you need to make sure there is people on call to babysit it when it needs to be babysit and we are talking about two geographic locations so different times so you need people around the clock you need to pay them to pick up page or duty from calls not ideal could be to run the easier tool of choice per vocation it's very easy technically most of them are containerized so you just run them in all locations but what if you have 17 locations what if you have 25 or 250 who is going to manage that who is going to manage all the access to all of those tools because they are now separate tools how do you manage permissions that these tools have that you want them to process how do you reconcile logs you have dependencies between jobs in vocation 1 and vocation 17 one of them fails how do you reconcile logs in multiple locations it simply becomes an operational nightmare very quickly so let's not do this maybe it is safer to cry in a corner but you know when you cry in a corner people that were selling shovels during a gold crush and I don't have any shovels on me but I would like to introduce you to two shovels which are Apache Airflow and AWS ECS and let's talk about them briefly by the way anyone in the room using any one of these tools ok any happy users of those tools yes yes yes that's what I want to see ok AWS ECS is the elastic container service and it is essentially a container management platform some call it formance Kubernetes maybe it is, maybe it isn't I don't know but what I do know is that it allows you to run containers to scale them to make sure that they keep running and it does the job right in most cases it is far from perfect but it works it is the integration into the entire AWS ecosystem if you need to handle permissions for those containers it's IAM if you need to change firewall rules for those containers it's VPC again on AWS site the way we use ECS within Bestseller is to run services 24x7 and mainly the services that make or break our business so things like refund services and order management systems and anything essentially that as I said can make or break our business but we also run schedule jobs we also run one-off containers once a month when we need something to be done etc and those processes are still important they can be weighted on but they are still business critical and one option that ECS doesn't do is container orchestration so it's not going to wait for something to happen to run your container so how do you orchestrate this well that's where Apache Airflow comes in because it is an open source platform for orchestrating and for developing workflows specifically that have dependencies between them what it does is it allows you to create workflows called DAX through Python code which means that you can version control it you can collaborate on them with your teammates and have all the good stuff that comes from it but you can also create intricate dependencies between jobs in your workflow and you can also create dependencies between the workflows themselves so if you look at a very simple example of making a pizza I could guess you did it at least once in your life so how does it go so you can do the toppings you prepare the base and then you bake but you can also do the toppings and the base in parallel your partner can do one you do the other chopping taking care of everything that falls on the floor etc so you can do it in parallel and in the end you will make a pizza but you need both of these processes to finish first before you put them together so you can visualize these jobs or these tasks into workflows that they call DAX and then you can configure dependencies between them and you can say which comes first which runs in parallel etc and keep in mind airflow is not imposing these dependencies they come from the real world you cannot make pizza without the dough and the toppings you can without what's it called pineapple the real power of airflow is that it comes with batteries included and these batteries have been committed by the community because it is an open source tool there's more than 10 million sales per month so you could say it's pretty popular big companies, big players such as Adobe and IBM and Airbnb and LinkedIn etc they all use it because it is a good tool but let's talk about these batteries so one of the first batteries that comes included with airflow is called operators so what do they do? operate right? not as doctors would operate but they just operate on something else as you can see in this example we are using the redshift operator and what it allows you to do is to provide a SQL query which is going to be executed on a redshift cluster so it allows you having to know too much about how to do that thing that's what operators do they do the heavy lifting for us so that we don't have make sense? ok so enough about EC2 and airflow let's talk about the features that make them work together and that make them work well together the first one is called ECS Anywhere and it is essentially a feature that allows you to reuse a feature that you have including Raspberry Pi's to extend your cluster and to run containers on those machines the next one would be the airflow's ECS operator and I think you can guess pretty much what this guy does it allows you to run containers on ECS you already see how they work together? yes? no? well, I'm here to tell you ok, so with these two essentially what we can do is we can orchestrate containers on our existing infrastructure using this guy that would be the ECS operator for airflow and what it does is it essentially runs a container somewhere on ECS looking at the final solution architecture this is what we have this is what we ended up with when we said ok, let's use ECS and airflow together but there is a lot of going on on this diagram right? so let's go through it this is the existing container management platform the existing ECS and it is somewhere in the EU we have it, it's up and running it's making money, it's all good what we also have is the existing ETL solution again located somewhere in the EU it's also running, it's providing value but then we also have two data centers that are remote and that are untouchable for us up to this point one of them in New York but we can extend the existing container management platform to use this data center as well by using ECS anywhere and we can do the same for the New York for the San Francisco office sorry so now we understand what the final architecture looks like let's see it in practice this would be the demo time and usually experiences told me that doing live demos is a no no demo gods are not favorable so I would just show you screenshots of what I did it's easier and you have to trust me that's why it's easy, right? cool, if you don't trust me everything that I'm showing here is available in a repository it's everything as code so you can just run this on your own prove me wrong, please ok so the thing that we are going to do first is just show the ETL scripts that we have because they are the basis of everything after and we have two of them the first one is the bonus calculation script we want to give our people the Christmas bonus that they earned working hard for us and the script itself is fairly simple it just loads a CSV file does some data processing and then saves the result into another CSV file and uploads it into AWS S3 the second script that we have is the tax calculation and this is the one that calculates company taxes based on the offices that they have lease contracts, whatever it's too much for me to go into and it also must include these new branch offices in the US and it is very similar to the first ETL script that we have loads CSV data processes it uploads the results back into S3 so let's verify that they work this guy said he doesn't trust me so I'm going to have a screenshot to prove him wrong I can just run the script like this with the valid inputs that it has and a few seconds later if I list the contents of this S3 bucket you can see I did it this morning sure enough it works the result file is there the same goes for the bonus calculations it's pretty much the same deal same steps, same target, same distance we get the results nice and since we saw that these scripts are working on their own standalone now would be the time to containerize them because remember we need to run these scripts as containers they need to be scheduled somehow docker files docker build docker push boring boring the only thing you need to know is they build and I push the container where it needs to be because the next step is to actually use that container in an architecture like this one so what I have running right now is two sets of resources the orange area are the resources running in AWS provision to the same code that I mentioned but we also have the blue blue area here and no, I didn't buy any property in New York or San Francisco I just spun up to VMs they're good enough and what I also did is I extended the existing container platform to now use these virtual machines that I have running on my machine as external cluster make sense clear ok well that's because we have two ETL scripts one for Christmas bonus one for tax calculation and that's because two of these machines have access to a different data set one host has access only to employee data while the other host only has access to property data and lease contracts and similar which means that our infrastructure is now ready to run a container as a container and see if it works I'm here just using AWS CLI and essentially telling it hey go run this thing on this cluster here and please use the external part of the cluster itself which is the two VMs that I spoke about and sure enough looking at the logs it works it does something, it doesn't complain to me that's good enough but you can also notice that it runs on one of these machines it runs on the bonus as I call it so what would happen if I do the same thing again remember we have two machines would it still run on the same one or would it use a different machine I see a lot of people yawning so let's do some stretching again who of you think that the next time I run the container is going to run on the same machine and which of you think that it is going to run on a different machine you probably notice I raise both hands both is true and that's thanks to the round robin algorithm and the 50% chance of running on any correct machine because we have two of them the host itself is not wrong this is the situation that we end up with because the task can run on both of those and it's a no no because the hosts are not wrong it's the data set that they have access to is wrong so the tax ETL script can run on the bonus host and say no can do boss I don't have access to these files so how do we make this how do we approach this tax tax the real metadata tax or some other type of tax it says tax ok well you are not wrong my friend thank you for speaking up this here represents the algorithm behind how ECS schedules a task how it chooses where to run your container and as you can see it first looks at CPU and memory meaning hey can I run this container on this much memory and then if again it then looks at the location, instance type etc etc meaning if you can tell your container hey please run on instances that have access to a GPU if you need GPU or you can tell them hey please run in ECS because that's where my other infrastructure is but what you can also use is something called custom attributes or custom placement constraints and as the name suggests they are completely custom meaning that you can tag your container instances with whatever you wish and then use those tags when you run your container this is something that we can use to tell one of these ETL scripts to run here or there or nowhere because you can screw up these tags or these attributes and the container will simply not be scheduled this is how these custom attributes are set what we have essentially is a custom attribute called purpose with the value of bonus and the same goes for the other one purpose, tags and as you can see two different container instances so now they are completely separate what we can also do is when we run the container, when we try to schedule it we can now say hey please include this placement constraint and only run this container on this specific set of machines for this one machine and of course by using these task placement constraints and custom attributes that we just set we can now guarantee that the proper scheduling is going to happen such as this one this is the last step but also the most complicated one and now is the time to schedule these ETL containers on our very specific infrastructure using Airflow and using everything that we mentioned up to this point so what I did is I went ahead and I created two tags in Airflow one of them for the tags calculation and they are pretty similar in code and I did so by using this guy that we already met like 10 minutes ago the ECS run task operator to schedule these containers we can use code like this and if you have a keen eye you can also notice that it's very similar to how we ran the container by using the AWS CLI that's simply because we provide the name of the cluster where we want this thing to run and we also specify which task definition or which container definition to use where to run it meaning external instances or internal but we can also provide the placement constraint saying hey dude this is the Christmas bonus calculation ETL script please run it on instances whose attribute purpose equals bonus and ECS is going to respect this now the container itself has been scheduled it ran as you can see from the logs we also can see that the exit code is 0 what does 0 mean can you uh... it was successful but what the hell happened just now do we know ok let's show it a bit so the first thing that happened is that Airflow as a tool ran a DAC it's scheduled this DAC and the DAC itself started a container on ECS and then on ECS said ok this guy wants to schedule a container on this specific machine type in this specific location and I can do this so ECS schedules a container on the remote data center on the remote machine that we have with access to proper data that's when the container executes it starts ETL script it tracks along processes the file and in the end it uploads the results back to Europe and this is fine because these are just the results it didn't upload any of the data on US citizens it just uploaded a bunch of IDs and numbers so this is fine we are still compliant and because time is running out and we are having fun we are having fun, right? yes, thank you it's time to recap we saw that these processes can be pretty different they can be simple, they can be complex they can be blue, white, red manually scheduled, etc but what they are is they are necessary because it comes in many shapes forms and needs to be transformed and processed between, etc you know this better than I do what we also saw is that ECS is the right tool for the job, but it does come with tradeoffs we also saw Airflow is the right tool for the job but it needs a bit of a boost from a container technology started more than 10 years ago proving the docker is more valuable, right? if you use it in the right way and in the end something that I learned in my career along the way is that there are technical solutions to technical problems but there are also non-technical solutions to technical problems and vice versa and the sooner you get this the sooner you can advance your career and become the famous person that you can be with this I would very much like to thank you for your time and for listening to me if you have any questions you can shout them like here but you can also use our good friend Slido and ask any questions that you may have ok so we still have some time please ask those questions before people start pouring in again if you have them do not just catch me around but also ask upload it back process data this process data just scrubbed by output of container no it is a CSV file that is created by the container CSV file and this automatically or it's part of job and immigration from S3 to our native it is a part of DETL script essentially DETL script is completely custom it can be whatever this is just a stupid example of calculating Christmas bonuses so it can be whichever you like there was another question here yes please I think you will raise your hand maybe I am wrong ok yeah please go ahead you were mentioning the placement I was thinking so basically it is airflow that does the schedule it has pins and tolerations it is basically duplicating this same thing or it is running pins and tolerations ok so the question is about custom task placement constraints and airflow scheduling this and then the point is what is the body so it already has that which workflows are supposed to run a specific flow but I am wondering why the flow is duplicating because the reason that is specific is airflow is not meant to work with Kubernetes it can work with Kubernetes as well in this case it was working with AWS and the reason it is duplicating the work and it is encroaching on the infrastructure part it is scheduling this airflow is the orchestrator now you are giving it the power to do so and also operators because without operators and using continuation you should write by your own everything and this inside airflow just few strings and everything is done in this very specific example yes because the operator itself is simple and we are using the task but what is the real power of this is that you can because airflow is written in python and those workflows are written in python but you are not someone that does python if you prefer rasta whichever else the takers allow you to do it you are not locked into python I think we are out of time we still have 2 minutes ok please 2 more minutes the session is still running do you have a solution that might have done similar job what would be the technologies so the question is have we considered using other solutions such as lambdas to achieve the same task and yes it is a very valid question but you have to think again about if the tool has everything you need lambda is completely custom it is the code that you put into it that is valuable and lambda also has limitations such as how much can it scale in terms of memory but also time limitations because you can only run for 15 minutes and what if we have a 10 gig CSV5 that will take 3 hours to process also the reason why we are using airflow is because it is the ethiotom of choice and it comes with batteries included meaning we don't have to write the CSV processing logic it does it for us please go ahead so I was just wondering about the experience I guess you are running your own airflow instance and what was the experience is it maintenance heavy maintenance free depends on how you look at it so the question is are we running our own airflow instance and is it difficult to manage it or is it difficult to just keep it running that's why I said we have both we have the old airflow and it is not very much updated because we are bad at doing this it's not airflow, it's us like every relationship but we also have another deployment of airflow which is completely managed by AWS and that one is much easier to work with but also costs 5-6 times more for now if there are normal questions thank you very much again for attending interesting question what I am working on Red Hat is a team called Quality Custom Experience where I am not sure we are collecting the health data from the open ship cost to our customers if they are ok with that we do or try to do some magic and then provide profit in terms of better user experience for the customers it might be services they can use it might be improving the product based on the amount and so on so we basically are focusing on the phase 2 of this strategy and what I am spending most of my time and in detail what I do is I see probability clusters and my clusters I mean open ship instances because if the open ship instance is happy I don't see it, it's boring we are interested in the stuff when the clusters have some problems so basically we are looking at probability clusters and trying to reason about that and the rest of this presentation will be tightly coupled with this I am a data scientist by trading but I do data science by accident maybe or by necessity because without that you can't much in this business I have interesting relation with the data science love hate I love doing that data science doesn't love me always but we are very good one of the problems of the open ship when it comes to looking at the problems or understanding problems is the underlying from infuse telemetry and alerting and the distributed notion of the queerness where each component has definitions of what is the problem and the problem is when there is some central issues a root cause in the cluster multiple components start complaining about that so what you end up having is some timeline where you have multiple alerts triggering at the same time so there are maybe 20 different alerts that we have seen around 9 am but there were not 20 problems that the cluster has seen to maybe at least 3 but the problems are different than the signal that you are having so the question is can we do something to reason better and show closer the root cause of that which I believe I described in this next slide basically we tried to do these other signals that we have about the cluster into some related things and ideally also be able to reason about the cause the consequence of these problems and that's why this talk is about correlation and causation because it's tightly coupled with this particular problem so we now will move away a bit from the open shift itself and I will talk about this problem in more broader terms so one thing when it comes to the grouping many times you start thinking about maybe clustering and people that have tried or seen some any thing that's the first thing that they suggest of course it's clustering algorithms just do that and the question is really should I do this path or should I not that's the question so we tried that we tried some clustering algorithms so we did our embedding thing we did principal component analysis we did damage mode reduction all sorts and it kind of works one of the problems that we had there is that once these are basically different symptoms that if they are both together somehow mean that they were related or happened at the same time the problem with this approach is it's hard to interpret you can feel that they are probably related but there is some limitations of reasoning about it that's why in machine learning often it's mentioned that there is problem with explainability that's one of the reasons like it does something but you can't tell much how it does so we can also try a different approach and that's what this talk will be about so you have this all dispensy machine learning, deep learning artificial intelligence stuff in the state that they had textiles but there is still this thing called statistics that has been around for some time and I don't think we should forget about that so we will be taking a bit more statistical approach on trying to find these relations and I believe that's the end of my slides and now I will switch to this different depth and that's how we know that there is some data science going on because out of the sudden there is a Jupiter notebook you can be sure that there is some data science going on there might be some other things going on in Jupiter notebook but we have also data so there is something with data and in the next next cell we also have some latex equations so that is science so in the data we have science so we are doing data science before we jump to some data small magic trick I never trained that so it might not work at all so I have here my kids with for some toys that I have in his bag and I have a special capability which is called chromatic ear which means that I can hear the color I have never tried that I just have a feeling that I know that I know how to do it so I randomly choose some item and play silence play some voice I think it's red it's red I haven't looked at it I just heard that so so how did I do that you got only red objects in the bag oh I forgot to show you oh ok I have different colors I am doing it the first time I did it so I am doing it of course there is a trick it's really silly one so it's that the cube is the only one that's red yes of course you can't find no other cube with the other color but you would not be able to do this trick if I gave you this bag because you would have no information about the statistics and probabilities and all so that's what the branchins to be about being able to do this kind of thing and now it helps it's a nice segway to something called probability of occasion theory where this is all about about these things so if I talk and I have some marking here we are talking about this a lot this also is probability of one thing while the other thing is true in this particular case was the probability of object being red when the object is cube it's one 100% but we can also do it the other way around if I have red objects and what's the probability of it being a cube so I have three cubes one sphere so it's three quarters the probability one misleading thing about vagian theory is that probability of one versus the other is different when you switch the directions and that's like I will maybe sometimes still use this as the visual representation of that because it makes it easier so we are not talking about objects and colors we will be talking about symptoms I mentioned the health data on the open shift and that's basically some property that you have or don't have which we call symptom and I will not talk about open shift I'll have to say because it doesn't tell you much but I will talk about something closer to humans which is the human diseases I'm sorry about the feedback I was not able to figure out anything more positive not that it's not being tested positive anyway so we have some patients here data set that I just created for this occasion where how to interpret this is that I have for example 10 patients that have positive flu test and have this set of symptoms fever cup we have also some lucky patients that have flu but no symptoms we just happen to test them positively but they are completely fine spreading to think around we also have some I think I included some unfortunate person that had chicken pox and all these chicken pox symptoms but they also break broken his leg so double that double unlock there but I included there just to I'm sure that we could have multiple reasons for the symptoms and you want to be able to spread that so in here we can reason about the consequences of flu and fever and how it relates to the open shift is that in the open shift role when you have just a stream of alerts you don't have any metadata saying DNS is down you can expect some component or some container being broken we don't have this metadata and the task is can we find these relations based on the statistic data that we have to our data set with our patients so how we can find some or start reasoning about the relations between different symptoms so we will be using this Bayesian theory here where we will be looking at the probability of one thing while the other one thing is happening so in the first thing we will calculate the probability of having the fever when you have a flu it is not surprising that you have a flu there is a high chance that you have fever in that location so when we are running that in our data set basically we need just a number of patients that have both fever and flu that is basically in this it is the number of red cubes and the flu all the patients that have a flu that is the red thing so we have red as a flu and subset of that is the fever and flu so it is red and it is cube so it is the denominator and the denominator is the whole set of objects so in this case patients with flu so it is nothing else than just looking at number of patients with fever and we get about 60% based on this data set so it seems that it is not related but we can't tell for sure yet why is that we don't know what is the probability of fever in the whole population in this case it would be probably intuitive to say that probably only a few percent of the population has fever but if the fever was set with threshold of 36 degrees I would not tell you higher than high so I'm sorry then basically everyone would have it and would be useless for our analysis so it is still good to calculate the probability of the fever itself and then we can compare these values so what we did here is something that is called lightning ratio and at the top one we still have the probability of fever when a flu it's this 60% and we compare it to the probability of having fever in the whole population and we get some number which if it's higher than one there is some sign that it's significant there is some relation, there is some correlation so the higher the higher the number is the more these things are correlated so that's a good what I mentioned there is that in the open-ship world we don't really have the information about relations like what's causing what, we can just see those of individual components so is there some way to set this direction like compare it maybe what if we switch the order so if we look right now about what's the probability of fever when you have flu and then compare it to whole population we can also try to switch around so taking the probability of flu while you have a fever just for fun so again like nothing new it's really almost the same as the first one we just really switched the order basically we just switched the denominator and so the result is different as well there is still high probability of having flu while we have fever like nothing surprising about that and we can also calculate this or compare it to the whole population probability of the flu which is this ratio so we have different also 0.6 0.23 a now we have 0.47 0.18 now the question is like what's the result of that and it's almost the same number just because I was rounding the numbers while doing this so it doesn't produce too much output but basically these two values are the same so we have different directions we will get the flu versus fever and fever versus flu we even had different ratios but then when we calculated that we ended up with the same number which for me was surprising then I will get Bayesian theory and all this how it works and it has to be this way so the conclusion is that these numbers are symmetric and it doesn't give us any indication of the causality just for this reason so when people say correlations of causation one of the reasons for that is many times when you calculate these correlations they are symmetric so you can't tell much more about congress much more about it sir just want to start it quickly is that the fact that the numbers always say is that because is that always going to be the true when you switch them or is that because of the way it's always the way or sorry you can agree with just with Bayesian theory I don't have enough time to do it but it's fun for sure so it will work for us for the reason but we can compare different things and I will repeat what we compared before is having flu versus fever compared to having flu without any other conditions we can also formulate it differently compared to a mechanism which is probably that I have flu while I have fever compared to I have flu while I don't have fever that's the difference I don't compare it to all the feverness I compare it to flu while I don't have fever which is something that I've learned later it's called relative risk because first I just was making up some sometimes so if you search for it it's relative risk so probability of flu while I have fever whereas probability of flu well I don't have fever so you calculate that so this is flu while I have no fever again no it's not that high number because there are many patients with other symptoms that is neither flu, no fever so probability of having flu without no fever means you need to have other disease and they are also common so we ended up with 0.9 and we can do the same the other way around we still we compare the original probability with fever compared to flu and no fever and we ended up with a bit different number so basically this is the equivalent of our first situation in here where we could be comparing of two probabilities now we have just a different so it's so we have flu fever and no fever and now we can do the same as flu again I have different ratios there the question is will they be the same and if they would I would probably not give them this presentation so they are different now we see that flu compared to fever has 5 and fever compared to flu has 4 so it tells off something so this is basically where I started thinking can we use this to actually indicate the causality of the problem so that's the pretty much about theory and the assumption is that the higher the score the closer the symptom that's at the first place there is causing the whole thing so I prepare also a function just calculating these values in table so that you have all these numbers in one place and we see that flu has higher relative risk than fever compared to flu so we would indicate that flu is causing the fever also I can show this for the likely good ratio you've seen this 2.58 number so we can compare and you can see so how this actually works no, it's nice to have some synthetic data set and it just works so if you want to understand how it works so I have here a simplified data set where I have still patients I have 10 patients with flu and fever I have 10 patients with flu nothing else I have 10 patients with fever nothing else and I have no other patients wouldn't it work in this case and it doesn't because it can't and why it doesn't is that we even have lower relative risk when it comes to flu and fever and fever and flu just because the symptoms are called this way but the data are not really real it just will not give us the correct answers one thing that this approach really needs is quite good data and first of all you need some other examples that don't mesh neither flu or fever so I will include some patients that don't have these symptoms and now we start the same correlation between flu and fever because otherwise it can't reason about the probabilities because it doesn't know about anything about how common fever or flu is and the other thing that we also need is some disbalance so right now it's still showing similar correlations because we have 10 patients with flu and fever and 10 patients with flu and nothing at all so we can't reason much but we can increase the number of patients with flu and fever and we can also include other reasons for the fever other than flu 50 and now it starts showing up now we see that flu is much higher than fever versus flu so there is like you need other reasons for the fever to be in the data set to stop being able to reason about this it's not that different from how all this fantasy large language model work they need a lot of data a lot of examples to be able to deduce the probabilities it's much more simplified in this way that you still need data and on the other hand you can also use this as a test to set tell whether the data you have are really useful or not you can just have garbage in you can garbage out you can't do much about it so that's some experimentation with the data and now can we do it for the whole data set I was comparing just the flu and fever so in the next I want to see the holistic picture so I have prepared this table so you can see the relations, right? it's clear oh you are a data scientist so let's do something so this is basically the graph representation of that basically we ignored the low we just don't put line between heading and broken line because the relative risk was just be all one or be all threshold and we include the lines between them and you already see some components in this graph and you can apply some graph theory there where rush, you can box and fever could be considered as one common thing the script or things are related in graph theory I learned that it's called click or something like that for these components similar fever, flu, cough sorcell, headache is somehow related and it's going to be a tiredness we have on the broken legs and bones part we have this whole component that is related to pain because I said the data set to be this way but I was talking about because you see no arrows there so I was still doing that part but we can also draw the arrows sorry the arrows they are pointing right there, direction and now we have directions when the direction means we think one is cause of the other so the symptom is pointing to the cause and you can see the chicken box and flu here are the only balls or points that have already incoming arrows so everyone points to those directions which is in my opinion so I'm not saying it's perfect it still relies on how quality data you have and all these other things things are not that simple this thing can do combining multiple symptoms and how they differential things but there are some properties of it that are also much nicer such as it doesn't require any GPUs no GPU was hurt and you can still run it on GPU if you can't wait for five minutes and you want to have it in one maker but it's much easier to do you can also if you compare it to the machine learning algorithms it has you can really reason about it it's still some numbers that you can put some meaning behind and you can iterate fast and as you can understand how good you have or your data have I mentioned the cons with power so the conclusion I would say for that is that you should be using both don't be shy to use both, don't be shy to use stats just because machine learning is cool but I noticed that this boring guy is proposing machine learning the cool guy is proposing stats so that's pretty much what I went to the message I wanted to give here and the last thing just if you hear data science don't think about the job for the data scientist because I've seen that it's very little for subject matter experts to do these kind of things and compliment this whole area so you might know some knowledge about the field itself and you really don't need to be your science to do this kind of thing you can look at me as an example of that so with that I guess we are running almost out of time so thank you very much for your attention and I will still have maybe 3 minutes for questions so how is the time to for questions do you want to have more tricks because if you do I can't help you there is some question what are some real examples where you were able to use this and solve get to root cause analysis where you were able to show some examples great question and actually we were planning this but I have some real examples here it's not about the root cause it's still something that I need to apply in your world and we've seen some of these indications but when it comes to the clustering grouping the things together this is some real data where we have some cluster where it has some issue here which you can see had multiple different alerts we don't really want to reason much about that we can do some time grouping which is just grouping the things together based on the time that appeared they started happening additional contextual grouping and I will highlight this this particular group and I will also do this thing so when we did the time grouping we had these at CD members down and Brazilians other alerts but there was some noise introduced there because some other incident was happening at the same time when I include the additional contextual grouping it's now split into two things so we still have the at CD members down but then we have the bot security violation if it was not related and just for the statistics that we had we could separate those two so that's basically the area of applying this approach for the real world not just for playing around thank you for the question I'm an engineer in Red Hat currently on the DBCing project and I'm going to introduce you all to the DBCing project so we will have a brief look at what ChainDataCapture is then of course we will focus on the DBCing project and from there because we have some non-relational databases we will talk about ChainDataCapture for MongoDB specifically and after that we will actually have a look at what are the possible use cases for ChainDataCapture trying to be mobile at the end there will be place for questions and hopefully some answers so what it is ChainDataCapture I very much like this definition that ChainDataCapture is a process of recognizing changes in a data system and they are delivered to some downstream consumers so that these consumers can take actions based on these changes This diagram actually portrays one of the variation that ChainDataCapture is based on mining of the transaction log Is somebody here who doesn't know what a transaction log is Just for the remote audience a transaction log is a sort of a canonical source of truth for each databases and then think which actually records all the operation and all transaction which then can be used for example to make other than something nasty happen to the database and we have some users which are actually executing all your regular create update delete operation and modifying your data then we have a CDC platform which is reading the databases transaction log and it's emitting events about these changes to any possible downstream consumers which can then, for example, store your events as data in some data lake or data warehouse or perform some calculation on top of that Beside this transaction log mining approach there are some alternatives mainly we can implement a CDC based on queries and based on database triggers However, these approaches do have some disadvantages for query-based ChainDataCapture one of the obvious advantages is that you have to constantly execute your queries so that puts additional strain on your database and you are able to actually capture all the changes because how do you actually write a query which captures deleted data the answer is you don't, you can't with triggers you can actually recognize all the changes to your data but you get a very database-specific implementation and you are also limited by the expressiveness of the procedure language which is supported by your database so you might have actually issues getting out of the database The most common use cases for ChainDataCapture include but are not limited to things like the data application or actually cache invadation or search index updates you can also build all the things on top of it and last but not least you can actually use CDC with microservices to implement some of the architectural patterns as a look at the end of the presentation one of these things is like reliable data exchange between microservices This brings us to the division project so what is division? Division is actually open source complex ChainDataCapture platform which covers multiple databases we have active and large community because the project has been around for a while and it's been used in some relatively large scale deployments by our community already When I said that we support multiple databases we have to talk about code to division connectors and these connectors can be divided into three different groups so to speak first there are the core connectors these are connectors which are the source of these connectors is actually part of the main division repository under the division organization and these connectors are developed, maintained and supported by the core division team then we have community connectors again these connectors are housed under the division organization umbrella on GitHub but they live in separate repositories and they are actually developed by our community contributors so for this we have DB2, CASandra VITAS and our newest addition the CloudPaner the third group for me is probably the most bilingual band, our independent connectors these are connectors built by completely third parties and different companies outside of the division team and RedHead and everything these are built on top of our command CDC connector framework so to me these are showcases that the project kind of took a life on its own and it's now spread beyond the boundaries of our core team and RedHead in general so how can you run these connectors first and foremost our database connectors are actual implementation of Kafka source connectors do we have somebody who doesn't have any experience with Kafka so Kafka is let's say event messaging platform and there is component called Kafka Connect which provides the DB2 to run services or connectors which can get data either into Kafka or out of the Kafka the ones which are getting data into Kafka are called source connectors so that being a Kafka Connect connector it means that the other option is actually embed dbzim engine dbzim engine actually started as a test dependency because we needed to means to actually prove that the connectors work but as it's usually with good testing tools they kind of became used in production by some of our community members so we decided to improve it so we decided to improve it so we decided to improve it so we decided to improve it so we decided to improve it and we've actually built another runtime around this dbzim engine and that's how dbzim server was born dbzim server pretty much fulfils the same role as Kafka Connected us but it's simpler not like with runtime and it's really meant for different data sync architectures where you either don't want to or can't use Apache Kafka we also support many other things besides actually being able to store these events in Kafka you can get them into Amazon Kinesis, Google Pasta or simply send them over HTTP the new addition to this ecosystem is dbzim operator which was just released in the first preview version and you can actually use that to deploy easily dbzim server on top of Kubernetes now to the third important part of this talk we also need to say what the MongoDB is because that's going to be the database of folks for today MongoDB is a documented no SQL database no SQL database that means that it doesn't deal with relations similarly as for example my SQL boot and document it means that the data is actually changing the documents other than that there is a pretty similar structure to the database so you have the database itself a collection would be an alternative to what the table is in a rational database and the document would be counterpart to a row in that table the document does not are stored in a format for BISAN which is a binary representation of JSON however it's not just binary JSON, BISAN beyond what a regular JSON is for example there are additional data types which are supported in BISAN if you wouldn't find in JSON and MongoDB also have native support for change of the data observation before we actually get to extracting the data we need to talk about how the database can be run first and foremost there is this standard of deployment that's not really meant for production it's recommended just for testing and development environments and since there isn't any way to actually extract the data changes from standard deployments we don't care about that we cannot possibly tolerate so the database is 2 a replica set which is the basic unit of a MongoDB deployment it provides some basic data writing the H&A features and then if you want to actually horizontal scaling of your data you can use MongoDB sharded cluster deployments the primary one is a replica set as I already said and it's a typical primary secondary topology where you can actually see that we have a client and first of all all the grids are always directed to primary which holds the data and the data is replicated to secondary as a beta replica and second they can actually take some of the load from primary to support the reads from the secondary as a basic topology if primary goes down then the secondary are actually entering an election phase and they will among themselves elect a new primary this is done by the fact that there is a heartbeat going on between all the replica set nodes which means that every node in a replica set is actually aware of the existence of all the others the data application part is actually the core of what we need to know for the ECM you can see here we can see some data copying going on and this is so called operations log at the beginning I've talked about databases having this transactional log and this is actually an equivalent in MongoDB for this so what is an operation log or oblock for short in MongoDB it's a capped collection holding a rolling record of data modifications what does that mean first as I said what a collection is is an active valent to what a table would be in a relational database and what it means that is a capped collection so it means that there are some restrictions on top of it and oblocks specifically can be restricted in two ways first there is a maximum configurable size within the oblock and there is also a configurable time retention in hours if the oblock either grows very large outside of the maximum size or a record is in oblock for longer than is the maximum retention time then entries will start dropping out of the oblock and these actually mean that the data will be lost so long story short oblock actually keeps all the changes done to the database for a specific period of time so depending on the configuration of the size and the retention of oblock that's how far back in time you can actually go to replicate all the changes which are done through primary and this oblock is actually the means how data replication is achieved in early castle deployment for MongoDB so if this the secondary two replica went down and the primary had an oblock which is configured to retain data for two hours for example then as long as the secondary would be back up in less than two hours it could just really stream all the referred changes if the secondary was done for longer than two hours that it would have to discard all its data and actually copy everything from the beginning for sharded clusters it's where it starts getting complicated we have to first add a component on mongo's which is the mongo router decides what shard you want to go and then we have separate shards but first what actually sharding achieves is a means to horizontally scale your data so it means that if you have for example a custom recollection then a portion of the data is actually distributed to each of your shards and the mongo router is the only part which actually knows about all the specific shards so that's sort of the glue layer on top of regular replica sets which decides where should where is be routed or how to match such results or perform sorting across these shards because the shards themselves are individual replica sets with individual operation logs so they are not the variable that exists compact genomes inside a replica set getting the changes out of this oblog can be tricky sometimes so since version 4 mongo db actually provides an abstraction called chain stream this is a feature where you can actually subscribe or open a chain stream again the database and the database itself will push information about these changes in form of chain stream documents to the client and this is the feature which dbzm actually employs to extract these changes from mongo db you can actually start the chain stream at some operational time which again can go as far back as is the maximum retention of the oblog chain stream are still just an abstraction on top of oblog or when you actually get some chain stream document from this chain stream you can use its id as the resume token to start receiving these changes after a certain operation from the past how does the entire change extraction process function with dbzm well first in dbzm we have these two phases first the phase is something called snapshot that is what at the beginning we actually query and transfer all your data into events to actually get the starting point before we do that the first thing we do is we actually obtain a resume token so this is the point of time where we started the connectors and from there we actually stream all the changes but before we start streaming we transfer all the currently persisted data into events once this initial data copy phase is performed then we switch into what we call streaming phase and that's where we use this formerly obtained resume token and start streaming the data with each change document we receive from the open change stream we do two things first we actually transfer it into event and we deliver it into the thing which could be Kafka Redis HTTP, Google Popsup or any other messaging system we support and we also store the resume token for this particular change into our offset storage so that in case there is an outage for example there is a connection also to the database we can resume streaming from that point of time once again I mentioned change streams being an abstraction on top of operational locks so that means this offset can be at most as old as is the maximum retention time for change streams in case you actually for example have an envelope which retains changes for two hours and the connector is down for more than two hours than when you restart the connector depending on configuration the entire process will happen from the beginning and you will again have to undergrow the initial copy of the data because otherwise you would be risking data and this is something we cannot guarantee or cannot guarantee we want to guarantee the opposite that you don't lose events it's a bit more complicated but we get to than we get to sharding because in this case we have two options of actually extract the data in both cases we are opening change streams but in the so called replica set collection mode what we do is actually in order to achieve higher throughput and higher performance we open an individual connection to each of the shards in the sharded cluster this has the advantage of having larger throughput because we stream changes down to each of these shards separately but it also means that we can never actually guarantee for example the order of changes performed across these shards we guarantee it always if there is a new if there is a new entry which we are observing we can guarantee that all the changes will be in the right order for one specific entry so we will always get creation first then there will be modifications in the right order and if you delete it it will be always at the end is there a question ok so we can guarantee the order of these changes for one particular document but we cannot guarantee the order across documents because there is no means of actually knowing for the reason whether a change in shard to occur before after a change in the shard one we can have these guarantees in the other connection mode for charlit cluster which is called charlit in this case we open the change streams again the mongo again the mongo router which moves everything together because the mongo router in turn opens change streams to each of the shards and it performs a synchronization as it should be but the price for it is that there is some overhead and then there is some delay before you actually can get these changes from all of those shards this is all a theory but let's also look at how we could actually approach some practical problems using dvizium and mongo ruby so imagine we have a simple order center base which means to do two things to store an order and we also need to send a message to another service called shipment service and we need to do all of that reliable, what's the problem well dual write are prone to inconsistencies so if you want to guarantee that we do both of these things what's the first solution which comes to your mind any ideas that's the same time yeah not at the same time but sort of both or none you have to guarantee that both are sexy well distributed transaction and two-phase commit would be an option but if you do not want to use distributed transaction and for example if the system where you want to send the message is for example Apache Kafka it may not even support it so how do you actually ensure that if you store the data in the database the message is always delivered as well the solution is something called Outbox Pattern and it starts with actually eliminating one of these external systems and since the database belongs to the service the one system we eliminate from this storage process is sending the message or eliminating the movie somewhere else so instead of initially sending the message we save the message into an Outbox collection together with the original order and we do both of these inserts in a single document transaction that means that they eliminate the service coupling but then again we still have to send the message to the Apache Kafka so we need to introduce sort of work worker process which would be responsible for the delivery of these messages the way it will look like is that we will have this order service it stores both the order and the message and then we need some thingy which will read these messages and guarantee that they get delivered into Apache Kafka so what the process for the worker thingy to actually send the message first it needs to pull the collection and check if there are any unprocessed messages so it gets the message then it needs to send the message and then it needs to mark this specific message in this Outbox collection as a process what are the messages of such approach please yes you cannot guarantee that you actually do this near real time and then you again are pulling the collection so you have a similar problem with every base change data capture approach and you can't really solve the fact that in order to reliably mark it as process you would have to again do both of these in transaction the sending of the message and marking of process this is something which cannot be solved so the best we can do is actually to have a list once guarantee that it will be processed more than once and last by moment if you are the developer who is supposed to implement this pattern then it requires a boilerplate you've gotten from sending a message to implement an entire worker process which has to clear a database send message and ensure that it does it reliably that seems like a lot of work however it is actually where the division can help you because it gets rid of the entire need to implement this worker processor responsible for delivering these messages you just plug the division to your database and it will completely act as the worker process it will be observing any changes by opening the change steps to this algorithms table and it will be editing these messages to the desired Kafka topic there are several there are several mandatory but configurable part which these messages need to get back so we need something called aggregate type this is just we need to distinguish different message types that we need some ID of the related object to this message and then you need to provide some payload which is any data or none data which you wish to send with this message then there are additional fields which you can add as a bridge for example in this case I have a field called type which distinguishes the type of the event let's demonstrate with a simple simple demo so first of all I have a local Kubernetes cluster running where I have a few things first I have a Kafka deployed through the string operator I have some tooling image which provides at least things like Kafka can actually have a look inside the Kafka topic and I have a Mongo database running and I also have a database operator first thing I'm going to actually do is deploy the museum and for that happens we can actually have a look at what is the configuration so this is the configuration for division and we can see three important parts first we have the same configuration portion which says that we will be sending these events to a particular Kafka if we wanted to we could replace this Kafka configuration for example if Google pops up then we have this source configuration which just unsurprisingly it configures a MongoDB connector because we want to extract these changes from MongoDB and the most important part for us is actually this transform part because the Outbox pattern support is implemented as data transformation and it's again really simple it just says that we want to use this Mongo event router this is a configuration to which topic we want to actually send these messages so in our case it will be event dot and then a name which we will configure as the aggregate type then this additional configuration just says where the type field should be placed and then we kind of want to unwrap our data to actually get rid of something a format envelope which we are using for other messages so in the meantime it looks like GBZoom actually started so first we will actually simulate what service we do with the data so we will open our connection and we will just demonstrate that there is no outbox collation yet it's not there it's not there and I have a simple JavaScript code we will do a simple simulation first of all we will open a MongoDB session and then in a transaction we will create a new order which will just contain this order as its payload and we will store both these in transaction so now we've stored it and we can have a look into Kafka and see whether we've actually received these messages I will just connect to Kafka I want this mode I want it unbuffered and I want a top page I want it to go from the beginning and I want a top page named Ivan's order and just to have it look nice we will apply it to so we can see that we've received this message which in this case we are able to show just the payload but if we looked at the entire Kafka message including the message envelope then we would get a slightly different output and in this case you can see that in headers we also have this order created value for the for the type for the type field if we were to cancel the same order again I will achieve a bit and copy it from here then you can see that we actually another another message and in this case the payload is empty because we haven't stored anything in this case we are just canceling the order so to take away from this is that CDC is a quite useful tool in event driven architectures DBCM is actually a project that provides the need to change data capture and on this pattern provides efficient means for micro services to actually relive exchange data and it's an alternative to distributed transactions and truth first comments if you have some questions I will try to answer them still this one's delivery though right? yes anything else? although specifically with Kafka and it's something Vojta here would know more about there is now possibility to have exact ones right? but that's specific for Kafka because Kafka is the one doing the deduplication if there are no more questions thank you for attention if you want to ask something of the record I will stay around for a while hey guys you're all I didn't have the form we didn't expect them but every day after so my name is I'm the principal I'm the engineer and today we're going to talk about how we wanted to implement the machine learning model when they have the perfection of the simpler statistical model so this is the agenda for today's talk so we're going to talk about some possible challenges you might face when training a model and then even after you have a model trained you still have a long way to go to bring that model to your product so we're going to come back to that slide at the end of the piece so if you've been following AI to use the papers in the last five years you might have noticed that there is a lot of information about the algorithms itself but there is way less and there is only data in the last year the emphasis actually has shifted to the models which is both the data but there is still a lot of information about that so today we wanted to talk about different same algorithms with different data will perform in different programs so we told you to talk about the algorithms themselves because there is a lot of information about that on the internet we want to focus on the challenges that most people would face trying to bring it to production we are going to tell you about our experience so we are going to give you a bit of context we are part of OpenShift which is an enterprise solution bring by Rehat so you know about it it's basically for operating containers what we do is getting a bunch of data from our customers in OpenShift clusters we process that data generate a couple of graphs, notifications emails and recommendations for the customers to better manage their classes the project that we participated in is the update risk predictions it's the project that we are going to talk about today and it basically predicts whether an OpenShift update will break the cluster or not that's the point before starting with the engineering model you first need to make sure that you understand the problem that you are going to solve otherwise it will be completely lost so let's talk about the problem statement first first you need to identify what is the actual problem your model will try to solve operating in a system is no common to the cost but it's even higher cost getting that system out of the trade state because you have time emergencies you have users pre-made you so it's usually very expensive for the brand with money so these are some examples of real life failures so for example imagine having to deal with a water leak on a monday morning when you need to go to work or an airplane engine failure during take off another example I'm not sure if I find the right icon but this was supposed to be tsunami or an earthquake in a highly populated area but this one is an example for a bridge collapse because there are either too many cars or it was unmaintained so all of these can be mitigated or monitored for example airplanes usually have regular inspections for everything like engines landing gear the same for bridges but for example for earthquakes and tsunamis we have monitoring so usually some scientists or even other agencies will send for warning when they detect some shakes so it's way cheaper to try to prevent failure than deal with the outcome of that failure it can be expensive both with money and then we see an airplane earthquake and even bridge collapsing example it can cost you your life that so we are not dealing with human life factor but we try to save the money for our customers so some examples close up your home since it's deafcon for example imagine service failure to service less because it's out of the resources so you can mitigate it by setting up monitoring but will automatically scale it up and down or you can just compare it so you try to deploy a new version of your software and then late in the process you notice that there are some fail dependencies so you can use tests and tune tests to detect that sooner CI, CD or you can use stage environment or you can use all three of them if you really want to prevent these failures so let's think about modern cars so modern cars are both combination of hardware and software increasingly more especially example so they have hundreds of sensors in those cars measuring everything and those cars they send a lot of telemetry to the manufacturer so example of those sensors is if they can measure the state of the battery, the state of the brakes the tires if there are any oil leaks what is the oil level and so on so there is a lot of that information we probably have here people from different countries probably in every country there is a car checkup so when you bring your car to some government approved agency where they check your car and tell you if you are allowed to go on the streets or not usually once a year four years depends on the age of the car at least in Spain so the way that car checkup goes is you take half a day of work you drive there, you wait there for two hours true story then you go through the car checkup and at the end of it the engineers say well your brakes are not really working you have a problem with discs or you have bolt tires so after that you take your car to the mechanic you fix it and then you go for that checkup again so basically you have days of your time wasted that could have been prevented if there was proper monitoring for the car so what if we could do something different what if your car were able to tell you if it's going to fail the checkup and what is that you need to fix before you actually go for that checkup or closer again to our example remember our application was we wanted to predict the a great failures for our software so what if we could predict those a great fail because after an a great has failed it's usually an emergency especially if it affects customer workloads so a lot of people need to jump in spend a lot of time so what if we could predict those a great a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a