 Hello, good morning, welcome to the presentation about web processing, pipeline, OpenShift.com. My name is Matij Maciowski, I work for RedHead. I work in Connected Customer Experience Team, and I would like to share with you how we process data from OpenShift in OpenShift. So first of all, what do we do? OpenShift clusters that are subscribed to the Insize program, send some health data to RedHead. Then it is processed and analyzed. And the results are returned back to the customers in formal some recommendations on how to fix or update their environments to prevent or fix issues that they might be experiencing. Also use the data internally, process them further, and we use them, for example, in OpenShift engineering to improve the product. Or also we use the data in various support groups to help customers to fix their problems more efficiently. I will talk just about the processing for the internal use, the rest of the whole process, which is basically owned by the Connected Customer Experience Team. If you are interested in the other parts, you might visit other talks by my colleagues on this conference, and you'll learn more about the whole process and the whole setup. While this is talk about data processing, I would also like to introduce our data. So on Ingress, we get the health data from the clusters. We get about 250,000 archives per day. The archives are basically a tar wall of some files, mostly cases. And we receive those on Ingress, then we process that. And the results are stored in our data lake, where we have more than 100 tables. And we produce more than 100 millions of rows every day. It's actually a lot more, because we process some data multiple times. But you get the rough understanding of the amount of data that we deal with. Also, besides the data lake, the time is for further analysis. We also have some boards that we deal with and maintain, where we surface the data for various groups within Red. And we try to provide the data in some comprehensive way, so that they can use it correctly. So this is time for my question. So I would like to ask first, who has some or has no experience with the data processing for pipelines? OK? Thank you. Does anybody use TecTone in there? And so, yeah, thank you. And the last question is, is there any TecTone contributor in here? Somebody who develops TecTone? OK. Thank you. So let's go on. The mandatory agenda, I don't want to cover the basic things that are covered by usual tutorials. I would like to focus more on the things that are often not mentioned in the tutorials. And that's how you should structure your pipeline task so that it's easy to maintain and sustainable for the intermediate use. So we'll cover some basic topics that will be needed for the talk. We also take some information about TecTone. And the rest of the talk will be what we do to build sustainable pipelines for the job that we do. So the first thing that should be clarified is the kinds of pipelines. So there are two basic types of data processing pipelines. And as I will be talking just about the batch pipelines, I'd like to explain the difference between the two approaches. So there are streaming pipelines. Where the strategy is that it's built around messaging. There are some workers that are running for the whole time. They are listening in the messaging queues. And the processing of the data is triggered by the incoming data, the incoming message. Basically, the workers usually process single entity at a time. They don't see the bigger soul. And the messaging is what glues the workers together. Basically, the workers take the message, process it, and the result goes to another queue or a story somewhere. This is great architecture for real time processing, usually on the ingress of data where you have the data incoming process as it comes. This is the right architecture for the job. The second type of pipelines, that's the one that we will talk about. And that's one that we use for data processing in my team, is batch pipelines. The architecture is a bit different. It's built around pipeline manager, which has the description of the pipelines. Pipeline is a series of tasks that have some defined order of execution. And the pipeline manager knows how to process, in which order, and how to process the tasks. The pipeline is triggered usually by some event. Often it is time-based. For example, we use scheduled runs of pipelines. And the task inside the pipeline runs just for the time when it processes the data. So the pipeline manager then shuts down the task and runs something else. The tasks have access to the data link, and usually process the data sets as a whole. They are, this is great architecture when you need to process data, do some predictions, build reports in multiple steps. So this is what you should use. So a bit more about pipeline managers. Pipeline manager is basically the heart of the batch processing pipeline. And its job is to manage the tasks in the right order to solve the concurrency between the tasks and so. There is really a lot of solution on the market. There is plenty of solution that are open source. And they differ in many things. They differ in platforms, which they integrate with. They differ in the complexity of the tool. Some of the managers are really light-white and just orchestrate the pipeline description. Some are really complex and are the full framework that helps you with creating the pipelines of the visually and so on. So on the bottom side, there is a list of some well-known pipeline managers. The most often used are Airflow, Argo CD, or Argo workflows, and Teccom Jenkins, you probably know some of these. So why I don't talk about Teccom? So as a data engineer, you usually don't have the choice of your tools for two reasons. One is that when you are processing big data, the tooling is expensive to maintain. And companies usually have some tools already posted or bought or there are some available solutions. So you basically pick what there is available in your environment. The second reason is that the other option would be to run the tools on your own. And data engineers usually don't have the skills to run such complex tools in production quality. And they also don't have the time because they focus on the data on the tools. And that's a situation where we are. We currently operate our pipelines on Argo workflows and it's a self-maintained system. It was not easy to set up with our skill set that we have in the team, but it works just fine. Now we are in a challenge that we need to replicate our environment with multiple namespaces in the cluster. And it's not trivial to set up for us and maintain multiple Argo instances. At the same time, we notice that in the namespaces that we use in the OpenShift cluster, we have OpenShift Pipelines principle. OpenShift Pipelines is basically branded Teccom provided with OpenShift. So we were thinking if Teccom could be the tool that we switch to because it's provided to us and it's maintained by somebody else and that could be a solution for a problem with the workflow managers. So a few facts about Teccom. Teccom is very much native. That's the most important. We want the manager to be stable and run smoothly in the OpenShift environment. So this is satisfied. It's open source, but that's great. There is also a well-established open source community because the Teccom project is there since 2018 if I'm mistaken. So there is sort of a big experience. The project has excellent documentation. So it's easy to get started. It's not very complex. And it looks like it could be the tool for the job. Just to clarify the vocabulary that's used in Teccom, they have pipelines, which is the pipeline. It's composed of tasks. Every task adds to an OpenShift port. So what's task runs on single port in the cluster. That's important to understand. And the tasks are composed of steps and every step is mapped to a container. So you can have multiple containers within the port and you can define your tasks with this kind of structure. What we were looking for, the common use cases for Teccom on the internet, and they present themselves as a solution for CICD. So CICD pipelines are the most common pipelines run with the tool. The above diagram is a typical structure of CICD pipeline. Often Teccom is also used for machine learning for the training of the models. And that's the bottom diagram that's usually in one task after another. The question that was ahead of us was if Teccom would work with our pipelines. So this is basically our daily pipeline. Not the whole one, it's just part of it. But you can get the idea of how complex it is compared to the usual CICD pipeline. So we were wondering, because running the pipeline is one thing, but running it on the two-day basis with all the things that are around might be a big challenge for the tool. So we identified some key errors where we want to do some experiments to basically understand if Teccom will make it with our pipelines. So these are the errors that we identified and then we wanted to cover in our spikes or experiments. And this is a spoiler because the colors shows how it went. So the green ones went fine. We solved what we wanted. And the orange ones worked as well but with some compromises and with some work around. The red ones, it's not a blocker but it needs some more understanding and understanding. So if I go from the top, the time-based execution, it's not typical for CICD pipelines to be triggered by a drone job. There is some bit complex way how to do that but it's possible with the Teccom tools. You can define some event listener and there are some mechanisms of triggers where you can transform the parameters and send it to the pipeline. So we, in the end, run the pipelines with a current command in a drone job. And it sends just events to the listener and the pipeline triggers. So that works. We were wondering how the log observability in the tool will be because the amount of task is huge and if you want to quickly navigate to some task that is causing some issues, it might not be easy. It was surprisingly easy. The Teccom has excellent CLI. It's really easy to use and the output is sometimes more comprehensive than the UI and the UI is well integrated into the OpenShifter console. So you have everything at hand and the user experience is really well. It's quite easy to navigate even the complex pipelines and the structure inside. The third thing that we investigated was performance. We found out that the performance is really comparable. This is the Argo world probes that we are using now and there were not any significant differences. So this is also something that we don't have to worry about. Before I jump to the orange section, I would like to share how we structure our tasks because it's important for some of the workarounds that we had to do to bypass the limitations of all that we did. And I need to stress out that the limitations that we found are not like Teccom fault or something but it's just that we are not using Teccom for what it is intended. So we are stretching the features of the Teccom in the direction that was not intended. So this is our best practices for the tasks. And this is basically developed over the time that we run the pipelines and this proved to be a useful set of rules when designing tasks. So from the top, the task, and that's really important, the task needs to be imported. That means that when you run the task multiple times on the same data, you need to end up with the same results in the data lake or in the databases in the end. That's important because often you need to rerun the pipeline, for example, you are missing some data, something changed and you need to rerun the whole processing and you don't want your data to be screwed by this operation. So this is really important part for the sustainability of the pipelines. Next one is ability to run for past days. We need to deal with the historical data because we do some time analysis of the data. So we need to have the tasks to understand the date for which they are run. It's also important, especially for backfills of the data or when they're outages and we need to add the missing data to the history, this is also a key feature. Single responsibility, this is very useful. In our case, it means that one task equals to one table in the data lake. We, it's not good practice to do more things in a task because then it gets more complex when you want to handle your tasks in general way, for example, when you want to generate the pipelines. The task needs to be as uniform as possible. For the same reason, no data sharing between tasks. That means that the task can only take data from external source or the data lake and store the data to the data lake. No exceptions allowed to store the data to some temporary storage that's mounted to the port and remounted somewhere else. These hacks will screw your pipelines very easily. The last two practices are, well, from a programmer's perspective, it's probably not the great practice, but from the maintenance of the tasks or the pipeline code, it proved to be practical. And it's about organization of the code. We have our tasks defined in Python and one task is one class and it includes all the metadata that are needed for building documentation, for building the pipeline definitions and basically for navigating in the tests, let's say. So the logic is bundled together with the metadata. It's in one file. If you need to change something, if you have the exact place where to look, it's also easy then to do some reviews on the matching quest because you can easily see that there is nothing left. Because before when we had the things split, it happened that, for example, the documentation went updated and the data were not matching. So this worked for us. And obviously the execute method of the task class needs to be unified so that you can operate with the tasks in general. If this is satisfied, you can easily work with the list of tasks and use them for multiple purposes. So for example, building the documentation, building the graphs for the pipelines and so on. So if all the tasks are uniform, it's easy. So now let's jump to the orange part. There are the other challenges. So first one was loop support. In Argo templates, if you are using now, we are using the loops quite heavily. You can imagine a set of tasks that you need to run multiple times with different parameters. That's for what the loops are and there is no such thing in the tech term. This is often used when we parallelize some processing of the big data sets. We split them into smaller parts and run them in parallel. Also it's used in backfills where we tend to loop over the dates and run the whole pipelines for every day. So this is an essential feature and it was the most challenging part of the experiments because it forced us to change the strategy how we built the pipelines. So the solution that we found was that we now define, we created our own model for the pipeline and we built the pipeline in the model and then we have some, let's say compiler that can transform the pipeline definition of the pipeline into the YAML that's understanding by the tech term. During the transformation, we basically expand the loop and render all the iterations explicitly in the YAML. The downside of this solution is that we can't have dynamic iterations. We need to set the amount of iterations before the pipeline is run. That can be limitations for some use cases. In our pipelines, this pattern could be avoided, so it's no longer for us. But this is the biggest limitation of the technology that we found. There is something called metrics that is new feature in AlphaStage, in Pektor that somehow could fix the problem but it's not in production and it's not in OpenShift pipelines yet. So the next one, backfills. This is the most complex error because if you imagine the backfill as a huge pipeline when you have a pipeline run for every day, you can easily get to a huge pipeline that has thousands of tasks. This is real challenge for the environment that you run in because it's really demanding on resources and also it's a challenge for the tool because you stretch it to the limits. So the limitations of Tecton that we hit in this area was loop support. We already discussed that. There is no pipeline nesting. That means that the pipeline could be composed of other pipelines. That's difficult for defining the pipelines but we overcome this with our Python model where we allow this. And then we have some transformation code that again expands the pipeline into the expanded view where everything is explicitly upset. So the transformation of the of the natural model will do the work for the Tecton. Another limitation is default time limit in Tecton. Basically, Tecton tends to by default kill every pipeline after one hour of running. That's not very practical for the backfills because they often runs like several hours. And we found out that this limit can be override only when the pipeline is triggered from the command line. So that basically will rule out possibility to run the backfills from the UI. We usually run the backfills from the CLI anyway. So again, no blocker but it's something to be aware of. What was the problem is control over garbage collector, garbage collection of the finished pots. In the OpenShift main spaces, there is a quota on how many pots you could consume and it includes also the finished pots that are no longer running. And you easily exceed these quotas with the backfills because there are probably about thousands of pots that needs to be created. And also we need, because Tecton has some kind of garbage collection, but you need to have cluster level permissions which we don't. And it doesn't have the granularity that we need. We usually have tasks that finish successfully removed in short time and tasks that fails, we keep them for 24 hours basically so they can be reviewed and altered by the person who has the watch duty over the pipeline. So we ended up with our own solution. We created simple script that does the finished task we know and we run it in Chrome job like every hour or something so that it removes unwanted tasks from the space and freeze the resources. Okay, RedRise, that's the only red one. We didn't find out any workaround for it. RedRise are useful when you are processing the data from external sources and then there are some, I don't know, optigies in the sources, for example restarts of services and so on. You don't want to kill all your pipelines because of these. So it's useful when you have on the task to define some retry interval with some delay and you try again for three times or so and it fails for the three times the whole pipeline fail. Tecton doesn't allow us to set the delay. It is important because you need to give some time for your retry because you need something to change in the environment that you depend on. So this is, basically we need more testing to see if it affects how we deploy the pipelines on the daily basis. Setting limits on memory and force. This is also important because some of our tasks have more than 20 gigabytes of memory required for processing of the data. So we need to carefully define the limits for every task that they can fit in the environment. We have in total 64 gigabytes of memory. So it's difficult to squeeze in everything. So this is important feature for us. Tecton supports that. The only issue is that the limits cannot be passed as a parameter to the task. So you can't have like generic tasks that you reuse in multiple places where you want to change the memory limits, for example. This was solved again by our model and we define every task in the pipeline specifically and we have no visible tasks. We would have to ditch this pattern in the pipeline. And the last one is the day-to-day pipeline link. We are wondering how the person that cares about the pipelines would have difficulty of life when they want to watch all that's happening. So the limitations that we found were these. So the long-running pipelines from Yvide already discussed that garbage collection also was touched. It's not possible to post pipeline executions. Sometimes it's important to post some pipeline that's resources hungry to leave some room for the other one that's more important to finish in time. So this is not possible. And it was resolved by a simple script that basically if you stop the task it will stay for a while in the Tecton. So we have script that basically produce new pipeline that contains only the tasks that were not run in the previous run. So that would be work around this way. It's a bit creepy but it sort of works. Smart reprise of a failed pipeline that's a similar thing. If pipeline fails somewhere in the middle you face the problem and you want to rerun it again. It's not possible with Tecton. You need to run the whole pipeline which is wasting of resources. So again we use the same script to create new pipeline that contains only the stuff that didn't run in the previous run. So we found solution for these problems and it seemed that it could be working. What I wanted to stress out that we were quite surprised by the user experience of Tecton with these huge pipelines it's easy to navigate in the UI. It's easy to realize what's running, what's failing, in which state the pipelines are and it's easy to navigate in the running tasks that have plenty of tasks and to find what you need. So that's basically it about our experiment and now we are at a stage where we need to evaluate our findings and decide if we want to switch to the Tecton or not. So if we decide to switch the next step will be to create some hopefully open source project where we bundle all the workarounds and hacks in one place and we would like this project to serve as a boilerplate for anybody who wants to create copy of our environment within let's say one day to be able to easy start. So we want to put the scripts or the workarounds in there all with the documentation and some follow guides how to how to set everything up. So that's our plan. That's it for my talk. Thank you and now it's some time for your questions. We have, I guess, four minutes. Can you see these feedback to them because the software seems to have no scope but all the limitations you mentioned are what we actually expect. We already heard the question. The question was if we gave the feedback to the Tecton community. So we didn't do it yet though most of the limitations that we found are already known and discussed in the Tecton community in the forums. There are some solutions or discussions in progress so we didn't find anything extraneous that would need this attention. How many more ambitious are we considering contributing to Tecton or if others may be some of those limitations that appears to be. That might be an option if we decide to go with Tecton we might join the community to affect what features have priority. You may want to repeat the question. Did everybody hear that? It's for the string. So the question is if we were considering joining the Tecton community. I think we took the similar journey as you do right now a couple years ago when we took one-hour service and moved it into Tecton. One thing that brings the challenge was how to properly test the task pipeline in different units. Have you done any research in this area as well? The question was how we do testing of the pipelines and the tasks. We have the tasks. They are defined in Python so it's easy to do testing for that and we have them quite well covered. Some are even like custom-developed so the coverage is good. The problem is with the whole pipeline. There we trust to the Tecton that it's able to process the prescription as is defined and we'll probably add tests for the transformation of the model of the pipeline to the Tecton code. That's probably the area where some issues can happen. We'll have to cover that with this form. I think that's it because we are out of time. Thank you for your attention.