 Hello, everybody, and welcome to yet another OpenShift Commons briefing. Today we're going to tell you how to automate and scale your data pipelines in the cloud-native way. Guillaume Moutier from Red Hat will be reintroducing himself and telling you a little bit about himself, and then giving us a bit of a deep dive in our data pipelines initiative. So Guillaume, please take it away, and there will be live Q&A at the end of this. I will get the slides from him, and we will post it all on blog.openshift.com and on YouTube, as usual. Take it away, Guillaume. Okay, thank you, Diane. Hi, everyone. I'm Guillaume Moutier, greetings from Canada. I'm a technical evangelist at Red Hat in the storage business unit, and I'm working mostly on data, not storage itself, but data, the way you consume it, the way you move it around, and especially in the AIML field. But today we're going to look at a standard data pipeline, and the way you can automate it and scale it automatically the cloud-native way. So let's get started. First, the cloud-native, to set the stage, I want you to go back a little bit on what the characteristics can be for a cloud-native platform. Here, I'm listing the things that are most important for me, but what you must never forget is why you are doing things. Here is the business outcomes. What are we trying to achieve when we are implementing those kind of architectures? For me, most important things are speed, efficiency, and foremost, adaptability. We know now that technology is moving fast, real fast, and we have to adapt to our businesses or organizations to be able to handle these kind of changes. And so adaptability is, and was my main concern, all throughout my career. And now we have the tools and we have the technology to be able to achieve these business goals. So let's take a look at what I would call a legacy data pipeline architecture. And I call it legacy, but I know for sure that for most organizations, it's still the start of the way to do things. We look at architectures that are very tightly coupled and not easily scalable. For example, if I take a very basic application where a user will save a five to some storage that can be processed by an application, well, the way it works for most applications is that there is some storage mounted on a computer, can be a shared folder or something like that. And then the file is sent to the storage. Again, it has to be mounted against the server, CIFS, ISCSI, but still some kind of hard connection in between the storage and the application server. And then it's consumed by an application, let's say some Java application here. Problem is with this architecture, well, first I have to have things very close from one another because of this mounting problem. I cannot mount the CIFS over thousands of kilometers, doesn't work well. Also the scalability problem. When I am using this type of connection, that means that if I want to put up another application server, for example, because I want to scale my application capabilities, well, it has to have exactly the same configuration as my first server, exactly the same storage connection, exactly the same mount point and behavior. So that's okay if you have one or two server, but if you have tens or hundreds of them, that's a burden that you have to take care of. Now let's look at a more clownative way to do this kind of things. Well, we can think of an application, again, where a user will just send a file, but this time to an object storage, and here it's a fully disconnected mode. You know, using object storage, consuming object storage, is only HTTP connection, so it's only a put or a get, and then that's it, I'm finished. I have no remaining connection in between my user application and the object storage. Same thing for your data processing function. They can consume this storage directly, and as they need it, so that means they can be wherever you want, and they can scale, it will be much easier to do. And now we have intelligent storage, I would say. In the latest releases of SAF, we have now backup notifications. So that means that whenever something is happening in the object storage, you can send a notification, let's say, to a Kafka bus. That will itself trigger some data processing function. Here I like to put Kafka in the middle of those kind of architectures, because it can act in two different ways. First, as a buffer. Let's say my data processing function is not ready or not ready yet. Well, the notifications will keep coming in inside the Kafka bus, and when it's ready, then it will be consumed, the topic, the notifications will be consumed, and then the function can realize its process. But also Kafka can act as some hub for all those notifications. So we can imagine that we have different processing functions, maybe in different places, different data centers, realizing different operations, but everyone feeling on the same topic. So we'll try to do it for real. We'll try to build an application that will work like this. So here for this demo, I took the example of ACHPments for you people who are not in the United States. ACH can be seen as some electronic check, electronic payments. So it can be used by a customer paying a service provider, an employer depositing money on your checking account for payrolls, all those kind of things happening electronically. For my demo here, I will try to implement this very basic pipeline where someone buys something from merchants and there is an electronic payment happening. The way it works is that the transaction has to be sent to the bank of the merchant and this bank will produce what is called an ACH file. It's a standard file, we'll come to it in a minute. A standard file that will be sent to the Federal Reserve. Here it will be processed and make available to the receiving bank. The receiving bank will be the one of the customer. So it will be the one to process the transaction and debit the account of the customer. So that's the basic process of ACH and as a reference here is the ACH file itself. And the way it works, you know very old fashioned way of describing transaction with the first line giving information about the bank, the bank itself and some basic information about the company, second line of more details about the company and then you have all those transaction fields with the different customers here, the amount of money that they have spent and to which bank, to which receiving bank this transaction should be sent. This is how I have implemented it inside OpenShift. So I have here some kind of generator, we'll come to it, that generates fake transactions and send those files inside an object storage bucket. Then this one will trigger a notification that will be sent to KafkaBus. And here I will be using candidate eventing and candidate serving. That's the way in Kubernetes and in OpenShift to create on-demand parts, on-demand functions. So I have a service that will be listening for Kafka events and then spinning up a deployment of the container that will process the file, that will process the transaction. So here what it will do is create an ACH file for the transactions and send them to the bank, so the bank of the merchant. So here I will have a few buckets. So I do my demo with seven different banks or seven different buckets to which the different files will be sent, depending on the merchant sending the file. At the origin bank, those files will be processed. Basically what it will do is look at all the transactions and recreate new ACH files this time sending it to the destination bank, to the receiving bank. All those files will be created and versed into different buckets. So this time buckets belonging to the receiving banks where they will be processed. So here the standard process will be to look at the transaction and debit the account of the customer. What we will do in this demo is that we'll only look on the amount processed and we will just sum them up in some big buckets just to see how many transactions were processed and how much amount of money was processed all throughout this pipeline. So to implement this, a few things that I need, some Kafka topics to be able to send my notifications. So here you can see at the bottom they have the American top load topic. And the all DFI topic where I will send a file. Then I have some buckets that I have created in my storage. So here are all the buckets that I have. And don't worry, you will have access to the code and everything to be able to reproduce the demo. So I won't go into too many details on this right now. And then we will program the bucket notifications themselves. How it's done in SAF, in RHCS, in Red Hat SAF storage. What you can do, what you do is to create a topic that will point to your Kafka endpoint. Okay, so here I will create a topic with the name RDFI and I will point it to my Kafka cluster. Okay, and then for each bucket, I will use this reference to the topic I just created. And I will here, it's a simple put request to the name of your bucket. Here it's from an old demo. So here it should be a different name and the notification verb and the configuration of the topic that I want to use in Kafka. Finally, before we go on for a live demo, this is a transaction job. The way it works is that it will trigger a container that will generate our transactions and it will run 60 times with a parallelism of five. So that means that I will be able to create five files at a time inside my open shift cluster. So let's go, let's do this. So here, what we can see here, I am in my project. I can see that I have three pods which are the pods, the candidate pods, the serverless, open shift serverless pods that are listening to events. I have also in my open shift serverless, I have three different processes, three different services which will split the H5s or process them depending and they're ready. But you see those processes are ready. The services are ready but there are no pods running. So now we are scaled to zero. So let's create these transactions. Here I will use the exact same file I just showed you. And now it's being put into motion. So here we can see that we have five containers creating based on the transaction image, transaction container that I have designed and they will begin to create new transactions. And as new transaction files are created, well, it triggers containers. It triggers the ODFI split. That means looking at the H5s and splitting them and putting them inside the RadBucket. It also triggers RDFI split. That's what's happening when it looks inside the HCH file and splitting them together to send it to the receiving banks. And then the RDFI process. So here I'm processing the other transactions themselves. It will be better with a live view like this. Here is a graph and a dashboard where I have my pipeline. We can see that we have already generated 15 different transaction files fixed in now. So far, 16 have been processed and dispatched to the different bank of origin. And so far we have treated eight. We have processed eight of them. Those files are split for the different receiving banks and they are sent here to the receiving banks buckets where they are processed. And so far we have processed 75 of them. Of course, we have many more files in this process because we take each originating file and split them, split each transaction towards its own receiving bank. We can see as the process is going on that the CPU usage is increasing. Of course, we are spending more parts as we need them. We have also the RAM usage going on. And I have some lags here on the deployments but it should keep up in a few seconds. And we can see here the value of the transactions that have been processed so far. So we can see it's going up. We are now at about $9 million. What I generate here for transactions is a random number of transactions between 300 and 500 of them for each file. And the amount itself is between $1 and $2,000. Okay. So that's the kind of transactions I'm generating. And here we can see the different deployments now that we have. We are now up to 15 parts. We can see that we have five deployments of the create transaction pod. That's the maximum parallelism that I authorized for this. We have, of course, my listeners for the Kafka events but the treatment itself, the processing itself is how the RFI split is what's happening here at this point. So here it doesn't consume much resources because it's only looking at the files and depending on the American banks, sending it to the different buckets here. So not many resources involved. So there's only one deployment of this process. But here if I look at our RFI split here, that's what's happening in this box. So that means retrieving the file, splitting it into the different transactions and recreating new files and then sending them to the receiving bank buckets. So it consumes some more resources. So here that's why the serverless functions has automatically be scaled to two deployments because that's what it needs to be able to handle the traffic coming in. Same for the RFI process. It looks at the files and process it adding to the amount of money that all those transactions represent and then it needs also two of those parts to do the processing. Here, what's happening? We can see that we have reached the maximum number of files that we wanted to generate. So 60. So our create transaction parts have scaled down to zero. Okay. We, of course, that's what we wanted to do. And then here we have reached also the number of 60 for this, the first step of processing. So this RFI split pod should come then to zero in a few seconds. We can see that it's, we already are consuming a little bit less memory for those kind of things. So here that's a neat way to demonstrate with only using bucket notifications and serverless functions. You can fully automate your data pipelines. Doesn't require some kind of application that would orchestrate everything and will take care of everything. Here it's only a few files, a few configuration files that you put into motion. That allows you to create very simply this kind of pipelines. So speaking of files, I will go back here. Speaking of files, you will have all the code and all the different configuration files and containers, images and things like this in this repo. I will also put it in a few days a full walkthrough to be able to reproduce this kind of demo. And of course, feel free to reach out for some more information or if you have questions or problems implementing this kind of things. It will be a pleasure to reply to these questions. And now I think we still have time for a few questions. Absolutely. Let's see if we have... I don't see any questions in the chat, but I think we're all kind of totally loving the demo that you gave Guillaume. So I'll open it up and see if people have any questions. I'm not seeing anything, which means you did a really thorough presentation. So thank you very much. The repo that you point out here on the demo page, that has everything in it to reproduce this demo? Yes, there is everything. There is the container code to be able to create the parts that will process or create the transactions. There is the Kafka topic creations. There is... Well, there is everything to be able to go from scratch that is starting on the brand new OpenShift installation and install everything that you need. Awesome. So we look forward to other people taking this for a test run and trying and demoing it. And I really appreciate you taking the time today, Guillaume. And look forward to having you back for new updates on this topic. So thanks again. And everybody would like to re-watch this. It will be uploaded on the YouTube channel later today. And I'll steal the slides from Guillaume shortly and also link them up there as well and put a blog post with some other resources up on blog.OpenShift.com. So look for that in the next coming days. And we will continue to provide you with entertaining and educational briefings over the coming weeks to take place of some of the conferences that have been canceled. So look for that on the events page at commons.OpenShift.org. So take care, everybody. And thank you very much.