 Počutno začniti. OK, zato počujevamo prezentacij. So, moj nimi Gujel Mojucija, je to je zelo na vse. Zato sem vsečen na Optum, ko je amerikajne kompani, parta Unijat, delovina. Zelo smo v Dublin, na vsečenih rizv. Pnešno, da se počutno vseče, da se počutno vseče, da se počutno vseče, da se počutno vseče, da se počutno vseče. Also, if you want to get in touch for other stuff like how to prepare homemade pizza, or some information about deep learning, so feel free to get in touch after. So, here's some facts about the company, but due to the short time for this talk, probably we could engage after this. This is summary for today. So, when people start to think about a new project, on with machine learning, deep learning and artificial intelligence which have a good wish abilities. They have very good plans about that they have questions and they know where the data are, they know how to build the model, but they forget to do a proper planning about how to we HTML data. So that is a big challenge, because there are other challenges in that I am going to present now and I went to share my experience using just a couple of open source tools pa puni zelo. Kaj je pričak, jel je izgleda z zelo, da zelo cilje in nekaj je imel nr. Vse cilje se objetil, nrv počkel... Tudela počekaj smo delaj ratne mugene, srednje in tudi, počketi, So basically every day you have new smart devices. This doesn't mean that in the future we will have smarter people, so that's another challenge anyway. And so you have to understand how to get data from that particular data sources because you have different protocols, different way of accessing those devices, different security matters, so it's a very broad universe. You have to consider also that you have often limited resources in terms of network or physical resources, so something to take into account when doing data collection from the edge compared to traditional collection from data centers, cloud, databases, so things like this. Your data change, so imagine you set up your streaming pipeline, you can't expect that you receive the data always in the same way, so the change could be in the structure of the data, it could be a semantic change, it could be a change in the particular software that a particular vendor for a specific device is doing, so often you don't have control on the data sources and then you have to keep in mind that you have to handle that drift yourself. And of course in production, in a context like this, if you consider manufacturing, health care or cybersecurity, you don't have a single pipeline, edge pipeline, but you can have 100,000, so you have to think something that could be scalable. And last but not least, so then you have to move this data into some centralized big data system that could be on-prem, could be in the cloud, then there are other challenges involved. So typically, this is what I saw in many situations, many scenarios, at the end you have some infrastructure like this, where you have dependencies from a particular vendor for the IoT gateway, then you have different sensors attached to the devices from different vendors, then you have to rely on different commercial products or run your code, which is hard to maintain because a change in the data source means a change in the code, so this is a cost in terms of time, money and disruption sometimes of the service. On the other side then you want to, using different transmission protocol, send the data to some other platform for historical retention or analysis to some time series database, if you have time series there, or you want to consume in real time this data to dashboard or some other app. And again, you have the same problem there in the destination because there you have to write your own custom code for the specific destination and you could have multiple situations there and hard to maintain, you have to change code or rely on commercial or different open source product. So the solution that I want to share coming from my experience in general in data ingestion, data movement, but that applies also to the edge is the following. Oh, there's another need that you should have because typically in this context you would have much more streaming of the data. Traditional ETL doesn't apply in this case because you have the different phases of the ETL combined at the same time in streaming fashion. So we ended up using just one or two open source projects to do this thing, to address all these challenges. And the first one that we used in IBM first and often now is the Streamset Data Collector. So basically it's a tool which has been born for data streaming. It works in a batch fashion anyway. You can set up it as well if needed. And then you have out of the box several facilities in terms of setting up your rules, automatic data drift management. So basically you don't have to check logs, dashboards, something all the time because it's the tool that practically alerts you about any situation, depending on the rule you set up there. And you do everything through a web UI. So another outcome of adopting this tool is that whoever can use it. So you can rely always in technical people. So you have some operators coming from traditional operation, not at the Vops mindset. You could have also sometimes some executive that want to have a look at how the data are moving and having a look at the web UI and dragging and drop things. So this is easy for everyone. And of course you can handle different data serialization. And at the end you have something. Let's skip this slide. Basically this is how a pipeline looks like in the web UI. So as you can see, not going into the details because we have just 15 minutes. But you don't have to write any code, don't have coding skills to understand how to set up blocks and set up the origin, destination, processors, and executors. And the tool is ready to connect to different things that I'm pretty sure you have in your infrastructure. So it's ready for the cloud, for the major cloud platform, relation on SQL databases, streaming, messaging system like Kafka or Active in Q, some industry for protocol like OPC UA or MQTT. And you have a centralized service that could be a single machine, could be a cluster, depending on your installation. That doesn't require to install anything on the devices and the destination. So it connects to the specific sources and destination using the protocols available for them. And the access security protocols that you need to have for the specific. So for example, if you look at Kafka, you can connect to pipelines for consuming and producing, using TLS and Kerberos or whatever is specific for the particular source of destination. This works fine in a lot of scenarios, but sometimes there is no way to connect directly to the devices if you think of some manufacturing context. So in that case, you need to integrate to this project, another open source project, which is a child of this one, called Data Collector Edge, which basically is an agent that you this time have to deploy on that particular device. And you can communicate in a bidirectional way between the device and the data collector and vice versa. And it's a single binary implemented in Go, so it's just less than five megabytes. That's the worst case scenario. Typically for Linux-based devices or Android devices, it's no more than two or three megabytes, and each pipeline is a few kilobytes because at the end it's just a JSON file. And it's open source as well with semi license as for the data collector. Doesn't have any dependencies on whatever is in the device, in the OS, and on external IoT gateways. So this is important because you can use the same agent on different devices from different vendors. And you can perform also some edge analytics on there where the data are generated. So sometimes you don't want to bring all of the noise of the data into your big data platform. You can do this because at the same principles as the pipeline for the data collector, you can reduce, transform the data and trigger some machine learning things that you could do with TensorFlow or Deep Learning for J for Android on this particular device. If the device mount an operating system belonging to those five families, you can deploy this binary and it works there. And it supports different messaging protocols, including MQTT, Kafka. So it means that the data you generated from the device is transformed there can be sent not necessarily to the data collector, but could be something else in your infrastructure that supports those protocols. Of course, it can detect and handle data drift on the edge. So if something changes, you are alerted about this change and take action. They said what to do with the change in the data without stopping the pipeline, of course. And in a single device, you could have a single agent running multiple pipelines. So if you need to collect data from different sensors in the same device, you don't have to store different agents, just an agent and pipelines. This is just to have an overview of how this works. And at the end, this is what we do for the data collector and what you can do in integrating also data collector edge in the platform. So assuming you have some continuous delivery pipeline, you have people that implement edge pipelines and the receiving pipelines, which are those that expect data from the edge, from the agent using just the UI. So there is no coding involved there. Then you can put this under source control because this pipeline at the end is a JSON file. I'm assuming you know how to version the JSON file on GitHub or whatever you have in-house. Then if you have some continuous integration server, you can automate the deployment of the agent and the pipelines for the first time. But if the agent is already there in your device, you just upgrade the pipelines and you can run them automatically from Jenkins or for your CI server as well. And then you have two options. So you can send the data to some messaging system. I put Kafka here because this is the reference messaging system in Optum, but could be something else supported by the edge. Or you can send directly an instant communication between the devices and the data collector. The first option is good when you don't have direct access from the devices to the network where you have the data collector service. So also in terms of setting up the security for this, you have this option. And finally, you can do some raw transformation on the data in the data collector itself and then send the data to your cloud or on-prem platform for advanced processing or some machine learning, deep learning, whatever is in scope to address your questions. So if you remember this slide from a few minutes ago, adopting these two open source solutions probably go from this situation, from this one on the bottom where you have the exact same layer at the device site, at the destination site, and also this could replace in most cases also the IoT gateway. And you can also skip the messaging system in the middle. That depends on your particular infrastructure, security restriction, network restriction you have in house, but there are different combinations of this. So at the end you can understand that you don't have so much code to maintain. You know that whatever works for the device would work properly with another device and you can define standards and have also a smooth learning code for people that need to implement pipelines and look at the data. And before completing this conversation, one just to point to the fact that what I really believe is in not only manufacturing in general in data collection from the edge or other sources, have a look at the open source solution because there is a word ready to production there, including these two tools, but not restricted to, investing people because at the end, whatever the tool you use, whatever your business is, people do things. So don't invest so much in some complex commercial solution that probably don't solve your problems. Software engineering and data science approach is the best way. Keep things simple because things evolve in time, things changes and so you should be ready for changes. So if you keep things simple, use standards, you will be ready to address quickly changes in data, technology, problems, whatever is always changing world. So I put also in the slide some links if you won't go to GitHub, get some stuff, but feel free to contact me after this talk. Thank you so much.