 Our first speaker is Nicolo, so please add Nicolo, yes. Hi Nicolo. Hi Martin, how are you doing? I'm fine and you? Pretty good. A little anxious but I think it's because of the talk. Don't worry. So from where are you streaming from? Italy. Milan. Oh, wonderful. Actually maybe I go next week to Milan. Oh, that's nice. Maybe it's still open but maybe one week holiday. So you talk is about Spark and Azure Databricks. Wonderful. Right. Yeah. Okay. So can you see my slides? Now everything is perfect. Okay. I'll start then. So I'm Nicolo Giso and the title of this presentation as you can share in this first slide is from telemetry data to CSV with Python, Spark and Azure Databricks. In Tenova we are using this stack to transform the telemetry data we retrieve from the field from our machineries in easily manageable CSV data files. Before diving into the details I will show you now the agenda. So first of all I will give you a little bit of background about myself and Tenova. Then I will show you what are the main data application of the data we record the retrieve for the machineries. Then we present the architecture and the stack we apply to transform this telemetry data in CSV file. Then we have some screenshot of code just to see in detail what happens in the code. And finally I will close up with the orchestration of these notebooks and a final conclusion on what are our next steps. So let me introduce myself. I'm Nicolo from Italy as I said before. I work for Tenova company. There will be a slide in a couple of seconds about it. And my role is data scientist and my main interests are books, TV series and of course data. You shouldn't expect nothing fancy from this presentation since you can see that my power point level is pretty low. I started to work in Tenova in 2017 and I support the Tenova digital team in the implementation of the industrial IoT platform. And today I'm really proud to be here at the Europe Python to show you a part of this platform we implemented with Databricks. Tenova is a tech-in-group company, a world-wide partner for sustainable, innovative and reliable solutions in the metals and in the mining industries. In Tenova we design and develop solutions that help companies to reduce costs, save energy, limit environmental impact and improve working conditions. Throughout the years Tenova engineered and installed the customer plans a huge number of machineries and devices. Here you can see some numbers and figures just to give you an idea of this. In the context of industry 4.0 we decided that it was time to leverage all the data that those machineries produce during the time working. And so we devised this so-called Tenova industrial IoT platform that leveraging Microsoft Azure allows us to collect data, analyze them and make some machine learning models and also show the results and the monitoring of those variables to our customer. The five pillars of our platform are the field gateway that is the physical actual device that collects the data from the plants. Then we have the Tenova analytics engine and the machine learning model repositories that help us to find out some new information and extract insightful knowledge from the big amount of data that we collect. Then we have the Tenova digital portal that allows to configure the variables to collect. And finally we have the Tenova wide that is our proprietary dashboard ding tool system that allows us to create dashboard to show the status of the machineries and how the different variables are moving over time. So basically we have this Tenova edge that can read data from any plant data source that can be a PLC or a SQL database that can write data bottom premise on our platform. The data transfer is secured by design and this can guarantee the privacy and security for our customer. The application of the data that we retrieve from the field are three mainly that are monitoring data analysis and machine learning models. For the definition of the monitoring dashboard we shouldn't manipulate too much the data so the format is okay. But for data analysis and machine learning models the building is necessary to transform the files that we retrieve from the field into a manageable format, especially for the data analysis that is carried out by our process engineers that have all the knowledge about the process information and can gain insights if the data is provided in a meaningful way. To give you an idea this is a sample of the file that we retrieve is a JSON lines file in which each line is a record and we have a variable that is defined by an item and name that you can see here for instance speed and an item ID that is unique and then we have its value and the time at which this value was recorded. That's the face of our engineers when they have to look at the first time at those data and we immediately decided to find a way to format them in a proper way. Before doing that we define a set of requirements of those CSV files that we will provide to our engineers. Each file should contain the data for a single device. Only one file for device per day should be available and so if there are two for any reason two parts of the data divided in two different CSV files that should be merged then we guarantee one row per second and that's a nice feature to have especially in forward filling scenarios and finally we want to have the Midnight Row. With Midnight Row concept we mean the fact that the Midnight Row is present and has the value from the previous day so if a variable didn't change value let's say in the last two hours of a day we don't lose information because we can fill with the last recorded video of the previous day. So now let's see the architecture. This is all the flow. We start from the JSON lines that I showed you before and then there is this series of eight Python Databricks notebooks that transform the data. The runtime is Spark 2.4 but this is the face of our internal looking at this architecture so I decided to make something I hope that is a little clearer. So the first notebook has the task to remove the information from previous run since this job is eight steps around daily that means cleaning away all the information from the previous day. Then we have the first real notebook that is the Pivo notebook that transforms the JSON lines into CSV with variable sub columns and UTC timestamp as index. Then we have the merge notebook that takes the data saved in the Azure data lake but the Pivo notebook immerse them together to have a unique CSV file for the same device in a given day. Then we save the result to a data lake and update the last value just writes to SQL database the last value that defines for each variable into a table that then can be retrieved from the filming night notebook that retrieves the last value of the previous day and fills the midnight timestamp. At this point the file that we build is saved again to the Azure data lake and the notebook pad timestamp is just used to be sure that there is one line per second. At this point the CSV is saved and the manipulation is over but since we have the data ready in a tidy format we decide to implement the compute KPI notebook that calculates basic KPI such as mean minimum and maximum and save this information in a Cosmos DB collection that can be later retrieved by our dashboard. To close up we have a sanity check notebook that just control the logs for running error and through our logic apps reports VMA the status of the whole run. So let's see a little bit of code but before I want to highlight the fact that there are variable like speed that you see in line 1 and 2 that change value within the same icon just the decimals are different in the two lines and then there are the variable like script image that are complex variables meaning that the value has more than one single value as in the previous case like for the case of script image we have entered the n category. This is usually the result of a SQL query. Okay so now we have the core function of this notebook is the transform data frame function that takes the file the data frame read via spark with j-zones and selects some of the columns that has built using transforming the UTC timestamp and then selecting that name that and for the values that explode and parse the value with an user defined function in order to have all the content read available. At this point we create in line 12 the variable a new column called variable in which we put together the item ID the item name and if necessary the key as a suffix to this variable at this point we select UTC timestamp variable and then we are ready to go to the pivoting part we cannot directly pivot our value because we have too many, it can be the case of the speed variable for instance in which we have more than one value per second and we want to have just one value so we just truncate the UTC timestamp to the second and then we have to pivot. The thing is that we tested many times the aggregation with last but what happened that it gave different result with the same data because sometimes it didn't give the real last value so to overcome this issue we use the following trick we build a window the partition of variable and truncate the UTC timestamp and order by the UTC timestamp then we select for each window that we built just the first column and at this point we are ready to group by the truncated UTC pivot on the variable and finally taking the value we used last but we could use whatever aggregating function since there is just one value left. At this point we make just a little bit of renaming and then the data frame is ready to be saved the output is this one as you can see and then there is the merge data frame the merge notebook that basically takes files from the same name from the same device and concatenates the thing together is also make sure that there are no duplicates UTC timestamp that's when there is between one run and the other that there is a common seconds and so we have just to merge this second in a unique role let's say then the result is saved and with the update last values notebook we retrieve for each column the last value and this timestamp we define last value column that's possible indeed we apply it in the generate last value dict function that takes all the columns of the data frame except of course the UTC timestamp that we are currently having as a key the column name and as a value the last non-value for the column and the timestamp of this value found at this point we have to write to the SQL database to store the result we just found and the table is this one with the column timestamp item ID value device at the end the first query at line 2 is used to look for any entry of item ID and item name with the same day and if there is any entry with a timestamp that comes before the one we just found with the previous function we can apply the update function to change the value and update the timestamp otherwise we insert all together the value we found inside the table then we arrive to the film in night notebook in which using pandas read SQL we read the table we have seen before and retrieve for each device the item ID the item name and the value for the day before and build a structure that will help us to fill the midnight row indeed in the update data frame read through Spark with CSV and the dictionary we built to update the midnight row indeed if the midnight row exists so there is a line containing our minute and second with all zeros we are going to fill the new value with the information retrieved before otherwise we insert the new line and the value we just got from the table finally there is this timestamp notebook that has the only job of be sure that there are no empty there are no skipping seconds between the two consecutive lines and that's done using the Spark range and the join so the data now is ready to be analyzed by our process engineers and our data scientist but since we are already here we decide to build another notebook that basically computes on all the files we have available for the day some basic API that can be the mean, the median the absolute, the maximum value or the minimum and storing this information along with the item name and the item this saves this as a document in a Cosmos DB where then they can be later retrieved for visualization so to close up I want to show you our way to use Databricks we actually usually develop new feature on Visual Studio Code locally and then when we are ready to push our changes to Azure DevOps there is a release pipeline that actually is able to map the branch structure to a folder structure in which for each branch corresponds a folder so you can see here the main branch is mapped to the main folder and the dev branch with the dev folder this allows us to experiment with this feature without impact in the production scenario to orchestrate all of these notebooks at first we used Azure Databricks because it is in the same environments as the notebooks but it has as a main drawback that has little flexibility in thinking about future developments so we decide to migrate to Azure Data Factory that has easy customization and thanks to UI it allows us to better orchestrate all our steps furthermore Azure Data Factory gives us some flexibility in replacing maybe in the future some notebooks with some other services because of performance or if we need something particular that is already implemented without a notebook so to conclude what are our next steps to develop and make better our solution we want to upgrade in the next weeks to Spark Tree so we can fully leverage Koala sexually then we want to extend the computation of the KPI not to just the basic one but a dock KPI for each device and finally we want to improve the performance of the compute KPI notebook the last one that I just showed you because at the moment is our bottleneck thank you that was my talk if you have any question I am here otherwise you can take a cup of coffee and be ready for the next talk ok thank you very much Nikola for the talk there are some questions actually they are popping in the moment let's start with the first question why do you rely on CSV as a data format why not using that's a nice real question actually we started with CSV because at the beginning as all modular stuff there was just the Pivo notebook so we wanted just to have a CSV then later on we decided to have other steps to let's say make more complex and definite our files and so we just stayed with CSV for that motivation and yet the results needs to be a CSV actually so we should consider maybe to use parquet in the middle intermediate steps but probably as a final result a CSV anyway ok next question what happens when there are multiple data rows per second what is deciding which line will be kept used ok yeah probably I didn't explain that quite well actually in the JSON lines there are two rows that differ just by two milliseconds we keep the latest one and using that trick that I showed you we are able to take really the last one next question goes in the same direction why did you decide to use second as the smallest time unit and more than a second ok because actually our idea was to target a specific scenario in which we analyze the data just on the seconds like in some of the queries you need to look at the data at the micro milliseconds even nanoseconds but for this we just look at the raw data on the machine or any way to the one we recorded in the JSON file this needed to be something that could be easy manageable and maybe having variables with different polling frequency so it will be just a big mess maybe also with value change to have all the milliseconds will be quite huge as a ok next question do you have end-to-end tests to ensure your pipeline work as expected I hope to not receive this question but since we are actually besides before moving to spark tree probably we have to add some tests to be sure to don't break everything anything in production we should implement we investigate some solution I found something last week right before the presentation because I was thinking ok maybe we should have something in place but we have to find it and I'm open to suggestion actually we still have time for more questions so the next is how do you organize information and the data is received with latency one second to late but you are recording on second ahead wait so basically it doesn't arrive just when it is recorded actually it is a batch job that analyze those data so there is no this problem at the moment we just refer to the UTC timestamp in the line and we take these as the main source of truth ok and there are not more questions at the moment but one comment also important very interesting more to grab thank you thank you I think this is a perfect question or comment thank you very much someone is typing in the last seconds of the match it's actually also a comment but I have to copy this one because it's also a great comment here oh no I can't copy paste that's horrible copy paste so that's the final thank you to be here with me I can't say it better so thank you very much thank you and I mean I can ask where something if anybody has other questions in the breakout room thank you very much now we have time for some ads