 Grazie molto e prima di iniziare, voglio dire grazie al Paetano per invitarmi a parlare con voi oggi e anche io voglio dire grazie anche a Paul, Matteo, Amber e Kian per il supporto per oggi. Allora, cominciamo perché ho un po' di stuff per raccontare con voi e quindi io avrei avuto velocemente e velocemente con il topico di oggi. Allora, io provo, in il prossimo minuto, a parlare con voi la differenza di la Taurosa e la Taleca e cosa dobbiamo creare la Taurosa e la Taleca e perché dobbiamo creare la Taurosa e la Taleca in un'organizzazione, in un'organizzazione. Allora, per provare a parlare con voi la Taurosa e la Taleca io uso un progetto, un progetto che ho imparato da 2016-17. È un progetto dell'Unione europea, perché ho lavorato per l'Unione europea e questo è un progetto di labor market. Allora, cominciamo a mostrarli questo progetto e poi faccio in dettaglio un'altra approccia e un'altra metodologia per analizzare i dati. Allora, parlo di il top-down approccio e di un altro approccio. Allora, alla fine del mio progetto, vi presento 3 differente tecnologiche. Non lo vedo in dettaglio, ma se volete parlare con me online, avremo molto di tempo. Vi presento le tecnologiche per supportarci, per supportarci, per costruire la Taleca. Vi presento i dati.io, Udi e Asber. Allora, vi presento a me, io sono Mauro Pelucci, ma potete anche chiamarmi Mauro Pelucci perché è tipico da inglese il mio nome è Mauro Pelucci e è ok, non è un problema. Sì, nel 2016, ho portato questo progetto per l'Unione Europea, principalmente da Eurostat, Cedefop e ETF. Eurostat, credo, è abbastanza nulla. Cedefop e ETF sono due dati europei che pensano di vocazione, quindi pensano di sviluppare il mercato labore di vocazione in Unione Europea. Cedefop, principalmente nei paesi che sono parte dell'Unione Europea, ETF, principalmente per i paesi che sono fondamentali dell'Unione Europea, come Morocco, Tunisia, Giorgio e Ucrenia. L'unico lavoro è dove ho portato il science globale. Che facciamo? È molto particolare, vi presento nel prossimo slide. È un'università di studiamento in Bergamo, e in Vemiliano V.Cocca. Questo è il mio contatto. Se volete contattarmi su Twitter, e mail, sentite free to write me e anche in LinkedIn. La presentazione è pubblica, quindi se andate a questa reposizione di Github, troverete il mio slide. Credo che avrò il slide dopo la collezione, la versione la più visibile in questa reposizione di Github, potete usare questo slide come volete. Iniziamo per il tema per oggi. Il progetto che voglio presentarvi è di esplorare l'Unione Europea. È un periodo di tempo, un periodo di informazione, un periodo di informazione, un periodo di informazione, un periodo di informazione, un periodo di informazione. Iniziamo per esplorare l'evoluzione dell'evoluzione. Perché, 5 anni fa, 6 anni fa, l'Unione Europea ha capito che l'evoluzione dell'evoluzione è in continuo evoluzione. Quindi, i tati officiali non sono abbastanza per esplorare l'evoluzione e si chiedono per costruire la tua casa per costruire l'evoluzione dell'evoluzione. Quindi, l'evoluzione del professionale, l'evoluzione del skill, l'impatto della globalizzazione e anche alcuni che, in ultime 2 anni, hanno stato terribili per l'evoluzione perché, you know, noi eravamo in pandemia, abbiamo l'Ukrainian War, abbiamo molto di nuove regalizzazioni, quindi tutto questo topic, ovviamente, ha cambiato l'evoluzione. Quindi, come abbiamo tentato di aderire queste richieste? Quindi, per tentare di creare un sistema che ha tentato di capire l'evoluzione quindi che è una casa di l'evoluzione. Quindi, ovviamente, abbiamo iniziato dall'evoluzione del professionale, di solito, ovviamente, ma queste evoluzioni di professionale non sono abbastanza. Perché? Perché abbiamo molta informazione di fortuna, non abbiamo una informazione fresca, non immagini le statistiche puckri che parlano the language of employer and the language of employee, of course. Quindi, abbiamo int recycled the web, So, in 2016, assouth we started to collect data from the web mainly job management online job management in all of European Union To build at our house, At the end we have at our house So, for me it's 20 years that I work with at our house quindi non è un nuovo topic di Towerhouse. All'inizio ci rilisiamo in Annalytics, principalmente ci rilisiamo in Dashboard e con questo Dashboard, l'Università Europea, le proprie di allenamenti, l'università, può adessare alcune questioni sul mercato sviluppato. Quindi, questo è un link per il tool, questo è un tool pubblico, che si può accettare al tool, che è costruito su un tavolo di Dashboard e sotto il wood, abbiamo in Towerhouse la dimensione di un mercato sviluppato. Quindi, cosa dobbiamo fare è abbastanza semplice. L'anno, io ero qui rimodere e presento come coltiamo le data da scrivere. Quindi, principalmente, cominciamo a scrivere le data dal web, dal portale online, dal dashboard, dal site pubblico, del servizio pubblico, principalmente per scrivere. Ovviamente, facciamo una cura di qualità di data. Abbiamo creato una grande data lake con tutto il data che abbiamo e, su top di questa data lake, abbiamo rilasciuto le data in Towerhouse, principalmente. Quindi, alla fine, abbiamo una data in Towerhouse dentro di un'organizzazione che ha provato data per il nostro secondo. È quello che dobbiamo, ovviamente, muovere le data dal Towerhouse per la data lake. Abbiamo applicato le tecniche e le tecniche di AI perché le data che abbiamo colleggiato dal portale online sono abbastanza differente, you know? So, speaking about my profession, so, you can call big data scientists, but in some organization, I am a data engineer, you understood very different, in some organization, I am a senior data scientist, in other I am or head of data science. So, all this profession are quite similar and it's difficult to analyze the very information for a user maker if we don't normalize this data. So, what do we do is before to create a data lake we evolve information and then we produce Towerhouse. Visit Towerhouse is is not public, of course, but all the national statistical institute can access to visit Towerhouse to release statistics, information, reports and so on. Under the hood we have a lot of technology, of course. Mainly we use Spark, PY Spark to process information. We use a lot of machine learning to normalize the data. To build the data lake, of course, we don't use only one technology, but we use a bunch of technology because for each question we need to choose the best technology. For example, we use AWS string to store the information because we need to reduce the cost. We use Neo4j and Elastic to release some data to our security. We use Tableau for representation. In some cases we only use Python and matplup in the deep. It depends by the question. So, let's return now to representation because what we need to understand is why we need the data lake in Towerhouse because when I teach in university a lot of students, I don't know, social on the web on Google and okay, the Towerhouse are died. We don't need to create the Towerhouse. But really we need the Towerhouse because we need the quality data. So, but also we need the data lake because the two Towerhouse the Towerhouse data lake really are four different questions. So, let's start with the topic. I have a question for you. It's not my question. I took this question from a presentation of Cook here in 2014. So, from this TEDx talk you can find the end of the web. So, what is in your opinion the America's favorite type of key? Apple pie. Thank you. So, okay. Thank you. It is the same response of 10 years ago, of 80 years ago, sorry. So, it is the same response. It is a beautiful question for me because I asked you what is the more favorite type of cake and you are applying in your mind the typical BI process, this intelligent process. So, I try to recap the process in your mind. Of course, you collect the data. In this case usually we use the retail sales data to respond to this question. In your mind, you are applying techniques to optimize the information. So, to aggregate the information we can call them this bunch of approach ETL. So, extract, transform and load. You build the Towerhouse in your mind with the list of the type of case. So, apple pie and also the other. And then you use the turnover to take the first one, apple pie, right? Okay. So, it's a typical BI process. If you retire to 2004, Solomon described the BI process as the bunch of processes in organization that take care about the collection of the data. The cleaning, the improvement of the quality of this data. The store of this data in Towerhouse to release question, to release answer for decision maker. So, there is a typical BI process. But this process usually, always start from a question. Starting from a question, we apply, we select the source system. In your question, we select the data from an ERP from an enterprise resource system to respond to the question. We collected the data. We store the data. I don't know where. We provide the data and we create an application. It is a typical BI process. And I reuse this process to take care about available market data because what I do for the European Union, of course not with the type of cake but with 100 terabyte of data, it is the same. It is the same. I collect the data from online job website. I try to address a lot of issues about the quality of the data that we collect. I store the data and then I create something to provide the data to my stakeholder. Of course, my stakeholders are quite complicated because a lot of politicians. But of course I try to simplify the dashboard for them because they need this data to improve our training system, our labor market system and so on. So it is a typical top-down approach because we start from a question, we select the data to address this question and then we make something to produce answer. But I need now to return to the beginning because the question it was what is America's favorite type of cake? And we need to understand the different favorite and retelless data because so in our mind we make the process but I don't know if favorite it is the same of retelless data. Speaking about one or two years ago during the pandemic I needed to buy a printer because I have three children at home and I didn't have a printer before and I choose one printer, not the best one, the printer available. So speaking about the question I don't know if favorite is the same of retelless data. I don't know. So of course now we have a lot of new data with a lot of new question. It is not the same of 20 years ago when I started to work and I have only surveys, surveys, surveys, surveys. Now we have a lot of new question coming from our decision maker, our colleagues, my wife and so on. And we have a lot of new sources. Of course we need to understand what is the question and what sources we can use. So we need to change the approach, not the top down approach. So starting from a question but we need a bottom up approach. Starting from a question is right because we can go quick, quick click. But we need to understand but we can also address a new question. So we can use a bottom up approach. We can store everything. Now not only the data retelless data but we can also store social media data, web data, data coming from Amazon e e-commerce. We can store everything now and then use machine learning, AI, pattern recognition to extract a new insight for our seconder. Let me to give you an example. I started in 2016 to collect data about labor market. My first question from the decision maker was what are the top five occupation, profession, more requested European Union? And okay, I build a machine learning model and try to normalize the job titles. I try to build a machine learning model in the 45 different languages because in Europe there is no English but also Spanish, Italian, Portugal, Basque, Catalan, Galician and so on. So it was not easy but now we are at our house. Last year the same decision maker asked me what is the request of remote workers? So, new different question, new question. Of course, using this approach, I started to use the raw data that I stored in seven years of project and I released that the remote working before February 2020 was not present in Europe and after now is present. Spinging about the green transition, how to measure now the impact of the green economy in the labor market? Because we have the data now, I can now respond to my decision maker not to build at our house because at our house is built with fixed dimension. So, occupation, contract, working hours, I don't know employer, salary. These are different question. Of course, if we use a bottom up approach, we have a lot of challenges to address starting from the use of machine learning. So we need a way to integrate data, AI, machine learning, tool for analysis. We need a way to introduce any data and any sources because online job posting are okay but now we have also job platform. I can call data about the labor market from Facebook, Amazon, Twitter. I'm starting to watch also Twitch because it's on our channel. Three years ago, I found one job advertisement inside the HTML code of a company. A company, in this HTML code, if you go to the source code of a page, there was a job advertisement, of course for technician. So when I collect data, I need to take care about the evolution of the sources. And of course, the HTML code is not like the job description but usually I find on, for example, ad ecco rasta, so this typical job boards. So, of course also the accessibility of the data. Why? Because when I, for example, provide the data to the decision maker, they can access only to the final data. I don't want that they access to my kitchen where I prepare the data. I don't want. But of course, when I have a lot of good scientists, my colleagues, I want that they access to the data and play with the data. Because they can found new patterns, new, I don't know, new metrics. Speaking about new metrics, one month ago, one my colleague started to play with the data. Started to play with a new model, new model. Model of 40 years ago, but a new model, about the context. We apply, for example, a model coming from econometrics. It is called economic complexity and we found something interesting. We present to my economic colleague, it is okay. And we are working on it. So, I want that video scientists have access to vero data. But I don't want that the decision maker access to vero data because they cannot understand the complexity of the data. So, we need data warehouse. We need data warehouse because data warehouse are like the signal that we have in our organization. We need this signal to understand what are the state of organization. Because data warehouse are clean, are integrated, are sandwiched oriented. So, we don't have raw question in data warehouse. Data warehouse are built in two response to pre-definite question. Of course, we need also data lake. Why? Because in data lake, we can found a lot of new inside, new question, but currently are unexplored. Okay. So, of course, data lake are more complex within the warehouse because we have machine, we need to prepare the environment to be machine learning needed. We need to give access to vero data. We need to be schema free. Please, not schema less, but schema free. Okay. Because every information have a schema. Okay, sooner or later. So, we need these characteristics and also we need tools and techniques that support or rapid change of the information. So, I love this representation of data lake. I don't know because I found it on the web and it's beauty for me. So, I don't know who prepare it. But as you can see, because I love, because it is my 20 years experience recapping was live. So, as you can see, we have data coming from ARP. We have data lake where we can store vero data, where we can store our process to improve the quality of the data and we have the old tower house. So, as you can see, we have the access zone where the data are available. So, it is the same of the BI process that I presented 10 minutes ago. It is the same. Only the name is changed data lake. The name is changed because the approach is changed because we have a bottom up approach. We don't start from a pre-definite question. Yes, we have an access zone as some pre-definite question. But also we leave the data to speak. We leave the data available for our data, scientists, our data analysis. So, speaking about the process, this is a recap of what we do in my organization for the European Union. As you can see, we ingest the data from scraping. We have some stage that clean the data and apply some preprocessing techniques. Then we apply a lot of machine learning to extract the single information from our old question. So, we extract the occupation. We try to extract the skill requested. And we have 30 different dimension, classical dimension of the labor market. And we build the lake. Of course, we have 100 terabytes of compressed columnar data. So, it's difficult to handle this change also because, so, every six months, they are changing the occupation code list. So, we need to reprocess all the data. So, in this case, we use a particular technology that we present in one minute. Of course, this area is available but only to me and my data scientist. Then, every month, we copy the data. Really, it's copy and filter the data. We move the data in quality. Only the data by the quality are moved to our house. And it is not the end because before the data are available for data analyst, decision maker, and also citizens, we apply some validation rule. Why validation rule? Because we don't have the control in this case of the source of the data. We don't have the control of the 35,000 different websites that send us the data. So, every month, we have a WS batch process that can better apply statistical task, distribution check, and so on to the data. If we detect an issue, we stop the release of the data to the citizens, we watch the data. We understand if this change is okay or not okay and then we release the data. So, speaking about technology, mainly the data warehouse and data lake that we build is based on these two technology, Delta Lake, Delta.io. Is a technology mainly supported by Spark because in the other various fabrics. And Udi, Apache Udi is a public project. Also, Delta.io is a public but with a different agency. Apache Udi is mainly minded by Uber and Apache Foundation. So, we use these two different technology to build our data lake. And also, there is Iceberg. So, currently, I'm not using Iceberg, but it's a beautiful tool. So, I want also to present in one slide Iceberg. Iceberg, it is mainly, it is the same technology, of course, with different approach from Netflix and WS. So, why we are using Delta.io? Because we need a library that can help us to merge the data. So, mainly in certain data because every month we can reprocess our entire data lake. So, with the technology of Delta.io, we found a tool integrated with Apache Spark and also integrated, so it was easy to integrate it on WS because we are using WS.io Mare to release the data, to process the data, to classify the data. So, we found Delta.io, a beautiful technology that help us to merge the data. Because, of course, Delta.io support also the time travel, the schema enforcement, support the change of the data in the partition, but really what I was looking for years ago, it was a technology that help us with update, insert update and delete. Because, you know, classical Corumbal format, like, I don't know, Avro, Parquet, ORC, doesn't support merge and updates. So, Apache Uzi is another technology that we use. Why? Because with Delta.io, we have some issue about the bulk processing of the data because it's quite slower in case of big bulk process and because the maintenance task, so the compact of the table, the vacuum of the table are not automatic. So, we have different process with Delta.io that support the maintenance and we need to take care about this stuff. With Apache Uzi, we simplify last year this approach because Apache Uzi has an automatic compression and with a different view that Apache Uzi release, so we have a read optimized view that is with less latency and with more latency and less cost and we have also a real time view with less latency in this case, but with more cost. So, with this type of different views on the data, we reduce the processing cost and the processing times about the from one to, so we are 10 more faster now. Why? Because Apache Uzi, it's designed for batch ingestion. So, in case for us, it is important to reprocess the data every month, every two month with a process and we apply all the machine learning model to 100 terabyte of data. More or less in our data we're out now, we have 400 million job posting and we also have HTML, the description, so on. So, with this optimization about batch ingestion, we simplify a lot the processing and also all the compactation and the vacuum of the table is integrated. Apache Iceberg is another good technology because in this case, I only mention one topic about Apache Iceberg. It is because we don't have the dependency from Spark. So, in this case, you can also use other technology. So, speaking about the main difference, I recap in this slide, so you can watch it if you want. Only one topic because I have 20 years of experience. Take a watch to the contribution because when you choose, you know an open source solution, you need to take care about also the support. So, it is because we started from that.io four years ago, as you can see. Mainly the issue, in my opinion, my opinion of that.io it's that it is only supported by Databricks. It's good, but you have only one company that now support the revolution. Second, we are moving now to Apache Union. Why? Because it's why my internet, while integrated with AWS, a lot of support, there are a lot of contributors. So, it's usual, so when I speak with my younger colleague, they want to use the latest technology. In this case, data lake at our house are for long decision, you have to be a long view on what you are building because you cannot change the data format every two months. So, I choose four years ago that.io because it was the only one really ready for production. Now, I'm moving into Apache Union. Why? Because, in my opinion, it is in the future with more contributors, it's better. So, thank you to everyone. If you have any question, please use the mic right here. So, thanks, Maro. Very impressive setup you have there. We have time for five minutes of questions. If there's any in the hall, please come to the mic. And if we have any remote questions, you'll let me know. Great talk, thank you. I have a question, you said that you have two types of data. You have like census data, which is very strong high value and then you have all this data. And then you mentioned that you use statistical tasks to clean your data before you publish it in the data warehouse. Can you go a little bit more into detail in how you use the first data to clean it again? Okay. We want to ask a question, thank you. Okay, so, about the data and about the quality process. It's important when we built our data lake of course to store everything. So, in our data lake, we have two main area. We have the raw data area when we store everything as is. And then we have the data clinic, not ready for the citizen, but ready for the scientist. What do you mean? So, in this area, we apply some classical processing techniques. Mainly I apply two techniques because the main issue that we have with job data is about the duplication of the data. So, I can find the same page in a lot of different website or from a different website. So, what we apply mainly in this case is one model, it is like a spam model, like your email spam model, that for each page that we collect, decide if the page is a real job posting or it is a fake job posting. Of course, we don't delete the... So, we delete from the data warehouse the fake job posting, but we store it in another table because, for example, three years ago, one European Agency asked me to release a report about the informal economy. So, in this case, I use the fake job posting to answer to this question. So, mainly we use the spam model. Second, we use the duplication model. The duplication model is built with a sort of fuzzy matching between the HTML code and the job description. The second duplication model applies machine learning because, in some cases, we have the same job posting published in different languages, for example. So, starting... So, only to give you a number. Starting from four million job postings in the European Union, only 100 are really job postings. So. Thank you for your description, your process and your talk. I had a question in terms of the data quality. Could you comment on the differences between... So, obviously, if you have sampling biases or other biases within your data, then that's going to affect the insights that you get from your analysis. Obviously, the kind of bottom-up data-like approach helps you to be less biased in your postings, your hypothesis and stuff. Yeah, yeah, beautiful question. Beautiful question, thank you. It's very difficult for me to address the question. So, I try in a few minutes and then we can chat. Why? Because... So, speaking about the classical data that we have from a survey, from a company, we have the idea of the population. So, if I ask to you how many... If I ask to Paul how many participants I have in this room is easy. We have a sensor, I don't know, of the door and we have the idea of the entire population. Speaking about job data, we don't have the idea of the number of job posting that we have in the European Union. We don't have the real number. So, when... Usually when a decision maker asks me, okay, do you have all the data, all the job posting? No, because I have only the web job posting. But you have all the web job posting? I don't know, because I collected data from 35,000 different websites. But I don't know what is the number of websites we have job posting in Europe. So, it would be a good number, a bad number, I don't know. So, usually we compare the data offline, in this case, not online, with classical statistics. Of course, so, speaking about statistics, we have different type of data because in this case we have flow data, classical statistics more or less, are with stock data. So, we use techniques to try to create a sort of representativeness of the web data that we have in comparison with available market. It's something that we are trying to address in visia. Thanks a lot, Mauro. We have to go to the next talk. So, please put your hands together for a very interesting talk.