 Hello. Hi. Yes. How are you? Hello Fabiane. How is everyone? Yeah, we're good. I guess everybody's super excited to have you here because we have a lot of people. If you see, you have like almost 6,500 people joining the nation today. Wow, that's a lot. So yeah, you have a big audience. And Fabiane, well, now be aware that right now we have two Brazilians on the stage. So it's a very nice opportunity. Fabiane, she's a Brazilian friend for me while I'm Brazilian too. And she's also, besides being a Java champion, she's one of the world's top experts in data science and data engineering. I'm pretty sure she has a lot to share with us. Her company, I think it's one of the largest in its line of business in Latin America. So I don't want to spend more time because Fabiane, we're eager to hear from you. Great. So it's always good to see Edson. I miss him in Brazil now that he's moving to US. But let me share my screen here so you can see my slides and I can see there is another Brazilian in the chat. So I'm going to put in presentation mode and I will not be able to see the chat actually. So Edson, if you see something that I have to say or to answer or something, just say something. Okay. No worries. Okay, so thanks a lot for the invitation. I'm here in Brazil as Edson said, my name is Fabiane. I'm a long, long, long time Java developer. I've been working in data science for eight years, more or less right now. So I started working with data science when this field was actually starting. So I guess over these years, I saw everything. In the beginning, we would do everything from installing a cluster to create the machine learning algorithms. But now the field is a little more, it's more specialized. But we learned a lot over the years and I guess the idea of this session is share with you a little bit of what it's data engineering and how Java developers can help this field to take data science to the next level. Right. So for those in the conference that are not familiar with data science, usually a data science project follows more or less pipeline like this one. It's called a data science pipeline. So usually you have several sources of data. It can be your transactional information system. It can be log files from maybe another system or have or can even be a third party data that you need to create your machine learning models. Usually you have several sources of data. This is what you call raw data because it's data not not not transform a data that you have to put inside your pipeline to me to be able to process this data. One very important part after you actually gather your data and you may understand that I'm getting these sources of data is not easy usually involves several scripts or API calls or things like that. But once you have your data somewhere, you have to do lots of cleaning and transformation, right? Cleaning because usually there are lots of bad data transformation because data is usually in a font that's not the one you want. And after you have your data cleaned and transformed, what you have to do is what data science will call feature engineering. Feature engineering is another site of transformation of data, but is usually well defined, well known algorithms to transform the data and make the data more suitable for the machine learning algorithms. So for example, a typical example of feature engineering is getting taxi values and transforming numbers. After you do this, usually you have to do some kind of datamentation to get further data to run your model. And then you can actually do some artificial intelligence and create your models. The model part is where you are going to train, give data to an algorithm and train the models and then get some insights. And in the end usually you have some kind of visualization to be able to see what you actually discover from applying artificial intelligence machine learning over your data, right? So the fun part is doing the models and the insights. This is the machine learning part, the artificial intelligence part, or it can be made with just statistics. So although when you think about a data science project, usually you think that you're going to do only this part. But actually the other part, which is data engineering part, is the biggest work you have to do. In fact, we believe that 90% of the work in a data science project is spent in data engineering tasks. So it's a lot. When you think about the machine learning and data science, usually you only think on the fun part, but 90% of the work is done on the grass and getting the data and transforming and everything else. So in the last years, the machine learning and artificial intelligence evolved a lot. We have lots of algorithms. We have cloud platforms that provide all the algorithms you need. You don't have to develop almost no new algorithm anymore. However, in data engineering, we did evolve as fast as in machine learning. A while ago, I went to GitHub and did a search for machine learning projects. And you can see that you have almost 200,000 projects about machine learning. But if you look for data engineer, feature engineer, or data lake that are terms more common to data engineering, you have a lot less projects. So if 90% of the time is spent in data engineering, why are we not spending as much time in doing data engineering tools? So what happened is that right now we are in this moment in data science where we need better data and we need to improve data engineering in order to be able to fulfill the promise of data science. So I know several companies that, for example, hired 100 data scientists and then found out that they just didn't have enough tools and data for them to be able to do the job. So someone told me a few months later that I hired 100 data scientists but maybe I have worked for 10 of them. So what we need right now is to improve data engineering to be able to do more with data science. And as Java developers, we have a great opportunity here because what is missing for doing better data engineers is basically creating better tools and better architectures. And if there's something that we know how to do is better architectures and better tools. So if you see a pipeline like this as Java developers, I'm sure you can think of several ways of implementing one of these and it's not hard, right? You're going to do a script to retrieve the data, then maybe you're going to do a software that is going to loop over the data and apply several validations and then transformations. And then you can maybe get data from a 30-part to do the documentation and merge with the data you have. So it's not hard to do a pipeline like this. The problem is how to do a pipeline like this at scale, right? Because usually you are not going to do just one, you're going to do several of these pipelines. So just to give you an example, in my company, we process 3.5 billion new records per day and we run 4,000 modern 4,000 pipelines per day. So imagine if you're going to do this by hand, it's not going to scale, right? So usually when you are training to be a data scientist and you'll do online training, for example, there are several good courses out there. Usually this training happens in what I call the Data Science Fiction Award. So usually in the Fiction Award when you are learning, it seems that there's no big data and there's no legacy code. It seems that when you finish your experiments and you have a model, the job's done. And it seems that there's always a data lake with all the data you need. Of course, this is not true. And in the real world, things are a little bit different. First, you have lots of data. It's not just data that is going to fit in your machine. Usually you're going to have lots of data that need a cluster to process. Or even data that is sensitive and you can't just run the data in your machine. You have to have some kind of infrastructure to run in the cloud. And you have a lot of legacy code, probably Java code, right? And why this is important? It's because in your company you probably have lots of libraries that you build over time that can help with several activities including cleaning data. So for example, if you have a company identifier that you have some kind of validation that you already programmed, it makes sense to be able to use the same code you already created in your data science projects, right? And not think of a data science project as a totally independent project that's not going to use your legacy code. And in the real world, we need tools to be able to experiment, to test. For experimentation and testing, we have tools that are very well known in data science, but we have very little tools for deploying and to catalog the machine learning models. This is very important as you start investing more and more in machine learning in your company. We need to have tools to catalog and build a data lake. So many companies create a data lake just by creating a bucket in some cloud storage. But it's not just that. You need to have some sort of catalog to know where the data came from, what license you have for using that data, when it was updated and so on, right? So a data lake is a lot more than just a bucket somewhere with the data you need. And you need security so people are able, only people that can see the data are going to see the data and so on. And in the real world, we need good architecture practice to build scalable solutions to process data and train models. It's not something with this amount of data that you have, it's not something that you can just train in your machine or in a machine that you create in the cloud and then you turn the machine off. Usually you need to have some kind of cluster that is able to deal with large volumes of data, right? So for Java, we have a tool that probably most of you know already that's called Apache Spark. There are other similar tools in Java, but Spark is probably the most used tools for processing large volumes of data. Spark is a framework that it's very fast and allows to process data, allows to do distributed processing. So usually you have an Apache Spark cluster and Spark is going to deal with all the complexity of dividing the data and processing and so on. And inside the Spark ecosystem, you have two tools that can help Java developers a lot. One is called Spark SQL and another one is called Apache ML Lib. ML Lib is a machine learning library that has several algorithms. Probably most of the algorithms you're going to need are implemented in the ML Lib and it's a Java library. So we talked a lot about data lakes when doing data science, but there's something we need to talk about. That's a code lake. So having the code available to be used in your distributed processing so you can create those data science pipelines in a more scalable and fast way. So with Spark SQL, we as Java developers have a good opportunity to do that. So this is an example of Spark SQL code. This is a Scala code, but you can write more or less the same thing with Java as well. So here in the same line, I'm just reading a file. It's very intuitive. And then I can get these records. This is what Spark SQL calls a data frame. And I can filter the data frame by some condition. And then I can write back the data to a new file. So this is just for you to understand how Spark SQL works. You can see that the code is very simple to understand. It's very easy to do file handling and things like that. So as Spark SQL runs over Spark, what happens is that when you execute this code, you are actually executing distributed code. So this is going to run over a cluster that can have two machines or maybe a hundred machines, 400 machines. It doesn't matter. You can scale the cluster as your data increases as well. But better than using Spark SQL, is using Spark SQL with your code. So this is how you create your code like. So what I'm doing here is just I create a new instance of this class Spark Geohash. This is a class that I created. And I can register this function. This is going to be a user defined function in Spark SQL. I can register this function and then I can use the function inside Spark SQL. So my legacy code can be encapsulated in things in a function like this. And I can actually call my legacy code from Spark SQL. And this code is going to run in a cluster in a distributed processing with scalability and so on. So it's a great way of using your legacy Java code to help you create your data science pipelines. And following this strategy, there are other nice things that you can do. Usually you can create a plugin architecture using Spark SQL and create things like semantic data types, transformations, aggregations and other functions for that. Symmetric data types are very interesting for data science because it opens up a lot of possibilities that are not so obvious. So in Spark SQL, you have several data types, but they are common data types like string, double, long and things like that. You can, when you reach a data store like this file here, you can say to Spark that you are going to use this schema. And you say the data you have and what type they have. So I'm saying here, for example, that this field is a string, the first field is a string, but let's do this is a double, right? So when I reach a file passing a schema, what Spark SQL is going to do is that if your data is not compatible with the types informed in the schema, the data is going to be ignored. So these allow you to do the first phase of cleaning the data. So if you say that your field is going to have doubles and this field has a string, the data is going to be ignored and you have your data automatically claimed, right? But you can, with Spark SQL, also create your own data types. So for example, in this example here, I created a data type called social security number data type, right? So if I have a social security data number type here and I implement validation here, I can validate automatically when a data type that is not a social security number comes in. So it's another level of cleaning that I can do. So here's an example. I'm just the same example as before, but instead of saying that the field SSN is a string, I'm saying that some other type. It's the type that I created, which means that when Spark SQL reads this file here, if one of the data is not a file this social security number, the data is going to be ignored. So it's another level of data cleaning you can do using legacy code and using Java and using Spark SQL. So once you have semantic data types, there are lots of things that you can do that can be a game changer in data science. First, once you have these types defined, you can have type detectors, because one of the biggest problems in data engineering is to, once you have data is to understand what that data really is. So if you can have type detectors, you can detect when a column is a string or is a double or a phone or an email, and this can bring more intelligence to your system. Outfit helps a lot of privacy and GDPR, for example. So if you have a field with a semantic that says that this field is an email, for example, you know that email is sensitive information. So you can, for example, automatically anonymize that information. And you have automatic validation, and in the future you can have some sort of automatic feature engineer with some more advanced topic, but it can be done as well if you have data types with semantics. Just to show you how to create functions, you can have several functions and transformations, transformations and aggregations created with Spark SQL. So here's an example of how to create one of these functions. Here you can see that I'm calling a Java class called geo hash, that's my legacy code, and I'm just encapsulating this in a Spark SQL function. Once I do this, I can, like I showed before, call this function as part of my data science pipelines and automatically clean the data and transform the data. So it's a very important way of using your Java legacy code. Right. So once you have all the infrastructure and architecture in place to be able to process the data using Spark and maybe using your legacy code, you can give to your data scientists a platform where they can do experimentations, right? But there's actually a gap between experimentation and going to production. The difference is that when data science are experimenting, usually they are using samples, so they don't, they are not going to do experimentation with big data. Usually they want a sample of your data to be able to do the experimentation. And usually the samples are not available, readily available, usually they have to compute that as well. Usually they use a tool that's called data science notebooks. The most known tool is called Jupter. And Jupter and the other tools like that are tools that allow you to write code and execute code while you write in it. So you can do a very fast and interactive experimentation. When you go to production, things are a lot different. First, you are going to use big data and not just samples. You have to have, you need to have some form of scheduling system so you can schedule your pipelines to be executed in batch, right? Because you're not going to do a data science pipeline that's going to run just once. Usually you have to run multiple times, at least every time you have new data. And you need to have some sort of logging system so you can understand what happened with the pipeline. So one architecture that has been used by some companies and we use in our company as well, and I know that Netflix has an approach pretty much like this as well, is to create these, so the notebooks that are created, usually they use a tool like this. This is called Spark notebooks but it's very similar to Jupter, that is more well known notebook tool. And usually you write code and if you can connect the notebook to your code lake and to your data lake, right? You can provide to the data scientists a tool that's very interesting for do experimentation. But this code, usually when companies use notebooks, usually they have to reprogram the experimentation you did here in a software that's going to run in batch, right? So one approach that several companies are using right now is to get the same code that you created during your experimentation phase and run the same code in production. In order to do that, you need to have these notebooks with some form of parameters and a scheduler. If you have this, you can have an architecture more or less like this one. So you need a scheduler. In Java you can use Quartz or any other library that allow you to schedule jobs. And then when there's a pipeline to be executed, this pipeline goes to a queue. Then you have to have some form of lock to acquire the data from the data lake. Then you call a pipeline executor, that's another piece of software that you write. And this software is going to send parameters to the notebook you wrote. And this notebook is going to be executed over the Spark cluster. The interesting part is that the execution of the pipeline or the notebook can be used as the log itself because notebooks, when you executed each of the cells in the pipeline, these notebook's tools usually they save also the output of the execution. So you can use the execution of the notebook as your login system. So it's a very clever way of executing Spark pipelines in batch. If you are interested in this architecture, I can send you articles about how to do it. And it's very efficient and very elegant. So just one more thing we need to do is we need to talk about model deployment. So far we've talked about how to clean the data transform and make pipelines that are going to run in a cluster. But what about model deployment? So models are the result of training a piece of machine learning. So usually when you are creating a model in machine learning, you give data to a machine learning algorithm and this algorithm is going to produce a model. A model is a piece of code that has the result of the training inside. So what happens is that as more and more decisions about our lives are made by artificial intelligence, we need to have some way of tracking these models. We need to know which data was used to produce it, what algorithm was used, when was a version created. You need to have the past versions because if you make a decision based on the past version of a model, it's very important to know how this model was done, right? So this is going to become even more important in the following years as more and more regulations are coming into place that don't want to have these models just as black boxes, right? So in order for us to do model deployment, we need to talk about not about data engineering but about machine learning engineering. So machine learning engineers is another field that's not very well developed yet, but I'm sure that Java developers can help a lot in this field. So for doing machine learning engineering, you need to have a model catalog. You need to know how the model was created, what parameters were used, what data was used, what license you had to that data you use it. You need to have some sort of model versioning and some way of model execution. So we as Java developers are very used to deal with different versions, but we probably can believe that in the model world, this is still not very well solved. They're not a standard tool for that or anything like that. So in Apache ML Lib, there's a concept called Pipelines, ML Lib Pipelines, that can help a lot in programming pipelines with machine learning in Java. So here is an example. This first part here is the feature engineering part. Then you create a machine learning algorithm. This is a logistic regression. It's a very well-known machine learning algorithm. And then you put everything in a pipeline. It's pretty much like a list of things that have to be executed. Then you can train the model and save the model to a file. Later, you can read this model and apply this model as part of your Spark SQL pipeline. So this way, you have the whole life cycle of a data science project created in Java and with scalability. So it's a very powerful tool and we really need more architecture and more machine learning engineering in this field. And I'm sure that the experience we have as Java developers can help a lot. Okay, so I think I used all my time. So if you want more information or you want to discuss, this is my Twitter here. And if you have time for a few questions, add some. And you're yours. All right, Fabiana, thank you for this amazing session. I'm pretty sure everybody learned a lot of data science because I'm not from this area. So yeah, it was super interesting to learn more about this. Unfortunately, we don't have time for questions. We got a bit of delay when we started the track. And nobody posts questions here in the chat, too. But we have a lot of Brazilian flags. You see, you have a Brazilian audience here the next day. So Fabiana, I would like to thank you very much. Again, it's always awesome to have you here. It's so unfortunate that we won't be able to see each other in person. But we can make it up next year, maybe. Maybe next year. So thanks a lot. And if you want to continue the discussion, just confirm me on Twitter. And thanks a lot. Bye, everyone. Bye.