 Thank you very much. So today I'm going to, as mentioned, give a brief presentation about industrial machine learning, production-ready machine learning, and more specifically on horizontally scalable data pipelines. So it may be just worth covering a bit about myself. So I am the CTO at Exponential Technologies, Chairman at the Institute for Ethical Machine Learning and a member of several groups. I also lead an engineering team at a product-based B2B company. And so today I'm going to be covering an overview of and caveats of scaling data pipelines and also including machine learning pipelines, covering airflow as well as its components. So this including things like salary, the machine learning models and machine learning algorithms that you would use, et cetera, and then introduce a difference of two topics that often get confused or mixed up as they're kind of like a spectrum and kind of between part of each other, which is machine learning pipelines and data pipelines. And then an overview of airflow with a use case. This is a very broad, big picture talk. So I do recommend to actually do a deep dive on the technologies themselves. We're going to be learning by example. So what better way than actually building a startup? And we're going to go full hype today. We're going to jump on the hype train, build a large-scale crypto analysis platform. We're going to do heavy compute, data analysis, transform, fetch. And we're going to go deep running predictions on LSTMs because why not? And asking the question, putting ourselves back in end of 2017, asking the question of can we survive the crypto craze? The data set is all historical data from top 100 cryptos. And it goes all the way back from the beginning to September 2017. It is over 500,000 prices. We have an interface that loads all of the cryptocurrencies in Panda's data frame and then another that actually triggers the distributed predictions. The code and the slides can be found in my GitHub. So please do feel free to check them out. And let's do this. So the early crypto beginnings. So the crypto ML team managed to obtain access to a unique data set which allowed them to build their initial prototype and that allowed them to secure VC money. It was so accurate. It was insane. And now they needed to figure out what is machine learning. So just really quick, they found some tutorials and they saw that basically it is automatic learning from data examples to predict an output based on an input. For example, telling whether a shape is a square or a triangle. This is, of course, learning from some example data. More specifically in this example, if we think about a two-axis plot where the y-axis is the perimeter and the x-axis is the area, this could be a feature space and we would see all of our data scattered around our feature space. And we want to learn a function in the case of classification that allows us to divide and optimize for the division of this two data so that we can basically predict any new data. And this is basically using x in the function being an input and then m and b being the way and bias that we want to learn and that we would tweak. And then when we have the resolver and we give a new input, we would be able to predict. The difference with traditional rule-based programming is that we let the machine do the learning. We give it a few examples. It tries to find the best line. We give it more. It gets better more and more and more until we actually find, using a minimization cost function, we're able to find the best local or global optimum. In this case, it's our function that we would be able to use to divide the two classes and predict any new and seen inputs. Then the crypto ML devs ask themselves, well, can I use this for my time series data? And the answer is not yet because processing sequential data requires a different approach. And this is where we introduce sequential models. This is basically the models that try to predict a new time step based on the examples that you've been given. It uses a similar approach, ultimately the same basis. It just uses the cost function a different way. The hello world of sequential models is the linear regression. And this is basically what we would use to predict our cryptocurrency prices initially. However, if we use this, we would end up extrapolating. See that Bitcoin is going to be worth billions in the next two years. And know that people didn't do it, but we're not going to do it. Of course, the crypto ML team wanted to go full hype. It wasn't enough, so they wanted to use deep neural networks. More specifically, recurrent neural networks. In this case, we're going to be using LSDMs. I'm not going to be diving into details, as I found out a few minutes ago that my talk was only 25 minutes. So if you want to find that section of the talk, you can check out one of the other videos of my talks in YouTube. Well, that's great. So now we know conceptually what we need to do. But how do we put it in practice? And of course, the crypto ML guys found this concept of machine learning pipelines. And they were copy-pasting a lot of code from Stack Overflow, but they realized that they actually needed to understand how it works and not just try to throw different amounts of data, hoping that it would try to converge into something that would be valuable. So in order to do it properly, they saw that they needed to try to build a more mature infrastructure around their machine learning. They found that in a generic perspective, very, very abstract, machine learning pipelines break into two different workflows. Model training slash creation. So actually creating this function that I mentioned before and then using that model to predict unseen data. The first part, learning the function perspective, it breaks down into finding basically your training data and test data, transforming this data so that it can be used and input into your machine learning model that you chose. Then actually train the model. And once you're happy with the training of this model, of course, during your iterations, persisting that model. And once you persist that model, then you're able to take any unseen data. I can transform it in a way that it can be represented and then run it through your model so that you can actually obtain a prediction and then get the results. It's important to mention that in the machine learning pipeline itself, you have these two steps, which are probably the most important parts in this process, which is transforming your data into features and training the model itself. And with this, you need to focus a lot in your feature space itself. That's what they found out, that they need to find new ways in which you can represent the data that you're bringing in. In the example that I gave previously, squares, triangles, and shapes, these were the features where area and perimeter, those were the ones that we decided to represent the shapes as. But we can actually think of more ways and more features to get from that, such as, for example, color, number of corners, et cetera. The second thing is the actual training that the model requires. So trying to make sure that you have a representative amount of data and not just that, but ensuring that your model is the right one for the complexity of the problem you're facing. So for example, if you have something as complex and crazy as a set of cryptocurrency data, it may be the case, or text, for example, it may be the case that the text that you're analyzing, no matter if you get a million examples, you would never be able to get the abstractions of language itself from just the text itself that you're using. And that's why you would sometimes benefit from other things like Word to Beck, because it already uses things from the entire Wikipedia to train this word vector. Also, from the perspective of this, you need to make sure that the data that you have is the right one. And that is, again, the relevant model and type of model for prediction. You don't want to use one that is for sequential classification or clustering in the wrong use case. And now we're machine learning experts, so you can collect your certificates after the talk. These certificates are valid in your LinkedIn profile, in any non-tech meetups or parties, or in any tweets that anyone sends. You're officially an expert, and you can voice your opinion in all the ways that you want. But seriously, so now it's time to build a pipeline. For linear regression, how would this look like? Using a simple implementation using scikit-learn. We basically take the model. We get the transform data, in this case, the price and the times. We select the model that we want to use, in this case, linear regression model. We then train our model from the data that we currently have. And we choose an unseen set of data points that then we can use to actually predict for the future. In this case, we're predicting 10 steps into the future using a linear function. And then now we can use it. We run our linear model, and then we can predict the next 10 time steps. In the case of the recurrent neural network, very, very similar. The way that this snippet does it is basically by taking a sliding window and training the recurrent network by feeding it in 50 of the time series data and then trying to predict the next one, and then moving in a sliding window to train the machine itself. So whenever we want to actually run the prediction, we actually just get the first prediction and then use that next sliding window to get the next one, et cetera, et cetera. And in this case, we follow the exact same steps. So it is, again, get the data, select the model, train the model, choose how many unseen data points you want, and then predict the unseen data. And it's not too different. This is something that you would tend to do very, very, very often. Of course, it's very, very simplified. And the code to build the LSTM you can find in the codebase as well. I'll break it down a bit more in the longer talk. And now if you actually run it, you get to see that it works. When I say it works, I don't mean that you should actually take this and bet money for a cryptocurrency because I don't think that will work. But I mean, it runs. I think that would be a more accurate perspective. And it's important to note that in this example, we're using the training and the prediction in the same function. As I covered previously, you more often than not would actually separate this. You would use your training. You would iterate. Once you're happy, then you would persist the model and deploy it. And then run predictions in production with your infrastructure. And the environments would be completely different. And do not underestimate the hidden complexities. I mean, you only start using machine learning once you get the chosen model or even a persisted model. There is a lot of complexity all the way from staging and deploying machine learning models or storing and standardizing your repositories of training data, test data for reuse, abstracting the interfaces of your machine learning pipeline to be able to, say, for example, use different libraries. Also, distributing load across infrastructure, doing things like idle resource time minimization, node failure backup strategies, things like versioning of your models, how do you store a representation of the features that you chose with the data that is used for training so that you can compare it, the accuracy. It's a lot of things that are beyond the actual working accurate machine learning model. And now that the CryptoML team have their own machine learning, deep learning pipeline, they asked themselves, are we done then? And the answer is no, the fun is just getting started. They saw that after the CryptoML was using deep learning, TechCrunch and Mashable and all of the above wrote like a million articles. And then they were featured in all of them just because of the stuff that they were doing. And their user base exploded. Now they have tons of users coming in every day, each running several machine learning algorithms concurrently. And they try to get larger and larger single servers in AWS, but their costs were just going insane. And they were just eating all of their VC money that they had raised. And they should have really seen this coming because machine learning is known for being very compute heavy, but at the same time, it's also forgotten how memory heavy it is. I mean, if you saw that Microsoft right now released like at the one terabyte memory servers, which is actually in use. And then it's also like very hard to scale into large instances compared to actually just using smaller but distributed instances. And then also having to do everything in one node, it is a single point of failure. So it is time to go distributed. And then we introduced Celery. So who here has actually heard about Celery? Oh, amazing, I could just skip these lights. Both. Okay, so if you guys raised the hand for the actual food, fine. So Celery basically, as you know, it's a distributed asynchronous task queue for Python. It's beautifully simple to use and to actually get started on a non-celerized project. It uses a producer-consumer architecture. So basically you have a bunch of producers that say this task needs to be done and then you have a bunch of workers slash consumers that just are listening continuously to this, in this case, RabbitMQ queue and then take the task executed. And many people think that it's really hard. Certainly the crypto ML devs thought it was, but it's actually not that hard to get started. So in this case, we're gonna take that function that we created for the prediction use in the RNN. And we're gonna basically take it and just sellerize it. And what happens when we sellerize it? We first just create the Celery object which connects to the queue, to the RabbitMQ. We just literally point it to the URL that has been exposed. We use the decorator, app.task, that basically says this task is distributed. And this task will be picked up by a worker that then is gonna be listening for this specific type of tasks. And then once you have that, the only thing that you need to make sure is that all of your inputs and outputs, like all of your parameters and return are serializable. So in this case, I'm doing a very, very basic serialization using PickleDump and load. And I'm just running it for the return and for the parameters to on PickleDump. And then the last thing is just to run the workers. And you basically can run multiple workers very, very easily in any servers and scale it very, very easily. And you can just see the logs for each one of them. And we already have it there. So now we just need to create the producer following the same receipt. So we choose the code in this case is basically just going into all of the cryptocurrency data frames, iterating through them, running the prediction for each of them, and then basically just again, iterating through all of them and printing it. The only thing that we need to do here is we import that function that we just created that we already said this function is distributed. And instead of just running it in the usual way, we can run it using .delay. There are other ways that you can actually run it as well, but in this case, we're making sure as well that the parameters are being serialized. Once we go into the results and the printing of the results, in this case, we want to print them sequentially. So we use the .get, which basically waits until it finishes. Normally, you wouldn't really do this for obvious reasons. You would want to trigger the jobs to do their own thing and then probably store the output in some database so that you avoid plugging the tasks that you have. And yeah, and then you just run the producer. This could be many things that are creating jobs to actually execute the queue. And they are the consumers just listening to that. You can use other external tools like Flower to visualize all of the workers that are currently executing. This is a very nice tool. And there are other alternatives. You can just hit the API to see what is currently running. And the next thing is just to run more producers and consumers. So in this case, the CryptoML team just like, sellerize their deep predict functions and they just got them running. Yes, so now we have a machine learning pipeline and we're distributed. We have surely one. Can we pack our bags? Well, not yet. This is, again, not the beginning, but it's not the end, certainly. And that's where we jump into smart data pipelines. The CryptoML team now has an exponentially increasing amount of internal and external use cases. Their data pipeline is getting unmanageable. They also realize that machine learning, the machine learning pipeline itself is just the tip of the iceberg. They forgot that it's just not the only thing that is required in terms of data ETL requirements that you normally do in production. And the growing data flow complexity, they saw that there was a growing need to pull data from different sources, growing need to preprocess and post-process the data itself, even beyond and before the ML pipeline. And then as complexity increases, the actual dependency of the tasks increases as well. So if a task fails, then you wouldn't want to execute the next one, for example. And some of these tasks need to be triggered chronologically. The data pipelines can get very, very complex and having just sellerized tasks run by Linux cron job gets really, really messy. Nobody wants to be there. So you really want to go away from here, literally just like a huge, messy, terrible, Windows cron job file, a crontop file to something more sophisticated that actually you can visualize it, track it, debug it, et cetera, et cetera. But before jumping in, I want to actually clarify a distinction. This is basically the distinction between the terms of data pipelines and a machine learning pipeline. And providing a breakdown of definitions. So this is not concrete. It's kind of like a spectrum. It's from my perspective, it is a machine learning pipelines are in a way a subset of data pipelines as in the perspective that data pipelines consist of ETL workflows and running a prediction of a specific dataset could be one of these ETL executions. In a very, very broad sense, what data pipelines are or consist of is you take data from somewhere, you do something with the data and you put the results optionally somewhere else. Maybe you don't put them somewhere else. But this is kind of like what it could boil down to. And it also often encompasses the concepts of scalability, monitoring, latency, versioning, testing and a lot of complexity that we do need some tools to help us. But fortunately, many people have the same problem, which is in a way great. And this is where we introduce Airflow, which in the way that I call it is the Swiss Army knife of data pipelines. But before I jump into what Airflow is, I want to cover what Airflow is not, because I think that is also very important. I see a lot of people trying to use Airflow in use cases where, you know, not they shouldn't be, but it's probably easier if they don't. Airflow, well, first is far from perfect. You know, there's a lot of data pipeline frameworks that you can use, which I'm going to cover. Airflow is not perfect, but it is definitely the best one out there. This is kind of like from the research that I have done and choosing the frameworks. It's also not a Lambda function as a service framework, although it could be programmed as. It's not a machine learning pipeline, so you're not, you know, it's not going to be providing you everything you need to, you know, version and deploy and your models and, you know, run 10 different types of like parameters and see which give you the best accuracy. You could do it, but, you know, it's only one of the use cases that you would use it for. There are other tools that you could use that are more specific. It's not extremely mature. It's an Apache incubation, so you need to be aware of that, and they're going through a big revamp. And with this big revamp, you need to make sure that whatever you build and deploy in production with Airflow, you can migrate it. And that's one of the beautiful things about Airflow that is very modular. It's also not a data streaming solution, so however, you could still augment it with some sort of like external, you know, data streaming or even HDFS-based solution like Spark. You know, they often ask, is that why do you use, you know, Airflow instead of Spark? You would never use Airflow instead of Spark. You would use, you leave your Spark there and then, you know, your Airflow may trigger some stuff there. But yeah, it's not, one is not the other and can not replace, well, maybe could replace it, but it would, yeah. Maybe we can have a discussion in the pub about that. So let's now dive deep into Airflow. Airflow's in brief is basically a data pipeline framework written in Python with an active community and a UI for management. The Airflow's course is our DAX, which is directed as cyclic graphs. And these are basically ETL operations that are executed in a specific order that only executes if the previous one succeeded. The DAX are defined programmatically. So basically you specify the name of the DAX, the start date and the schedule, which is basically on cron format. And then you define the ETL operations in potentially Python or whatever. And then you define the order in which they're executed. So in this case of just like a very simple graph, you can define it like that. So operator one after operator two. Then you can actually see all of the DAX that you've defined in your list overview together with all of the information of executions, which ones failed, which ones are currently being executed, et cetera, et cetera. This views, Airflow is not really the most beautiful design, but it's very functional. You can also have a detailed DAG view. Here you can see like a specific DAG in more detail with all of the operations that you have together with all of the executions, in this case on each date. This is probably like every day it got executed once and you can see that it succeeded. And then operators are very easy to define. In this case, we're defining a crypto prediction where you get the data, run the prediction, store it. And we are using a Python operator specifically to just say like, just wrap this Python function and run it whenever it's its turn. So when you define your DAG, it can be like, it runs the fetch of the data and then it runs the actual prediction. And then it might run some other operator that puts it back into somewhere, other system or sends an email. Then also, Airflow provides default operators. One of them are for example, sensor operator, which basically when triggered, it pulls as many times as until it returns true, for example, or false. And you can set the time that it waits between each time it pulls. Then you can actually also pass data downstream in your DAGs. So for example, whenever you run something in an operator, you can return. And then whatever you return, you can actually retrieve it. And this is very useful for storing things like, triggering DAGs that need to hold a specific ID or pass specific other IDs. It's also worth mentioning that Airflow has a database. It uses Postgres. However, well, it can use any SQL based database. However, you shouldn't pass really massive things downstream because it inserts it into the database and then you retrieve it back again. So it's not very efficient. You normally would just pass some simple things like IDs or something like that. In this case, what we're doing is, we're creating a push function and then a pull function. The push just basically returns a list and the pull uses the XCOM objects as they're called to basically retrieve it, whatever data was passed from upstream. And then we just define those two operators and we say push tasks goes before pull tasks. You can also visualize the execution of the parameters. Same, you can also visualize the logs. A lot of the things you could actually do in the UI. And the cool thing about Airflow is that it's very modular, which is great because you can separate the definition of your operators with the definition of the DAGs themselves. And you also have things like sub-DAGs that you can reuse these cyclic graphs that you've defined. And it's also very extensible. You can create your own operators, hooks, executors, et cetera. And if that is not awesome enough, you can actually use different backends to run this on. Celery is the one that currently is most common, but I assume that you can also use things like Kubernetes using the native Kubernetes communication to send the tasks, et cetera. And then for crypto ML, what they wanted to do is they wanted to pull some crypto data every day, they just transform and standardize. When ready, they wanna trigger a prediction for each cryptocurrency. And then if they wanna store that prediction in the database and then based on some rules, they want to trigger a trade. So if we break this down into every terminology, first for each individual cryptocurrency update, we would have an operator that transform the data, then another one that sends the data into the crypto engine, then we trigger a sensor that pulls the engine until it's done. Once it's done, we basically branch off where one of the operators stores the data and the other one, based on some rules, it would trigger a trade. And now for all of the crypto jobs, we would have basically some schedule job that would run, say, every day. It would pull, for example, 20 data sets of cryptocurrency, and for each one of them, it would trigger a sub-dag, in this case, a tag. Again, so airflow is not very mature. The way that you actually trigger dynamic dags, you would have to do it with the session API, which is not great. So this is kind of the summary, same thing that I just said, the operator, sensors, and branch. Success, so now we basically, you know, crypto guys managed to set up their pipelines. Do check out the Apache Airflow project. I would recommend you to get started to just read the documentation. And there are some alternatives like Luigi, Pimble, Seldom Core, and other similar ones that are not really similar. It would be used in completely different use cases like Dask or Apache Kafka. And then some special mentions like Docker Kubernetes, which you can find implementations in the code base. And Cryptocurrency guys managed to sort it out. They survived. They're probably millionaires now living in their own island. But for us, you know, we got our overview on data pipelines, the difference between machine learning pipeline and data pipelines. Overviews and use case. Again, code is in the repo, so please do feel free to check it out. And with that said, feel free to contact me. And thank you very much. I hope that was informative.