 Γεια σας, πρόκειται για να αρκετήσω τη μία εξαιρετική μου, δε θα ξεκινήσω λίγο πριν. Θα προσπαθώ να τα κομμάσω, είναι το τέτοιο εξαιρετικό που ξέρω, πιστεύω ότι όλοι είναι γεγκά. Λοιπόν, σήμερα θα είμαστε εδώ για να πω για να βρήξει από τέτοιες μεταλλακές μονοδικές ποιπλίσεις, με δύο διαφορετικές προς τεκνολογικές από απατσίο, απατσίο Φλήνκ, τένσο Φλο και Hobsorx. Hopsurgs είναι η ανοίκης πλατφόρου της ανοίκης που εργαστούμε σε σχέδια λογικές. Λοιπόν, πρώτα θα πούμε τι είναι οι τελικά πλατφόρου της ανοίκης που εργαστούμε, ένα πολύ σκέψενος παράδειγμα. Τι είναι οι Hopsurgs, το τελικά πλατφόρου της ανοίκης που εργαστούμε για αυτό. Τι είναι το απατσυπίμπωτας που να πει στον κόμμα και πώς εργαστούμε με Φλήνκ. How we can build machine learning pipelines on top of all this with TensorFlow Extended and we will explain later what that is and we will finish with a short demo. So what machine learning pipelines are. Here we can see that we have, we apply essentially transformations on data, so we get data, raw data from different sources, it can be on the internet, it can be IoT devices, transactions, it can be many different things, so we need to ingest them into the platform, the platform of your choice. There you need to ingest and then do some data preparation, some validation and transformation. For all this typically you need CPU compute power. Now then when you go to the machine learning part, typically you need GPU compute power. Now then after you train your model you need to be able to serve it into production for different clients to be able to use your models actually. Now one way to do all this would be to have a single pipeline where you have all the stages within the same let's say processing framework. There is a project in the Apache Spark community called Hydrogen which tries to do that, it hasn't been released yet because it really is a complex problem, how to keep everything in the same framework. The other way to do it is to have some distributed storage where you actually separate the data preparation transformation part and the machine learning, the training part effectively. What we do with this is we have the concept of the feature store where it is a place to manage your feature data. So you do your feature engineering on one end and then you need a very scalable distributed file system which is part of the platform. You can get your feature data in different formats from the feature store or whichever way you want to actually do your, install your feature data and then you continue and do your training and serving. So that's a very like gentle introduction. Now what is Hopstorx? Hopstorx is an open source project. It has been going on for a few years now and this is a short timeline of what we have been doing so far. One of the most recent ones is the feature store and what we're also showing here today, how to run TensorFlow Extended on Hopstorx with Apache Beam and Flink. Now Hopstorx as we said is a platform for doing data intensive AI and on one hand you have all the different data sources that you bring in the platform. You apply your transformations, you do your machine learning, your deep learning and then the output is different applications and APIs to access all this data and functionality. So if we zoom in a little bit, we can see the different technologies that we use to implement this. So at the bottom, at the down layer we have Hopstorx, it's our distributed file system and the unique thing here about all these technologies that we integrate is not just a bundle of technologies but we keep the metadata of all these different technologies in the same meta store, in our file system. So then we support streaming with Beam, Spark and Flink. You can do also bots with the same technologies, the feature store which is built in-house and different distributed machine learning and deep learning technologies. Really any Python library that you might want to use. But the main ones are TensorFlow and PyTorch as we have been seeing. And then you have your model and then you can serve it and you can do that with Kubernetes in an elastic way. You can ask for more or less containers and then the platform will manage to automatically do this for you. Now all this comes with a nice web UI, you access it from your browser. But the principle is that everything needs to be accessible via a well-documented REST API. So these are some of the resources of the platform so you can create projects. It's kind of like a workspace, you can imagine it. Data sets where you actually store your data, jobs, your compute jobs. It can be Spark, Flink, Beam, feature store, etc. So it's very important to have this well-documented and follow best practices to allow actual users to be able to use your APIs from different clients. Now that was what Hopsox is, the generically say platform that we are building. And the main part now starts of the talk about Beam. So Apache Beam is a framework to do distributed processing. And it's also like a programming model built on top of Apache Spark, Flink and different runners. So what it gives you really is a way to write your programs, your distributed processing programs in one language of your choice. And then this Apache Beam will manage to execute your code in different runners, whichever runner you have in your platform. Now in our case it's Apache Flink also because it is of the on-prem, let's say, solutions is the more mature one for Apache Beam. If you're on GCP you will use probably Dataflow to do that. So the idea here with the portable runner is that you implement your program, let's say in Python. This program will be converted into the pipeline that will be executed later on by the runner. So you can choose Apache Flink, Dataflow, Spark, there are a few others like SAMSA I think. Then the actual execution will be pushed down on each machine. So the idea here is that you can distribute your execution, you can have a scalable engine to distribute your execution. You don't need to just be restricted on your laptop. You can just take your code and distribute it on a cluster. Now this needs a lot of manual labor to get it done, to run it. And there are a lot of things that need to be done to make it, let's say, not production ready, but to have something that fits your production ready needs, let's say. So what we have been working on is how to have Beam as a service, let's say, in hopscorks. And what does that mean? That means we try to do things easier for people to actually start, implement a Beam program and run it. And that means that you can develop different Beam pipelines in Python from your Jupyter notebooks. So the portable runner that I just described is kind of a recent development in Beam and a very important one. Because the Beam concept from whenever Beam was incepted kind of, but only just recently there was being improvement and progress on that. So you're not restricted just on Java. You can write your Python, it is very important for machine learning, because you can really use all these Python libraries for machine learning on top of a distributed framework that can give you different runners. And that one is Beam. You can develop your Beam pipelines in Python from within Jupyter notebooks so you have this familiar environment of notebooks. Now, you need to have to run something as a service. You need some tooling to simplify deployment and execution. That means that the platform where you run Beam as a service will manage to deploy artifacts to collect logs and all that. We will see later on more details about it. Then you need to manage the lifecycle of the Beam portability job service. We will describe again later on, but these are different components of the Beam portability framework that you need to manage either on your own or you have a service to do it for you. Logging, as we said, and then the important part. If you have a Python program, then it will be pushed down different executors on Beam. But there you need to have your Python environment already either on the machine or a container. We will see what's the trade-offs there, which are the trade-offs. And in Hopsocks at least we provide Flink and also Spark is a work in progress because these are the main runners if you don't run Beam on Dataflow. These are, and especially Flink is the one with the more complete functionality, is the way to run Beam, let's say, on-prem or if you're not on Dataflow. Now, we have developed some APIs. We have listed them here that help with all this distribution of artifacts, logging, et cetera. These Hopsootl-py and Hopsootl-python and Java, et cetera, respectively, simplify deployment by doing the following. They set all the security configuration parameters because we have this security model we will explain again a bit later by using certificates and then these APIs have some default configuration and they provide getters and setters for all your security configs. We provide service discovery, so when you're in the platform, you have other services that are running, not just Beam and Flink. You might have also, we have Kafka for ingestion, you have elastic search available for free-text search on your logs and all these things. You need to know where these services are located, if you want to access them from within your Beam program. So we have service discovery from these APIs, you get back the details of where these services are running. Helper methods for the Hopsootl-Rest API, so from within your Beam program, you might want to do something more Hopsoot-specific like run another job or upload a file, create a dataset, something like that. And a very interesting ML Experiments API, we'll show later how you can run machine learning experiments on Hopsootl-Rest and with TFX with the Helper APIs that we have built. Now, the question we posed before was, which one is better? If you want to run a Python program in a distributed way with the Beam portability runner, then you have two options really, either the process or the Docker one. Now, this really boils down to how you want to manage your infrastructure and what the rate of your willing to take depending on the use case. Now, if you go Docker, then you have to, what you need to do is build your image with all your dependencies or you have to download them later on when the container is actually deployed. Now, if you make changes to your Python environment, then you need to update or modify your image and you need to have additional infrastructure components to deploy the images and all that. Now, the process one runs natively on the file system and you skip all that, but still you need to be able to have consistency of your Python environments across all different machines. Python is notoriously hard to get it right in Python when you uninstall and you install different libraries, you get dependency conflicts and all that. So, we went with the process one, it fit our model better, but it really is a matter of how you want to manage your infrastructure. Now, how we solve the Python problem, when you want to have the same environment on all the machines in the cluster, we are managing the environment with Anaconda, so you have better workspace, each workspace has its own Python environment and its fully self-service. You go in the UI, we'll solve it later on, you search from the UI, I want to install this library like BIM, BIM also has a Python library. So, it will go, search for the library and it will install it. We have some services running in the background that will manage to replicate and install all the environments and have it available on the cluster. Now, this is the infrastructure part, if you want to start actually a Jupyter dashboard, Jupyter notebook, you go in the Jupyter dashboard in Hopstorx and you can select different settings if you want to run pure Python or Spark, for example. I will go a bit quickly through the screenshots, as we have also a demo later on. Now, as I said, Jupyter and Jupyter lab in Hopstorx, you can execute a BIM Python pipeline. That can happen in Hopstorx in many different ways. So, you can run your pipeline, as we see here in the example, yeah, it's a bit small letters, but you can run it in pure Python kernel, or what we do, we use Spark to actually orchestrate execution of the Python programs in the back and this is transparent to the user, the only thing they need to do is select the PySpark kernel in Jupyter. But if they go with Python, the Python kernel will run in a Docker container managed by Kubernetes, which is in the Hopstorx installation, so you don't need to manage that. An important aspect of this is that you use notebooks to actually develop code to develop your BIM program. That's good for your development. What do you do when you want to deploy this in production and run the notebook really as a core component of your production pipeline? So, the first way is that you have your data scientist and they use Jupyter to have interactive programming in Python. So, you start Jupyter and then we go to LibyServer, a service we have in Hopstorx to actually start your notebook. Now, if you want the other way to orchestrate all this, we have Apache Airflow as an orchestrator, which does the following. You define which notebooks you want to run periodically, you define your direct acyclic graph and you say I want to run this notebook after that notebook and then you define your dependencies. Now, the notebook, automatically, by Hopstorx will be converted to a Python program and then it will follow the same, let's say, route of materializing all the certificates, all the security secrets and everything and it will be deployed on the cluster. Now, both notebooks and the Python programs that are converted from the notebooks are stored on the distributed file system that we described before. So, consistency metadata, we have also the notebooks there, so everything under the same roof. Now, if we dig a bit deeper in the portability framework of Beam, this is an image taken from, there is a link over there. This is a very good figure if you want to understand how the portability framework works. So, what happens is you have your Python program developed in Beam, you have your Beam pipeline and then you would need to run this job service. This is a standalone service that you need to start and you attach it to a Flink cluster. You need to have started the Flink cluster from before. This is like the way that you would do it if you wanted to run Beam on Flink, let's say, on your laptop. Then you need to have a staging location. Here we see DFS, S3, whatever. It really depends on you. Where Beam stores the artifacts and the state and the logs of the job service and also of all the task managers in Flink. Now, to simplify this, to have something that is as a service as we said, we have hopscrux at the top, which is the entry point to this architecture, and we have hopsfs as the backend storage, the distributed file system, where we store the artifacts. Now, the runner, we automate the process of starting the runner and the job service, sorry, and we have compiled it with the dependencies that come from hopsfs, so they must be compatible in order for the job service to be able to talk to hopsfs. To help do all this, as we said, we have these helper APIs that we can compile to the Flink cluster. Now, the important part is that you need to start all this. You need to start the job service, you compile it, you need to start it, and you need also to start the Flink cluster. We have, again, helper methods where it's written in Python and you just do from hops import Beam as hops Beam, and then you do hops Beam.start. What this will do, it will start the Flink cluster automatically. There are some default values there in the call, in the start call, and also it will start the job service that you see in the middle of the screen. Then you will be able to, the API attaches the job service to this particular Flink cluster, that's the important part. So then the job service will send the task to be computed on the Flink cluster. So there is a method here get portable runner config which returns all the portability, some default portability configuration and you can add your own, you can override the default ones. But these are the ones that come out of the box in the hops works. And the last part is that the SDK workers these are the actual workers that will run your code on the Flink cluster have access to the Python environment. That has, as we said before, has been created and synchronized across the machines by hops works itself. So this is a view like a screenshot of integrating Apache Flink in hops works. So you have your workspace it's called big things and then you have your, this is a history server actually of Flink and here you can see a bunch of tasks that I've run most of them succeeded, a couple of them failed and when you run the BIM pipeline you will be able to view everything from this dashboard. Now the API in a bit more detail you define your runner name this is really the name Flink cluster that you are deploying you can override that and then the default configs that BIM can actually set for Flink. Now you have a different runner like Spark or SAMSHA then this would be different then we have a different like start method for the same start method but with different parameters for every runner but for now you can say I want this heap size for the Flink job manager this number of task managers you can increase it if you want to scale out more memory, the slots this is Flink specific, the number of slots in Flink and kill runner. This kill runner means that when you actually stop your jupter notebook when your BIM pipeline is finished there is an option here to shut down also the textual runner because we try to make it transparent to the users you do not really need to care about the runner, the Flink or the Spark or whatever the framework should be able to handle that and start the runner and shut it down gracefully and collect all the logs and everything. Now optionally you can start the BIM job server it's done automatically by the start method but if you want some more control over it this is the API Now logging is very important because with all these different frameworks and services you have logs scattered kind of everywhere so you get logs from the Flink job manager and the task managers which in our case run on yarn you get logs from the BIM job service itself sends the pipeline down to the actual graph down to Flink the tasks so there are two ways to run this in local mode and cluster mode in hopscorks that is and then you might get your logs in your jupiter directory or in a cluster somewhere in a machine where this PySpark container will be running you also get important logs from the actual SDK worker your Python program will run in the worker so if something goes wrong you really need to go deep and see ok my Python program failed because the dependency was missing or segmentation falls you might get also with all this transferring back and forth of pickling and pickling etc and to collect all these we use the elastic logstash kibana stack the elk stack so we collect all these we aggregate them into the workspace and then you can from kibana then you can view we have an index pair service kind of and it looks like that from within your workspace you have kibana and then you can see the logs for example these are the logs of the beam job server it says ok manifest to the manifest location remove directory and all that now about security in hopscorks all the services like the store ads, the execution we use TLS certificates to talk to each other and also for clients to talk to these services this is this system is inspired by a Netflix system called bliss it was open sourced a couple of years ago and internally we use TLS certificates everywhere short lived TLS certificates now for users to actually access the platform you can log in with your username and password or LDAP and all of with support for Kerberos that's to enter the platform and then everything is automated now about the machine learning part so we did beam we did we did beam we did we showed how to run your programs and how we set up everything now about the machine learning part so there's a very famous paper by Google on the hidden technical debt in machine learning so of all this that you see here I hope you see because here I lost the monitor actually so of all this the actual machine learning, the training this white box in the middle all the rest is infrastructure development related you need to do all this to do the central part properly and there is a lot of progress in the community lately with different frameworks that help data scientists to automate engineers to automate these processes and all this pipeline so one of the one of the most popular ones is TensorFlow Extended or TFX which as we see here this is taken from TensorFlow's webpage you can provide a set of components and with these components then you can do data validation that means ok I got my own data and then at scale I run this on beam I want to run a program for these terabytes of data I want to see I want to get statistics on my data I want to see if I have a lot of kernels or some values might be empty, things like these then you can run the data validation component then you can run the TensorFlow transform so you can transform the values and compact them let's say in different if it's numerical you can compact it in a different range and then you actually go to the training part TFX of course uses TensorFlow to do the training part after that there is TensorFlow model analysis so you train your model but is it performing well is it like valid is it correct, do I need to retrain maybe I made a mistake so TensorFlow model analysis helps you with that with these two components evaluator and model validator so after you are sure that your model is correct then you can go and push it to TensorFlow serving so your clients can actually start making requests against the model now this is the pipeline of TFX how does it actually run now TensorFlow data validation and transform they use, you write a program in Python of course and then we utilize BIM on Flink to run it same goes for TensorFlow model analysis so you have all the data in the distributed file system you have your Python code that is using TensorFlow model analysis component and then again BIM on Flink for the training part in Hopsorx whatever library but here basically we saw that in this case with TFX you will use TensorFlow and then you can run it on GPUs so in Hopsorx you can do for example distributed training with GPUs and the GPUs are managed allocation of GPUs it's managed automatically by the platform and TensorFlow serving as I described before is using Kubernetes with Docker to make it elastic now when you do your training you need to be able to track your experiments that means that you have your notebook you have your data you can get it from the feature store or you can get it from this TFX pipeline after the transform part when you do your training and you want to manage the metadata of your experiment if you want to go back in time and actually run it again so this is tracked by all the helper libraries that I described before and shown in the UI in this manner there is integration with TensorFlow so you have your experiment let's say you run an experiment a week ago and you want to view the TensorFlow we kind of bundle all the experiment artifacts everything that you need to know about this particular experiment code and you can have version data from the feature store and you can open TensorFlow and view your past experimentals now about orchestration there are a few ways in Hopsorx we use Apache Airflow and we provided kind of as a service which means that when you are in your workspace everything is set up for you to actually use Airflow in a multi-tenant way that means you cannot access all the services in the workspaces in a multi-tenant you cannot access services outside of your workspace you can invite other people other members in your workspace but you cannot manipulate other people's like workspaces so same goes with Airflow as part of your workspace your project and then what you would typically do you would write or develop via the UI you would have your Airflow DAG which is a python program in the end that's what it is and here is an example of it this is the DAG that I will show later in the demo to actually run the tfx pipeline so we will have one notebook per tfx component and Airflow you schedule it and you say ok I want to run it periodically per week, per day, at this time and all these things it is quite a powerful tool and this is how the pipeline would look in the trivia of Airflow but then you have different tools to actually monitor and monitor the progress and everything and this is the view where you have all the jobs so these jobs we call them jobs we convert it into python programs and run on beam and flink so it's notebook here corresponds to it's part of our pipeline that's one way to do it there are many many ways to do it but this is like a basic example so you have compute statistics in first schema, compare statistics etc now to put it all together there is this nice architecture in the beginning we had kind of a structure but now we put in all the components that we described so we ingest data you can use on the platform Apache Kafka for example and then you can do your data preparation with beam and flink but in this example we go with beam and flink runner feature store or tfx transform to transform your data and then you can store it in the distributed file system then you continue with training actually the interesting part and then when you have the model you want to analyze it, again beam model analysis on flink and then serving now the interesting thing also in serving what you can do for online serving is that we track all the inference requests that go to model serving and then you can have streaming job let's say that gets all the inference requests in real time and monitors performance of the model so then you can decide ok I want to retrain I got new data or my model is not performing very well and all that is backed by this metadata store now demo let's see if that ok can you see ok so we have this instance is running on AWS right now we'll go here pick things and let's start with so the example is based on the Chicago taxi ride it is on TensorFlow TFX github so we got it from there and then we developed the notebooks here so if we open here yes so let's do first a simple one word count ok so this is an example of how you could do before TFX like something even simpler some word count let's say on beam so we have this from hops import beam as hops beam then hops beam start here for example I wanted to override the default memory and set 8 gigs for it so let's go we'll go back we will see that these are the jobs this is the runner and now it is initializing so for me it created the link this cluster description it submitted it on the resource manager and it's starting to to run it so it is running now so we can click here and then hopefully yeah so this is the link dashboard if we go back to our nice notebook then here we defined word counting workflow ok we just import this stuff and then here is the actual do fun that we have in beam and here we provide the arguments for so this now got the default beam pipeline options that come from hopsworks and you can override them for example SDK worker script this is the actual script that invokes the worker python process and all the like ports and everything so what we should see here yep so now the request went to flint cluster with the beam portability framework it was submitted on the flint cluster which is computing now a set of tasks so in total so in total here it has to compute 15 tasks as we can see it already has done 7 it takes a bit of time and because I want to also do the tfx example I will cancel this so we can continue with the tfx which is more interesting but the output in the end just to show you if we let it run because I ran it before it should be in resources here is a sexpear output so we got sexpear and we did some word count and then this is the word count, this is the output if I let the flint job finish now for the more interesting part we can go to let's say there are two ways to run the notebooks here there is one big notebook that runs the entire pipeline in an interactive way and this is mostly done for development so let's say ok this one also tfx we did this a while ago but tfx recently released their own interactive notebook so here what you would do is again you start the runner, the runner is already started so it should be fast it returns back the job server B job server parameters like the port the host, where the log is then we do the imports then we download the input dataset we clean up the previous run, these are our parameters and then we start the actual pipeline here is to compute the train stats in the data validation part so let's do this yeah train stats is notified ok name option ok let's run it again I think I need to restart it yeah ok so that's the kind of interactive way this takes quite a while to actually run it's a lot of tasks, if we scroll down let's keep it running then I can show you then we have to infer the schema the schema of the data that you have ingested and you want to continue with the training later on you check the eval data not just the train for different errors and then if we scroll down you freeze the schema you say ok this is my data now I'm ready to transform it and then you do this part with the pre-process inputs and then if we continue we compute statistics over the transform data we need this later on in the pipeline when we also go to model analysis here we actually define the experiment, the training code and here we submit it with the run local experiment and this is the part where you run TFMA to compute the matrix so you run TFMA as we can see is a beam pipeline then you enter the data from the file system or in memory you might have it in memory here it's an interactive notebook and then you run the visualization so that takes a while let's see if it actually so I have the output already from a previous run I can show name options and define ok never mind so let's go here the last thing is the jobs so what we want to do we have defined a set of jobs, this job is a notebook and the notebooks are here in JupyterLab so if we go to pipeline we can see the notebooks this is the same, these notebooks are from the interactive one but we split them in different steps so we can schedule them with airflow actually so we can see here it's exactly the same but we can go to airflow this is our DAG and here we can see you set up some properties like the project name and all that and then this is the actual DAG I want to run compute statistics after that I want to infer the schema compare the statistics freeze the schema, preprocess the inputs transform the actual data do the training and then model analysis this is actually the lower level API of TFX there are you can also run a notebook I can show you here this one which uses higher level components of TFX and you can actually build a pipeline not with just airflow but you can run this pipeline here with the beam DAG runner which is brought by it is a nice way in TFX to skip if you want like more complicated services like airflow and you can go just with the beam runner it will do the DAG for you now let's go back to airflow let's open it so we have the pipeline and then we can try to do this yes I am sure so if we go to graph view it will start as you can see it started now this is the DAG compute statistics everything it started with the first one it will run all the jobs which is one job one notebook everything in the distributed file system now this takes a while let's see if something happens so here I run the DAG and you see already it started running the first step of the pipeline but we have the results already we don't need to wait so here are the visualizations so we have this notebook to visualize the stats we read the stats from the file system so we can do this no such file ok it's because I run this example which wipes kind of the visualizations so we have we would have to wait let me see if I have other ones ok so that will continue and then you can monitor also the progress from here that's it if we go back and data sets let's see the last thing taxi data that's just the input data as the last thing and then the resource pipelines and resources last modified so as slots are being computed they are being resisted here this is the actual file that was computed ok so let's go back to the presentation and conclusions so Hopsox 1.0 is the first on-prem horizontally scalable platform to support BIM and FLINQ portability framework which is very important for machine learning because it brings support you can run Python really on all these runners and then we now our future work is to finish integration with Spark runner as we have support for Spark as well and integrate the feature store with the effects that would be a really nice thing to have and last thing also export metrics of the FLINQ runner to InflexDB and visualize with Grafana different metrics that the actual execution engine exposes contributors and how to get started you can get a free account on Hopsox to try it out with AWS or you can deploy it on GC as well thank you