 Introductions. I am Marceni Chernoff. You've probably seen that I started as a co-organizer to this meetup. So together with Han Su, those of you that know Han, he unfortunately had to leave to get small kids. So Han and I, we're all looking for speakers for the next meetup. We also look for the next venue sponsor. So if your company wishes to invite this very honorable crowd and talk to us in a short manner for a few minutes, we're very happy to talk to you. I'm from Databricks. I started three months ago. We are actively hiring. So also if you're looking for additional challenges in AI and big data space, let me know. So I will introduce it to people and we can start talking about that. I am based in Singapore since 2011. I live in Simei. I'm a grassroots organization member there. But I also cycle to us and I know few people come from NTU today. So I'm by chance sometimes dropping by. Originally I'm from Russia. I'm from St. Petersburg. There are only two seasons in St. Petersburg. If you ever plan to go to Russia, it also applies. There is July and winter. So go in July. This is just one question that everyone asked me. I wish to go to Russia. Go in July. Not any other time. And I'm very, very, very big fan of Kaya toes. And like as a as a Singaporean in heart, I really enjoy the cuisine here. I'm currently in the partner solutions architecture role in Databricks and I cover the entire region here. My degree is in engineering. So master degree in radio, links, television. I also work with turbo codes. Those are like the parallel branch of deep learning deep learning networks that are dedicated to signal to noise ratio reductions. And I previously worked for standard shadow bank in cloud and DevOps environment. And before that I would with the MC I salon. This is how I got to meet spark for the first time working with structure data and Hadoop. And before that with a few other vendors. Now I wanted to set the stage for this talk showing you this slide which is coming from the paper Google published in 2015. It's called the hidden technical depth and machine learning systems. You see the greenish dot in the middle, which says the ML code. This is actually the proportion of time. Google's engineers in 2015, I think still till today, spend on developing the machine learning code, the deep learning code versus other things. What are the other things? What are the largest boxes? You can take a look at it. You see that it's data collection, serving infrastructure monitoring process management tooling. So the actual black box of the high word, the artificial intelligence, deep learning ML, that's a fraction of time. Most of the time Google themselves spends preparing the data, mumbling the data, merging the data. So it's the reality of today. So the current state of things is we live in a very siloed world. And you've seen it in Williams in the previous speaker's presentation as well. So there were multiple systems involved in data preparation. They worked with, I believe they had a lot of Google Cloud Platform technologies. The previous speaker shared, they had predominantly GCP based solution. But you may pick up any of the databases, any of the store systems as an example. And they're not really closely connected with all the deep learning libraries, deep learning libraries that are able to create the valuable outputs for the business are not directly connecting to those sources. So there are different types of systems. There are silos in between them. And the good example is, for example, TensorFlow that only recently started to connect to the external sources. We'll speak about that in a minute. But also, the technology impacts the way that we work. And by having those silos, we then organize different teams. When we organize different teams, the teams have to start to communicate to one another in some standard manner. So they create procedures, procedures gradually slow down businesses. And with more and more procedures, businesses have to go for more and more approvals, reviews and exceptions. So the extraction of the value from the data becomes slower and slower and slower. The handles here on this slide between the teams is usually either injurious. And remember, we've talked about, you've seen this slide that was presented by William, I really liked it. There was this Twitter quote that said that it only takes several hours to create the model and takes several months to actually deploy it. Because you need to talk to people, create jurors, approve it, sign off, and then do the pull requests and actually commit it to the masters. Also, by having more and more silos, you're losing the ability to analyze the changes to track the differences. There were a lot of questions today about like, how do you see what changes to the model compared to what changes were in the data? How do you track all that? Well, you have multiple systems, multiple silos, multiple teams, you completely lose this visibility and definitely different infrastructure. With only five systems I've counted in the previous speaker's presentation, you still can kind of live. But in production, for example, on premise environments or heterogeneous environments, you may end up with like 15 or 20 of them. So the data though keeps on emerging and there's a variability of data that grows in different kinds and velocities and types. And the current idea behind project hydrogen, which is the Apache Spark version 2.4, is to start to unify these two worlds together to start breaking down the silos in between big data processing and machine learning. So with all the connectors that Spark has, and Spark has 50 plus quite stable and fast connectors to different storage types, including streaming batch. And even if you can connect to Elasticsearch, someone someone asked me today earlier in the day, like, can we use Elasticsearch? Yes, you actually can connect Spark to Elastic and query Elastic as you wish to with all the capabilities that Spark already has. There are just few things that needed to be solved that would unify the the portion about the total platform that would be capable to do the distributed learning AI and the data prep. So there's there's actually two thoughts. Let me make sure I'm trying to skip a little bit. I think this is the this is the history of how a lot of projects that we're all acquainted with emerged and developed over last several years. So you can see on the left hand side, the Spark community started with the where where Spark started was the MapReduce in memory just to speeding up MapReduce, avoiding usage of the disk and being able to process large sets of data in a distributed fashion really quickly. And it started with the RDD API. It was like a very, very new API. No one was really figuring out how to quickly write a new code for it. And then it started to adopt more libraries and APIs to simplify the usage of Spark. So first, the APIs in Python, in Java, in R started to appear. Then there was the new concept called the data frame that appeared and it made it very easy. It was it was in the unification single common denominator, the unifying pit that allowed a lot of different skills from different types of background working in one team to use the same execution platform. So the person with Python background or with Java background or with our background are now able to use Spark and teach one another, share the skills. So that gradually went into the Spark Spark, in Spark's history in Spark's role, maybe gradually into that gradually went into the production shipping code that was ready and the one that you probably use right now. It then was was extended by structured streaming. So one other interface that was available in Spark from very early days was the SQL interface. So you can actually query Spark just as if you would query the regular SQL NC SQL standard database, you can query it using SQL. And then on top of SQL, this structured streaming API appeared, which was the same approach, but to the micro batches instead of querying the huge data sets, you split them into small tiny pieces and then within each, you can look at them from the perspective of like a second batch interval, micro batching intervals. So this is how you work with the streaming data in Spark. At the same time in the AI and ML community, there were the same development. So there were first foundational projects like pandas and NumPy. And on top of that, there were some performance optimization projects that appeared like XGBoost, GLMNet. Then there was a fundamental big development by the team behind scikit-learn. And slowly gradually we appeared to the TensorFlow's beginnings. Then TensorFlow was a little bit too low level for many of the data scientists. So then Keras started to simplify the way that the TensorFlow models are ran. So it allowed to create the layers in the neural networks in a much easier, much simpler way, although below the scenes using TensorFlow. So you can see that it's all about converging. And definitely then there is a side project in TensorFlow that allows to access the data right now. It's called tm.data, tm.transform. So it all kind of converges into simplification and the ease of use. And what's next, the big double, the two question mark is, we don't know, time is up, but it looks like it all converges into something that would be the best of the two worlds together and the communities start to speak in between one, in between each other. That's something we're observing. The reason why these communities need to speak is because, as you know, Spark is very popular for big data processing, for ETL, for distributed, very distributed with speaking about like hundreds of thousand cores participating in some jobs that one of the large customers of Databricks run. We speak about processing large amount of data, which is very beneficial for DL, for distributed learning. So this picture is actually coming from a research paper presentation from Stanford. And you can see that the efforts you put into feeding your neural net with more and more data, they actually pay off with the performance of the model. So the more you can allow, the more you can allow to train, the more data you can allow to pass to the, to the model pays off overall in the, in the models, in the model's performance and the model's accuracy. So there are a lot of pro projects that we're trying to blend the two, they are not entirely unique to Apache Spark community. So projects, for example, Intel had big DL that tried to do the distributed deep learning using Intel CPUs and leveraged them better. There were several pros, there were several projects from, for example, Yahoo, there were Cafe on Spark, TensorFlow on Spark, and you may actually find a lot of semi orphan packages on sparkpackages.org that are trying to do the same thing. So, for example, around the, around Spark with TensorFlow in a distributed manner, but then people basically gave up after like first several, several commits. So definitely take a look at all these projects. If you have, if you have interest in, in how they develop and where they stop. But what really we're interested in is to use cases to be solved to user stories and they come from a data scientist perspective. So a data scientist wants to build a pipeline that takes training events from a production data warehouse where the downstream kind of data lake ends up and train the model in parallel and also apply this model to already trained model to extract value from the, from the same production data. So basically start inferring it in parallel into the distributed event stream that is already working there. So you've seen in previous, in previous presentation of the previous speaker, there's a whole deployment on top of Kubernetes that they have to work with by start, starting in a completely different, in a completely different source of the data, working with different siloed systems. It all ends up in Kubernetes in the end. This is, this is the example of how segregated different infrastructure layers in that environment are. And the user stories having a, the users want simplification. So the user stories to have one system that allows to do both of these things. So when we speak, when we speak about the distributed training, this is the reinforcement of the same user story. So we want to load the data from multiple different data types. And Apache Spark is one of the best solutions for this user story. And then at the same time, we want to fit the, the model, train the model and do it in a distributed way. And there are multiple frameworks using GPUs right now. For example, Horovod for distributed TensorFlow. And they are then giving us the model that we want to productionize. And when we have a productionized model, what we want to do is we want to have a live stream of data that we connect the model to. But as I mentioned, there is a disconnect. For example, Horovod does not allow you to read directly through Kafka. So you have to create some sort of interim sync, some sort of interim, some sort of interim load step, and also orchestrate this connectivity with some scripting, some additional engineering solutions, some CIC pipelines that are not essential for your extraction of the features, for example. So the afterthought for that is the glue code that we also seen in the previous presentation. So there's quite a lot of, quite a lot of orchestrating, non valuable projects that emerge. And they are sometimes maintained by engineering in larger organizations. They are, they are maintained by the same data engineering team. Sometimes they are outsourced, but that connection is right now not, not realized natively in one platform. The reason why we cannot really, we were not able to really realize the same, realize all both the working with the big data and the distributed training on Spark before version 2.4 is just the way the tasks are scheduled originally in Spark native scheduler. Spark is known for its so-called embarrassingly parallel task scheduling. So when we have a big data set, we partition it into multiple partitions, and then we'll let each individual compute unit of the Spark cluster work on its own portion. And if that particular worker, they're called workers, if that particular worker fails due to some reason, the task is getting restarted and it does not impact the other tasks. It's the idea behind behind embarrassingly parallel scheduling on top of Apache Spark. Now what is needed in distributed training is a little bit different. So to illustrate the task failure in our first distributed system, which was the embarrassingly parallel scheduling, we have a green task that crashed and it didn't impact anything because it was designed from day one to be tolerant to system partitioning. In distributed system, you actually have to restart all the tasks together because they significantly rely on one another to communicate the proceeds, to communicate the, to shuffle the data between them to update them on the stages of their job. So you cannot really restart only one task in a distributed learning distributed, distributed TensorFlow system, for example, you need to restart all of them. And that implies that we need a different type of scheduling. So the solution to this is something that was introduced in version 2.4 in Apache Spark. It happened several months ago, and it's something called a barrier execution mode. So if you go into the Spark's, Spark's homepage, you may find all the release notes always on the Apache Spark project. And then it's the top feature about this release. So it says that this release adds a barrier execution mode for better integration with deep learning frameworks, and introduces some additional higher end functions inside this mode. But not only that, there's also some improvement in Spark 2.4 that allows to use the user defined functions in a much faster way than something that is labeled under the epic called optimized data exchange. And the last but not least, there's a new way to, it's only right now available in Spark's own standalone scheduler. There's a new way to communicate the capabilities of workers, of the compute elements in Spark cluster, so that your task that you schedule on top of Spark cluster is aware about the underlying hardware. So it's aware, for example, that there is a GPU of a certain type, or someone so forth. So these are incremental improvements in 2.4 in all, in all group of them is sometimes referred to as the project hydrogen. So this is the the idea behind it. I will demonstrate a little bit more about the barrier execution mode. But the idea is in the scheduler in the mathematical, in the theory of scheduling, there is something called the gang execution. So the gang scheduling gang execution is like either one, either one whole gang of people, like you know, adversaries, yeah, they go and do something really criminal, or no one does. That's the idea behind the barrier execution mode. The realization you can, you can take a deeper read about the epic. There's big JIRA that went into the release note. The idea behind it is to allow to start all the tasks in Spark from some point, only from the simultaneous perspective, and wait for all of these tasks to complete, and then get back into the embarrassingly parallel, still operate in the regular Spark mentality. But from some point onwards, you just say, from here, start all tasks together, and look at the way that this set of procedures is executed. And once they're all completed, come back to the regular scheduling. And also it allows to establish a context for the other jobs to be aware what's actually scheduled inside this barrier. And I think it will be a little bit clearer when we go into the demo. But the API's to barrier mode is so far realized in Scala and Python. So there is nothing like that in SQL. If you want to like call out for the user defined function or something like that across the data set, you're still bound to Scala and Python for now. You have inside the context for a particular RDD, sorry, it's inside the barrier for a particular RDD, you have some additional functions that you can call. For example, you can call like for the IP addresses of the workers that are currently scheduling, they're currently executing the tasks in parallel. So something like creating a cluster within a cluster that only does a distributed gain execution job. So I'll go forward and demo it a little bit later. Let's skip this one. Yes, you're able, for example, here is it's important, for example, an API in an MPI. MPI is the way that in TensorFlow, for example, you start the distributed learning, you have TensorFlow, you have one single master host, and then you run you designated it could be not aligned to the regular master host, the so called driver in Spark, it could be actually quite different. And you need to to be able to execute the TensorFlow job, you need to be aware about the, for example, IP addresses of the hosts participating in the distributed TensorFlow job. So you are able to pull and get the information about that sort of like cluster inside the cluster, you're able to get get that information through the context within the barrier, and then launch the usual MPI launch the TensorFlow in the usual MPI way. So so far, as I mentioned, it's only in the standalone, but there are already, you may see that there are several tickets, several, several, several initiatives to talk to the other schedulers where Spark is executed, for example, to yarn to Kubernetes to mess us to also allow this information to get cascaded back into the Spark environment to create the parity in the project hydrogen. Speaking about optimize data exchange, optimize data exchange is when you have a particular Spark type data, and usually it's represented by the data frame API, you you read with all those 50 connectors, you read the data into the data frame API, it's very convenient to work with it. There are very simple functions to change the data to expand it to enrich it to extract features out of it. But most of the deep learning frameworks, they operate in a different data types. Usually they are native to Python, they they are not aware about the data frame API format. So one of the things that could be done definitely is just to write it out, wait for the right operation to complete and then read it back, which is a waste of, which is a waste of time, waste of energy, because Spark operates very fast. And it's optimized to be working completely in memory without touching the disk. So that's not a good idea. The usual approach is to call the transformation to the particular set of data, like, for example, for one row out of a huge one, and then pass it over to Python in the call using the user defined function. So the the calls to TensorFlow, for example, in in Spark, they are also done against the data invoking invoking invoking the data using the user defined functions. And the problem with user defined functions, they were introduced quite quite a few years ago, I think over two years ago, Spark 2.2, the problem with that is the latency of taking a single row of a huge data set, and converting it, and then working on it in Python world, and then returning it into Spark, and then, for example, writing it out with all with all the possibilities that Spark has, or acting at it, at this data, as if it's a streaming data. And when we speak about latency, the measurements that were done is about 92%, 90% to 98%, this was the estimation, is the exchange of the data formats versus the actual execution of the Python code. This is the difference in between the data types inside Spark natively and the world of Python, where most of the distributed learning libraries operate. So the project hydrogen's initiative about the optimization to this problem is to introduce the vectorized data exchange. And the idea behind it is to send the column. So it's similar to column in a database, or column in an exchange. So the user defined function is executed against a set of columns. So why don't we send the entire column out of the data set that already has been prepared, featureized, and stays in memory of Apache Spark cluster. Why don't we send the entire column of this data into Python UDF, and then wait for it to return, and then merge it backward. So the kind of behind the scene realization of this initiative was done using the arrow. The community behind Apache arrow actually has a lot of different transformations already, and they had established the framework for different other Apache projects to start using it. So Spark project just partnered with them and leveraged the transformations that arrow has behind the scenes to speed things up. So when you call the two pandas RDE or two arrow RDD as the user defined function, Spark handles its transparency for you in that columnar manner, so you're not wasting. And when I say Spark, I mean Spark 2.4, Project Hydrogen handles it for you in the transparent manner. It happens all behind the scenes. And the re-usage of the already established protocols by arrow community, and also some additional engineering that was done by Apache Spark committers, allow to improve the calls to use defined functions like 200-fold, 300-fold. So you can see the comparison between a row at a time call to the user defined function versus the call to the pandas UDF with the vectorized columnar format. And yeah, different lambda functions were used like lambda plus one or do subtract mean, for example, from a single column. So I want to switch to a very quick demonstration of the barrier mode. I think it's one of the very important aspects that is there in 2.4. So I have a couple of notebooks here. And I want to want to show you guys also the, is this UI familiar for you? Can you raise your hands who've seen Databricks UI before? Because I know it's an Apache Spark made up, so. Can you raise your hand if you've seen Apache UI? OK, Databricks UI, who's seen it? Yay, quite a number of people. That's pretty cool, thank you. So the idea behind it for those that didn't raise hand is we, at Databricks, we are shipping the unified analytics platform, which only works on the cloud. So you cannot install Databricks on-premise. But the idea behind it is it's a collaborative notebook space that is a very, very powerful way for people to work the code together. And behind the scenes, there is its own cloud-native cloud optimized scheduler and Apache Spark core fundamental with additional optimizations to work with connectors with native Azure, Microsoft Azure storage types with native AWS storage types and additional optimizations from the performance. So that's a very 10-second explanation of what is Databricks for those that are not aware. So you see the notebooks in here. I wanted to, when you go into the notebooks, you have to attach them as usually in Zeppelin notebooks. Can you raise your hands? Yeah? OK, some of you use it. So you attach the notebooks to the cluster so you can execute some jobs. You connect the code to the execution on top of a particular cluster by attaching them. So I have a cluster that has several, at this point, this is a native US environment and it has four worker nodes. Each one has one GPU and 60 gigabytes of memory. The notebooks attached to it, let's go into this first one, distributed training, are executing some Spark code to do distributed learning using Horovod. This is the regular MNIST example. Anyone remembers the MNIST? The digits recognition samples. Sorry, it's the first time that I'm at this meetup, so I'm trying to gauge the audience awareness. That's good. So the idea behind this is we have a set of numbers and they're already being captured as pixels. Like in a picture, you have every black and white dot, and you can create a long list of rows in a very large database to represent the graphical data set on top. And then you can pass it to Horovod to train it against the label set to then output the predictions. So we have a table already. The table has, if we go into the, I need to show you, I guess, the table with the digits, so you trust me, you don't just take my word for granted. So if we go into low world, no Horovod, hydrogen, MNIST, this is what every individual digit looks like. So these zeros are like black dots, and then there are like the other grades of grade. So if we start the job and train the Horovod model using all the four nodes, we can just run them all. What happens here, we read in the entire table into the Spark data frame. So that data frame is then repartitioned into two partitions. And in Spark, the amount of rows, you have a huge data set. You partition it into smaller chunks. And then this is the actual scope of work for every individual Spark executor, the work of the compute unit. So right now, I expect that there will be two executors actually working on the data set. I write it out into an intermediate format here. And then I execute the function against that data set that works on two hosts. So let's see. We need to take a look at this stage. So we have a job stage one, which is just kicked off. I'm just waiting for the second stage. It might take a little longer, but actually it wouldn't care. If I run at the same time, what I want to do is I want to use the same cluster for something in parallel. And I'm not using any barrier execution mode here. I'm just trying to infer the model that I already have. And it's hidden in one of these functions, predict using GPU. I want to infer a model against the live stream of data that's, in this case, it's simulated, against the column that will be the features. And I want to create another column in the same data frame calling it a prediction to what actually this digit means, the hand recognized digit. And then I will call an action to display. So what I'm ideally looking at is a two column table that has the label originally seen. And then based on the column of the features, the prediction to what the real time machine learning algorithm would apply to it. So if I kick it off by running it all, and here in these utils are all of these smaller functions actually, they are all hidden. So I'm expecting the stream to actually fail. So let's see. And the reason for that is that I'm trying to schedule four, there are four nodes in this cluster, four GPUs. And I'm trying to schedule, I took a little bit longer than I anticipated. This is complete. Yeah, I've ran out of time. I need to, I guess, I took a little bit longer. Let me actually stop this job and just rerun it for you for a second. So that happens when you're trying to do the live demo sometimes. And that's a real cluster behind it. I think what happened is the previous job commenced much earlier than I anticipated in the GPU cycle. So what I need to do is I'll clear it all and restart it in a second. So don't take my word for it. The idea behind it is the job, as I anticipated, has to fail here because it would try to use all the GPUs available in this cluster. And there are four of them against the same, against the same, to run this test. And two of them would actually be already occupied running the previous notebook. So let me clear it all up. There is this function, very, very helpful function that is called the KILOL API, just in case there is something running on the cluster. Yeah, so let's kick this job off once more. Doing the prep work, reading in the table with the digits, repartitioning them, writing it into the party format and then reading the same, reading this party database with the, oh yeah, there we go. So I think, yeah, what happened is just, I was a little bit too quick for that. So you can see the usual startup sequence of the TensorFlow here. It does have a lot of details inside, we will not go into it, but basically right now I'm trying to, I'm training the model on the four node cluster. And if we'll look at the event timeline here, enable zooming. So these are the two, on the back of it, these are the two workers that were selected and run the two partitions, repartition two, they run it against the GPUs. So if we go back into the model inference without the barrier mode execution and we just say, okay, run on all available nodes without any awareness about the barrier mode, run this code and infer the model that already exists against the live stream of data. So what it would try to do, it would try to infer also on those nodes that are already got the GPU occupied that tries to schedule four tasks. And at this point it should fail. I want it to fail. Yay. This is the kind of like the opposite example of the usual demo expectations you really don't want for things to fail. So it fails with the, yeah, anticipated GPU occupied error. And the idea is we're like, we're actually not able to complete the task that Spark wants because underneath the hood, below the hood, there's already GPU busy. So let's stop it all. Let's actually go back to this job, clear it all and clear all the possibly running MPI jobs by killing them all. So if we repeat the same experiment with the barrier mode now, the only thing that was added is this. So it says also repetition the same data set into two, but create a barrier for these tasks and distribute the information that you are using these two hosts for the distributed deep learning scenario, distribute the information to anyone who is interested to schedule any jobs in parallel. So it's a cluster inside the cluster. So let's run this. Let's take a look at the execution. So you can see that because there are two partitions to the training set. So yeah, right now there is two tasks going on. And if I'm running the model inference that is aware about the barrier context, I actually don't do anything. So Spark does it for me. So once I created the barrier for the two tasks in my previous parallel job, the other jobs that I scheduled on this cluster, they will be aware, which nodes are currently occupying the GPUs. So when I'm using the different notebooks, in this case, I'm still doing the same. I'm reading the stream and for the every micro batch that I read, I'm inferring it using the UDF. So now I don't want it to fail. So yeah, you can see the number two and number eight in this notebook. And number nine and number 10 are different executors that were picked up. So we scheduled, although we had the, in Spark mind, we had all the nodes available for scheduling. We only scheduled two because we were aware that there is a barrier concurrently doing something on the GPU enabled hosts. So that's the idea of the barrier mode. And it just appeared, like it's like two months, they released 2.4 in November. The first talks about, and I'm not the author of the presentation, the first talks were actually by the committers, like late summer, they were talking about that. So I just wanted to introduce this concept of the barrier execution or two. This is the beginning of a very big progress because you're now able to mix and match the power of Spark with the nature of the distributed deep learning. And the last bit is the accelerated aware scheduling. It's still in discussion. So you can see it's marked as the pending vote here. The idea behind it is you can only specify those particular type of GPU or some technical, some capabilities of the host for further actions that you execute in Spark cluster. And it will execute these tasks only on those hosts that conform to this criteria. But it's again, like it's scheduling, it's actually very dependent on the scheduler. So there is the work going on to talk to the messes community to Kubernetes community and to Yarn community to actually make it happen, to make this awareness presented back into Spark. And the timeline, right now we have the barrier execution mode and it's available straight out. You can also play around with it on Databricks platform. The next big steps will be to integrate more and more functions into the barrier execution mode to make it more easy, for example, to pull out the additional context information. And then going forward in Spark 3.0, there will be optimized data exchange and the completion of the accelerator aware scheduling. So that's the nature of the project hydrogen. It shows you why we actually are rebranding this just Apache Spark meetup to Spark and AI meetup because it's now getting into the single unified platform that merges the best of the two worlds together. I know I was running a little bit short of time and we actually are supposed to end five minutes ago, but still if you guys have any questions, please raise your hand. I'll be happy to address those.