 Introductions. I am Marceni Chernoff. You've probably seen that I started as a co-organizer to this meetup. So together with Han Su, those of you that know Han, he unfortunately had to leave to get small kids. So Han and I, we're all looking for speakers for the next meetup. We also look for the next venue sponsor. So if your company wishes to invite this very honorable crowd and talk to us in a short manner for a few minutes, we're very happy to talk to you. I'm from Databricks. I started three months ago. We are actively hiring. So also if you're looking for additional challenges in AI and big data space, let me know. So I will introduce you to people and we can start talking about that. I am based in Singapore since 2011. I live in Sime. I'm a grassroots organization member there. But I also cycle to us and I know few people come from NTU today. So I'm by chance sometimes dropping by. Originally I'm from Russia, from St. Petersburg. There are only two seasons in St. Petersburg. If you ever plan to go to Russia, it also applies. There is July and winter. So go in July. This is just one question that everyone asked me. I wish to go to Russia. Go in July. Know that any other time. And I'm very, very, very big fan of Kaya Tos. And like as a Singaporean in heart, I really enjoy the cuisine here. I'm currently in the Partner Solutions Architecture role in Databricks and I cover the entire region here. My degree is in engineering. So master degree in radio, links, television. I also work with TurboCodes. Those are like the parallel branch of deep learning networks that are dedicated to signal to noise ratio reductions. And I previously worked for Standard Shadow Bank in cloud and demo environment. And before that I was with EMC Icelon. This is how I got to meet Spark for the first time working with Structure Data and Hadoop. And before that with a few other vendors. Now I wanted to set the stage for this talk, showing you this slide which is coming from the paper Google published in 2015. It's called the Hidden Technical Depth in Machine Learning Systems. You see the greenish dot in the middle which says the ML code. This is actually the proportion of time Google's engineers in 2015, I think still till today, spend on developing the machine learning code, the deep learning code versus other things. What are the other things? What are the largest boxes? You can take a look at it. You see that it's data collection, serving infrastructure monitoring, process management tooling. So the actual black box of the word, the artificial intelligence deep learning ML, that's a fraction of time. Most of the time Google themselves spends preparing the data, mumbling the data, merging the data. So it's the reality of today. So the current state of things is we live in a very siloed world and you've seen it in the previous speaker's presentation as well. So there were multiple systems involved in data preparation. They worked with, I believe they had a lot of Google Cloud Platform technologies, the previous speaker shared, they had predominantly GCP based solution, but you may pick up any of the databases, any of the store systems as an example, and they're not directly connected to the deep learning libraries, deep learning libraries that are able to create the valuable outputs for the business are not directly connecting to those sources. So there are different types of systems, there are silos in between them. And the good example is, for example, TensorFlow that only recently started to connect to the external sources. We'll speak about that in a minute. But also, the technology impacts the way that we work. And by having those silos, we then organize different teams. When we organize different teams, the teams have to start communicate to one another in some standard manner, so they create procedures. Procedures gradually slow down businesses. And with more and more procedures, with more and more procedures, businesses have to go for more and more approvals, reviews and exceptions. So the extraction of the value from the data becomes slower and slower and slower. The handles here on this slide between the teams is usually either in Jira's. And remember, we've talked about, you've seen this slide that was presented by William, I really liked it. There was this Twitter, a Twitter quote that said that it only takes like several hours to create the model and take several months to actually deploy it, because you need to talk to people, create Jira's, approve it, sign off, and then do the pull requests and actually commit it to the to the masters. Also, by having more and more silos, you're losing the ability to analyze the changes to track the differences. There were a lot of questions today about like, how do you see what changes to the model compared to what changes were in the data? How do you track all that? Well, you have multiple systems, multiple silos, multiple teams, you completely lose this visibility and definitely different infrastructure. With only, with only five systems I've counted in the previous speakers presentation, you still can kind of live, but in production, for example, on premise environments or heterogeneous environments, you may end up with like 15 or 20 of them. So the data though keeps on emerging and there's a variability of data that grows in different kinds and velocities and types. And the current idea behind Project Hydrogen, which is the Apache Spark version 2.4, is to start to unify these two worlds together to start breaking down the silos in between big data processing and machine learning. So with all the connectors that Spark has, and Spark has 50 plus quite stable and fast connectors to different storage types, including streaming batch. And even if you can connect to Elasticsearch, someone asked me today earlier in the day, like, can we use Elasticsearch? Yes, you actually can connect Spark to Elastic and query Elastic as you wish to. With all the capabilities that Spark already has, there are just few things that needed to be solved that would unify the portion about the total platform that would be capable to do the distributed learning, AI, and the data prep. So there's actually two thoughts that make sure I'm trying to skip a little bit. I think this is the history of how a lot of projects that we're all acquainted with emerged and developed over the last several years. So you can see on the left hand side, the Spark community started with the where Spark started was the MapReduce in memory, just a speeding up MapReduce, exploiting usage of the disk and being able to process large sets of data in a distributed fashion really quickly. And it started with the RDD API. It was like a very, very new API. No one was really figuring out how to quickly write a new code for it. And then it started to adopt more libraries and APIs to simplify the usage of Spark. So first, the APIs in Python, in Java, in R, started to appear. Then there was the new concept called the DataFrame that appeared. And it made it very easy. It was, it was in the unification single common denominator. They did the unifying pit that a lot of the lot of different skills from different types of background working in one team to use the same execution platform. So the person with Python background, or with Java background, or with our background, are now able to use Spark and teach one another, share the skills. So that gradually went into the Spark, in Spark's history, in Spark's world, that gradually went into the production shipping code that was ready and the one that you probably use right now. It then was extended by structured streaming. So one other interface that was available in Spark from very early days was the SQL interface. So you can actually query Spark just as if you would query the regular SQL, NC SQL standard database. You can query it using SQL. And then on top of SQL, the structured streaming API appeared, which was the same approach, but to the micro batches. Instead of querying the huge data sets, you split them into small, tiny pieces. And then within each, you can look at them from the perspective of like a second batch interval, micro batching interval. So this is how you work with the streaming data in Spark. At the same time in the AI and ML community, they were the same development. So they were first foundational projects like Pandas and NumPy. And on top of that, there were some performance optimization projects that appeared like XGBoost, GLMNet. Then there was a fundamental big development by the team behind scikit-learn. And slowly gradually we appeared to the TensorFlow's beginnings. Then TensorFlow was a little bit too low level for many of the data scientists. So then Keras started to simplify the way that the TensorFlow models are run. So it allowed to create the layers in the neural networks in a much easier, much simpler way, although below the scenes using TensorFlow. So you can see that it's all about converging. And definitely then there is a side project in TensorFlow that allows to access the data right now. It's called tm.data, tm.transform. So it all kind of converges into simplification and the ease of use. And what's next, the big double, the two question mark is, we don't know, time is short. But it looks like it all converges into something that would be the best of the two worlds together. And the communities start to speak in between one, in between each other. That's something we're observing. The reason why these communities need to speak is because, as you know, Spark is very popular for big data processing for ETL, for distributed, very distributed with speaking about like hundreds of thousand cores participating in like, in some jobs that one of the large customers of Databricks run. We speak about processing large amount of data, which is very beneficial for DL, for distributed learning. So this picture is actually coming from a research paper presentation from Stanford. And you can see that the efforts you put into feeding your neural net with more and more data, they actually pay off with the performance of the model. So the more you can allow, the more you can allow to train, the more data you can allow to pass to the model pays off overall in the model's performance and the model's accuracy. So there are a lot of projects that we're trying to blend the two. They are not entirely unique to Apache Spark community. So projects, for example, Intel had big DL that tried to do the distributed deep learning using Intel CPUs and leveraged them better. There were several projects from, for example, Yahoo. There were Cafe on Spark, TensorFlow on Spark, and you may actually find a lot of semi-orphan packages on sparkpackages.org that are trying to do the same things. For example, run Spark with TensorFlow in a distributed manner, but then people basically gave up after like first several commits. So definitely take a look at all of this project if you have interest in how to develop and where to stop. But what really we're interested in is two use cases to be solved, two user stories, and they come from a data scientist perspective. So a data scientist wants to build a pipeline that takes training events from a production data warehouse where the downstream kind of data lake ends up and train the model in parallel and also apply this model to already trained model to extract value from the same production data. So basically start inferring it in parallel into the distributed event stream that is already working there. So you've seen in previous presentation of the previous speaker, there's a whole deployment on top of Kubernetes that they have to work with by starting in a completely different source of the data, working with different siloed systems, it all ends up in Kubernetes in the end. This is the example of how segregated different infrastructure layers in that environment are. And the user stories, the user wants simplification, so the user stories to have one system that allows to do both of these things. So when we speak about the distributed training, this is the reinforcement of the same user story. So we want to load the data from multiple different data types, and Apache Spark is one of the best solutions for this user story. And then at the same time we want to fit the model, train the model, and do it in distributed way, and there are multiple frameworks using GPUs right now, for example Horovod for distributed TensorFlow. And they are then giving us the model that we want to productionize. And when we have a productionized model, what we want to do is we want to have a live stream of data that we connect the model to. But as I mentioned, there is a disconnect. For example, Horovod does not allow you to read directly through Kafka. So you have to create some sort of interim sync, some sort of interim load step, and also orchestrate this connectivity with some scripting, some additional engineering solutions, some CICP pipelines that are not essential for your extraction of the features, for example. So the afterthought for that is the glue code that we also seen in the previous presentation. So there's quite a lot of orchestrating, non-valuable projects that emerge, and they are sometimes maintained by engineering in larger organizations. They are maintained by the same data engineering team. Sometimes they are outsourced, but that connection is right now not realized natively in one platform. The reason why we were not able to really realize the same, realize both the working with the big data and the distributed training on Spark before version 2.4 is just the way the tasks are scheduled originally in Spark native scheduler. Spark is known for its so-called embarrassingly parallel task scheduling. So when we have a big data set, we partition it into multiple partitions, and then we'll let each individual compute unit of the Spark cluster work on its own portion. And if that particular worker, they're called workers, if that particular worker fails due to some reason, the task is getting restarted and it does not impact the other tasks. It's the idea behind embarrassingly parallel scheduling on top of Apache Spark. Now what is needed in distributed training is a little bit different. So to illustrate the task failure in our first distributed system, which was the embarrassingly parallel scheduling, we have a green task that crashed and it did not impact anything because it was designed from day one to be tolerant to system partitioning. In distributed system, you actually have to restart all the tasks together because they significantly rely on one another to communicate the proceeds, to shuffle the data between them, to update them on the stages of their job. So you cannot really restart only one task in a distributed learning, distributed TensorFlow system, for example. You need to restart all of them and that implies that we need a different type of scheduling. So the solution to this is something that was introduced in version 2.4 in Apache Spark. It happened several months ago and it's something called barrier execution mode. So if you go into the Spark's home page, you may find all the release notes always on the Apache Spark project and then it's the top feature about this release. So it says that this release adds a barrier execution mode for better integration with deep learning frameworks and introduces some additional higher end functions inside this mode. But not only that, there's also some improvement in Spark 2.4 that allows to use the defined functions in a much faster way than something that is labeled under the epic called optimized data exchange. And last but not least, there's a new way to... it's only right now available in Spark's own standalone scheduler. There's a new way to communicate the capabilities of workers, of the compute elements in Spark cluster, so that your task that you schedule on top of Spark cluster is aware about the underlying hardware. So it's aware, for example, that there is a GPU of a certain type or so on so forth. So these are incremental improvements in 2.4 in all group of them is sometimes referred to as the project hydrogen. So this is the idea behind it. I will demonstrate a little bit more about the barrier execution mode. But the idea is in the scheduler, in the mathematical, in the theory of scheduling, there is something called the gang execution. So the gang scheduling, gang execution is like either one whole gang of people, like adversaries, they go and do something really criminal or no one does. That's the idea behind the barrier execution mode. The realization, you can take a deeper read about the epic. There's this big JIRA that went into the release note. The idea behind it is to allow to start all the tasks in Spark from some point, only from this simultaneous perspective. And we'll wait for all of these tasks to complete and then get back into the embarrassingly parallel. They still operate in the regular Spark mentality, but from some point onwards you just say, from here start all tasks together and look at the way that this set of procedures is executed and once they're all completed come back to the regular scheduling. And also it allows to establish a context for the other jobs to be aware what's actually scheduled inside this barrier. And I think it will be a little bit clearer when we go into the demo, but the API to barrier mode is so far realized in Scala and Python. So there is nothing like that in SQL. If you want to call out for the user defined function or something like that across the dataset, you're still bound to Scala and Python for now. You have inside the context for a particular RDD, inside the barrier for a particular RDD, you have some additional functions that you can call. For example, you can call for the IP addresses of the workers that are currently scheduling, they're currently executing the tasks in parallel. So something like creating a cluster within a cluster that only does a distributed game execution job. So I'll go forward and demo it in a little bit later. Let's skip this one. Yes, you were able, for example, here is, it's important, for example, an API in an MPI. MPI is the way that in in TensorFlow, for example, you start the distributed learning. You have in TensorFlow, you have one single master host and then you run, you designate it. It could be not aligned to the regular master host, the so-called driver in Spark. It could be actually quite different and you need to be able to execute the TensorFlow job. You need to be aware about the, for example, IP addresses of the hosts participating in the distributed TensorFlow job. So you are able to pull and get the information about that sort of like cluster inside the cluster. You're able to get that information through the context within the barrier and then launch the usual MPI, launch the TensorFlow in the usual MPI way. So far, as I mentioned, it's only in the standalone, but there are already, you may see that there are several tickets, several initiatives to talk to the other schedulers where Spark is executed. For example, to Yarn, to Kubernetes, to Mesus, to also allow this information to get cascaded back into the Spark environment to create the parity in the project hydrogen. Speaking about optimized data exchange, optimized data exchange is when you have a particular Spark type data and usually it's represented by the DataFrame API. You read with all those 50 connectors, you read the data into the DataFrame API. It's very convenient to work with it. There are very simple functions to change the data, to expand it, to enrich it, to extract features out of it, but most of the deep learning frameworks, they operate in different data types. Usually they are native to Python. They are not aware about the DataFrame API format. One of the things that could be done definitely is just to write it out, wait for the right operation to complete, and then read it back, which is a waste of time, waste of energy because Spark operates very fast and it's optimized to be working completely in memory without touching the disk. So that's not a good idea. The usual approach is to call the transformation to the particular set of data, like for example for one row out of a huge one, and then pass it over to Python in the call using the user defined function. So the calls to TensorFlow, for example, in Spark, they are also done against the data invoking the data using the user defined functions. And the problem with user defined functions, they were introduced quite a few years ago, I think it was two years ago, Spark 2.2. The problem with that is the latency of taking a single row of a huge data set and converting it and then working on it in Python world and then returning it into Spark and then, for example, writing it out with all the possibilities that Spark has or acting at it at this data as if it's a streaming data. And when we speak about latency, the measurements that were done is about 92%, 90% to 98%, this was the estimation, is the exchange of the data formats versus the actual execution of the Python code. This is just the difference in between the the data types inside Spark natively and the world of Python where most of the distributed learning libraries operate. So the project hydrogen's initiative about the optimization to this problem is to introduce the vectorized data exchange and the idea behind it is to send the column. So it's similar to column in a database or column in an exchange, so the user defined function is executed against a set of columns, so why don't we send the entire column out of the data set that already has been prepared, featureized and stays in memory of Apache Spark cluster. Why don't we send the entire column of this data into Python UDF and then wait for it to return and then merge it backwards. So the kind of behind the scene realization of this initiative was done using the arrow. The community behind Apache arrow actually has a lot of different transformations already and they had established like the framework for different other Apache projects to start using it, so Spark project just partnered with them and leveraged the transformations that arrow has behind the scenes to speed things up. So when you call the two pandas rv or two arrow rdd as the user defined function, Spark handles it transparently for you in that columnar manner so you're not wasting and whenever I say spark, I mean spark 2.4 project hydrogen handles it for you in the transparent manner, it happens all behind the scenes and the re-usage of the already established protocols by arrow community and also some additional engineering that was done by Apache Spark committers allowed to improve the calls to use defined functions like 200 fold 300 fold so you can see the comparison between a row at a time call to the user defined function versus the call to the pandas udf with the vectorized columnar vectorized columnar format and yeah different lambda functions were used like lambda plus one or do subtract mean for example from a single column. So I want to switch to a very quick demonstration of the barrier mode. I think it's one of the very most very important aspects that is there in 2.4 so I have a couple of notebooks here and I want to I want to show you guys also the is this UI familiar for you you guys can you raise your hands who've seen the Databricks UI before because I know it's an Apache Spark made up so can you raise your hand if you've seen okay Databricks UI who's seen it yeah quite a number of people that's pretty cool thank you so the idea behind it for those that didn't raise hand is we at Databricks we are shipping the unified analytics platform which only works on the cloud so you cannot install Databricks on premise but the idea behind it is it's a collaborative notebook space that is a very very powerful way for people to work the code together and behind the scenes there is its its own cloud native cloud optimized scheduler and an Apache Spark core fundamental with additional optimizations to work with connectors with native Azure Microsoft Azure storage types with native AWS storage types and additional optimizations from the performance so that's very like 10 seconds explanation of what is Databricks for those that are not aware so you see the notebooks in here I wanted to when you go into the notebooks you have to attach them as usually in Zeppelin Zeppelin notebooks can you raise your hands yeah okay some of you use it so you attach the notebooks to the cluster so you can execute some jobs you you connect the code to the execution on top of a particular cluster by by by attaching them so I have a cluster that has several at this point this is a a native us environment and it has four worker nodes each one has one GPU and 60 gigabytes of memory the notebooks attached to it let's go into this first one distributed training are executing some spark code to do distributed learning using Horovod this is the regular MNIST example anyone remembers the MNIST the digits recognition samples okay sorry it's the first time that I'm at this meetup so I'm trying to gauge the the witness awareness that's good so the idea behind this is we have a set of numbers and they're already being captured as pixels like in a in a picture you have every black and white dot and you can create a long long list of rows in a very large database to represent the graphical data set on top and then you can pass it to Horovod to train it against the label set to train model and to know to um to then output the predictions so we have a table already the table has if we go into the uh I need to show you I guess the the table with the digits so you so you trust me you don't just take my word for granted so if we go into the low world no Horovod hydrogen MNIST this is what every individual digit looks like so these zeros are like black dots and then there are like the other the other grades of grade so if we start the job and train the Horovod model using all the four nodes let me just run them all what happens here we repartition we read in the we read in the entire table into the spark data frame so that data frame is then repartitioned into two partitions and in spark the amount of rows in you have a huge data set you partition it into smaller chunks and then this is the actual scope of work for every individual spark executor the work of the compute unit so right now I expect that there will be two executors actually working on the data set I write it out into an intermediate format here and then I execute the function against that data set that works on all works on two hosts so let's see we need to take a look at this stage okay so we we we have a jobs job stage one which is just kicked off and just waiting for the second stage it might take a little longer but actually it wouldn't care if I run at the same time what I want to do is like I want to use the same cluster for something in parallel and I'm not using any barrier execution mode here I'm just trying to infer the model that I already have and it's hidden in one of one of these functions predict using GPU I want to infer a model against the live stream of data that's in this case it's simulated against the column that will be the features and I want to create another I want to create another column in the same data frame calling it a prediction to what actually this digit means the hand recognize the digit and then I will call an action to display so what I'm ideally looking at is a two column two column table that has the label originally seen and then based on the columns in the feature based on the column of the features the prediction to what the real-time machine learning algorithm would would apply to it so if I kick it off by running it all and here like in these utils are all of these smaller functions actually they are all hidden so I'm expecting the stream to actually fail so let's see and the reason for that is that I'm trying to schedule four there are four nodes in this cluster four GPUs and I'm trying to schedule yeah it took a little bit longer than I anticipated this is complete yeah I've I've I've ran out of time I need to I guess like I took a little bit longer the let me actually stop this job and just rerun it for you for a second so that happens when you're trying to do the live demo sometimes and that's a that's a real cluster behind it I think what happened is the previous job commenced much earlier than I anticipated in the GPU cycle so let me what I need to do is I'll clear it all and restart it in a second so don't don't take my word for for it the idea behind it is the job as I anticipated has to fail here because it will try to use all the GPUs available in this cluster and there are four of them against the same against the same to to to run this test and two of them would actually be already occupied running the previous node books so let me clear it all up there's this function very very helpful function that is called the kill all API just in case there is something yet running on the cluster yep so let's kick this job off once more doing the prep work reading in the table with the digits um repartitioning them writing it into the party format and then reading the same um reading this party database uh with the model oh yeah there we go so I think yeah what what happened is just I was a little bit too quick for that so you can see the usual startup sequence of the hands-of-flow here um it does have a lot of details inside we will not go into it but basically right now I'm trying to I'm training the model on the four node cluster and if we look at the uh event timeline here I'm zooming so these are the two on the back of it these are the two workers that were selected and run the two partitions uh repartition two they run it against the GPUs so if we go back into the model inference without the barrier mode execution and we just say okay run um on all available nodes without any awareness about the barrier mode um run this code and infer the model that already exists against the live stream of data so what it would try to do it would try to infer also on those nodes that are already already got the GPU occupied because it tries to schedule four tasks and at this point it should fail I wanted to fail yay this is the kind of like the opposite example of the usual demo expectations you really don't want to things for for things to fail so it fails with the yeah anticipated GPU occupied error and the idea is we're like we're actually not able to complete the the task that Spark wants because underneath the below the hood there's already GPU busy so um let's stop it all uh let's actually go back to this job clear it all and clear all the possibly running MPI jobs by killing them all so if we repeat the same experiment with the barrier mode now the only thing that was added is this so it says also repetition the same data set into two but create a barrier for these tasks and distribute the information that you are using these two hosts for the distributed learning deep for distributed deep learning scenario distribute the information to anyone who is interested to schedule any jobs in parallel so it's a cluster inside the cluster so let's run let's run this let's take a look at the execution so you can see that because there are two partitions to the training set so yeah right now there is a yeah two tasks going on and if I'm running the model inference that is aware about the barrier context um I actually don't do anything so Spark does it for me so once I created the barrier for the two tasks in my previous parallel job the other jobs that I scheduled on this cluster they will be aware which nodes are currently occupying the GPUs so when I'm using the different notebooks in this case I'm still doing the same I'm reading the stream and for the every micro badge that I read I'm inferring it using the udf so now I don't want it to fail so yeah you can see the number two and number eight in this notebook and number nine and number ten are different executors that were picked up so we scheduled although we had the in Spark mind we had all the nodes available for scheduling we only scheduled two because we were aware that there is a barrier concurrently doing something on the GPU enabled hosts so that's the idea of the barrier mode and it just appeared like it's like two months they released 2.4 in in november the first talks about and I'm not the author of the presentation the first talks were actually by the committers like late summer they were talking about that so just went to to introduce this concept of the barrier execution or two this is the beginning of a very big progress because you're now able to mix and match the power of spark with the nature of their distributed deep learning and the last bit is the accelerated aware scheduling it's still in discussions you can see it's marked as the pending vote here the idea behind it is you can only specify those particular type of gpu or some technical some some capabilities of the host for further actions that you execute in spark cluster and it will execute these tasks only on those hosts that conform to this criteria but it's again like it's scheduling it's actually very dependent on the scheduler so there is the work going on to talk to the messes community to kubernetes community and to yarn community to actually make it happen to make this awareness presented back into spark and the timeline right now we have the barrier execution mode and it's available straight out you can also play around with it on Databricks platform the next big steps will be to integrate more and more functions into the barrier execution mode to make it more easy for example to pull out the additional context information and then going forward in spark 3.0 there will be optimized data exchange and the completion of the accelerator aware scheduling so that's the nature of the project hydrogen it shows you why we actually are rebranding this just Apache spark meetup to a spark an AI meetup because it's now getting into the single unified platform that merges the best of the two worlds together i know i was running a little bit short of time and we actually are supposed to end five minutes ago but still if you if you guys have any questions please raise your hand i'll i'll be happy to address those