 Good afternoon. I think so this is last session and Julia is on the other side, so I can understand so See this this talk is about the project which I did a while ago and Is more of a demonstration of how system evolves in Haskell and And particularly I was fond of apache sparks and I wanted to recreate some of its Sophistication into the way Haskell works with Haskell type system So the reason I started doing this was for although it was my for my course work Also, I wanted to use part of it in my day-to-day problem particularly we Applied this problem to In a geometry application wherein you have a very large ship and You wanted to find out how many Compartments are kind of intersecting with each other because the model that we got From our CAD modeler wasn't very good And then we want always wanted to have a clean model And doing the that on a small model of a ship was fine but when it came to like models with About 65 compartments 18,000 plates. I mean these are called plates and So in all about total 45,000 elements were there and then suddenly it was out of the bomb Spark was kind of out of option because The ecosystem was such that we couldn't do Java or JVM runtime. So We wanted to have some solution in which we could do things quickly and not too sophisticatedly But could solve our problem. So this this was also useful there and that is the basic background and So, so this is this is Inspired from a purchase park, but it is loosely based on spark Like a purchase park it is distributed It's it's does in memory color computation but We also had places where we wanted to store a lot of files on disk Which typically happens with many CAD models I don't know how many of you have worked on CAD models because my background is from Computer-added design and most of geometry and not really from other things So so so the problems that we get are little different because we have to get into space partitioning and those kind of things So we had a lot of files on the disk and then although it is in memory computation We also ended up doing some computation to read files from the disk So you will find the repository there and the way I want to structure this is like Basically, I want to build some understanding of spark Not that we don't know spark just to understand why the systems behave in a certain way Then we will look at some of the implementations in Haskell that we have so In fact, there is already an implementation which fits into spark. So why should I use this? But but this is as I said, this is a side project So that question is kind of I mean, I don't want to ask myself then I Wanted to demonstrate the first the system is was built and then I basically how it evolved into kind of a more elaborate system and then Accidentally, I kind of discovered and then could use the monadic parallelization pattern and then that is how the implementation stands although it's not complete yet But that is what I wanted to achieve like Describe the elements of spark subsystem Describe the the way Haskell is viewed around it to achieve that and Then how can we take that model as a base model and then how can we build on top of it to create something elaborate so that's the That's the basic agenda please feel free to Stop me at any point of time because I guess this would be a long ish session and then I do want to jump into code from time to time and Also, in a way, I want to expose myself because then I would like to invite a lot of questions as to how things could be done How it could have been better. So please make it like a discussion I would kind of welcome it my target audience is kind of beginner to Intermediate it's like someone who had who knows about Haskell and somebody who had worked for some time with Haskell and Also would like to Understand the Haskell ecosystem around the distributed or cloud Haskell so that it's possible to build application on top of it And it's not the walk in the park kind of thing like I wrote like I write programs in Otherwise pure code, you know, I mean I enjoy like I enjoy writing pure code and suddenly when it gets to Distributed Haskell, it's not true Suddenly I encounter many difficulties and then I have to get around them and even some of the things that are being done looks like Hackish things, but that is how they are probably Recently I've been to the Haskell exchange and there there has also the same similar questions were asked like How can we support a polymorphic type in a better way or more than rank one types in a better way and Probably we'll get better answers so understanding Apaches Park the reason why I like Apaches Park was kind of it is actually very simple because Anyone who has tried to build a distributed system of any kind He knows that creating such a system is extremely hard It's very difficult to get it right first time and It is very difficult to even if you try to write specifications in a protocol ish way If you will even then it's very difficult probably And if you want to write something like consensus, then it's extremely difficult to write So why do I like Spark because it allows me to specify my problem? without understanding The complexities of distributed system in the it at a very low level So it's kind of hides a lot of things from me and still it allows me to specify jobs in a linear way and then It allows me to do calculations so Spark has built in a lot of sophistications like recovery and fault tolerant if you start it on a mesos worker and Then if you shut it down if it's not available for some reason then it kind of recreates the whole hierarchy again And those kinds of things makes it a perfect candidate also to replicate model of So from that perspective So so I'm not trying to create like Get into basics of RDD only at a very general level wherein We would like to understand What's going on? Okay, so so if you look at Spark program it's it's actually a Present you away with a very simple DSL Can you read it? Okay, right? The code is not doing much. So it should be okay So basically the model is very simple you keep you create an RDD from the existing data And you keep on building top of it and you keep on building in terms of map and reduce join sample Union those kinds of jobs. So it's like it creates a DAG or a cyclic graph in which you have something on the root and then you keep on at every stage you add something on top of it and then That is how you specify your job so The advantage of such a system is that it's easy. I mean even for a layman It's very easy to understand what's going on here So you can you can sense that it's trying to read some lines from some data source then Finding errors so filtering only on the errors And then again zooming on to only certain kinds of errors and that is where you are interested Another thing to remember here is that the whole pipeline is specified, but it's not executed immediately. So When you say that and unless and until you do some kind of call cache Or by some other means force the calculation. It's not going to evaluate itself Something that Haskell can do very well. So that's good But because of these kind of pipelines It it it's very easy to create these kinds of jobs. So before getting into Actual implementations, let's let's try to understand more about data and its dependencies. So We have few pipelines like for example map and filter where in the pipe that the dependency is direct So when I say dependency is direct it means that This is my data and then as you can see that it is partition into different partitions But each partition maps to a corresponding partitions When you apply a map or a filter I Mean I don't have to get out of the my base partition and then I can directly It's a one-to-one correspondence between child and parent Sorry about that So this this is called direct dependency now. It doesn't really matter Where the partitions are So partitions can be present on the same node or the same machine or they can be present on a different machine Although spark will try to optimize The co-location of these partitions It will not guarantee co-location of these partitions but Nevertheless, there is a direct dependency with co-located dependency Things are something become something like this So you might have Partitions, but these partitions are replicated across different nodes. So this is This is one node. This is another node and both contains similar partitions So if you want to achieve something you have to contact Both these nodes and partitions. So your dependency is not only on the parent partition But on partitions which are located across so this is called as co-located dependency and Then When we try to do something like group by or reduce Then since we have to operate on all the keys and then the keys might not be partitioned exactly on one node So it results in a what what is called as shuffle dependency because then you have to By go back and forth between all the nodes in the worst case so that is the Wide or shuffle dependency so spark kind of identifies the difference between direct dependency co-located dependency and Shuffle dependency so basically all this came from the original spark paper you must have read of course and What? spark tries to do is optimize the data transfer and Moving of the partitions between two nodes so that it can does it can it can do the job efficiently Not only that It also maintains the lineage of all the partitions for example, so this again example is taken from the Spark paper and this mentions about the page rank algorithm and There are various stages in the algorithm and those are maintained But at the same time their dependencies is also meant The reason why dependency is maintained is because If for some reason some node becomes unavailable for some reason a particular Partition or it's it's already is lost Then looking at the lineage it's possible to recreate it so that's the reason why This kind of lineage is also maintained in the spark ecosystem So some kind of failures are handled. I mean if you have a very catastrophic kind of a failure it will report error anyway and and then if you try to look at the The overall picture then you can see that the things are done in various stages So this is the stage one in which a group Operation group by operation is done. This is a simple map and then a union is done and finally there is a join Which results in these three nodes and together it represents the data So now the main aim is to Somehow replicate this kind of features using Haskell and using distributed Haskell So now there are there are already existing systems and This is one system which recently came out you can actually go and look at this application It is interesting one because at the core it has inline Java some some Component that they say that it's ready to come out of it, but I haven't seen it yet It wrap itself in a jar file as a spark job so you can actually submit it using spark submit and The API that you can you write in Haskell closely mocks In fact, almost same as the spark syntax that you have seen few slides before So definitely worth a look at and I Haven't run it frankly, but The the block that it can the block that it point is this link points to it's amazing the way they have mentioned in fact These are the same guys who wrote distributed Haskell So no wonder it came from them and I also wonder why they didn't base it on Distributed Haskell they kind of pushed and wrapped it inside Spark jar file So so inline Java is something very similar to inline C and other things but this this this is a very good candidate to look at In fact, they also have our binding to Haskell so not only you can Do spark programming but you can also do our programming in Haskell this is another library which you will find on Hackage and They The they are a little different than distributed Haskell in terms of how they approach it at the core level the concepts are very similar how you want to Transfer the code from one node to another node and I had some help from one of the team members of this to do few things related to a synchronous process creation But and and they also have a paper coming up which would be published in Next ICFP or this ICFP not sure but but worth a look again and And then So now now we are coming to the implementation on which we are going to demonstrate our code so this is a Distributed Haskell also they call it cloud Haskell and They specifically use static pointers and remote table and we'll see what those are and Of course, they can do monomorphic types, but it is also possible to handle rank 1 polymorphic types in there and In doing that they use closures So we'll see what these are and why they are required but The development is not very active I mean for a long time they are on version 0.6 and the last version came up in February and Not not it's the jithub repository is not very active, but I guess they are also doing a lot of things otherwise in fact Initially they Created this remote table approach and that kind of inspired Haskell core guys to implement static pointers inside GHC so If you are interested you can go to like Haskell on cloud and then you can find Simon Pitten Jones GHC Web page wherein he describes why and how of static pointers So something very core to distribute at Haskell and this is where it is is like How can we send a data over the wire? How does it happen in Java or how does it happen in? Scala or Java or JVM based language like so we know that the Scala objects are Serializable the case objects are directly serializable and It's possible to serialize closure also. So we can take the closure object and its closure to another location and pre-created How can I do it in Haskell? It's pure. I don't know how to do it, right? I Can't even because because runtime all the type information is gone There is if it's very hard to imagine how one can go about it So that is why? the static pointer extension is added in GHC 7.8 and It's an interesting one, but because now because of it it is possible to use I mean create a function Create a static pointer to it and then take it an Other side on other side and then exactly call the same function So we'll see about it. So to be able to use a static pointer in Haskell You have to enable the static PTR extension, which you would put it on the top But this is the basic usage of how you can use it so if you have a suppose a function called square and And it does its job to just square the number integer and then you can actually use the static keyword and If you use the static keyword and pass the function to it, then you will get a static pointer and then you can actually get the information about that pointer and it will give you module name the Function name and a lot of things about it. So the whole Lineage or the provenance of that function is stored into it at the same time you actually can dereference that pointer and apply an argument to it and Get the result back so it's possible to write a function pure function or Impure doesn't matter you can actually go ahead and create a static pointer dereference it and call it So far so good. I mean This is good, but how I can use it how I can use it to Take my function, which is on on Created here, maybe in the same environment and then take it to some other location transfer it over the wire and recreate that environment over there So static pointer itself will not do that work. So it's it's not sufficient. So I haven't I have left the I just wanted to provide the function declaration So if you look at the static pointer then then it provides information about package the model in which it is there the name and Some other information but most important thing that it gives is like a key and This key provides a way to serialize this function or data to another location and recreate it The static key is nothing but a fingerprint which is again defined in the core ghc and This fingerprint is some kind of a hash And that hash so so every function is So if you whenever you create a When you say static and give some function then it will be hashed and that hash will be stored into a table which is which with that table will be stored in the same module and The table will contain this hash. So now I have a way to take the fingerprint Now this fingerprint is just a set of hash. So now I can actually pass it on the wire and Now I can recreate it back so if I have a static key I kind of Save it go to another place in which similar things are there and then See if that static key key is there or not then I Can have this function though the in the current version it is defined in unsafe way But I don't know I haven't looked in yet But in 8.0. They said that it might be safe version there but The Implementation that I have uses the unsafe version. So now I what I can do is if I have two connected nodes Then I can take the static key send it on the wire and then look up whether that Static key is there and then maybe I will get that Function or whatever back and then again using the dereference. I can have that function back I apply arguments so I can do and resume my calculations over there so what's over Yes, both node they have to have the same Basic substrate It is possible to build on top of it and then transfer the whole thing to another node. That is possible So that is what we are going to look at But the substrate has to be same you cannot change but but you can change the context And that is what is called as closure So this itself will take you and then you will be able to Deceiver the or get the function back But then it's not sufficient right you not only you have to Deceiver the function you also have to take the data from one place and you have to apply in the same way To the function the way you want to apply it here that's what is Called closure because eventually in Scala or in Java also you do the same thing right when you serialize it you actually not only serialize the Object but also serialize the closure around it. So now I Can define something like this. So if I have a So now this is a static which is actually one of the type in distributed Haskell which actually is a Rapper over the static extension that Haskell has and also the remote table concept that distributed Haskell has We'll see what that remote table is But it's it's enough to understand that the static of Function, okay Suppose so so you have a first function and then you have some data and that also data you have converted to Some static pointer then you can actually compose it and get the static back now this gives you a way to create So to compose few things now if suppose you have two functions and if you create two statics Suppose you have a function B to C and then a to B just like function composition in pure Haskell You can do it you can compose static functions and create a resultant function and now send it over to another side Of course, but at the same time I should understand that this itself is again not sufficient but because we are again talking about Function and recreating at other side not how we are transferring it right So that is also we need to understand and that is what the closure does So what we do is in distributed Haskell is that we build a closure around the static and The basic definition is very simple You have a static in which you get some kind of binary representation of the data and then you get the data type out so This is a static function Which you want to apply and this is the data which you want to decode this is the encoded data Suppose you have a lot of data Which you encoded into some kind of binary string and now you create a closure using static Which is by its string to Some type a and then build a closure around it. So so given such a function encoded data, which you have to decode you get a closure and This closure is again You can you can So this closure you can again compose the compose in the same way. You could compose Do let me know if you have anything Yes, yes So basically a static is a fingerprint. It's a hash and that hash points to some Table inside the runtime system that you have running now on top of it you add the Byte string which is a binary string, which is an encoded string from some data You might have a tree or you might have a lot of integers anything and then then the whole thing you put it together so that you can actually pass it on an object and recreate them and then do do Apply the function and then run it over there. That's right So this is okay Yeah, yes No, this is the Decoder Right. So when this byte string is taken to the other side or on a different machine And when this static is deciphered then you can apply this by string to this function and get this a back Since this static is represented as a fingerprint. This is nothing but a byte string So what you are transferring is a byte string and byte string and then you are taking it to another side deciphering it Inverting it to a function pointer applying it and then you'll get type a back No, no since we have to transfer the fingerprint and that fingerprint has to be present on another side So as I said the substrate has to be common It's not possible to create Code base. Yeah, no, no, no In fact, it's a one of the pain point, right But that's how it is Sometimes like simple things are easy But because it has all static types and then it has to remove all the types at the runtime You cannot do things certain things very easy So you have to have a elaborate mechanism to do this But these also had some limitations. We are going to see it so This has a some limitation in Haskell we always deal with Type classes right the square function that we have seen It was operating on integers. What if we have a function which takes the type class constraint So I have a number which I want to square now if I Take it to the other side this Type class information is going to get lost and this is also because of the the way Haskell is implemented What happens is that this type class is converted to a Type dictionary and this dictionary is passed to the function since we are not Converting that dictionary in into any encoded format that information is simply lost So If we do Transfer this function this function will not be will not work on the other side Because there is no way it will understand that this function has a constraint of numerical type So it still has a problem right and What we want to do is we want to serialize data and then the way to serialize data is Declare an instance of binary type plus and binary type plus allows me to Serialize a type of data a to a byte string and get it back. So I don't want to Write binary conversion for every type. I would just use binary class type So it means that even if I have static type. I have closure. I will not be able to simply take my type and Put it on other side and get it back. I still have to do a few things. So taking monomorphic type to other Location is easy taking integer there will be easy taking a byte string would be easy but taking a type class constraint will require some more efforts and So that is where we have to and that is where even in the cloud distributor Haskell This trick is applied which is called a dictionary trick and actually Surprisingly same dictionary trick Similar kind of dictionary is created by Haskell runtime when you create an instance of a type class and is passed to the function so now suppose I want to encode the Dictionary or the type class right suppose my function is Taking argument which are ordered. So basically what I do is I create a g a d t algebraic data type in which I force the type a to be ordered But nowhere it's a phantom type. No, I wear that data is present there, but since I told that This has to be an instance of ordered type Haskell will put that dictionary and enclose it around so when I Recreate the serialized data of order dictionary it will give me It will force the order type class on that So I have to hide that order and then again disclose it on the other side So if you take the same example In which you have a square Which works on the numerical side. So what you would do is you would create a numerical dictionary where the constructor will force type a to be numerical most of the time you would like to have typable and Binary and that is already provided. So those types are already provided if you want to support your own type classes Then you have to do something like this Sometimes you have to do it. I mean for example in reduced step if I want to achieve reduced step, then my keys has to be ordered so then I have to Tell the runtime that And and code order dictionary into it. So this is kind of a pain but I Mean, I don't know but looks like a sophisticated pain, right? but This is how it is and and still has a problem You still cannot take Polymorphic type to the other side. So for that There is a concept of a remote table So let us I have example of it You can read it I suppose So This guy distributed Haskell gives me a template macro In which I can register my polymorphic function and then it creates a string dictionary of types so There is a library called rank one dynamic. So these are rank one types so So if you have a function say f from a to b then it will convert it into any to any one and It will since It has no other constraint as long as you have two different types. It should be all right if you have a constraint of a type class then you would have already encoded that into some kind of a type dictionary and Then the rank one dynamic library is able to resolve that as well so There are two ways of getting your functions on the other side so one for monomorphic types use static extensions and for polymorphic types use remote table and those remote tables Now you have to ensure that the same remote tables exist on all the nodes Fortunately when you create nodes and if you have a if you start with Transport model in which you present the remote table then it is replicated on all other nodes as well So once you have it on one node, it's possible to have it on other so So that was about The the main the core part of it like how to take your code to others, right? So now we can talk about how we can manage things on multiple machines now once we have taken care of Transparing from one side to another side what we can do is There is a logical compartmentization called node. So you can have multiple nodes on the same machine It's bit similar to Akka. So you can have Separate actor systems you can either in the same process or you can have it As a different processes on the same machine and it's called node. So you can have nodes as a logical partitions between Two subsystems and then what happens is that node is node is a container So also kind of a owner of all the things that are inside So if you terminate a particular node all the things that are there will also Terminate so the unit of execution in Distributed has to be this process so a node can contain multiple process and When they communicate they will communicate using the binary signatures or binary realization over the wire So you can either send a request and receive it Which is a typed message or you can establish a channel and that channel Is is more secure and faster because the way it is implemented is like It uses the transactional memory to transfer a message from one node one process to another process if you just send a message and Just specify the address suppose. I want to send a message from process two to process three and I just send it Then it has to do a lookup. It has to resolve which process it want to send to send to and then finally It will reach the process Fortunately, there is a faster method called channel. You can establish a channel between two processes and then that will speed up the transfer of data between two processes so No, no, so now we have nodes which are logical containers and we have processes which live in nodes And as I said they can they can talk to each other using these basic communication channels SKM is heart of it at the heart of it and But it is deeply seated so that you don't have to look at its implementation very often So if you are in the same node, then you don't have to have a closure you just start a process so inside the same node you just spawn a process and That's it only when you have to create a process on another node. You have to do it specifically with the closure So indeed there are two versions of Creating process we can spawn local in which you will create a process on the same machine Same node and spawn which is like remote version of it and of course for spawn you need a closure, right? so Akka has a receive method to receive lot of Messages and then it creates a partial function and you can concatenate two partial functions and that's it This process work in a little different way Actually, it's more of a general version of become in Akka. So more of a finite state machine as that that's the way you can look at it and Akka also has it, but it's a I don't know whether it's adding or I don't know but it's like the moment you receive it you always get a chance to Transform its state into another state and that's a good way of designing it because you can look at your process as a finite state machine and once you have a process You can also link to the process to monitor it and whenever the process is killed It's not reachable or you want to cancel after some time. It's possible to do it through linking So this was like now. So basically now we have all the machinery to build our own implementation So we can go about it now. So what I created is like a special kind of a process which with with a fixed life cycle and It it does only serves few purpose like it it is created. It is live now it stores something and It expects some stage somebody will take that data and once someone takes that data it will terminate itself The whole purpose is to live temporarily live temporarily on some kind of node and then you can create multiple such partitions and You can communicate with those processes. So this is again a phantom type So this type doesn't appear in the constructor at all but it allows me to page that data from these processes and This is a building mechanism which is used while creating a system wherein we have a lot of partitions and Transferring the data from one partition to another partition. So blocks is like a set of partitions. So now you can actually create a Partition which we create a special block which can talk to many blocks at the same time and This this actually gives a good way of creating union. I mean for the union. There is no cost, right? I just have to create a map to different partitions of the same type and The type system allows me to only map blocks which are of the same type and then This is how the data will be transferred. So each partition is actually nothing but a block and Once you transfer the data This will get terminated and it will become another block So when I create transfer the closure the closure is applied and it gets resulted into another These are some sundry things to do management So you have to have a master node on which you have you will get the result and all the slave nodes are used to Create partitions transfer the data and do a processing You will need a remote table so that you know that you are running the same things at all the nodes and The strategy is not used right now. The strategy was supposed to Strategy I found little ineffective the way I implemented it But the initial intention was to create different strategies for executions the definition of rdT itself now becomes very simple so as long as long as As long as you have a data that can be serialized you can create an rdT and The main the the main function that type class needs is a flow Which tells how that data can move to some block B? Now we can use this definition to create a composable stack of rdT's and So so when the whole system is run, this is a very toy program, which does a small job so you create Tell how many partitions you want if you don't specify any partitions all the nodes are used for partitions But if you give number of partitions, then only those nodes only those number of nodes used to store the partitions and If you remember how the spark DSL look like This looks like something similar although You have to provide extra dictionary types now so that it knows how to serialize it on the other side So you see it the data you map it you provide the reduce step and Finally only when you collect it only then all the transformations will be done so again you need this Dictionary trick to provide the qualified type information to this Any questions? so Basically, actually I can just run the test That would be better. I guess the tests are very small, but They exemplify the system really well. So this shows the whole example of How you can create multiple nodes Create the data so I'm generating specifying a job with random hundred numbers and then Doing reductions to the some partitions and it works for well So basically when I ran the intersections that initially I talked about I got about Five to six times speed up which was kind of good for me So coming back to the implementation the implementation now flows with the type class definition So you have a seed in which you tell how many partitions you need Which actually comes from the context then you need a closure of the data that you want to push and You need a dictionary of what you want to push. Similarly, you can also create a map map looks little difficult different because now You have to specify you have to have three types So the base is an RDD which gives you B and you want to produce C So you have to closure of B to C So when you have a base you apply the closure you will get the type C So what you are providing is you are providing the way to compose things together now So this block will be transferred to another location and that's the way to Do the mapping operation on the other side? Implementation itself is very simple So you get all the blocks which represents the data for the base system and You just map over all these blocks to send the closure over to some nodes and You spawn a process over there and you just wait For all those processes. So all those processes which are created are the ones you are interested in So the the all operations are now scheduled They will run on the other system and now you just have to wait for those to complete in the meanwhile if any of those would fail then you since the Master node would link to those processes. You will get a notification Recovery is not there. So I just get a notification and then I stop. That's it. Sometimes it hangs too, but Most of the times it goes through clean at least on Linux and Mac. It works well Windows I had a lot of issues, but I haven't I wasn't able to solve Reduction was a problem because I have to get the Order dictionary in I also have to have serial as a dictionary and there is no simpler way of Doing it rather than having all of them together here and then This is a partitioning function in in spark the partitioning function is Kind of automatic even if you don't specify one It will use one depending on the type that you are dealing with but here we have to do that Probably we can decipher for some simpler types, but I don't think it's possible. So better to specify one here and Reduction happens typically in two steps. So it doesn't happen in one step So if there was a paper by Ralph Lamel from Google and then he had actually put forward a lot of things to how reduce step can happen using pure function functional languages So typically he advised for two step processes one in which you create partitions and then you Segregate the data in multiple parts in the second step now using all those partitions on all those nodes You take the data and do the reduction step. So the first step is to partition second step is to do actual reduction So shuffling happens in two to two places. So that way the transfer between I mean the processing is reduced because partitioning has already happened and I exactly know which partition to go to To receive the data. So this is the implementation So I what I do is I roll over all the slaves and spawn the first stage again, you can see the the step one closure is passed and I wait for All those process IDs for the second stage and stage two is shuffling in which This is where the Combiner function in which I have to combine two values together is passed again. It is scheduled on those machines so the execution kind of is applies with a Simple balancer like it's a round-robin kind of thing if you schedule one task on one node Next next task will be scheduled on next node and it is rolled over The direct dependencies are always localized I Think it's it's very Now to do that but it gives a good result that is for the purpose that I was running the job and so so execution strategy was actually simple and It actually was trying to equally distribute the nodes Whatever that was present and but but important thing was that the Concept of a block and that I found really good although other things could could improve like The process could represent a storage block and not only that I can modify my block process Not only to store data in memory It doesn't matter if I store the data on the disk also or it doesn't matter if I store data on HDFS I can retrieve it back depending on the Scheduler or if I want to implement some LRU strategy So that was good, but there were a lot of limitations So I mean taste where I mean we still have to improve a lot of things but Controlling life of a process was a problem because sometimes the process being the IO process it would kill although I would get a Notification it was difficult to kind of recover from that and This has been solved so I'm I'm told by those guys are that The on the local nodal node serialization doesn't happen the data is directly connected to another process even if you spawn it with a closure So it has that kind of optimization, but now I Wanted to go further. So that is where This is still under development, but it is very interesting to look at the Implementation because I want to now I have a specification and But if you see the implementation the moment I specify map job all my Processing is defined immediately. I mean it defines how things can be run. I don't have any control on top of it so the main thing I wanted to do is to Separate the plan and the execution. So that is where now we could use a continuation monad so So using a continuation monad at every step What I can do is at every step Whenever I do a plan. So suppose I created a seed data is I create a map data immediately I convert that step into some kind of a Step which I can translate it to a scheduler so that I can schedule the job Synchronously and that is where this library was helpful. So this library was created by Simon Marlow and there is also a beautiful paper referring to it. What it does it creates a par monad and par monad is nothing but a continuation monad in which Two threads talk to each other through IO reps through shared memory So there are two states like empty state and full state So either IOF is empty or it is full once it is full the dependency will kick in using the Scheduler and then that step can be run. So it's very interesting to look at it because The owner monad par library makes the calculation kind of deterministic Although because of IO monad Even if it has lot of determinism You cannot really say that it is fully deterministic, but I am one that was the reference point to again convert the Execution that I had to a synchronous execution on multiple. So suppose some node is taking longer to run some process I should be able to steal the work which is given to that node and put it on the Another so this is a five-on-se serialization. So it's just to get an idea. So this I and J in this Continuation monad is nothing but a IOF and whenever you evaluate the Values of this function You will put it inside it and then you will be able to get those values and use it somewhere the same way I wanted to create a semantics for Distributing work on the cloud. So rules are simple So always have some un-evaluated computation and pass it to more than In I mean and then we need to ensure that the result of this computation is not required immediately And that result can be shared so that other components of the program can use it and This this looks like a future right because future does the same thing You have some un-evaluated component, which you want sometime in future. So then now I have a plan. I can fork I Create create a fork and then this is a child plan a child process This is the parent and that itself will create a plan. I'm done with the task and Then I want to create a new task and then I want to put the result back into the IOF and the IOF How can we create a IOF which is valid on the network, right? so Before going into it Now I can also simplify my rdd definition by using algebraic type. So Seed becomes data already. So as long as I have something serializable actually I can create an rdd and Specify all different rdd's in the same definition even if any so so this looks like very similar to rdd in scala because it also has finite number of methods like Sample join union etc. So now I can take this rdd Put it inside a continuation monad and create a step a plan so I can I can have a function which takes the rdd and Creates the plan out of it once I have the plan then I can use the scheduler to Create execution mechanism to run on different machines. So now not only I have a separation between Execution and scheduler, but I also have a freedom to choose my own scheduler so I can choose to use my old scheduler again So what I have to do is I? Basically the I or S now becomes a network So I wait for So so I or F is not a I or F which is created on this machine But I or F is can be somewhere on the network and then its semantics makes it look like I or F so I can send a message if it is not there It will give me an empty if the value is filled in it will tell me the process ID where the value is filled in and then I can go and Get those values in the function. So if I have a procedure which takes a value and Creates a process I can translate it into a Network I or F kind of a thing and create a closure around it okay, so From a simple process which has a closure to I or F best process So basically transforming this function into this function would give me a freedom To spawn a function anywhere on the node and get the data from anywhere on the network. So now I can actually Cycle into all my nodes and Have a control node wherein I can keep on scheduling these processes And this should give me freedom should because It's not yet done Only the first step is done. So I cannot really tell how useful it is So basically and I work becomes a networked I or F Because it has the same semantics of I or F and then but but it resides on network And however, it will point to the process in where the data is processed. So basically I don't want to copy it immediately And it can have reference to reference that I want That that crossed my mind, but I haven't thought about it yet I mean that has to be taken care either by having redundant copies, but I mean I Don't have any answer to it right But at least in a good situation this should theoretically work. So now I'm now the scheduler is actually very Simple, I mean I can have a scheduler which run task on the same node Once the one task is run it will schedule the next tax and the next one or I can have a work stealing pattern which will Schedule work on all the nodes and if some node is heavily loaded then it will take out some load and put it on the another and That implementation I have actually taken it directly from the monadbar library. I wanted to use it directly as it was But because of nivr, it's not possible So that that work is kind of still going Finally, so these are the references where you can Go ahead and look at yeah, so Yes, yes Yeah, I have thought about it using template Haskell, but I haven't given much thought whether it would work or not Frankly, I'm not sure Haxel is not doing that right Haxel is solving the n plus one problem in a in a deterministic way I don't know. I don't know frankly, but when Last week I was at Haxel Haskell exchange and Simon Piton Jones explained how he can put some typewrap information inside core as a trusted core and Probably that could solve some of the issues, but I would still say that it would not solve all the issues because We would have silly still the problems with the polymorphic types. So that's that's not yet Yes, yes Actually, it's a reference to a process, right? So only thing that you are serializing is process ID There is no closure present and you'll get a error It won't be spawn. In fact, you'll get it immediately. So you have to always bracket It so that you can handle that exception the same thing the same code that you are running on one machine The same code has to run on the other the data and composition of functions You could do that. Yes. Maybe the way is like you send some template has killed code and then reify at other end and probably you will get similar thing by but I am I Don't know anything about template has killed. I haven't never used somehow. I'm afraid of using it No, it runs in its own monad and creates its binding always. So in fact, you can create Overriding binding means that actually it has a little different kind of execution engine going on. Yes. Yes In fact, you know something the stat the static extension that you have You cannot even run it in the interpreter. Yes. Yes Yeah, I guess when you do spark submit the jar file is co-located on every node and that is loaded I think that's the same thing, right sending the template has killed code and loading something using The the earlier company I used to work for in so we had a proprietary has college language and So we used to compile that into bytecode the whole bytecode used to be serialized to So it was very strict, but has college So mu might have heard in standard chartered bank it's written by the compiler written by Leonard Augustson It's used in always running the bytecode and And the good thing was that you could generate binding to other language like extra JavaScript and then so so once you have C++ API is linked in now you can access through a share That that was really good. No, it wasn't lazy at all It wasn't even recursive didn't support the recursion any other question. So do have a look at my repository and then Invite you to participate because this is my side project. I Used it once to solve the intersection of objects, but that to schedule few things on different machines But not not in product although someday I would like to that's a interesting problem to solve and I would like to use it But being a side project it always remains that okay. Thanks