 Good morning. Please welcome our next speaker, Honza Kral, who will be talking about designing a Python interface. This presentation is about a research and development project that we are working with David, also with students of our laboratory, and also a third partner, Rafael Ijeske. The agenda of this presentation will be the motivation of all the idea of the embedded flexible language, our objectives, the programming model of EFL, the execution model, the implementation of all this idea, and then the focus of this lecture is how we can implement parallel design patterns in EFL, conclusions, and further work. Then the motivation. We know that we have a huge heterogeneity of incompatibility of parallel programming platforms today, NPI, OpenMP, Python threads, Python multiprocessing, etc. There is a need to make it easier to program parallel systems for a common approach which will free programmers for platforms technical intricacies. Our objectives were and continue to be a major objective has been to develop a straightforward language which implements that common approach and allows implicit instead of explicit parallel programming making easier to the programmer to program parallel systems. This should allow flexible computation that further we will talk about it in which sequential and parallel executions produce identical deterministic results. It doesn't matter if we run the same program sequentially or in parallel we ensure that the values, the result values will be the same. The run, the execution is completely deterministic. To facilitate this a deterministic parallel programming tool has been developed and its name is embedded flexible language EFL. The programming model of EFL. Suppose we have a program, a sequential program and we decide that to have a better performance we want to parallelize part of the code. Then in those parts that the programmer wants to parallelize that parts of the code we embed blocks of EFL that you see here in the slide and the sequential parts of the program are written in the host language. Maybe Python, maybe C, maybe any programming language. The parts of the program which are to be executed in parallel are written as EFL embedded code. The EFL syntax is C style. Why? Because we wanted something universal that the same embedded language could be used for any language and the most wisely used syntax in programming is the C style then because of that we decided that the EFL syntax will be C style and host language independent to be universal. The semantics of EFL is deterministic like in functional programming. The memory management we keep to the host language implemented by translating embedded EFL blocks. The semantics is implemented but by translating embedded EFL blocks of code into the parallel code of the host language. Then what is the principles of the programming model of EFL to ensure deterministic parallelization? First, the programmer should call pure functions. Functions that don't have side effects. Ensuring the functional programming concept of referential transparency. Variables used inside EFL blocks may be of two kinds only in or out but not in out. Variables that we can read only from them or write only for them to them but not both and the important concept of one's only assignment that is connected to the principle that variables may be only in or out. Then the execution model of EFL the key aspect of the EFL execution model is that parallel and or sequential execution orders of the program execution is a program that is written according to the EFL programming model will yield deterministic identical values and because of that the flexibility of execution orders we call our execution model flexible computation. Then we will now try to understand why we need one's only assignment. If you see here in this code example that we initialize x and y with one and three and here the block of EFL that we have three lines here. X receives the value of f of a, y the value of f of b and x equals x plus f of c. If we execute this code sequentially in the line one after line two after line three the final values of x and y are x is f a plus f c and y is f b but in a parallel execution that every line can be executed in any order and in the order is three two one we have a completely different value than in the sequential execution. Then allowing x to be in and also out leads to undeterministic results. If we don't allow in, out and only in or out the same code will be written this way that x is only an out variable also y is an out variable and then the sequential execution also the parallel execution any parallel execution will give exactly the same values is a once only assignment prevents the undeterministic results that we had in the in the in the in the in non non EFL code. Then how we implement our idea of the of the EFL the pre compiler of EFL would let's let's see here two views the view of the implementer that he writes the syntax and semantics of EFL we used a tool that is called the java cc java compiler compiler the java compiler compiler according to the syntax semantics will will generate or create a platform specific EFL pre compiler until now we have two pre compilers and we will talk about them in in furthering in the later later later slides and the programmers view he writes an EFL based host language search code that code is translated by the pre compiler to a parallelized host language code and then the host language runtime platform runs that parallelized code and you see the the our approach is that if we have a specific pre compiler of EFL we can write with EFL in any language but our implementations until now are for Python. Then we implemented the the the pre compiler for the multiprocessing pools module of Python what are the characteristics here that the pool a pool's object is a collection of fixed number of child processes that the number of child processes defaults to the number of cores in the computer the pool object mechanism serves as the scheduler of that a parallel execution and inside the the this Python module we have built in the map functionality that we will see later the importance of that functionality the pool's module was modified by us allowing unlimited hierarchy of non-diamonic processes because the original Python pool generates only diamonic processes and that constrains the hierarchy or the nesting of parallelism and the pool-based scheduling management is the the the element that allows us to manage all the the scheduling of the parallel execution. The other implementation the second implementation is an MPI version that is based in a module that was developed at the University of Montreal that is called DTM that is an element inside a package of evolutionary algorithms that they developed that is called DEAP. DTM is a Python module written using the MPI for Pi module it's a layer on over the MPI for Pi. DTM allows EFL implicit parallel programming in a similar level of abstraction as that that we are that we get got by the multiprocessing Python module even the the syntax in DTM is very similar to the syntax in the multiprocessing module also in DTM we have a map functionality and the number of child processes defaults here to the number of cores in the computer cluster that we run the MPI. A scheduling mechanism also is built in in the DTM. Now we will focus on how we implement the parallel design patterns in EFL and we will talk about the implementation of the fork joint pattern and those are the constructs in EFL that allow the implementation of this pattern master worker pattern with the for block map pattern with all we have here three alternatives map loop and for the reduced pattern and the filter pattern and we will see also a construct that may be very useful but it's not connected to any specific pattern the if block then the fork join pattern maybe that part of you know what is the the idea of the fork join we have here a program until now it's a sequential control flow here there is a fork that generates n child tasks until all the tasks not finished the the program waits the the parent task task waits until all are finished and then it joins all the results partial results of the computation of all the child tasks and then it continues its sequential control flow then the first EFL construct that's allow us very easily to to to implement the idea of the fork join pattern is the assignment block in the assignment block we have n assignments that all the right hand side of the assignments are executed in parallel when all the the this child tasks finish their job then all the the value all the variables receive the their their value and then the program continues after the end of the block of EFL can we have two examples the first example we have my value one equals five my value two equals f of five we decided that if the in an assignment block we have this kind of assignment of simply a value this assignment will not generate a child task it's a waste of time that this kind of assignment will be will be executed sequentially like in any other in a sequential program and and here because we have a call to a function yes this this call is executed in parallel here we have another example of calling two computing intensive functions in this case both are running parallel both generate a child task and when both return my value one and my my value two receive their values and then the program continues another option to to the fork join is that if we if we have here uh like in sia or like in in python if else if etc all the boolean expressions are executed in parallel and the first one sequentially that is true uh launches the the body of the of the option but there may be a problem with the piff if we have a case like this and a is zero this may may provoke a divide by zero exception the the the problem of the or the danger of the piff is that if the programmer is not aware of that one of the options here that all are are executed in parallel may may provoke an exception master worker pattern we implement the master worker pattern with the four construct and the four construct looks like a regular four in c right but every instance of the of the body of the four is executed in parallel suppose we have m processors or or m cores in the system when n greater than m the scheduling built into the pool modules uh n is the number of of tasks that are generated and in the pools modules of module of multiprocessing and in the task manager of dtm uh they allow implicitly the implementation of the master worker pattern then in every moment we will have m processes running and all the others are waiting to be executed if n equals m the the the pattern is actually a four four join pattern that is implemented also by the four construct map pattern the you know map the idea of the map is that we have an input sequence maybe a list maybe a tuple uh maybe an array in other languages of length n we we have a function that is applied over all the n elements of the input sequence and then the the map generates an output sequence exactly of the same length but there are its elements are the result of applying the function on the corresponding element of the input sequence and that here we have the the syntax of the map loop map loop receives a function and the input sequence another construct with a completely different kind of execution is that implements also the map pattern is the loop block in the loop block we have here a label that is like the name of a function and here there is a recursive call to the label and then recursively the loop block generates n instances of calling the cpu intensive function according to the to the value of i then also here we will have n processes n tasks that are run in in parallel but there the launching is like a um recursive call now we we will see here how actually the the efl pre compiler works if we have say the here uh a program that receives a list of numbers and we'll return the square root of every one of those numbers then the paramap function that receives the input sequence will generate an output sequence that is the the the result of the map loop on sec in it should be sec not sec in here a sec and the map map funk and then the result is printed and the the the program ends implicitly you know you see uh here we have parallelized the the calling of the n callings of the map map funk as how it it looks after the translation with the pre compiler this this example is with the multiprocessing pre compiler you see here we have the paramap here the original efl block starts then an object of a of a pool of our pool non-daemon module is is created a manager of multiprocessing is created a queue of manager is created and then here a from the pool the map asynchronous method is called with the map funk and the input input sequence okay and here map out will after all the after all the the sub tasks were launched then with uh with the get the the the get method that waits until all the sub tasks are are ended map out receives the the out sequence and then here we close the pool and we we join all the the sub tasks that that were created and return map out you see the programmer doesn't have to deal with all these complex code of multiprocessing he wrote very simply like in a sequential programming uh uh program right uh okay reduce pattern already for reduce pattern that it works like you see here in the in in the slide we have an an input sequence and those fun that function should be uh uh an associative function that like plus or multiply and then every two elements are a are a past to the function here and here and then the the results we have a a temporary a a sequence of the results of the first level and then every one of every every couple is is a is passed to the function again until we receive the reduced value the the the result of all the reduce and for that we have a construct that is called log log loop it receives the input sequence and it receives that associative function that must be in needs to receive two parameters to be able to do what you see here in this in this picture then here we have the the algorithm of the log loop that in the questions if you will like we can we can analyze the the algorithm now here you have a an example we have a a list of eight numbers and with the add a add method of the int class of the integers of python we call to log loop with l and the the add of integers and what happens in the running one and two plus plus one and two is three three plus four is seven etc and then in the other round three and seven is 10 11 and 15 is 26 and then we have the result that way the log loop works but we can use the the add of the least a class or the add of the string class and then we we can do reduce with every every kind of values the filter pattern the filter pattern is implemented using map the map uses the map function that is essentially uses a boolean function that every element that the boolean function is true remains the same element all others that the boolean function is not true the the map function will return none and then the result of in map out will be in all the places in the sequence where the boolean function was true is is passed to the output and all the others are none and then you see after the the EFL block we have a least comprehension that get rid of all the the elements in a map out there are none and then in sec out we have only those that are true for the boolean the boolean function in that way we we may implement the a filter pattern using using EFL and the if block is like a sequential a sequential if that every in this case instead of pf in this case every boolean expression is evaluated sequentially and the first one that is true the body is executed but all the EFL block is executed in parallel with all the the the program then now we will see a two EFL programming examples the first using assignment block and if block and you see here that in the first block we have a and b that are for that block are out variables that will have the the result of f on x and the result of b of g on x then we go out from that block and then we in the next block we can use it a and b as in variables and then in the in the in the second block we we can read from a and from b because in the second block a and b are in variables are not out variables second example that also a shows how we can very easily implement the nesting pattern in nesting is that we we have I'm sorry we we have a n a child task and in every child every child task also we launch parallel tasks we have here a matrix a 2d matrix with a vector in in the main function we scan with the form inside the EFL block by a by a rows and every row is is passed to mold to make also in parallel the product between every every element of the row with the with the element the the according element in the vector and then you we see that we see that we have a nesting of instances of the master a worker pattern when in in main the for loop iterates upon the rows of the matrix and in the second in the nested one it in the upon the items of the of the row conclusions two EFL precompilers were implemented safe and efficient parallelism has been made possible by the EFL framework and parallel design patterns have been shown to be implementable using EFL we in the electronics department of our college two students build this cluster of 64 raspberry pi processors and we after we go back home we will try to test EFL scalability with that 64 raspberry pi cluster that in every one of them run the MPI version of of EFL further work we are developing an EFL curriculum to teach how to implement serial and parallel algorithms flexible algorithms with EFL also we will we will want to implement concurrent data structures using EFL also we want to answer the question are purely functional functional data structures EFL compatible and also with in a joint pro project with professor Miroslav Popovich from Serbia we are rewriting an algorithm for predicting structure of proteins that is called deep sum using EFL and using software transactional memory and then the invitation like in the lightning talks here we invite you to join us to collaborate with us to redesign EFL with Python like syntax and then make happy all the crowd here in the in the Python conference that we have an EFL version that is completely purely Python implement also we have you we want to implement new versions of the EFL pre-compilers for other parallel programming platforms and other host programming languages and also implement all the the the basic a kernel of parallel design patterns using EFL and we also have to to answer the question are there patterns that cannot be implemented within the EFL framework maybe we have to research that then to the end our a laboratory is the flex comp lab here you have the website you are invited to enter the website and download the the installation kit of EFL all the faculty in the laboratory is me David Diane that is here in the audience Dr. Raphael Jeheskel and Dr. Simon Mizrahi from the electronics department that his students build that cluster of Raspberry Pi and those are our students part of them already graduated and Elad and Moshe and Bosny Levy and Naaman they are developing the MPI version are finishing the development of the MPI version of of the pre compiler and Miroslav Popovich from the Novizad University is our research European research partner and here you have the picture of our campus the Lev academic center that is also called Jerusalem College of Technology thank you very much questions thank you yes we have time for a few questions questions nobody wants to collaborate with us in our project yes so I was wondering I know you use C but it seems like a hardware description language might be a better syntax for the EFL parts do you know of any use of like more hardware description language or did you just use C because that's more familiar to computer programmers and we we developed a EFL especially because ourselves felt that was was very was not easy to to to program directly explicitly the with the parallel tools that are in Python or in other languages then if we if we design a layer over the the the the the parallel tool of the specific language will will will make easier for the programmer to program and we can ask if the the code generated by the pre compiler will be as efficient than the than if we could write it by hand that is the same question that in in the end of the 1960s when people wrote operating systems in in assembly and Richie and Kernigan in Bell Laboratories they developed the C language and they argued that we can implement an operating system in a high level language and and is what we have today today nobody writes an operating system in assembly and also here we we we would like to to to allow everybody to write parallel code in very easily and and then utilize all the computing power that we have now in the multicore and clusters of of computers I I asked I answered your question okay to any other question yes I didn't quite understood whether there is already some sort of Python module in some in some beta version or are you going to start from scratch implementing it or was the current state we we will not we we can consider that idea but our idea is in at least and now to to have a language that is embedded in a host language and it uses all the the platform possibilities okay maybe that in the future we will think about a transforming a EFL in a Python module that is what you are asking right in EFL will be a module in Python I we may we may think about that but because what what I said in the beginning and our idea is to have a universal solution for parallel programming anywhere in any language but we can think about that possibility that EFL will be a module for Python so we have time for one more question if there are no no more questions so please thank you again our speaker if you want after the lecture to talk with me please I am here