 My name is Parithosh Pandya. It's welcome to this session. Our speaker for the next talk is Soham Chakravati. He's from, he's an assistant professor at IIT Delhi. He recently came back. Certainly a find for us in the country. And Soham did his masters at Kharagpur. Then he went to do his PhD at Max Planck Institute. Before starting his PhD, he worked for several years in industry. He brings both the perspectives. And he works on very interesting area of weak memory models in real programming languages and real compilers and is trying to find solutions to that. So I'm sure we look forward to a very good talk. Welcome. Thank you Professor Pandya for your kind introduction. Today I'll talk about my work that I did during my PhD on validating optimization of concurrent CC++ programs with my advisor Dr. Victor Vapyadis. Let's start with a concurrent program for an example. Let's say this is, I am not using the exact syntax of CC++ but pseudo code of it. In this program X and Y are shared variables, integer variables initialized to 0. And let's say the program has two threads which are concurrently running. In the first thread, Y is set value 4 and then X equal to 1. And in the other thread, when X reads a value non-zero, then we read value Y in variable R. So if one intends to write this program, one expects that when X reads a value 1 in this program, then in this case R will get the value 4 at the end of this execution. However, note that there is a data race on axis X and as a result according to C semantics, this program behavior is undefined because there is a race in this program. So the definition of race is when there are concurrent accesses and there is access on a particular location and at least one of the access is a right operation. So in this case, the C says that even if the value of X equal to 1, if it reads the value 1, still it is possible to have R not equal to 4 and hence the program is wrong. So to write such program correctly to give a defined semantics, CC++ in 2011 specification introduced the atomic accesses. So in these primitives, atomic primitives and access of integer is declared as atomic integer and these locations are not accessed directly rather through specialized functions such as atomic store or atomic load. For example, in the atomic store along with the memory location which is written with a value, it provides another parameter called memory order. And in this case, it provides the memory order release in the store operation. On the load side also in the atomic load, it gives the memory location and one memory order which is memory order aquire. Now CC++ defines various rules for these atomic accesses. First what it says is that having a data race on atomic access is allowed and it has a defined behavior. The second rule it says that if there is a memory order release operation, then it should ensure the compiler or hardware should ensure that the accesses before this memory order release should be completed when the memory order release operation, store operation is taking place. On the other side, when atomic load with memory order aquire is taking place, it ensures that before any access after this memory order aquire access, this load operation should always finish and that has to be ensured by the compilers and the underlying hardware. The third rule it says that when an atomic load with memory order aquire reads from a release write or store operation, then it establishes a synchronization between these two accesses, which means that there is a happens before relation between this atomic store and this atomic load operation. As a result of this, the atomic load operation is happening before the load of y and which is happening before the load of the store operation and this y equal to r 4 is happening before the store operation and as a result what happens is that now this store of y and the read of y are not concurrent anymore because there is a happens before relation. And since the program has a defined behavior, in this case y reads value 4 as an outcome of this program. So, this program has a defined semantics and gives the intended meaning for the programmer. So, instead of just using these syntax as it is quite clumsy, I will use a simplified syntax for this program where the release and aquire this memory orders are attached with the accesses. Now, let us modify this program a bit here, where we have initially is the location r with a value 4 ok. So, at the end of this program what happens is that if x aquire is reading a 0 value, then we get r equal to 4 else r will read from this particular location from the first thread and there also it gets a value 4. So, finally, the in this program r will always return the value 4 ok. Now, let us say our nav compiler is compiling this program and while compiling it finds that the accesses in the first threads are independent and reorders them. So, once it reorders that, then we can find an execution where let us say in the first thread x release takes place first and then x aquire reads the value and then y reads a value at that point y equal to 4 has not been taken place. So, it reads from the initial location 0 and it introduces a new outcome in the target program which was not there in the source program and as a result this transformation is wrong and moreover it has introduced one data race on the y location which is also wrong. So, the key takeaway of this example is that optimizations which are correct for sequential programs may not always be correct for concurrent programs. So, what we require is that to carefully analyze that program which are compiled where concurrent programs which are compiled by a compiler and we may consider a state of that compiler like LLVM which takes a program perform a set of optimizations and finally, generate the target code. In that case it is important to know if this transformations which LLVM are doing which was originally written for sequential programs and now it is modified for just for concurrent programs whether it is done correctly or not. So, in order to address this problem we ask two questions first what are the transformations are allowed for a concurrent cc plus plus program and the second question is that whether a compiler is performing only the correct optimization. In order to address the first question we study the cc plus plus relax memory model and in a purple 15 paper we came up with a bunch of transformations which are allowed in cc plus plus 11 memory model. We specifically looked at two types of transformations one are the reordering transformations which are often performed by the optimizations to improve the temporal or spatial locality. The other type of transformation is access deletion transformations which are performed by let us say constant propagation common sub expression eliminations and so on. So, by starting this model we came up with a list of transformations and checked and proved the correctness of these transformations. For example, here what I said that if there is a write release then no operations before that write release can move after that. For the redequire it is just the opposite no access after the redequire should not be moved before it and for the other accesses for example, the non-atomic accesses which I have shown there it can be freely reordered. So, for example, here if there are two accesses on two different independent locations then for the aquire access we can move an non-atomic access after the aquire access, but we cannot move it before. Similarly, we study a set of transformation which are eliminations are correct and we have studied for example, if there are two read operations on a same location here on reading the value on t and t prime it is possible that the second load can be eliminated and the value of read in read in t is reused. Similarly, we have done for this is read after write in that case if a value is written on a location followed by a read of the value we can just do a constant propagation in that case and this is safe we prove all these results and finally, that overwritten write that if there are two write operation then the first one can be eliminated safely and it does not introduce any new program behavior. Now, once we have this set of transformations we check that in the CGO 16 we check if a compiler like LLVM is indeed following this set of safe transformation when doing the optimizations. In order to check that we take several examples compile that by LLVM compiler and then we check that if the transformations were done correctly. So, here is one such program what we compiled by LLVM O 3 and let me explain it to you. So, the first thread remains as it is x and y are shared variables in the first thread the code remains like the earlier and then there are two accesses of y takes place in the second thread. The first we set the f equal to false and then there is a conditional access of y and since f is false here this access of y never takes place ok. Now, the second access of y takes place when read acquire of x returns a non-zero value and in that case this is only possible if it reads from this release write and in that case y will read again like the earlier program from this location y equal to 4 and it will finally, in this program B will always read the value 4 ok. So, because it is going through the synchronization and otherwise also it is the value 4. Now, we take this program and compile by LLVM and we get an outcome of a program here is the target program where B equal to 0 is a possible outcome in this program. Now, it so happens let us look at what happens in the target program. In this program the access of y read of y is moved outside this conditional because of a speculative load operation and then it is read on the location S and it finds this the second read operation is eliminated and the value of S is used directly at this point at B and now we can get a like execution where y first reads the value 0 followed by the run of the other thread and then x acquire can read the value from the release write, but finally, since S is assigned assigning the value for B it can assign the value 0 in location B and as a result the compilation of LLVM is introducing a new behavior which is wrong in terms of C 11 semantics. So, why this bug is happening? To understand that we look at the intermediate steps of the compilation. So, we find that this compilation is happening in two steps. In the first step is as I said that it is speculatively moved outside of the conditional and in the second step it is finding that it is removing the read operation and reusing the value which was read in this case. So, now what does this transformations about each of this transformations this C 11 model talks about what is correct and what is wrong. So, according to C 11 the first transformation is wrong. The reason being earlier there was no access of y in this access of y was not possible the first access of y because it was under a false condition, but the speculative load has moved the access outside the conditional which is introducing a data race and introducing a data race means it is introducing a undefined behavior in the program. So, this transformation is wrong. Now C 11 says that second transformation is correct. Reason being that there are two conditional accesses of y. Now if these two values of y are supposed to see the same value then the elimination of the second read of y is justified. However if these two y are seeing some different value then the right where from the second read operation is seeing the value is in data race with the first access of y. So, this program has a RAC source program in this transformation and as a result any transformation is justified. However, we reported this bug to the LLVM community, but they differed from the interpretation of these programs. According to LLVM introducing a racey read is a justified and the program with a read write race has a defined semantics only the read returns an arbitrary constant value what they say as undefined expression. So, according to LLVM the access moved out of this condition is justified because the access of s is used only under this condition and if f is false then this access is never used in this particular program. However, since the overall transformation is wrong LLVM says that the second transformation is wrong because that in the source program now there are two accesses of y and the first access of y has an undefined expression, but the second read of y was reading a proper synchronized value. Now, we are removing a synchronized value with an undefined expression which is incorrect as an overall result. So, LLVM when we ask to fix this problem they restricted the second transformation, but allowed the first transformation to continue. To catch such error we develop a translation validator to check that whether we can check automatically that if the compilation is correct. So, we reduce the problem of validating the correctness of transformation to a problem where we check that if the target program is generated in the given a CC plus plus concurrent program only by the combination of a set of correct reordering and correct elimination transformation. And as I said similar to the earlier as the C11 model also for the LLVM model we find a set of correct transformations which are possible and based on that we develop our validator. To do this automatically we develop two components one is a test case generator which generates arbitrary CC plus plus concurrent program snippets which we compile by the compiler. And then we keep the intermediate representations of LLVM throughout and then we compare these two intermediate representations and just do they run this validator to check the compare this two intermediate representation to check that if the transformation has been done correctly. So, if it is done correctly it reports correct otherwise it shows that where it has gone wrong. So, using this approach we have exposed a number of compiler bugs in LLVM 3.6 we reported those bugs to the community they considered the max bugs and they fixed it in the next version of LLVM. Now the approach the way it works it has two approaches we developed to compare these two programs one we say is compiler independent matching. So, compiler independent matching takes the compiler as a fully black box it does not look at inside the compiler what the compilers are doing it can only access the intermediate representation. And this technique can be used by any other compiler transformations as well. The other approach we have adopted is we call that meta database approach. In the meta database approach we use a particular component of LLVM compiler called meta data. So, LLVM use this meta data concept to attach certain information with the intermediate representation. And LLVM use this meta data to later actually for the debugging purpose. So, what it says that it can attach some information with the some instruction, but that meta data cannot affect or influence the optimization decisions. So, we use this meta data to annotate the shared memory accesses and based on that we can later use that as an witnessing compiler. So, that we can validate the compiler transformation. So, this is more or less LLVM specific, but if other compilers introduce such concepts then probably that is also possible to put those concepts to the other compiler as well. Now, how do we do this compiler independent matching? So, first as the first step we identify the corresponding program paths for the source and the target program. Then for each corresponding path pair we find the access sequences that what are the shared memory accesses taken place in each of this path and then we compute the delitability of accesses. So, delitability says that if an access removing a certain access is correct in this particular sequence or not. And finally, after that we match these access sequences from the source and the target and then we check that after the matching if the sequence matching has been done correctly or not. So, let us look at one example program. So, this is just consider a straight line program where X, V, Z are shared locations and S1, S2, S4 are thread local locations. So, consider that all these thread local values which are used here are used later in the program. So, now we do this delitability analysis on this sequence. The first access of X is non-delitable. It has to be there when we transform the program reason being the value of S1 is used later in the program. So, it has to be there. The second access of X can be deleted the because instead of S2 now we can use S1 directly. Then we check the delitability of the synchronization access is Z required. So, we do not allow to delete the and LLVM also does not delete the synchronization access like acquire or release. So, it has to be there. Now for the right operations, we go on the other way round. We start from the bottom through the start. So, we say that Y equal to 2 has to be preserved because it is the final output of the program for value Y. The earlier right of Y can be deleted because it is overwritten and access of V this is the final value of V here. So, it has to be there. Now once we mark these accesses. So, this is the first access of X is here happening and S1 is used later. So, I am not shown that later, but consider that in the sequence later it is accessed I have just showing the accesses. But S2 even if it is used later can be this access can be deleted because instead of S2 now we can use S1 in those places. Now suppose we get a target access sequence which where some of these accesses are deleted and something are moved out. So, now we want to match these two access sequences. So, first we match these synchronized accesses on the acquire followed by the matching of X from top to bottom. So, we match sorry for the right to right for the from the bottom to top and we match Y equal to 2 here then we match the access of these. Once the right operations are matched we match the read operations starting from the top to bottom. So, we match the access of X with T1 and since all the accesses of the in the target are matched now. So, the matching operation stops here. Now at this point after that we check that if the matching is done correctly. So, two things we check here whether the unmatched access are deleteable or not in the source. So, there are these access of X read of X and right of Y are unmatched, but since they are deleteable access. So, these deletions we consider as correct. The second one is that if the reordering what has happened during this matching are allowed. So, here there is a access of V is moved after the Z acquire and any access if it is moved after the Z acquire after an acquire access it is correct. So, as a result we declare that this matching is done correctly. Now let us look at what happens when a program has a control flow. So, when the program has a control flow then we find the corresponding path pairs as I mentioned earlier. In this program this A, B, C, D, E, F are shared the basic block names and F1, F2 are conditionals and in the target program let us say these blocks D, E are merged with the block of C and in that case we find that there are certain path pairs like A, B, C, D, E, F when both F1 and F2 are true. This is the corresponding path in the target program where A, B, C, F is there. And similarly the other path where F2 is false that path also corresponds to the A, B, C, F path. And the other path corresponding path pairs we find are when F1 is false and in that case A, C, D, E, F the other path pairs are matched with A, G, C, F in the target program. Now once we have these path pairs for each of these four path pairs we perform the same access sequence matching. And at least in one path pair if we find there are some error then we report that the entire transformation is wrong. And in this matching of these path conditions we use this Z3 SMT solver using which we can match those two paths. Now once we have a loop in the program what we do is that in our approach we unroll the loop bodies for a fixed number of times. And then again perform this finding this access path pairs and then matching and finding this analysis. Of course, that these in the presence of the loop this is not fully sound as I am just doing a number of unrolling. But since the overall like value the translation validation in the presence of loop is undecidable I am resorting to this kind of unsound but bug finding approach. How do you know which ones to pair up I mean. So, I have these conditions for each of these basic blocks. So, these are basic blocks is it? Yeah A, B, C, D or our basic blocks. So, basically there are two things we do internally one is that this matching of this condition for the paths and before that it is to do it in a better way. We find that in LLVM transformation the names of the basic blocks are almost preserved because the transformations are small and often they preserve the name of the basic block. So, that is also an indicator where we can actually match two basic blocks if the conditions which are coming inside the basic blocks are same and the we can match with the names. If the conditions do not match then we do not match those basic blocks. Yes that is also possible that changes are happening that will be captured when we collect the access sequences. So, once we have this basic block matching then for each path where we again collect the access sequences. The meaning is that the B here and the B there. Yes. May look very different. Yes. So, how do you know that this has to be matched with this? So, the conditions what are there these conditions which are incoming to B that has to match also. But the condition could also be changed. Yes, if that changes then we do not match this basic block. Then we do not let us say that here I could find a match between A and C. So, in that case if we do not have these conditions matching for B or these kind of cases. So, I match A and C and then I find all the path pairs between A and C. So, in that case it goes we check both path pairs and maybe in that case there will be some false positives. So, to summarize our approach we have noticed that compiler optimizations request requires careful analysis. And using our approach we have found a number of compiler bugs and which we have which has been taken place in LLVM compiler. And we developed the validator which was artifact evaluated in CGO 16. If you are interested you can try this it is available there. So, just to give some idea of what I see as a future work is that it is possible that since it is doing this checking the sequence of access of the concurrent accesses. But we assume that the transformation which are done thread look earlier correct. So, in that case we require a sequential validator to check along with this concurrent access if the entire program transformation has been done correctly. So, what I see that if these validator can be integrated with a sequential validator probably that will be a good idea. Presently the way we handle loops are still not very efficient. So, maybe there are traditionally there are many work on doing translation validation for the loops and in that case borrowing those ideas will be interesting in this case. In addition to that it is possible to handle more advanced language features. For example, arrays, pointers which I presently not handle in the test cases because ideally the goal is to find out the errors of the compiler rather than actually validating realistic code. So, if we want to make a realistic code with concurrency primitives then we require to handle these extra features. And I believe that in these cases SAT and SMT solvers will be very much useful and if you have any idea about this please feel free to suggest and discuss I will be happy to discuss with them. So, with this I conclude my talk. Thank you for your attention. So, is there something specific to only the concurrent features of CC plus plus or this can be used for other any other compiler optimization? This can be done for other languages also if we have let us say some other language like Java it has also the memory access primitives. No, I am saying that let us say I want to use this to verify some traditional compiler passes let us say loop splitting or code hoisting or. Yes. So, in that. Which does not necessarily preserve the control flow structure. Right. So, in that case so I think the loop the code hoisting those kind of transformations are possible up to one extent because then you will see that one access is outside the loop. And then in the matching it should be reflected that in the target it has introduced one access where there was nothing in the source program. So, in a loop if we are matching that if we are considering conservatively that either the loop is taken or not taken. So, when the loop is not taken. Tool addresses those things. Yes those things. So, it validates all the optimization passes. Yes it goes through entire LLVM architecture independent optimization passes and checks all of them. So, what actually there is some restrictions in terms of what kind of program it can capture. So, the test case I generate it does not have all the constructs which where all the kind of loop optimizations will be triggered in LLVM. So, there are certain cases it will not. So, otherwise the validator can check that those transformations whether they are correctly done or not. So, I mean it is an undecidable problem right. So, right. So, there actually I am saying that this is I am saying that this is an unsound up to a point because I am. Unsound. Unsound in terms of I am not looking at the entire unfolding of the loop at the in the presence of loop. So, it is a. It is not unsound. So, for the when there is a loop then I do not guarantee that my validator will be specifically you will be able to say there is no error. What other there is no error there is some. Yes. Because you are. Because I am doing unrolling a certain number of times. If it is possible that your analysis will say there is an error but there may be no error or the other way. No the other way. Other way round. It is under approximation in the presence of loop, but the way that since I am doing it a few number of times and since I have used I do not claim it, but it is done for only the scalar variables. So, we should not be seeing any new path where this transformation will go wrong. So, we ideally generally unroll by factor of 2 the loop bodies and check all the path combinations and that can tell that what are the possible orders of the accesses can be done, but I do not claim that it is sound. So, I was just wondering in secure information flow analysis you get these leakage of information based on control flow structures. So, I was just wondering whether every time you do find some kind of risk condition which causes problems to you is it will it be a good generator of a candidate you know flow leakage. So, this is possible given that suitably assigning security levels to those variables you will have always get a problem. Yes. So, this approach can be leveraged in other kinds of properties as well I think. In the case of leak detection or something I would like to attach with it the data flow analysis which is not presently done here, but that is what I was saying that if we have a sequential validated like the sequential transformation or the thread local this data flow based analysis, then it is possible to capture such kind of properties also and I believe it is possible. How do you say you do not have any data flow analysis because you are telling that if it is used then I will be not I will not be deleting if it is used not used then only I am deleting. So, you are having some data flow analysis, but. So, it is done I am doing it a very limited way, but the way the programs are generated are like you can think of like immediately you do some operation on the thread local variable what you have done immediately just do some operation then LLVM generally takes it at the and finally on the local where based on the local variables you return something outside of the function. So, in that case you are using that value, but it has to be the programs are manufactured carefully so that it does not affect the actual concurrent accesses. But I believe that for the realistic program we should be doing better. There is no question from that end. So, let us thank the speaker for a lively talk and move on to the next talk.