 Hello everyone. Good afternoon. Hope you're enjoying DefCon as well as I am. So my talk is about Androsia, which is a tool for securing data in process for your Android applications. So if I had to explain what Androsia is doing in a couple of lines, it's basically providing you the features of a proactive garbage collector. And garbage collectors inherently are lazy. So they only kick in when your application is memory constrained and they will reclaim memory from objects which are unreferenced. But with Androsia, you will be able to reclaim memory from objects immediately after their last use. So this is helpful because if you have a lot of sensitive information lying around in your heap memory, as soon as it is last used, Androsia is going to kick in and collect all that sensitive information from your application and replace the memory contents with the default values. So your sensitive data will not remain on the memory anymore. So before we begin, a little bit about myself. I'm Samith. I work for the product security team at Citrix. I'm a web and mobile application security enthusiast. Spoken at a bunch of conferences including Black Hat Asia, Apsa QSA, Cocon, Code Blue, IEEE services, mobile soft and cloud com. So yeah, a little bit of contact information in case you want to contact me. So starting off, the first question for you is which one of these is the most difficult to protect? Number one data at rest, number two data in process or number three data in motion? So arguably, I would say that it's number two. Why? Because the least number of solutions in the market are for number two. Because this is a very gray area. You know, EMM providers do not really understand how to protect data in process. Data at rest is very simple. You use any ciphers and then protect your data. Data in motion can be protected using TLS. However, data in process is still a mystery for most of us. So this tool targets number two. And we'll see how. Yeah. So I'm going ahead. So Java creates a lot of objects or Android itself creates a lot of objects on the heap, right? And the objects may contain a lot of sensitive information which may include your authentication credentials, authorization tokens, and encryption decryption keys, pins, ODPs, and personally identifiable information. So as and when the application is executing, all of these will be on your heap memory. So the developers want to ensure that they get rid of all the sensitive information as soon as it is last used in the program. So the myth is that the garbage collector will eventually collect it. But, you know, the garbage collector is very lazy as I said. It will not come into the picture or it will not collect all your data until it is essential for it because unless your application is memory constrained, it will not come into the picture. And even if it does, right, its scope is very limited. So this is something which is called as an object tree on the heap and how garbage collectors actually does a collection is using mark and sweep algorithm. So it will start off from the root nodes which are called GC root. And from the GC root it traverses the object tree using depth first search and keeps on marking all the objects that are reachable. In the sweep phase of the direct garbage collector's collection process, it is going to find out all the objects on the heap which are unreachable and it is going to collect only those objects. So when a developer is writing code, he may forget to unreference some objects which may be these ones which are in the circle. And garbage collector will not collect these set of objects just because they are reachable from the GC roots. And these objects may contain sensitive information and they are going to lie around in your application throughout the lifetime of your application until the application is terminated. So this is where Androzia kicks in. It will clean off all the unused but reachable objects in your memory. So to summarize what I've been talking about until now, reachable and unused string builder objects may contain sensitive information. So throughout the course of this presentation I'm going to specifically talk about string builder objects as an example. But this framework is extend extendable to any other kind of objects as well. So reachable and unused string builder objects may contain sensitive information and a heap dump or an application compromise will help an attacker to reveal that sensitive information easily. So don't just rely on the garbage collector to collect all that information but instead destroy by overriding all the critical data that you have in the application. So with that in mind, we do have some solutions provided by Java libraries which allow you to destroy objects that are created on the heap. One of them is the key store dot password protection class and this class provides a destroy method which can be used to destroy the objects content. But again, you know, it's the developers, you know, it's upon the developer to use this API. He may use it at a very late stage in the program or he may not even use it at all. So ideally we want to automate this process. We don't want to rely on the developer to destroy our data. Our data should be in our hands, not the developer's hands. You know, this is where Androsia will again automate that process and instrument code to remove or destroy any object that has critical information. So how does Androsia help? So Androsia uses static code analysis and it determines the last use of objects at a whole program level. If I had to be specific, it uses a summary-based interprocedural data flow analysis. So there are three terms here, three key terms. One is summary-based, second is interprocedural and the third one is data flow analysis. So static code analysis power is just not gripping data and giving you information about what the Android applications permissions are or whatever stuff. It comes from data flow analysis. So every statement in your program has some data to contribute and that data needs to be propagated across all the statements in your application and every statement could probably add or remove some data from the data flow sets that are going on. So that's how powerful data flow analysis is. It'll help you perform any sort of data-related analysis, which may include even taint analysis. So that's what, that's where the power of static code analysis is, you know, better used. What I mean by a summary-based interprocedural analysis is that this is a whole program analysis. This is not just analyzing a specific method and every method when it is analyzed, the analysis results can be maintained or cached in a summary. The reason for that is that a particular method, let's say foo, can be called from multiple locations. And if you have already computed the summary for foo, you may want to reuse that computed summary instead of, you know, computing it again and again whenever the method is called. So that's why we use summary as a cache for storing our analysis results for a specific method. And after we have determined the last usage point of a particular object, we are going to instrument byte code to clear its memory content. So to give you a very simple example, I have statements numbered 1 to 10 here. And there is a definition of variable X at statement number 3. And then there is a use of variable X at statement number 5. So between statement number 3 and 5, since the variable is defined and used, I can say that the variable is live. But before statement number 3, the variable X wasn't even defined and neither was it used. So it is said to be dead between 1 and 3. And now at 7, the variable X gets redefined. And then reused at 8. So between 7 and 8, the variable is again live. However, between 5 and 7, the variable won't be live. It has been defined, but it is not being used. So this is just to give you an idea about how liveness of a variable looks like. And we're going to use it further ahead when we are seeing how Androsia really works and how it computes the liveness sets and how it then infers the last user's point. So basically, if you have a span of definition and usage between that span, a variable is going to be live. So this is a snapshot of the heap dump taken using Eclipse memory analyzer toolkit. So I had this sample code which contained a static field which is called static secret and it contained a password. And it shows up in the heap dump. And after the optimization or after the instrumentation, the static secret does not contain any password at all because Androsia has successfully removed the password from the heap memory. So in order to give you an overview where Androsia would fit in, is that a user could provide an application or source code to Androsia server. And Androsia server can then unpack the application to Dalvik byte code. And the Dalvik byte code gets converted into Jimple code. Now, Jimple is an intermediate representation. Just like Smali, it is an intermediate representation which takes good from both the words. It takes good from the Dalvik byte code as well as good from the Java high level language as well. So Jimple is basically a mixture of Java and it is simple. So they call it Jimple. So Jimple has only 15 different kind of statements. Unlike Java, which has a lot of complicated structures and different kind of statements, even Dalvik byte code has like 200 different op codes. So Jimple has only a precise number of statements which is 15. And you know, it is easier to analyze when you have just 15 odd statements to play around with. And this Jimple code is then analyzed by the analysis that we have embedded into the app. And then the code instrumentation happens and then we convert the Jimple code back to Dalvik byte code. And then the Dalvik byte code is packaged again and signed or we could just provide the analysis results to the user. So this is the entire flow of how will you eventually use Androsia. So an important fact that I should also mention here is that the framework which is called suit, which is the backbone of Androsia can be used to perform any sort of data flow analysis. It's not just the analysis that I have done. So I'm going to talk about suit next and it can be used, you know, you can build your own data flow analysis over suit. So suit is basically static. So it's a framework for Java byte code analysis. It can be used to implement your own analysis as well. It provides this three address code representation called Jimple and it looks something like this. So for the example, I have an if statement here. And in any line of Jimple code, you will not see more than three operands in one line. So you have R1, null and label zero here. So it is very simple to view, you know, it's not like Java that you can have a lot of complicated structures in one statement itself. And this diagram shows you that the suit framework itself takes in the Java source code or the class files and converts it into Jimple three address intermediate representation. And then you can perform your analysis optimization and then again convert it back to the class files. So until recently, actually, you know, as I talked about that we are going to do an inter procedural analysis. So suit also provides you the ability to perform an intra procedural analysis. So until now until recently, suit was missing a Dalvik to Jimple transformation module. And now that has that void has been fulfilled. And that void has been fulfilled by a plugin called Explorer, which allows you to transform Dalvik to Jimple code. So it makes it easier for you to analyze any Android application. Okay, so there's one more tool that I want to, you know, talk about here, which is called flow droid. Now flow droid allows you to generate a dummy main method, because unlike Java, Android does not have Android applications do not have a main method. And to start your data flow analysis, you need a starting point. And this dummy main method acts like a starting point for your data flow analysis. The dummy main method actually connects all the Android lifecycle callbacks. So it gives you, you know, since Android applications are event driven, they can be, you know, invoked by any other, any callback method. So there's no specific method where, you know, you can ensure that this is where my starting point will be. However, flow droid will, you know, plug in every Android lifecycle callback to your starting method, which is, which will be the dummy main method. So that way you can start your data flow analysis from a single point. Okay, so, so objects can, you know, exist in various scopes. They can either exist as a local variable inside a method. For example, I have XYZ variables here, which contain eventually the string builder objects. And their scope is limited to this method foo only. They're not used outside this method anywhere. So this is one simple example, which I'm going to take or, you know, it'll be a walkthrough example throughout the course of this presentation. So this example will be recurring. And I have a variable X here, which I'm instantiating with a string builder object. So basically X will be referring to the string builder, which contains the value secret. Y will be referring to the string builder, which contains the value password. And there's some logic here, which I'm going to, you know, I'll just add it for the sake of explaining things. So the scope could be either as a local variable, or it, or your objects could even exist as static fields of a class. And this is completely different because, you know, the scope of a static field is the entire program. It's not just a specific method. So X, the value X can be invoked from anywhere else using the class name, which is my class. So this is how you will invoke variable, sorry, this is how you will invoke variable X. And this invocation can happen in any other method, which is bar in my case. So the scope, you know, becomes the entire program. Or you can have objects as instance field. Here you can see that I have a private access modifier specified for instance field X inside the my instance field string builder class. So the scope of this variable X will now be limited to the scope of the object of this class, my instance field SB. Okay. So coming to the demo, I have three classes here. And one of them is the main activity where my control will begin. And this is where the secret has been defined. And then I'm calling the use static field method. And the use static field method uses the static secret and then calls bar method. And inside the bar method, I'm again using the static secret. So if I had to, you know, just ask you guys to point out the last usage of the static secret, it would be like inside the bar method, right here. So the instrumentation should happen immediately after the system.out.println statement of static secret. That's correct. But you know, there can be loops as well. Like this statement could be inside a for loop. And then the instrumentation point will not be immediately after static secret. It will be right before the return statement of bar method. If there's a loop outside, outside the call to the bar method, then your instrumentation will happen here. Why is that? Because if you instrument it within the loop, then your logic of the application will break. And you don't want that to happen. So we'll see an example, actually a demo on it, which is on the next slide. So this is the same code which I showed you on the last slide. I have a password string builder object being created, which contains the value password. And then I'm calling the use static field method here. Now we'll see the definition of the use static field. The use static field is using the static secret and calling bar method. And inside the bar method, we have again use static secret. And then we have two print statements which are printing hello and bye. So if you see the instrumentation point should be right now, just after the last use of static secret, which is online 20. So let's run the tool on this. So this is how the dummy main method looks like. You see it in the stack trace. Get that gets printed out. And right now the output format is set to Jimple so that we can see where the instrumentation is happening. You can even set it to Dalvik bytecode so you'll get a dex format as an output. All right, we're in the on create method, but we need to go to the bar method, right? So inside the bar method. So can we pause the video, please? So you'll see that right here, you have the print line statement for static secret. This is the static secret reference variable, which is an R2. And you're printing the value of R2. And from line number 30 to 34 is where the code from Androsia that has been instrumented. So what I'm doing here is that I'm getting a reference to the static secret field. I'm calculating the length of the static secret field. And I'm using the delete API from zero to the length of the static secret to remove it's clear its memory content. So the instrumentation point right now is right after the print line statement for static secret, which is the correct instrumentation point. So can we resume the video, please? Now what we're going to do is that we're going to remove the for loop here. We're going to uncomment the for loop here. And now the instrumentation point will not be between will not be actually right up to the static secret statement, but it will be right before the by statement here, the system.outdient.print line for the by keyword. So yeah, that's because you don't want to instrument within the loop because in the next iteration of the loop your resetted value will be used, which will break the logic of the code. So this time the instrumentation, which is from 31 to 34, 33 is happening right before the print statement for by, which is the right instrumentation point. And Androsia is smart enough to figure out if there is a loop, it can hoist the instrumentation point outside the loop. So that's the main idea that I want to convey here. And that goes on for any level of nested for loops. You know, you can have the bar method being called from a for loop like this. And then the instrumentation point will change to somewhere in between, somewhere just before returning the use static field method. So the return statement would be somewhere here in the simple code. And the instrumentation point will be right before the return statement. This is the use static field. And right before the return statement of the use static field, you'll find the instrumented code from line number 73 to 74, 75, actually, even 77. So this is the part which is get which gets instrumented just before the return statement. So you see that the instrumentation points are changing according to the loops that you have inside the code, which is, you know, which may not be because of nested loops, it may be even because of recursion of a specific method. So Androsia will be smartly identifying if there is a recursion or loops and changing the instrumentation points accordingly. So we've seen how Androsia is doing or performing the instrumentation, but we really don't know what is the logic behind that, how what is the algorithm that is running behind us Androsia. So to come to that, first, let's identify what information can a single statement provide us. So there is every line of statement has some of the other data to add, right? And the data that we are concerned about here is called liveness data. So the next few slides are going to talk about the live variable analysis definition and how we compute the summary for every method, which is going to look something like this. It will be a two tuple combination. Basically, the first tuple will be the variable. So the summary of foo method will tell you that the variable X was last used in the statement if y dot length is less than x dot length. It'll be something like this. So for every variable, you will have, you know, two tuple element. And the summary will tell you how many variables are there and what is the last usage point of every variable inside it. So to compute the summary, we have a two-step process. One is to use, one is to compute the definition and use sets for every statement. Then using the definition in use sets, we are going to compute live variable sets for every statement. LV entry and LV exit. And using the live variable entry and exit sets, we're going to infer the last usage point for a local or static field reference within that method. And we're going to store that as a summary. And now, once we have the summaries for every method that is analyzed by Androzia, we're going to compute the last usage point of a static field reference considering the whole program, not just within a method. Because we know that the summary will tell us that within a method, this is the last usage point of a particular local variable. But it will not tell us across the entire program where is the last used actually happening. So the summaries need to be combined and propagated across your program to infer that knowledge. So to start off with the definition of live variable analysis, live variable analysis determines for each statement which variables must have a subsequent use prior to the next definition, which is what we learned from the diagram we saw earlier, right? We had definition and use of variable X and between that span the variable was live and after that it was dead. So this is exactly what the statement is telling us. So in this particular code, the last usage point of variable X is this blue statement which is number four. And the last usage point of Y is statement number five and six because they are two different, you know, statements within the if clause. So either this one could execute or this one could execute based on the result from the if clause. And the last use of Z is statement number seven. So this is very intuitive, right? We can see it but how do you determine this using an automated way? That's the question. And it's not as easy as it seems like. So we're going to run through this same example in the next few slides and figure out how we determine the last usage points of X, Y and Z. So the last usage point of a variable is also the last statement where that variable was live. So X is last live here and it is also the last usage point of that variable X. So we're going to use this fact to determine the last usage point of every variable. So talking about the def use sets, the definition of a definition set is that it is it will contain all the variables defined in a particular statement. So in statement number one, X is being defined, so for statement one you'll have an entry X. So the use set are the set of variables that are used in a statement. So if we run through this example and compute the def use set, they will look something like this. So we start from the bottom. Z is being used here, so it will go in the use column for statement number seven. And X is being defined here, so it will go in the definition column for statement number seven. And similarly, if we keep going, it will eventually populate the entire table. So the direction of you know propagating that data or propagating the data flow facts is from bottom to top. It is the reverse order of execution of the program. So if I compute the live variable of exit of six statement, which is the last statement in my code, then it is going to be five because there is no other variable which is being used after statement number six. So this is a must like this has to be five because there is there cannot be any other variable which is used after the last statement in the code. So this is this will be an initial point for the algorithm to run. And once you compute LV exit of six, you're going to use the def use set of six to compute LV entry of six. And then you'll compute LV exit of five, then you'll compute LV entry of five, then LV exit of four, LV entry of four. Similarly, once you reach at a point where there is two branches, you're going to merge the results from LV entry three and LV entry four using a union operator. And that result will go into LV exit of two. And using LV exit of two, we are going to compute LV entry of two and so on. So that's up next. We're going to use the same example, the one we discussed for the def use set and we're going to compute these things. So if I had to give you a mathematical representation, I mean there's nothing possible without math here because data flow analysis needs to have math backing it up for proof of correctness. And without any algorithm or without some math, there's no way you can perform a data flow analysis. So the LV exit of a statement L will be five if L is the last statement of the body, which is what we discussed on the last slide. And it will be the union over the LV entry set of the successors of L. Otherwise, so if you have a merge point, we'll have to take the LV entry sets and do a union over all the entry sets to compute LV exit of L. Now once we have the LV exit of L, we're going to plug that value here and we already computed LV definition and use it for a statement L. And that's going to give us LV entry of L. Let's see this in action. So we have the diffuse columns populated. We know that the LV exit of seven statement will be five because this is the last statement. And now we're going to use this formula to compute LV entry of seven. So LV entry of seven will be five minus X union Z, which will give you Z. And then LV entry of seven will form the LV exit of six. So LV exit of six will be Z. And now Z minus Z union Y will give you Y. So you get the LV entry of statement number six as well. And so on you can populate this table iteratively and you get the entire table. Now the interesting part to note here is that if a variable disappears from the entry set to the exit set in the live variable table, like in the fourth statement we have X disappearing from the entry set to the exit set. In the fifth and sixth statement Y disappears from entry set to the exit set. And in the seventh statement we see that Z disappears from the entry set to the exit set. So those will be the corresponding numbers will be the last usage points of those variables. And that's what we were after. We've got that now. So now we're going to store all these results, the last usage point results, as the summary of foo method. And we're going to use this summary in other parts of the program when the foo method gets invoked from there. We don't want to compute this again and again and again. So let's just say that foo was, there was a method foo which calls bar and bar called baz. Okay? So and just we've already computed, let's just assume that we've already computed summaries for each of these individual methods foo bar as well as baz. And baz will tell me that the summary of baz is going to tell me that static field reference was last used at C4. The summary for bar is telling me that the static field reference was used at B5 and there was no use of SFR in foo. Now what I'm going to do is I'm going to propagate the summaries across these methods in a reverse topological order. So this is how the analysis will run and when you reach statement C4 you see that SFR is being used. What I'm going to do is create a data structure with the value baz comma SFR C4 which is going to tell me that the last use of SFR is happening at C4 statement in baz method. Okay? And once I reach, I've already analyzed, I've completed the analysis for baz method, I'm going to analyze the bar method. So the analysis happens in a reverse topological order. So if baz is the last method that is called it will be analyzed first. If bar is the second last method that is analyzed it will be, you know, if bar is the second last method that is called it will be analyzed second last. So basically foo is calling bar, bar is calling baz. Baz will be analyzed first, bar will be analyzed second and then you'll analyze foo. So here we see that SFR is being used at B5. So the LV entry will contain B5 SFR and once we reach the call statement for baz we're going to pull in the summary of baz method from here and now we're going to see whether the entries LV exit set of B3 already contains an SFR's use and it does in our case. So we're going to replace the value of this little red data structure with the new entry which will be bar SFR B5 which would mean that SFR is last used at B5 statement inside bar method. So this is how you will keep on going and updating the red data structure and once the analysis is complete over all these methods you'll realize that the red data structure will complete the last usage point of the SFR variable across the entire program. So similarly if you go ahead and see that there is a bar method invocation here you're going to pull in the summary for bar method but the exit set of A3 contains five because SFR is not being used ahead in the program inside foo method. So this summary will not be over this this particular value will not be overwritten by anything because it's just five that is being coming from the exit set of A3. We just completed our analysis and the results tell us that SFR was last being used at B5 statement inside the bar method. So this is how you know algorithmically and Rosia will run on your code and figure out what is the last usage point of any object in that application. Just to summarize we went through this definition of live variable analysis. We computed summaries for individual methods using a two-step process. One of the steps is to compute the defuse set for every statement and using the defuse set we computed the LV entry and exit sets and using the LV entry and exit sets we figured out the last usage point of every static field reference or local method and put it in a summary and eventually we use the summary to compute the last usage point for a static field reference at a whole program level. So that was too much. I mean just intuitively you can see what is the last use point but to automate that you require a lot of algorithm and math. So yeah I mean things are not easy always. So for instance field approach the approach you know for instance field is a little different. What we do is we mark all the classes which are string builder instance fields. We find their object instances. We track the last usage of object instances and their aliases instead of the string builder fields themselves. We'll actually see a demo of this again and I think that would explain this even better. So just let's see we have three classes here which is one of those main activity. This is where my control is going to start. I have a secret defined here which is a variable which is going into a variable my secret and then I'm instantiating the my class which is here and the variable that is referring to the object of my class is MC. Then we have a wrapper class I'm instantiating and the reference variable for that is W and then I have a call to the then I have a call to call W method which is defined inside the wrapper method wrapper class sorry and I'm passing the MC object instance to this call W method which is invoking set a to pass on the secret value here and this is just a setter method which is setting the value of a. So basically the secret is here then it goes here and then it gets defined in this and then it gets set in the set a method to the instance field a which is a private instance field. So in order to reset if we had to reset this private instance field we cannot just do it outside from within the on create method we need to invoke a method within the class to reset the value of a because it has been privately acts the access modifier is private so we can't you know access it outside its class. So let's just quickly jump on to the demo so the steps that I had explained on the previous slides where we mark all classes which are string builder instance fields which will be my class we find the object instances which is MC then we track the last uses of object instances which is here in the call W statement and then we instrument the code right after the call W statement and here you can see that I'm invoking a method which is reset SBA and this reset SBA is being defined in the class right here that's because I cannot access the private instance field small a within the on create method so I have to invoke this reset method alright so let's have a look at the demo pretty same code pretty much the same code I have the call W statement here where I'm passing MC and my secret and I also have the wrapper class here instantiated and the reference variable is W and I'm printing the W dot SB field so SB field is another string builder instance field of wrapper class and A and B okay we can see that so A and B will be instance fields of my class and SB will be an instance field in the wrapper class we'll see that just in a while so you can see that we have an instance field SB here in the wrapper class and I'm calling set A and passing on the secrets value to the set A method and my class has A and B as its instance fields and inside the set A method I'm setting the value of the secret to A B is nowhere used right now so inside the wrapper class you see that the method called W also has a secret defined called password and its last uses is inside the call W method so B is used here MC is last used here and W is last used here so the instrumentation point for A and B should be right after the call W statement here and the instrumentation point for the string builder SB should be right after W SB statement here the print line statement W dot SB so you see here that there's a call W method being worked and I have the reset methods being injected or instrumented right after the call W method because we were tracking MC and MC's last use was in the call W method call and after that we have reset both the instance fields and similarly for the wrapper class we had an instance field SB which is being reset right here in state statement number 58 so these reset methods are defined in the respective classes which have those instance fields so if we go to the respective classes this is this is the wrapper class and the reset method has been defined here what it's doing is the same thing it's calling the length function on the string builder it's and then it's calling the delete API to remove the content of the string builder and similarly we have reset methods for variable A and B which were the instance field in my class here we have reset SBB which is the reset method for variable B so you know we've tackled all the three scopes now we know how to deal with instance fields we know how to deal with static fields we also know how to deal with local variables so I mean we did not need to see all of this because this is too much in-depth but if you want to go ahead and implement your own static code analysis this is important stuff for you so I mean I thought it appropriate to be you know displayed here so the work in progress is basically we're working on a test to development so that we can you know get rid of the remaining bugs that we have and we're also planning to include this into the CI CD pipeline which should be straightforward and that way a lot of big companies could you know just plug in their APKs and just push it across to us and then we can just analyze those APKs and instrument the code that andro androsia you know instruments to clear the memory contents of the APK so that way it will be helpful for big corporate giants especially the enterprise mobility management companies so if you want to use androsia or you want to contribute or get in touch this is the URL you should take a note of and the tool and documentation will be available here and if you need to if you want to contribute to the tool here's my Twitter account or email you can just shoot out an email to me be on the sites so these are some good references the first three ones will get you started when it comes to creating your own data flow analysis and the last two ones are a little bit more in depth if you want to you know see how things are actually working in the background and anyways posters these slides on the website of Defconn so you can have a look at these references so that brings me to the end of my talk thank you for your time I'm really happy to speak over here on this podium I hope you'd enjoyed it thank you very much