 Rwy'n gwybod i'r pryd yn ymweld bwysig ei wneud o'r gyflwyno i python o Pythyn gyrddm. Rwy'n gwybod i'r pryd yn cyfnod o'r gwrdd o'r cyflwyno ac yn fwy oedd yn 45 dyma. Rwy'n gwybod i'r pethau, oherwydd rwy'n ddweud os yn y bydd yn mynd i'r cyflwyno e'n bwysig am y pythyn ymdill, ac roeddwn i'n gwybod i chi'n ddweud i'w ddych chi'n fwy o'r cyflwyno. Dwi'n gwybod i ymddill python, mae'n ganddo i'r cwestiynau hyn iddyn nhw. Mae'n rhai defnyddio cyfnod i ddiogel y cyfnod o'r cyfnod i'r cwmwneud. Rwy'nnu i'r cwyrwch i'r cyfnodio cyfnodio o'r pethau cyllid yn cyfrifiad. Yn y cyfnod i'r cyfrifiadau wedi'u cyfnod, goon to some length to try and stop me assessing that security posture. So obviously it was a personal challenge of I would defeat them. All the available tool kits really didn't work with anything other than standard compiled Python byte code. Even the smallest obfuscation would kind of trip them up. So that was why I decided obviously I had to write some tools to do what I needed to do. Most of the kits will assume that all the decompilers and disassemblers just assume that the code that you're dealing with is standard byte code. People won't have gone to any length to try and stop you reversing it. So none of the tool kits apart from this one now has any understanding or attempt to get around any obfuscations which might be in place. Also obviously there's a huge amount of Python code out there and a lot of it is used in web applications and remote applications so there's a bigger tax surface area to be working with but there isn't a huge amount of work in actually doing Python specific techniques. So it was a good area to carry on some research in. And maybe like in the past there actually hasn't needed to be much work in this space because people hadn't been disruting obfuscated Python they'd just been throwing the PYs out so you could read the source code anyway. But this is changing. There's some general kind of bigger picture trends obviously people are moving away from developing in C and C++ for all the reasons that people have failed at it for the last 20, 30 years found out that it's hard. Obviously your high level languages, Python, Ruby, Llewer etc etc much more rapid to developing. The people that are able to develop in it are kind of straight out of university and are able to do Python much better than they are to be able to do C so there's more developers, it's cheaper to develop, it's cross-platform etc etc. There's also changing in the distribution. Five or six years ago you always downloaded an application, now it's all Web 2.0, the cloud, everything has got something to do with the cloud but nobody knows exactly what the cloud is. It's got something to do with the network, the internet. So everything now is software as a service. What this means for reverse engineering obviously is maybe you don't actually have access to the files in which you're trying to reverse because you're dealing with them on a remote server. Also, overflows aren't the only bugs. Everyone's obsessed with obviously stack overflows and went to heap and people got more and more complicated and all the protections were put in place to work up a good memory corruption bug now. We'll take some talented guys, good six months of solid research. There's definitely a need for that but the return on investment for those kind of bugs, it's a significant investment. Some of the researchers obviously are immunity where I work. We've got the resources that we can invest six months in a bug but obviously then the sale, the price of that bug has to reflect the amount of work that goes in. A lot of people aren't prepared to pay for bugs. Python will have lots of good bugs in it and higher level languages can have lots of good bugs and they're often very much cheaper to find because they just haven't had as much effort put into the techniques to find those kind of bugs. It's easy, there's a lot more low hanging fruit. Reverse engineering, a lot of the toolkits are often done in a C-centric manner with newer reflective languages. Maybe there are some better ways people are stuck in this decompilation mindset. Maybe there are some better ways to go around it. Obviously lots of the toolkits, certainly anything to do with Python reversing, relied the fact that you had access to the file on disk that you could do a static reverse on the serialized object on disk. If you're working in a service-orientated method that might not be the case, you might not actually have access to the file on disk but you do have runtime access to the objects on a remote server. If you can get actually source code back out of that remote object then obviously that's highly desirable but you're never going to have access to the disk so we're going to talk a little bit more about that. Bug snobbery. I certainly have fallen into this category. I get loud and vocal about why am I having to do a week's worth of assessment on XSS and CSRF but to be quite honest, just because you're not working on some hardcore overflow, you can get some pretty good bugs if it gets you in, it gets you in. You shouldn't be too proud to work on an XSS even though it's in JavaScript and that's what all the kitties are doing. There are lots of areas with low hanging fruit that haven't really had too much research put into them yet. Not everybody is an eco wise man, not everybody can do crazy heat reversing so I came to the conclusion that I would never become as good as an eco so I had to start looking in other areas to make myself feel good. Other side effects of the new development model, obviously everything is always in beta. The less experienced developers are developing code that actually is used in products and time to market and new features are key. There will be a lot of new code that's out there that hasn't necessarily been tested as thoroughly as you'd like and obviously the flip side of all those points is that there's going to be tons of bugs and that's what we like. There's often new large populations of users for whatever the in app of that week is and they can often be rapidly seeded so a huge population of users running some very vulnerable code will burst up and obviously that's the best thing that you can have and a lot of these bugs because it's in the higher level languages are actually cross-platform and cross-architecture. Now that's awesome, you write one bug and it will execute across whatever architecture system they're running on and whatever operating system they're running on so you could hack an iPhone right the way up to a mainframe. If it's running Python and there's a Python level bug you'll be good, which is huge. Obviously all these conferences and certainly with Black Hat you'll get a lot of vendors like peddling huge amounts of snake oil that we're more secure than we've ever been but I really think it depends on what metrics you're measuring on. There's more lines of code than there's ever been. There's more people who think they can code. I'm sure there's everyone sitting in here and knows people at their place of work who should never be near a keyboard yet they're doing code which is going out in production systems or in internal systems. Obviously everything's now about connectivity, everything's network aware and the pervasiveness of technology is increasing so I'm not sure really that we are more secure than ever and the higher level languages are being used increasingly to seed all this crap everywhere so if we can find some good techniques to exploit them then we will. What's not going to be discussed? There's not going to be any dropping of commercial application source code that I've reversed out from anything or any bugs that I found within mainly because the lawyers don't seem to agree that what happens in Vegas stays in Vegas and I can't afford to be taken to court. So why reverse in a high layer? Most people assume that going in to the lowest layer is the best thing but if an application has been written in a high level language and reversing it at the layer it's been written means that you're closer to the developer, you're closer to the information further from the data but closer to the information so you can get a much better sense of bugs that might be around. We're not assessing the security posture of the Python runtime that does have a ton of bugs but we're looking for bugs in the Python code itself so reversing out the layer that the code was implemented in is the best thing to do. I'm not sure how clear this will come out but you should be able to see here this is reversing like in a normal debugger and you'll see a lot of call outs to the Python DLLs. Even to do something simple like print a hello world there's actually a lot of layers in between you and the code so it gets very complicated and even to do simple things can actually take quite a lot of effort and obviously Python is a fairly complex language it's got quirks and flaws and bugs just like every other language but like I say a lot of the people that develop in it are maybe less experienced developers they don't really understand so much about how computars work so there's kind of mistakes that everybody makes I'm going to just highlight a couple anybody out there that's good at Python that can see the problem with this code shout all right I won't wait long well yeah the var is a class rather than an instance attribute so you can see if you on through and bar which is the instantiations of the test class if you actually print out the var variable both of those instances of the class share the same variable because the variables at the at the class layer people make this mistake all the time if they make a new instance of an object on a shared system you can actually get access depending how they use the class obviously you can get access to other people's objects which depending on the situation can be good thing anybody can see the problem in this okay it's a mutable default argument here so when it was called here with an argument going in we're all good it acts as you know a few is appended to the end of the list that was supplied that's how's expected if if it's called without an argument the default argument she used for the default argument is made in instantiation time so again it is shared so the you call it twice and the list is growing I've seen a lot of people make this mistake certainly in remote applications they'll have socket objects here it means you can access somebody else's socket object and then you've got to route back to a different client so obviously like every language there's a ton of these bugs that may be inexperienced developers don't really understand because they don't understand really how Python is doing things it's just a fairly easy language to write in so lots of people make some pretty simple mistakes so what were my initial aims for doing this I wanted to be able to have a toolkit that would rapidly assess and find bugs within applications even if they were obfuscated and I didn't have access to the to the dot P wise themselves obviously I wanted to get back to a source code representation from you know a live memory object and I'd prefer to have a general approach against all the ways that people are obfuscating bytecode rather than a specific approach for each different thing because obviously that's a cam mouse game and it's going to take a lot of time to carry on so if there was a general way of defeating what people were doing then I wanted to take that obviously there was because I'm giving this talk so we're going to blast through this because we haven't gone much time like 101 of Python language so obviously there's a fair amount of different file types with Python the PY is the one everyone will be familiar with it's where the source code lives human readable obviously you can take that PY or run on any Python platform then there's kind of the compiled and serialized versions of the Python language is PYC being the you know the kind of mesh ubiquitous one is a standard serialized form we'll have a look at the format very quickly anytime pi is compiled or imported import obviously implicitly compiles a PYC will pop out of the and a PYC will pop out which is the you know the bytecode equivalent of the PY contrary to popular belief it doesn't actually speed up execution it purely speeds speeds up instantiation because you miss out that compile step so you don't need to recompile every time you run the application it is cross platform a PYC will run on you know Linux and Windows and it's not cross Python version so a 2.4 PYC went run with a 2.5 run time and it is purposely documented by the Python developers to allow them the flexibility to change the bytecode format without breaking a bunch of stuff where people have relied on it PYOs same structure as PYCs but they're optimized at some point you can optimize at the first level which remove all the asserts optimize the setting level it will remove all the asserts and all the inline documentation nothing to do with speed purely file size this shouldn't break most things but in some kind of corner case it will Python lex in yak is one of those that if you remove the dock strings then everything fails because the the actual lex grammars are kept in the dock strings but most of the time a PYA just smaller file size so that's all good PYD most complex format that Python will produce by itself that it comes as standard with C Python I seen a lot on Windows it will compile into a shared compiled C object and we're not really going to talk about these has been some good work done by Aaron Portnoy and Ali and they can access the PYDs and they did a toolkit called anti-freeze where you can unpack the PYDs modify the bytecode repack them and they did some good stuff with with games on Windows making that character jump like 20 times higher so the Python PYC format there's a four byte magic number this is for the version of Python like I say to make sure the so the runtime can do a check of what version the the Python was compiled with and bail out if it's not the version that it is there's a four byte timestamp will show why this is important later basically it's to decide whether a new PYC should be generated from a PYD and there's a martial code object which is the actual serialized code object where the Python code resides you want to say something well we made it to two o'clock this afternoon until I had to come and yell at you several someones are stiffing cadies who have been very nice to us for $100 plus bills as in seven people eat get up walk out leave the bill very very uncool let's apply that whole new social media thing and whatever you guys are calling it if you know who they are pressure them to go pay their bill if you know who you are go pay your bill if you're thinking about doing it don't the the numbers isn't enough now that I'm coming over here to make an announcement about it so it's not just one group if it keeps happening you will be they will review the video tapes they will catch you you will be in trouble and more importantly we will be in trouble we have a good relationship here at the hotel much as we love the alexis park I would rather not go back to the alexis park it is not a lot of fun doing CPR on somebody in 110 degree weather please don't send us back to the alexis park or I promise you and I will sit down and have a talk okay guys be responsible be adult get your friends to go pay their bills thank you okay so it sounds like someone's fucked up yeah let's please not go back to the alexis park right so that was the bytecode format for pyc's all those formats come out standard as python and then optionally a lot of people are using packaging packaging allows you to distribute a runtime along with your python code it means that people don't have to have the version of python which you need to be installed on their system to run your code so pi to xy pi to app they're all examples of packages so they'll take your code runtime code bundle it up and allow it much easier to run just with a double click this is important because it means that developers can distribute a modified python runtime along with their code and a lot of the anti reversing techniques that we'll talk about rely on the fact that they can distribute that modified python runtime we're going to speed up so this is the object hierarchy we're going to blast through this obviously modules the only thing to note is that a module doesn't have a code level object it doesn't have a code object when you should have imported it and this is for speed reasons once it's imported it's not needed so it's a language design issue but from reversing it's a real pain in the ass there's obviously then there's a class level objects all the superclasses to a class are in underscore bases classes have methods methods are literally just a wrapper for functions the functions held in the I am funk object functions have a funk code object the code objects which is actually what we care about has various attributes is a very robust code objects you can tell a lot about how the code will run the biggest one is the co code object that's actually a string representation of the bytecode so we're going to be able to access that and start reversing things out and then there's all these attributes here that you know constants and variable names you know the line numbers that stuff was the function was instantiated at all really useful information as we get into some of the techniques so all you need to know is that from you from all the python objects like the code is the one that we're really going for and everything else just stacks into that so we've blasted through this pretty quick python has a you know bytecode language obviously it's a pretty simple language and every every opcode just is a you know an 8 byte opcode so there's only 200 256 available currently in python 2.6 there's 113 already defined optionally one python opcode can take arguments all arguments are two bytes so if this is a python source just print bugs when it goes into its bytecode instructions you can see there these are the different instructions names each of those will map to a single integer just a just a map in between a number and the instruction and some of them take arguments and this is the byte stream so you can see that 64 would relate to load const and it took two arguments both of which in this case were no well one argument and one arguments always two bytes or fail okay so we blasted through the python we're going to run through this and then we're going to get on to the new techniques the existing stuff that's available this disassemblers there's this that comes standard with pi it will the representation that you saw from source code to instructions that was just done using dis it will just dump out the bytecode for you the important thing is it relies on opcode.pi which is the python level module which gives you those number to instruction mappings oh fail again so there's debuggers PDB is a standard python debugger it comes again with with normal python but it's very much a developer's debugger it assumes that you're going to have access to the PY it is for finding bugs as you're developing as opposed to finding bugs in other people's codes so that was what I based a lot of my stuff on but but extended it up to be more useful when you haven't got access to a PY file only the PYC's decompilers there are a bunch of decompilers some of which are you know an application that you download some of which are an online service why you'd want to use an online service to decompile and give the people your source code I don't know but people obviously do some are free some are commercial they do definitely vary in quality and depending what version of python you're running the the best free one that I found that I based a lot of the stuff that we talk about in piraticon is un PYC some from a Russian developer it's not perfect but it's it's good enough in most situations for the online services the python is certainly the best the quality of their decompilation is really good but it's not free and you're giving them your source code so it's up to you bike code assemblers and modifiers bike play and bike code assembler certainly a good ones anti freezes an example that we've already spoke about for PYDs these allow you to work at the python bike code layer rather than at the python source layer if you're interested at the kind of this level of python they're certainly good to play with and you'll learn a lot more about how the python runtime actually operates by playing with them so that's what exists so far what are the anti-reversing techniques that I was seeing in a lot of these commercial apps so we're going to go through blaster attacks on me like I say increasingly commercial and closed source apps put in obfuscation techniques in to stop you getting at their pie because obviously you know they they've put their effort into you know capitalizing on on that and there will be a bunch of bugs in there the more effort that they go to to keep the source away from me the more you know there's going to be bugs in there I looked at I've worked with python a lot and there was a bunch of different techniques I've seen so we're going to blast through these techniques all of these techniques mainly focus on when the bike code is on disk when there's those pyc files on disk all the techniques focus on obfuscating that pyc file so you can't it's making it so it's not in its standard form so all the basic tools that we've just talked about break and that's pretty much the approach that everybody's taking so the simplest one that you see a fair amount is just hiding in the packages you know the pyta xy pyta app people believe that if they wrap their code in that then you can't get access to it obviously it's standard formats it's easy just to reverse out often you'll find a packager and maybe even a py is all we are present in that so that's super easy to get by and you often see this on windows 32 applications people will pack it up into a pyd and assume that you can't get access to it but it's a standard document at format so it's very easy to just reverse out source code obfuscation I've never actually seen this used in a real application there's a commercial application where it's sold to do the obfuscation but I've never seen anyone actually use it it's a similar kind of thing that you see in javascript malware where they'll try and make the source code look complicated the functionality will be the same so for example there's this look kind of easy code on on the left and it goes to this kind of weird obfuscator code on the right I'm sure it's very easy to actually undo what they've done I haven't looked into doing it because I've never seen it in a commercial application so I haven't tried when I was looking at the guy who sold this python obfuscator though I found a gem and he also sells poor sense poor sense is a way to cat proof your computer and it will tell you when cat light typing has been detected so if people have kind of cat computer problems this is your man and it's only 1999 so it's a snap so some of the more effective modifications that you use all rely on modifying the python runtime as we said you can distribute the python runtime in a package and so this means the authors can modify the C python at the C layer compile their own version of the python runtime and then that can switch things around make the bytecode that's produced different etc and this is what people are doing so one of the simplest things is to just change that bytecode magic number you'll remember from the PYC the first four bytes for a magic number that say what version of python it is if the the python that you're running isn't the same version it will bail with a bad magic number error you probably can't see too well so these are all the defined magic numbers for the different versions of python this is in the comments for the import.c basically all they do is to change the version of the python number so any standard python apart from their modified runtime won't run their code it's kind of if you fall at this barrier you're failing pretty hard because you just need to replace the first four bytes with a you know with a four bytes that are standard python magic number and you're good it does make all the standard tools fail because they're like I say they're expecting a standard format bytecode so it's a very easy change for developers to make but it's very easy to get around as well changes to the marshalling format as you said there was a four byte magic number there's a four byte timestamp and then there's this marshalling code obviously the marshalling happens in a standard format if they go in and go to marshalling.c in the c python runtime and change how it's marshalling things then obviously all the standard tools won't understand how to unpack it they can get arbitrarily fancy here I've seen stuff that looks like it's pretty much they're doing some kind of encryption I'm sure it's pretty crappy encryption but still to work out what they're doing because it's happening in the c python layer you're going to have to trace it in a debugger and you're going to have to do that for every different type of of marshalling modification which is a bit of a pain so I want to avoid that and I managed to avoid it by working at the python layer we'll talk about that in a bit and then opcode remapping this is one of the more complex things that they do basically that table that I spoke of of those integers to instruction mappings they juggle all that around so it's not the standard format that you'd expect they change the opcode.h in the runtime and remove opcode.py from the distributed runtime and that means that you can get access to the cocode object but the byte stream makes no sense and it makes no sense to any of the decompilers because they're running off the standard opcode map and this has got like a juggled opcode map when I found this it was a real pain but I've got a pretty simple attack to get around it which we'll talk about so the general approach with what we're taking with pyretic I want to remove the reliance on having access to the file on disk I want to work in memory only I want any of the protections like the marshalling and everything the application to undo it because however complex that they've got with marshalling up something they're going to undo the marshalling and then when it's in memory it's just a standard python object and obviously it's easier to understand the standard python object than piss about at the sea layer what trying to work out what they're doing with their marshalling I'm going to get in process at the python layer and then obviously I have access to the full python namespace and I can start to query the objects and then from querying the objects in the live memory get back to a source code representation so this the fact that you can with a reflective language like python querying objects this is obviously really useful in the cloud paradigm because you will never have access to their files however they may be pythons a popular choice for using as a sandbox an environment or as an environment to allow you to access their API they think that maybe if you only have access to their objects that's all good but with these techniques you can actually get the source code back from their objects so even though you're interacting with a remote computer you can still get access back to the the source code even though you haven't got their pycys on disc which is pretty cool moving forward I think this is an area that more people are going to have to for all sites as languages not just python put more effort into purely because the paradigms of distributing applications are changing so I'd expect it to be an area that you'll see a lot more activity in future so how do we get in process I said we need to be in process so we can query the python objects obviously python if people are trying to objuscate their code they won't distribute the pyces they'll only distribute the pycys and if they've modified their runtime you need to use their interpreter to run their pycys because they've pissed about with the marshalling or the or the byte the bytecode number or whatever but the import rules of pythons still apply so in import C you'll see there's this a quality test here we've talked about the timestamp and they're saying if the this is just saying if the timestamp in the python bytecode is not equal to the timestamp on the py file that's on disk then recompile that bytecode so you're using the py in preference to the pycys because something may have changed in your py and then you'll recompile the bytecode obviously this test means that you can just if their module was named foo module pycys take it to another name foo original and then you take your module the code that you want to run and just call it foo module.py your code will be running preference so that's this is this is a really easy way to get in process all the distributors would need to do is take out is take out this quality test and then stop people doing it but I've never seen people have put huge amounts of effort into their remarshalling and things but they haven't taken out this like four line if statement I don't know why I probably should they fail so obviously our py takes presidents and then because we you know we're running arbitrary python bytecode now we can start to query those objects there's some code that I haven't gotten disk but it's in the paper for blindly mirroring the object that you've replaced and so you're acting as a proxy between the object that you've renamed and your and the object which is the code which is calling you and you're literally passing the calls on into the object so the codes in the paper is pretty useful that means that the application won't crash so as I said non standard marshalling can be a real pain in the ass I didn't put much effort into trying to come up with a technique to undo their marshalling because as soon as you're in the python runtime they've already undone it so it was a real pain so we just completely sidestep the problem and access it at the python layer this is more complicated however the we've got to re re-remap their opcode so we've got to work out how they juggled up their their opcode mappings their integer to instruction maps obviously we need to do this at runtime we don't have access to the opcode.py so we need to reproduce that opcode.py so all our other techniques can work at the python runtime obviously we need that we need an understanding of the opcode mapping to be able to get anything from the byte stream is this understanding which the vendors are assuming that they've broken if you don't understand the opcode mapping you can't get any sense out of a byte code because it's just a stream of bytes you don't really understand what it means but there was a pretty easy way to get to get at this new opcode mapping essentially using a plain text attack against the compiler so as I said all instructions are just one byte and there's an optional two byte argument to it so if we go and look at some normal python byte code on the left that we've already seen instruction arguments if they've remapped their stack to you know o x 64 has been remapped to o x 44 o x 47 has been remapped to o x 11 et cetera et cetera if we can take a known set of .pys and compile them into byte code for a standard python and then compile them into byte code for the obbuscated python using the obbuscated runtime then we'll get a list of this byte code and then we can just diff these two byte code streams and we can say oh cool right you know 64 has gone to 44 47 has gone to 11 we do this for enough python source code and then we'll hit all the all the byte code instructions and then we've then we've got the new mapping and we can rewrite the opcode.py and then we're good we're you know even though they've juggled it up we can still work out what the instructions are so for this example obviously we can we've got these mappings here and you just keep doing it for like a shit ton of python and eventually you'll get all the opcode mappings out as I said everybody the runtime contains all the normal python modules you know os.py and sys.py we've got access to those as well because they're standard standard library python modules so if you compare the obfuscated pycs that the they're distributing and then a normal pyc which you've compiled you get this opcode mapping back out which is pretty pretty useful to do this in the tool I created a new file format which is probably a bit of a grand term pyb file format which just means that we're dumping out raw python byte code to this it allows me to sidestep the fact that I probably don't know how they're marshalling things up so it's just a new format that we can produce two streams that are much easier to diff I don't have to go through that stage of un-marshalling their pycs I'll show you a so this is like step by step first we've got to find a version of python that the obfuscated runtime is running we need to compile the byte code with the exact same version so if they're running 2.5.4 we need to compile our pyb use with 2.5.4 otherwise we won't get streams that are perfectly in sync and we'll get less collisions for the plain text attack so find out the version we use a standard library python to compile up all the pyb's from in-process we compile up the pyb's from in-process of the obfuscated runtime compile up all the pyb's and now we've got two systems that we can diff we diff them we get our new byte code mappings and then we rewrite opcode.py and for unpyc we need to write opcode.py which is its equivalent of it and then we're good to go so even though they've gone to all this effort now we've remapped their opcodes we understand what their byte streams mean and we can start to decompile again we're only getting this because we're working in memory at the python layer so it's quite useful and this is where I'm going to demo and things will fail so so I've got a little test application it's really lame but it shows the point re-pdbdb is my reverse engineering version of pdb so we'll start that up and we'll have a look at the I'll put it into the middle of the screen so we'll just set a project you can see all our source code is going to be dumped out to a particular directory and you can see the project was created here all our pybs are going to be going out into here so I'll just get my cheat sheet so I don't fail at the paths so first we're going to generate the reference pybs so just gen ref to six and a path to all the standard library modules it's just a standard you know the standard module set the pybs so we compile all those up and so we're compiling them into pycs so we've got like a good we know we've got a good compile and then we're taking them to pybs so we've done all that that's good and now we do the same for the obfuscated python byte code so we can get out our sets so and gen in two six and this time we give it the path to the the all the pycs which distributed you know the modified pycs I've created my own see this is that python two six two is a standard standard compiled runtime this one here is one that I've modified that will do opcode remapping so we compile up everything here it's got a bunch of warnings but that's because I haven't cleaned up my code properly and now we just tell it to remap and it will take those two byte streams and and dip them together and so you can see further up here where it's finding byte codes and remapping so you can see you know binary write shift it wasn't 63 it's moved to 64 so we found all the shifts and now we're going to rewrite the opcode.py so we say yes we want to rewrite the opcode.py and now if we look at here you can see there's all the pyb's that we've produced for the obfuscated all the pyb's that we've produced for the reference and then in the libs we've got a new opcode.py remap by pyretic and then these are all the new opcode listings an equivalent o equivalent for the opcode.py it's just the equivalent file for the for the unpicy so we've remap the byte code so now we're in a position that we can understand what the byte streams mean and we've probably really got a hurry up so in memory decomplation versus static obviously all the decompilers at the moment take files on disk and decompile them by accessing those serialized objects that we spoke about in the in the Marshall PYC in memory obviously we're going to access the the byte code from the function objects co.code objects because then everything will be automatically un-martialed and we can we can get access to that at runtime we're going to have to hurry up so as I said a top code a top level python object doesn't have a code object which is a real pain so at runtime there's no code object for us to decompile this means that we have to use I've termed it source code reconstruction I'm not sure if that's a proper term or not but rather than decompiling the stream of byte code we're prodding and poking a lot of the objects and asking questions at them in the runtime and then from from the answers to all those questions we've got have a good guess at what they actually look like in source code so when we're working in memory decompiling if we can get access to a function we can get access to the code objects and decompile that five minutes awesome if we're doing reconstruction at runtime we need to query a lot of the objects um so we'll talk about say if this is the source code these objects are obviously top level so we need to use reconstruction on those so we have to query a lot of the objects this function actually has you know it's got a co code object so we can just use standard decompilation there there's some stuff that we can and can't get out of reconstruction so obviously you can see here that bar is calling test function 3 from reconstruction we can never tell that it's called test function 3 we only get the return from that similarly with foo that was initially set to 9 and then was set to 10 we'll never know that it was initially set to 9 we only get it because when it's at 10 because that's the state from which we're we're asking the question of it so we have no pre-state history so it's not perfect but remember we're looking for bugs we're not looking to completely re-get the the Python source code so it's good enough for finding bugs and in realistic applications most of the functionalities actually in the classes and the functions not at the top level so if we can get roughly right it's good enough for finding bugs so it's not perfect but it's good enough and that's just saying what I've said so the pirate toolkit this is what I wrote to kind of mean that I didn't have to do all this manually obviously came from my real need to actually find some bugs it's kind of in three sections there's the decomplation section and source code reconstruction that we've just spoke about and it will do three types of decomplation depending on the kind of obfuscations which are in place and the access that you have so there's file system traversal module object decompile which means that you're walking the file system you've got access to the file system you've got access to their PYCs and you will take those PYCs and they also give you access to their un-martialing so they haven't restricted access to their un-martialing you can take the files off disk un-martial them throw them in memory into the decompiler and get the source code back out there's file system traversal so you've got access to the code on disk but you don't have access to the un-martialing so this means you have to do the source code and reconstruct and sorry the source code reconstruction and the actual decompile on the objects that you can access with the co-code object again you get pretty good decomplation out of this but not perfect not as good as the first case and then this case is when you don't have access to anything on disk you've only got access to the objects in memory so you've got to do you're you're not traversing the file system so that means you can only decompile anything which has been instantiated at the point that you're in state so a lot less code gets decompiled you can only get access to something that's actually been instantiated in memory at the time but you don't have access to anything on disk so you know you'll get less code decompiled but at least you'll get something this is the opcode remapping which you saw demoed earlier so it's all packaged up in a nice easy to use way so you don't have to do it all by hand and then there's the re pdb which is kind of my version of pdb it builds it just super classes out pdb but it allows you to access lots of extra functionality which is useful for assessing the code at runtime and things like it call through to third party modules like pie call graph so you can just get a nice call graph out and actually understand what that code is doing at a very high level so yeah some future directions and there's lots of lots of potential to actually take this more usable by other people rather than just me I'll just show you the the decompilation of the three types and then I'll be kicked off I think so we've already done all the remapping so we've got the opcode so now we're going to do the three decompilations so you can see how the source code kind of differs and degrades when you're doing the pure in memory stuff so we do first we'll do the file system traversal so file system traversal so file system traversal unmarched in decompile and this is the you know the silly test application which I pulled up in using the obfuscated python runtime which I compiled so it's pretty quick because it was a simple application so we'll go to our source code and then this ridiculously long path yeah we're there so this is the PYC that's been decompiled through the first method you can see it's pretty good you know you get some strange things where things are specified as longs but the the you know it's a pretty a pretty good decompilation you'll see things change when we're doing the pure the pure memory stuff so we'll do this again with file system mem decompile to the same path again pretty quick oh and so it's in a different and you've guessed it all the way back out and we're there you'll see here things like this was a function call but we only get the return value from that function call for the reasons that I said earlier and you'll see that some of the classes kind of have an elongated name because that's where we got the in memory when we were asking a question of that class object this is the answer that we got so again not perfect but good enough for finding bugs definitely and then the the pure memory decompile we need to obviously import the object so if we import the test app and then we do a pure mem actually I'll show it's I'll show it's in so you can see that it's actually in you know in the name space of the of the debugger so if you do a pure mem decompile on this object so obviously we're not passing it a path anymore we're passing it on objects because we're assuming that we don't have access to the path it'll do the same thing oh and if we go to the mem object now we don't have a long path and you'll see that it's come out again it's kind of you know we only get the return values but it's pretty much similar to the previous one so the decompilation is pretty good and we had no access to anything on this this was source code coming out of a instantiated object in memory which obviously it's a simple test app but it's pretty powerful what you can do with that moving forward so I think I've completely run out of time if I can get back to the prezi fail so completely run out of time are there any questions awesome that means nobody followed even better the other thing that I'm going to announce is there was a there was a hack cup for a lot of teams a lot of hacking teams playing in a football competition yesterday so these are results to that teams at A1 and they won a load of tickets to echo party down in Argentina so that's pretty cool and Niko who put this cup on he's going to be throwing it again next year so if people are hackers that like football which I know is a strange mix because that means athleticism and getting out from the basement but it will be on next year so if you want to enter a team get in touch with Niko Weisman and he'll give you all the details other than that thank you