 Hello, everyone. Thanks for coming. My name is Batwan. I'm a high school student from Turkey. And in this talk, I'm going to show you a few tricks of how to hack CPyton in tap water. Before we start, you can ask me a new question through Twitter with that is identical handle and get slides from Speaker Deck with that same handle. So we are here to hack CPyton, but what is hacking? When you say hack or hacking or hacker, people often think a black veering guy who serves our evil forces, who works our money, tries to break into your server. Yes, they are hackers. But hacking is not just doing an illegal thing. Think about torrenting. When you say torrenting, people think you are doing something illegal, but you can torrent with someone's permission. So we are going to do a legal hacking of CPyton. The definition of hacking is using something in a way you are not supposed to do. Like in Turkey, we use old cheese containers to plant. It's a recycling hack. And we are going to hack Python for our freedom. You all know this guy, Richard Stallman, who saved us from slaving to the bad guys. We all grateful to him. He hacks for his freedom, so do we. Anyone heard PEP 313? No? It's an old PEP about adding Roman literals as Python integers. It is rejected with some good reasons in 2005. But what if you want to use PEP 313? Isn't it your freedom? Yes, it is. It is why we are going to hack. For doing such hacks, we need to learn how CPyton works. Every step in that execution model is a breakpoint to us where we can inject our code or alter the code output. For example, if you want to implement PEP 313, you need to replace all capital X with 10 in the ASD to compiling step. The first step is tokenizing. When you say Python file.py, Python will read your file in the encoding you have specified. And then stream it to the tokenizer. Think about 2 plus 2. There are three tokens identifiers. Like 2 is an integer, plus is an operator, 2 is an integer. And by the way, the first token is the encoding token. You may use it when you are hacking. But tokens doesn't know about relationship. It doesn't know if something is in the ifs body or ifs test. So we need something more relational, like concrete syntax trees. The parser of Python generates concrete syntax trees from the tokens that stream it into it. And this is the last step. You can convert the direction. Like you can convert a code into the CST concrete syntax tree and concrete syntax tree into the CST. The CSTs of two same expressions are different. Like you can write 2 plus 2 with 15 Y space between the operator and integer or without Y space. These two CSTs are different. But we need to know they are the same. So we need something more abstract. Like abstract syntax tree. It is generated by ASDL from 2.5, I think. And the ASD just keeps relevant information about Compiler. Like it just keeps notes, line information, and colon offset. We can hack ASD by Python's app. Python says, don't take the parser. Don't take the bytecode. If you want to hack, use it ASD. It offers a great API for hacking ASD. This is three lines of code. It adds A to the beginning of every variable. Like X plus Y plus Z. When you transform the ASD with this code, Python will understand it like AX plus AY plus AZ. And there is a great documentation about this ASD on a site called the Green Tree Snakes. You can check that out. I think Python refers to it on DevGuide or the official documentation. And this is the last step of compiling bytecode generation. Python uses a format called bytecode to store instructions. Python have a disassembler for these bytecode objects, but doesn't have an assembler. I proposed an assembler on Python ideas last month, and it is rejected like the people optimizer API. The people is the optimizer for the bytecode objects. It has only a few optimizations. And if you want to add new thing, you need to write your own people. We are going to do that too. And this is the last step of execution model evaluation. Python will go through every instruction you have specified in that bytecode with a for loop, also, and a help of GCC feature called labeled gotos. And it will push and pop to the stack. See, Python virtual machine is stack-based. So it uses push and pop. When you say 2 plus 2, it means it will push 2 to the stack, 2 to the stack. And then an instruction called binaryette come in and pop the last two value from the stack and merge them into each other and push it back, push the for back. Yes. So let's hack. The first hack I'm going to do is using Valor's operator on Python 3.7. It's set on Python 3.8, but you can use this on Python 3.7. For doing such a hack, we need to interfere between tokenization and reading file. The only step between tokenization and reading file is the encoding. So we are going to add our encoding. And when we are decoding, we will alter this code with Python 3.7-compatible code. Our studies, we should run before the tokenization happen. We need a new tokenizer, or we can modify the Python's tokenize module. And we will tokenize the source with that module. We will alter this code, and we alter this code by changing tokens, changing position of tokens. And then we are going to un-tokenize it and stream back to the real tokenizer. For modifying tokens, we need something called token module. I import that as tokens. When you add a new token to the C Python, you need to add an ID for it and a name for it. Our token name is Colonical and ID is 255. I edit the tokens to the token module, and I edit the token itself to the exact token types. It's a dictionary where tokenize module uses that to find the ID of a token. And for changing the tokenize mechanism, you should look at the source code of tokenize module. I looked and I saw there is one rule, one main rule that uses for tokenize module. So I am altering that with adding my QtValrus operator. The second step is writing a decode function for our encoding, Valrus 37. This decode function has an extra parameter beside the input and errors. It is encoding. It is for because you can specify your own encoding with Valrus 37. For an example, you can say Valrus 37-UTF2. And it will decode the encoding US specified. Also, it will alter the code. I'm just streaming it back to the generateValrusSource code and return with the encoding US specified. And this is the last step. It's adding a new encoding. We need a search function. Codex.register uses the search function. And if we return a codec info, codex.register allows us to register this codec info. I'm just streaming Valrus 37 and dash, forcing if you have specified an encoding or not. If you don't specify, it will use UTF8 and return the codec info. By the way, you need to register this encoding every time a Python session starts. For this, we need to hack the site module. Anyone heard site module? Sites? No. We don't care about site module, actually. We care about its behavior. Site module uses pthash files. And we are going to inject our code at that pthash files. This is the code for implementing rejected PEPs, like 3.1.3. We are going to use Roman literals as Python integers. Our strategy is, we are going to run when we are imported. We are going to only be effective inside of this LO scope. And we are going to raise proper error messages when it is used outside of this scope. For implementing this PEP, we need to use a steam modules, not transformer. It goes through every name definition, like AX, testOptainer, and capital X, I, V, and checks for, is it a Roman literal or not? If it is, it is going to return its integer representation by returning a new node. For scoping, we need an extra transformer that will go through every width statement and checks the width statement name is LO, like this. If it is LO, it will get the first argument, and then it will get to the transformer we wrote with that argument and transform only width body, not the whole file. Just transform only width body, and then it will copy the locations of nodes to the new node and fix missing line nodes and return the new width node. The runtime, we should run when we are imported. So we are going to call a function called LO with lowercase. This will import our self and get the file and read the file and parse file to ASD. Then it will call our transformer, the transformer we wrote here, to the file. And then we are going to compile that, transform ASD by manually and execute it under the modules namespace. Yeah. Another funny hack is Rust return. Rust returns implicitly. So the Python with that RLR decorator, we are going to return implicitly. We need two things, return the last statement, last expression, and we should allow infinite branching, like if as inside of if as, we should return all possible last expressions. The first thing to transform ASD is we need to go through every function definition and check that if there is a RLR decorator, if there is, we need to remove that decorator, we don't want recursion. And then called adjust method. The adjust method will find the last statement of a function. And then if it's an expression, it will pop that statement and return it as ASD.return. If it is a if statement, it will adjust ifs body with that same function. And this is how we support infinite branching. The other thing is people, an alternative to people optimizer, poopal optimizer. We are going to optimize eliminating local bars. Maybe you have heard, there was a blog post about it's called see Python's bytecode compiler is dumb, and it doesn't optimize eliminating local variables. It doesn't make optimization. And we are going to make it ourself. Our strategy is when we are going to run when a function is decorated, we just only make the optimizations user have specified. And we are going to re-put the source, the bytecode, of a function back when it's done. For optimizing, we need a decorator, this decorator will create a bytecode object from function. The bytecode object comes from this module, and it is immutable. So we are going to return a new bytecode object every time we make an optimization. And then we will re-put the code of bytecode object at the end, and function will be optimized. I'm going to show an example of optimizations. This is eliminating local variables. We will go through to bytecode and find which symbol value is a constant, and which symbol uses where, and then keep track of that, and find the unused symbols to eliminate local variables. And then we are going to remove them manually by using a for loop, remove all the unnecessary symbols. You can assemble bytecode in standard library modules, but you can use a module called bytecode as a third byte module from Victor's dinner. It has its own people implementation you can hack, and it assembles bytecode great. This is the last hack I'm going to do is, this is catlizer v1 extended. Catlizer is a module for hooking functions like audit hooks, but it's more general concepted. It will hook your functions. The first version of catlizer uses decorators to hook your functions, like before pre-hooks, on-call hooks, and post-hooks. But in that extended version, I didn't want to mutate the functions you gave to us. So I am using a methodology to hack the C-pytons calling function instead of user's function. So this is the code for hooking into a py function, fastcal cables. This function uses it when a bytecode comes to call a Python function. I'm using James Powell's code hooking loop hook for this, a slightly modified version called alpy hook. And this will overwrite the memory address of C-pytons function with my modified C-pytons function. And this is how we are going to inject our code into the C-pytons evaluation step. For modifying the C-pytons function, I directly copied the code of it, and then checked for catlizer sign. If a function uses catlizer, catlizer will sign it. If the function has a catlizer sign, I will call the catlizer before the function call, on the function call, and after the function call. Yeah, this is it. Thank you for coming and listening. You can contact me through Twitter. OK.