 So, I am Elise. I often go by Elise Zero Two online. The talk I'm holding today is on breaking the black box security coprocessor inside of the Nintendo Switch. So a small bit about me. I am a security researcher with a strong interest in Nintendo consoles. Back between 2018 and 2019, I used to maintain one of the largest FOS custom firmware solutions of the Nintendo Switch. In this talk, I will be presenting my research of the almost black box security coprocessor present on the Tigra X1. Our end goal by the end of the presentation is to be able to sign code to run at the highest security level offered by the coprocessor. We will do this by building up an exploit chain around a vulnerability in the authentication realm that was discovered by others. I started my research before this vulnerability was known, but it is what we will build up to being able to exploit in this talk. Before we start though, I would just quickly like to remind everyone watching to stay hydrated as it's incredibly important. So, the Tigra X1. The Tigra X1 is an SOC produced by Nvidia. It's used in a wide range of applications. Back in 2018, an exploit that I assume a lot of you have heard of, known as Fousage, a layer was published. It affected the USB stack in the boot room and allowed for us to gain arbitrary code execution on the boot processor, which at the time completely broke the boot chain security model. This led to custom firmware being developed relatively fast and Nintendo being Nintendo. Nintendo don't like custom firmware. So, in November 2018, they attempted to regain control of the boot chain by utilising the security coprocessor to set control again. So, what is the Tigra security coprocessor or the TSEC from here on once? The TSEC is a security coprocessor found in a lot of Nvidia hardware. But the Tigra X1 is what interests me because of my interest in Nintendo consoles. It's a Falcon V5 microprocessor with a cryptographic coprocessor built in. Nvidia basically just took a design they already had and decided to slap a crypto coprocessor on the chip for security. We can assemble and disassemble firmware for the Falcon V5 microprocessor thanks to work already done by the Envy Tools team because the architecture is shared across quite a few Nvidia devices. The TSEC has its own memory space, split into data and instruction memory. The instruction memory is 0x8000 bytes long whereas the data memory is only 0x4000 bytes long. There are also three security states that execution can happen under. No secure, which is the default, low secure and heavy secure. Now we will look at the details of the cryptographic coprocessor inside the TSEC a bit closer. It has 828-bit crypto registers. It has 64128-bit hardware secrets. There's a DMA controller which can be used with an override to lead or write to crypto registers. Hardware secrets have an access control level and crypto registers also have an access control level. However, the crypto register of access control level changes based on instructions. Typically the output register from a crypto instruction, the permissions for that register will be made from a bit twice and of the ACL of operand one and operand two. So the minimum amount of availability between the two operands. There is also a hardware 128-bit AES implementation. So more in the security modes. As I did mention, the TSEC has three security modes. No secure, low secure, which is relevant to us as it is centred from heavy secure mode and finally heavy secure. So no secure mode. This is the mode that the TSEC enters after initialisation. Firmware run in this mode is unsigned. You can load any firmware blob in is no secure and just run it. None of the hardware secrets are readable. However, you are able to load any hardware secret into the crypto registers. But as I said, the crypto register will inherit the access control level from the hardware secret and you won't be able to use the crypto register from no secure anymore. You are intended to initialise context to enter a signed heavy secure firmware blob from no secure. So heavy secure. For heavy secure mode, as I said, the firmware must be signed. The memory pages for the heavy secure blob must be marked as secure. If you enter a heavy secure firmware blob, it will always result in execution starting from right at the start of the authenticated pages. You could jump 200 bytes into the authenticated blob, but you will still actually start execution at the start of the page. You can perform crypto operations with some of the hardware secrets. Upon entering heavy secure, you need to go through the auth rom. This is done by jumping into pages marked as secure, which triggers the authentication rom via secure faults. That results in the authentication rom, which is in silica and taking over control. Our end goal is to gain arbitrary code execution in this heavy secure mode. As I mentioned, there were attributes on instruction memory pages such as them being marked as secret. You are able to mark pages as secret by writing to the TLB. However, you cannot mark pages as authenticated. Only the authentication rom can mark pages as authenticated. Entering a new group of pages marked as secret will trigger a secret fault, as I mentioned. If the auth rom succeeds, the pages marked as secret will also then gain the authentication flag. They keep the authentication flag until execution falls outside of the range of authenticated pages. If you attempt to read memory from the pages marked as secret, it will return the constant OX dead 5 EC1. Prior to firmware update 6.2.0, Nintendo's firmware for the T-sec ran during the bootloader consisted of only three stages. Boot, keygenloader and keygen. Both keygenloader and keygen are signed and execute in heavy secure mode, whereas boot executes in non-secure mode and sets up the context. Keygen is encrypted using a key seeded by a hardware secret. The key is generated and the blob decrypted by keygenloader. Keygenloader is signed but unencrypted. It's not very interesting at all. It calculates a CMAC over the boot payload to verify integrity. It optionally decrypts keygen and then optionally runs keygen. So, as I mentioned, keygenloader attempts to calculate a CMAC over the boot blob. It expects the boot blob to always be at the base of instruction memory, so zero, because this is how Nintendo have it. They load their firmware rights at the base of instruction memory and they assume their boot blob will always be at the base of instruction memory. The size and expected CMAC come from an unsigned metadata structure, which is positioned right behind the boot blob in the combined firmware blob. Can we bypass this check somehow? Actually yes, trivially and in multiple ways. The simplest one and my favourite one to do is to control the entry point for the TSEC. You can set the entry vector to wherever you want in instruction memory and the TSEC will begin execution from there after a reset. Nintendo assume that the TSEC will always start from zero at a reset. So, when they verify their boot payload, they verify it from the base of instruction memory, but the TSEC doesn't have to start executing at the base of instruction memory. Thus we can just append our own boot blob to the end of Nvidia's firmware blob, set the entry vector to that address and suddenly we've bypassed this check completely. This brings our vulnerability counter up to one. Next we will look at a stag smash inside of keygen loader. Up there is the function signature I made for the keygen blob. This is what it would look like in Pseudo CE. So, to validate the boot blob it will copy the boot blob from the base of instruction memory, as I said, to the base of data memory. The size of the boot blob comes from the metadata structure passed as an argument to the boot blob, passed as an argument to keygen loader by the boot blob. The structure is completely unsigned and just trusted, not checked in any way. We can control starting stack pointer two making this slightly cleaner. We can smash the stack vulnerability counter two. This is why you should always make sure to keep your mem copy sizes validated. So, building up a rub from that stag smash, I will explain how I did that. I appended my rub chain to the end of the contaminated firmware blob that I made. I patched the size of the boot blob in the metadata structure to be the total size of the contaminated firmware blob. I then calculated a starting stack pointer that would overlap with why the rub chain is at the end of the combined firmware blob. Now we have gained rub inside of keygen loader. Now we will use the rub to be able to get the key to decrypt the keygen blob. So, to obtain the key for decrypting keygen, we need to do four main things. First, we need to call a function to generate the encryption key. It leaves the key in crypto register one. It requires register 10, which is the first argument to be one. It also requires register 11, which is the second argument to be none zero. For the second step, we will call a function to be the key from crypto register one and write it to data memory. This function requires that the first argument be one and the second to be 16 byte aligned. The third step is to bring the t-sec out of lockdown and then finally we want to return to our success handler. So, our rub chain looks something like this. I hope it's readable from there. The first address is a gadget that sets register 10 to one and then returns. The second address is the function to generate the key. It also leaves a 10, 30x10 aligned pointer in register 11, thankfully. The third address is the same as the first. It sets register 10 to one and returns. The fourth address is a gadget that will write the crypto register to data memory. The fifth address is the exit handler for keygenloader. This one is very important because the t-sec is in lockdown. We're not able to read any of the memory from the boot processor. It will just return that constant to x, d, e, a, d, 5, e, c, 1 value. So, we have to exit lockdown. The sixth and final address is the success handler in our one secure boot payload. If we run this, we can then read the decryption key from DEMEM via DMA from the boot processor. The keygen blob is ASCBC encrypted and we can now just decrypt and analyse it. That's quite obviously not the key. That's the Blake 3 hash of the key needed to decrypt the keygen blob. So, now we have a decrypted keygen. We can explore it a bit. On the right up there is my PseudoC function signature and just an overview of what it does, like initialisation. It takes a 16 byte key seed and an unsigned 32 bit integer key version. It then calls a function to generate a key, passing both the key seed and the key version as arguments. After this, it clears the active signature and all crypto registers. Looking closer at what the function that generates the desired key does, we can see if there's anything interesting in there. Right at the top, it checks the host1x register for a magic value that's written by the bootloader. This is completely irrelevant but that value does have to be written by the bootloader for it to work. Then it will load the provided key seed into crypto register 0 and then optionally run one of the two present key generation algorithms depending on your specified key version. Both algorithms will write the result back to the key seed pointer that you give as an argument to keygen. Remember that. Finally, it will always write whatever is at the key seed pointer to the for SO1 MMI registers. If the keygen algorithm was run, then it will be the output of the keygen algorithm, otherwise it will just be the input key seed. Now, something here might be standing out to some of you already who were seeing the thing that I said before. The key generation algorithm writes the result to the seed pointer that we provided. We give that seed pointer to the keygen algorithm for it. There is zero validation done on this address. It's just an address passed and it will use it for the input seed and then it will write the result back to it. If we can control enough of the output of one of the algorithms, we could potentially smash the stack by pointing the input seed to overlap with the return address from the function on the stack for the function that calls the generate keys. Key version 2's algorithm is significantly smaller and it also only uses Hardware Secret 0, which is readable from heavy secure mode, compared to the other key generation algorithm which uses Hardware Secret 0x3f, which is not readable. If we can obtain Hardware Secret 0, we could theoretically control the outputs of the key generation algorithm and get the key generation algorithm to give us whatever we want as the output. This increases our vulnerability counter from 2 to 3. To obtain readable Hardware Secrets, the optimal setup would be a signed firmware that doesn't overwrite a crypto register before we're able to gain Rop and also has a gadget for then writing that crypto register to memory. Thankfully, exactly that is provided on an HDCP firmware for the shield that Nvidia left in is a debug thing in one of their firmware builds. Thanks Nvidia. This increases our vulnerability counter to 4. So, that blob, it starts with a return instruction. As I said earlier on, with heavy secure payloads, if you jump into them, it always will start your execution from the start of that heavy secure payload. You can't jump past that first return instruction. Well, you can, but it will push you back there anyways. The pages do not lose their authentication status until the program counter actually points outside of the page range for as long as the program counter stays inside of an authenticated blob, then that blob stays authenticated. As I said earlier on, Nvidia took something that was not meant for security and slapped a security coprocessor on top of it. We can see this here. You can branch without linking into heavy secure blobs. To those of you who don't know what branching without linking is, typically when you call a function, your return address gets pushed to the stack and then when that function returns, the address is popped from the stack and that address is where the code execution will continue from. If you branch without linking, your return address is never pushed to the stack because it's not treated as a function call or it's just treated as branching in the code, and you're allowed to branch without linking into heavy secure blobs. So, push a rob chain onto the stack, branch without linking to the payload, gain rob as soon as the blob returns. The blob starts with a return instruction. We have Rob inside of this blob right from the beginning before anything else is done. Again, thank you Nvidia. But yeah, as this works for all heavy secure blobs, this is a hardware vulnerability. This cannot be fixed without a new revision. There is a new revision, but not in any of the switches. It's in a completely different tech rep. To mitigate this, most heavy secure blobs will clear crypto registers and the active signature before returning, rendering this in most cases fairly useless. But still, this increases the vulnerability counter up to five and being the first hardware vulnerability of those five. So, obtaining readable hardware secrets. Now we have a rob chain that will write the secret into DMM for us to read out. You can see it on the right. It's fact five assembly, but it should be fairly easy to understand with the comments. So, we have one more problem there. If we run this, the hardware secrets will get written into the data memory, but the t-sec will still be in lockdown and this firmware blob contains no gadget that's easily usable to exit lockdown. So, if we want to exfiltrate those keys from memory, we need to exit lockdown somehow. One more solution, keygenloader. So, keygenloader, as I mentioned earlier, will bring the t-sec out of lockdown at the end. However, keygenloader brings the t-sec out of lockdown no matter if keygenloader was successful and running or not. Always when keygenloader exits, it will exit lockdown. So, we can just load in the keygenloader payload right afterwards, call it with values that prevent it from decrypting or running keygen. It won't do any keygenloading. It will just behave nicely for me and bring the t-sec out of lockdown. This brings our vulnerability counter out to six. And here are all of the Blake 3 hashes for the readable hardware secrets. There's a lot of things that can be done with these secrets. These are obviously not the secrets, they're the hashes. And now, back to keygen. We have hardware secrets zero now. Again, thank you Nvidia. We could reimplement this algorithm for the key version 2 and attempt to brute force a seed that fulfills our output parameters. So, to create a valid seed, we cannot control the first four bytes of the seed. It is always the return address for whether gen key functions is called inside of keygen. The first four bytes of the output will be used is the first address in our rock chain. Our rock gadgets will discard the middle eight bytes. However, the final four bytes are actually used in a gadget. So, our requirements for the seed. The first four bytes of the seed will be the return address for our rock chain. Our first rock chain entry will be 0x940. Luckily for us, the address bus is only 16 bits wide. If the address bus had the full 32 bit, which it would have been a whole bunch harder to calculate a valid seed, but because the address bus is only 16 bits wide, we only have to match the low 16 bits of the first four bytes in our seed to get a working rock chain because the address will be decoded as the low 16 bytes, no matter what's in the high 16 bytes. The final four bytes will be used as the parameter for a gadget, as I mentioned. It requires that the lower four bits be 0. Bits 16, 17 and 18 represent a three-bit register selection, and we want crypto register 1, so bit 16 must be 1 and bit 17 and 18 must be 0. So, calculating a valid seed. We can now attempt to find a valid seed now that we know our output requirements. My very, very, very unoptimised algorithm takes about 10 seconds to find a valid seed. With the seed, we can completely control the output of the key generation algorithm, and with the seed pointing at where the top of the stack will be when the generate key function is called, we now have rock inside of key gen. So, exploiting key gen. Key gen doesn't clear crypto registers when it gets called. I don't know why? Do the Nintendo developers know why? Probably not. There is a gadget for using the seasigank instruction with crypto register 1 as the key. Crypto register 1 is not polluted when we obtain rocks, so any value we load into crypto register 1 before getting rock inside of key gen that value will persist in crypto register 1 when we finally have execution. We can use this gadget to generate a signing key. So, the sigank instruction. It's a bit of a strange instruction, so let's take a close look. It takes two arguments, an output crypto register and an input key crypto register. Unlike the regular AAS instructions inside of the t-sec, seasigank doesn't derive the access control level for the output register from the input register. It is always just a constant that allows you to read it from heavy secure mode no matter what the input ACL was. With this, we can encrypt the active signature using non-readable hardware secrets and still read the output of the operation. So, the hardware secrets that I was not able to extract because they're not readable in heavy secure mode, we can utilise those here a bit. So now, the topic of fake signing keys. I would like to preface this slide by stating I did not find this vulnerability. It is publicly disclosed on switch rule already and credited to the list of beings on the screen. Cluto, Hexkey, Shuffle 2, Sias M and Mode Sazer independently. Got that from the wiki, so just trusting it. So, when authenticating pages, you provide both a 16-byte signature and a 16-byte seed. In every single Nvidia signed heavy secure blob, this seed is just 16 null bytes. The authorom does not validate the seed at all. You can pass whatever seed in you want and if the seed works with the key, it's okay. Using Hardware Secret 1, because Hardware Secret 1 is the hardware secret that the authorom uses for its operations, we can use that as the key for the C-sigank instruction, which will encrypt the current active signature using Hardware Secret 1. Hardware Secret 1 is what is used in the authorom, as I said. With AES, if you have a plain text, you encrypt that plain text with a key, just a 128 regular AES page, you encrypt it with a key, you have your ciphertext afterwards and if you encrypt ciphertext again, you will get the same plain text back, which is how the fake signing exploits works. Using Hardware Secret 1 as the key for C-sigank will give you a fake signing key. You set the seed to the signature of the firmware where C-sigank was executed. So, in our case, the signature to KeyGen, that is the seed and then using that fake signing key, which we have obtained, we can sign and run any code in heavy secure mode and have full access to it in every way that this is the highest level of execution you can get on this security coprocessor. So, we have ROP inside of KeyGen. We want to build up a ROP chain that will allow us to obtain that fake signing key. First, we will load Hardware Secret 1 into Crypto Register 1, then we will move the stack to a location where we can overlap it with the key seed, we push the success handler onto the stack, then we push the values of the two registers that later get popped from the stack by a gadget, we push our C-sigank gadget to the stack, we push another value that will later be popped from the stack to a register, and then we move the current stack pointer into a register as it will later be read from that register in a gadget. We place the controllable parts of the seed in three separate registers that the KeyGen blob will store on the stack for us nicely when initialising itself, thank you Nintendo, and that will nicely assemble our seed for us. We finally set up the calling arguments for the key seed that will live on the stack, and the key type 2 to run the KeyGen algorithm that we want, and then we just call KeyGen normally. Here you can see what the assembly for all of those previous steps looked like, it's commented. We load the C-secret1 into the crypto register 1, and then set up the rock chain. I will talk now please so you can hear me from here, but this is our seed that we're using here. We have a lot of excess room in the seed for brute forcing, so I'd like to make two parts of my seed's magic values. With that, we now have a fake signing key stored in the SOAR1 MMIO registers that we can read from the boot processor. Implementing the signature validation algorithm in Rust allows us to sign firmware blobs. This is a screenshot of my tool that I wrote to sign firmware blobs to run in heavy secure. The blanked out parts are the fake signing key and the seed, which would be the signature for the KeyGen blob. In conclusion, I discovered and presented six vulnerabilities today for the exploit chain, with one of them being a hardware design flaw that can't be fixed. This impacts boot chain security, general security, DRM, such as HDCP and Widevine, for all devices that use this version of the T-sec. All of the code is available on GitLab under the link displayed on the slide down there, and it obviously contains no secrets or firmwares for legal reasons, as those are not my property, they belong to Nintendo. However, the key seed that I generated from the hardware secrets does not belong to Nintendo, so there's enough stuff on there for you to go generate a fake signing key yourself in a matter of minutes if you want to. Thank you for being here for this talk today. Again, I would like to remind everyone to stay hydrated as it's very warm and important. If anyone has any questions, I'm willing to take them, but there's a bit of a balance of questions I will and will not answer, because I'm treading legal ground here. First, one question from my side. Who wants to use Nintendo Switch now in the future? Have you any questions? Hi. Can you hear me? Thanks for the talk. I have a question about the verification part. I think it was the key-gen loader part where the verification starts at the address zero or the base address. What I didn't quite get is why the verification stops at some point, because you mentioned that you load your code somewhere into a higher address and the verification starts at the zero or the base address. And why the verification would stop at some point and would not consider your own code to verify? Because they're trying to backwards verify their boot blob, which is unsigned. They're validating just the boot blob to make sure that the unsigned boot blob was not tampered with. They're doing it from a heavy secure payload so that they're trying to make sure that their boot chain wasn't compromised. And it knows the size of the boot payload from the metadata table. So it just dismantles a mem copy and then checks that. It's a fixed size, right? It's a size in a metadata table, yes. Okay, awesome. Thank you. More questions? I don't know. Yeah, hi. Also thanks for the talk. This is more a general question. When you're looking for gadgets in some firmware blob for ROP, do you do this with tool support or do you write your own scripts or can you elaborate a bit on this? Thanks. I didn't quite understand that. Sorry. So when you're looking for gadgets in firmware blobs, how do you do this typically? Completely manually. I've spent probably thousands of hours on this research. It's a custom architecture so there's no tooling for finding gadgets. And the gadgets I need are fairly unique so all of the gadget hunting was done by Paul. Wow, I think this requires another applause. Yes, more questions? I don't see them. So maybe you have some questions and more privacy later on. So maybe you are available. If anyone wants to ask me something afterwards, I'm more than happy to do so and I can discuss more outside of this room than I can inside of this room. Thank you, Elise.