 I think this assembler is getting too big. It is now currently 6K. So I've been thinking of ways to try and shrink it, to simplify and do more things using the same code. And I think I've got something. So here is our instruction set in the column base broken down by the A, B and C fields to make it easy to understand how it all works. We have, in terms of addressing modes, we have our 8 here, 9 for this one and 10 for this one. Plus 11 if we're going to count the implicit argument instructions, which is supposed to be more or less better to do. However, how many ways are there to actually encode instructions in the machine? Well, there's the single byte instructions, which take an implicit argument. There's the zero page instructions. These take a single byte argument. That is used by B equals 1, indirection operators, B equals 0 and B equals 4, normal indexing with zero page argument, which is B equals 5, and this one, the immediate constant, which is B equals 2. So the same encoding works with all of these. So that's 2. Then we have the relative branch. This is special because it needs special treatment for encoding the relative branch. That's 3. Then we have the absolute addresses, which is these, B equals 3, these, B equals 6 and 7, also these, and our indirect jump. So that is, if I remember correctly, four different ways of encoding instructions. Just four. So if you look at our code over here, down here under place code, this code needs to know only about instruction encoding forms. And I think we've got them all. This chunk of code in Record Expert is handling all the encoding forms that we need to know about. Absolute addresses work, we've tested that. Zero page addresses work, complete with instruction shortening, if necessary. Relative instructions work. And the other two forms, the one other form, is handled under record bytes because it never goes through this code path. So the bulk of the complexity is not in dealing with the four encodings, it's dealing with the 11 addressing modes. And that's happening up in the PAS code here. However, again, however, because there are only the four instruction encodings, then whatever addressing mode we pass, it has to turn into a single byte value, a single byte relative value, a two byte value or nothing at all. So our PAS ALU argument routine here, if we can generalize this to PAS any arbitrary operand for one of these instructions, we can use this for everything. The only extra thing we need to do is to have somewhere a table of, given each mnemonic, not opcode but mnemonic, which addressing modes are valid for it and what opcode should it be using. And given that information, we have a complete parser. So up here we have our tables of instructions, these are the mnemonic to the opcode, so we'll need to add some more to that. And we're also going to combine these to a single table. So where is our instruction record here? So we have the name, we have the opcode and we have the valid addressing modes. Now it's really tempting to use a bit field for this. The only problem is we have more encodings than we have available bits in a byte. So either we have to use a two byte bit field or we indirect it and I'm going to go with indirection. So we are going to have an enumeration of encoding types, which will be, these are going to be all of the unique, given a single mnemonic, what encodings are valid for that mnemonic. So this is going to be the different classes of mnemonic. Or actually, actually, this might be premature optimization. How many mnemonics do we actually have? Let's combine these together, this will stop our program from working. Here is where the mnemonic start. Here is where the branch instructions start. In fact, I have missed a bunch of these. So that's going to be these ones down here. Oh, sorry, these instructions that take a as a parameter. That is going to be our 12th addressing mode. Let me just add that so I don't forget about it again. I got these, but I missed these. XA is ATA, TAX is AADX is CA, NOP is EA. Got those. I think I've got those. I've got some of them. Okay, so let's add the weird ones from the top block. So we've got JSR, C0, LDY. There are more LDYs. I think we are generating some of those, but anyway, E0, CPY, E0, E0. LDX, that goes as these as well. BIT, I don't think we've done BIT yet. And that has a second form over here. Got these. Oh, I haven't done STY before. Where is STX? I think that's down here. There it is. And I think we're done. I think that's a lot. So how many is that? So the end is 607. The beginning is 558. So that's 49 instructions that we care about. Right. So we are going to have a two-byte BIT field because trying to do this the clever way is going to use more than 49 bytes. So it is cheaper to do this the dumb way. And much easier to enter all the data into. So let's just sort these alphabetically. So there is then going to be a second field, which is going to describe all the addressing modes that these use. And in fact, we are going to have to be slightly cleverer than that because some of these instructions have exceptions, for example this. If we were to just say that LDY supports an immediate form, because the immediate form is B equals 2, our code will try to encode LDY using this instruction, A8. And that's not going to work. So we are going to have to cheat. And I'm not going to make you sit through me typing in a whole block of data like this. So I am just going to skip ahead. Apart from anything else I need to think of, I'm going to name all this stuff to make the table easy to read. So I've actually split up the addressing mode enumeration from the B value enumeration. So here we have the addressing mode and I found a new one, which is LDX here and STX. Here in the column which normally does 0 page comma X, is in fact a 0 page comma Y. So that's nice. So yes, I have the addressing modes here and I have the B values here. And then I have, here's my table of mnemonic to addressing modes. And here is the table of addressing modes to B values. And now find instruction returns a pointer to the instruction, because we're going to need to pull these values out. So I need to go and change the code for these. So this is going to be, hang on, hang on, this is all going to be our new PASA code. So if this is not an instruction, then we want to skip this stuff. Once we're into the instruction, then there is no going out. So I can put all the PASA code in the conditional here, or I can put in a go-to. Let's just stick it in here. So we need to first pass the argument. That will then return the addressing mode of the argument. If this addressing mode is not valid, we can easily check with in some addressing modes and am, addressing mode. Okay. What did I call that? Encoding. That's wrong. Now addressing modes. So we now have the instruction and a PASA parameter. So we want to either turn it into a expression record or bytes. And I can't remember where we were doing that. Right. It's up here actually in add expression records, so we don't need to worry about it, which is nice. This should apply to all encodings provided op is set correctly. Okay. So the first thing we want to do, let's see if you go here, is figure out the B value for this instruction, which is done by this piece of code. We then want to add it like that, and we are done. That is all that code needs to be. Okay. This piece of code. PASALU argument now become PAS argument. Right. Now, here we have just passed an expression, but we don't know whether this is one of these or one of these. So we're just going to have to return AM in. And then down here in the PAS code, we are going to have to fudge it. So if this is a special immediate, and AM is a normal immediate, then switch this to IMS. We can put an else if there, because we know that this is a supported addressing mode, because we checked that here. So that is nasty, but it should work. Let's do the rest of this. This piece of code. Now, when we get a closed parenthesis, that's going to be something like that. We are expecting there to be a Y after it. However, that is not true anymore, because if this is a jump, then that is a valid addressing mode. So we get a closed parenthesis. We read the token. We now have to look at the next token. If it's not a comma, then this must be AMY indirect. Otherwise, it must be a Y. So this then turns into AMY pointer. This piece of code, this has to be a AMX pointer. Okay, now this stuff, this is if you get a normal expression followed by comma X or Y. If it's X, then that is either, what do I call it? X offset zero page. Otherwise, it's an AMX off. So if it's not an X, then previously this could only be a Y offset, but that's not true anymore too, because this can now be a Y offset. Like so. Otherwise, this is as before, it's AM zero page. Or otherwise, it's a simple AM abs. And there is one more case we need to put in, which is if it's an A. So if it's an ID, then if the token length is one, and the first character of the parse buffer, which is the only character, and A, capital A, then this is a AMA. Otherwise, full through as before. Oh yeah, and a addressing mode is a 16 bit value. Let's just create a type for that. Okay, I think we also need another special case. Yeah, if somebody does an absolute comma Y, looking at the wrong place, if somebody does an absolute comma Y, and the absolute value is a number, and the number is small, then it will be treated as a zero page comma Y, which is the special fake addressing mode we're just using for these. So we're going to have to work around that as well. So if the available addressing mode does not contain a Y offset zero page, and the addressing mode is Y offset zero page, then change the addressing mode to Y off. And we'll get rid of that because we do actually want to check the addressing mode at this point because it is very, very theoretically possible that there is no Y off. This one doesn't have it. Just looking for, yeah, like this instruction. This instruction fits the form, but there is no Y off addressing mode. So if somebody does an S, T, Y, something comma Y, then this will fail. Okay, and that build, it's not going to work because everything's just balked all over the place, but how big is it? That's bigger. Yeah, it is bigger, but this should have added all the rest of the instruction parser. It's pseudo ops, but there you go. I suspect that the bulk of the size is, where did I put that instruction table? These, which are 16-bit values. In fact, looking at it, there are not very many different kinds. So we could save our 49 bytes by indirecting them so that this field would then be an 8-bit value that we could use to figure out what the full addressing mode type is. But I don't think I will worry about that for now. Let's just have a quick look at this in the hex editor. Lots of strings. Here is our actual table, which is smaller than I was expecting. Not that I'd done the math or anything. That's only about 256 bytes. That's the fix-up table, which appears on disk and is loaded into memory used and then discarded. So that is not technically part of the binary. And that's actually quite big. So our program in reality, I think this is the last byte, which is 16bd. So yeah, we're losing about a thousand bytes on the fix-up table, which we can ignore. But there's a lot of code there, and some of these strings can be made shorter. Okay. So we now have our instruction table and our new improved parser. Although I noticed when I filled out the table that actually some of these opcodes are wrong. Anything which needs to be value added to it needs to be the opcode of the b equals 0 case. So like sty here, sty. I'll put that into alphabetical order. That's better. So sty here should actually be a 8-4. That should be 8-0. Why is that saying a4? That's ldy. That is very much the wrong instruction. And stx, which is down here, needs to be this instruction, which is 8-2. And why is that saying 8-0? Because 8-0 is this one. Because I'm looking at the wrong instruction. That's why. Right. stx is 8-6, this one. This needs to go over here and be 8-2. Yeah, the shifts are the same. So they want to be again over here because they're going to add on the b value. So that's 2. So asl. Did I forget to put the shifts in? I forgot to put the shifts in. Let me fix that. Right, I've put those in. There's only four of them. I'll sort those into place. That again did not sort. Interesting, I am doing the right thing. Yeah, Vim is just ignoring my attempts to sort this for some reason. Okay. So yes, we could actually optimize this quite a lot. I think we could even get rid of both of these bytes by arranging these in order so that we can determine the addressing mode from the position they appear in the list. But I am not going to touch that just yet. But what's our damage? 24 bytes. Okay, let's run this and actually see what it does. Bad addressing mode. Okay, well, let's strip this down to the minimum. Okay, that works. This does not work. So this RTS is here. It's an AM imp for implicit. So down in our passcode we do not want to pass the argument. Yeah, the imp ones are special. So if the addressing modes include AM imp then this is clearly a imp and implicit addressing mode otherwise pass the argument and just fall through the rest of the code. Okay. What did we get? We got a 6.0 which looks like the right thing. Followed by 0 followed by our fixup table. I don't know where that 0 is coming from but the fixup stuff is all currently broken because I never updated it for the variable length instruction stuff. So let's ignore that for the time being. So LDA label. Okay, that's pointing here. That looks more or less right. But I am perturbed by the fact there's an 04 there. I think that, yeah, the calculation code is writing too many bytes. It's trying to write an argument. I'm actually going to 05 which is here. So the calculation code is doing it correctly because that's the right address. 3 bytes, 4 bytes, 5 bytes. But the emission code is wrong which makes me think that I know what's going on. So the reason why it's writing a byte argument is because it's going down here through record expert that is not expecting to get an implicit argument instruction. That is because ad expression record here is looking at token value and token variable here. But of course we're not calling pass expression anymore and pass expression is what is actually resetting those. So let us actually just simplify things a lot and just admit the opcode and stop. So we take that all completely out of line and we're not going through any of this expensive code anymore for implicit instructions. Better. We are however still going to have to fix. So I was thinking that let's shift an a which the machine thinks is an implicit argument instruction is still going to go through this code but of course we have called pass argument in order to pass the a therefore it will have correctly set those things. In fact we can try that. Our new improved parser raw a 8060 6a 00 brilliant. So in fact it is going wrong. Why is it going wrong? It's going wrong because down here we're not calling pass expression and it's pass expression that is setting token value and token variable to 0. So let's just put that in here. Right. Yeah that is 6a 00. That should not be happening. Okay so we have this second one here is our a addressing mode and that's using a b value of 8 which is 1, 2, 4, 8 that's 0, 1, 0 for the b value which is 2 shifted left hmm So now I think that this value is wrong. Why did I put that in? This is AMA is actually implicit therefore that should be a 0. There we go. So now we have the same addressing mode the b value is 0 therefore it should have generated the right opcode which means that the calculation code should have correctly identified it as being a 0 operand instruction and it still hasn't worked and in fact 6, 2, 00 6 2 is not the right instruction raw a, raw a should be raw a, raw a raw a, 6a I entered the wrong values I entered very the wrong values I wanted the values here oh wait no I was wrong that code is right so this is actually setting b equals 2 because we want this column here so all the things in this column from here up well in this block in the c equals 1 block take one byte of payload everything else they take 0 and I remember now that my code here for getting the b value from an opcode getting the effective b value from an opcode is ignoring implicit instructions completely great ok so c equals 2 that's this one wait c equals 0, 1, 2 0, 1 0 no this is c equals 2 so shift instructions with alu compatible b values this is looking for any shift including these which is not right so we are looking for instructions with a b value of 1 3 5 7 so that is the odd b values so you want bit 2 set which is that one 0, 1, 2 so that should be going through this code to be imp I wonder if for some reason this is hitting the expression output so we jump right so we are in get in some props probably this means we are we should be this is why it's doing it yeah ad expression record is assuming that our only instructions with parameters are going through here that is easily fixed if this is a absolute or a 0 page i.e. a value or an immediate or a relative these are sucky properties i need better ones but we have the size over here so that will probably do better get in some lengths if len is not 0 emit the next byte if len is not 1 emit the third byte that should have gone through here and got 1 as the length so we know it hit this code from here we shift right by 2 for this we then load the flags value into a which is 4 0 that is a size of 1 so now we should be here not quite sure what this is doing it's shifting right by a large value by rotating left so we now do have the length of 1 so we should now be down here so we emit that byte lda24 which is the 6a if you stick a breakpoint 4 4 and go we get the length which is 1 that's not 0 so we fall through we load the second byte wait a minute it's only half 11 I'm not usually quite this dumb right there we go 6a is 6a is raw a that is the right instruction finally okay let's just get rid of the breakpoint so we should have full set of work instructions so we can do jump label no we can't jump should work but it's not so I remember to put that in the table I believe I did not this table is expanding so I'd actually put jump here I put jsr using the jump addressing modes so that's not right jsr is just abs so jump is 4c and it can be abs or wide indirect there you go and that has done the wrong thing again so we have a 2 which is the right address but it's encoded it as a 16 bit value rather than a uh it's encoded it as an 8 bit value rather than as a 16 bit value so w in is wide indirect no hang on this is I'm using the abs form here it should be producing a simple 4c it's not so that's got an instruction of 5 8 so that's wrong 4c yeah and I'm abusing the ah I am incorrectly abusing the b value system so this is an abs form it's added on the abs b value of 3 shifted left to and produced entirely the wrong instruction so that wants to be a 40 down here under am wind we actually want this to be so by adding that on we have the abs value we have the the abs b value that takes us from 40 to 4c and then we add on an extra 2 0 that takes us to the indirect form there we go 4c 0 4 0 0 6 0 and that is the fix up table our first nibble at offset 2 0 1 2 that's correct yes and if I were to change this to that and assemble it we should get the same code but no fix up because this is an absolute address or not so as far as expression ah that gave us a number which was here that gave us because it's a small number it gave us the zero page addressing mode but the jump instruction does not have a zero page form and it's expecting an abs therefore that also fails we're going to have to have another one of these so if there is no zero page and the addressing mode is a zero page change the addressing mode to abs better ah that's produce the wrong result fantastic it's less wrong but it's still wrong just print the print token value right that's a zero parse argument as it's failed to parse that thing properly this should at least be a relatively simple fix 0x 0004 that is a valid hex number or I'm going mad or both here I hope this isn't the problem that I ran into last time manifesting in a peculiar way it looks very similar we have only gone through read token should reparse the number so we only have one expression and we got a 02 as the ah token type which is a number so let's let's find our read token I bet that we have overwritten token value from somewhere so this should push the last red token and then this will pull the values back out of the store so so token value is also zero before here and this has only been called once okay let's do this the hard way and f62 is read token lda7 we fetch the token look ahead no we don't that's adjusting the stack I think that's adjusting the stack this is the virtual stack used by LLVMMOS so I think this lda is fetching the look ahead thing and a is 2 which is non-zero so we are indeed fetching the pushed token so the value in A and now in Y is the token look ahead so here STX03 we wipe the old version and now we fetch into XY the previous value which is zero and write it then we fetch the previous variable which is zero and we write it and then we return and I should have realized that's what was happening because it did actually print this trace so it's read the number before it hit the break point yeah so the same thing has happened we have managed to wipe probably by pushing twice the value that's been read which is brilliant okay so PAS we read the token it's an ID we look up the instruction we know it's not implicit so we called PAS argument now read token here will have cleared the pushed token value and PAS argument here will genuinely read the token and it set the values to zero there we go full C0400 and the RTS and no fix ups it is slowly coming together I think we are now at a point where we can assemble real code there's still a ton of stuff missing that will actually like make it usable but let's me get rid of printf yup and how big is it not bad it's still too big but it's still a plausible size for an assembler I bet there's a ton of stuff that I can do to further optimize but I think that we are actually going to we're now at a point where the basic assembler is mostly working we have variable size instructions we've got labels, we've got symbol tables we've got all the appropriate encodings very little testing so in order to actually produce useful programs the next thing we're going to need is some pseudo ops because we need to be able to define data bss 0 page symbols that will have fixed values but I'm going to do that next time so see you then