 So, I've been doing some work with CPM recently. This is Digital Researchers' Operating System from the 70s and 80s for the 8080 and Z80 processors. It's an absolute masterpiece of minimalism, but it's so old that it predates proper licensing and as a result most of the software is encumbered and can't really be used today. Now, I have been slowly trying to put together a properly open-source clone of CPM and I found most of the pieces. One of the pieces which is missing is the assembler which was used to write programs for CPM. It would assemble 8080 machine code into binaries and you've got a copy of the assembler with the operating system distribution which meant that when you got CPM you got all the tools you needed to write programs. Now, what you're looking at here is a copy of the original 1980 source for the CPM command shell and today I'm going to try and write an assembler that will assemble this from scratch. So, this is all based on the 8080 which is backwards compatible with the Z80 processor. The 8080 is a very simple processor, much simpler than the Z80 which actually kind of helps. This means that writing an assembler for it isn't actually too bad. So, let's start with a bit of boilerplate. I'm writing this in C using the SDCC compiler for reasons of simplicity. The original assembler was in handwritten machine code. I actually have a copy of the source here. So, yeah, there we go. We have a binary. It's all of 34 bytes and we have an emulator that will actually let me run this and it does nothing. Now, the CPM is incredibly simple and it doesn't really have any C bindings. So, I've actually made my own. Each of these files calls a CPM system call. So that, for example, to open a file you set up the appropriate thing and call this and it makes it happen. And you notice that the main function here neither takes nor returns any parameters. This is because input and output to the program are actually passed for you by the CPM command shell. So, the very first thing we need to do when the program starts is to actually look at the file that's passed in on the command line. Actually, before I do that, I have a copy of the original digital research assembler here, which is all of 8k. So, let's actually run this and assemble our program. So, we run the emulator. We run the digital research assembler and we give it the ccp.asm source file and that does not work. Why doesn't that work? Okay, that works. Oh, yes. I keep forgetting this. The interface to the program is kind of weird. So, the file name is passed and sent to the program, but the actual assembler uses the three letters of the file extension as drive letters for where to put its various files. So, what this has actually done was it's tried to, I think, write the output to drive M, which is stupid, but we're going to have to do that because it's part of the public interface. So, anyway, that assembled our ccp.asm, this, and it is omitted a hex file, spelled it correctly, which is this, which contains the assembled command shell. So, this is Intel hex records, which is an incredibly inefficient way of storing machine code, but that's got, you know, Vim has nicely highlighted everything for you. This column is the length. This column is the address. This column is the type of record, 00 means byte. Then we have some data. Then we have some checksum. So, this means write these 16 bytes of data to address zero, and you go through the rest of the thing to the end. All these controls at the bottom are because cpm does not actually know how long files are. So, this is padding out the length of the current block on disk. When the, when a tool sees a control Z in the input stream, it knows to stop. Okay. So, now the assembler reads in a source file and it outputs a hex file and the listing. You've just seen the listing. This, so you've just seen the hex file. This is the listing, which is a copy of the source code with the annotated with what it actually omitted. And our assembler is going to omit a listing file, but it's not going to be like this. It's going to be drastically simplified. So, let's find my file again. Right. So, on entry to the program, the cpm fcb file structure here has been populated with the source file. The fcb is cpm's equivalent of a file descriptor. It contains all the information needed to refer to the file on the disk and all the state needed to refer to a file being used. So, let me, let me just copy a few helpful things from a different program. Yes, I have no standard library. SDCC will provide printf, but I don't actually want to use it. The reason for this is that it's gigantic. So, let's just keep these. What we've got here are a few helpful tools for just printing stuff to, to the output stream. So, so what this will do was it will just output the file name that we give it or not. I put char as in .io. So, you can see that we gave it the ccp file name and it has written out the name of the file. The cpm shell has uppercased everything for us. So, we are going to use that as our input stream fcb. We also need to define a couple of fcbs for the output files. So, copy the fcb to the new locations, set the file extensions appropriately. Actually, before we do that, we need to copy the drive letters, which are in the ninth, tenth and eleventh characters. I believe it's like that. We've got the manual here from 1970-something. So, here we have the command as an abx source file to pick in disk A, hex file on B, and the listing file to x, which is the console, and z skips. So, source file is the first letter, hex file is the second letter. Okay, so we've got that right. Now, we want to also want to set the file extensions correctly. Mind myself, we don't want them set, we want them copied. Destination source size. We also remember that we need to set the drive letter and extension of the source. Okay, let's try that. And why doesn't it like that is a underscore f, not a structure union member? So, sdcc is a bit buggy. I'm reasonably sure that unless I'm doing something really stupid, this should be working. That nine was wrong for a start. Yeah, look at the wrong line numbers. Yeah, this always gets me with sdc. My eye hones in on the error number here, which looks like a line number, but it's not. Line 36 underscore f, underscore f is not a structure. Okay, I think that's a red herring. Struct fcp is definitely a pointer. Ah, it's not a struct. It's a, yeah, it's a type depth. Yeah, I should know that because I like wrote all this code. That was embarrassing. Much better. Line 42 cannot assign values to aggregates. Right, that is a sdcc bug because you're supposed to be able to do that. Let's just use mem copy instead. Nearly there. Line 50. Array or pointer. Right, okay. So that, I tell it I want to assemble ccp and it produces these three files. Now there is one other thing we want to do which is in this situation, I haven't actually specified a drive letter. Let's actually print the drive letter. Let me first remind myself what the drive letter looks like and the fcb definition. Sometimes drives are referred to by letter. Sometimes they are referred to by a, here we go. Right, one is drive a zero's default. So, okay, that looks very wrong. Yes, I have screwed that up. Nine, ten, eight. Right, back tick is not a valid drive letter. This is because the file name I'm giving it here which is just ccp. The extension is supplied as ccp.space space space and space is not a valid drive letter. So the drive letters here actually need to be, need to convert from the drive letter to an actual drive code. How would you do this? The drive letter can be space if it is invalid. At this point we want to, right, okay. At this point is correct. That corresponds to drive code zero which means use default drive for file. Now the other, the other codes, the other special codes are x, y, z, etc. Those will be like standard drives. All right. So we are now correctly parsing the files. We now wish to open the output hex and printer files. Now we only want to do this if the drive letter is an actual drive. So again, this is not a drive letter. This is the drive code. So 16 is the top drive. So if it's not valid, just do nothing. Now otherwise, yeah, I don't have auto completion here. Otherwise we wish to delete any existing file and create a new file. Now I don't actually remember whether my emulator supports delete. It does, right. So that has truncated the hex and print, hex and print files, which is correct. And they should now be ready for reading. Let me just double check them, make system call, create file, creates the file. It should also initialize the FCB so that we can write to it is code 22. I am not committed to this document. Here we go, make file. Make file operation is similar to the open file operation, except the FCB names a file that does not exist. Okay, that will then create the file. It is now ready to write to, which is great. Let me just see how big my program is so far. 340 bytes. Yeah. A lot of that will be the boilerplate here. Okay, we have now opened our output files. We now need to start reading the input. Quick pause for a drink. Now, the 8080 is a pretty simple architecture. It doesn't have any variable length instructions. For example, the Z80 has instructions that can take either one or two bytes, depending on what you're doing. For example, the various branch instructions. That's actually not entirely true. Anyway, the 8080 is really dead simple. Every single opcode has either a parameter based on the let me rephrase. You can tell how big the parameters are from what the opcode is. So LXI always takes a two byte parameter. And this is different from that. That is the load 16 bit instruction. This is different from MVI, which takes 8 bit parameter. Now, the Z80 uses the LD instruction for both of these. And it determines whether you're loading in a 16 bit value into BC or an 8 bit value into B from what register you load. The 8080 doesn't do that. It uses different instructions for each. This makes it really simple to assemble for. We need a two pass assembler where the first pass reads all the source and figures out where all the labels are. And the second pass actually emits the code. So we need to know what pass we're in. So we're going to go two passes. Oh yeah. I also completely forgot. Error handling. Yay. CPM's error handling is very, very straightforward. It usually just returns either a pass fail code or just a bot to your program and returns it directly to the console. So we did do some error handling in a... I did do some error handling in another program. I think that was dump sort of C. Dumped even simpler. This on the left is a copy of the STAT program which I wrote in C to replace the digital research one. It's a surprisingly complicated and amazingly badly designed program. Okay. Yeah it looks like I didn't do any error handling there. Probably because the underlying program didn't. Now I happen to have a copy of the original assembler source code here. So let's just have a quick look and see what it does. Written all in machine code. So let's search for make. Make a file. Yeah. So this... Okay that's straightforward enough. This is the original equivalent of the CPM make file function. What it does is you give it an FCB in the DE registers which I'm doing here. It calls the CPM system call to actually create the file, checks for an error. If there's no error it continues. If there is an error it reports the appropriate error message and terminates. Now we only want to check for an error on a CPM make file. So we do if we could not create a file then what is the error? It's actually producing error and mark. No directory space. Yeah that was the file. That was the error we saw earlier when I used the wrong parameters. So we're going to do better than that. And output file. And what error is going to do is just print the message. Did I actually remember to put in a system call for exit program? I probably didn't to be honest. Let's add on there. And then we terminate. Very very simple. CPM exits. Exiting a CPM program can be done in two ways. You can either return back to the command shell if you haven't used the memory where the command shell lives. Or you can restart the system which is like very fast. If you have used the memory the command shell use this thing calls the system to reload the the command shell. And if you want to restart the system you just jump to zero. So we do this and that should have worked. Okay back to our assembler. And now at the beginning of each pass we wish to open the... So we need to read the input file twice once for each pass. So what I'm going to do is actually open... We can either open it twice or we can open it once and rewind the file once we've finished it. Let's rewind the file. You don't need a function for that. In SDCC functions are actually quite expensive because all values are passed on the stack rather than registers. If you look at my CPM system called bindings you see that each of these things has got this on it. This forces SDCC to pass a single parameter in a register which makes the code way smaller. So at the beginning of each pass we wish to rewind the file back to the beginning and we do this by adjusting the the FCB state directly we want. We want to do that. That just resets the pointer back to the beginning again. We now wish to start reading bytes out of the input file. Now this is a little bit more exciting than sound because CPM does not actually have the ability to read bytes from a file. Instead what you do is you read complete 128 byte records one at a time out of files. This means that CPM doesn't have to care about file sizes it just needs to care about allocated sectors. So we're actually going to need our own buffering scheme. So we want to read a byte from the input file and we need a 128 byte buffer to do this. We actually already have one defined. You get one in CPM automatically it's at 0x80. I need to adjust my bindings for that actually. On entry to the program this particular buffer also contains the command line but we're not actually using that in this program. We need to keep track of how much is left in our buffer so that we can reload the buffer from the file. Okay so when we read a byte from the file if we have bytes remaining in the buffer then do we want to do it like that? We want to do it the other way around. So okay this variable now contains the number of bytes we have read out of the input buffer and we initialize it to 0x80 to indicate that we have read all the bytes and we need to read the next block from the file. Okay if there is anything in the buffer which we haven't read just return it. Now otherwise we need to tell CPM that we want to do file access in the default DMA buffer and then read the next block from the file and if there is no additional data detecting the end of file is going to be interesting because I don't think we can distinguish between that and a read error. Right okay we don't actually have to I've forgotten about this. CPM read errors are fatal and cause the program to exit. So all we need to do is it says it returns zero if the read succeeded non-zero if it fails. So if the read fails then just fill the buffer with end of file characters. Okay let's refactor this a little bit. If we have run out of characters refill the buffer reset the buffer pointer and return the next character. All right so let's read one byte from the file per pass and that did not work because fatal is a and we call it error and we got what looks like two spaces and yep it's oh it's two tab characters let's be a little bit let's read several okay that seems to be working so we are successfully rewinding the file we are successfully reading bytes out of the buffer good now we wish to start actually reading tokens from the file tokens in the digital research assembler can either be words simple strings and simple strings are only supported by a very few directives comments numbers which start with a digit and then end with a base specifier in this case it's h or or identifiers I think I've mentioned identifiers before plus a few other things like the dollar sign meaning the location of the current program counter commas arithmetic operators and so on an exclamation mark like this is a statement terminator and these can appear inside comments let me rephrase they can terminate comments so if I look for a comment character followed by a exclamation mark we actually find one here so here we have a label definition then there's a comment and then there's a function now dollar signs are normally in context other than this ignored completely they are padding used in identifiers to make them more readable this completely fails from my point of view but there you go so we wish to read an identifier we read a token from the input stream we wish to skip white space and then the actual token depends on the character we read now you've got the assembler documentation here string constants I'm just wondering about multi character symbols I don't think there are any it's a really simple language so we've got single character operators we've got multi character words we've got of course parentheses we do have some precedents but this means that we can read a single non-white spaced character from the input stream and we will know from that character what kind of token we're looking at so let us do this we do luckily have c-type I'm also going to need to print to allow printing of hex characters I think I've got one of these in stat actually now I don't I've got one in dump here we go printx4 printx8 yeah and I've got the yeah let's stick some of these in to just reduce code size somewhat we don't need printfcp anymore yeah I'll need to go and add these throughout once the program is done but I won't worry about it for now so we want to read a token and print it okay apparently is space does not detect tabs um checks for white space characters oh hang on no no that's stupid right that's that's picked up a hex seven four which is a lowcase t excellent so let's use a switch so this can be either a end-of-file character and we are actually if it's a if it starts with a number it is a if it starts with a digit it's a number possibly not decimal we're going to have to keep reading stuff until we breach the terminator base character if it's a if it's a letter then it's an identifier this includes instructions and pseudo operations if it's a single quote it's a string and let me just double check this string constants represent blah blah blah repository symbols I think is space might be wrong here yeah because this also detects new lines and carriage returns which we want to what you want which we don't want to skip so if it's a space or if it's a tab or if it's a carriage return because we're going to ignore these and use the new line character for to mark lines cpm uses crlf separators for lines and I think that we also need to check for form feed and vertical tab we want to skip those two these are very out of date characters no one uses anymore but of course cpm dates from the 70s so we kind of expect them okay so anything that's not one of these three special characters we're going to turn into a token directly so if we get a comma or a plus sign then we use that as the token value it simplifies the code no end the control z and the file is passed through as a single character token so are new lines so when we run this program we should get an error yep can't read identifiers yet good getting somewhere now in order to actually read any of this stuff we have to keep reading characters from the input stream into a buffer and then pass the buffer now there is a strict limit as to how long a token can be the longest token is a string constant must not exceed 64 characters in length so so we're going to keep reading characters until we get one that is not a valid identify character but of course now we've read the character we don't want it so after we've finished reading the identifier we need to put back the last character we read luckily we can do that now the way we've set up our read byte function is we refill the buffer before reading the byte out of it this means that after you've called read byte there is always space for one more character at the end of the buffer at the beginning of the buffer there's the one we've just read if the buffer has just been refilled then this will be the first byte in the buffer if we have just read the last character input buffer read count will be 0x80 but we haven't refilled the buffer yet so unreading a byte is as simple as that and we can unread a single byte only that should be fine yes input buffer read count will never be zero after you call read byte all right so we to read the identifier we now need to if it's a dollar sign ignore the dollar sign don't attempt to read it if it's not an identifier now what are the valid identifier characters is it going to actually tell me I don't think it is brilliant there you go identifiers can be all characters are significant except for the embedded dollar symbol great any kind of formal definition oh yeah there's some stuff that we're just ignoring in the spec so the the ancient digital digital research assembler actually has backwards compatibility features for an even more ancient assembler which I'm just going to ignore completely this is the processor technology assembler oh yeah I've got about comment oh comments yes um right I'll talk about comments later let's just stick this in the and also I forgot about plain characters plain characters are treated as new lines so we just do we just do that that should suffice though we are also going to have to put some code in here to keep track of line numbers okay this is not actually telling me what the valid characters in an identifier are so let's take a look at the source code I'm looking for interesting looking string constants to try and identify where the passcode is yeah this is checking this is read a character and it's now checking for um various interesting values so here we say check next character for numeric value that's actually a helper okay so I'll num return zero flag if not alpha numeric that does suggest that identifiers need to be you know accumulating an identifier is a dollar sign so skip it is it alpha numeric yeah uh right that suggests that it's only this will only support letters and numbers which frankly is a terrible idea uh we're going to do we're going to do a bit better than that so there's an is blank what does is blank do space or a tab yeah we could use this blank here but let's not right you see I was hoping there would be a I was hoping there'd be a standard c type function for returning identifier like characters so basically we want is this alpha numeric or is it an underscore that is going to be extension to the syntax so if our if the byte we've just read is not alpha numeric but is not an identifier then terminate loop if the token buffer is full fail in fact I'm going to kind of using this from multiple places so okay so when the loop exits the last character red in c is not part of the identifier so three byte c and we are very deliberately only being seven bits safe are we only being seven bits safe now we can do better than that we can be eight bits safe change that to an int okay does that build no it doesn't right we have read the token identifier is minus one which is being reported as ff which is great let's just change this to this so we have I hope successfully read our token oh I didn't have a print text 16 need one of them okay uh yeah we've missed the initial character because the initial character is of course in c here now we know that c must be zero so we can just do we can do this right we've read the token and it is title now there's actually some more work we want to do here we actually want to be want to resolve the identifier in the symbol table but we haven't done the symbol table yet so let's just ignore that for now I will right now I've been testing with the actual with this actual file but I'm going to switch to a different one right we want a number I can't read numbers yet oh yeah how big is our one k could be smaller again a lot of that's boilerplate a lot of it's the print stuff that we will we will actually be using eventually so uh that's actually been compiled yeah it has been optimized cpm's only it says it's designed for 16 bit machines we only have 64k to play with at all some of which will be used up by cpm itself that's 3.5k some of which is the bios which is on a on a decent cpm that's typically half a kilobyte so that's 4k taken away from the total now we don't actually have to store anything in memory other than our program and the symbol table so it should be okay but bear in mind that the actual assembler written in hand tool machine code is 8k so we've already reached an eighth of our budget so numbers numbers so this is very similar code we wish to read a we wish to read bytes out of the stream in this case the uh numbers may consist of digits or alphabetic letters that is actually specified in the manual now here's some examples oh you can add dollar signs as well let's put that back in the terminator for a number is the type specifier which is octal okay right so that means we can't actually use the character we read to identify the terminator because b is a perfectly valid hex digit right so we in fact do need so that reads a number into the token buffer but it does not pass it strings strings need to be read into the token buffer as well now strings are different because strings have a start and end character also i have forgotten to call this so for strings we want to skip the initial apostrophe and start reading characters and writing them so we don't want to uppercase everything so read a character if it is if it's a new line that is illegal if it's a single quote that is the end of the string oh no it's not see i'm thinking of c type strings that use an escape character but the assembler strings do not assemblers use a double apostrophe so in fact if we read an apostrophe we need to read another byte and if this is also an apostrophe then we wish to add the apostrophe to the buffer so if the second character is not an apostrophe then yes exit the loop otherwise the string to the buffer and on exit from the loop c will contain the character immediately following the terminating apostrophe therefore we need to push it back into the buffer and this is wrong here i mean this will work but uh various other illegal cases will not be picked up so in fact we want to do if for example end of file we don't want to just keep reading end of file characters into the buffer so i think we want is print here to indicate is that right one checks for any printable character including space now i want is control what's it complaining about with that okay so that's reading a that should be reading a string into the buffer so yep that has read flored if i put one of these in that should read that's correct and if i take this off right that seems to be working okay the next step is we now need to start doing stuff like parsing numbers now parsing the number we wish to look at the last character of the of the thing we've read and this may be a base specifier do we do this okay so this will chop the last character out of the buffer out of the buffer and then we can test it so binary octal more octal decimal okay binary octal decimal hexadecimal so if the last character is a decimal digit then the base is base 10 and we don't want to have deleted that last character because it's valid right we now have the base set and the token buffer contains just the the number so let's actually go to 23456 ech why is that still got the h on it if e is minus 2 which is a number hmm so right on entry token length is 2 indicating it's read two characters which is the one and the h right that is as i expect so connect is now one on exit from the routine so token buffer dull right that's better we now wish to start actually parsing the value and we do this by going from the left we wish to so we resolve so we turn the character read into the value of the digit we check that the digit is in range and we add it to the accumulator and so when we read the number we return the token number identifier and on exit the token number variable will contain the value yeah this is sdcc is not actually doesn't do proper flow analysis okay no warnings and the value is correct so that's one h so one two abh that's correct let's change this to a z valid digit and character constant two three four decimal oh four d two yeah yep binary nope seven seven yeah probably okay that looks like it's more or less working so we are now correctly parsing numbers so what's the damage yikes nearly half a kilobyte not all of that will be the numeric stuff but that's still not brilliant i wondering whether we can make this a little bit better so i'd expect that this switch statement compiles into a just a series of you know if elses it'll do identifies you haven't done yet strings are fine we don't need to touch strings comments yep comments now the obvious thing to do with comment is to just like read characters until we reach the terminator which either a new line or an exclamation mark but actually it's a little bit more complicated than that because we want to read the comment and emit it into the output print file so if i reassemble our ccp and you look at the print file you see it's actually got all these comments in it so i think that we're going to need to read the comment hmm so the way this print files created is that this column here is just a copy of the source line what that means is that i think in order to replicate this our read routine here is actually going to have to be different because we're going to need to read a complete source line from the file into memory so that we can then output the complete source line to the print transcript and then read bytes from that source line that's a shame yeah let's go with that in which case our this piece of code here when reading the comment all we do is eat characters and in fact things are a little bit different because we need to if we see a comment then we actually have to read more bytes so we're going to do this this would actually be an ideal place for a go to to be honest go to is a good for this kind of state machine okay if if we see a comment character then we just start eating stuff until we see an exclamation mark a new line or an end-of-file character and if we have read a comment and we see one of these things then we'll loop back again and continue reading no no that's not right oh no no i'm over complicating this right we read a byte if it's a comment character keep reading until it's not a comment then it's not a comment so we just continue parsing as before so in our test file we get a carriage return token which is in fact the one here oh that's an exclamation mark yeah i actually need a bit more test code so what that's done is we read a new line character that's that exclamation mark so a new line token then another new line token which is the actual new line then it's a number which is the number then we read the new line at the end of line two the new line at the end of line three and my cpm emulator pads with zeros rather than with new line characters so we read zeros that's strictly user error because i should have a control z at the end of the file and you see it has actually read the the control z there zeros i think in the interests of sanity let's put uh this one of those in okay so zeros are now translated into end of file characters right so where have you got to we are parsing numbers we are parsing identifiers but not looking them up in the symbol table and we are parsing strings so the tokenizer is looking solid i think that's done so let me just check all this lot in oh yeah we're beginning to approach the bit we need to start talking about symbols because the way the assembler is going to work is it's going to accumulate name values internally every identifier like every identifier will resolve to a symbol in the symbol table and we're going to look up these symbols here in the identifier code if a symbol is not in the symbol table then we will create a new symbol table entry for it of a particular type that means this symbol is not defined yet and then when people actually try to refer to it that will cause an error there's a reason for this it'll make life easier later on so we're going to have to start talking about the symbol table so the symbol table one thing you may not have noticed is that this program contains no dynamic memory allocation this is because my runtime library does not actually support dynamic memory allocation malloc and friends are kind of expensive and we're not using them instead what you get is a simple array of bytes which contains all the memory in the cpm system that is not otherwise used by the program itself and we are going to put our symbol table in there so there's various ways we can do this the way the cheapest way if you're writing in machine code is just to use an array of variable size structures but we are actually working in c so we want to do things a bit differently so we are we define a symbol which consists of a null terminated name the numeric value of the symbol a callback which contains a pointer to some to some code which is essentially the symbol type this is going to be a routine for doing something with the symbol and a pointer to the next symbol now the overhead because we're in c is two bytes for the pointer and two bytes for the next symbol but this gives us a lot more flexibility and we're going to store the symbols as a linked list but we're going to be a bit cleverer about that and we're actually going to use a hash table so the bottom few bits of the identifier we're going to use five bits for 31 entries 32 entries even initialized to nulls so the way this is going to work is to look up a symbol we take the bottom five bits of the first character so example for example if you have a symbol flawed then the bottom five bits of that f are six so we look in the sixth slot in the hash table and this gives us a linked list of all the symbols beginning with f this should drastically reduce the time needed to actually look things up in the symbol table so we actually add symbols to the table at this point because we want to resolve them so we actually go so that gives us the slot that gives us the first symbol in the chain so for each symbol in the chain we wish to compare the name with the the one in our token buffer now let me think I believe that I actually want to change this a bit so rather than zero terminating the strings pointed to by the symbol we're actually going to store the length of the name in the symbol structure now the the 8080 and z80 are they don't have strict alignment so this pointer is two bytes this is one byte therefore this is aligned to three bytes but we don't actually care on a more modern machine we'd want to do this so that these four two byte values are all aligned in fact we might as well do that anyway so if the lengths match or rather put it another way if the lengths don't match then we know the we know this is not the right symbol if the strings are the same then we have found our symbol and we just return if however we get to here then we have not found our symbol so we need to allocate a new one now to allocate memory we are actually going to put in some more code can we do that yeah we can't oh hang on absolutely me okay so heap pointer is going to point at the top sorry at the bottom of unused memory and as we allocate stuff then heap pointer is going to advance and eventually it will hit the top of memory and we should probably check for that at some point we're going to skip that for the sign being so that is our allocation function there is no provision for freeing memory because we never will this is traditional memory allocation strategy in compilers compilers accumulate stuff in the heap and then terminate and because all memory is automatically freed on termination you don't need to keep track of it so we want to allocate enough memory for the name done we wish to allocate enough memory for a symbol structure which we then initialize you can do this the other way around we add it to the link list on the front the value is zero the callback is unset to mean this symbol has no type oh hang on that didn't work it would be oh i called it cpm top why do i call it cpm top that's a stupid name let's do that okay so let's take our test routine and go what so those are two symbols that should be allocated here is another symbol that should refer to the same one as in line one and the one in line four should refer to the same symbol as line two that did not work at all that's because i actually got to i should probably pause and get some tea i have just paused and you know made dinner tea always helps yeah i'm not very that pointer's just wrong so here we have the map of where things are in memory you can see the top of ram is a3f which is here which makes me wonder whether this is not actually initialize the variable correctly no that looks okay i can see the heat pointer advancing so that looks like that bit's working correctly token symbol yeah okay always look at the warnings one one two function return value mismatch yeah i would expect that to be an actual error uh sdcc strikes again that better okay so we see that fnord is at a6f so the second one is not at a6f fabulous that'll be a bug then maybe i want to remember to set there we go fnord is at a3b fnord is still at a3b foo is at a49 okay right now this oh this is a tricky one and we're going to have to fix this elsewhere possibly not correct yeah and that's that's not right at all that skipped the second character completely so this is the identifier first character gets added second character uh yeah i think i seriously do need some tea that's better and you see it's the dollar sign is no longer in foo and the pointers match fantastic so that bar and foo should be appearing in different slots in the hash table yeah i think that's done good so let's now point this back at the ccp and what do we read we read the title which is correct we read the string here which is correct couple of new lines and comments etc why do we have uh this is not right where's that mvi coming from right we somehow have found ourselves here uh what's happened i'm pretty sure is that the comment munching code is faulty and when it sees this it's actually skimmed all the way ahead to the first exclamation mark which is here and now it started scanning from then on so let's go and look at that comment munching code that'd be here and this should be an and as well so only eat only eat comment if the character read is not a exclamation mark new line or end of fire good so this is read a whole bunch of new lines which is these ones here we haven't reached as far as the false uh let's actually just print out some more tokens there we go so false is defined eq then a number and a new line and a true eq not false good that looks like it is working okay the next thing we need to do is to start doing something with those tokens and this means we are going to need to populate our hash table with things that make sense now this is one of the reasons i wanted to use explicit names and uh pointers and this is going to be a lot of really irritating boilerplate so that eq statement the way this is going to look is we do const drops symbol eq symbol equals so uh uh put that there so symbol name is eq the uh length of the name is three now this will in fact this will allocate four bytes because it's null terminated which i'd rather like to avoid i'm not sure we can do that the value is going to be ignored for this the callback will eventually be something like this this will be the code that actually makes eq happen but we're going to leave that as zero for now and next is going to be the next symbol so and this needs to go into the hash table and the hash of e is slot o five 34 to 66 it's 32 that's correct so therefore eq symbol here goes now that's actually going to complain due to const issues that's not complained due to const issues i'm really surprised and if you look at the symbols being produced here title is a 40 which is obviously a new symbol eq here is 136 which is a completely different address range so it has obviously found this one right so we now have a symbol i'm going to need to put a whole bunch more in including all the atat op codes but add one of t actually let's use a macro i should add that our operators are also going to be here so for example plus is 2b and 2b um hashed will end up being hex ob which is slot 11 so zero which should go there along with k's what this means is that our symbol names may contain non c identifier things so id name so title has found a internal symbol okay value callback let me prototype the title callback and we can put some code in here and of course title will do nothing that's the next term okay now this is going to be fragile as hell each of these needs to be a linked list which means the last term of each symbol needs to refer to the previous isam in the chain and then the last of the symbols in each chain it needs to go into the hash table this is the kind of situation where it'd be so so nice to have some kind of c meta programming beyond like hash defines but we don't so i'm going to have to do it the long way i have in the past resorted to horrible evil to make this happen but there you go right we are beginning to get somewhere we now need to start actually doing something with the symbols we read which means we need to understand the grammar of the assembly files now the grammar is defined rather poorly here but it's essentially we have a label followed by an optional colon followed by a instruction followed by a comma separated list of operands now the label being the label having an optional colon after it means that the only way to distinguish between labels and instructions is from context so that is a valid instruction that is a valid instruction and i can demonstrate so this is using the original assembler it has assembled this and i can look at the hex record and 3e01 is this mvi instruction and if i take the colon out it still assembles into exactly the same thing but also that's valid too so that has now generated 3e01 and 3e02 so the way we do this is for each instruction we read a token and the token will be an identifier and it will be either a label or instruction and we have to look at the symbol type to determine what it is if it's a label label then we read the another token which is the instruction each statement can contain exact either zero or one label and zero or one instruction and the operands depend on the instruction so what we do is we define symbols for the current label and the current instruction we read a token now the token may contain a new line new line means end of a yeah actually what tokens can we return ooh single character tokens we have to look these up okay do we can we just treat those as tokens do we need to actually put them into the symbol table i'm not don't think we do there's not very many of them there's the error i'm mainly thinking of the arithmetic operations here and we've got plus minus star and slash and to be honest what they do is very dependent on the exact context no let's not put those in the symbol table so these are just returned as tokens so we want to read all tokens in the file if it's a new line skip the next token if it's not an identifier fail we know it's a identifier and has a resolved symbol so we are now looking at the first token of the line this is either going to be a label or instruction if it's a label then we wish to set current label read the optional colon if it's present and move on to the instruction if it's instruction we don't so to identify a label the label will be either the callback will be zero to indicate we cannot actually do anything with it it's just a value so if there is no callback it's a label this yeah there are in fact two types of labels there are eq labels and there are set labels and we're going to need to distinguish between them somehow so i'd rather not allocate another byte to be honest so i am in fact going to be kind of terrifying yeah sorry i'm being ridiculous okay so going to have a label a set label callback and the eq label callback what these will actually do if you run them is for an error you've got two macros for instructions and values let's just do that so if it's a label then set current label accordingly read the next token if it's a colon read the next token now if this is a new line now let me change that if this is not a new line that is if there is a next token then it must be an identifier and it therefore must be the instruction so set the instruction so we have now identified the label and the instruction for the current statement so call the callback for the current instruction and we expect this callback to consume all the remaining tokens in the statement so that we're then ready to go back here again for the next statement okay let's see if this does anything fails and of course it does cold object is not a function oh okay let's see if i can remember the c function pointer syntax well that did a thing converting integral to pointer without cars in 426 this is token symbol right and it actually detected it was a title and failed awesome let's put some generic if you need this in a bit yeah uh notice that i'm not actually doing any error recovery which i'm going to have to deal with at its due point so cpm is so slow that when you're assembling something you don't really want it to bail out the first time it hits an error you want it to keep going as long as possible that is what the the printer listing file is for the idea is you assemble stuff and you get a log out on the printer with all your errors in it so all these fatals well these ones are just for debugging but this one does not want to stop processing it just wants to skip the current line and continue but uh mark the runners failed what's happening line three eight really three ninety unreachable code yeah okay that's actually correct actually a thing we could do that would actually improve things a bit uh if we if we get an end of file we don't actually want to continue reading from the file at all so don't want to do that so what i was going to do is if we read an end of file byte we just do not advance the pointer so that we keep returning the same byte again and again uh what this will achieve is if we reach the end of file and there's a end of file byte and then some garbage after it before the end of the block we don't read the garbage but if we read an end of file byte and don't advance the pointer and then unput the unread the byte because this violates one of our preconditions for unread byte which is that input buffer read count is never zero yeah let's just not do that so that has correctly reached title here now the title pseudo operation should be a list of them somewhere this reads a string and it just printed so if i assemble my ccp you see in fact it doesn't even print the string it just logs this line of the uh the listing so you can read it let me see if i can yeah wrong keys let me see if i can find the title directive it's not there interesting controlled instructions no that's the actual 8080 instruction set that's interesting okay i'll just make something up then i'm going to assume that the title read to string and then we're going to print the string to the console so so we expect a string we then print the token buffer we then expect a new line like so and we have read the title and printed it we haven't done anything else why haven't we not done anything else that should have moved on to the next um next token do you read a bunch of comments and then we get a a minus one identifier so kind of folks minus one i know what's happening right we've got a label but we don't have an instruction which means the current instant is zero which means when it tries to call the call back it does just garbage uh cpm has no memory protection no catching of nulls in fact zero is a perfectly valid address and that is the address you jump to if you want to restart the system so if there is no callback if there is an instruction call the callback for it otherwise we have a bare label or you know we um we do know we got an identifier so yeah we get a bare label no that still hasn't worked uh so i presume we get false here and then we get eq minus so we are calling a callback so what is it pe5 uh oh oh oh when we create a new identifier we set the callback to null rather than to uh set label so a set label is one that can be modified by another set instruction a eq label is one that cannot do you want to set that to set label yes we do because we can always upgrade a set to an eq but not vice versa right now we hit the eq good we can lose these okay eq reads its single parameter which in this case is zero and assigns this value to the label on the left now we are going to need to get into the horrible morass of of expression parsing so what this is going to do is read an expression can you have an eq without a label i don't think it makes sense but it may actually be valid let's try that right s means syntax error yeah the original assemblers error reporting is kind of terrible so yep you are not allowed to have an eq by itself which is nice because what we put here is to do if no current label syntax error dead easy right if the current label callback i want if the current label is a set label if the current label is a set label then uh we actually want to update the value if the current label is an eq label then we cannot change the value what this means is that the label attached to this instruction is either different sorry the label attached to this instruction has already been defined somewhere we know that current label must be a label of some description therefore if you pass this it is a set label so we can just do current label value equals read expression and upgrade it to a eq label to prevent anyone else from changing it and then we then expect the next token to be a new line okay and this is going to be our expression parser which we hit as the error don't like this very much it's a bit of a play so we have we can define a value and change it that's allowed if the value is if it's a set label can we then use it p what does p mean i'm pretty sure that's an error but i think maybe i need three different types of label i need set labels eq labels and implicit labels have been defined that are new on the left hand side of the expression but which haven't been defined yet so a undefined label can be upgraded to an eq label or a set label yeah here we go label does not have the same value and two subsequent passwords through the program phase error is a bane of this assembler it's it just means something happened without any real information as to what okay we've got right so we upgrade undefs to eqs for these callbacks to be called then the labels are used in an instruction context which is just a simple syntax error these three callbacks only really exist as a um as enumeration values anyway this they will never happen in a proper program so bare label cb is the actual the routine that gets called if you use a label in a statement on its own what that will actually do is it will assign the program counter to the it will assign the label to the current value of the program counter and in fact this is going to want we're going to be using this in a lot of places so like most instructions are going to call that if there is no current label do nothing in fact it is a program counter is a new variable we're going to create that contains the current program counter obviously enough okay we're going to have to tackle the expression parser next the expression parser it actually evaluates a well an expression this is a infix syntax expression one of these with operator president and parentheses and all that ghastly stuff uh the we have these operators that need to be evaluated we have symbols that need to be looked up the syntax is very simple but uh it is kind of subtle the good news is that the output of an expression can only be a un 16 so that's nice so let's have a look at that then well thanks to the magic of video editing a cup of tea has miraculously appeared on the table next to me which means everything is better okay let us tackle this expression parser and not fond of these they're just fiddly so we have five levels of precedence and in fact we have a couple more because we've got unary plus and unary minus so we have leaf values which are the highest precedents such as a or b or plus a or minus a or parentheses operations we then have or an x or then and then not not why it's not there not is a infix operator is a leaf operation that should be at the bottom hmm right i see what they're getting at oh yikes i uh right what so this is not the way c does it so c not the not operator which is playing applied here will bind only to c but in the assembler it actually binds to the entire following expression oh that's awful it doesn't list it doesn't specify a precedence for infix minus but if if infix minus applies here then this means that the minus binds to one times two which i don't believe could possibly be right in fact that uh poor example because let's try that minus one she left two so i'm going to treat the unary minus and plus as leaf opera as leaf operations so we're going to actually do this as a recursive descent puzzle and in fact this is we're going to need some look ahead we can probably do without it but we need look ahead i think we can so so if reading the expression includes the terminating character which is going to be a new line or comma this means that if i have something like then reading the one plus two will also include the terminating new line and if we have some imaginary hang on we can do this then reading this expression includes the terminating comma so that reading the next expression includes the new line the problem with this is this doesn't tell us what the terminating expression was which means that uh we don't know with db here whether we have in fact read the terminating new line and therefore needed to go on to the next statement or whether we've read a comma and therefore need to read another expression now we could put we could do some look ahead this would be a an unread token function the issue there is that tokens have states attached to them particularly the token buffer and we can't save that so i am actually going to adjust this so that i'm going to adjust this so it returns not the value but the terminating token and the value can be stored in token number so eq here so if reading the expression does not result in the new line it's a syntax error so let's read a leaf expression and i need to prototype this for reasons that will come obvious in a moment so what we do is we read a token if the token is a number return the value if the token is a identifier return the value of the identifier if the token is a plus sign then recurse if the token is a minus sign also recurse if the token is an open parenthesis then recursing to read expression and expect a closed parenthesis if it's not a closed parenthesis it's an error and return the value otherwise it's an error if it's a string then we return the first character of the string let's just make sure it is actually the right length has that actually right it doesn't know this syntax error doesn't return so let's just put that in this needs to be set let's see it label okay this is the bottom level expression reader and this just returns a value i wonder if it's yeah this just returns a value we now have multiple levels of precedent to deal with so let's start with the highest we've got five levels so we read the leaf on the left and we read the operator the operator may be one of these identifiers or one of these we're going to need a new placeholder callback and our operators are mod schleft right not and or an xor values are completely arbitrary so not and or xor and we need to add these to the um the hash table so mod symbol mod s of the wrong key except it has to go here because alphabetical order and let's put these in the hash table and symbol oops i got my end my m thrown around and i need to make sure that this is the last in the chain this is a xor okay does that build no or right value is unsigned we just want the value here to be out of range for a character that's all for 20 yep okay right if the if it's an identifier then the infix operator must be a operators callback otherwise that's a syntax error we're actually going to be using that quite a bit so that's just i am actually slightly wondering i think i should put the other operators in the lookup table as well to be honest and we can actually use the callbacks no so i thought that we could use the callback data for to actually do the work but we also need to um be nice to encode the precedence in there as well i think as we can't use the operator callback do the expression work as the operator callback it's called for um using one of these things as a identifier for example if you know someone would just do mod then that would be invoked as an instruction and we want that to produce an error so we have to use the value field for the actual operator thing uh yeah that should actually be a callback um except the infix ones well there is only one infix one which is not but we're not but our tokenizer is not reading the symbols yeah it's not turning the single character symbols into identifiers i mean we can change that but i think this is the wrong that's that'd be the wrong way to do it yeah i think this is wrong what we actually want is our shifts shift right arithmetic of course it doesn't say oh here it does zero fill okay and the other arithmetic operators are unsigned too this is not a pointer this is a value i mean pointers and values are both two bytes but it'd be kind of nice to this code currently does not have any platform specific stuff in it other than that memory allocation routine yeah i can think of numerous different ways of doing this what i'm trying to figure out is which the least bad one is what is more i do not believe that this recursive descent approach is quite right because all the logic is going to be the same okay let's actually get rid of all this and go back to oh i know what to do i know what to do right let's put all this back again and we also want to add plus minus sub let's sort those okay so what we're going to do is that these are all going to be indexes into an operator table let's move this up here so add and sub are precedents one and is three div is zero as is mod is null is two four is four left and right is zero and sub is the same as add which is one okay so these values are now indexes to the operators table so when we read an expression so this is the precedence of the current expression so this is just going to do read expression with precedence and the top level is the lowest precedence of all so so do that so this is the token that we read now if it's an identifier then it must be a callback which means the operator comes from the um the value otherwise we test it it is either an operator or it is a terminator if it's a terminator we just exit immediately we set token number to be the value we read from the left hand side and we return the value we read right so now we know what operator we're actually operating on we look the we look the actual operator up now if the precedence of the operator is greater than our own precedence then so that will happen here in b so what this will do is we read we will read the a our current precedence is infinite so plus is a lower precedent so uh i think i've got my precedence as backwards yeah the highest is five minutes no i've got this right uh zero is the highest precedence four is the lowest precedence and 255 is infinite precedence so if the if the precedence of the operator we've just read is higher than our own precedence then we wish to bind the current value bind the operator to the current value so what this actually does is so what we're doing at this point is we're deciding whether to apply the operator we have just got to our left hand value ah no we're deciding what to do with the value on the right so yeah i need to undo all that up so right if the current operator let's say the star we've just read this work right we read the a we evaluate that we now read a star star is higher precedence than the outer expression which is infinite therefore when we read the next value we want to apply the star to it yeah that's just standard left accumulation now the other direction is more interesting we read the a we read the plus right we read the b but we don't do anything with it yet until we see the expression the operator following or i hate this stuff so hard to get straight in my head right we actually we're going to need to stack things i was hoping to use the cpu stack for this and call this thing recursively normally i do this using um multiple hierarchical functions but there's so much common code that i think yeah let's do some factoring so this stuff here for reading the token will either so reading the operator it returns either an operator or it aborts with in which case we need to retain the token id because it's a terminator the trouble is to avoid look ahead yeah normally you have uh like the high uh the low precedence function for handling plus there is uh you either reject the you read the a you read the operator if the operator is is not the one you're expecting you rewind back to where you and you try the uh the next uh the next lower precedence handler um yeah i'm gonna have to stack this yeah this is terrible and it is going to have to be recursive because we need to be able to recursively call read expression so she needs to have to put stuff on the stack and that's not very big this is normally known as dykstra shunting algorithm and i actually have an implementation of it in cow goal and it's and it's it's gruesome so many horrible special cases if we can avoid the parentheses that will help the other thing you do is you just use an actual piles of generator which is so much easier and i could do this for this i would be using yak or bison and they generate plain c parsers that get all this stuff right for you which i can then run through stcc and compile it and link it but they're not small yes i'm actually going to go away and look how this works properly okay let's try this properly and we are going to be using dykstra shunting algorithm so we need to maintain a stack of values and a stack of operators right and we start reading expressions the t is not helping we when we get a token we resolve it oh that's that's stupid the very first thing that that must be a leaf either a parenthesis operation or a value so we actually just read that directly into the value value stack and the heart of the the operation now starts reading infix operators so as before we read a token we resolve it to the id if it's not a valid operator it must be a terminator therefore we exit now it may also be a parenthesis which are handled specially and we are actually going to take the parenthesis code out of readleaf expression are we i've got some pseudocode from wikipedia here i can actually do this all in one go we read a token from the input stream if it is a number stack it if it's an identifier stack the value if it's a string stack the value if it is a identifier oh hang on i remember this from last time this algorithm i do not believe is quite right yeah uh no this algorithm is right but it it doesn't do unary unary operators because there actually needs to be two phases where you read values and you read operators uh that is values and infix operators because operators do different things depending where you are so i believe at this point we are reading left we are reading values so for this point we have a plus then the we really just want to skip it and yeah plus is a poor example because yes we can just that that just does nothing the more interesting one is unary minus because this actually ends up as being zero minus whatever so i believe that what we want to do is to stack a zero stack so what we are effectively doing is we are converting unary minus into zero minus whatever and then we shift into the second phase which is to read the operator we also need to cope with parenthesis and when we receive a parenthesis we push it we're actually going to mark this in an incredibly dodgy way with a null in the operator stack and at this point at this point we know that we must be a value nothing else is allowed so everything else is a syntax error right and in fact that op sub is let's not do it like that let's add a unary so you just add a neg and this is highest priority so let's just bump all these two make that highest priority what we end up with on the output of the shunting yard algorithm is post fix but that's not going to work here because we've actually pushed the operator before we've pushed the value okay break time while i go and look up how bloody unary operators work okay i think i have a bit of a handle on it now so remove all this let's put this here so we're actually going to so the secret is that when you push the operators you check the top most operator on the operator stack and uh apply it if but you apply operators from the operator stack while they are high or equal associativity to yours this actually makes stuff like unary operators fall out in the wash the thing to remember is that when you see an operator you don't apply it then and there you push it onto the stack and apply it later so what we want to do is if the there is anything on the stack gee let's do like this if the precedence of the top most operator is lower by which i mean higher because my numbers are the other way around to the current one then pop the operator off the stack we need to record whether these things are unary or binary if it's a stack so it's actually the right most one that's first if the thing is binary then also pop if on the other hand it is unary then and once that's done we then push the current operator right now that's mostly bollocks if it's a number push it it's an identifier yeah as before push all this stuff right now we need to decide whether this is unary or binary that we can tell from whether the last thing was an operator or a value so if the last thing was a value then and i'm actually going to adjust my operator stack so that we store op ids rather than operators themselves so if the last thing we saw is a value then this is the infix version therefore it to sub otherwise it's the prefix version and it's a neg parentheses need a operator id this doesn't actually appear in the operators table i'll need to be careful about that so let me double check the code parentheses you see left power and you just push it yeah this does actually need an entry in the table but it doesn't want to call back right identifiers can be infix or postfix if it's if the last thing we saw was a value then it's infix thing and therefore this must be an operator so anything other than an operator is a syntax error if however the last thing we saw was a operator then this is a value so push the value of the symbol right if the thing we saw is a right parentheses then we keep applying operators from the stack until we see an open parentheses and then stop if we didn't find one then it's an error if let me see at the top top id if it's a open parentheses oh yeah uh a closed parentheses only makes sense if the last thing we saw was a value if it's not a value then it's an operator that doesn't mean that that means nothing so if this is a parentheses then just pop it and stop otherwise apply the operator i can do better than that okay that looks reasonable um if it the token we saw was none of these then it must be a terminating token to stop right we've now run out of tokens so we have stuff on the stack and possibly some operators so while there is stuff on the stack now while there are operators on the stack keep applying the operators now you can only push a value you can only push a number if the last thing was an operator likewise strings okay you write this unary plus if the last thing we saw was a value is an infix operator otherwise then do nothing parentheses can only only an open parentheses only applies after a operator okay go here that goes this is going to be interesting to debug i have to say but it actually looks reasonably coherent now uh if so at this point everything has been resolved there should be one item on the value stack so of course it's not going to work of course of of course it's unreferenced it's supposed not to be referenced that's kind of the point and we're going to also need a neg in there and a comma here not put x or in there okay well now we get a syntax error which is a start i don't imagine for one moment this will ever work um but let's just print that what value did we get from that eq zero well that's actually correct which is nice but it's then not doing the next thing so we're going to need to put thunder debugging in here well that's a new line and new lines should not well new lines are terminators okay let us dump the stacks particularly an amulet of that zero operator ah this would probably help a little right that's better okay so so the first expression is this one and this is just a simple literal so you read a number we start the loop nothing's on the stack so you read a number we terminate that value is the one we return so this syntax error is happening because we're probably getting that not let's just put this before you ask no i don't have a debugger or rather i do but it's machine code one i'm not helping here so token fff is a identifier this is going to be that not the one here now not is a operator so we're actually going to hit this and we have not seen a value do i have these backwards oh not is in is prefix so prefix operators only apply only makes sense if the last thing seen is not a value right i'm going to do this the other way around if this is an operator then look it up if it's binary then if it's a binary operator then it must be a scene value thing so if it's binary it only makes sense if it the last thing seen was a value if it's unary the only makes sense if the last thing seen was not a value so let's do this so it's an operator so push operator yeah here we are actually pushing the operator value as a value so that's not good if it's a label it only makes again and uh that's an operator therefore scene value is false so if if it's a label then it only makes sense if the last thing seen was an operator right otherwise unknown identifier i think that's a syntax error it might be a terminator let's try that oh we got something we got more so we got a number and it was zero done the result was zero all right the next one we got a identifier which is operator six we got a value which is zero yeah uh i don't think that operator is being applied correctly but let's try this okay value operator six result zero so operator six is not right that's the one we wanted not cb is yep that's the one we wanted see this looks like it's working it's just popping it's popped a value it's unary it calls the callback and it pushes the result oh yeah we're also going to have to put star and slash on this list but i'll leave that until i get the rest of it's working oh that's applying the wrong one is my unwind correct no no it's not right okay that worked so we read uh a operator which is six we read a label value which is zero this is the one we set previously we unwind we apply the not we get minus one awesome that worked so let's just put star and slash in here i think it's worth using a switch there incidentally i confidently expect this to be the hardest and most time consuming part of the whole project there's going to be a long boring piece of data entry where i add all the op codes but seriously i think this is the bulk of it which given how embarrassingly long it took me to do is probably a good thing you can't use as a prefix so it only makes sense if the last thing was a value value there's a it's not a common code here that maybe might be able to do something with okay right what are we doing next testing eq false oh the syntax error is from this if statement fabulous if yeah right conditional compilation so this evaluates the parameter it does the usual thing if it is uh non zero it does the first lot of stuff if it's zero it does the second lot of stuff it's completely standard now but i actually before i do that i'm going to do some testing and probably go and have dinner why do i get already defined for this i think it's that blank line yeah it's the blank line at the end of the file okay let's try some expressions one plus two is three good two times three plus four is not good apparently so two times three right the next operator is plus which is lower precedence than the two and three do i have that the right way around and i've got this the wrong way around yeah i am actually i'm confusing myself by having these the wrong way around as well so if the current operator is higher precedence than the one on the top of the stack then this case is it's a asterisk i think i also want that to be less than or equal to let's try that so let's just double check what that is two times three plus four is 10 result 15 what i expected uh i think that's done that this uh he is 14 yep right that has in fact got the precedence completely backwards so i suspect i've also managed to compute myself when there is an operator on top of the operator stack of precedence higher than or equal to the operator who are currently processing pop it off and apply it so if there is an operator on the top of the precedent stack higher priority or equal to that means having a lower precedence number apply it why is it applying this over and over ah it's not reached the unstacking stage you're still trying to deal with the plus here i am forgetting to actually pop the stack right 10 is correct and actually let's have a bit of a value sp equals operator stack okay the result of that is 10 which is correct so let's try this the other way around which is 15 14 it's still 14 okay that's worked i'm impressed so far let's try that and it fails right it's trying to apply a parenthesis which we don't let it do so if the thing on the top of the stack is a parenthesis do not apply anything we'll deal with it later i think so what's this done okay we have a open parenthesis which is pushed onto the stack we have a value two we have a operator add we have a value three we come across a parenthesis i don't need this piece of code here because parentheses have infinitely low priorities so that will come out in the wash okay is this one so it's trying to unwind the parentheses so we pop off the op id if it's a parenthesis do nothing wait a minute zeros zeroes an ad right okay and we get 14 which is the right result hang on a second 14 hex is 20 decimal right that's the right result okay i believe this is looking like it's working released enough to go on with so let's let's take the debugging code out and then we'll see what's the see how that goes i've missed an applying yeah i ain't deleting any of this code i'm sure i'm going to be needing it to debug future horrible things right let's see how big our assembler is 4k that's going to have the bulk of the complexity in it uh yeah this is going to end up being bigger than the hang tool assembler which is not really surprising but i think it's probably going to be okay okay i need to go and sort out about dinner but i will be back in an instant okay dinner has happened and my blood sugar is up again so let's have another go at this conditional assembly i also took the opportunity to look at the original assembler implementation now there's two ways to do this kind of conditional assembly the first way let's say testing here is false the first way is to read and parse each statement but they're not actually do anything this lets you uh guarantee that the statement you're skipping over are syntactically invalid the other way the wrong way is when you hit the if statement you just start reading tokens until you see an else or an end if and i am very glad to say that looks like the original assembler is doing it the wrong way because that's a lot easier to implement now this does mean uh you can still do nested if statements but i don't know if it supports it let's actually give that a test and see so uh so if one that should produce an error message uh oh that actually created a label so let's do that to right that has created some syntactic error error message so change that to zero it doesn't and that's interesting why has that i think i know what's going on here i think that let me do a little bit more let's actually generate real instructions right well interestingly the documentation here does not write does not describe else at all which makes me wonder whether this actually works change that to true and try reassembling the ccp p is phase error and right i yeah so this when i assemble the ccp using the original assembler it looks like it's working but i don't think it is see i don't think it's paying any attention yeah it's only paying any attention to that else is b-dos l actually used anyway yes it is once ah i wondered what this line was doing oh it's these stupid error messages like fail to register that this was actually an error and not a log right it's complaining that b-dos l has not been is not been referenced and that's because this else here is not being honored so this source code doesn't actually work in the dr assembler it doesn't really report it very well well let's just quickly hack it so it does work so it's not actually original okay and what is n that's another error message page 16 n not implemented right yeah this was intended to be run using mac uh sorry this was intended to be assembled using mac the macro assembler which is a much more capable assembler with macro facilities and you know else okay well that simplifies things for us i might implement else because it easy and will be really useful but let's do our nested if if so what i'm thinking is that if it supports nested if it doesn't do anything at all if it doesn't support nested if then this if line will scan it hits this end if and assemble this thing did that do a thing yes it did we do not have nested if okay i'll implement it as that to begin with extending it will be easy right we want two key words if and end if so if symbol if make things line up nicely and this wants to go to end if symbol like so right that won't build so you want our if callback and end if callback let's do the end if callback first so the end if instruction will only be executed as my test gone will only be executed in this situation so we hit the end if and we do nothing we go on if the if is not taken if zero then we're actually going to scan for the end if without executing instructions so the end if instruction itself will never be executed in that situation so the end of callback is a no right now the if callback so the if callback reads an expression and if the token if the result is non-zero we take the if if the if callback is false then we scan for a end if which we do simply with if this is an end file then stop or actually it's due to an error so if the token is an identifier and it's a end if and consume the new line after the end if right that should work right expected an identifier I have no idea where that happened so let's actually go and add some line number information so we need to add you can go it's fatal gone I was hoping to be able to avoid printing in decimal but I reckon that line numbers in hex are a little bit antisocial so let's find have a decimal print routine from stat it even prints with padding and precision I don't think we want any of that it's incredibly crude I copied the logic from I copy the logic from the old built-in stat dot plm when I was rewriting it they're looking at this I do wonder whether sdcc has div mod the way it works is that precision here selects which digit we're actually printing this line happen this line is reached when it's printing a leading zero so we don't actually want to print those at all the original code would allow you to pad it with spaces but we don't care so if the so if our digit is non-zero that is specific readability or we are printing the last digit which is always printed or we are not suppressing zeros then print the digit okay so line number right that will tell us what line an error happens on line one but of course it's line one because we haven't actually written the code to check for new lines yet so we're going to do that by testing for new line characters I can't actually read these dark blue characters so if if c is a new line and I want an actual new line not a fake one then bump the line number now this will mean that the line number will be incremented before processing the actual new line token which may not what be what I want in particular when we get a unexpected new line it will be on the next line so we really want to bump the new line here before read bump the line number here before reading the next character we can do easy enough with state so we add a variable to indicator for at the end of the line so we start reading we are at the end of the line and the line number is zero if we're at the end of a line increase the line number we are no longer at the end of a line so what does this do all this does is go you know because true that should work better all right now where do we get to line 16 that's a decent way through the program org why are we expecting an identifier here ah org in the current state org is a unrecognized symbol and therefore is treated as a label declaration so it reads this expecting it to be an instruction and of course it doesn't work so all we need to do is add the org word org symbol what org does is it sets the origin that is the program counter org symbol and we read an expression we're actually doing this so often let's make a helper function for that now we had an expect up here so let's do a program counter equals token number yeah that should be fine line 17 syntax error aha this dollar sign is a special value that represents the current program counter that's easily added it's a it'll go up here it's it's really a number whoa 43 we're really racing through this that big comment had nothing to do with it and it's our first instruction the first time we're actually want to emit something although this is currently past one so we won't be emitting anything right now this is the point where I go looking to see whether I have a nice table of the 8080 instruction set it's not the one I was looking for is this one better this one's the one I wanted now the instruction sets really simple this is it this is all of it so we've got single byte instructions with registers and things baked into them we've got two byte instructions with a register baked into instruction and a single byte opcode we've got three byte instructions with a two byte payload and yeah frequently a register baked into instruction jump is really simple two byte instruction plus a fixed opcode so now the way we are going to do this is that will mean it's a simple one byte instruction simple meaning no mutation of the sorry this is a simple two byte instruction three bytes instruction simple meaning we're not mutating the opcode itself so we create a jump instruction with the value being the opcode which is 1100 C00113 that I don't need that prototype because that's done for me up here right and now on entry current instant will contain the instruction being executed which means it's got the value in it now there's two things that need to happen here one is we need to process and parse the instruction itself and the second is we need to set any implicit label that's currently in effect so the first is easily done by calling set implicit label now but now we need to read the payload which is an expression like so very easy at this point we will then actually emit the bytes which we can do with we want to emit an 8 bit value which is the value and then we want to emit a 16 bit value which is the result of the expression and that is it that is everything we need to generate all the simple three byte instructions in the instruction set we need to but we need to add them to the symbol table obviously but they will have this callback and this value will contain the opcode now that's failed because we haven't actually written emit 8 and emit 16 anymore yet so let's do that now yeah yeah it seemed to change my mind about making all the function static so making them static is best better practice but it also makes debugging a little harder because with stcc they don't show up in the map right so so if we are on pass one increment the program counter but do nothing else otherwise fail likewise for emit 16 yes i know they're unreferenced so during pass one all we're doing is accumulating the is determining where everything lives in memory so we're not actually going to emit anything we're just going to keep track of the fact that this is a one byte instruction this is a one byte thing this is a two byte thing etc when we get to pass two we've actually have some fixing to do but then we will start actually emitting code and i've had some thoughts about that which i'll get on to later okay we've got our first two instructions db db is one of our first interesting instructions what this does is it emits bytes but it's a little bit special okay db contain db takes as an argument a comma separated list of values let me find the and it is special in that one of the in that you can use as a value a non zero length string so like this the reason we went through all that pain with the expression parser and dealing with look ahead is so that we can figure out where these arbitrary length expression of the arbitrary length parameter lists end but i've now had thought that we can't actually identify strings because we try to read a string as an expression it'll just fail so you see that is a string because we need to look at what the next token is in order to decide whether it's an expression or a string also i noticed something in passing which is you're allowed to use two byte strings for words so we're gonna have to do something there as well now how do i do this right the obvious thing actually is we're testing if the token length is one and we also want is if we see a string and it is a long one no actually i think i know what to do here if we see a string and there is nothing pushed onto any of the stacks then we terminate immediately without doing anything and we terminate with a with with token string as the value now what this will do is we'll be able to detect this here so we actually want to start reading stuff if we terminate it with a new line or a comma then we want to omit the value red but if we terminate with a string then we want to omit the entire contents of the string then we wish to read any new line or comma which is coming next because we know there must be a terminator next and we keep looping doing this until we hit a new line does that work there is no such thing as token comma line 52 expect an identifier we got to well we got we did this bit or at least we did something ds now ds ought to be straightforward what ds does is it creates uninitialized memory it's well yeah it's exactly the same as uh bumping up the program counter let's go and add that now i've no idea if any of these are doing the right thing yet we haven't started to we won't know that until we reach pass two so we expect an expression oh yeah we want to very important need to set the labels we need to do that here too all these things need to set a label first in fact so the only the only pseudo operations that won't set a label is ecu and set i wonder if we can do this more cleverly probably but we can also do it more crudely which is which works just as well there we go that's much better now we can remove three bytes of code from every single one of these right ds expect expression program counter plus equals token number sorted oh yeah we haven't done set yet set is a set is actually a copy of ecu except if the label is a if the label is undefiled set then we are allowed to change it if it's a ecu then you can't use the set label and let's just add that to the list table two a first three item chain yeah adding all these opcodes is going to be a bind expression stack underflow that's a new one line 52 interesting right combuff is defined here so if i just paste that into my test program why can we not create the output file did i delete print fcb i don't i think i did why is it trying to open okay what did i do there open output file is called twice not only twice i wonder if this is stack overflow yeah i think it's stack overflow because it's somehow calling open output file more than once right um so it's calling this relatively complicated function but it's not recursive so why would it be using lots of stack but the easiest way to fix that is to just double the amount of stack and see if we get the same error we get it we get the same error okay it's not stacked then i think it may be that yeah yeah right okay what's happening is that that instruction callback is garbage that instruction callback might be null oh that is interesting now the instruction callback might be null and also current instant might be null so let's do that callback is never going to be null so we do want yeah it's actually a lot easier than i was thinking okay right that's better let's move that tracing let's do a debug see i told you i'd be needing this stuff okay what's it doing value push value operator minus ah right we're getting the we're seeing the first parenthesis which is of course infinitely low precedence which means that the thing on the stack is higher precedence so it's trying to apply it immediately but we haven't actually pushed any values yet so in fact we don't want to call push operator maybe a parentheses need to be infinitely high precedence parentheses let's just push it and not do anything about it i actually need to rename those symbols to something better okay that did a thing good let me see if i can remember just trying to remember my vim regex syntax okay we want to change push operator followed by a parenthesis to push and apply operator now we want to change push operator law to push operator right oh yeah and that's produced the right result too good let's try our ccp again line 54 syntax error syntax error ah this is the first time we have one of those optional colons so we need to check for them oh no we are doing we are checking for them so we've read the identifier it's been added to the symbol table as a label if it's a label the set current label pointing at it read the next token if it's a colon we read another token following that would be dw so it's not a new line it is an identifier therefore it should be well actually it won't be an identifier because it's not in the symbol table okay i know what's going on here it is an identifier it's always going to be an identifier because reading the token has added it to the symbol table as a value so it has the callback here is actually calling undef cb as calling this now most common failure case for this is going to be that is just not an instruction we know about there you go line 54 unrecognized instruction that is a dw okay now dw we're going to have to do something here this because we need to support 16-bit string constants that's actually straightforward i just need to try and figure out the endianness in most cases string lengths are restricted to one or two characters which is streamer comes at eight or 16 bits you go the second character to the low order that's big endian which is not the ordering of the 8080 okay the way to do that is up here in token string so that's not going to work this bit here so what i was going to do is what i was going to do is do that and well i still need to do this so i'll do this i can do better than that okay that should make this the string constants work if it's we read the first byte if it's a two byte constant then we read the second byte and adjust if the token length is wrong then we error out as you push the value right what's going going to go wrong here is if this is a db and you have a 16 bit character constant at the top level then this is going to return out and return 16 bits and the character constant will be written as a string that will actually work with db it won't work with dw what happens with string constants in dw right asking strings to length one or two are allowed but strings longer are disallowed yeah i'm going to do this the bad the bad and cheaper nasty way so this but this hack is only going to go into operation if it's a db right this means in the dw code we are never going to get a token string output which means that that code is actually quite a lot simpler so we read the expression we emit it until while we get while we read commas and we want to do it that way around rather than testing for nl because we might receive an end of file let's change this so if if to use a string then we do the string code otherwise just emit there we go simpler okay 96 unrecognized instruction so we've got all the way up through here to 96 which is a mauve right we're going to uh well mauve is actually a little bit interesting mauve refers to registers it's got two register parameters now the way the assembler handles registers is incredibly crude it's actually in the spec and everything which is they're all in the symbol table they've got values and the actual values here are the internal 8080 register numbers so we need to add them to the symbol table value a seven so values have uh no callback actually because they have no callback we can test that here no actually do have a callback doh eq so e oh no there is no f hlm the h4 yeah so the 8080 and the z80 use let me rephrase the 8080 is a extension the z80's extension of the 8080 so if you're used to the z80 these are all the same registers you've got there but they all have different names six is m sp and pw and the names are all a bit weird sp is i will use a six ps w yeah um so that the same names are used to refer to both 16 and 8 bit registers so if you use b in an 8 bit register it refers to register b if you use it in a 16 bit one it refers to the bc register pair we could add aliases to these numbers to make that a bit easier but that's not what the dr assembler did so i won't just for now okay i need to double check the chains a symbol b symbol yeah i'm quite certain that i'm going to screw this up at some point m sp symbol ps w symbol right does that build that builds it doesn't work yet but right we now want the that was mauve wasn't it m mauve mauve ea so we want to add that right now mauve if you go look at our table mauve has two registers encoded into it so you've got the destination and the source i believe that is the only instruction that does this it is so it's going to have to get its own dedicated callback and so the value is unnecessary we get two expressions which are the destination and the source respectively of course expect expression expect there to be a new line but this is not a new line and we wish to to emit that so the opcode is four zero plus the destination left okay and that gives us a syntax error now i want to know whether that's yes they give that giving a syntax error that just seems to be the wrong it's gone from unrecognized instruction to syntax error here do that i think so a bit of a suspicion that it's not getting past the mauve but let's double check that actually i think that i wish to just put some hack tracing in so if we have the instruction we are going to print like so that should just list all the instructions it sees so we know where it gets to right so it doesn't like c it thinks c is an instruction yeah okay so it's seen the mvi it thinks it's a label so it's c is a ecu label which is tried to call so i'm going to have to put a better error message in for ecu label let's do that so i'm going to do the right thing yep okay so now i want to add the mvi instruction what did mvi do mvi has a destination in the middle instruction in fact there are a number of these but this one takes a byte operand and i think that is the only instruction that does that so it will also need its own dedicated callback and let me double check the syntax it sees yep so we've got a register followed by a 16 bit constant so we emit the instruction which is 06 and then we emit the value okay right and we're now on line 100 we've actually done all these instructions push b now push is a simple one it uses a register pair value which is just a register number you can see from here 00 is bc which is 0 01 is de which is d no it's not wait it's the top two bits of the register the three bit three bit register value and yes psw is used in push and pop for that that's why we defined it to 6 110 right so that has a there's a number of these instructions that use a register pair push and pop both take no operands ldax and stax take no operands okay so we can actually use oh so so do all these yeah all the register pair instructions are the same format which is nice so push symbol and the base instruction is 1100101 which is c5 we also do pop at the same time because we're here and pop is c1 right so here's the callback that makes it work and it's the cheapest way of doing this actually and what does this get us to why is print char being passed as a instruction because there's a call we shouldn't have got that far yet oh yeah we're here uh yeah vim syntax highlighting for assembly thinks exclamation marks are common characters okay call is a simple two byte instruction and its opcode is 11001101 which is cd and we already have simple 3b in place for jump so now we are at push call pop why is that spotted them off that's a ret instruction so so what we got here oh we actually have a line number oh we're down here already but ah oh i'm gonna have to be careful of that there's a loan instruction here you thought that was a label that is completely valid so it hasn't shown up in the listing and it's just magically skipped it i've actually run into this with other assemblers and it's easy fantastically annoying it all looks fine except doesn't work it just misses instructions now this is a simple one byte i think we may not even have the function for that yet ret is c9 in fact there's a bunch of them with uh different condition codes and we're actually need to encode these as separate instructions so let's just do that now and these all need hooking up so ret symbol z symbol of course each one has to refer to the previous one let's see symbol and this refers to rm right and that fails because i haven't done simple 1b yet this is not going to be complicated like so okay so we see it actually called ret there which is nice and we're aligning 111 again uh aura that's the one it doesn't like aura is another common one it's an arithmetic operation with a register as the source parameter so there's a ton of those uh so read the register parameter is a uh 101 which is b0 one two inks that's another of our register pair instructions so there's a they're still highlighted we've got inks dcx and dad and they all have the same format so we've also got ldax and stacks so let's just go and add those dcx and dad are these the opcode is ob dad is uh o9 okay inks is o3 yeah i forgot to add if but it never came up because i removed the call to if from the source code okay ldax and stacks ldax is oa and stacks is stacks is is o2 okay right now yeah how big is our assembler now uh five and a half k and remember the hand coded one was eight i've got most of the logic in place uh we're just going to add some more all we need to do at this point is add more opcodes um and then do the emission and uh print logic and it's done so where were we one two three we're here and certainly this file is 828 lines long so is that oh it doesn't like sta sta is it's another simple three byte and the value is 32 and while we're at it we'll do lda which is next to it which is 3a we got to now and do three inner uh inner increments a register i believe it's yeah it's an eight-bit register so that's going to be a it's a simple alu destination register and it looks like there's two of those three of the well okay these two don't count inner and decor i am just beginning to suspect that i won't get this done today which is a shame i want to try and get this all done in a single session what is that is four what's the value for decor oh five i rather like the 8080 instruction set it's very very simple i mean it's very stupid in a lot of ways and extremely restricted but you can see all the logic and the patterns to it okay now where have we got to oh i'm done alu just line 129 xr a xr a is another of the alu source it doesn't exclusive all with a so might as well go and add those where it's my orio and the opcode is a eight again let's do the others we're going to need them a and a a zero or i'll be done x or i'll be done cmp is b8 always a line 130 lxi uh lxi is a another three byte instruction but it takes a register pair parameter where is it here we go so this is a combination of rp and simple 3b i think it's the only one it is the only one so we can get its own pullback so we read an expression omit that expect an expression omit that 180 add a okay so hang on i thought we added all these oh there's another batch here which i forgot about is eight zero adc is eight eight no more a sub and spb is nine oh spb that it i think so okay how are we getting on line 191 cpi uh now we get to all the alu ones again except with simple one byte instructions compare with immediate now with luck they should all be and they are in fact different opcodes so that is fe that's a simple two bytes actually um there should be a db there okay x r i is ee i is f6 a na is a zero hang on what was i looking at a na yep az is the wrong one i want this one is e6 and i have in fact not set the callbacks correctly for these so let's go through again cpi x r i or i cpi x r i or i a and i simple to be now this batch spi and sui they are uh d e and i've also completely lost track of whether i've remembered to update the hash table sui i'm willing to bet that is going to be d7 it's it's not it's d6 sui symbol or i yep i didn't update add that's a and i cpi now i need aci and adi a n i really oh that's and that's the one down here adi and aci adi is c6 and c e okay now let's try it and i haven't done simple to be right 198 jz now we've got to do all the jump instructions of which there are a bunch because they're more of the condition code things so where am i condition code ret's here they are so we just want to copy those intact over to here oh i forgot to update any of the op codes for those yikes um okay so they are in fact uh co with the condition code high that's in fact zero yeah zero eight zero eight zero eight zero eight b e f okay so let's just use these and copy again okay what are the jump op codes they are same as the ret op codes at the bottom two bits set so that's going instead of alternating between zero and eight it's going to alternate between three and eleven b two hundred and one c n z you know what this means more condition codes these are the conditional call instructions and these are exactly the same as the ret except the bottom is alternating between four and c is that the right does the right op codes c n z yep our knees are simple three bs dpi symbol c m symbol two three six that's probably yeah that doesn't like schlud schlud is another um simple 3b wait why are these oh yeah those are instructions and the op code is schlud 2a yeah i'm getting tired my ability to pass binary into hex at a glance is suffering and chlud is wait that's not a three that's not even an r that's a two two a no two two is schlud chlud is 2a yeah good news is we've probably got most instructions by now yeah like we've now skimmed all the way down to here and it doesn't like the one after exchange wait that's an inks no it doesn't like exchange it knows that exchange is a one byte instruction because it's seen a label there for it cannot be another label and i actually need to go through and do the other one byte instructions because i'm not going to get errors about them so this is e b h to eight oh we're at the end awesome right end is a special operation that takes a optional parameter and terminates assembly and we're going to completely we're going to completely fake this with uh if yeah so-called symbol callback equals ncb break and ncb itself is a no op it'll never be called so okay now we've reached all the way to the end now we started again from the beginning and now we are starting on pass two when actually stuff needs to happen and the first thing that happens is that we get a label already defined error now this is because there is already this label was actually defined in pass one but you know what i am not going to look at that right now because instead i need to go through and do the rest of the the simple instructions because they say i won't get warnings about these so i just have to remember to do them not halt e i d i uh we haven't done those haven't done those p ch l haven't done the resets done that done that set carry you haven't done any of them done those decimal adjust we've done those we've done exchange so it's these ones and yes these resets are more condition codes right uh in and out are actually simple 1b so let's do them so the command shell doesn't actually do have any machine io in it uh d3 so it never calls in or out okay well there are a bunch of those uh the resets because those are the hard ones rather the annoying ones at least there are's no hang on no those aren't condition codes so this is the system call instruction it uh it jumps to a set of vectors at the bottom of memory and which vector is determined by this value here so it's just uh one one oh oh c seven this is a this is an a new test because the three dip the three bit reset vector isn't exactly the same place as for an alu instruction so those three ends are the same as for all of these so we can use the same routine which is nice okay oh oh one oh oh one one one is two seven oh seven oh f one seven one f two f three f three seven um e nine e three a no fb f three seven six okay and these are all simple 1b's group them the c's go here you see symbol the m a symbol see these go c r symbol t a a symbol there is one e there is one h you don't have very many h's there is one n we don't have very many n's there's lots of r's two s's x th l the distribution of keywords interesting now f g's k's q's u v w's wisel's n's okay so is that going to work right that failed to look something up that means that i have an infinite loop in my in my chain there we go that should be inner right so we get all the way to the bottom and we are now it's starting on pass two so let's nuke this run that again let's actually fix title shall we because we only want to print it once and let's also do some progress information okay so let's commit that all right um now i want to actually i'm not going to finish this tonight it's 11 30 uh heaven knows how long we're working on this and but i do want to get it actually like you know showing bytes so all right the stuff i was talking about about the labels the second time through we expect to define all the labels in the same place so we have these tests to see is the label already defined however if it we do want to allow people to set the value to the same thing next time so we're going to change this comparison so that we only produce the error if the value is different and the same applies here so you're actually allowed to call eq to set the same labels often as you like provide it's always the same value okay now we've got to the point where we're actually emitting things and i don't think i need to change this one all right so and let's just let's just do some really crude debugging so we always want to adjust the program counter so in fact this goes like this that assembled i mean i've no idea if this is right but that's definitely not right okay so we set the origin to 3 400 and we start doing stuff so we've got ccp start and ccp clear these should not be zeros ccp start is to find way down here yep those are wrong we then get the 7f and the zero and then all the 20s from here and we get the copyright message which is this and then then we get the ds which is here and the value goes like that's not right i believe that the this arithmetic operation is incorrect because all my labels may be being set correctly okay short drink break and i should be right back okay let's have a go with this i actually found a can of beer in the fridge it is zero alcohol beer i've been trying to find one of these it's actually drinkable hmm it's kind of bland oh well so uh where were we i am pretty sure that uh it's not setting symbols correctly so that first symbol here is ccp start which is a implicit symbol let's sort off so uh right we need to call set implicit label even if there is no instruction how's that work whoa got something 3783 377f those look like actual values to me and yes the range of addresses looks sane all right so we now need to start actually emitting something to disk now traditionally cpm emits intel hex record files uh that's the let me reassemble the ccp with the real assembler that's these and you then use the loader with the load command to turn these into a binary and i am not convinced this actually adds any value whatsoever and i'm wondering about having the assembler just emit binary files directly the advantage is it's much simpler code at my end it's much faster on a old cpm machine with like a two or three kilobyte per second at most floppy disk drive not including seek times you sweat blood over every byte of i o and this is you know it's like about 100 about 250 percent the size of the actual thing and it's not like these files are relocatable anyway and they don't have any have any symbol information i mean literally all you can do with them is to throw them at the loader and the loader assembles them in memory and most people never actually use more than one i'm not even sure the load command supports more than one let's have a quick look at the manual load command actually in the assembler no it's not if you're actually going to do anything that involves multiple source files you're going to use mac which does emit relocatable output files and which i have no intention of rewriting because it's really complicated and has a complete macro language in it so i think i just want to instead of emitting hex files i just want to omit like bin files this means that we can do a one pass translation from a source assembler file to a actual runnable com file if you use a an org of 0100 so let's do that i mean it's not entirely compatible with the real thing but if you can always turn a bin file back into a hex file if you want i can easily provide a tool that does that you need to remember what address it was at okay so emitting output now the same problems that applied with reading apply to writing as well so we're actually going to need a couple of buffers because we can only write in 128 byte chunks so and i also need a buffer once this is done i need to see how much memory i've got left because if i can increase the buffer size i can get much better throughput and you also need a buffer for the print file which haven't done anything with yet i'll deal with that later i probably won't deal with the print stuff in this session because i think i can actually make this work today which will make a nice rip up from the session and i can do the rest another time actually yeah let's just lose those as my omit code yeah so our program counter can change during assembly but only forwards so i will actually so let me see oh yeah we also need to close our output file closing the output file flushes stuff from memory to disk you don't need to close close input files there are no resources allocated from the system in cpm when you open a file that's why the fcb is a structure rather than it being like a file handle surprisingly elegant actually cpm yeah i assume that returns this file i assume that returns ff on error yeah i don't know what error it would be actually yeah some versions of cpm always return zero okay now our output bin is being flushed so we wish to flush any pending denser in the buffer and we also need to remember what address the data in the bin buffer was destined for okay are no bytes in the output buffer and the base address is zero it also occurs to me that so the base address of the output bin file is going to be the first address of output in the source file so if you have like org 3400 then you'll end up with a memory image intended to be loaded at 3400 this matches cpm executables so we need to know know whether we're waiting for the first byte to be omitted or we are or we are relocating the base address because the user's just done an org at a different but higher address we need to omit zeros so we're actually going to need a the buffer fixed okay so omit 16 is really simple we're just going to omit uh little endian values e to w is ff don't need to do that the compiler will do it for me like so right omit 8 is where the work actually happens so if we are not fixed then the buffer starts here if the program counter is greater than is if the program counter is out of range of the buffer then we're going to need to flush the buffer hmm i'm just thinking of whether we could whether we need to write zeros or whether we can try and do uh seeking into the output file to write data at a specific location which would actually be more efficient and you know what i'm just going to do this the simple way so while so these are the number of bytes that we need to add number of zeros that we need to add to the output buffer if the buffer is full then flush the buffer adjust the base address right that should do that so when we omit a byte we keep writing zeros until the program counter matches and then we omit our byte so in omit 8 raw we actually write the value to the buffer and flush the buffer if needed so we now need to actually flush the buffer which we do with tma i assume right sequential can fail values returned in ar okay zero is okay the cpm error codes are kind of a bit random sometimes ff is error sometimes zero is success normally when ff is an error then success can be zero one two or three and those values have specific meanings so error output file and this will write a single 128 byte sector to the bin file which we okay let's try that i call right sequential oh i didn't put in a i didn't put a system call in for that no i added system call i just didn't add the header for it cpm right sequential yeah and excess element in array of that stat i know that one oh huh why are we getting already defined labor errors now we didn't before doesn't like line 45 at max len i know why it's because omit 16 calls omit 8 twice and omit 8 advantage of the program counter uh okay well that did something we have a 2k bin file that looks promising it's got stuff in it it looks kind of right uh well we have load there so i actually need to now assemble the ccp with the old assembler and then load it what did that load to i think that loaded it to ccp.com which it's not right okay it did load it to ccp.com so yeah the actual data starts at 330 and yeah i don't think this is don't think that's right uh so there's probably a way to tell load the the correct address but let's just do it like this okay that's better this is what it should look like this is what it what my assembler makes it look like so does vim have a hex mode okay uh let's do it like this then right so what it should be is on the right and what i get is on the left and so okay yeah um i can see that i seem to be inserting stuff let's find the byte 2 wow yes oh of course uh this this is because this is an address referring to further down the file so any changes caused by inserting things is of course going to change the addresses so doing a simple compares not going to work here however i can see something wrong here 5f oe o2 c3 5f oe o2 o0 c3 i think i have an omit 16 where i should have an omit 8 and that's what's causing that oe o is where is my uh i actually now want the other 8080 instruction set page this one oe is oe is mvi mvi cd okay so that's still not right but it's better so i can tell it's not right because the bottom's wrong however it looks like this doesn't look too bad actually add.com blah blah blah this all looks okay next line when things start going wrong c38237 c3 is a jump c38237 as the right address followed by some stuff so i bet these are the jumps and here we have a yeah okay all right i know what's going on here uh i am omitting zeros in a ds and i bet that this assembler is omitting junk yeah so i am omitting zeros up to here and then we get data 242424 and so this is doing the same thing here 20202020 no wait this this is the digital research yeah this is the digital research assembler so we get up to the 42 here and then there's nothing but zeros but i have stuff that is the oh interesting right because that zero which is i think it's subfollowed by three zeros so sub zero zero zero zero that zero corresponds to this zero here and everything afterwards is ds so in fact what's happening is it's in my assembler is getting up to this point and it's emitting the byte and the ds's are not actually emitting anything they're just adjusting the program counter so what we're doing is we're writing what's left in the output buffer from the last sector and in fact that matches everything here right that's an easy fix uh several easy fixes now i could change ds to write zeros but i think instead is that what i want to do this will only apply to the last record because everywhere else if you do a ds followed by data my emit routine will insert zeros into the output stream yeah this stuff so it's only the last record that is causing any issue i know what i'm gonna do rather than just call flush i am going to so if there's anything in the output buffer if there's anything in the right and the output buffer write zeros until the output buffer gets flushed that wasn't quite what i wanted so let's take a look at this in the hex editor and we get some nice zeros at the bottom and we take a look at what digital research as assembler said and it looks much the same but i'm not going to take my word for it so and they're different byte 213 d5 012345 012345 that looks the same to me oh that level and it's different intriguing why is that different is that an address is that just a dodgy opcode what is one o is a knob and one one is an lxi and three b cd looks like an address so i reckon i've got the wrong opcode for lxi yeah i have no wait hang on yeah okay the the opcode value is not in the value field so it's actually just a constant and that constant is 01 right now let's try it and compare 318 line 2 is that the same address no it's not 1 3 e oh i keep forgetting i can actually type decimal values in there 318 here we go so cb versus ca ca is assuming this is the same problem oh it's one of them ca is comp d no it's not it's jz so the j's are either in a or 2 yeah a 2 a 2 a 352 i guess that's that's a bit further on of 2 yeah c 233 so c2 is a c 012 j nz really so what we've got here we've got c3 here here and c2 here oh getting better 4948 okay right this is where i am emitting zeros but the dr assembler is emitting garbage so i am not actually okay so that is in fact the block looks like the last significant change that means my assembler works whoa well let's do a very quick include benchmark so i'm a little bit faster which is interesting faster than the hand tool assembler uh i hmm there's still work to do i haven't done any of the printer output yet their assembler is 8k mine is smaller but again i haven't done the printer code and i don't have any of the hex code in which is you know so much simpler not to have and i haven't tried this in real hardware at all um sdcc does generate z80 code where the the digital research assembler is of course like intended to be assembled with itself which is an 8080 assembler so it's written in 8080 code i've got a copy of the source here i just just so it's been really useful for reference copyright 1978 and it's all modularized with bits that jump to other bits because it's designed to be written in multiple files with an assembler that doesn't support relocation or linking so what you do is you assemble it into discrete chunks so you see this module starts at a fixed address and then you load it all together in memory this is at 1100 yeah so yeah you can you can see the gaps between modules and this is the table it's using for the uh the op codes and they're they're all glued together it's got another table elsewhere with the lengths in them but my assembler works differently so here's a set of zero terminated strings okay that's working it's got like a ton of stuff to do but that's actually successful and this is all going to be bsd2 clause licensed along with the rest of the stuff i've been doing on this so there will be finally after 41 years be a proper open source very very simple assembler to distribute with a cpm like operating system let me just commit that of course it hasn't been properly tested or anything all right and push fantastic so i will sign off and tomorrow i will go and edit all this footage together and find out how long i've been working on this bloody thing it's probably a scary number well hope you enjoyed watching please let me know what you think in the comments