 Hello. Welcome to EMS Stage B, our next speaker is Andrew Jenner, who is going to be talking about adventures in retro computing. Over to you. Thank you. So this is a story about some pretty absurd lengths that I went to to get an old game running on new computers. Like many of the best stories, this story starts when my parents brought home a PC, a home computer into our house for the first time. It was an Amstrad PC-1512, a pretty good PC for the time. It had an 8 MHz 8086 CPU, half a meg of RAM and 16 colour graphics, a resolution of 640x200. It had a PC speaker that made bleepie bleep music that sounds pretty great to today's ears. It came with a graphical user interface and a mouse, which was the hot new thing at the time. It came with a few bits of software, but we were looking around for new things to run on it. It somehow got hold of a disk full of pirated games. There were a few fun games on this disk. There was Woolly the Worm and a frog a clone called Hopper. Our favourite game to play was this one, Digger by Windmill Software. I didn't know anything about it. It didn't come with any instructions or anything. We just had to figure out how to play it as we went along. This game had these fantastic cartoony graphics. It had music that was a little more sophisticated than the music in the other games. The programmers were clever enough to be able to make the notes in the background music actually change volume over time, so they were nicely shaped with envelopes. Because we didn't know how to play the game, it was actually some time before we figured out that the F1 key fired a fireball and you could kill the enemies that way. It made it much easier to play once we figured that out. Some years later we finally got rid of that old machine and upgraded to a 486 machine, which came with Super VGA graphics. Unfortunately, Digger did not work on this machine. The programmers had programmed the machine at such a low level that they were programming the individual registers on the CGA card. VGA is not compatible with CGA to that extent, so if you try to play Digger on a VGA or Super VGA machine it looks something like this. The display is all corrupted and you can't see what's going on. It's completely unplayable and it's also far too fast because all the timing in Digger was done just by counting CPU cycles. If your CPU is twice as fast, the game will play twice as fast. Numer of years passed and I had an idea that I wanted to fix this to make it playable on modern machines. The first thing I did was reverse engineered the graphics out of the executable file and redrew them in VGA resolution. Just for fun I didn't think I would actually do anything with them at the time. In 1998 I finally got around to decompiling the entire game back to the C code. I believe this is the first time that somebody had remastered a game in this way. I've heard of a few other projects since, similar projects with other games. As far as I know I was the first person to do that. If you have any information to the contrary let me know and I'll retract that. Once I got it back to the C code I was able to add the VGA graphics that I had drawn into it. I added sound blaster sound. I added a few more features, game recording and playback so that you could show off your high scores to your friends. A mode where two players can play simultaneously and either co-operate or fight each other depending on how you want to play it. I added the ability to redefine the keys, which is useful for people whose keyboards didn't work very well with the keys that it came with. The original digger to exit the game you had to reboot the entire computer as a number of games did in those days. There was one part of the game which I never got exactly right and that was this screen. When you get a new high score the original digger would hammer the CGA's pallet registers to change the colours of the letters on this screen. In a shimmering fashion. Because the computer that I had originally played digger on was not the same speed as the computer that digger was written on, I didn't know exactly how this screen was supposed to look. I wasn't able to reproduce it properly. I just made a guess at how I thought it should look. It always bothered me that there was this particularly visible thing that shows that the remake was inexact from the original in that respect. First the speed of the game overall was also inexact but that seemed less of a concern to me at the time. The answer to this kind of problem is of course emulation. The emulator is a program that will take a modern machine and teach it to behave like an older machine. There are a number of emulators for emulating old PCs. I've actually contributed to a whole bunch of them over the years. Adding support for various games and bits of hardware that did weird things with them. The multiple emulator super system which is now part of the main multiple arcade machine emulator is an emulator that is extremely thorough in the number of machines that it emulates and the accuracy in which it emulates them. There's DOSbox which may be very familiar for anyone who ever runs old DOS software. It's a very convenient way to do that because it can access the drives on the host machine. There are a few others that other people have written over the years for one purpose or another including a couple that I've written myself. Berapo is an emulator that in progress has been a long running project. It's an emulator that I hope you'll be able to reconfigure dynamically. You can write a configuration file that specifies exactly what hardware you want in your emulated machine and it'll figure out how to wire it all up for you. If you wanted to take the sound chip from a Commodore 64 and place it in a BBC micro, you could do stuff like that. 86Sim is a very simple emulator that I wrote in order to test a compiler that I was porting a port of the GCC GNU C compiler targeting the 8086. None of these have a accurate CPU cycle timing. The reason for this is that the exact timing of how long each instruction takes in terms of the number of clock cycles it takes is not actually documented anywhere. There are documents online that give best case timings for each instruction, but there are a number of reasons why any particular instruction might take longer than the optimal time. There are ways in which the other parts of the machine can steal cycles from the CPU. We'll talk more about that in a minute. In 2011, I finally got myself an original IBM XT. It was built in 1984, I believe. As far as software is concerned, it is pretty much identical to the original IBM PC from 1981 that is the grandaddy of all of the X86 machines that have taken over the world since. This machine has a 4.77 MHz Intel 8088 CPU. It's an 8-bit bus but 16-bit CPU internally. The machine has 640K of RAM. I got quite lucky. It was a machine I bought on eBay. It came with a bunch of expansion cards including a RAM expansion to take the RAM from the 256K that was on the motherboard all the way up to 640K, which as the saying goes should be enough for anyone. I have one 5.25 inch 360K floppy drive for it. I have some other floppy drives that I keep meaning to fix so I can have the dual floppy system that was very desirable in the day. I also bought a CGA graphics card for it so that I could play all the CGA games just the way they were meant to be. CGA graphics is a little less sophisticated than the graphics in the Amstrad PC-1512 that I started with. In the 640x200 resolution you can only have two colours on screen at once or four colours at half the resolution. If you plug the machine into an NTSC monitor or an American TV set you can actually get all 16 colours on screen at once at a resolution of about 160x200. When I got this machine it didn't have a keyboard. It didn't have any working floppy drive or hard drive and the only graphics card that it came with was not compatible with the one monitor that I had for it. My first job was to try and figure out how to load code onto this machine. Back in those days when you bought a PC from IBM it came with a great deal of technical documentation. It came with the schematics of the entire machine and also the assembler listing of the BIOS, the ROM chip inside that actually boots the machine. Looking over these assembler listings I noticed that IBM had left in there a little something that they used in the factory for testing the machines as they came off the assembly line. A little piece of code that looks for a particular byte coming in over the keyboard port and if it sees that byte instead of the normal byte that says, hey, I'm a keyboard, then it will know that it's not actually a keyboard that's connected to the machine but IBM's internal manufacturing test device. What it does with this test device is it just loads a stream of bytes over the keyboard port, dumps them into memory and then when that stream is finished then it just goes and runs them. So it's a really good way of just getting code onto the machine really quickly. I started off actually just by plugging an Arduino into the keyboard port to get code onto it. I have since built this little circuit which is basically the same thing. It's basically an Arduino. It's an 18M328 microcontroller running at 16MHz. The irony that this is actually quite a bit more powerful than the PC that it's plugged into is not lost on me. So as well as the Arduino, this has got a pass-through for the actual keyboard so it plugs into the keyboard port and the actual keyboard plugs into this and it's also got a serial port to plug it into a modern machine for transferring programs over from a modern machine onto the XT. There's one more connection here, this little red wire that you can see that goes off into the corner. That's actually spliced into the XT's power-good line from the power supply. So when the microcontroller pulls this line low, it resets the entire machine as a complete hard reset and then in a second or so it's back at the part of the BIOS where it's looking for that byte from the keyboard port. So rather than, some of you may remember that PCs in those days would go through a very long memory test when they booted up, they would count up for each kilobyte of RAM in the machine and count up. So if you have all 640K of RAM in an original IBM XT, it actually takes a couple of minutes to boot up. But the keyboard, the manufacturing test routine happens before that memory test so you can actually get the machine running a new program in about a second this way as much quicker to iterate when you're developing software for it. I've taken this device and it's connected to a modern PC in my office and that modern PC is running an Apache web server and some CGI scripts that I wrote myself, which means that anyone, anywhere in the world, can load code onto this XT by using this web interface, re-ignite.org slash XT server. So the screenshot that you can see here is just a web browser that I've sent as a post request a floppy disk image, a 360K floppy disk image containing a DOS and an auto-exec batch that just prints itself so you can see how that works. The XT server does not yet support keystrokes coming from the web browser and then sending those over to the XT. That's something I hope to add at some point. But it's useful for non-interactive things. You can write a program, stick it on a disk image, send it to the XT server, get the results back and so this is really useful for emulator authors who want to run experiments on the real hardware to see what the timings are for various things. None of them took me up on it though, so I was left to figure out these cycle timings myself. So this is the target that we are trying to make a cycle exact emulation of. The Intel 8088, which was the hot new thing in 1979, uses a 3000 nanometre process. Compare that to, I don't know what they're down to today, 15 nanometres, 10 nanometres. It has 29,000 transistors compared to the billions in today's CPUs. It has eight general purpose registers, each of which is 16 bits. It has a 20-bit memory address space so it can address a whole megabyte of RAM, although normally you would only address 640K of RAM and the other 384K is for ROM and peripherals and things like that. The CPU is actually microcoded internally, so while it's running your program, it's also running its own little program and its own special purpose instruction set. The sort of blue rectangle you can see in the corner of this dye photograph is actually the main ROM, which holds the microcode. It's 504 instructions, each of which are 21 bits. My purpose in building this emulator wasn't to run this original microcode program just to get something with the same cycle timings so that it would be indistinguishable to software that is running on the actual PC. The microcode instructions themselves, I haven't actually got a dump of them yet. This dye photograph is high enough resolution to be able to see the individual transistors, but it's only the top layer. I don't want to mess about with fuming nitric acid or whatever you need to do in my house to actually take the dye photographs of all of the layers and reverse engineer the chip at the gate level, as some people have done with chips like the 6502. I decided to approach it just by trying to reverse engineer the chip from the outside, run code on it and do timings and figure out how it works that way to sufficient fidelity. Here's a little slide about the architecture of the Intel 8088. The top half here is the bit that communicates with the bus and then the bottom half is the actual execution unit, which runs that microcode program. The execution unit is adding your numbers together or multiplying whatever you've asked the computer to do. The top part gets the program and data in and out of the CPU to the memory and other devices on the machine. The fact that these two parts run asynchronously and either one can be waiting for the other at any point in time is why the timing of this chip is so complicated and it hasn't been done before now. Not only do we have to know how long each instruction takes overall, but also where in the execution of that instruction it asks the bus execution unit to get or put a value to the bus or gets a byte from the prefetch queue. There's a four byte prefetch queue in the 8088, which queues up bytes of the program, the instructions that the execution unit will be running next in order that the execution unit won't have to wait for the bus interface unit for too long. That does speed things up quite a bit over similar architectures, but it does make the timings a lot more complicated. On this diagram that shows all the pins of the chip, you probably can't read it, the text is very small and I'm sorry for that, but there are two pins, QS0 and QS1, which actually show the status of the prefetch queue and they show for each cycle that the CPU is executing, it shows whether the queue is being emptied, whether it's the first cycle of an instruction, whether it's the subsequent cycle of an instruction or if it's just idle. The reason these pins exist is for the 8088 or 8086 to be able to interface with the 8087 floating point co-processor. The floating point co-processor actually runs alongside the CPU, monitors the instruction stream, and if it sees an instruction that is a floating point instruction, it interrupts and hops in and does its thing and then sends the result back to the CPU over the bus. I wanted to be able to read these queue status pins along with everything else that's going on in the machine so I could see what the queue is doing at each point in the execution of these instructions. I ended up building this little ISA card. Again, it's based around the 80M 328, I like that chip. This time, rather than running at 16MHz like an Arduino does, the clock is actually taken from the clock on the ISA bus, so it actually runs at 14.318MHz, four times the NTSC colour carrier frequency. As well as the microcontroller, it also has a serial port to get the results out to a modern machine. The only other thing on the board is these multiplexers. There's only so many IO pins on the 80M 328, and I wanted to be able to sample a lot of pins, not just on the 40 pins of the CPU, but also all the pins of the ISA bus as well, and a few other things that I've since added. If you look inside my XT now, there's wires going from this board to all over the motherboard so that I can sample various other lines. The way that this busniffle works is we run the same program multiple times, taking care to ensure that each time we run it, all the timing is exactly the same. The machine is put into a known state, run the program, so many times, once for each set of pins that we want to sample. It works really well. This is a dump of the output of the ISA busniffle. The columns on the left show the actual raw data that's coming over the serial port from the busniffle card. On the right, there's some interpretation that's been done on it. It shows exactly which instruction is starting on which cycle, and it shows the bus access, the reads or the writes to the bus that are occurring at any moment, occurring at each cycle. Also, when the machine gets interrupted for a DMA transfer, which happens 64,000 times a second for dynamic RAM refresh, that's all shown in the diagram as well. A little side quest. In 2015, I worked with some friends on this demo scene demo, 8088 MPH, which we presented at the revision demo party in Germany. It blew everyone away with what we were able to coax these old 8088 CGA machines into doing. We wrote a bunch of effects that only work on IBM PCs and XTs, because, like DIGA, they require the machine to be cycle exact. We also managed to coax the CGA card into making about 1,000 colours rather than the 16 it's normally capable of by using and abusing the NTSC composite colour system. We also got it to play four-channel music on the PC speaker, which again is all cycle counted, so if you try and run that music routine on a modern machine, it will sound very high-pitched and fast, like a record played back at the wrong speed. The nice thing is that now we have an incentive for emulator authors to try and get their emulators to be cycle exact so that they can run this demo the way it's supposed to be. Of course, we didn't have a cycle exact emulator when we were writing the demo, so we did all our development on real hardware. I'm still working on a cycle exact demo, a cycle exact emulator. I call it XTC, XC cycle exact. What I've got at the moment is a programme that just generates a very large number of test cases. It tries to test the timing of each of the 256 possible op codes that the CPU can execute. Some of these are not even valid op codes, so I'm even testing the timing of illegal instructions. Where the operands to the instruction can make a difference in the timing, I'm testing all possible combinations of operands as well. Also the state of the prefetch queue and the bus can make a difference, so I'm testing each possible combination of those. There's a very large number of possible situations that the machine can be in. Once we've generated all these test cases, we batch them into chunks of 64K at once. We send them over that serial link to the microcontroller to run them on the XT. If any of those test cases turn out to have different timing on the real hardware as on the emulator, then it will run the ISC busniffer to get a trace of exactly what happened on what cycle on that, and then I can compare it with the equivalent instruction trace on XTCE and see where the emulator is diverging from the real hardware. By iterating through that process a lot of times, about 550 of the automatically generated test cases have at some point found a bugged timing problem in XTCE. It's now got to the point where millions of these tests are now passing and it's able to go for days at a time without hitting any failing tests. In fact, the limiting factor right now is not the time it takes to run the tests or bugs in the emulator. I'm now running out of memory on the modern machine to actually hold all the tests. I think what I'm going to have to do is maintain two versions of the emulator in the same program. One wall is the one that's actually under test and one is the last known good version of the emulator with the timings correct as they were known to be so far. Then I can compare those two and then when we run out of tests that we have lost known good results for, then we can run them on the real hardware. As well as the timings of the bus, the way that the bus interface unit and the execution unit interact, another problem that I had is the fact that the multiplication and division routines are actually fairly complicated little bits of microcode and these instructions actually take different amounts of time, different number of cycles depending on what numbers you're multiplying together or dividing. Here's a fun little picture that I made that shows how the timings change depending on what numbers you put into these instructions. On the left is eight bit multiplies. You've got one number that you're multiplying together on the x-axis, the other number on the y-axis and the colour at each point corresponds to the number of cycles that multiplication takes. Similarly for division over here, the multiplication one actually wasn't too difficult to figure out. It turns out to be just the one cycle for each one bit in one of the operands and a few other things to do with whether the cycle, if the result overflows and some cycles if you're doing signed multiplication depending on which of these four quadrants you're in. The multiplication one wasn't too difficult to figure out. The division one took me ages. I thought I knew how to implement division algorithm in the same way that a CPU would do it just in terms of compares, additions, subtractions. But no matter where I put the delays in my own division routine, I could not get the timings to line up. Fortunately I found a patent that Intel had filed about the implementation of the 8088 and 8086, and this is an extract from that patent that actually shows the division algorithm as it is implemented in the microcode. The patent is terribly written. A whole load of terms in it are never explained anywhere. There are mistakes all over the place, but after a lot of head scratching and looking at this patent and trying to figure out, I actually figured out how this division routine works and re-implemented it myself in C++ and I got the timings to actually work out. Although I'm not a fan of patents in general, this one actually worked out pretty handy for me. In fact the algorithm as implemented in the microcode is a bit cleverer than the one that I implemented myself. It actually manages to do the division with fewer temporary registers, which is obviously very important if you're running on a CPU with a very small number of transistors and internal registers. I'm still working on XCC. There's still a few more bits to do before I can call it finished. A number of situations where the invalid instructions, if you're using multiple prefixes at once, there are situations there where that isn't defined. Situations to do with hardware interrupts, if a hardware interrupt comes in, a device needs an interrupt servicing, then whereabouts in the execution of the instruction does that interrupt actually occur? I want to use these 550 or so tests that ever failed to make a torture test for other emulators so that they can see how accurate they are. The rest of it is just implementing the rest of the machine, the CGA card, the speaker, keyboard, mouse. A host interface like Dosbox has for reading files off your hard disk would be very useful indeed. Fixing up the timings of the other peripherals, the timer and the interrupt controller, the DMAs and the other peripherals. That's the end of my talk. There's a few links here to some of the projects that I've talked about. All the code for all of this is on GitHub, so I'm afraid it's all a big mess in one repository, so you'll probably have to email me if you want to find something specific in there. That's it. Sure. Any quick questions? OK, on here. So except for your demo scene code and the Dagger game, how sensitive are most games to the timing of the XT? Most games are not that sensitive at all. I mean, even games that were written to be only tested on that particular CPU and run at a speed that is governed just by the speed of the CPU. There were a number of games like that, but people played them on faster systems. They were just more difficult. So it's kind of an academic exercise to actually make this cycle exact emulator. It's not very important for running real-world software. I'm hoping that by implementing this emulator I will sort of spur the development of some software that's more software running on these old machines that is cycle exact that does require that cycle exact timing and can therefore push the machine much closer to its theoretical limits than it would do if you were relying on other parts of the hardware, the timer interrupt and so on to do your timing. Thank you, Andrew Jenner.