 There will be the ultimate Akron Archimedes talk in which we'll be spoken about everything about the Archimedes computer. There's a promise in advance that there will be no Eureka jokes in there. Give a warm welcome to Matt Evans. Okay. A little bit of retro computing, first thing in the morning sort of. Welcome. My name is Matt Evans. Akron Archimedes was my favorite computer when I was a small hacker and privileged to be able to talk a little bit about it with you today. Let's start with what is an Akron Archimedes? So I'd like an interactive session, I'm afraid. Please indulge me. I'd like a show of hands. Who's heard of the Akron Archimedes before? Ah, okay. Maybe 50, 60 percent. Who has used one? 10 percent, maybe. Okay. Whose programs? Who's coded on an Archimedes? Maybe half. Two. Three people. Great. Okay. Three. Okay. So a small percentage. I don't see these machines as being as famous as, say, the Apple Macintosh or the IBM PCU. And certainly outside of Europe, they were not that common. So this is kind of interesting just how many people here have seen this. So it was the first ARM-based computer. This is an astonishingly 1980s, I think one of them is a drawing, actually. Not just the first ARM-based machine, but the machine that the ARM was originally designed to drive. Is that a comment for me? Mike? Okay. I'm being heckled already. It's only slide two. We'll see how this goes. So it's a two-box computer. It looks a bit like a Mega ST to me. It's a main unit with the processor and disks and expansion cards and so on. Now this is an A3000. This is mine, in fact, and I didn't bother to clean it before taking the photo. And now it's on this huge screen. That was a really bad idea. You can see all the disgusting muck in the keyboard. It has a bit of ink on it. I don't know why. But this machine is 30 years old, and this was luckily my machine, as I said, as a small hacker. And this is why I'm doing the talk today. This had a big influence on me. I'd like to say he's a person, but more as an engineer in terms of my programming experience when I was learning to program and so on. So I live and work in Cambridge, in the UK, where this machine was designed. And through the funny sort of turn of events, I ended up there and actually work in the building next to the building where this was designed. And a bunch of the people that were on that original team that designed this system are still around and relatively contactable. And I thought this was a good opportunity to get on the phone and call them up. We'll go for a beer with a couple of them and ask them, well, why are things the way they are? There's all sorts of weird quirks to this machine. I was always wondering this for 20 years. Can you please tell me why did you do it this way? And they're a really good bunch of people. So I talked to Steve Ferber, who led the hardware design. Sophie Wilson, who was the same software. Tudor Brown, who did the video system. Mike Muller, the I.O. system. John Biggs and Jamie Urquhart, who did the silicon design. I've spoiled one of the surprises here. There's been some silicon design that's gone on in building this acorn. And they were all wonderful people that gave me their time and told me a bunch of anecdotes that I will pass on to you. So I'm going to talk about the classic arc. There's a bunch of different machines that acorn built into the 1990s. But the ones I'm talking about started in 1987. There are two models, effectively a low end and a high end. One an option for a hard disk, 20 megabytes, 2,300 pounds. Up to four megabytes of RAM. They all share the same basic architecture. They're all basically the same. So the A3000 that I just showed you came out in 1989. That was the machine that I had. That was, again, the same. It had the memory controller was slightly updated. It was slightly faster. They all had an ARM too. This was the released version of the ARM processor designed for this machine. At 8 megahertz. And then finally in 1990, what I call the last of the classic arc, Archimedes, is the A540. This was the top end machine. It could have up to 16 megabytes of memory, which is a fair bit even in 1990. It had a 30 megahertz ARM 3. The ARM 3 was the evolution of ARM 2, but with a cache and a lot faster. So this talk will be centered around how these machines work, not the more modern machines. So around 1987, what else was available? This is a random selection of machines. Apologies if your favorite machine is not on this list. It wouldn't fit on the slide otherwise. So at the start of the 80s, we had the exotic things, like the Apple Leaser and the Apple Mac, very expensive machines. The Amiga, I had to put in here, started off relatively expensive. Of course, the Amiga 500 was very good value for money, very capable machine. But I'm comparing this more to PCs and Macs, because that was the sort of market it was going for. And although it was an expensive machine, compared to Macintosh, it was pretty cheap. I even put a next cube on there. I figured that I'd heard that they were incredibly expensive, and actually compared to a Macintosh, they're not that expensive at all. No, I don't know which one I would have preferred. So the first question I asked them, the first thing they told me, so why was it built? I'd used them in school, and as I said, had one at home. But I was never really quite sure what it was for, and I think a lot of the Acorn marketing wasn't quite sure what it was for either. They told me it was the successor to the BBC Micro, this 8-bit machine, lovely 6502 machine, incredibly popular, especially in the UK. And the goal was to make a machine that was 10 times the performance of this. The successor would be 10 times faster at the same price. And the thing I didn't know is they'd been inspired. The team at Acorn had seen the Apple Lisa and the Xerox Star, which comes from the famous Xerox Alto at Xerox Park, the first GUI workstation in the 70s, monumental machine. They'd been inspired by these machines, and they wanted to make something very similar. So this is the same story as the Macintosh. They wanted to make something that was desktop machine for business, for office automation, desktop publishing, and that kind of thing. I'd never really understood this before, so this inspiration came from the Xerox machines. It was supposed to be obviously a lot more affordable and a lot faster. So this is what happens when Acorn marketing gets hold of this vision. So the Xerox Star on the left is this nice, sensible business machine where someone's wearing a nice, crisp suit, banging their microphone. And it gets turned into the very Cambridge tweed version on the right. So apparently it's illegal to program one of these if you're not wearing a top hat, but no one told me that when I was a kid. And yeah, my court case comes up next week. So Cambridge is a bit of a funny place, and for those that have been there, this picture on the right sums it all up. So they began Project A, which was build this new machine. And they looked at the alternatives, the alternatives. They looked at the processors that were available at that time, the 286, the 68K, the NatSemi32016, which was an early 32-bit machine, bit of a weird processor. And they all had something in common that they were ridiculously expensive, and in Tudor's words, a bit crap. They weren't a lot faster than the BBC Micro. They were a lot more expensive. They were much more complicated in terms of the processor itself, but also the system around them was very complicated. They need lots of weird support chips. This just drove the price up of the system, and it wasn't gonna hit that 10 times performance, let alone at the same price point. They visited a couple of other companies designing their own custom silicon. They got this idea in about 1983, they were looking at some of the risk papers coming out of Berkeley. And they were quite impressed by what a bunch of grad students were doing. They managed to get a working risk processor. And they went to Western Design Center and looked at the 6502 successors being designed there. And they had a positive experience. They saw a bunch of high school kids with Apple IIs doing silicon layout, and they thought, okay, well, they'd never designed a CPU before, the Acorn. The Acorn hadn't done any custom silicon to this degree. But they were buoyed by this, and they thought, okay, well, maybe risk is the secret that we can do this. And this was not really the done thing in this time frame, and not for a company the size of Acorn, but they designed their computer from scratch. They designed all of the major pieces of silicon in this machine. And it wasn't about designing the ARM chip. Hey, we've got a processor, cool, what should we do with it? But it was about designing the machine that ARM and the history of that company has kind of benefited from. But this was all about designing the machine as a whole. They were a tiny team. They were a handful of people, about a dozen-ish that did the hardware design. Similar sort of order for software and operating systems on top, which is, orders of magnitude, different from the IBMs and Motorola's and so forth that were designing computers at this time. Risk was the key. It needed to be incredibly simple. One of the other experiences they had was they went to a Sysc processor design center. They had teams of a couple of hundred people and they were on revision H and it still had bugs and it was just this unwieldy complex machine. So risk was the secret. Steve Furber has an interview somewhere he jokes about Acorn management giving him two things. The special source was two things that no one else had. He had no people and no money. So it had to be incredibly simple. It had to be built on a shoestring, as Jamie said to me. So there were lots of corners cut but in the right way. I would say corners cut, that sounds ungenerous. There were some very shrewd design decisions always weighing up cost versus benefit. And I think they erred on the correct side for all of them. So Steve sent me this picture. He's got a cameo here. That's the outline of him in the reflection on the glass there. He's got this up in his office. So he led the hardware design of all of these chips at Acorn. Across the top we've got the original arm, the arm one, the arm two, and the arm three. And guess the naming scheme. And the video controller, memory controller, and IO controller. You can sort of see their relative sizes and it's kind of pretty. This was also on a process where you could really point at that and say, oh, that's the register file. And you can see the cache over there. It's, you can't really do that nowadays with modern processes. OK, so a bit about the specification, what it could do. The end product, so I mentioned they all had this on to eight megahertz, up to four megs of RAM. 26-bit addresses, remember that. That's weird. So a lot of 32-bit machines had 32-bit addresses or the ones that we know today do. And that wasn't the case here. And I'll explain why in a minute. The A540 had its updated CPU. The memory controller had an MMU, which was unusual for machines of the mid-80s. So it could support, the hardware would support virtual memory, page faults, and so on. It had decent sound. It had 8-channel sound, hardware mixed and stereo. It was 8-bit, but it was logarithmic. So it was a bit like U-Law. And if anyone knows that instead of PCM. So you've got more precision at the low end. And it sounded, to me, a little bit like 12-bit PCM sound, so this is quite good. Storage-wise, it's the same floppy controller as the Atari ST. It's fairly boring. Hard disk controller was a horrible standard called ST506 MFM drives, which were very, very crude compared to disks we have today. Keyboard and mouse, nothing to write home about. I mean, it was a normal keyboard. It was nothing particularly special going on there, and printer port, serial port, and some expansion slots, which I'll outline later on. The thing that I really liked about the ARC was the graphics capabilities. It was fairly capable, especially for a machine of that era and of the price. It just had a flat frame buffer. So it didn't have sprites, which is unfortunate. It didn't have a blitter and bit planes and so forth. But the upshot of that is that it was dead simple to program. It had a 256 color mode, 8 bits per pixel, since a byte. And it's all just laid out as a linear stream of bytes. So it was dead easy to just write some really nice, optimized code to just blitz stuff to the screen. Part of the reason why there isn't a blitter is actually the CPU was so good at doing this. Color-wise, so it's got palleted modes out of a 4096 color palette, same as the Amiga. Has this 256 color mode, which is different. The big high-end machines, the top-end machines, the A540 and the A400 series, could also do this very high res, 1152 by 900, which was more of a workstation resolution. If you bought a Sun workstation, a Sun 3 in those days could do this in some higher resolutions. But this was really not seen on computers that might have in the office or school or education at that end of the market. And it was quite clever the way they did that. I'll come back to that in a sec. But for me, the thing about the ARC, for the money, it was the fastest machine around. It was definitely faster than 386s and all the stuff that Motorola was doing at the time by quite a long way. It was almost eight times faster than a 68K about the same clock speed. And that's to do with its pipelining and to do with it having a 32-bit word and a couple of other tricks, again. I'll show you later on what the secret to that performance was. About mini computer speed. And compared to some of the other risk machines at the time, it wasn't the first risk in the world. It was the first cheap risk and the first risk machine that people could feasibly buy and have on their desks at work or in education. And if you compare to something like the MIPS or the Spark, it was not as fast as a MIPS or Spark chip. It was also a lot smaller, a lot cheaper. Both of those other processors had very big die. They needed other support chips. They had huge packages, lots of pins, lots of cooling requirements. So all this really added up. So I priced up a Sun4 workstation at the time and it was well over four times the price of one of these machines. And that was before you add on extras, such as disks and network interfaces and things like that. So it was very good, very competitive for the money. And if you think about building a cluster, then you could get a lot more throughput. You could network them together. So this is about as far as I got when I was a youngster, I wasn't brave enough to really take the machine apart and poke around. Fortunately now it's 30 years old and I'm qualified in doing this. I'm gonna take it apart. Here's the motherboard. It's quite a nice clean design. This was built in Wales for anyone that's been to the UK, very unusual these days, for anything to be built in the UK. It's got several main sections around these four chips. So remember the Steve photo earlier on. This is the chip set. The ARM, MEMC, BIDC, IOC. So the IO side of things happens over on the left, video and sound on the top right and the memory and the processor in the middle. It's got a megabyte on board and you can plug in an expansion for four megabytes. So memory map from the software view. I mentioned this 26 bit addressing and I think this is one of the key characteristics of one of these machines. So you have a 64 megabyte address space and it's quite packed. There's quite a lot of stuff shoehorned into here. So there's the memory. The bottom half of the address space, 32 megabytes of that is, the processor's got user space and privileged modes. It's got a concept of privilege within the processor execution. So when you're in user mode, you only get to see the bottom half and that's virtual mapped. There's the MMU or map pages into that space. And then when you're in supervisor mode, you get to see the whole of the rest of the memory, including the physical memory and various registers up the top. The thing to notice here is, there's stuff hidden behind the ROM. This address space is very packed together. So there's a requirement for control registers for the memory controller, for the video controller and so on. And they're right only registers in ROM, basically. So you write to the ROM and you get to, you hit these registers. Kind of weird when you first see it but it's quite a clever way to fit this stuff into the address space. So it will start with the ARM one. So if you will, the instruction sets late 1983, Steve took the instruction set and designed the top level, the block, the micro architecture of this processor. So this is the data path and how all the control logic works. And then the VLSI team then implemented this, did their own custom cells. There's a custom data path and custom logic throughout this. And it took them about a year all in. Well, 1984, this project A really kicked off early 1984. And this teched out first thing early 1985. The design process, the guys gave me a little bit of, so Jamie O'Cart and John Beggs gave me a bit of an insight into how they worked on the VLSI side of things. So they had an Apollo workstation, just one Apollo workstation, a DN600. This was a 68K based washing machine, as Jamie described it. It was this huge thing. It cost about 50,000 pounds. It was incredibly expensive. And they designed all of this with just one of these workstations. Jamie got in at five AM, worked until the afternoon, and then let someone else on the machine. So they shared the workstation, they worked shifts so that they could design this whole thing on one workstation. So this comes back to that it was designed on a bit of a shoestring budget. When they got a couple of other workstations later on in the project, there was an allegation that the software might not have been licensed initially on the other workstations. And the CAD software might have been like another confirm or deny whether that's true. So Steve wrote a BBC basic simulator for this when he's designing this block level micro architecture, run on his BBC micro. So this could then run real software. There could be a certain amount of software development but then they could also validate that the design was correct. There's no cash on this. This is a quite a large chip. 50 square millimeters was the economic limit for those days, for this part of the market. There's no cash. That also would have been far too complicated. So this was also, I think, quite a big risk, no pun intended, that the aim of doing this with such a small team that they were all very clever people but they hadn't all got experience in building chips before. And I think they knew what they were up against. And so not having a cash or complicated things like that was the right choice to make. I'll show you later that that didn't actually affect things. So this was a risk machine. If anyone has not programmed arm in this room, then get out at once. But if you have programmed arm, this is quite familiar with some differences. It's a classical three-operand risk. It's got a free shift on one of the operands for most of the instructions. So you can do things like static multipliers quite easily. It's not purest risk though. It does have load store multiple instructions. So these will, as the name implies, load or store multiple number of registers in one go. So one register per cycle, but it's all done through one instruction. This is very not risk. Again, there's a good reason for doing that. So arm one comes back and it gets plugged into a board that looks a bit like this. This is called the A2P, the arm second processor. It plugs into a BBC micro. It's basically there's a thing called the tube which is sort of a FIFO like arrangement. The BBC micro can send messages one way and this can send messages back. And the BBC micro has the disks. It has the IO, the keyboard and so on. And that's used as the host to then download code into one megabyte of RAM up here. And then you can run the code on the arm. So this was the initial system, six megahertz. The thing I found quite interesting about this, I mentioned that Steve had built this BBC basic simulation. One of the early bits of software that could run on this, Sophie had ported BBC basic to arm and written an arm version of this. So the basic interpreter was very fast, very lean and it was running on this board early on. They then built a simulator called ASIM which was an event based simulator for doing logic design. And all of the other chips in the chipset were simulated using ASIM on arm one, which is quite nice. So this was the fastest machine that they had around. They didn't have the thousands of machines in the cluster like you'd have in a modern company doing EDA. They had a very small number of machines and these were the fastest ones they had about. So arm two was simulated on arm one and all the other chipset. So then arm two comes on. So that's a year later. This is a shrink of the design. It's based on the same basic micro-architecture but it has a multiplier now. It's a booth multiplier so it's a worst-case 16 cycle multiply. It does two bits per clock. Again, no cache, but one thing they did add in arm two was banked registers. So some of the processor modes, I'll mention a bit, interrupt mode, next slide. Some of the processor modes will basically give you different view on registers, which is very useful. These were all validated at eight megahertz. So the product was designed for eight megahertz. The company that built them said, okay, put the stamp on the outside saying eight megahertz. There's two versions of this chip and I think they're actually the same silicon. I've got a suspicion that they're the same. They just tested this batch and said, you know, that works at 10 or 12. So on my project list is overclocking my A3000 to see how fast it will go. See if I can get it at 12 megahertz. Okay, so the banking of the registers. Arms got this, even modern 32 bit arms have got, say, type of interrupt. An IRQ pronounced urk in English and FIQ pronounced fic in English. I appreciate it, it doesn't mean quite the same thing in German. So I'll call it FIQ from here on in. And FIQ mode has this property where the top half of the registers are effectively different registers when you get into this mode. So this lets you, first of all, you don't have to back up those registers when you're FIQ handler. And secondly, if you can write an FIQ handler using just those registers and there's enough for doing most basic tasks, you don't have to save and restore anything when you get an interrupt. So this is designed specifically to be a very, very low overhead interrupt mode. So I'm coming to why there's a 26 bit address space. And so I found this link very, very unintuitive. So unlike 32-bit ARM, the more modern 1990s onwards ARMs, the program counter register 15 doesn't just contain the program counter, but also contains the status flags and processor mode. And it effectively, all of the machine state is packed in there as well. So I asked the question, well, why 64 megabytes of address space? What's special about 64? And Mike told me, well, you're asking the wrong questions the other way around. What we wanted was this property that all of the machine state is in one register. So this means you just have to save one register. Well, what's the harm in saving two registers? And he reminded me of this FIQ mode. Well, if you're already in a state where you've really optimized your interrupt handler so that you don't need any other registers to deal with, you're not saving and restoring anything apart from your PC, then saving another register is 50% overhead on that operation. So that was the prime motivator was to keep all of the state in one word. And then once you take all of the flags away, you're left with 24 bits for a word aligned program counter, which leads to 26-bit addressing. And that was then seen as, well, 64 megs is enough. There were machines in 1985 that could conceivably have more memory than that. But for a desktop, that was still seen as a very large, very expensive amount of memory. The other thing, you don't need to reinvent another instruction to do a return from exception. So you can return using one of your existing instructions. In this case, it's a subtract into PC, which looks a bit strange, but trust me, that does the right thing. So the memory controller, this is, I mentioned the address translation, so this has an MMU in it. In fact, the thing directly on the left-hand slide, left-hand slide, I was worried that these slides actually might not be the right resolution and they might be sort of too small for people to see this. And in fact, it's the size of a house is really useful here. So the left-hand side of this chip is the MMU. This chip's the same size as ARM2, yeah, pretty much. So that's part of the reason why the MMU is on another chip. ARM2 was as big as they could make it to fit the price. As you, I don't know if anyone here has done silicon design, but as the area goes up, effectively your yield goes down and the price, it's a non-linear effect on price. So the MMU had to be on a separate chip and it's half the size of that as well. MEMC does mundane things like it drives DRAM, it does refresh for DRAM, and it converts from linear addresses into row and column addresses, which DRAM takes. So the key thing about this ARM and MEMC binding is the key factor of performance is making use of memory bandwidth. When the team had looked at all the other processors in project A before designing their own, one of the things they looked at was how well they utilized DRAM, 68K and the Natsemi chips, made very, very poor use of DRAM bandwidth. Steve said, well, okay, the DRAM is the most expensive component of any of these machines and they're making poor use of it. And I think a key insight here is if you maximize that use of the DRAM, then you're going to be able to get much higher performance than those machines. And so it's 32 bits wide. The ARM is pipelined, so it can do a 32 bit word every cycle. And it also indicates whether it's sequential or non-sequential addressing. This then lets your, yes, okay, this then lets your MEMC decide whether to do an N cycle or an S cycle. So there's a fast one and a slow one, basically. So when you access a new random address in DRAM, you have to open that row and that takes twice the time. It's a four megahertz cycle. But then once you've accessed that address and then once you're accessing linearly ahead of that address, you can do fast page mode accesses, which are eight megahertz cycles. So ultimately that's the reason why these load store multiples exist, the non-risk instructions. They're there so that you can stream out registers of back in and make use of this DRAM bandwidth. So a store multiple, this is just a simple calculation. For 14 registers, you're hitting about 25 megabytes a second out of 30. So this is, it's not 100%, but it's way more than a 10th or an eighth, which a lot of the other processors were using. So this was really good. This is the prime factor of why this machine was so fast is that effectively the load store multiple instructions and being able to access this stuff linearly. So the MMU is weird. It's not a TLB in the traditional sense. So TLBs today, if you take a MIPS chip or something where the TLB is visible to software, it will map a virtual address into a chosen physical address and you'll have some number of entries and you more or less arbitrarily, poke an entry in with a certain mapping in it. The memc does it upside down. So it says it's got fixed number of entries for every page in DRAM. And then for each of those entries it checks an incoming address to see whether it matches. So it has all of those entries that we've showed on the chip diagram a couple of slides ago, that big left hand side had that big array. All of those effectively are just storing a virtual address and then matching it. They have a comparator and then one of them lights up and says, yes, it's mine. So effectively the a physical page says that virtual address is mine instead of the other way around. So this also limits your memory. If you're saying I have to have one of these entries on chip per page of physical memory and you don't want pages to be enormous, the 32K, if you do the math, there's four megabytes over 128 pages. It's a 32K page. If you don't want the page to get much bigger than that and trust me, you don't, then you need to add more of these entries and it's already half the size of the chip. So effectively this is one of the limits of why you can only have four megabytes on one of these memory controller chips. Okay, so vidc is the core of the video and sound system. It's a set of five O's and a set of on-chip digital analog converters for doing video and sound. You stream stuff into the five O's and it does the display timing and you know, pallet lookup and so forth. It has an 8-bit mode I mentioned. It's slightly strange. It also has an output for a transparency bit. So in your pallet you can set 12 bits of color but you can set a bit of transparency as well. So you can do video gen locking quite easily with this. So there was a revision later on. Tudor explains that the very first one had a bit of crosstalk between the video and the sound. So you'd get sound with noise on it that was basically video noise and it's quite hard to get rid of and so they did this revision and the way he fixed it was quite cool. They shuffled the power supply around and did all the sensible engineering things but he also filtered out a bit of the noise that was being output on the sound. He inverted it and then fed that back in as the reference current for the DAX. So it was sort of self-compensating and took the noise, a bit like the noise-canceling headphones. That was kind of a nice hack and that was Bidsy 1, 1A. Okay, the final one, I'm going to stop showing you chip plots after this unfortunately but just get your fill while we're here. And again, I'm really glad this is enormous for the people in the room and maybe those zooming in online. There's a cool little illuminati logo in the bottom left corner. So I feared that you weren't going to be able to see and I didn't have time to do zoomed in version. Okay, so IOC is the center of the IO system. As much of the IO system as possible, all the random bits of blue logic to do things like timing, some peripherals are slower than others, lives in IOC. It contains a UART for the keyboard. So the keyboard is looked after by an ATO 51 microcontroller, which is nice and easy. You don't have to do scanning in software. So this microcontroller just sends stuff up of serial port to this chip. So KART keyboard asynchronous receiver and transmitter. It was at one point called the fast asynchronous receiver and transmitter. Mike got forced to change the name. Not everyone has a 12 year old sense of humor but I admire his spirit. So the other thing it does is interrupts all the interrupts go into IOC and it's got masks and consolidates them effectively for sending an interrupt up to the arm. The arm can then check the status and do a fast response to it. So the eye of providence there, the little logo I pointed out. Mike said he put that in for future archeologists to wonder about. Okay. That was it. I was hoping there'd be this big backstory about, you know. He was in the Illuminati or something. Maybe he is, you're not allowed to say. Anyway. So just like the other dev board I showed you, so this one's A500 2P. It's still a second processor that plugs into a BBC Micro. It's still got this host having disk drives and so forth attached to it pushing stuff down the tube into the memory here. But now finally all of the chipset are now assembled in one place. So this is starting to look like an Archimedes. It's got video out. It's got keyboard interface. It's got some expansion stuff. So this is bring up an early software head start. But very shortly afterwards, we got the A500 internal to Acorn. And this is really the first Archimedes. This is the prototype Archimedes. It's got a gorgeous gray brick sort of look to it. Kind of concrete. It weighs quite concrete too. But it has all the hallmarks. It's got the IO interfaces. It's got the expansion slots that you can see at the back. It's got all, you know, it runs the same operating system. Now this was used for the OS development. There's only a couple of hundred of these made. Well, this is Serial 222. So this was one of the last, I think. But yeah, only internal to Acorn. There were lots of nice tweaks to this machine. So the hardware team had designed this, Tudor designed this, as well as the video system. And he said, well, his A500 was the special one that he had a video controller that he'd handpicked one of the Vidsys. So that instead of running at 24 megahertz, it would run at 56. So some silicon, you know, there's variations in manufacture. So he found a 56 megahertz part. And so he could do, I think it was 1024 by 768, which was way out of spec for the rest of the Archimedes. So he had the really cool machine. They also ran some of them at 12 megahertz as well, instead of eight. This is a massive performance improvement. I think it used expensive memory, which is kind of out of reach for the product. Right. So believe me, this is the simplified circuit diagram. The technical reference manuals are available online if anyone wants the complicated one. But the main parts are displayed. So we've got the ARM MC Vidsy and some RAM. And we'll have a little walk through them. So the clocks are generated actually by the memory controller. Memory controller gives the clocks to the ARM. And the main reason for this is that the memory controller has to do some slow things from now on. Then it has to open pages of DRAM. There's refresh cycles and things. So it stops the CPU. It generates the clock and it pauses the CPU by stopping that clock from time to time. When you do a DRAM access, the address bus along the top, the ARM outputs an address that goes into the MEMC. And the MEMC then converts that. It does an address translation. And then it converts that into a row and column address, suitable for DRAM. And then if you're doing a read DRAM outputs the data onto the data bus, which ARM then sees. It's kind of, you know, MEMC is the critical path on this, but the address flows through MEMC effectively. Notice that MEMC is not on the data bus. It just gets addresses flowing through it. This will become important later on. DRAM is another slow thing, another reason why MEMC might slow down the access from the CPU. Works similar sort of way. There's also a permission check done when you're doing the address translation, user permission versus OS, the supervisor. And so this information's output is part of the cycle when the ARM does that access. If you miss in that translation, you get a page fault or permission fault. Then an abort signal comes back and you take an exception. And the ARM deals with that in software. The data bus is a critical path. And so the IO's stuff is buffered. It's kept away from that. So the IO bus is 16 bits. And not a lot of 32 bit peripherals around in those days. All the peripherals are eight or 16 bits. So that's the right thing to do. The IOC decodes that and there's a handshake with MEMC. If it needs more time, if it's accessing one of the expansion cards and the expansion card has got something slow on it, then let's dealt with the IOC. So I mentioned the interrupt status. That all gets funneled into IOC and then back out again. There's a v-sync interrupt, but not an h-sync interrupt. You have to use timers for that really annoyingly. There's one timer, there's a two megahertz timer available. I think I had that on a previous slide, forgot to mention it. So if you want to do funny pallet switching stuff or copper bars or something, that is possible with the timers. It's also a simple hardware mod to make a real h-sync interrupt as well. There's some spare interrupt inputs on the IOC as an exercise to the reader. So the bit I really like about this system, I mentioned that the MEMC's not on the data bus. The Vidsy's only on the data bus. It doesn't have an address bus either and the Vidsy is the thing responsible for turning the frame buffer into video, reading that frame buffer out of RAM, so on. So how does it actually do that RAM read without the address? Well, the MEMC contains all of the registers for doing this DMA. The start of the frame buffer, current position and size and so on. They all live in the MEMC. So there's a handshake where the Vidsy sends a request up to the MEMC when its FIFO gets low. The MEMC then actually generates the address into the DRAM. DRAM outputs that data and then the MEMC gives an acknowledge to the IOA. Excuse me, too many chips. The MEMC gives an acknowledge to the Vidsy which then latches that data into the FIFO. So this partitioning is quite neat. A lot of the video DMAs, well, the video DMA stuff all lives in MEMC and there's this kind of split across the two chips. The sound one, I've just highlighted one interrupt that comes from MEMC. Sound works exactly the same way except there's a double buffering scheme that goes on and when one half of it becomes empty, you get an interrupt. So you can refill that so you don't glitch your sound. So this all works really very smoothly. So finally, the high res mono thing that I mentioned before, there's quite a novel way they did that. Tudor had realized that with one external component, a shift register running very fast, he could implement this very high resolution mode without really affecting the rest of the chip. So the Vidsy still runs at 24 megahertz, which is sort of VGA resolution. It outputs on a digital bus. It was a test port originally. It outputs four bits. So four pixels in one chunk at 24 megahertz. And then this external component then shifts through that at four times the speed. So there's one component. I mean, this is a very cheap way of doing this and as I said, this high res mode is very unusual for machines of this era. I've got feeling an A500, the top end machine. If anyone's got one of these and wants to try this trick, then please get in touch. I've got feeling an A500 will do 1280 by 1024 by overclocking this. So I think all of the parts survive it. But for some reason, no coin didn't support that on the board. And finally, clock selection. The Vidsy on some of the machines, quite flexible set of clocks for different resolutions basically. So MMC is not on the data bus. How do we program it? It's got registers for DMA and it's got all this address translation. So the memory map I showed before has an eight megabyte space reserved for the address translation registers. It doesn't have eight megabytes of... I mean, it doesn't have two million 32 bit registers behind there, which is a hint of what's going on here. So what you do is you write any value to this space and you encode the information that you want to put into one of these registers in the address. So this address, top three bits of one, it's in the top eight megabytes of the 64 megabyte address space and you format your logical physical page information in this address and then you write any byte effectively. This sort of feels really dirty but also really a very nice way of doing it because there's no other space in the address map. And this reads to the price balance. So it's not worth having an address bus going into MMC costing 32 more pins just to write these registers as opposed to playing this sort of trick. If you have that address bus just for that, is it data bus, just for that, then you have to go up to a more expensive package. And this was really in their minds, a 68 pin chip versus an 84 pin chip. It was a big deal. So everything, they really strived to make sure it was in the very smallest package possible. And this system partitioning effort led to these sorts of tricks to then program it. So on the A540 we get multiple MCs. Each one is assigned a colored stripe here of the physical address space. So you have a 16 megabyte space. Each one looks after four megabytes of it. But then when you do a virtual access in the bottom half who use a space regular program access, all of them light up and all of them will translate that address in parallel. And one of them hopefully will translate and then energize the RAM to do the read, for example. When you put an arm three in this system, the arm three has its cache and then the address leads into the MMC. So then that means that the address is being translated outside of the cache or after the cache. So your caching virtual addresses, and as we all know, this is kind of bad for performance because whenever you change that virtual address space, you have to invalidate your cache or tag it, but they didn't do that. There's other ways of solving this problem, but basically on this machine, what you need to do is invalidate the whole cache. It's quite a quick operation, but it's still not good for performance to have an empty cache. The only DMA present in the system is for the video, for the video and sound. IO doesn't have any DMA at all. And this was another area where, as a younger engineer, this is crap, why didn't they have DMA? That would be way better. DMA is the solution to everyone's problems, as we all know. And I think the quote on the right ties in with the Acorn team's discovery that all of these other processors needed quite complex chipsets, quite expensive support chips. So the quote on the right says that if you've got, so some chipset vendors will be charging more for their DMA devices than the CPU. So not having dedicated DMA engine on board was a massive cost saving. The comment I made on the previous, but two slide about the system partitioning, putting a lot of attention into how many pins were on one chip versus another, how many buses were going around the place. Not having IOC having to access memory was a massive saving in cost for the number of pins and the system as a whole. The other thing is the the FIQ mode was effectively the means for doing IO. The FIQ mode was designed to be an incredibly low overhead way of doing programmed IO, having the CPU do the IO. So it was saying that the CPU was going to be doing all of the IO stuff, but let's just optimize it, let's make it as good as it could be. And that's what led to the programmed IO. Also remember, on two didn't have a cache. And if you don't have a cache on your CPU, then DMA is going to hold up the CPU anyway. So on those cycles, DMA is not any performance game. You may as well get the CPU to do it and then get the CPU to do it in the lowest overhead way. It's possible. I think this can be summarized as bringing the risk principles to the system. So the risk principles say for your CPU, don't put anything in the CPU that you can do in software. And this is saying, okay, well actually software can do the IO just as well without a cache as a DMA system. So let's get software to do that. I think this is kind of a nice way of seeing it. This is part of their cost optimization for really very little degradation in performance compared to doing it in hardware. So this is an IO card, the Euro cards, they're nice and easy. The only thing I wanted to say here was this is my SCSI card and it has a ROM on the left hand side. And so it was common, this is an expansion ROM basically, many, many years before PCI made this popular. Your drivers are on this ROM. This is a SCSI disk plugging into this and you can plug this card in and then boot off the disk. You don't need any other software to make it work. So this is just a very nice user experience. There was no messing around with configuring IO windows or interrupts or any of the ISA sort of stuff that was going on at the time. So to summarize some of the hardware stuff that we've seen, the ARM is pipelined and it has the load store multiple instructions which make for very high bandwidth utilization. That's what gives it its high performance. The machine was really simple. So the attention to detail about separating, partitioning the work between the chips and reducing the chip cost as much as possible. Keeping that balanced was really a good idea. The machine was designed when memory and CPUs were about the same speed. So this is before that kind of flipped over. An eight megahertz ARM 2 was designed to use eight megahertz memory. There's no need to have a cache at all on there. These days it sounds really crazy not to have a cache on the CPU but if your memory is not that much slower then this is a huge cost saving. But it was also risk saving. This was the first real proper CPU. If we don't count ARM 1, let's say ARM 1 was a test but ARM 2 was the first product CPU and having a cache on that would have been a huge risk for a design team that hadn't dealt with structures that complicated at that point. So that was the right thing to do, I think. And I'm talking about DMA. I'm actually a converse on this. I thought this was crap and actually I think this was a really good example of balanced design. What's the right tool for the job? Software's gonna do the I.O. So let's make sure that FIQ mode, it makes sure that's as low overhead as possible. I already talked about the system partitioning. The MMU of, I mean two minds about, I still think it's weird and backward. I think there is a strong argument though that a more familiar TLB is massively complicated compared to what they did here and I think the main drive here was not just area on the chip but also to make it much simpler to implement. So it worked and I think this was, they really didn't have that many shots at doing this. This wasn't a company or a team that could afford to have many goes at this product. I don't think that says it all. So I think they did a great job. Okay, so the OS story is a little bit more complicated. Remember it's gonna be this office automation machine, a bit like a Xerox star. It was gonna have this wonderful, you know, high-res mono mode and people were gonna be laser printing from it. So just like Xerox PARC, ACON started Palo Alto-based research center. It has Californians and beanbags writing and operating system. Using a microkernel in Modular 2, the all of the trendy boxes ticked here for the mid-80s. It was, by the sounds of it, a very advanced operating system and it did virtual memory and so on. It was very resource hungry though and it was never really very performant. Ultimately, the hardware got done quicker than the software and after a year or two, management got the jitters. Hardware was looming and said, well, next year we're gonna have the computer ready. Where's the operating system? And the project got canned. And this is a real shame. I'd love to know more about this operating system. Virtually nothing is documented outside of ACON. Even the people I spoke to didn't work on this. A bunch of people in California that kind of disappeared with it. So if anyone has this software archived anywhere, then get in touch. Computer Museum around the corner from me is raring to go on that. That'd be a really cool thing to archive. So anyway, they had now a desperate situation. They had to go to Plan B which was, in under a year, write an operating system for the machine that was on its way to being delivered. And it kind of shows Arthur was... I mean, I think the team did a really good job in getting something out of the door in half a year, but it was a little bit flaky. Riscos then a year later developed from Arthur. I don't know if anyone's heard of Riscos, but Arthur is very niche and basically got completely replaced by Riscos because it was a bit less usable than Riscos. Another really strong point that this had is it's quite a big ROM. So two megabytes going up. Sorry, half a megabyte in the 80s going up to two megabytes in the early 90s. There's a lot of stuff in ROM. One of those things was BBC Basic 5. I know it's 2019, and I know basic is basic, but BBC Basic is actually quite good. It has procedures and it's got support for all the graphics and sound. You could write the GUI applications in basic and a lot of people did. It was also very fast. Sophie Wilson wrote this and very, very optimized basic interpreter. I talked about the modules from Podules. This was the expansion ROM thing. So really great user experience there. But speaking of user experience, this was Arthur. I never used Arthur. I just dug out a ROM and had to play with it. It is bloody horrible. So that went away quickly. At the time also. So part of this emergency plan B was to take the Acorn software team who were supposed to be writing applications for this and get them to quickly knock out an operating system. So at launch, basically, this was one of the only things that you could do with the machine. It had a great demo called Lander of a great game called Zarch, which is 3D space you could fly around. It didn't have serious business applications and there was not much you could do with this really expensive machine at launch and that really hurt it, I think. So then we get RISCOS 2 in 1988 and this is now looking less like a vomit-y sort of thing. Much nicer machine. And then eventually RISCOS 3. There's drag and drop between applications. There's all multitasking. There's outline fonts, anti-LAC and so on. So just lastly, I wanted to quickly touch on the really interesting operating system. So Acorn had a UNIX operating system. So as well as being a CPU geek, I'm also a UNIX geek and I've always been fascinated by RISCICs. These machines were astonishingly expensive. They were the existing Archimedes machines with a different sticker on. So that's an A540 with a sticker on the front. And this operating system was developed after... The Archimedes was already designed at that point when this operating system was being developed. So there's a lot of stuff about the hardware that wasn't quite right for a UNIX operating system. 32K page size on a four megabyte machine really, really killed you in terms of your page cache and that kind of thing. They turned this into a bit of an opportunity or at least they made good on some of this. There was quite a novel online decompression scheme for you could demand a page in text from a binary and it would decompress into your 32K page but it was stored in a sparse way on disk. So actually the on disk use was a lot less than you'd expect. The only way would fit on some of the smaller machines. Also Acorn Tech Author Department designed the cyber truck, it turns out. This was their view of the A680 which was an unreleased workstation. I love this picture. I like this piece of cheese or cake as the mouse is my favorite part. But this is the real machine. So this is an unreleased prototype I found at the Computing Museum. It's notable and it's got two memsies. It's got eight megs of RAM. It's only designed to run RISCIX, the UNIX operating system and it has high res mono only, doesn't have a color. It was designed to run FrameMaker and drive laser printers and be a kind of desktop publishing workstation. I've always been fascinated by RISCIX as I said a while ago. I hacked around on RKM for a while and got it booting in RKM. I've never seen this before. I've never used a RISCIX machine. So there we go, it boots as multi-user but wait there's more. It has a really cool little X server, a very fast one I think. So if you've also again worked on the X server here. So it's very, very well optimized and very fast for a machine of its era and it makes quite a nice little UNIX workstation. It's quite a cool little system. By the way, Tudor, the guy that designed the VidC and the IO system called me ASADO for getting this working in there. So that's my claim to fame. Finally, and I wanted to leave some time for questions. There's a lot of useful stuff in ROM. One of them is BBC BASIC. BASIC has an assembler so you can walk up to this machine with a floppy disk and write assembler. It has a special bit of syntax there. And then you can just call it. And so this is really powerful. So at school or something with a floppy disk you can do something that's a bit more than basic programming. And bizarrely, I managed to write that with only two or three tiny syntax errors after about 20 years away from this. So it's in there somewhere. Legacy-wise, the machine didn't sell very many under 100,000 easily. I don't think it really made a massive impact PCs that are really taken off by then. The ARM processor, I'm not gonna go on about the company, that's clear that obviously has changed the world in many ways. The thing that I really took away from this exercise was that a handful of smart people, not that many, order of a dozen, designed multiple chips, designed a custom computer from scratch, got it working and it was quite good. And I think that this really turned people's heads. It made people think differently. People that were not Motorola and IBM and really, really big companies with enormous resources could do this and could make it work. I think actually that led to the thinking that people could design their systems on chip in the 90s and that market taking off. So I think this was really key in getting people thinking that way. It was possible to design your own silicon. Finally, I just wanna thank the people I spoke to and Adrian and Jason, the scientific computing history in Cambridge. If you're in Cambridge, then please visit there. It's a really cool museum. And with that, I'll wrap up. If there's any time for questions, then I'm getting a blank look. No time for questions? There's about five minutes left for questions. I'll stay over that. Or come up to me afterwards and I'm happy to chat more about this. Thank you. First question is for the internet. Sigma Angel, that will be you. Well, grab your microphones and I'll get first to the room here. Center microphone, please. Ask the question. You mentioned that the system is making good use of the memory, but how is it actually not completely being stalled on memory if having no cache and same cycle time for the cache, air for the memory as for the CPU? Good question. So how is it not always stalled on memory? Well, it's sometimes stalled on memory. When you do something that's non-sequential, you have to take one of the slow cycles. This was the end cycle. The key is, is you try and maximize the amount of time that you're doing sequential stuff. So on the arm to, you wanted to unroll loops as much as possible, so you're fetching your instruction sequentially, right? You wanted to make as much use as load store multiples. You could load single registers with an individual register load, but it was much more efficient to pay that cost just once at the start of the instruction and then stream stuff sequentially. So you're right, it's still stalled sometimes, but that was still, that was still a good trade-off, I think, for a system that didn't have a cache for other reasons. Thanks. No worries. Next question is for the internet. Are there any icons on sale right now, or if you want to get into this kind of hardware, where do you get it? Can you repeat the first sentence, please? Sorry, the first part. If you want to get into this kind of hardware, you want to buy it right now, where to? Yeah, good question. So how do you get hold of one? Drive prices up on eBay, I guess, I had to say. It might be fun to play around in emulators. I always prefer to sort of hack around on the real thing. Emulators always feel a bit strange. There are a bunch of really good emulators out there, quite complete. Yeah, I think I would just go on auction sites and try and find one, unfortunately. They're not completely rare. I mean, that's the thing. They did sell, I'm not quite sure what exact figure, but there were tens and tens of thousands of these things made. So I would look also in Britain more than elsewhere, although I understand that Germany had quite a few. If you can get hold of one, though, I do suggest doing so. I think they're really fun to play with. Okay, next question. I found myself looking at the documentation for the LBMSTM instructions while debugging something on ARM just last week. Oh, great. And just maybe wonder what's in your talk. Are there any quirks of the Archimedes that have crept into the modern ARM design and instruction set that you're aware of? Most of them got purged. So the 26-bit addressing, there was a couple of strange uses of, there's an XOR instruction into PC for changing flags. So there was a great purge when the ARM 6 was designed. And the ARM 6 is, I should know, there's ARM V3. That's got 30-bit addressing and a lot of these weirdnesses got moved out. I can't think of, aside from just the resulting on 32 instruction set being quite quirky and having a lot of the good quirks, this shifted register is sort of a free thing you can do. For example, you can add one register to a shifted register in one cycle. I think that's a good quirk. So in terms of the inheriting that instruction set and not changing those things, maybe that counts. Any further questions? Internet, any new questions? Nope, okay. Well, in that case, a warm round of applause. Thank you. Now it opens.