 Okay, so this talk is called a deep dive into the world of DOS viruses. And if you happen to be at the 8C3, that is 27 years ago, you would have seen a very young and awkward, even more awkward than I am at the moment, version of myself speaking on basically the same subject. The stage, of course, was a lot smaller than this. This would have really intimidated me back then. But I was talking about a university project that we had run for about three years at that point. And our possibilities are very limited. Meanwhile, 27 years later, our speaker, in between fighting battleships over the public BGP network and trying to encode data in dubstep music, was able to actually do all of the stuff that we were trying to do with a lot of effort, basically in, I guess, four hours of CPU time or something like that. Please help me in welcoming Ben to our stage to talk about a bygone error. Thank you. Hi, I'm Ben Gorick-Ox, as the slide suggests. So I have an admission to make. So this is a thing to be aware of. And, you know, things also to be aware of. Anyway, so what is DOS to get straight into it? You can do it in a bullet points way. DOS is an upgrade from CPM, another very old legacy system. But the third thing to be aware of is that DOS covers a wide range of vendors. It might not just be like those old IBM PCs. Some of the DOSes had compatibility with each other, meaning that some of the DOSes had shared malware with each other. But to be honest, most people know DOS as these lovely old beige boxes. The same era gave us our loved model M keyboard, hated by some, loved by others for the sound. But, most people's knowledge of DOS came from a user interface that looked like this. Pretty basic. There we go. Okay. So this is WordStar. Some of you may not know that Game of Thrones was written on WordStar. George R.R. Martin is apparently not a big fan of modern word processing. He admitted he had some issue with disliking how spell checking worked. So just uses, and I also guess it's a good security point of view. You can't get hacked if it literally has no internet access. So, also, for a lot of people, this is also their first experience into programming for the some of the older crowd. This is also the invention of QBasic, which gave a very basic language to program creatively in DOS. For some people, this was the gateway drug into programming and perhaps the gateway drug into what they started as a career. For other people, though, the experience of DOS was not so great. For example, let's just say you were doing some work in an infinite loop. And at some point, stuff like this happens. Unfortunately, I don't have sound for this one. But you can just, in your head, imagine your PC speakers playing some small techno music on only one frequency at a time. This might get especially embarrassing if you're in an office environment. Just slowly beeping away. You can't exit this. It has to finish fully. And if you touch the keyboard, it reminds you not to touch the keyboard and continues playing its music. So, you know, this would be fun, but this wouldn't be fun, especially if you're in an office environment. But, you know, ultimately, it's not malicious. And that trend continues. This is another good example of a DOS virus. This is ambulance. For when you run it, an ambulance just drives past. And then your normal program just continues running. I think this is amazing. It's an interesting era of viruses. It was all, the history of it was collected very well by a website called VxHeavens, which sort of still lives. But, unfortunately, at one point was rated by the Ukrainian police for what is the fantastic wording they used. Basically, that someone told them that they were distributing malware. Unfortunately, not malware that operates in this century. But I guess that's good enough for a raid. But, luckily, for the archivists, there are archivists of archivists. And so, we have a saved capture of VxHeavens. This is actually an old snapshot. There are way more modern snapshots. But, thankfully, the MS-DOS virus era doesn't move very quickly. But the interesting thing here is, like, there's 66,000 items in this table, and it's 6.6 gigabytes of code. And these viruses are, like, super dense. There's not much to them. Like, they are just blobs of machine code. They're not, like, your electron out these days that ships an entire Chrome browser. And normally, an out-to-date Chrome browser. You know, this is just basically, like, you know, how to draw an ambulance and, you know, some infection routines. The normal distribution also changes with it as well. For example, the normal life cycle of an MS-DOS virus is, you know, you download or, for some other reason, run an infected program. That presumably does nothing. To you, it looks like it does nothing. So, you know, it remains roughly undetected. That then, you go and run more files. The DOS virus infects more files. And at some point, you're probably going to give one of those computers to some other computer or some other person. Whether it was by giving someone or copying a floppy disk of some software, maybe some expensive software so they didn't have to pay for it. Or uploading it to a BBS where it could be downloaded by many people. So, the distribution mechanism is a far cry from the eternal blues of this era. Where, you know, we can have a piece of strain of malware spread across the world very brutally, very quickly. So, most DOS viruses are pretty simple. They start, they say, have my payload conditions been met? If not, then they'll go and display, if they are met, they'll go and display the payload. And the payloads are definitely more, I don't know, nice. You know, you have stuff like this, which is pretty. And it uses VGA colors and all sorts of really nice stuff. You get also some very demo scene vibes from this. Another good example is this like VGA, like super trippy thing, which is really impressive because this is really small. This is less than one kilobyte of code. It's, in fact, way less than one kilobyte, like 64K. Or you just get interesting screen effects as well. For example, it's quick, but you can just watch the entire computer just dissolve away, which also might be quite worrying if you weren't expecting that. Alternatively, if the payload conditions are not met, then, you know, you hook sys calls and you, or alternatively, if you want to be way more aggressive as a malware author, you scan for files on the system to infect proactively. And the way you infect DOS programs is pretty simple. Imagining you have, like, one giant tape of all the code you have for the target program. Most of them work like this in that they replace the first three bytes of the program with a x86 jump. They append their malware onto the end of the executable. And so the first thing that you do when you run the executable is it jumps to the end of the file effectively, runs the malware chunk, and then it optionally will return control back to the original program. But there's also a thing about hooking sys calls, right? So, you know, MS-DOS is an operating system. It does have sys calls. The programs can reach out to MS-DOS to do things like file access and stuff. So as you expect, you run a software interrupt to get there. Thankfully, though, MS-DOS does also allow you to extend MS-DOS by adding handlers itself, or even overwriting existing handlers, which is very convenient if you're trying to write drivers, but it's also incredibly convenient if you're trying to write malware. For some of the examples of the sys calls, most of them are relevant towards DOS virus making. Here is a decent example of the things that DOS will provide you. A lot of them are just very useful in general for producing functional executables that users want to use. This is what the average program looks like. This is almost the shortest hello world you can make, minus the actual hello world string. In fact, the hello world string might be the largest part of this binary. It's a pretty simple binary here. We're moving a pointer to the message we just set. We then set the AH register to 9, or hex 9. That's the sys call for printing a string. And then we run a software interrupt, 21H, which is short for 21 hex. And we continue on. AH, we then set AH again to 4C, which is return with an exit with a return code. And the program will return. So in the meantime, this is roughly the loop that just happened. You have your program code that calls an interrupt. And that gets passed over to the interrupt handler. In the process of doing this, the CPU has quickly looked at the first 100 bytes of memory in the interrupt vector table, IVT, as it's abbreviated. And then it's effectively a router. If anyone has written like a small piece of code to root HTTP requests or anything, it's basically like that, but in the 80s with sys calls. So it just basically is saying compare this, compare that, jump that. Then the thing gets passed to the call handler. It goes and does the sys call, the thing that was required. Normally it will leave some registers behind as state or results of actions it has performed. And it returns control back to the program. So theoretically speaking, if we wanted to go and look at what a program actually does, we need to set a break point here. Because this is the only place that we can be sure the location exists. Because this is way before the era of ASLR, address space randomization. And this is way before the era of kernel space randomization. In fact, MS-DOS has almost no memory protection whatsoever. Once you run a program, you are basically putting full control of the system to that program, which means you can happily also boot things like Linux directly from a com file, which is handy if you want to upgrade. So if we look at certain files, we can go and see what they do. So in this case, here is one example. This is a goat file. A goat file is like a sacrificial goat. It is a file that is purely designed to be infected. So what you do is you bring a virus into memory in the system and then you run a goat file in the vague hope that the virus will infect it. And then you have a nice clean sample of just that virus and not another program inside the virus, which makes it way easier to test and reverse engineer. So we can see things that are happening here, for example. We can see it opening a file, moving like where it is looking towards a file, reading some data from the file, just two bytes though, and it closes a file. We see the same sort of thing repeat itself except at one point it reads a large amount of data, moves the file pointer, writes another large amount of data, does some more stuff. And this, we parse some file names, we display a string which is almost definitely the goat file message. And yeah, we pretty much exit after that. So there were a few syscalls here that we would really like to know more about. So for that, it is the open files. We would really like to know what files were being opened. We would also want to know what data was being written to the file rather than having to fish it out of the virtual machine later. And we would also, just out of curiosity, really want to know what file names it was asking MSDOS to parse. The display string is also a nice test to know whether your code is working. So to do this, you're going to have to look a little bit deeper into how the MSDOS runtime and by proxy how the x86 in 16-bit mode works, or legacy mode, I guess. This is basically all the registers you have in 16-bit mode, and some nice computations at the bottom to make it easier to read. So as we mentioned, AH is the one that you use to specify which disk call you want, and you'll notice it's not there. AH is actually the upper half of AX. AH is an 8-bit register, because sometimes people really just want only 8 bits. It's a very obscure that we were saving that much space. And so this is what a, this is the definition of the sys call of a print string. So you have AH needs to be set to 9. This is once you, in order to call the sys call for printing string, you just set AH to 9, and then you need to set DS and DX to a pointer to a string that ends in a dollar. And that doesn't make a lot of sense, or it didn't make a lot of sense to me when I first read that. And so to do this, we need to learn a little bit more about how memory works on these old CPUs, or the CPUs that are probably in your laptops, but running in an older mode. So for this is effectively what it looks like. They have a 16-bit CPU, 2 to the 16 is 64 kilobytes, and we have a 20-bit memory address in space. 2 to the 20 is 1 megabyte. So if you ever see an MS-DOS machine limiting at 1 megabyte or some old operating system saying like the maximum memory it can have is 1 megabyte, it's because it's running in 16-bit mode. And the maximum it can physically see is 20 bits. So the question is, how do we address anything above 64K if the CPU can only fundamentally see 16 bits? So this is where segment registers come in. We have four segment registers. Actually, we might have more, but they're the ones you need to care about. There's the code segment, the data segment, the stack segment, and the extra segment for in case you need just another one. So anyway, with that in mind, let's have a quick crash course on segment registers. So imagine if you have a very long piece of memory, and we can only see 16 bits at the time. So, however, we can move the sliding window around in the memory to go and see, like, to move our view of where it is. So we can do this and put data around the system, and we can use the final pointer to specify how far in to the memory segment we should go. So the DS and DX really just means a multiplier. So where the data segment is 100, you need to just move 100 times 16 to get to the correct place in memory. And then DX is the offset. This continues on. So where we have 16-bit CPU, we have a bunch of general use registers or general purpose registers. They're quite useful for ensuring you don't need to touch RAM too often. X86 actually has a fairly small amount of general purpose registers. Some architectures have way more. I think more modern chips like GPUs have hundreds, well, hundreds, maybe thousands. However, this doesn't really change over time in X86 because we have to force backwards compatibility. So really what actually ends up happening when we move up the bitage is that the same registers just get wider. And we add some more ones for the programmers that want them. And the exact same thing happened to 64-bit, the registers got wider. So thinking about it, we have a lot of malware now. What if we want to know everything that's happened in this entire archive? So we kind of want to trace all of these automatically, but we might not know what we're looking for. So let's go through the checklist of what we need to do to trace all of this malware. We need a breakpoint on the syscall handler. When we get that breakpoint, we need to save all the registers so we know which syscall was run and potentially what data is being given to the syscall. Ideally, we're going to save 100 bytes from that data pointer, especially because we need it, but it's quite handy in a lot of registers, in a lot of syscalls. It's, for example, what you use to get the open file path when you're opening files. We should also probably record the screen for quick analysis rather than just staring at HTML tables. And so we can do that. We burn a lot of CPU time and probably call some minor amount of environmental damage. And we get nothing. We just run a bunch of stuff, and most of them don't return anything. At best, they return a Goat file string. They just do nothing. So if we look deeper into the reason why, it's sort of a smoking gun here. So we can see the syscalls that run on this file that does nothing. And the smoking gun here is the date. So it's asking for the date from the system. And this sort of flags out the first issue is that a lot of MS-DOS viruses don't really have a lot to go on because they have no internet connection. And there's not really any other state they can decide to activate on. So the date syscall is pretty simple. The get, date, and gate time just return all of their values as registers, and some using the 8-bit halves to save space. So a naive way of doing this is what we do is we would run the sample. We'd wait for the syscall for date or time. We would just fit all the values because in this case, we're using a debugger. So we can automatically change what the state of the registers are. And we can then observe to see if any of the syscalls that the program ran changed, which is a pretty good indication that you've hit some behavior that is different. And then we can say, hooray, we found a new test case. The downside is running every one of these samples takes 15 seconds of CPU time because MS-DOS, or 15 seconds of wall time, which when you're emulating MS-DOS is 15 seconds of CPU time because of the fact that MS-DOS doesn't have power saving mode. So when it's not doing anything, it just goes into a busy loop, which makes it very hard to optimize. Or we could take a cleverer look. So when we think about it, we are in the interrupt handler. All we ever see is the insides of the interrupt handler because we don't know where the program code is. The interrupt handler is the only place that we know is consistent because MS-DOS could potentially load the code for the malware or the program anywhere. But we want to know where the code is. It would be really handy to know what the code is that would be about to run. So for this, we need to look towards the stack. Just like the DS and DX registers, the stacks are located on a stack segment and a stack pointer. Luckily, the first two values is the interrupt pointer in the stack segment. So we can use that to grab exactly what the code will be around afterwards. So we just need to add a few things to our checklist. We're using to grab four bytes from the stack pointer. And then using that, we can calculate the destination that the Cisco will return to. And if we look at some of them, we can look at an example here. This is what one of the calls return as. So we see we're running a compare on DL against the hex of 0x1e. And then if that comparison is equal, it will jump to one memory address. And if not, it will jump to another. So if we look back at our definition of those Cisco's, we can see that DL is the day. So with this, we can conclude that if 0x1e is 30, DL is the day, this malware effectively is saying if the day of month is 30, we need to go down a different path. If we run these all over time across the whole dataset, what we see is roughly this as a poorly drawn bar chart. We see out of the 17,500 samples we have, around 4,700 of them, check for the date and time. And these are the ones that are really tricky because they're really hard to activate. They're also the most interesting, though, because those are the ones trying to hide. So with that in mind, we have the code segment that we're about to run when we return. And we can't really brute force because it takes a lot of CPU time. But we can't brute force it inside a real or emulated machine. But we can brute force it in a significantly more interesting way. We need to build something. We need to build the world's worst X86 emulator. So dubbed Ben X86, it's 16-bit only. Any attempt to access memory effectively ends the simulation. It's got a fake stack. If you try and push something onto the stack, it says sure, fine. If you try and pop it, it's like, oh, actually, I never held any of that data anyway. Sorry we're ending the simulation. 80 op codes, most of them are jumps because that's the primary purposes comparing in jumps. The difference is it logs every op code, every address that it went through. And it can be run with just a small X86 code segment and a register snapshot. This means we can test all days from 1980 to 2005 in roughly about 100 milliseconds. And most programs ended up having just three different code paths on average. So that yields us with 17,000 virus samples and about 10,000 samples that had date variations as in once you explode the complexity. So I'm going to now use my final remaining time to go through some of my favorite. So this is an example of a virus that just doesn't do anything on the first of 1980. However, if you were to happen to be running this on New Year's Day, New Year's Day, you would get this. No matter what you do, every program, you can't exit out of this. Your machine is hung. This might be great, right? You might be like, oh, cool, I don't need to do work anymore because my computer will literally not let me. This also might be terrible because you might need to do some work on New Year's Day. Here's another example. This does nothing as well. Just another innocent.com file. Of course, reminding that these pieces of malware will be wrapped around something else. So, you know, almost anything could be infected in here. In this case, though, these binaries are nice and shaved down. However, instead, we get this, which I think is super interesting. And it's basically the author is aware. They're telling you. They're actually self-disclosing. They're saying, the previous year, I've infected your computer. And for some reason, it's they're being nice. They're just saying, yeah, actually, you have been infected. And as a, I guess, a pity, I'm just going to remove myself now. I don't really, for some reason, it's also encouraging you to buy McAfee. This was back in the day when John McAfee himself actually wrote McAfee. Interesting times. Definitely interesting times. Here is another example. This one I found particularly obscure. On the 8th of November in 1980, or any year, I think, actually, it turns all zeros on the system into tiny little glyphs that say hate. If anyone understands this, I'd really like to know. I've been thinking about this a lot. What does it mean? Is it an artistic statement? Is it, I wish I knew. There could be a CCC variant that says Marte. Another good one in that it's the last thing I ever want to see any program tell me is this one here where you run it and it says, error, eating drive C. I never ever want to error in any program that unexpectedly just says, sorry, I failed to remove your root file system. Don't know why. Could you change your settings so I can remove it? Cheers. And finally, this is one of my absolute favorites in that it's just brilliant in that it also stops you from running the program you want to run. It exits prematurely. This is the virus version of the Navy SEAL copypasta. It says, I am an assassin. I want to and I shall kill you. I also hate Aladdin. And I also will kill it. I will eliminate you and we know where this is going. It says it's fear the virus. It is more powerful than God. It only activates on one day though, so it's fine. Thank you for your time. I know it's late. And I will happily take any questions or corrections if you know this topic better than me. This totally brings tears to my eyes with nostalgia. So if there's any questions, we have microphones to threw it around the room. There's like one, two, three, four and one in the back. We also have questions perhaps from the internet. If you want to ask a question, come up to the microphone, ask a question just as a reminder. A question is one or two sentences with a question mark behind it and not a life story attached. So let's see what we have. I'm going to start with microphone number one just because I can see it easiest. Let's go for it. Hi Ben. Thanks for the talk. Really interesting. My question would be did you do any analysis on what ratio of the viruses was more artistic and which one actually did damage? So most of them surprisingly don't do damage. I actually really struggled to find a date varying sample that specifically activated on a certain day and decided to delete every file. There are some very good ones. Some of them are like virus scanning utilities that just don't do anything on certain dates. And in one day, while they're telling you all the files they're scanning, it's actually telling you all the files they're deleting. So that's particularly cruel. But it's actually surprisingly hard to find a virus sample that actually was brutally malicious. There were some that were just in fact binaries, but it's very hard to find one that I think was brutally malicious. Which is a far cry from the days well from the days that we live in right now where we're taking down hospitals with Windows bugs. As everybody's leaving the room, please do it quietly. I see a question at three on that side. Yes, since a lot of industrial control systems still run DOS, what's the threat from DOS malware that might be written today? It's probably unlikely. The industrial control system that's running DOS would come into contact with DOS malware. The only way I can think is if one vendor was, or if a factory or supply or whatever was basically downloading or basically wears onto industrial control boxes, I wouldn't be surprised. But it would be pretty irresponsible. But it would be quite surprising to find MS DOS malware today on industrial controllers that was installed recently and not just a lingering infection from the last 20 years. Microphone 2? Did you find any conditions that weren't date based? Some of them do attempt to, some of them try and circumvent the date recognition. Unfortunately, it's very hard to brute force those. Some of them install themselves as what's called TSR, terminate and stay resistant. Which basically means that they will exit out, run in the background and continuously ask the actual system timer what time it is. It's a bit of a more risky strategy because the system timer might not exist, which would be unfortunate for the virus. So definitely there are viruses that have way more complicated execution conditions. I observed one sample that only activated after, I believe it was something silly, like 100 key presses, which is very hard to automatically test. Those sort of viruses require static analysis and statically analyzing 17,000 samples is a time consuming task. So we have a question from the internet. Do you have the source or what is the source of the malware that you analyzed here and did it publish somewhere? You can still find dumps of Vx Heavens and probably more modern dumps of Vx Heavens on popular torrent websites, but I'm sure there are also copies floating about on non-popular torrent websites. Over the microphone one. Hi Ben. I'm Job. Thank you for your talk. I was wondering did you learn anything from your studies of these viruses that should be taught in modern day computer science classes? Like more efficient sorting algorithm or some hidden gem that actually should be part of your approach to computing these days? My primary takeaway was x86 was a mistake. So I'm not seeing any more questions which oh no there is okay one more question from the internet. Have you found malware samples that did like try to detect dummy binaries or whatever to avoid easy analysis? Oh actually that's a really good question. So it's complicated. So some viruses would so maybe let's be dangerous. Let's try and go backwards on my home written presentation software. So come on. Too many slides. I have regrets. Yeah okay here we are. This slide. Okay so you know here how I'm saying that the malware infection goes to the end? Well some samples are really cool in the they don't change the size of the file. They just find areas of the files that are full of null bytes and just say this is probably fine. I'm just going to plot myself here which may have unintended consequences. It may mean that if a program is like a statically defined byte array of like a certain size and the program is relying on it being zeros when it accesses it for the first time it may get very surprised to find there's some malware code in there. But generally speaking as far as the underwear these this this deployment procedure works pretty well and it actually is very good at avoiding antivirus of the era which would just be checking like common system files and its size and you know if the size increases of command.com then that's clearly bad news. We have a question on microphone one. Are there any viruses that try to eliminate or manipulate virus scanners of the day? Oh yeah so a lot of the samples will actively go and look for files of other antiviruses but I am generally under the impression in that it's kind of hard to find them. There weren't actually that many antivirus products back in the day. I wasn't really I feel like it was a bit of a niche thing to be running. Microsoft did for a while ship their own antivirus with MS-DOS. So I guess you know what's new is old. So there were antiviruses out there. I don't think many of them were very effective. Any more questions there? Where? Oh right another one from the engine is interesting that the engine is querying MS-DOS all the time. Go ahead. Did you do the diagrams by hand or do you have a tool? So many hours. No so there's a couple of good tools to do it. ASCIFLOAD.org I think is a fantastic tool. I would highly recommend it. I think it's not maintained very well though. Microphone one. Are you publishing the tools you wrote? I will be publishing the tools at some point when they are less ugly. I will be definitely publishing all of the automatic malware runs and the gifs generated by them so that people can easily base Google for the virus names and get actual real time versions. The hardest thing that I found is when looking at virus names was literally just finding any information about them. And one of the things I really wish existed at the time of writing this talk was being able to just query a name and be like oh yeah this virus it looks like it does this. Since I saw a microphone one first let's go with that. Did you find any viruses that had signage in them not signage of today but the name of the author like he was very proud of what he wrote? Yeah there's some notable examples. Quite a few of them will try and name so DOS viruses do like have obviously sample names in the same way that you know we still today like give viruses names. A lot of the time you will just encode a string that you want the virus to be named it's you know somewhere in the file just a random string doing nothing it's like oh okay they clearly wanted to be called tempest. So that does happen one of the favorite examples is the brain malware which literally encodes an address and phone number of the author. I believe in Pakistan and there's a fantastic mini documentary by fsecure where they go and visit the people who wrote it. It's a super interesting watch and I would really recommend it. Indeed yes. Microphone 2. Did you have any chance to look at any kind of viruses that did not modify the files themselves? For example one of the largest virus infection time was a virus called Nymella which modified the master boot record. Yeah master boot record I did consider it was more of a time problem that I had in that getting to the point where you could brute-force time and date combinations and looking for master boot record changes was really hard. I am super interested in reviewing effectively the root kits of the era but yeah that's definitely something I will look into in the future. And we have yet another question from the internet. Yeah it's even from the same guy. Oh damn. Is the Ben x86 software open source or can it be found on the web somewhere? It probably will be. I wouldn't expect it to work in well in any use case though. It's effectively designed to like not work correctly right like you know what's the what was the spec it basically like fails at every single anything awkward I just went like oh that's fine we're probably far enough down there anyway. Where are we? Yeah like be aware this is the feature list. So is that a follow-up question from the internet? And no it's a it's a new one. Oh it's anyone good. And wouldn't I don't know how serious it is but wouldn't it be possible or a good idea to use machine learning to create new dust malware from the existing samples? It would not be a good idea but I like how you think. Actually I saw somebody trying to use NLP to generate viruses but okay that's a different you could probably do Markov chains with x86 to be honest. Please don't do that please. Don't try this at home. I have seen things I have seen just please don't do that. So I think we've run out of questions going once going twice let's thank Ben for this marvelous retrospective talk.