Saturday, May 30, 2020

How to write an Amstrad CPC emulator

How to write an Amstrad CPC emulator

0. About this text

This text is intended to shed some light on the internals of Amstrad CPC emulators, giving you an idea about how they work. It focuses mostly on CPE, which I know best, since I wrote it. (This is the missing section "Technical information" from the CPE .doc file...) But you will also find some information about CPCEMU that I have learned from exchanging letters with its author, Marco Vieth. There are currently (supposedly?) two other CPC emulators available, and more are being developed (quite a lot, actually!), but I have less information about these.

I will not describe the CPC specs here in detail, since there are other, better documents available to do this. (See the CPC Guide at this WEB site). But I will describe the basic functionality of the hardware. Occasionally, this text will also deal with other systems than the CPC. For writing this, I only have as a guideline what seems interesting to me. I hope you will find it interesting, too.

This document can probably be enhanced. If you have an idea how, please mail me at <crux@pool.informatik.rwth-aachen.de>.

This text may contain speeling and grammatical mistakes; English is my second language. Please, either ignore them or mail me and they'll be corrected.

1. Overview

What needs to be done to emulate a computer system on a completely different one (which I will call the host system from here)? The answer is rather simple: Just pretend to the software you intend to run that all the hardware of the system to be emulated is present and works as it should. "All the hardware" includes

  • the memory system
  • the CPU
  • possibly some timers
  • interrupts (these belong to the CPU, but when writing an emulator, you will probably come to think of them as something more or less separate)
  • the video hardware
  • supporting chips
  • Input/output, like keyboard or floppy/hd drives
If these are behaving like on the original system, most software ought to run. If you like, you can add less important features like sound support. When the hardware emulation is complete, you may want to enhance the program, so that the emulator becomes even better than the emulated system. (As a matter of fact, this has never happened. All emulators are in some respect inferior to "the real thing". Some important things, especially graphics, are really hard to emulate correctly.) What might be added at a later stage includes:

  • rewrite the system software "native", so that it can need not be emulated. This is what WABIs like Wine are all about: They do not in fact emulate hardware, but provide the same system calls for programs to run under the emulator. This technique has not been used in CPC emulators except for replacing the tape or disk I/O routines.
  • provide "snapshots". Thus, you can freeze the system state at any point you like and load it again at some later time. This could be done with additional hardware on systems like the CPC or the C64, but with an emulator it is actually a lot easier.
  • be able to run multiple computers at the same time. There is a C64 emulator that can have more than one C64 emulation at the same time, but these are not executed in parallel. If you are writing for a multi-tasking OS, this issue is rather pointless. But DOS isn't multi-tasking very well. Neither is Windoze.
  • ..insert what you can think of here..
The following chapters will describe possibilities for emulating the hardware mentioned above. I assume throughout that assembly language is used. This is necessary on current systems to achieve maximum speed. For the future, it would be interesting if someone wrote a portable CPC emulator in C that runs in an X Window on Un*x (Linux?) There is an X Window C64 emulator called X64, which runs reasonably well, although it can't be used for most games currently. Also, I assume standard PC hardware, because CPE is written for PCs running DOS.

2. The memory system

Why start with the memory system? Because it is fundamental in a certain way. The memory layout can strongly affect other parts of the emulation as well, for example the CPU and the video emulation. For emulating computers that have memory-mapped I/O (the peripheral chips respond to memory accesses in certain restricted areas) like the C64 or the Amiga, you would have to think about how to distinguish different areas of memory on each access.

Basically, a CPC can access 64K RAM. This is a restriction imposed by the Z80 CPU, which has a 16 bit address bus. Unfortunately (for CPC emulator writers) this restriction has been overcome to a certain degree by special hardware (i.e. the Gate Array). On a Spectrum, which has a Z80 as well, only 64K can in fact be addressed (16K ROM, 48K RAM) which makes that part of the emulation fairly easy. This is a good reason why Spectrum emulators ought to be faster than CPC emulators, even though the emulated CPU is the same.

The 64K available in the CPC are split in 4 banks with 16K each. On every CPC, the upper and the lower bank can be mapped to contain ROM instead of RAM (except on writes, which are ALWAYS directed to the RAM). Thus, a CPC has 64K RAM and 32K ROM that can be made visible to the processor. On a CPC 6128, the situation is made even more complicated by RAM-banking. There are 8 RAM banks of which only four are visible at any time to the processor. These can be exchanged, so that an invisible page becomes visible within the 64K address range and another previously visible page becomes hidden. To make matters worse, the video chip ALWAYS accesses memory in the first four banks, even though they may be invisible to the processor.

ROM banking has to be done in any CPC emulator. If you don't need to emulate a 6128, you may forget about RAM banking for the moment. You can always implement it in a kludgy way by copying loads of memory whenever pages are exchanged, although this is highly inefficient.

In CPE, I first left out RAM banking and only implemented ROM banking. A 64K sized area is reserved for RAM. (Luckily, an Intel CPU can access at least THAT much of memory in one block, or I would have shot myself at this point.) On a write access, data is stored there at the appropriate location. Nothing could be easier. On a read access, the upper two bits of the address are masked out and taken as an index to a table of segments. This table contains four entries, one for each 16K bank of memory. The data is then read from the appropriate offset in this segment.

With this, switching a ROM bank only involves rethinking this four word array and is therefore rather efficient. There is a penalty, though, for each read access, because the emulator must first look up the correct segment.

The reverse applies to CPCEMU. Here, the whole system memory is stored in a contiguous 96K area. For reading and writing, two segment registers are set aside. These usually have different values, because different memory areas may be visible for read and write accesses. Also, two 16 bit offsets are kept that are added to any address before the memory access occurs. Take, for example, the following diagram:

 type base address highest address ROM 0x0000 lower ROM RAM 0xC000 RAM page 3 RAM 0x8000 RAM page 2 RAM 0x4000 RAM page 1 RAM 0x0000 RAM page 0 Write offset: 0x0000, read offset: 0x4000 All writes are directed to the central block containing the RAM banks. When the CPC is trying to read, say, from address 0xC000, first the read offset is added to the address, giving a result of 0x0000. This means that the byte at the address 0x0000 in the read segment is read. Look it up in the diagram and you will see that this is the beginning of the upper ROM. The described memory map therefore corresponds to the state "lower ROM disabled, upper ROM enabled": The CPU can still read the lower 48K RAM, but when reading from 0xC000-0xFFFF, it accesses ROM.

If the banking were switched to the state "only RAM enabled", write and read segments/offsets would be set to the same values. All this requires very little overhead.

Unfortunately, it is sometimes necessary to exchange two 16K pages in the 96K area. If you look at the above diagram, you will notice that you can't achieve that both ROMs are active at the same time. You will have to exchange the lowest RAM page with the lower ROM page to do this. In old versions of CPCEMU, this was a problem, because memory had to be physically copied, and the BASIC emulation became quite slow. Now, the 96K area can be stored in EMS memory and the capabilities of EMM386 to modify the 386's RAM mappings quickly is used. Thus, RAM access needs hardly any overhead in CPCEMU, but bank switches are slower than in CPE (calling EMS functions takes some time).

The last two updates to CPCEMU include a version that uses a different method for banking. I'm not quite sure about how it's done, but here's my guess: Two 64K EMS frames are allocated (possible with EMS version 4.0, which is provided by programs like EMM386). One is used as a segment for reading, and the other one for writing. The emulator does not have to worry that a modification of RAM in the write page is not reflected in the read page: It uses "aliasing", which means that the same EMS page is present at two different memory locations, so that the CPU sees the same block memory at two different addresses. This can be done without a problem using the MMU of 386 CPUs.

Currently, another CPC emulator for the PC is being developed by Herman Dullink. This is still in beta stage, but it looks very promising. It also utilizes the advanced memory management features of a 386 CPU to achieve banking. It has its own DOS extender! Unfortunately, it therefore can't coexist with EMM386 (or anything else that switches to V86 mode). Of course, if you can program the 386 MMU directly, you get an enormous speed (the author says it runs at full CPC speed on a 386SX-16). Unfortunately, I currently don't have more information about this.

A short note about RAM banking. In CPE, this is done by having two RAM areas. One is accessible by the Z80 CPU, the other one is a "backup" area where all the invisible RAM is stored. RAM banking is then done by exchanging pages between these two areas, either by copying which is as slow as one would imagine, or by using EMS, which is a little better. In CPCEMU, this is done using EMS as well and fits neatly with the system explained above.

3. The CPU

 The CPU used in all the CPCs (as well is in numerous other home computers at that time, like the Spectrum) is a Zilog Z80.

How does a CPU work? It contains a little region of memory where various important data is stored. These are the CPU's registers. When it runs, it reads machine instructions ("opcodes") from a location in memory which is defined by the value of a special register called the program counter (PC). The instruction is decoded and an appropriate action is taken. (More complex CPUs have a special form of software called microcode within them that decodes the instruction. The Z80 does not have a microcode, all its functionality is hardwired. This leads to an interesting effect: Some opcodes that are not officially documented produce interesting and potentially useful results nevertheless, just the results that "should" be there if these opcodes were officially documented. Some people say the Z80 was the most complex processor ever to be made without a microcode. But the M68k FAQ says that the latest 680x0 CPU, the 68060, has no microcode as well! Interesting, but back to the topic...) Some instructions that can occur involve

  • arithmetic instructions: additions, subtractions, on more complex CPUs multiplication and division as well. Not on the Z80.
  • movement: transferring data from one place to another.
  • logical instructions: logical AND, OR, NOT instructions which affect all bits in a register. Register contents can also be shifted or rotated bitwise in various ways.
  • control instructions: branching from one point in a program to another by modifying the value of the PC
  • input/output instructions: The Z80 does not do memory-mapped I/O as described in the previous section. Instead, it can transfer data from and to so-called ports with the IN and OUT instructions. This is used for communicating with all the peripheral chips.
 Many instructions affect the value of a special register called the flag register. For example, load the value 255 in the A register and then add the value 42 to it. Since the A register is only 8 bit wide, it can only hold values between 0 and 255. So, you will get a result of 41 (it wraps around). The flag register will represent this by setting the "carry" flag which is (roughly speaking) the 9th bit of the result. There are also other flags like the zero flag and the sign flag (all values with a set 8th bit are thought to be negative, and the sign flag is set accordingly).

Most CPUs (including the Z80) have a special register called the stack pointer. This register contains a memory address where certain data can be stored. When data is stored there, the stack pointer is decreased and points to another location to store data. When data is fetched from the stack, the stack pointer is increased again. This is (for example) used to execute subroutines: Before you jump to another point in a program, store the address where the subroutine should return on the stack. When the subroutine ends, it executes the RET instruction that fetches this address from the stack and puts it in the PC.

The Z80 is an extension of the Intel 8080, and therefore can run all 8080 software. For example, many CP/M programs available for the CPC are written for 8080 CPUs. The Z80 has a 16 bit address bus (as mentioned above) and an 8 bit data bus. It's a true 8 bit processor, although some of the 8 bit registers are grouped to 16 bit registers which can be used for arithmetics or addressing memory. The PC and SP are 16 bit wide. The Z80 runs at 4 MHz.

It is amazing to see how similar the Z80 (or in fact the 8080) architecture is to the "modern" design found in a Pentium. For example, most registers are "special purpose" registers, whereas in almost any reasonable newer CPU you have a large set of general purpose registers. The way a DOS program written a couple of years ago uses a Pentium just the way it would use a Z80: it has a privileged register called the accumulator that more operations can be performed with than with the other registers, there is a "loop counter" register, the 16 bit registers are made of two 8 bit half registers which can be accessed independently, and both access only 64K at a time, which is a shame. Even the flag register has the same format! (This is in fact quite fortunate, since converting flag register contents is no fun thing, as you can see if you look at the source code for the Amiga version of CPE).

How can you simulate all this in software? First, set aside some memory for the registers. It is usually most efficient to use the processor registers of the emulating CPU to store the contents of the emulated CPUs registers. If you don't have enough (on the PC you don't) you'll have to store some of the less used registers in RAM. CPE stores the Z80 registers SP, IX, IY, R and I in memory. I think this is true for CPCEMU also. Basically, you have no choice on a PC. You probably want to have all the registers that are heavily used to be stored in registers as well, and then you have no space left.

You can then write a central loop that fetches the next instruction, increments the PC, decodes the instruction and determines what to do. It then calls the appropriate handling routine for the opcode. When it has executed the opcode, it returns to the central loop. This is a straightforward approach, and it is used in CPE. Decoding an instruction is done by looking it up in a large table. Actually, it is not that large. Opcodes are 8 bit wide on the Z80, so you have 256 of them. Four of these are only prefixes and need to read a sub-opcode which determines the type of action to be done. So, you have a table that contains pointers to about 700 simulation routines (one for each opcode).

You then have to write all these simulation routines. The amount of work for this can vary. It can be hard, if the emulated CPU is very different from the emulating CPU. Look at the Amiga CPE source code to see what I mean. The flags are handled differently, some Z80 flags don't even exist on the 68000 and you can't access the upper half of a 16 bit register on a 68000 without some shifting, whereas you can do this on the Z80 without a problem. Emulating a Z80 on an Intel based PC is much easier. You usually find the same instructions which affect the flags in the same way. You still make a lot of silly errors, though, if you have to write 700 such routines. You'll know there's a bug somewhere in the Z80 part if the 3D graphics in your favourite game look strangely "melted" :-)

The simple approach described above can be optimized in some ways. First off, you probably don't want to jump back to a central loop after each instruction. You can simply append the code that fetches and decodes instructions to each opcode simulation routine, since it is short. This is done in both CPE and CPCEMU. CPCEMU does one more, rather clever optimization: All the instruction simulation routines (at least the first 256 which are the most common) are aligned at 64 byte boundaries. So CPCEMU does not need a lookup table to determine the address to jump to, it can just multiply the opcode with 64 and jump there. I think this is the main reason for the 25% speed advantage that CPCEMU has over CPE. Unfortunately, my opcode simulation routines are somewhat longer than those in CPCEMU, and a 128 byte alignment would cause a HUGE code segment, and since I hate segments, I don't want to have too many of them...

A Z80 has only a limited number of opcodes, so you can hand-code all of these, and you probably want to if you need maximum speed. Other 8 bit CPUs have even less meaningful instructions (like the 6510 used in the C64), so the same method can be used here. But what if you want to try to emulate a MC68000 which has thousands of instructions? The best thing is probably to improve the opcode decoding part. The MC68000 has only about 56 different instructions. The enormous variety is produced by different addressing modes that can be used with these instructions. You can move data from an address to a data register, or to a place in memory, etc. There are a lot of combinations. So, you would probably want to have simulation routines for the 56 instructions and special code that handles all the different addressing modes. Thus, a MC68000 emulator might even be about as short as a Z80 emulator (and much more easily debugged), although the CPU has more capabilities.

A very interesting possibility to speed up the CPU emulation is to "compile" the Z80 instructions into native code that the Intel CPU can directly execute. I know one C64 emulator for the Amiga that comes with a special tool that can do exactly this and achieves a very good speed by doing so, even on an Amiga 500. The difficulty is to distinguish code from data, and self-modifying code is pretty lethal.

4. Interrupts

 This chapter is strongly related to the previous one. An interrupt moves the CPU into a state where it executes a special interrupt code that is stored at a well-defined location. Interrupts occur when external hardware signals to the processor that it needs to be serviced. Unless the running program has temporarily disabled the interrupts, the CPU reacts immediately.

In all computer systems, interrupts can occur for various reasons. In the CPC, the only source of an interrupt is a timer that runs at approximately 300Hz (actually, it's not really a timer, but we'll forget about this for now). Other computer systems raise interrupts when a key has been pressed or a character arrived at the serial port, or the sound card has finished playing a sample.

When the Z80 executes an interrupt, it usually pushes the current PC to the stack and starts executing at location 0x0038. (There are other interrupt modes, but I know only one program that actually uses one of them.)

The problem is, how do we know it's time for an interrupt? The best solution is to fiddle with the PC's timer chip which can be programmed to generate interrupts at any frequency. We set it to 300Hz and write a short interrupt handler that sets an "interrupt occurred" flag, which is tested after each Z80 instruction by the CPU emulation. If the flag is set, we know it's time for the next interrupt.

Of course, testing this bit each time a Z80 instruction has been executed is rather inefficient, since interrupts don't occur often. In CPCEMU, this is done differently. If you are one of the "structured programming people" who can't stand assembly language optimizations, please don't read on and continue with the next chapter.

Still here? Good. In CPCEMU, there is no test of an interrupt flag. Instead, the timer interrupt handler modifies all the instruction simulation routines, replacing the jump to the next Z80 instruction with a jump to the routine that handles a Z80 interrupt. When the interrupt handler returns, it does not matter where the Z80 emulation was interrupted, when the current instruction is complete it will jump to the Z80 interrupt routine. Of course, this has to restore all the simulation routines with their original contents.

Although this method is faster than just checking the interrupt bit, I don't use it in CPE, because it is very difficult to implement correctly and I can use the test of the interrupt flag for other purposes as well (short delays are sometimes needed in the hardware emulation, which can be done by setting up a counter, setting the interrupt flag to a special value and decrementing the counter after each Z80 instruction until it reaches zero, then doing whatever seems appropriate).

What makes the CPCEMU method difficult? From what I've described, it seems to be a clever trick, but not overly difficult. But you don't know everything about Z80 interrupts yet!

Interrupts can be disabled. After a Z80 DI (disable interrupt) instruction, interrupts are forbidden. This does not mean they are ignored, they are just deferred until they are permitted again. Just forgetting these disabled interrupts would cause serious problems for some software.

Also, there is a "feature" in the Z80 EI (enable interrupts) instruction which prevents an interrupt to occur directly after it. Instead, an interrupt can only occur after the instruction following the EI.

The HALT instruction, which stops the Z80 until an interrupt occurs, is easier to implement with an interrupt flag. All these don't make the method used in CPCEMU impossible (as you can see, because it works), but quite a lot of work to implement.

5. The video hardware

 Most computers generate a video signal using the same basic technique: An electron beam is moved very quickly across the screen, generating intense and not-so-intense dots. The beam builds up lines of pixels from the top of the display to the bottom. Each line is built from left to right. Special synchronization impulses signal to the monitor that a line or the whole frame is complete. To determine the intensity and color of pixel, information is continually read from the video RAM, which may be in the same memory as programs and data (as in most early home computers, like the CPC), or be in a reserved video RAM, as in VGA cards.

In the CPC, two chips are responsible for generating a video signal: The CRTC (Cathode Ray Tube Controller) and the Gate Array. The CRTC generates the addresses in the video RAM that are to be read and the VSYNC and HSYNC impulses. The Gate Array reads the memory and generates the video signal for the monitor.

The CPC has three different video modes. In mode 2, the usual resolution is 640x200 pixels. Each pixel can have one out of two different colors, which can be choosen freely among the 27 available colors. In all modes, pixels have the same height, but they can be twice (mode 1) or four times (mode 0) as wide as in mode 2. The lower resolution allows for more colors: four in mode 1, 16 in mode 0 (always out of 27). The Gate Array must be programmed to set resolution and colors, since this chip creates the video signal.

The resolutions of 160x200, 320x200 or 640x200 are in fact not obligatory. You can program different resolutions as well, and many games use the ZX Spectrum resolution of 256x192 pixels (guess why...). This can be done by programming some of the CRTCs registers.

Let's start with the CRTC. For itself, this chip is probably not evil. But in the CPC, it has been connected in a very strange way that makes the organization of the video RAM a mess. Turn on your CPC, scroll up or down a couple of lines, and try the following BASIC statement to see what I mean:

 FOR i=&C000 TO &FFFF:POKE i,255:NEXT 


It fills the video memory from the first byte to the last byte. But it does not seem to fill the screen in any particularly organized pattern. First, it draws the first line of every character row, then the second, up to the 8th line. It doesn't even start in the top left corner, but somewhere in the middle of the screen.

The thought behind all this was probably to make text-mode character drawing and scrolling as easy as possible. But for graphics, it's a nightmare. Here's how it works: The CRTC has registers to store the base address of the video RAM. It can be made to use any of the 16K memory ranges 0x0000-0x3FFF, 0x4000-0x7FFF, 0x8000-0xBFFF or 0xC000-0xFFFF. Usually, the last one is active in the CPC, but many games use the second one, too, for double buffering effects.

Additionally, there is a register to store the height of one character in scan lines. Another register stores a starting offset. With all this information, the 16 bit address generated by the CRTC looks like this:
 Bit 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 \_ _/ \______/ \______________________________/ | | | | | |- Row offset | | | | | --- character row. Usually values from 0 to 7 | |---------- base address: 0x0000, 0x4000, 0x8000, 0xC000. Doesn't change while the screen is displayed. 
 When the screen is displayed, the row offset is initialized with the CRTC offset register. Then, the CRTC draws the first row of the first line of characters, incrementing the offset after each character. The offset is a 11 bit value, there is NO carry into the character row field. The row field is incremented until the height of one character has been reached (usually eight characters). Each of these eight lines starts with the same value in the offset part. After that, the offset is incremented with the number of characters in one row, and everything starts again for the second row of characters.

There are usually 80x25 characters in mode 2. Each is 1 byte wide and 8 pixels high, so that there are 80x25x8 = 16000 bytes visible in screen memory. 384 bytes of the screen bank are unused: The row offset, if initialized with 0, will only count to 2000 (80x25), so that there are 8 holes in the bitmap with 48 bytes each. If you program the CRTC do display more pixels, some of the data will appear twice on the screen. For example, if you make characters nine pixels high instead of eight, the ninth line of each character will contain the same data as the first line.

Now for emulation. Of course, you have to think of some clever scheme that will translate the CPC video address into the equivalent VGA address. If you get this right the first time, you are probably a genius.

There are different ways to ensure that everything that whenever something is written to the CPC screen mem, the VGA mem is updated. You can monitor each write access to the CPC RAM. That's done in CPE. Each Z80 instruction that writes something to memory includes a check whether the region written to lies in screen memory. CPCEMU performs a sampled update only at the beginning of each frame. It keeps an extra region of memory where it stores the previous contents of the CPC screen memory. This is compared to the new contents to determine whether something has changed. These changes are then written to the VGA and to the backup region.

The new emulator by Herman Dullink uses a similar method, but since it has full control of the 386, it can be even more efficient: It uses the page dirty bit to determine whether a page was modified.

Occasionally, it may be necessary to redraw the complete screen, for example when the dimensions have changed or the mode is modified. When the CPC scrolls the screen, it modifies the offset part. You can then either redraw the screen (then this routine had better be fast) or you can try to modify some VGA registers to follow the scrolling. Some CPC programs use double buffering, switching quickly between two different screens. To avoid redrawing, you might want to keep an updated second screen in another VGA page, so that you can switch quickly as well. CPE uses both VGA scrolling and double buffering emulation, but occasionally the screen has still to be redrawn. For this, an offset lookup table is generated that allows an easy translation of CPC addresses to VGA addresses.

But there are some even more basic problems. For example, which VGA mode should one choose?

The answer seems simple: The CPC can display a maximum of 16 colors at a time, in a maximum resolution of 640x200 pixels. There is a standard VGA mode which has exactly those specifications. Great, isn't it?

In fact, all CPC emulators I know of use this mode for standard graphics. The problems begin when someone programs the CRTC to display a resolution of 512x256 pixels. The lower 56 lines will be truncated. There are also some (less common) programs that make the screen wider than usual.

There is no good solution for this problem. In CPCEMU, the screen is automatically made 640x400 pixels high if this is needed. This can be done without effort by clearing one bit in the VGA control registers. The aspect is of course distorted, then. In CPE, I use 800x600 pixel SVGA mode (which isn't distorted, but renders a fairly small picture), or, for non-SVGA boards, 640x350.

It is absolutely vital to do some thinking about how the VGA memory is organized. In 16 color mode, you usually have four bitplanes. Modifying all of these can be highly inefficient if it's done the wrong way. Also, the bytes from the CPC screen memory can't just be written to VGA memory, they need to be converted. In CPE, all this is done by pre-calculating a 3*256 byte table in an invisible region of the VGA memory. The byte from the CPC rom, together with the current mode, serves as an index into this table. When you read one VGA byte in 16 color mode, four bytes are in fact read from memory into the VGA latches. When you write to another byte in VGA mem, the contents of these four latches are stored to the destination. This solves the problem of addressing several bitplanes, and converting the byte value only involves taking it as an offset. Other schemes to do this are possible as well (you can turn off the bitplane mode completely, but it needs some more hacking).

Another problem: The 16 color restriction can be circumvented by clever programming. In the CPC, interrupts are generated by dividing the HSYNC signal by 52. The total frame is 312 lines high. (I'm talking PAL here. For NTSC machines, the timing is different.) Thus, when an interrupt occurs, and you wait for the 6th interrupt after it, when this one occurs, the electron beam will be at the EXACT SAME position where it was when the first interrupt occurred. The interrupts split the screen in six zones. By modifying the mode and color registers of the Gate Array on each interrupt, you can have six different regions on your screen (although two of them are usually in the border). This technique is widely used in games.

It is also possible, but much more time-consuming, to use knowledge of the Z80 instruction timing to generate even better effects. For example, you can wait for the topmost (vertical blank) interrupt. Then, you can let the Z80 execute 42 times 42 NOP (no operation) instructions. When it's finished, the electron beam will just be displaying the pixel with the coordinates (314, 159) and you can determine exactly what color and mode it should have. With this technique, "copper" effects like on the Amiga are perfectly possible on the CPC, and perfectly inefficient, which is why usually only demos use it.

There ways to emulate these effects. Let's start with the first one, because it splits the screen only in six distinct zones. We have two problems: multiple colors and multiple modes.

The color effects won't work on the emulator, because the timing of a VGA card is different from that of the CPC. If you just change the colors in sync with the CPC interrupts, everything will flicker. In CPE, I have tried to solve this problem by making the VGA display the screen at the exact same frequency of the CPC: 50Hz. I wouldn't recommend trying this. The code is big, fat and hairy. The problem is to synchronize the PC timer with the VGA card. It can't be done exactly, so the timer interrupt handler has to check if they are in sync. If not, it makes the VGA display a couple of lines shorter. The position of the electron beam on a CPC vertical blank interrupt will therefore move across the screen relatively fast (you can watch this behaviour in CPE: When a program initializes its multicolor effects, the color zones move across the screen quickly). When it hits the VGA vertical blank interrupt, the VGA display is made a little longer again. The color zones will then stay relatively stable, but the synchronization can't be perfect, so the interrupt code has to keep checking. Sometimes, it will adjust the length of the VGA display a bit. One some monitors, this will cause the display to "jump".

The multimode effects can be addressed by keeping a six entry table that contains mode information. It is updated on each interrupt. Screen updates have to take this into account whenever they modify a byte. This sounds easy, but often gets the effect wrong. Both CPE and CPCEMU try to do this, but sometimes timing problems (or whatever, I haven't completely figured it out) cause the effect to fail.

If you want to emulate the second described effect, which allows total freedom in color and mode characteristics, things become more difficult. With the second release of CPE, I include a program called CPE2.EXE which tries to achieve an exact video emulation in all circumstances. It uses the standard VGA mode with 320x200 pixels in 256 colors.

You may ask: "What, 320 pixels wide? I thought the CPC had a resolution of 640x200 pixels?!" Of course, you are right. Mode 2 doesn't look terrific when this program tries to emulate it. But since mode 2 is hardly ever used in programs which need an exact graphics emulation, this is not so much of a problem. I have tried to use a VESA mode with 640x200x256 pixels, but it didn't work at all. (Couldn't set the colors! What is this?)

Using 256 color mode doesn't by itself guarantee an exact color emulation. CPE2.EXE uses a very different method to update the screen. The screen is redrawn 50 times a second, and not in one piece, but parallel to the CPU emulation. First, the CPU executes a certain number of instructions, until it has used up all the clock cycles that a "normal" Z80 would use while one raster line is displayed on the screen. After the CPU is finished, the raster line is drawn with the current color and mode information. (Of course, this still isn't completely exact if the colors are changed in the middle of a line, but this hardly ever happens.) When the complete frame is drawn, the emulation is stopped until a 50Hz timer interrupt happens. Thus, an exact color emulation and a real-time CPU is achieved with this method. Unfortunately, this method is very time-consuming. Even on a 486DX2-66, it can be slower than the original, although usually it has the correct speed. It might be sped up by allowing the user to specify how often the screen should be redrawn. If this were set to "only every 2nd frame", the speed of a 486DX2-66 should be sufficient in all cases, I think.

By the way, this method is also used in the two best C64 emulators for the PC. On a C64, this type effect is even more common, simply because it's much easier to achieve. And if you wanted to write an Amiga emulator, you would probably have to extend this concept not only to draw the screen in single lines, but stop the processor emulation after each emulation to perform the actions of the custom chips (blitter, copper) and update the next few pixels on the display. Some people insist that an Amiga emulator is impossible because of all this, but I don't agree. If one was written, it would probably be unbearably slow on current hardware. But, look at Amiga CPE and all the C64 emulators for the Amiga. They are all pretty unusable on a standard Amiga 500. All that's needed to achieve a reasonable emulation is a faster CPU.

6. Input/Output

 An emulator which can't load programs is not very useful. Emulators for the CPC (and C64, too) emulate disk and tape access. Disk access is provided by storing the content of one 3" disk into a large (200K) file, called the "disk image". Tape files are simply stored in a special directory and accessed each time the CPC tries to read from the "tape".

The way the CPC ROM accesses the disk and tape hardware is rather low-level. The signals coming from the tape recorder are "digitized" and can be read from a single bit in the 8255 I/O controller. The ROM times how long this bit is either high or low and out of these timing results constructs a bit stream.

No sensible human being would want to emulate the behaviour of this bit. (One could imagine, though, to sample a whole CPC tape using a SoundBl*ster card and analyzing this data with a special program. I think there is a Spectrum emulator that tries this, with moderate success.) Instead, you want to trap the ROM routines that are responsible for reading the tape. This can be done simply by modifying the entry point to contain one of the Z80 illegal opcodes. Then, you have to make your Z80 emulation treat this special opcode so that an appropriate routine is called that pretends that the ROM routine was executed.

In CPE, only one redirection is necessary: CAS READ. This routine is supposed to read the next block from tape, and the replacement code does exactly this: it reads the next block from the tape file. If the end of file is reached, the next file in the directory is scanned. If there is no next file, it moves to the beginning of the directory. The tape therefore loops (more a microdrive than a tape really!)

To speed up the searching process, the routine CAS IN OPEN is also redirected. When the CPC wants to open a specific file, the file pointer is set to the appropriate entry in the directory. This allows for more speedy loading.

CPCEMU redirects more ROM routines and therefore can be a little more user-friendly, showing the tape directory in a window when you type "CAT" or being able to let you select files. It can also write to tape.

You might use the same approach to emulate disk files, and it is in fact the easiest. But you may run into trouble with software that directly accesses the floppy hardware (copy-protected software, for example). Floppy support can alternatively be implemented by emulating the behaviour of the FDC (floppy disk controller). This is done by CPE and CPCEMU. The FDC can be given commands like "step inward", "read sector 4", "write sector 6" etc. This was the hardest part for me to implement, because I had no documentation for the FDC. I only had a ROM listing that I had printed out myself. From this, I tried to guess how the hardware works. Because I don't really remember how it is done (I am happy that it at least works in CPE and I need not do anything more about it), I won't describe any details here.

Herman Dullinks new emulator uses the fact that the FDC in the PC is the same as the one in the CPC. It just provides an interface between the CPC's port addresses and the PC's FDC addresses. Thus, it can directly read CPC formatted disks.

One important I/O device is still missing: The keyboard. The CPC scans a keyboard matrix 50 times a second. If a key on the PC is pressed or released, this matrix has to be updated. You might install a keyboard interrupt handler that reacts to raw keyboard signals. Of course, the PC keyboard layout is different from the CPC layout, so you have to think of some "best fit". You will also have trouble with different keyboard languages. Alternatively, you might make your emulator react to processed keyboard events by letting the standard keyboard handler do its work and map the input from the PC key buffer to the CPC key matrix. But then, you may have to generate more than one CPC keyboard event when one PC key is pressed, to take shift and control keys into account as well. Usually, you will prefer the first method.

7. "Now I know how I can write a CPC emulator, but WHY should I do so?"

 Nostalgia, perhaps? Because you like all the old games you had for this computer better than todays stuff that comes on six CD-ROMs and plays itself? Because you have written some good programs for it that you would like to continue using?

If none of the above apply to you, go write a boring spreadsheet.


 This text was written by (Bernd Schmidt) (Author of the CPE emulator for the PC and Amiga) 


from Hacker News https://ift.tt/2yPcNko

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.