S5 A Microprogrammed CPU

And what rough beast, its hour come round at last,
Slouches toward Bethlehem to be born? W.B. Yeats

Link to End of Chapter Exercises

S5 / 1 The Second Coming

In the last chapter of the DDZO text we glimpsed, through the binoculars of register transfer logic, a sphinx of a computer. In this final supplement we approach more closely to the computer, and try to solve its riddle. A computer can be divided into a central processing unit (CPU) and a read/write memory system. The CPU can be partitioned further into a combinational arithmetic-logic-shifter unit surrounded by clocked registers and controlled by (what else?) a control unit.



The major riddle in a computer is the control unit, and that will be our focus in this chpt. We will study a systematic approach called microprogrammed control. Before we explain microprogramming, we outline a CPU architecture to be controlled, with inspiration taken from the AMD 2901 single-chip registered ALU "slice." And before we slice open the (now obsolete) 2901, we step back to consider the place of the human in all this--


The information flow above addresses two questions often asked about a computer: "How does its memory fill up with information?" and "How does data get into its registers?" It can all starts with a human typing at a keyboard.

S5 / 2 2901 jr.

Here are an ALU & registers, in the architecture we'll design a controller for. Like some of the multi-register machines we saw in Chapter 11, this one has a MUX insinuated between the registers and the two ALU inputs.


Two other choices appear on the MUX inputs--all zero's and EXTERNAL DATA. EXTERNAL DATA can be an entry point for data from memory. The MUX has two outputs, labelled R and S, which then project to the two ALU input ports. The number of combinations of 4 things taken two at a time is

so 3 bits of MUX select are needed to select all input combinations. Such a calculation assumes that the two input ports to the ALU are treated equivalently by the ALU operations, which may not be true (think about subtraction).

Via the 4-bit address, the 16-word RAM allows one of its words access to the SOURCE MUX. If write-enable is active, then the output of the shifter will be written to the RAM location specified by the address, when the clock is LO. Like the SRAM's we considered in chapter 8 DDZO, the write-enable is level-, not edge-sensitive, in the 2901 RAM.
Since writing is a level-sensitive operation, the RAM system needs an output latch (slave, not shown above) to avoid data racing around from shifter to RAM and back again.


If the latch is transparent when clock is HI, and holds previous data when clock is LO, then racing will be prevented.
The output of the shifter will be written to the Q-register on the rising edge of a clock pulse if the register is enabled, otherwise previous data is held.

The ALU is an eight-function combinational circuit with CIN with several status outputs. Three "instruction" pins select one of the functions. The functions are--
R plus S
S minus R
R minus S
When CIN = 1 the subtraction is 2's complement.
R OR S
R AND S
R AND S that's R-bar...
R EXOR S
R EXNOR S

The status outputs are
C-OUT
G
P
overflow
sign bit (MSB)
OUT = 0000
COUT is generated by internal carry-look-ahead circuitry and G and P are the CLA "generate" and "propagate" signals for connection to a second-order CLA chip, the 2902. Overflow = CN+1  CN.
The shifter here has the same three choices (left, right, no-shift) as the shifter designed in the previous �; the least and most significant bits are led out as pins for connection to another "slice", or to define circular, straight, arithmetic, etc shifts.

S5 / 2 / 1 A real 2901

is a 40 pin bipolar TTL chip with a 4-bit data pathway. It has 3-state outputs from its ALU. The output of the device comes from the ALU, not the shifter. The shifter is really two shifters, one each for the Q register and for RAM. The RAM is "dual port" with two sets of 4-bit addresses (A & B) and two output ports to the Source-Select MUX. Only the RAM location addressed by the B pins can be written to, but both addresses can be read from. There are latches on each of the RAM ports, as needed to avoid race. The input MUX has 5 choices (zero, external, A-RAM, B-RAM, Q) and a subset of 8 of the possible 10 combinations can be selected (B-Q & B-D are not possible).
One 2901 is intended as a "slice" of a larger 4xN bit CPU; since RAM and Q have their own shifters, 4 pins (RAM0, RAM3, Q0, and Q3) must be connected with neighboring slices.
To complete the slice connections, hook the CN+4 pins of lower order slices to the CN pins of their neighboring higher order slices.
Because it has a clock, the 2901 speed should be gauged by its maximum clock rate, which for the 2901C is 15 MHz. Propagation delay from clock to output pin "F3" is 50 nsec.
More information about the 2901 can be obtained from the AMD data book cited below, or JD Lab Manual +. AMD subsequently developed the 48-pin 2903 4-bit "super slice," with 16 ALU instructions. See an old AMD data book for comparison.

The 2900 series of chips is fabricated with bipolar, not MOS, technology. At the time it was designed, the 2901 was much faster than any CMOS microprocessor or memory components. In particular, its 16 word "cache memory" had much shorter access times than the dynamic RAM used in main memory.

There are nine other pins on the 2901 we haven't mentioned yet--instruction pins. Three pins each to instruct the 2901 about which SOURCEs for ALU input, which FUNCTION for the ALU to perform, and what DESTINATION the ALU output should go to. Since the shifters on the real 2901 are closer to the destination RAM and Q registers, shifter function is specified by the DESTINATION "field" of the instruction.
We already know what eight arithmetic and logic operations the 3 FUNCTION pins can select, and we know that the SOURCE field select a pair of inputs from among a choice of 4 or 5 MUX inputs.
The eight destination instructions, in RTL notation, include
RAM z ALU (write the the ALU output to RAM)
Q z ALU (load Q)
no-op (nothing loaded; just let a clock cycle pass; used as a wait state)
RAM z RAM/2 ; Q z Q/2 (shift RAM & Q down)
RAM z 2�RAM ; Q z 2�Q (shift RAM & Q up)
where "RAM" means the RAM location addressed by the "B" pins.

Advanced Micro Devices Bipolar Logic and Interface Data Book. AMD, 901 Thompson Place, Sunnyvale, CA 94088.

G.J. Myers, Digital System Design with LSI Bit-Slice Logic. John Wiley & Sons, New York, 1980. See chapter 3, "ALU/Register slices."

Mick & Brick, Bit Slice Microprocessor Design, McGraw-Hill, New York, 1980. See chapter III, "The data path."

S5 / 2 / 2 Instruction and status registers

Now that we have a registered ALU in hand, we can design a controller for it. Before doing so, we list the pins which will attach to the controller.


That's at least 16 pins, much more than 4-bit data path of one of our registered ALUs. That's OK. We'll call for a 16-bit register, and later need to add more bits as the instruction register evolves to a "pipeline" register.
The registered ALU will return six status signals to our controller, which can be used for conditional branching in the microcode.
The machine's status can be saved in another register:


The instruction and status registers are an "interface" between the control unit and the registered ALU, as we have partitioned it, but all three share the same clock.




S5 / 3 Microprogramming

An instruction-a micro-instruction, more precisely-for one CPU clock cycle will control all the SOURCE, FUNCTION, DESTINATION, ADDRESS, CIN, etc in the ALU and its registers. All of the different micro-instructions that the CPU will need are stored in a ROM whose many output pins*

project synchronously to the instruction register.



Thanks to a new clock pulse, the instruction register is loaded with the output of ROM. This ROM is in the CPU; it is not in the memory system of the computer. The ROM is a combinational circuit; the instruction register and address generator on either side of it are clocked, but the ROM is not. To work effectively with the hardware in the ALU and registers, the m-code ROM should be fast-brief propagation delay from new address to new data.
The many words in ROM can be divided into sections which represent higher-level instructions, ADD or MULTIPLY or MOVE data to a different location, for example. In general, a user instruction (higher-level instruction) has a FETCH and an EXECUTE phase. The micro-instructions for FETCH may be common to many instructions, and so may reside in ROM where they can be called as a subroutine. The 1's and 0's which make up one micro-instruction can also be called micro-code. The art of micro-programming (a level of programming below assembly language) is the art of writing micro-code, code which defines the instruction set of a computer. We do not refer to m-code RAM; ROM gives the correct impression that the m-code cannot easily be changed, and that should be true about a computer's instruction set. These various facts about m-code ROM should be kept in mind as we work through the main design problem of the micro-program controller-the address generator.

S5 / 3 / 1 Address generator

If micro-code is written systematically, then next instruction from ROM will be one address greater than the current instruction (an increment operation in the generator). But there must still be some times when a JUMP to a dis-continuous location is required, for example, every time a new user instruction is fetched.
We begin the design of the next address generator with a MUX which can choose between an increment of the present address, Y+1, or a jump to a dis-continuous address, D:


There are two designs to consider for the CONTinue choice, an up-counter or a combinational incrementer + register:



The up-counter looks simpler until we think about JUMPing. The incrementer + register handles jumping with no problem because the jump location D is automatically sent to the input of the incrementer. The up-counter, on the other hand, must have a LOAD feature, and it must be loaded with D at the same time D is passed out of the MUX. The load control for the counter must come from SEL for the MUX. Another feature in favor of the incrementer+register is that it maintains both Y and Y+1 internally, a state useful for saving a return address in a subroutine operation.
[An end-of-� exercise allows you to explore the propagation delay differences between the two designs.]

S5 / 3 / 2 The jump location

One source of jump location is the next instruction in the user's program stored in system RAM (more precisely, the compiled version of the user's program, stored in an assembly language "object file"). Another source can be the microcode, which is executing, over several clock cycles, the user's current instruction. Since both of these sources are outside the address generator itself, let's attach them to the D-input of the MUX by a 3-state bus:



What exactly do we mean, that a jump location is in microcode itself? That the present set of bits in the instruction register has another field-the next address field-which projects not to the registered ALU but back to the address generator D input.
The next figure shows that the instruction register has been widened to accommodate this "next address" part of the microcode word. The next address passes through a 3-state buffer, then attaches to the address bus on the D-input of the address generator MUX.
Also attached to the address bus is the output of "mapping ROM," from the user program's next instruction. The mapping ROM "maps" an assembly language input to a m-code ROM address. The mapping ROM has a 3-state output. Later we will add output-enable controls to the address generator circuit, which will choose which, if either, of the mapping ROM or pipeline buffer should have control of the address bus.
"Instruction register" is no longer an appropriate name for the register which hold the ALU & registers & next address information. We rename it pipeline register. It's one register in a pipeline of information transfer from user program to CPU. A pipeline is a generalized shift register in which many bits are shifted in parallel.
How many bits might be in the address bus? Twelve bits would address 4096 m-code ROM addresses, and 12 bits is the width of the address bus in the 2910 microprogram sequencer. With 4000 locations and an instruction set with 50 entries, each instruction could take up an average of 8 microcode words.


The feedback pathway from the pipeline register back to the address generator is clear evidence that we are constructing a clocked sequencer (Moore circuit), like we did in �6. Here we are not vexed by the problem of minimizing the number of gates in the design. Before the commercial advent of microprogramming in the late 1960's CPU control circuits were "hard-wired" with tightly designed clocked sequencers. Now with the expense of a ROM and the delay of the pipeline, the process of designing the hardware for the controller can be separated from the task of designing the instruction set for the computer. To change the instruction set of a micro-programmed computer, simply change the code in the ROM.

For a good development of the "random logic" approach to control unit design see

Thomas Bartee, Digital Computer Fundamentals, 6th Edition, McGraw-Hill (1985) chapter 9, "The control unit."

Pause for a moment to appreciate the micro-program sequencer's "philosophy"; it not only knows what it's doing now, but knows what it's going to do next. Look before you leap, or in this case, look before you jump.

In fact the feedback has another component we haven't shown on the figure above: The current microcode instruction also contains the SEL for the next-address MUX, so that the current instruction can tell the address generator when to look at the D input.
The feedback arrangement providing for the next sequencer instruction makes the job of writing microcode more difficult: Each line of mcode controls the current Source Function Destination, etc of the registered ALU, and it must contain the next sequencer instruction, and possibly the next sequencer D-address.
[There are some subtle timing issues with regard to the arrival of the next active edge of the system clock and when the control unit finally settles to the next instruction in the pipeline-propagation delay-but we defer these issues until later in the �, and to the Lab Manual. ]

S5 / 3 / 3 One-level subroutine

So far our next address generator, or sequencer, has no more capability than an up-counter with LOAD. Now we take advantage of the incrementer-register to save the next continue address address during a JUMP, then return to that address after executing a "subroutine". See address blocks below.


Here "Main Program" means the code address sequence in micro-code ROM.
The sequence ... M-1, M, S, ... S+3, M+1 ... must be stepped through.
Two sequencer instructions will be needed for the complete process: JSR, for "jump to subroutine" and RTS, for "return from subroutine." Again, to motivate*

interest in subroutines we remind you that the microcode may be structured to include a common FETCH subroutine in each micro-coded instruction. FETCH can obtain the next user instruction from main memory, and increment a "program counter" register in the ALU register set. We don't have to be concerned about what happens in the subroutine, just about going to the subroutine and getting back to M+1 in the main program when the subroutine is finished. Since FETCH is used by many instruction code sequences (such as program N, above), it cannot have built in a unique return address. In one case return is to M+1, in another to N+1.

Saving the current address + 1. When a JSR instruction arrives at the microprogram sequencer, the address M+1 must be saved in another register. Call that register R and give its output a third access the address generator MUX, as shown below. R-reg has been placed above the INC-reg, extending the internal pipeline of the microprogram sequencer. The timing problems associated with placing R-reg at the same level as
INC-reg can be explored in an end-of-chapter exercise.


When the "pop" input on the MUX is selected the current output of the R-register becomes the ROM address. The circuit above has its own timing problem, which is easier to solve-If a new clock edge sends M+1 to the INC register before the "JSR detect" signal arrives at the R-reg, then the correct M+1 value will be written into the R-reg.


Insure that the R-reg clock occurs tset-up after user clock, and M+1 races through the INC-reg to land where it should in R-reg.

When the subroutine is finished the return location is available at the output of the R-register, and a MUX select to POP will bring address M+1 to the output of the address generator, as desired.

S5 / 3 / 4 Nested subroutines

By turning the R-reg into a last-in, first-out "LIFO stack" the sequencer can handle nested subroutines.


Subroutines can be nested to a level as deep as the stack; let's say 4-deep for the example we're working. The LIFO stack can be a 4-word memory addressed by 2 bits from an up/down counter known as the stack pointer.


There are two control signals for the stack- SE = stack-enable, which is active for both JSR and RTS instructions, and PUSH/POP, which is HI (PUSH, count up) for JSR and LO (POP, count-down) for RTS. SE is AND'd with the system clock to produce clock edges for the stack pointer. If more than 4 JSR's are attempted before an RTS, then the "FULL" warning is lit by the stack pointer.
For one-level subroutining only the clock on the R-reg need be coordinated with the clock on the INC-reg. Now the clock on the counter, the up/down control, and the stack write-enable need to be coordinated with INC-reg. In particular, note that SE may increment or decrement the counter near the system clock edge, but for an RTS the counter may present an incorrect value to the POP input of the MUX. This problem can be overcome by sticking another "pipeline" register at the output of the stack, and changing the phase on which the stack point is changed. See details in an end-of-� problem, or JD Lab Manual.

S5 / 3 / 5 Conditional branching in microcode

We advertised that our sequencer would be able to handle conditional branching. Conditions to cause branching can come from three sources:
� *The user program
� *Internal conditions collected by the status register from the ALU.
� *External conditions, like EOC from an A/D converter.
Changes in microcode address from the user program are expected and are handled by the "mapping ROM" shown earlier, which has 3-state access to the sequencer data bus.

Feedback from the status register. "Internal" conditions will come from the status register which saves COUT, overflow, sign bit, zero-detect, etc from the ALU operations.


Shown above is a "FAIL MUX" which uses the next sequencer instruction, stored in the pipeline register, to select an appropriate status bit to check. For example, if the user instruction is a signed addition, it will be important to check for overflow, but not for carry-out. We show the output of the FAIL MUX headed to a sequencer instruction decoder, which will be explained in a couple paragraphs.

External interrupts. Events in this category may be unexpected, or occur at unpredictable, even asynchronous times, but their possible occurrence has been planned for in the form of interrupt service routines in microcode. In this � we ignore external events whose occurrence does require urgent attention and who can be handled by scheduled "polling" of the flags and status bits in external devices.*

Here we outline how a CPU controller can manage "real time" events-some welcome, like a signal that a co-processor has completed a floating point calculation, some unwelcome, like a power failure during which time the controller may have 100 msec to start and finish a shutdown procedure. Interrupt service sub-routine addresses can be announced to the controller the same way user instructions are, through a 3-state output ROM, in this case called a "vector ROM," to suggest that the ROM supplies a vector pointing to the proper interrupt service routine:


The interrupt signals, which may have to pass through enabling, synchronizing, and pulse-capturing circuits, form the address to the vector ROM. The output of the vector ROM is an address in micro-code ROM where the interrupt service routine starts.
[Additionally, a signal that an unexpected interrupt has occurred may be sent directly to the sequencer.]
The output-enables of the mapping and vector ROM's, and the pipeline buffer output enable, will be controlled from the microprogram sequencer, as we'll see in the next paragraph.

S5 / 3 / 6 Instruction decoding

Assume that coordination of the timing signals for the INC-reg and stack have been taken care of by a pipeline architecture, and that the various external jump address sources have been attached to the address bus. We are left with one final problem to finish the design of this general-purpose sequencer-how are the instructions from the external pipeline decoded? In the figure above, the instructions are shown feeding back to the SEL of a MUX. Here is a complete listing of the inputs and outputs for this instruction decoder. Four bits of instruction code are sent back from the pipeline.


The decoder does not need a clock. It does not need flip flops. So the final act in the design of a micro-program sequencer, which is the final act in the design of the CPU controller, and the penultimate act in the design of the whole computer, is the realization of a truth table. REPEAT: Once we design the instruction decoder, we're finished with the inner workings of the whole computer. The design of the decoder takes us all the way back to the beginning of the book-the turning of truth tables into combinational hardware.

To actually fill in the gates for the instruction decoder design we need to know the instruction code. In one of the end-of-� exercises you are given an 8-instruction, 3-bit code and asked to work out a complete design. Here we bring out 6/16 of the 4-bit code + FAIL lines from the 2910 instruction set, for examples.

[We have shadowed the 2910 chip in illustrating the design of a microprogram sequencer, but we have not shown all of the 2910's features. See data sheets in in AMD book, or JD Lab Manual. In this regard, one of the problems at the end of the chapter asks about the usefulness of a 4th MUX input for repeated looping. In the MUX-select options below we've added a 4th choice, ZERO.]

I3 I2 I1 I0 are the instructions bits for the sequencer. If FAIL is active, then the second condition on JUMP address and stack operation is enforced. The stack has 4 operations: CLEAR, HOLD, PUSH, POP.

Table of instructions:

I3 I2 I1 I0

PASS

FAIL

INST

MUX SEL

JUMP Addr

STACK

JUMP Addr

stack

0 0 0 0

JZ

zero

clear

0 0 1 0

JMAP

JUMP

MAP

hold

1 1 1 0

CONT

CONT

hold

0 1 1 0

CJV

JUMP

VECT

hold

CONT

hold

0 0 0 1

CJS

JUMP

PL

PUSH

CONT

hold

1 0 1 0

CRTN

POP

POP

CONT

hold

where the sample instruction abbreviations stand for







S5 / 4 Connection of CPU to main memory (semiconductor RAM)

The "ultimate" hardware step in our assembly of a complete computer is hooking up the CPU to main memory. Main memory (called in the past "core memory") will have orders of magnitude more locations than the cache memory inside the CPU, and will be arranged as random-access read-and-write cells, as you saw in the Memory Chpt. For the purpose of the example to be shown here, we assume that the number of bits in the data path of the CPU is sufficient to address main memory. If the data path is 16 bits, then there are 216 = 65,536 locations in main memory. Main memory has three parts which need connection to the CPU:

� Address pins
� Data pins, for both input and output
� Read - write control.
Consider main memory in the form of DRAM chips-slower than the static RAM of the CPU cache. For connection to main memory a few more registers will be needed; they will be housed in the CPU.

Start with CPU connection to main memory address. Another register-the Memory Address Register, MAR-which receives its input from the shifter, is dedicated to addressing main memory. Another bit is added to micro-code to control when MAR accepts a new address.



Registers near the ALU can store the program counter and data pointer (locations in main memory of the user instruction currently being executed, and the data being computed upon), and these values can be passed out to the MAR during a FETCH cycle of the machine. While the CPU is busy carrying out a user's instruction, the MAR faithfully holds the correct place in memory.

Next consider the memory data pins, and read-write control. We assume the main memory data bus is bi-directional, and that whether memory contents are being read from or written to is determined by the R/W pin on main memory.



When main memory is in the READ mode, there are two possible destinations-the mapping ROM or EXTERNAL DATA port on the ALU MUX. As long as the MAR output is stable, registers are not needed in front of either of these destinations. It is the responsibility of the microcode to send the output of the mapping ROM into the m-code ROM at the beginning of each user instruction, but once that is done the CPU control unit doesn't need to worry about "saving" the contents of the mapping ROM. Likewise, when microcode decides that data from main memory is needed, it can FETCH the data through a change in the MAR, then accept the data through control of the ALU MUX select.
Writing to main memory is another matter. Presumably something at the output of the shifter should be sent to a memory location, so the shifter needs connection through a 3-state device to the main-memory data bus. Three-state enable is controlled from the pipeline. On the same microcode operation the pipeline register can switch the memory from READ to WRITE mode. That we have opted to structure the 3-state device as a register clocked from microcode give the system some latitude in making sure the data meets thold of the main memory write cycle; the issue is explored more in an end-of-� exercise.

In hooking up main memory to the CPU we have not dealt with secondary issues such as addressing modes, main memory being "slower" than the CPU, direct memory access and how main memory may be loaded from disk drives. But what we've done is sufficient to make a working computer, given enough intelligence and "width" in the microcode. At this point we can summarize a computer with a block diagram which includes the three major components and their interconnects:




S5 / 5 Example of a user instruction in microcode.

Before summarizing this chapter, we tread lightly into the realm of software.
Let's demonstrate how a particular user instruction could be handled by microcode.
Try the ADD instruction, ACC z ACC + RAM
"Replace the contents of register ACC with the sum of the current ACC and a number stored in memory location RAM."
Where ACC is stored in a register near the ALU, and the address of RAM is stored in a data pointer (DP) register near the ALU.
There is another register near the ALU, called Program Counter (PC), which contains the address of the user instruction ADD ACC to RAM.
For specifics, say that "6" is in register ACC, 21 is in register PC and 43 in register DP, and that at main memory location 21 is the user code for ADD, and at location 43 is "5".


With apologies to strict register transfer notation, here is what we expect to happen during the ADD instruction. We assume the architecture developed during this � carries out these actions.


Two FETCHes, two EXECUTES and a test for carry out; 5 cycles of microcode. Should the instruction end with a MAR z PC, or should that be the beginning of the next instruction?

[At the end of the instruction more than incrementing the PC register may need to be done, such as microcode which enables interrupts, inspects their status, services them if they are active, then disables them before starting the next instruction.]

Without setting up all the fields (SOURCE, FUNCTION, DESTINATION, CARRY-IN, MAR CLOCK, etc) with position codes, we can't fill in the 1's and 0's of the various microcode words, but this outline is close to the final product-a microcoded user instruction.



S5 / 6 References

Chapter 4, "Microprogram sequencing devices," of G.J. Myers, Digital System Design with LSI Bit-Slice Logic. John Wiley & Sons, New York, 1980. Discusses the 2910, plus similar chips from other manufacturers.

John Mick & Jim Brick, Bit Slice Microprocessor Design, McGraw-Hill, New York, 1980. Chapter II, on microprogrammed design, goes over features of the 2910.

Wilkes, Maurice, Memoirs of a Computer Pioneer, MIT Press, 1985.
Wilkes originated the idea of microprogramming.

John von Neumann and the Origins of Modern Computing, by William Aspray, MIT Press (1990). [ review by G. Tweedale, in Nature, Feb. 21, 1991, page 662. Von Neumann had such influence on the design of early computers that machines like the one we've designed in this � are called "von Neumann machines." ]

S5 / 7 Summary

S5 / 8 How it ends

We have ended the Supplemental Chapters by giving you insight into a full-featured sequencer, in the service of a micro-programmed computer. We found that instructions to a microprogram sequencer were decoded by a combinational logic circuit which realizes a truth table. At this point we have wrapped around to the beginning: The textbook started with an exposition about the means to turn a truth table into a logic circuit.