# Question 1: Single-cycle CPU implementation (20 points) You are asked to implement jr rs instruction (jump to register rs) in the single-cycle datapath. The format of jr instruction is shown below. | Field | on = 0 | 1'5 | | | | |-------|--------|-------|------|----------|--| | | | | 0 | func = 8 | | | Bits | 31-26 | 25-21 | 20-6 | 5-0 | | "Before implementing the jr instruction, do parts (a) and (b) as a small warm-up exercise." ## Part (a) The single-cycle datapath from lecture appears below. Clearly mark all wires that are active during the execution of lw instruction. (points) marked with green. ## Part (b) On the diagram below, write (next to the signal's name) values of all non-0 control signals required for the lw instruction (5 points) # Part (c) The single-cycle datapath from lecture (same as on the previous page) appears below. Show what changes are needed to support jr instruction. You should only add wires and multiplexers to the datapath; do not modify the main functional units themselves (the memory, register file and ALU). Try to keep your diagram neat (priorits) Note: While we're primarily concerned about correctness, full points will only be rewarded to solutions that do not lengthen the clock cycle. Assume that the ALU, Memory and Register file all take 2ns, and everything else is instantaneous. #### Part (d) On the diagram below, write (next to the signal's name) values of all non-0 control signals required for the jr instruction (5/points) # Question 2: Multi-cycle CPU implementation and its performance (20 points) Assume that the ALU can perform the max2 operation (i.e., return the greater of 2 inputs): alu\_result = (A\_input > B\_input) ? A\_input : B\_input; ALUOp for this instruction is MAX2. Given this improved ALU, implement the max4 instruction that writes into register rd the largest value of 4 registers: Note that register rd is both an *input* and an *output*. Instruction max4 has the following format: | Field | | 1 | | | | | |--------|-------|-------|-------|-------|------|------| | 1 1610 | ор | 15 | rt | rd | rm | func | | Bits | 31-26 | 25-21 | 20-16 | 15-11 | 10-6 | 5-0 | # Part (a) The multicycle datapath from lecture appears below. Show what changes are needed to support max4. You should only add wires and multiplexers to the datapath; do not modify the main functional units themselves (the memory, register file, and ALU). Try to keep your diagram neat! Note: While we're primarily concerned about correctness, full points will only be rewarded to solutions that use a minimal number of cycles and do not lengthen the clock cycle. Assume that the ALU, Memory and Registerfile all take 2ns, and everything else is instantaneous. Part (b) The max4 instruction can be used in place of three branch instructions, reducing the number of instructions that need to be executed. Below are two functionally equivalent programs; the second of which uses the max4 instruction: | | Program 1 | Program 2 | |---------------|--------------------------------------------|---------------------| | 00 00 Vo to | 12 lw v0, 0(a0) as +4) 12 | | | IF IM IS O(D) | add a1, a0, 12 0 1 10 | lw t0, 4(a0) | | | label: add a0, a0, 4 ow +12 cm. | lw t1, 8(a0) | | | lw t1, 0(a0) | lw t2, 12(a0) | | | <u>slt</u> t2, t1, v0 | max4 v0, t0, t1, t2 | | | bne t2, \$zero, skip | jr \$ra | | 7+7+7+7 | move v0, t1 always skip: bne a0, a1, label | lw v0, 0(a0) | | | jr \$ra | | Assume the datapath and control that you implemented in parts (a) and (b); assume slt and move can be considered as R-type instructions and jr and bne take the same amount of time as beq. Also assume that the first element in the input array is the largest (so that the first branch of program 1 is always taken). How much faster (or slower) is program 2 than program 1? You may leave your answer as a fraction. (10 points) Tor program 1: 6 instructions, assume 5 a.c. / instruction total # of mshs: 7+7+7+7+1=29 and c.c. time = 1sec. assume 'SCC (int. => Exec. time = 6x5x1= 80 sec. => Exec time = 29 x5=145 ratio = Execting() = 145 => (2) is falke by 80 than (1) >> is of higher performent Question 3: More performance (20 points) #### Part (a) Assume the following delays for the main functional units: | Functional Unit | Time delay | |-----------------|------------| | Memory | 5 no | | ALU | 4 ns | | Register File | 3 ns | Given the following instructions: Iw, sw, add, beq, calculate: - minimum time to perform each instruction - time required on a single-cycle datapath (from q. 1) - time required on multi-cycle datapath (from q. 2). | your answers in th | | te any assumptions. (16 | points) | lost for gol | |--------------------|-----------------|-------------------------|--------------------------|--------------| | Instruction | minimum<br>time | in single-cycle | in multi-cycle 5+3+4+5+3 | 2 5 5 | | · lw | 2005. | 5+3+4+5+3<br>= 20 ns | = 2002. | | | 5/// | 1200 | 5+3+4+5 | 5+3+4+5 | 5 x H | | add | 1500 | 5+3+4+3<br>= 1505 | 5+3+4+3<br>5+3+4+3 | 5+3 | | beq | Jan. | 5+3+4 | 1301 | • | all take langth of the 30 #### Part (b) How much faster than a 1 GHz single-cycle MIPS processor would a 3.0Ghz Pentium 4 x86 processor be if it achieved a CPI of 0.5 on a workload? (6 points) ormance of single = Exec of Pent = 0.5 x I x 3GH2 = 0.5 Frentium Question 4: Pipelining and forwarding (20 points) Pentium H is of better # Part (a) Show or list all of the dependencies in this program. For each dependency, indicate which instructions and register are involved. (6) points) add \$8\$ \$5, \$5 add \$2\$ \$5,\$\$ 210\$ $$+ 105 = 210$$ $$315 = 315$$ $$314 + 315 = (29) \Rightarrow 12$$ $$314 + 315 = (29) \Rightarrow 12$$ $$314 + 315 = (29) \Rightarrow 12$$ #### Part (b) The pipelined datapath on the next page shows the fifth cycle of executing this program, including values for several of the stages. Fill in the ten remaining values, marked with a ? symbol, in the EX and MEM stages. (20 points 2 points each) #### N.B.: - · Write your answers directly on the diagram, but write clearly. - · Show decimal values. - Assume that registers initially contain their number plus 100: \$2 contains 102, \$8 contains 108, etc. - Write 'X' for any numbers that cannot be determined. One CPU manufacturer has proposed the <u>10-stage pipeline</u> above for a <u>500MHz</u> (<u>2ns clock</u> cycle) machine. Here are the correspondences between this and the MIPS pipeline: - Instructions are fetched in the FET stage. - Register reading is performed in the REG stage. - ALU operations and memory accesses are both done in the EXE stage. - Branches are resolved in the DET stage. - WRB is the writeback stage. #### Part (a) How much time is required to execute one million instructions on this processor, assuming there are no dependencies or branches in the code? (5 points) Execution time = $$\frac{1}{2} \left( \frac{1}{1000,000} \times \frac{10}{10} \right) + \frac{1}{1000,000} \times \frac{10}{10} + \frac{1}{1000,000} \times \frac{10}{10} \times \frac{10}{10} \times \frac{10}{1000,000} \times \frac{10}{10} \times \frac{10}{1000,000} \frac{10}{1000,0$$ | - | | | -0 | | |----|----|-----|----|---| | 13 | | *** | n. | | | | æ. | rt | w | 3 | Without forwarding, how many stall cycles are needed for the following code fragment? (5 points) \$t0, 0(\$a0) lw \$v1 \$t0 \$t0 add [] [] # Part (c) > 2 stalls for the and ID If a branch is miss-predicted, how many instructions would have to be flushed from the pipeline? (5 points) Instructions to be plushed depend on # of instruction shipped due to branch, assuming that branch is taken due to prediction; ## Part (d) Assume that a program executes one million instructions. Of these, 15% are load instructions which stall, and 10% of the instructions are branches. The CPU predicts branches correctly 75% of the time. How much time will it take to execute this program? (10 points) > Exec the = Execution time cussuming good > prediction + 25% of the exectine customing sood prediction (100%) #### Performance 1. Formula for computing the CPU time of a program P running on a machine X: CPU $time_{XP} = Number of instructions$ executed $p \times CPI_{XP} \times Clock$ cycle $time_{XP}$ 2. CFI is the average number of clock cycles per instruction: CPI = Number of cycles needed. Number of instructions executed 3. Spredup is a metric for relative performance of 2 executions. Speedup = Performance after improvement - Performance before improvement = Execution time before improvement. Execution time after improvement