# Pipelining

Philipp Koehn

HW4-soon!

-26 March 2018 7 Oct 2019



### Laundry Analogy





# Laundry Pipelined



|        | 6pm      | 7pm               | 8pm  | 9pm | 10pm | 11pm |
|--------|----------|-------------------|------|-----|------|------|
| Task A | WASH DRY | FOLD              |      |     |      |      |
| Task B | WASH     | 胡胡<br>DRY<br>FOLD |      |     |      |      |
| Task C |          | WASH DRY          | FOLD |     |      |      |
| Task D |          | <b>WASH</b>       |      |     |      |      |

### Speed-up



- Theoretical speed-up: 3 times
- Actual speed-up in example: 2 times
  - sequential: 1:30+1:30+1:30=6 hours
  - pipelined: 1:30+0:30+0:30+0:30 = 3 hours
- $\bullet$  Many tasks  $\rightarrow$  speed-up approaches theoretical limit



# mips instruction pipeline

### **MIPS Pipeline**



- $\mathbb{T}^{f} \bullet$  Fetch instruction from memory
- \$\$\P\$\$\overline\$\$ P\$\$\overline\$\$ Read registers and decode instruction
   (note: registers are always encoded in same place in instruction)
- $\not \in \not \prec$  Execute operation OR calculate an address
- $M \in M \bullet$  Access an operand in memory
- $\mathcal{WB}$  Write result into a register

### **Time for Instructions**





### **Pipeline Execution**





### **Pipeline Execution**





### Speed-up



- Theoretical speed-up: 4 times
- Actual speed-up in example: 1.71 times
  - sequential: 800ps + 800ps + 800ps = 2400ps
  - pipelined: 1000ps + 200ps + 200ps = 1400ps
- $\bullet$  Many tasks  $\rightarrow$  speed-up approaches theoretical limit



- All instructions are 4 bytes
  - $\rightarrow$  easy to fetch next instruction



- All instructions are 4 bytes
  - $\rightarrow$  easy to fetch next instruction
- Few instruction formats
  - $\rightarrow$  parallel op decode and register read



- All instructions are 4 bytes
  - $\rightarrow$  easy to fetch next instruction
- Few instruction formats
  - $\rightarrow$  parallel op decode and register read
- Memory access limited to load and store instructions
  - $\rightarrow$  stage 3 used for memory access, otherwise operation execution



- All instructions are 4 bytes
  - $\rightarrow$  easy to fetch next instruction
- Few instruction formats
  - $\rightarrow$  parallel op decode and register read
- Memory access limited to load and store instructions
  - $\rightarrow$  stage 3 used for memory access, otherwise operation execution
- Words aligned in memory
  - ightarrow able to read in one instruction

(aligned = memory address multiple of 4)



# hazards

#### Hazards



- Hazard = next instruction cannot be executed in next clock cycle
- Types
  - structural hazard
  - data hazard
  - control hazard

### Structural Hazard



- Definition: instructions overlap in resource use in same stage
- For instance: memory access conflict

|    | 1     | 2      | 3      | 4        | 5      | 6        | 7        |
|----|-------|--------|--------|----------|--------|----------|----------|
| i1 | FETCH | DECODE | MEMORY | MEMORY   | ALU    | REGISTER |          |
| i2 |       | FETCH  | DECODE | MEMORY   | MEMORY | ALU      | REGISTER |
|    |       |        |        | conflict |        |          |          |

• MIPS designed to avoid structural hazards

#### Data Hazard



• Definition: instruction waits on result from prior instruction



- add instruction writes result to register s0 in stage 5
- sub instruction reads \$s0 in stage 2
- $\Rightarrow$  Stage 2 of sub has to be delayed
  - We overcome this in hardware

# **Graphical Representation**





- IF: instruction fetch
- ID: instruction decode
- EX: execution
- MEM: memory access
- WB: write-back

# Add and Subtract





output of ALU for next instruction

### Load and Subtract





- Add wiring from memory lookup to ALU
- Still 1 cycle unused: "pipeline stall" or "bubble"

#### **Reorder Code**



• Code with data hazard



#### **Reorder Code**



- Code with data hazard
  - lw \$t1, 0(\$t0)
    lw \$t2, 4(\$t0)
    add \$t3, \$t1, \$t2
    sw \$t3, 12(\$t0)
    lw \$t4, 8(\$t0)
    add \$t5, \$t1, \$t4
    sw \$t5, 16(\$t0)
- Reorder code (may be done by compiler)

### **Reorder Code**



• Code with data hazard

lw \$t1, 0(\$t0)
lw \$t2, 4(\$t0)
add \$t3, \$t1, \$t2
sw \$t3, 12(\$t0)
lw \$t4, 8(\$t0)
add \$t5, \$t1, \$t4
sw \$t5, 16(\$t0)

lw \$t1, 0(\$t0) lw \$t2, 4(\$t0)
lw \$t4, 8(\$t0) add \$t3, \$t1, \$t2 sw \$t3, 12(\$t0) add \$t5, \$t1, \$t4 sw \$t5, 16(\$t0)

- Reorder code (may be done by compiler)
- Load instruction now completed in time

### **Control Hazard**



- Also called branch hazard
- Selection of next instruction depends on outcome of previous
- Example

add \$s0, \$t0, \$t1 **)** <u>beg</u> \$s0, \$s1, ff40 sub \$t0, \$s0, \$t3

- sub instruction only executed if branch condition fails
- $\rightarrow$  cannot start until branch condition result known

### **Branch Prediction**



- Assume that branches are never taken
  - $\rightarrow$  full speed if correct
- More sophisticated
  - keep record of branch taken or not
  - make prediction based on history

speculative

this is not a full explanation

- "branch delay"



# pipelined data path



# **Pipelined** Datapath







# load























# store























# add























# write to register

# Which Register?





### Problem



- Write register
  - decoded in stage 2
  - used in stage 5

• Identity of register has to be passed along









# pipelined control

# **Pipelined Control**



- At each stage, information from instruction is needed
  - which ALU operation to execute
  - which memory address to consult
  - which register to write to
- This control information has to be passed through stages



# **Control Flags**



