## Lecture 13: Pipelining

Philipp Koehn

February 26, 2020

601.229 Computer Systems Fundamentals



▲□▶ ▲□▶ ▲ 三▶ ▲ 三▶ 三三 - のへぐ

# MIPS overview

▲□▶ ▲□▶ ▲ 三▶ ▲ 三 ● ● ●

# T E C H N O L O G I E S

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

Developed by MIPS Technologies in 1984, first product in 1986

- Used in
  - Silicon Graphics (SGI) Unix workstations
  - Digital Equipment Corporation (DEC) Unix workstation
  - Nintendo 64
  - Sony PlayStation
- ▶ Inspiration for ARM (esp. v8)

▶ 32 bit architecture (registers, memory addresses)

◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 のへぐ

- ► 32 registers
- Multiply and divide instructions
- Floating point numbers

#### Mathematical view of addition

 $\mathsf{a}=\mathsf{b}+\mathsf{c}$ 



#### Mathematical view of addition

 $\mathsf{a}=\mathsf{b}+\mathsf{c}$ 

MIPS instruction

add a,b,c

▲□▶ ▲圖▶ ▲ 臣▶ ▲ 臣▶ ― 臣 … のへぐ

a, b, c are registers

#### Some are special

- 0 \$zero always has the value 0
- 31 \$ra contains return address

▲□▶ ▲圖▶ ▲ 臣▶ ▲ 臣▶ ― 臣 … のへぐ

#### Some are special

- 0 \$zero always has the value 0
- 31 \$ra contains return address
- Some have usage conventions
  - 1 \$at reserved for pseudo-instructions

▲□▶ ▲圖▶ ▲匡▶ ▲匡▶ ― 匡 … のへで

#### Some are special

- 0 \$zero always has the value 0
- 31 \$ra contains return address

Some have usage conventions

1 \$at reserved for pseudo-instructions

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

- 2-3 \$v0-\$v1 return values of a function call
- 4-7 \$a0-\$a3 arguments for a function call

#### Some are special

- 0 \$zero always has the value 0
- 31 \$ra contains return address

Some have usage conventions

- 1 \$at reserved for pseudo-instructions
- 2-3 \$v0-\$v1 return values of a function call
- 4-7 \$a0-\$a3 arguments for a function call
- 8-15,24,25 \$t0-\$t9 temporaries, can be overwritten by function
  - 16-23 \$s0-\$s7 saved, have to be preserved by function

#### Some are special

- 0 \$zero always has the value 0
- 31 \$ra contains return address

Some have usage conventions

| 1 | \$at | reserved for pseudo-instructions |
|---|------|----------------------------------|
|---|------|----------------------------------|

- 2-3 \$v0-\$v1 return values of a function call
- 4-7 \$a0-\$a3 arguments for a function call
- 8-15,24,25 \$t0-\$t9 temporaries, can be overwritten by function
  - 16-23 \$s0-\$s7 saved, have to be preserved by function
  - 26-27 \$k0-\$k1 reserved for kernel
    - 28 \$gp global area pointer
    - 29 \$sp stack pointer
    - 30 \$fp frame pointer

# Pipelining

◆□ > ◆□ > ◆ 三 > ◆ 三 > ● ○ < ○





- ► Theoretical speed-up: 3 times
- Actual speed-up in example: 2 times
  - sequential: 1:30+1:30+1:30+1:30 = 6 hours
  - ▶ pipelined: 1:30+0:30+0:30+0:30 = 3 hours
- $\blacktriangleright$  Many tasks  $\rightarrow$  speed-up approaches theoretical limit

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

# MIPS instruction pipeline

◆□▶ ◆□▶ ◆ □▶ ◆ □▶ ○ □ ○ ○ ○ ○

- ► Fetch instruction from memory
- Read registers and decode instruction (note: registers are always encoded in same place in instruction)

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

- Execute operation OR calculate an address
- Access an operand in memory
- Write result into a register

#### Breakdown for each type of instruction

| Instruction                   | lnstr.         | Register       | ALU            | Data   | Register | Total          |
|-------------------------------|----------------|----------------|----------------|--------|----------|----------------|
| class                         | fetch          | read           | oper.          | access | write    | time           |
| Load word (Iw)                | 200ps          | 100ps          | 200ps          | 200ps  | 100ps    | 800ps          |
| Store word (sw)               | 200ps          | 100ps          | 200ps          | 200ps  |          | 700ps          |
| R-format (add)<br>Brand (beq) | 200ps<br>200ps | 100ps<br>100ps | 200ps<br>200ps |        | 100ps    | 600ps<br>500ps |



▲□▶ ▲圖▶ ▲≣▶ ▲≣▶ = のへで



|                    | 200 |  | 40            | 00            | 60            | 00 | 8            | 00<br>        | 10 | 00            | 120 | 00            | 1400 | 1600 | 1800 |
|--------------------|-----|--|---------------|---------------|---------------|----|--------------|---------------|----|---------------|-----|---------------|------|------|------|
| lw \$t1, 100(\$t0) |     |  | Reg.<br>read  | AL            | JU            |    | ata<br>ess   | Reg.<br>write |    |               |     |               |      |      |      |
| lw \$t2, 104(\$t0) |     |  | uction<br>tch |               | Reg.<br>read  | AL | U            | Da<br>acc     |    | Reg.<br>write |     |               |      |      |      |
| lw \$t3, 108(\$t0) |     |  |               | Instru<br>Fei | iction<br>tch |    | Reg.<br>read | AL            | U. | Da<br>acc     |     | Reg.<br>write |      |      |      |

◆□▶ ◆□▶ ◆ □▶ ◆ □▶ ● □ ● ● ●

- ► Theoretical speed-up: 4 times
- Actual speed-up in example: 1.71 times
  - sequential: 800ps + 800ps + 800ps = 2400ps
  - ▶ pipelined: 1000ps + 200ps + 200ps = 1400ps
- $\blacktriangleright$  Many tasks  $\rightarrow$  speed-up approaches theoretical limit

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

► All instructions are 4 bytes → easy to fetch next instruction



- ▶ All instructions are 4 bytes  $\rightarrow$  easy to fetch next instruction
- Few instruction formats
  - $\rightarrow$  parallel op decode and register read

► All instructions are 4 bytes → easy to fetch next instruction

Few instruction formats

 $\rightarrow$  parallel op decode and register read

Memory access limited to load and store instructions

 $\rightarrow$  stage 3 used for memory access, otherwise operation execution

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

► All instructions are 4 bytes → easy to fetch next instruction

Few instruction formats

 $\rightarrow$  parallel op decode and register read

Memory access limited to load and store instructions

 → stage 3 used for memory access, otherwise operation execution

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

Words aligned in memory

 $\rightarrow$  able to read in one instruction

(aligned = memory address multiple of 4)

# Hazards

◆□▶ ◆□▶ ◆目▶ ◆目▶ 目 のへで

Hazard = next instruction cannot be executed in next clock cycle

▲□▶ ▲圖▶ ▲ 臣▶ ▲ 臣▶ ― 臣 … のへぐ

- Types
  - structural hazard
  - data hazard
  - control hazard

- Definition: instructions overlap in resource use in same stage
- ► For instance: memory access conflict

1 2 3 4 5 7 6 FETCH DECODE MEMORY MEMORY ALU REGISTER i1 i2 FFTCH DECODE MEMORY MEMORY ALU REGISTER conflict

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

MIPS designed to avoid structural hazards

- Definition: instruction waits on result from prior instruction
- Example

add \$s0, \$t0, \$t1 sub \$t0, \$s0, \$t3

▶ add instruction writes result to register \$s0 in stage 5

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

sub instruction reads \$s0 in stage 2

- $\Rightarrow\,$  Stage 2 of sub has to be delayed
- ► We overcome this in hardware

## Graphical Representation



▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

- IF: instruction fetch
- ► ID: instruction decode
- ► EX: execution
- ► MEM: memory access
- ► WB: write-back



Add wiring to circuit to directly connect output of ALU for next instruction

## Load and Subtract



- Add wiring from memory lookup to ALU
- Still 1 cycle unused: "pipeline stall" or "bubble"

Code with data hazard

lw \$t1, 0(\$t0)
lw \$t2, 4(\$t0)
add \$t3, \$t1, \$t2
sw \$t3, 12(\$t0)
lw \$t4, 8(\$t0)
add \$t5, \$t1, \$t4
sw \$t5, 16(\$t0)



Code with data hazard

Reorder code (may be done by compiler)

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

lw \$t1, 0(\$t0)
lw \$t2, 4(\$t0)
add \$t3, \$t1, \$t2
sw \$t3, 12(\$t0)
lw \$t4, 8(\$t0)
add \$t5, \$t1, \$t4
sw \$t5, 16(\$t0)

Code with data hazard

lw \$t1, 0(\$t0)
lw \$t2, 4(\$t0)
add \$t3, \$t1, \$t2
sw \$t3, 12(\$t0)
lw \$t4, 8(\$t0)
add \$t5, \$t1, \$t4
sw \$t5, 16(\$t0)

Reorder code (may be done by compiler)

Iw \$t1, 0(\$t0)
Iw \$t2, 4(\$t0)
Iw \$t4, 8(\$t0)
add \$t3, \$t1, \$t2
sw \$t3, 12(\$t0)
add \$t5, \$t1, \$t4
sw \$t5, 16(\$t0)

Load instruction now completed in time

Clicker quiz omitted from public slides

Clicker quiz omitted from public slides

Also called branch hazard

Selection of next instruction depends on outcome of previous

Example

add \$s0, \$t0, \$t1 beq \$s0, \$s1, ff40 sub \$t0, \$s0, \$t3

► sub instruction only executed if branch condition fails
→ cannot start until branch condition result known

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

- ► Assume that branches are never taken → full speed if correct
- More sophisticated
  - keep record of branch taken or not

▲□▶ ▲圖▶ ▲匡▶ ▲匡▶ ― 匡 … のへで

make prediction based on history

# Pipelined data path

◆□▶ ◆□▶ ◆ □▶ ◆ □▶ ○ □ ○ ○ ○ ○

## Datapath



#### Pipelined Datapath



# Load





◆□ ▶ ◆□ ▶ ◆三 ▶ ◆三 ▶ ◆□ ▶



◆□ ▶ ◆□ ▶ ◆三 ▶ ◆三 ▶ ◆□ ▶





## Store

◆□▶ ◆□▶ ◆目▶ ◆目▶ 目 のへで





◆□ ▶ ◆□ ▶ ◆三 ▶ ◆三 ▶ ◆□ ▶







◆□ ▶ ◆□ ▶ ◆三 ▶ ◆三 ▶ ◆□ ▶

# Add





・ロ・・ 「「・・」、 ・ 「」、 ・ 「」、 ・ ・ 」



◆□ ▶ ◆□ ▶ ◆三 ▶ ◆三 ▶ ◆□ ▶



◆□ ▶ ◆□ ▶ ◆三 ▶ ◆三 ▶ ◆□ ▶



## Write to register

◆□ > ◆□ > ◆ 三 > ◆ 三 > ● ○ ○ ○ ○

## Which Register?



- ► Write register
  - decoded in stage 2
  - used in stage 5
- Identity of register has to be passed along

◆□ > ◆□ > ◆ 三 > ◆ 三 > ● ○ ○ ○ ○

#### Data Path for Write Register



# Pipelined control

◆□ > ◆□ > ◆ 三 > ◆ 三 > ● ○ ○ ○ ○

- ► At each stage, information from instruction is needed
  - which ALU operation to execute
  - which memory address to consult
  - which register to write to
- This control information has to be passed through stages

▲□▶ ▲□▶ ▲□▶ ▲□▶ ▲□ ● ● ●

## Pipelined Control



## Control Flags

