### Performance Evaluation of a Reconfigurable Instruction Set Processor

Mehdipour, Farhad

Faculty of Information Science and Electrical Engineering, Kyushu University

Noori, Hamid

Faculty of Information Science and Electrical Engineering, Kyushu University

Honda, Hiroaki

Faculty of Information Science and Electrical Engineering, Kyushu University

Inoue, Koji

Faculty of Information Science and Electrical Engineering, Kyushu University

他

https://doi.org/10.15017/14879

出版情報: SLRC プレゼンテーション, pp.1-, 2008-11-24. 九州大学システムLSI研究センター

バージョン: 権利関係:



# Performance Evaluation of a Reconfigurable Instruction Set Processor

Farhad Mehdipour, H. Noori, H. Honda, K. Inoue, K. Murakami

Faculty of Information Science and Electrical Engineering, KYUSHU UNIVERSITY, Fukuoka, JAPAN

{farhad@c.csce.kyushu-u.ac.jp}

### • Outline

- •Reconfigurable Instructions Set Processors
- •A Combined Analytical and Simulation-Based Model (CAnSO)
  - Model Extraction and Calibration
  - Basic Model Definitions
  - Speedup Formulations
  - Simplification and Calibration
- •Experiments
  - Experimental Setup
  - Model Validation
  - Design Space Exploration Using CAnSO
  - Effects of Modifications
- Conclusions and Future Work

#### Designing Embedded Systems

- Embedded Microprocessors
- Application-Specific Integrated Circuits (ASICs)
- Application-Specific Instruction set Processors (ASIPs)
- Extensible Processors



#### **Extensible Processors**

#### Mechanism

- Acceleration by using CFU
- a hardware is augmented to the base processor.
- Executes hot portions of applications









#### **Extensible Processors**

- Base processor (BP)'s fixed instruction set + Custom Instructions
- Goals
  - Improving the performance and energy efficiency

Maintaining compatibility and flexibility

CPU

Instruction Dispatcher

LD/ST: Load / Store

CFU: Custom Functional Unit

Register File

Instructions

### • Custom Instructions

- Instruction set customization  $\leftarrow \rightarrow$  hardware/software partitioning (Identifying critical segments in applications)
- Custom Instructions (CIs) are
  - extracted from critical segments of an application and
  - executed on a Custom Functional Unit (CFU)

#### Critical segments:

Most frequently executed (Hot) portions of the applications



### • Extensible Processors

#### o Drawbacks:

- Lack of flexibility
- Long time and cost of designing and verifying
- Many issues associated with designing a new processor from scratch:
  - longer time-to-market and
  - significant NRE (Non-Recurring Engineering) costs

#### Solution

Using a Reconfigurable Functional Unit (RFU) instead of fixed architecture CFU

# Reconfigurable Processors



Reconfigurable Processor

#### Processor coupling





#### Reconfigurable Instruction Set Processors (RISPs)

- Adding and generating custom instructions after fabrication
- Using a reconfigurable FU(RFU) instead of custom FU

**CFU: Custom Functional Unit** 

**RFU: Reconfigurable Functional Unit** 



#### How a RISP Works

#### 400680 subiu \$25,\$25,1 400688 **Baseline Processor** lbu \$13,0(\$7) **RAC** 400690 \$2,0(\$4) lbu 400698 sll \$2,\$2,0x18 4006a0 \$14,\$2,0x18 sra 4006a8 addiu \$4,\$4,1 4006b0 \$8,\$2,0x1c srl 4006b8 sll \$2,\$8,0x2 4006c0 addu \$2,\$2,\$25 **ALU** 4006c8 lw \$2,0(\$2) 4006d0 \$13,\$13,1 xori 4006d8 addu \$10,\$10,\$2 400680 subiu \$25,\$25,1 400698 sll \$2,\$2,0x18 Register File 4006a0 \$14,\$2,0x18 sra 400688 \$13,0(\$7) Configuration lbu 4006e0 \$10,4006f0 bgez **Memory**

**A Hot Basic Block** 

**RAC=RFU:** Reconfigurable Accelerator

**GPP: General Purpose Processor** 

**RISP** 

#### RISP Benefits and Drawbacks

#### Benefits

- Specialized datapath
  - Shared hardware
  - Higher Speedup
  - Less power consumption





#### **Drawbacks**

- More area
- Difficult to use

### Performance Evaluation of a RISP

- Performance evaluation of a RISP challenges
  - designing of a RISP architecture
  - optimizing an existing arch. for an objective function
- For a designer
  - obtaining optimum system configuration is desirable
  - a performance analysis in terms of the performance metrics (speedup, area and so on) is required
- Performance evaluation models
  - Structural models: includes empirical studies based on measurements and simulations of the target system
  - Analytical models: incorporates a system (usually simplified) structure to obtain mathematically solvable models

### Fraction of Dynamic Instructions in Applications



the RAC is responsible for executing almost 30% of dynamic instructions of applications in average

#### Model Extraction and Utilization



#### General Template of a RISP



### Basic Model Definitions

- Base Processor
  - an in-order general five-stage RISC processor
- o RAC
  - a coarse-grained tightly-coupled reconfigurable hardware
- Cls are indexed for direct accessing of the configuration bit-stream
- The content of all registers are sent to the RAC (Shared RF)
- Controlling configurations
  - Hardware-based: starting address of CI and index to the config. Mem. is stored in a CAM for quick retrieval
  - Software-based: starting address of a CI is replaced with a special instruction
- Memory accesses
- Control instructions

#### Single and Continuous Executions



### Speedup Formulation

Latency of execution of *Cli* instructions on the BP

Fraction of instructions executing on BP

$$f_{RAC} = \frac{\sum_{i=1}^{n_{CI}} \left(\tau_{BP}^{i} \times O_{i}\right)}{n_{tCC}}$$

$$f_{BP} = \frac{n_{C}}{n_{C}} \left(\tau_{BP}^{i} \times O_{i}\right)$$

$$n_{tcc} = 1 - f_{RAC}$$

Execution time on the BP

 $n_{tcc}$ 

Execution time on the RAC

200

Overall Speedup

$$\left(n_{tcc} - \sum_{i=1}^{n_{CI}} \left(\tau_{BP}^{i} \times O_{i}\right)\right) + \psi(\theta, \tau)$$

$$\psi(\theta, \tau) = \sum_{i=1}^{n_{CI}} \left( \sum_{j \in S_i} \left( \theta_{ij} \times \left( \tau_{RAC} + \tau_{OVH} \right) \right) + \sum_{j \in C_i} \left( \left( \tau_{RAC} + \tau_{OVH} \right) + \left( \left( \theta_{ij} - 1 \right) \times \tau_{RAC} \right) \right) \right)$$

frequency of *j*th occurrence of *Cli* 

Latency of RAC and the overhead reconfiguration time

ISOCC 2008@ Busan, South

19/XXXII

### • The Effect of CI Length

#### Large Cls

 Including more instructions than the no. of available resources in the RAC

#### Temporal Partitioning

Dividing larger Cls to a number of smaller Cls

$$\begin{split} L &= \{k \middle| k \in \{1, ..., n_{CI}\}, l_k > n_{FU}\} \\ m'_{k \in L} &= \mathcal{O}_i \times p_k, m'_{k \notin L} = m_i \\ \theta'_{k \in L} &= ((1, ..., 1), (1, ..., 1), ..., (1, ..., 1)), |\theta'_{k \in L}| = m'_{k \in L}, \theta'_{k \notin L} = \theta_{k \notin L} \\ S'_{i \in L} &= \{1, ..., m'_{i}\}, S'_{i \notin L} = S_{i} \\ \end{split}$$



#### Control Instructions

- the rate of miss-predicted branches might be reduced → higher speedup
- Instruction Cache Misses
  - no need for fetching instructions belonging to the CIs
  - access and miss rates to instruction cache are reduced
  - BP fraction reduces → speedup increases

no. of penalty cycles for branch misspredictions/cache misses

variation in branch/cache miss-predictions/misses

$$s_{o} = \frac{n_{tcc}}{\sum_{x=\{b,i\}} \delta_{xm} \times p_{xm} + \sum_{i=1}^{n} \left(\tau_{BP}^{i} \times O_{i}\right) + \psi'(\theta', \tau)}$$

### RF's Input/Output Ports

- Register file is shared between BP and RAC
- Additional clock cycles for reading/writing from/to the RF



### The Assumed RAC Architecture



# • RAC's Delay

- All FUs in the RAC implement similar operations
- Each mux receives
  - all outputs of the FUs in upper rows and
  - Outputs from its adjacent FUs at the same row

$$\begin{aligned} \tau_{RAC}_{h}^{w} &= \sum_{i=1}^{h} \tau_{FU} + \sum_{i=1}^{h-1} \tau_{MUX}_{i}^{k}, \quad k \in \{0,1,...,w\} \\ \psi(\theta,\tau) &= \sum_{i=1}^{n_{CI}} \left( \sum_{j \in S_{i}} \left( \theta_{ij} \times \left( \tau_{RAC} + \tau_{OVH} \right) \right) + \sum_{j \in C_{i}} \left( \tau_{RAC} + \tau_{OVH} \right) + \left( \left( \theta_{ij} - 1 \right) \times \tau_{RAC} \right) \right) \end{aligned}$$

### • Simplification and Calibration

- Control instructions are not supported
- Reduction in instruction cache accesses as well as cache misses
  - average reduction in access to i-cache is almost 17%
  - average i-cache miss rate is almost 3%.



Average i-Cache Accesses: 17% Misses: 3%





$$s_{O} = \frac{n_{tcc}}{\left( \frac{*}{n_{tcc} - \delta_{im} \times p_{im} - \sum\limits_{i=1}^{\infty} \left( \frac{*}{r_{BP}^{i} \times O^{*}i} \right) \right) + \psi' \left( \frac{*}{\theta', \tau_{RAC}^{W} + \tau_{OVH}} \right)} \qquad \psi' \left( \theta', \tau_{RAC}^{W} + \tau_{OVH} \right) = \sum_{i=1}^{\infty} \left( \frac{*}{n_{CI}} \left( \frac{*}{n_{CI}} \times \sigma_{OVH}^{W} + \sigma_{OVH} \right) \right) + \psi' \left( \frac{*}{n_{tcc} - \delta_{im} \times p_{im} - \sum\limits_{i=1}^{\infty} \left( \frac{*}{n_{CI}} \times \sigma_{OVH}^{W} + \sigma_{OVH} \right) \right)} \right)$$

## • Experimental Setup

- Fourteen applications of Mibench
  - automotive, security, consumer, network, telecommunication
- Cls (DFGs) are extracted from applications
- Simplescalar's cycle-accurate simulator is extended to simulate a reconfigurable instruction set processor
- Model Establishment
  - simulating all applications
  - collecting required information
  - model simplification and calibration

~ 4 hours to completion on a PC: Dual Core, Intel 6600@2400Mhz, 2GB RAM



### Design Space Exploration Using CAnSO

- The design of a RAC including different components entails a multitude of design parameters
- Examining 100 design points using 14 applications:
  - Simulation: 17 days
  - CAnSO: 4 hours
- Using CAnSO, re-simulation is not needed after establishing the model

#### Using CAnSO for Design Space Exploration of the RAC



the small heights → very low speedup

Height> 5: RAC's longer critical path delay → speedup declines

### • Effect of Modifications

#### Applying modification to the design→

- Small time is required for repeating the simulation
- Each iteration of the CAnSO takes less than a minute





- Reconfigurable instruction set processors
- A combined analytical and simulation-based model (CAnSO)
- Suitable for exploring a large design space for the accelerator
- Sufficient flexibility in a rapid evaluation of modified target architectures
- Substantially reduce the design or optimization time while preserving a reasonable accuracy
- o Proves less than 2% variation in evaluation results
- Uncalibrated CAnSO depicts 22% difference in average
- o Future work:
  - Expanding CAnSO to support control instructions
  - Considering more complicated RAC architectures