## Optimizing the architecture of SFQ-RDP (Single Flux Quantum- Reconfigurable Datapath)

Mehdipour, Farhad Graduate School of Information Science and Electrical Engineering, Kyushu University

Honda, Hiroaki Institute of Systems, Information Technologies and Nanotechnologies (ISIT)

Kataoka, Hiroshi Graduate School of Information Science and Electrical Engineering, Kyushu University

Inoue, Koji Graduate School of Information Science and Electrical Engineering, Kyushu University

他

https://doi.org/10.15017/14877

出版情報:SLRC プレゼンテーション, pp.1-, 2009-06-15. 九州大学システムLSI研究センター バージョン: 権利関係: Optimizing the Architecture of SFQ-RDP (Single Flux Quantum-Reconfigurable Datapath)

**F. Mehdipour**\*, Hiroaki Honda**\*\***, H. Kataoka\*, K. Inoue\* and K. Murakami\*

\*Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan

\*\*Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan

E-mail: <u>farhad@c.csce.kyushu-ua.c.jp</u>



#### CREST-JST (2006~): Low-power, high-performance, reconfigurable processor using single-flux quantum circuits



#### Agenda



- Introduction
- Large-Scale Reconfigurable Data-Path (LSRDP) General Architecture and Specifications
- Design Procedure and Tool Chain
- Preliminary Results
- Conclusions and Future Work



#### Introduction



- For performance improvement various accelerators are used with GPPs
  - PowerXcell, GPU, GRAPE-DR, ClearSpeed, etc.
  - Small size and low power consumption comparing to processors with similar performance



NVIDIA Tesla S1070 http://www.nvidia.com



#### **Acceleration Through a Data-Path Processor**

- Mechanism
  - Acceleration by using a data-path accelerator
  - Augmenting the accelerator to the base processor
  - Executes hot portions of applications on the accelerate













#### How a Reconfigurable Processor Works







#### **Motivation**



Conventional accelerators:

- A large memory bandwidth is demanded in conventional accelerators for high-performance computation
- On chip memories are often used to hide memory access latency



Large-Scale Reconfigurable Data-Path (LSRDP):

- is introduced as an alternative accelerator
- reduces the no. of memory accesses by utilizing data-path



## Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor



- Reconfigurable data-path includes:
  - A large number of floating point Functional Units (FUs) Arranged as arrays
  - Reconfigurable Operand Routing Network : (ORN)
  - Dynamic reconfiguration facilities
  - Streaming Buffer (SB) for I/O ports

#### • Features:

- Data Flow Graphs (DFGs) extracted from critical calculation parts are directly mapped
- Pipeline execution
- Burst transfer is used for input /output rearranged data from/to memory

# Single-Flux Quantum (SFQ) against CMOS

- CMOS issues: (if LSRDP has 32x32 FUs)
  - high electric power consumption
  - high heat radiation and difficulties in high-density packing





#### **Goals of the Project**



- Discovering appropriate scientific applications
- Developing compiler tools
- Developing performance analyzing tools

Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits



# LSRDP General Architecture and Specifications



#### Parameters Should Be Decided Within the LSRDP Design Procedure



• Core structure: a rectangular matrix of PEs

• PE: combination of a Functional Unit (FU) and a data Transfer Unit (TU)

Width and Height ?

Maximum Connection Length (MCL) between consecutive rows? (impossible to implement full cross bar)

Layout: FU types (ADD/SUB and MUL)?

Reconfiguration mechanism? (PE, ORN, Immediate data)

• On-chip memory configuration?

#### **LSRDP** Architecture

- Processing Elements
  - FU
    - implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL
  - TU (transfer unit) as a routing resource for transferring data from a row to an inconsecutive row







Kyushu Universit Flexible but consume a lot of resources

### Layout Types- Type II (Checkered)





SSV 2009





#### MCL: maximum horizontal distance between two PEs located in two subsequent rows





**Kyushu Univer** A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer," ASC08, 2008.

#### **Dynamic Reconfiguration Mechanism**





#### **Dynamic Reconfiguration Architecture**



**Three bit-stream lines for dynamic reconfiguration of:** 

- Immediate registers (64bit) in each PE
- Selector bits for muxes selecting the input data of FUs
- Cross-bar switches in ORNs

#### Design Procedure and Tool Chain





- DFGs are manually generated from critical parts of applications
- DFG mapping results are used for
  - Analyzing LSRDP architecture statistics
  - Generating LSRDP configuration bit-streams

#### Benchmark Applications for Design Procedures

- Finite differential method calculation of 2<sup>nd</sup> order partial differential equations
  - 1dim-Heat equation (Heat)
  - 1dim-Vibration equation
  - 2dim-Poisson equation
- Quantum chemistry application
  - Recursive parts of Electron Repulsion Integral calculation (ERI-Rec)

(Vibration)

(Poisson)

#### Only ADD/SUB and MUL operations are used in the critical calculations of all above applications



#### **DFG Extraction- Heat Equation**



• 1-dim. heat equation for T(x,t)



 Calculation by Finite Difference Method (FDM)

$$T(x_i, t_{j+1}) = D * T(x_i, t_j) + B * [T(x_{i-1}, t_j) + T(x_{i+1}, t_j)]$$

Basic DFG can be extended to horizontal and vertical directions to make a larger DFG



Basic DFG corresponding to minimum FDM calculation



#### **Example of extracted DFGs- Heat**





**SSV 2009** 



#### **DFG Classification**

| Class      | # of FUg | # of   | # of    | # of                                      |
|------------|----------|--------|---------|-------------------------------------------|
| Class      | # OI FUS | Inputs | Outputs | DFGs                                      |
| RDP-S      | 128      | 19     | 12      | Heat (3)<br>Poi (1)<br>Vib (2)<br>Eri (4) |
| RDP-M      | 512      | 19     | 12      | Heat (1)<br>Poi (1)<br>Vib (1)<br>Eri (4) |
| RDP-L      | 1024     | 38     | 24      | Heat (2)<br>Poi (1)<br>Vib (2)<br>Eri (5) |
| RDP-<br>XL | > 1024   | 64     | 52      | Heat (1)<br>Poi (1)<br>Vib (2)<br>Eri (5) |



Totally, 24 DFGs are prepared for benchmark Apps.

Due to broad range of DFG sizes

DFGs are classified as S, M, L, XL with respect to their size and the number of Input/Output nodes

=> LSRDP designing processes for S, M, L, XL, respectively





## **Preliminary Results**



#### **LSRDP Specifications: Width & Height**



|         | # of Input<br>ports | # of Output<br>ports | Width | Height |
|---------|---------------------|----------------------|-------|--------|
| LSRDP-S | 19                  | 12                   | 16    | 16     |
| LSRDP-M | 19                  | 12                   | 32    | 16     |
| LSRDP-L | 38                  | 24                   | 64    | 32     |

LSRDP Dimensions and the number of input/output ports



SSV 2009



|   | LSRDP                 | MCL<br>(avg/max) | ORN Size-<br>No of Inps (avg/max), Outs |
|---|-----------------------|------------------|-----------------------------------------|
|   | LSRDP-S               | 4/ <del>8</del>  | 18/34, 3                                |
|   | LSRDP-M               | 5/9              | 22/38, 3                                |
| - | LSRDP-L               | 5/9              | 22/34, 3                                |
|   | Kyushu University Fur | ation needed     |                                         |

#### **Analyzing Various LSRDP Layouts**



|          | Layout | Size  |
|----------|--------|-------|
|          | Ι      | 8x3   |
| Heat     | II     | 8x3   |
|          | III    | 8x4   |
|          | I      | 10x8  |
| Viration | II     | 10x8  |
|          | III    | 10x11 |
|          | Ι      | 10x10 |
| Poisson  | II     | 10x12 |
|          | III    | 15x18 |
| ERI1     | Ι      | 6x2   |
|          | II     | 9x3   |
|          | III    | 6x2   |
|          | I      | 10x10 |
| ERI2     | II     | 10x10 |
|          | III    | 15x8  |

#### Layout I $\simeq$ Layout II

(Except ERI1 DFG which gives better size for Layout III)

## Layout II can be used instead of Layout I to obtain a smaller LSRDP



#### LSRDP at One Glance (1/2)



| Functional units         |                 | ADD/SUB, MUL              |              |             |  |
|--------------------------|-----------------|---------------------------|--------------|-------------|--|
| Layout                   |                 | Type II (checker pattern) |              |             |  |
| Operations               |                 | 64-bit floating point     |              |             |  |
| Processing structure     |                 | Pipelined                 |              |             |  |
| PE structure             |                 | FU, T, FU+T, T+T          |              |             |  |
| LSRDP Size               |                 | Small                     | Medium       | Large       |  |
| No. of inp/out ports     |                 | 19/12                     | 19/12        | 38/24       |  |
| Width/Height             |                 | 16/16                     | 32/16        | 64/32       |  |
| Conf. bit-stream<br>size | Imm. Regs       | 16*16*64                  | 32*16*64     | 64*32*64    |  |
|                          | ORNs            | 16*BSS(ORN)               | 32* BSS(ORN) | 64*BSS(ORN) |  |
|                          | PEs             | 16*16* 2                  | 32*16*2      | 64*32* 2    |  |
| ORN                      | inputs, outputs | 22,3                      | 26,3         | 26,3        |  |
|                          | Structure       | Cross-bar switch          |              |             |  |
|                          | Conn. Type      | One-directional           |              |             |  |



#### LSRDP at One Glance (2/2)



| Internal memory   | Туре                                            | Immediate registers                       |  |
|-------------------|-------------------------------------------------|-------------------------------------------|--|
|                   | Size and count                                  | 64-bit registers,<br>One reg. for each PE |  |
|                   | Communication mechanism                         | Serial                                    |  |
| External memory   | No. of memory modules                           | 16                                        |  |
|                   | Date trans. rate                                | 1800Mbps/pin                              |  |
|                   | Overall data trans. rate                        | 24 GB/s                                   |  |
|                   | Mem. to LSRDP bus width                         | 64 bit                                    |  |
|                   | Channels per module                             | Тwo                                       |  |
| Reconf. mechanism | Bit serial configuration through a serial chain |                                           |  |



#### **Preliminary Performance Evaluation**

| Base processor configuration  |                      |                                     |                            |    |
|-------------------------------|----------------------|-------------------------------------|----------------------------|----|
| Processor type                | 0                    | ut-of-order                         |                            |    |
| GPP operating frequency       | 3.2GHz               |                                     |                            |    |
| Inst. issue width             | 4                    | 4 instruction/cc                    |                            |    |
| Inst. decode width            | 4                    | 4 instruction/cc                    |                            |    |
| Cache configuration           | Ľ                    | L1 data 64KB(128B Entry, 2way, 2cc) |                            | )  |
|                               | Ľ                    | 1 instruction                       | 64KB(64B Entry, 1way, 1cc) |    |
|                               | Ľ                    | 2 unified                           | 4MB(128B Entry, 4way, 16cc | ;) |
| Latency of main memory        | 30                   | 300cc                               |                            |    |
| L2 to main memory             | В                    | us width                            | 64 Bytes                   |    |
|                               | Fi                   | req                                 | 800 MHz                    |    |
| GPP+LSRDP configuration       |                      |                                     |                            |    |
| LSRDP operating frequency     | 80                   | GHz                                 |                            |    |
| Reconfiguration Latency       | 1c                   | 1cc                                 |                            |    |
| Latency SPM ←→LSRDP latency   | 1cc                  |                                     |                            |    |
| Latency Main Memory ←→SPM     | 7500cc               |                                     |                            |    |
| Bandwidth SPM←→LSRDP          | Max. 64 * 8 Bytes/cc |                                     |                            |    |
| Bandwidth Main Memory ← → SPM | 10                   | 2.4GB/sec                           |                            |    |



**GPP:** Exec. time measurement by means of a processor simulator LSRDP: Estimation by performance modeling



Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory.



### **Preliminary Performance Evaluation** (Poisson)



A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well as the execution time on GPP



#### **Conclusions & Future Work**



- A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced.
- 24 benchmark Data Flow Graphs (DFGs) were manually generated.
- LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach.
- LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances.

Future Work:

- •To achieve higher performance it is required to reduce various overhead costs mainly related to data management part.
- •To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size.

#### Acknowledgement

This research was supported in part by Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST).







**SSV 2009**