### SPLASH-4: A MODERN BENCHMARK SUITE WITH LOCK-FREE CONSTRUCTS

#### Eduardo José Gómez-Hernández<sup>1</sup> Juan M. Cebrian<sup>1</sup> Stefanos Kaxiras<sup>2</sup> Alberto Ros<sup>1</sup>

<sup>1</sup>Computer Engineering Department University of Murcia, Spain

<sup>2</sup>Department of Information Technology Uppsala University, Sweden





- The cornerstone for the performance evaluation is the benchmark suite
- Benchmark suites can misrepresent the performance characteristics
- Keeping up with architectural changes while maintaining the same workloads is a real challenge





- The cornerstone for the performance evaluation is the benchmark suite
- Benchmark suites can misrepresent the performance characteristics
- Keeping up with architectural changes while maintaining the same workloads is a real challenge
- We introduce Splash-4, a revised version of Splash-3 (an update on Splash-2), that introduces modern programming techniques to improve scalability on contemporary hardware





<sup>1</sup>Woo, Steven Cameron, et al, "The SPLASH-2 programs: Characterization and methodological considerations." ACM SIGARCH computer architecture news 23, 1995

<sup>2</sup>Venetis, Ioannis E., et al, "The Modified SPLASH-2", https://www.capsl.udel.edu//splash/ 2007

<sup>3</sup>Sakalis, Christos, et al, "Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research", ISPASS 2016

<sup>4</sup>Gómez-Hernández, Eduardo José et al, "Splash-4: Improving Scalability with Lock-Free Constructs", ISPASS 2021





<sup>1</sup>Woo, Steven Cameron, et al, "The SPLASH-2 programs: Characterization and methodological considerations." ACM SIGARCH computer architecture news 23, 1995

<sup>2</sup>Venetis, Ioannis E., et al, "The Modified SPLASH-2", https://www.capsl.udel.edu//splash/ 2007

<sup>3</sup>Sakalis, Christos, et al, "Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research", ISPASS 2016

<sup>4</sup>Gómez-Hernández, Eduardo José et al, "Splash-4: Improving Scalability with Lock-Free Constructs", ISPASS 2021



| Spla                                                                                  | ish-2                                                           | Minor<br>Update | Splash-3 | Splash-4 |
|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------|----------|----------|
| 19                                                                                    | 95<br>21 years<br>Computation<br>has changed                    |                 | 2016     | 2021     |
| The first major parallel<br>benchmark suite. Still<br>in use (+5k cites) <sup>1</sup> | A small update that<br>fixes bugs and upd<br>the programming st | ates            |          |          |

<sup>1</sup>Woo, Steven Cameron, et al, "The SPLASH-2 programs: Characterization and methodological considerations." ACM SIGARCH computer architecture news 23, 1995

<sup>2</sup>Venetis, Ioannis E., et al, "The Modified SPLASH-2", https://www.capsl.udel.edu//splash/ 2007

<sup>3</sup>Sakalis, Christos, et al, "Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research", ISPASS 2016

<sup>4</sup>Gómez-Hernández, Eduardo José et al, "Splash-4: Improving Scalability with Lock-Free Constructs", ISPASS 2021



| 2                                                                             | Spla | sh-2                                          | M<br>Up                       | linor<br>odate | Splash-3                                                   | 3 Splash- | 4 |
|-------------------------------------------------------------------------------|------|-----------------------------------------------|-------------------------------|----------------|------------------------------------------------------------|-----------|---|
|                                                                               | 19   | Q5 Comp                                       | rears<br>utation 2<br>changed | 2007           | 2016                                                       | 2021      | - |
| The first major para<br>benchmark suite. S<br>in use (+5k cites) <sup>1</sup> |      | A small upda<br>fixes bugs an<br>the programm | d updates                     | fixes          | major update<br>data races an<br>ormance bugs <sup>3</sup> | d         |   |

<sup>1</sup>Woo, Steven Cameron, et al, "The SPLASH-2 programs: Characterization and methodological considerations." ACM SIGARCH computer architecture news 23, 1995

<sup>2</sup>Venetis, Ioannis E., et al, "The Modified SPLASH-2", https://www.capsl.udel.edu//splash/ 2007

<sup>3</sup>Sakalis, Christos, et al, "Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research", ISPASS 2016

<sup>4</sup>Gómez-Hernández, Eduardo José et al, "Splash-4: Improving Scalability with Lock-Free Constructs", ISPASS 2021



| S                                                                                  | plas | sh-2                                                             | Min<br>Upda | or S | plash-3                                            | Spla | ish-4                                                                        |
|------------------------------------------------------------------------------------|------|------------------------------------------------------------------|-------------|------|----------------------------------------------------|------|------------------------------------------------------------------------------|
| _                                                                                  | 199  | 21 years<br>Computation<br>has changed                           |             | 07   | 2016                                               | 20   | 21                                                                           |
| The first major paralle<br>benchmark suite. Sti<br>in use (+5k cites) <sup>1</sup> |      | A small update that<br>fixes bugs and upda<br>the programming st | ates        | -    | or update th<br>races and<br>nce bugs <sup>3</sup> |      | Current update,<br>exploiting lockfree and<br>atomic operations <sup>4</sup> |

<sup>1</sup>Woo, Steven Cameron, et al, "The SPLASH-2 programs: Characterization and methodological considerations." ACM SIGARCH computer architecture news 23, 1995

<sup>2</sup>Venetis, Ioannis E., et al, "The Modified SPLASH-2", https://www.capsl.udel.edu//splash/ 2007 <sup>3</sup>Sakalis. Christos, et al, "Splash-3: A Properly Synchronized Benchmark Suite for Contemporary

Research", ISPASS 2016

<sup>4</sup>Gómez-Hernández, Eduardo José et al, "Splash-4: Improving Scalability with Lock-Free Constructs", ISPASS 2021

# IISWC 2022

#### November 07, 2022 4 / 22

- Splash-2 and Splash-3 are crafted using outdated programming techniques
- Previous works noticed that the default input sizes limit the scalability of some applications.
  - The computation between synchronization points is not substantially longer than the synchronization
  - Using larger datasets increases the execution time, and that is a problem when using simulation infrastructures









- Splash-3 exposes data races and performance bugs, and fixed them
- In the analysis done using GEMS (in a 64-core in-order multicore), benchmarks reach an speedup between 16× to 47×
- In our simulated environment (gem5-20 Intel's IceLake-like 64 out of order cores): the scalability stops between 16 and 32 cores, with an average speedup of 2.3×
- In our real hardware (64-Core AMD EPYC 7702P), scalability stops between 4 and 16 cores, with an average speedup of 4.7×





 Synchronization is done with combination of software locks, conditional variables, and barriers

The main idea is to replace high-overhead synchronization operations with newer lightweight alternatives

This translates into a performance improvement by expanding the architectural features that the benchmarks can exercise



- Modern ISAs typically provide a basic set of atomic operations that offer both atomicity and synchronization
- This basic set consists of atomic loads and stores, atomic read-modify-write (RMW) operations, and some atomic comparisons and exchange operations

Typical hardware RMW atomics are only available for integer types

### VOLREND (ADAPTIVE.C.IN:182/199)



#### Splash-3

#### Splash-4

```
1 /* Lock */
2 ALOCK(Global->QLock,local_node); 2
3 work = Global->Queue[local_node]; 2
4 Global->Queue[local_node][0] +=
1;
5 AULOCK(Global->QLock,local_node);
```

 $\label{eq:lock} \begin{array}{l} \text{ALOCK/AULOCK} \rightarrow \text{mutex lock/unlock} \\ \text{FETCH\_ADD} \rightarrow \text{atomic\_fetch\_and\_add} \ \text{-} \ \text{Sequential Consistency} \end{array}$ 

Eduardo José Gómez-Hernández

**IISWC 2022** 

#### SPLASH-4: LOCKFREE AND ATOMICS WHILE&CAEXCH CONSTRUCT



- Atomic Compare-and-Swap (CAS) and Atomic Compare-and-Exchange (CAExch) are type-agnostic
- They can be used to implement RMW operations for more complex underlying types
- 1 CAExch(ptr, oldValue, newValue);
  - ▶ If *oldValue* == (\**ptr*) then (\**ptr*) = *newValue*
  - Else oldValue = (\*ptr)

#### SPLASH-4: LOCKFREE AND ATOMICS WHILE&CAEXCH CONSTRUCT



- Atomic Compare-and-Swap (CAS) and Atomic Compare-and-Exchange (CAExch) are type-agnostic
- They can be used to implement RMW operations for more complex underlying types
- 1 CAExch(ptr, oldValue, newValue);
  - ▶ If *oldValue* == (\**ptr*) then (\**ptr*) = *newValue*
  - Else oldValue = (\*ptr)

```
1 var oldValue = LOAD(ptr);
2 var newValue;
3 do {
4     newValue = new;
5 } while (!CAExch(ptr, oldValue, newValue));
```

 $\text{LOAD} \rightarrow \text{Atomic Load}$ 

#### OCEAN (MULTI.C.IN:90) Atomic Max



#### Splash-3

Splash-4

```
1 / * Lock-free */
  /* Lock */
 LOCK(locks->error_lock)
                                     2 double expected = LOAD(multi->
3
  if (local_err > multi->err_multi)
                                           err multi):
                                     3 do {
4
    multi->err_multi = local_err;
                                     4
                                         if (local_err <= expected)</pre>
5
 }
                                           break:
6
 UNLOCK(locks->error_lock)
                                     5 } while (!CAExch(multi->err_multi
                                            , expected, local_err));
```

 $\label{eq:lock} \begin{array}{l} \text{LOCK/UNLOCK} \rightarrow \text{mutex lock/unlock} \\ \text{LOAD} \rightarrow \text{Atomic Load} \ \text{-} \ \text{Sequential Consistency} \end{array}$ 

### SPLASH-4: CRITICAL SECTION SPLITTING



- Splash-4 uses lock-free constructs that manage a single address and replace to critical sections that modify a single address
- Splitting a larger critical section that modifies more than one address would enable the use of lock-free constructs in more cases
- Unfortunately, splitting large critical sections, is not possible in the general case
- Many Splash-3 critical sections, (atomic) updates of independent variables are grouped together in larger critical sections for no apparent reason

### WATER (POTENG.C.IN 159/253)



Atomic Double Floating Point Addition (FETCH\_AND\_ADD\_DOUBLE)

```
1 double oldValue = LOAD(ptr);
2 double newValue;
3 do {
4 newValue = oldValue + addition;
5 } while (!CAExch(ptr, oldValue, newValue));
```

 $\label{eq:lock} \begin{array}{l} \mbox{LOCK/UNLOCK} \rightarrow \mbox{mutex lock/unlock} \\ \mbox{LOAD} \rightarrow \mbox{Atomic Load} \mbox{-} \mbox{Sequential Consistency} \end{array}$ 

### WATER (POTENG.C.IN 159/253)



Atomic Double Floating Point Addition (FETCH\_AND\_ADD\_DOUBLE)

```
1 double oldValue = LOAD(ptr);
2 double newValue;
3 do {
4 newValue = oldValue + addition;
5 } while (!CAExch(ptr, oldValue, newValue));
```

#### Splash-3

Splash-4

2 FETCH AND ADD DOUBLE(POTA, LPOTA)

3 FETCH AND ADD DOUBLE(POTR. LPOTR)

1 /\* Lock \*/
2 LOCK(gl->PotengSumLock);
3 \*POTA = \*POTA + LPOTA;
4 \*POTR = \*POTR + LPOTR;
5 \*PTRF = \*PTRF + LPTRF;
6 UNLOCK(gl->PotengSumLock);

 $\begin{array}{ccc} \text{LOCK(gl->PotengSumLock);} & & & & \\ \hline & & \\ \hline & & \\ \text{LOCK/UNLOCK} \rightarrow \text{mutex lock/unlock} & & \\ \text{LOAD} \rightarrow \text{Atomic Load - Sequential Consistency} & \\ \end{array}$ 

Eduardo José Gómez-Hernández

1 /\* Lock-free \*/



## SPLASH-4: SENSE-REVERSING CENTRALIZED BARRIER

- The execution time between barriers is fairly short
- To minimize the overhead, we implement a sense-reversing barrier
- This construct is optimized for short waiting times, except when oversubscribing threads

 $<sup>\</sup>label{eq:LOAD} \begin{array}{l} \mathsf{LOAD} \to \mathsf{Atomic} \ \mathsf{Load} \ \text{-} \ \mathsf{Sequential} \ \mathsf{Consistency} \\ \mathsf{STORE} \to \mathsf{Atomic} \ \mathsf{Store} \ \text{-} \ \mathsf{Sequential} \ \mathsf{Consistency} \end{array}$ 

### SPLASH-4: SENSE-REVERSING CENTRALIZED BARRIER

- The execution time between barriers is fairly short
- To minimize the overhead, we implement a sense-reversing barrier
- This construct is optimized for short waiting times, except when oversubscribing threads

 $\label{eq:LOAD} \begin{array}{l} \mathsf{LOAD} \to \mathsf{Atomic} \ \mathsf{Load} \ \text{-} \ \mathsf{Sequential} \ \mathsf{Consistency} \\ \mathsf{STORE} \to \mathsf{Atomic} \ \mathsf{Store} \ \text{-} \ \mathsf{Sequential} \ \mathsf{Consistency} \end{array}$ 

Eduardo José Gómez-Hernández

**IISWC 2022** 





A total of 50 critical sections (33%) has been transformed.
 27 critical sections (18%) were converted into C11 atomics

18 critical sections (12%) were converted into CAS-construct (without splitting)

 Splitting Critical sections allowed 5 more critical sections (3%) to be converted into 15 CAS-construct

#### **EVALUATION: METHODOLOGY**



- Measurements account for the region of interest (ROI)
- We used the recommended inputs from Splash-3 for all the executions
- The hardware used is an AMD Epyc 7702P CPU
- The simulated machine (in gem5-20) is mimicking an Intel's IceLake processor
  - Measurements reset stats after a warm-up period (as suggested by the original Splash-2 developers)

### EVALUATION: SIMULATOR OVERHEAD BREAKDOWN



Most of the cases the execution time is dominated by the barriers

Raytrace gets a huge overhead reduction due to the lock-free constructs



### EVALUATION: TOP-DOWN (OCEAN-CONT)





- Top-Down groups several hardware counters to understand easier the reason of the stalling
- In Splash-4 we observe an increase on the back-end pressure

### **EVALUATION: SIMULATOR SCALABILITY**





In general, scalability improvement is moved from 16 to 32

### EVALUATION: REAL MACHINE SCALABILITY





In general, scalability improvement is moved from 4 to 16

### EVALUATION: REAL MACHINE 64 THREADS SUMMARY



- With the lock-free constructs the execution time is reduced 11%
- ▶ With the new barriers the execution time is reduced 40%
- The combined reduction is 52%





- We presented The Splash-4 Benchmark Suite
- ▶ We performed a detailed performance analysis comparing with Splash-3



- We presented The Splash-4 Benchmark Suite
- We performed a detailed performance analysis comparing with Splash-3
- Our study shows:
  - A significant improvement on scalability
  - Execution time is reduced significantly on the simulated and real hardware



- We presented The Splash-4 Benchmark Suite
- We performed a detailed performance analysis comparing with Splash-3
- Our study shows:
  - A significant improvement on scalability
  - Execution time is reduced significantly on the simulated and real hardware
- In summary:
  - The Splash benchmark suite is still a cornerstone in computer architecture research
  - It should be updated to be able to exploit modern hardware features



- We presented The Splash-4 Benchmark Suite
- We performed a detailed performance analysis comparing with Splash-3
- Our study shows:
  - A significant improvement on scalability
  - Execution time is reduced significantly on the simulated and real hardware
- In summary:
  - The Splash benchmark suite is still a cornerstone in computer architecture research
  - It should be updated to be able to exploit modern hardware features

Splash-4 removes some of the synchronization overheads, showing the hidden reasons that prevents applications from scaling, leaving a door open for researchers to further improve their designs

### SPLASH-4: A MODERN BENCHMARK SUITE WITH LOCK-FREE CONSTRUCTS

Eduardo José Gómez-Hernández<sup>1</sup> Juan M. Cebrian<sup>1</sup> Stefanos Kaxiras<sup>2</sup> Alberto Ros<sup>1</sup>

eduardojose.gomez@um.es

Thank you for your attention!

ECHO, ERC Consolidator Grant (No 819134) Spanish Ministerio de Economía, Industria y Competitividad – Agencia Estatal de Investigación (ERC2018-092826)

This presentation and recording belong to the authors. No distribution is allowed without the authors' permission.

Eduardo José Gómez-Hernández

**IISWC 2022** 



During the possible window of execution time of the critical section, all variables cannot be related on any way, even not giving information from one to another

```
1 LOCK(lock);
2 ptr = NULL;
3 some_random_boolean = false;
4 UNLOCK(lock);
```



During the possible window of execution time of the critical section, all variables cannot be related on any way, even not giving information from one to another

```
1 LOCK(lock);
2 ptr = NULL;
3 some_random_boolean = false;
4 UNLOCK(lock);
```

```
1 LOCK(lock);
2 if (some_random_boolean) {
3 local = *ptr;
4 }
5 UNLOCK(lock);
```

### OVERHEAD SUMMARY



