## EFFICIENT, DISTRIBUTED, AND NON-SPECULATIVE MULTI-ADDRESS ATOMIC OPERATIONS

#### Eduardo José Gómez-Hernández<sup>1</sup> Juan M. Cebrian<sup>1</sup> Rubén Titos-Gil<sup>1</sup> Stefanos Kaxiras<sup>2</sup> Alberto Ros<sup>1</sup>

<sup>1</sup>Computer Engineering Department University of Murcia, Spain

<sup>2</sup>Department of Information Technology Uppsala University, Sweden

#### **OVERVIEW**







Programmers have always request the support of read-modify-write atomics of several address



#### **OVERVIEW**

- Programmers have always request the support of read-modify-write atomics of several address
- Ideally multi-address atomics should be:
  - fine-grained locking to enable concurrency
  - non-speculative to prevent retries (re-executions/aborts)



#### **OVERVIEW**

- Programmers have always request the support of read-modify-write atomics of several address
- Ideally multi-address atomics should be:
  - fine-grained locking to enable concurrency
  - non-speculative to prevent retries (re-executions/aborts)
- Our goal is:
  - achieve both goals: fine-grained locking and non-speculative
  - avoid deadlocks due to limited resources:
    - Rely only on the coherence protocol and a predetermined locking order
  - Outperform software locks (3.4×) and Intel transactional memory (2.7×)
    - with just 68 bytes of extra storage per core





MOTIVATION

BACKGROUND

MAD ATOMICS

DEADLOCKS

**EVALUATION** 

**CONCLUSIONS** 

Eduardo José Gómez-Hernández

MICRO-54 (2021)





#### Atomic read-modify-write (RMW) instructions

are the most efficient way to atomically update a variable

#### MOTIVATION

- Atomic read-modify-write (RMW) instructions
  - are the most efficient way to atomically update a variable
- Non-blocking algorithms
  - rely on atomic RMW primitives
  - commonly, the compare-and-swap(CAS) instruction



#### MOTIVATION

- Atomic read-modify-write (RMW) instructions
  - are the most efficient way to atomically update a variable
- Non-blocking algorithms
  - rely on atomic RMW primitives
  - commonly, the compare-and-swap(CAS) instruction
- In general, increase the scalability of commonly used data structures and applications







#### PREVIOUS WORK

A hardware implementation of the MCAS synchronization primitive<sup>1</sup>

- In MCAS table to setup the locks
- A set of instructions fill the structure, and later another one start locking the stored addresses
- Deadlocks due to resource limitations or lack of non-speculative solution.

Eduardo José Gómez-Hernández

<sup>&</sup>lt;sup>1</sup>Patel et al, In 2017 Design, Automation, and Test in Europe (DATE) <sup>2</sup>Ros and Kaxiras, ISCA 45, 2018

#### PREVIOUS WORK



A hardware implementation of the MCAS synchronization primitive<sup>1</sup>

- In MCAS table to setup the locks
- A set of instructions fill the structure, and later another one start locking the stored addresses
- Deadlocks due to resource limitations or lack of non-speculative solution.

Non-Speculative Store Coalescing in Total Store Order<sup>2</sup>

- Limited resources are taken into account
- Atomic groups established arbitrarily, on conflict atomic groups are split
- Atomic groups for atomic operations are established by the programmer and cannot be split

<sup>2</sup>Ros and Kaxiras, ISCA 45, 2018

Eduardo José Gómez-Hernández

MICRO-54 (2021)

<sup>&</sup>lt;sup>1</sup>Patel et al, In 2017 Design, Automation, and Test in Europe (DATE)

# BACKGROUND: ADDRESS VERSUS LEXICOGRAPHICAL ORDER





#### Typical solution Address Order<sup>1</sup>



<sup>1</sup>Dijkstra, EDW-310, E.W. Dijkstra Archive, Center for American History, 1971 <sup>2</sup>Ros and Kaxiras, ISCA 45, 2018

Eduardo José Gómez-Hernández

MICRO-54 (2021)

<sup>1</sup>Dijkstra, EDW-310, E.W. Dijkstra Archive, Center for American History, 1971

<sup>2</sup>Ros and Kaxiras, ISCA 45, 2018

MICRO-54 (2021)

# BACKGROUND: ADDRESS VERSUS LEXICOGRAPHICAL ORDER

- Typical solution Address Order<sup>1</sup>
- Address order does not take into account some hardware structures like the cache







<sup>1</sup>Dijkstra, EDW-310, E.W. Dijkstra Archive, Center for American History, 1971 <sup>2</sup>Ros and Kaxiras, ISCA 45, 2018

# BACKGROUND: ADDRESS VERSUS LEXICOGRAPHICAL ORDER

- Typical solution Address Order<sup>1</sup>
- Address order does not take into account some hardware structures like the cache
- Lexicographical Order<sup>2</sup>



LexOrder = CacheLine Address % Cache Sets

A 0x0040

B 0x0100

 $C_{0x01C0}$ 

 $D_{0x0280}$ 





MICRO-54 (2021)

<sup>1</sup>Dijkstra, EDW-310, E.W. Dijkstra Archive, Center for American History, 1971 <sup>2</sup>Ros and Kaxiras, ISCA 45, 2018 MICRO-54 (2021)

## BACKGROUND: ADDRESS VERSUS LEXICOGRAPHICAL ORDER

- Typical solution Address Order<sup>1</sup>
- Address order does not take into account some hardware structures like the cache
- Lexicographical Order<sup>2</sup>



LexOrder = CacheLine Address % Cache Sets







#### Lock-protected critical sections

mutex\_lock(Q); b++; a++; mutex\_unlock(Q);

#### MAD ATOMICS

- Lock-protected critical sections
- Single instructions multi-address atomics

| mutex_lock(Q); )   |     |
|--------------------|-----|
| b++;               |     |
| a++;               |     |
| _mutex_unlock(Q);/ |     |
| Ļ                  |     |
| dmad.inc_inc (&b,  | &a) |



#### multi-address atomics Decoded micro-ops

Lock-protected critical

Single instructions



#### MAD ATOMICS

sections



#### MICRO-54 (2021)

#### 7 / 20 October 16, 2021

Lock-protected critical sections

MAD ATOMICS

- Single instructions multi-address atomics
  - Decoded micro-ops
  - Out of Order execution









| b |  |
|---|--|
|   |  |
|   |  |

**Private Cache** 



Directory















Directory























Private Cache

|   |   | а  |
|---|---|----|
|   | b | b' |
|   |   |    |
|   |   |    |
| С |   |    |

Directory









Lexicographical reOrder Unit





- Lexicographical reOrder Unit
- Extra bit at each set of the directory





- Lexicographical reOrder Unit
- Extra bit at each set of the directory
- Load\_locked & Store\_unlock

We have identified several deadlocks scenarios due to resource limitations:

- Private Cache
- Shared Cache
- Eviction Buffers

















#### MAD atomics are limited to a maximum of 4 addresses

#### COMPUTER & PARALLEL ACCITECTURE & SYSTEMS



#### COMPUTER & PARALLER ARCHITECTURE & SYSTEMS











#### COMPUTER & PARALLEL ARCHITECTURE & SYSTEMS

## DEADLOCKS: SHARED CACHE



The set lock prevents multiple conflicts to clash in the same set

Eduardo José Gómez-Hernández

MICRO-54 (2021)







#### **Private Cache**



Directory



Directory

























We propose to enable *in-situ* replacements in this scenario



- ► Gem5-20 full system simulator
- Mimicking an Intel Skylake processor from 1 up to 64 cores
- Memory hierarchy and coherence protocol modeled with Ruby
- Execution and issue latencies modeled as measured on real hardware by Fog<sup>1</sup>

<sup>1</sup>Fog, http://www.agner.org/optimize/instruction\_tables.pdf, 2018

MICRO-54 (2021)



Commonly used concurrent data structures and some parallel applications

- Critical sections can be translated to two categories:
  - multi-address atomic operations
  - multi-address compare-and-swap (MCAS) operations

#### **EVALUATION: RESULTS**





Eduardo José Gómez-Hernández

MICRO-54 (2021)

October 16, 2021 17 / 20

#### **EVALUATION: RESULTS**





#### **EVALUATION: RESULTS**







MICRO-54 (2021)



18 / 20

#### **EVALUATION: RESULTS**









 New efficient, more flexible, non-speculative, deadlock-free multi-address (MAD) atomic operations.





- New efficient, more flexible, non-speculative, deadlock-free multi-address (MAD) atomic operations.
- Avoid deadlocks due to limited resources relying only on the coherence protocol and a predetermined locking order

#### CONCLUSION

- New efficient, more flexible, non-speculative, deadlock-free multi-address (MAD) atomic operations.
- Avoid deadlocks due to limited resources relying only on the coherence protocol and a predetermined locking order
- Performance is increased:
  - ► 3.4× on average against software locks
  - 2.7× on average compared to TSX
  - In general improving scalability from one core (software locks) up to 16 cores.





#### CONCLUSION

- New efficient, more flexible, non-speculative, deadlock-free multi-address (MAD) atomic operations.
- Avoid deadlocks due to limited resources relying only on the coherence protocol and a predetermined locking order
- Performance is increased:
  - 3.4× on average against software locks
  - 2.7× on average compared to TSX
  - In general improving scalability from one core (software locks) up to 16 cores.

with just 68 bytes of extra storage per core



## EFFICIENT, DISTRIBUTED, AND NON-SPECULATIVE MULTI-ADDRESS ATOMIC OPERATIONS

Eduardo José Gómez-Hernández<sup>1</sup> Juan M. Cebrian<sup>1</sup> Rubén Titos-Gil<sup>1</sup> Stefanos Kaxiras<sup>2</sup> Alberto Ros<sup>1</sup>

eduardojose.gomez@um.es

Thank you for your attention!

ECHO, ERC Consolidator Grant (No 819134) Vetenskapsradet project 2018-05254 and EPEEC (No 801051)

This presentation and recording belong to the authors. No distribution is allowed without the authors' permission.

Eduardo José Gómez-Hernández

MICRO-54 (2021)