### Script generated by TTT Title: Petter: Programmiersprachenh (31.10.2018) Date: Wed Oct 31 14:13:27 CET 2018 Duration: 78:48 min Pages: 37 ### TSO in the Wild: x86 The x86 CPUs, powering desktops and servers around the world is a common representative of a TSO Memory Model based CPU. - FIFO store buffers keep quite strong consistency properties - The major obstacle to Sequential Consistency is - modern x86 CPUs provide the mfence instruction - mfence orders all memory instructions: $$\mathsf{Op}_i \leq \mathit{mfence}() \leq \mathsf{Op}_i' \quad \Rightarrow \quad \mathsf{Op}_i \sqsubseteq \mathsf{Op}_i'$$ - a fence between write and loads gives sequentially consistent CPU behavior (and is as slow as a CPU without store buffer) - → use fences only when necessary ## **Happened-Before Model for TSO** Assume cache A contains: a: S0, b: S0, cache B contains: a: S0, b: S0 ## **PSO Model: Formal Spec [SI92]** ### Definition (Partial Store Order) - The store order wrt. memory ( □ ) is total - $\forall_{a,b \in addr \ i,j \in CPU} \quad (\mathsf{St}_{\underline{i}}[\underline{a}] \sqsubseteq \mathsf{St}_{\underline{j}}[\underline{b}]) \lor (\mathsf{St}_{\underline{j}}[\underline{b}] \sqsubseteq \mathsf{St}_{\underline{i}}[\underline{a}])$ - 2 Fenced stores in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\texttt{St}_i[a] \leq \texttt{sfence}() \leq \texttt{St}_i[b] \Rightarrow \texttt{St}_i[a] \sqsubseteq \texttt{St}_i[b]$ 3 Stores to the same address in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\operatorname{St}_i[a] \leq \operatorname{St}_i[a]' \Rightarrow \operatorname{St}_i[a] \sqsubseteq \operatorname{St}_i[a]'$ - lacktriangle Loads preceding an other operation (wrt. program order $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - - $\mathit{val}(\texttt{Ld}_i[a]) = \mathit{val}(\texttt{St}_j[a] \mid \texttt{St}_j[a] = \max_{\sqsubseteq} \left( \left\{ \texttt{St}_k[a] \mid \texttt{St}_k[a] \sqsubseteq \texttt{Ld}_i[a] \right\} \cup \left\{ \texttt{St}_i[a] \mid \texttt{St}_i[a] \leq \texttt{Ld}_i[a] \right\} \right)$ - Now also stores are not guaranteed to be in order any more: $$\operatorname{St}_{i}[a] \leq \operatorname{St}_{i}[b] \not\Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[b]$$ → What about sequential consistency for the whole system? ### **Store Buffers** Abstract Machine Model: defines semantics of memory accesses - put each store into a store buffer and continue execution - Store buffers apply stores in various orders: - ► FIFO (Sparc/x86-*TSO*) - unordered (Sparc PSO) - program order still needs to be observed locally - store buffer snoops read channel and - on matching address, returns the youngest value in buffer **PSO Model: Formal Spec [SI92]** ### **Definition (Partial Store Order)** - The store order wrt. memory ( □ ) is total - $\forall_{a,b} \in addr \ i,j \in CPU \ (St_i[a] \sqsubseteq St_j[b]) \lor (St_j[b] \sqsubseteq St_i[a])$ - 2 Fenced stores in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\operatorname{St}_{i}[a] \leq \operatorname{sfence}() \leq \operatorname{St}_{i}[b] \Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[b]$ - lacktriangle Stores to the same address in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\operatorname{St}_{i}[a] \leq \operatorname{St}_{i}[a]' \Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[a]'$ - $oldsymbol{0}$ Loads preceding an other operation (wrt. program order $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\mathrm{Ld}_{i}[a] \leq \mathrm{Op}_{i}[b] \Rightarrow \mathrm{Ld}_{i}[a] \sqsubseteq \mathrm{Op}_{i}[b]$ - A load's value is determined by the latest write as observed by the local CPU $$\mathit{val}(\mathtt{Ld}_i[a]) = \mathit{val}(\mathtt{St}_j[a] \mid \mathtt{St}_j[a] = \max_{\sqsubseteq} \left( \left\{ \mathtt{St}_{\pmb{k}}[a] \mid \mathtt{St}_{\pmb{k}}[a] \sqsubseteq \mathtt{Ld}_i[a] \right\} \cup \left\{ \mathtt{St}_i[a] \mid \mathtt{St}_i[a] \le \mathtt{Ld}_i[a] \right\} \right))$$ Now also stores are not guaranteed to be in order any more: $\operatorname{St}_{i}[a] \leq \operatorname{St}_{i}[b] \not\Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[b]$ → What about sequential consistency for the whole system? **Memory Consistency** **Out-of-Order Execution Stores** 29 / 54 Out-of-Order Execution 33 / 54 ## **Happened-Before Model for PSO** ### Thread A a = 1; b = 1; ### Thread B Assume cache A contains: a: S0, b: E0, cache B contains: a: S0, b: I ### **Explicit Synchronization: Write Barrier** Overtaking of messages *may be desirable* and does not need to be prohibited in general. - generalized store buffers render programs incorrect that assume sequential consistency between different CPUs - whenever a store in front of another operation in one CPU must be observable in this order by a different CPU, an explicit write barrier has to be inserted - a write barrier marks all current store operations in the store buffer - the next store operation is only executed when all marked stores in the buffer have completed nory Consistency Out-of-Order Execution Stores 34/54 Memory Consistency Out-of-Order Execution Stores ## **Happened-Before Model for Write Barriers** ### Thread A ### Thread B Assume cache A contains: a: S0, b: E0, cache B contains: a: S0, b: I **Memory Consistency** ut-of-Order Execution Stores 36 / 54 Further weakening the model: O-o-O Reads Memory Consistence **Out-of-Order Execution of Load** 3//5 ## **PSO Model: Formal Spec [SI92]** ### Definition (Partial Store Order) - The store order wrt. memory ( □ ) is total - 2 Fenced stores in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\operatorname{St}_i[a] \leq \operatorname{sfence}() \leq \operatorname{St}_i[b] \Rightarrow \operatorname{St}_i[a] \sqsubseteq \operatorname{St}_i[b]$ - $oldsymbol{@}$ Stores to the same address in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\operatorname{St}_i[a] \leq \operatorname{St}_i[a]' \Rightarrow \operatorname{St}_i[a] \sqsubseteq \operatorname{St}_i[a]'$ - lacktriangle Loads preceding an other operation (wrt. program order $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\mathrm{Ld}_i[a] \leq \mathrm{Op}_i[b] \Rightarrow \mathrm{Ld}_i[a] \sqsubseteq \mathrm{Op}_i[b]$ - 6 A load's value is determined by the latest write as observed by the local CPU $$\mathit{val}(\mathtt{Ld}_i[a]) = \mathit{val}(\mathtt{St}_j[a] \mid \mathtt{St}_j[a] = \max_{\sqsubseteq} \left( \left\{ \mathtt{St}_{\pmb{k}}[a] \mid \mathtt{St}_{\pmb{k}}[a] \sqsubseteq \mathtt{Ld}_i[a] \right\} \cup \left\{ \mathtt{St}_i[a] \mid \mathtt{St}_i[a] \le \mathtt{Ld}_i[a] \right\} \right))$$ Now also stores are not guaranteed to be in order any more: $$\operatorname{St}_{i}[a] \leq \operatorname{St}_{i}[b] \not\Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[b]$$ → What about sequential consistency for the whole system? ## **TSO Model: Formal Spec [SI92]** ### Definition (Total Store Order) The store order wrt. memory ( □ ) is total $$\forall_{a,b} \in \mathit{addr} \ i,j \in \mathit{CPU} \quad \left( \mathsf{St}_i[a] \sqsubseteq \mathsf{St}_j[b] \right) \lor \left( \mathsf{St}_j[b] \sqsubseteq \mathsf{St}_i[a] \right)$$ 2 Stores in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) $$\operatorname{St}_i[a] \leq \operatorname{St}_i[b] \Rightarrow \operatorname{St}_i[a] \sqsubseteq \operatorname{St}_i[b]$$ lacksquare Loads preceding an other operation (wrt. program order $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) $$\mathrm{Ld}_i[a] \leq \mathrm{Op}_i[b] \Rightarrow \mathrm{Ld}_i[a] \sqsubseteq \mathrm{Op}_i[b]$$ 4 A load's value is determined by the latest write as observed by the local CPU $$\mathit{val}(\mathtt{Ld}_i[a]) = \mathit{val}(\mathtt{St}_j[a] \mid \mathtt{St}_j[a] = \max_{\sqsubseteq} \left( \left\{ \mathtt{St}_{\pmb{k}}[a] \mid \mathtt{St}_{\pmb{k}}[a] \sqsubseteq \mathtt{Ld}_i[a] \right\} \cup \left\{ \mathtt{St}_i[a] \mid \mathtt{St}_i[a] \leq \mathtt{Ld}_i[a] \right\} \right))$$ Particularly, one ordering property is not guaranteed: $$\operatorname{St}_i[a] \leq \operatorname{Ld}_i[b] \not\Rightarrow \operatorname{St}_i[a] \sqsubseteq \operatorname{Ld}_i[b]$$ Local stores may be observed earlier by local loads then from somewhere else! nory Consistency Out-of-Order Execution Stores 33 / 54 Memory Consistency Out-of-Order Execution Stores ### TSO in the Wild: x86 The x86 CPUs, powering desktops and servers around the world is a common representative of a TSO Memory Model based CPU. - FIFO store buffers keep quite strong consistency properties - The major obstacle to Sequential Consistency is $$\operatorname{St}_{i}[a] \leq \operatorname{Ld}_{i}[b] \implies \operatorname{St}_{i}[a] \sqsubseteq \operatorname{Ld}_{i}[b]$$ - modern x86 CPUs provide the mfence instruction - mfence orders all memory instructions: $$Op_i \leq mfence() \leq Op_i' \Rightarrow Op_i \sqsubseteq Op_i'$$ - a fence between write and loads gives sequentially consistent CPU behavior (and is as slow as a CPU without store buffer) - → use fences only when necessary Memory Consistence **Out-of-Order Execution Stores** 32 / 54 ## **PSO Model: Formal Spec [SI92]** **Definition (Partial Store Order)** - lacktriangledown The store order wrt. memory ( $\sqsubseteq$ ) is total - $\forall_{a,b} \in \mathit{addr} \ i,j \in \mathit{CPU} \quad \left( \mathsf{St}_i[a] \sqsubseteq \mathsf{St}_j[b] \right) \lor \left( \mathsf{St}_j[b] \sqsubseteq \mathsf{St}_i[a] \right)$ - Fenced stores in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\operatorname{St}_{i}[a] \leq \operatorname{sfence}() \leq \operatorname{St}_{i}[b] \Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[b]$ - $oldsymbol{3}$ Stores to the same address in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\operatorname{St}_{i}[a] \leq \operatorname{St}_{i}[a]' \Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[a]'$ - lacktriangledown Loads preceding an other operation (wrt. program order $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\mathrm{Ld}_i[a] \leq \mathrm{Op}_i[b] \Rightarrow \mathrm{Ld}_i[a] \sqsubseteq \mathrm{Op}_i[b]$ - A load's value is determined by the latest write as observed by the local CPU $$\mathit{val}(\texttt{Ld}_i[a]) = \mathit{val}(\texttt{St}_j[a] \mid \texttt{St}_j[a] = \max_{\sqsubseteq} \left( \left\{ \texttt{St}_k[a] \mid \texttt{St}_k[a] \sqsubseteq \texttt{Ld}_i[a] \right\} \cup \left\{ \texttt{St}_i[a] \mid \texttt{St}_i[a] \le \texttt{Ld}_i[a] \right\} \right)$$ Now also stores are not guaranteed to be in order any more: $$\operatorname{St}_i[a] \leq \operatorname{St}_i[b] \not\Rightarrow \operatorname{St}_i[a] \sqsubseteq \operatorname{St}_i[b]$$ what about sequential consistency for the whole system? ## **Happened-Before Model for Write Barriers** ### Thread A ``` a = 1; sfence(); b = 1; ``` ### Thread B ``` while (b == 0) {}; assert(a == 1); ``` Assume cache A contains: a: S0, b: E0, cache B contains: a: S0, b: I Memory Consistency t-of-Order Execution Store 200 Further weakening the model: O-o-O Reads ## **Relaxed Memory Order** Communication of cache updates is still costly: - a cache-intense computation can fill up store buffers in CPUs - waiting for invalidation acknoledgements may still happen - invalidation acknoledgements are delayed on busy caches - immediately acknowledge an invalidation and apply it later - put each invalidate message into an invalidate queue - if a MESI message needs to be sent regarding a cache line in the invalidate queue then wait until the line is invalidated - local loads and stores do not consult the invalidate queue - What about sequential consistency? Memory Consistency Out-of-Order Execution of Loads 38 / 54 ### **RMO Model: Formal Spec [SI94]** ### Definition (Relaxed Memory Order) - lacktriangle Fenced memory accesses in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $Op_i[a] \leq mfence() \leq Op_i[b] \Rightarrow Op_i[a] \sqsubseteq Op_i[b]$ - 2 Stores to the same address in program order ( $\leq$ ) are embedded into the memory order ( $\sqsubseteq$ ) - $\operatorname{St}_{i}[a] \leq \operatorname{St}_{i}[a]' \Rightarrow \operatorname{St}_{i}[a] \sqsubseteq \operatorname{St}_{i}[a]'$ - lacksquare Operations dependent on a load (wrt. dependence ightarrow ) are embedded in the memory order ( $\sqsubseteq$ ) - $\mathrm{Ld}_{i}[a] \to \mathrm{Op}_{i}[b] \Rightarrow \mathrm{Ld}_{i}[a] \sqsubseteq \mathrm{Op}_{i}[b]$ - A load's value is determined by the latest write as observed by the local CPU $$\mathit{val}(\mathtt{Ld}_i[a]) = \mathit{val}(\mathtt{St}_j[a] \mid \mathtt{St}_j[a] = \max_{\square} \left( \left\{ \mathtt{St}_{\pmb{k}}[a] \mid \mathtt{St}_{\pmb{k}}[a] \sqsubseteq \mathtt{Ld}_i[a] \right\} \cup \left\{ \mathtt{St}_i[a] \mid \mathtt{St}_i[a] \le \mathtt{Ld}_i[a] \right\} \right)$$ - igwedge Now we need the notion of dependence ightarrow : - Memory access to the same address: $$\operatorname{St}_{i}[a] \leq \operatorname{Ld}_{i}[a] \Rightarrow \operatorname{St}_{i}[a] \rightarrow \operatorname{Ld}_{i}[a]$$ Register reads are dependent on latest register writes: $$\operatorname{Op}_{i}[a]'' = \max_{<} \left( \operatorname{Op}_{i}[a]' \mid \mathit{targetreg}(\operatorname{Op}_{i}[a]') = \mathit{srcreg}(\operatorname{Op}_{i}[b]) \wedge \operatorname{Op}_{i}[a]' \leq \operatorname{Op}_{i}[b] \right) \quad \Rightarrow \quad \operatorname{Op}_{i}[a]'' \rightarrow \operatorname{Op}_{i}[b]$$ • Stores within branched blocks are dependent on branch conditionals: $$(\mathsf{Op}_i[a] \leq \mathsf{St}_i[b]) \wedge \mathsf{Op}_i[a] \rightarrow \mathit{condbranch} \leq \mathsf{St}_i[b] \quad \Rightarrow \quad \mathsf{Op}_i[a] \rightarrow \mathsf{St}_i[b]$$ **Memory Consistency** Out-of-Order Execution of Load 39 / 54 ### Happened-Before Model for Invalidate Queues # Thread B while (b == 0) {}; assert (a == 1); Assume cache A contains: a: S0, b: E0, cache B contains: a: S0, b: I ## **Explicit Synchronization: Read Barriers** Read accesses do not consult the invalidate queue. - might read an out-of-date value - need a way to establish sequential consistency between writes of other processors and local reads - insert an explicit <u>read barrier</u> before the read access - a read barrier marks all entries in the invalidate queue - the next read operation is only executed once all marked invalidations have completed - a read barrier before each read gives sequentially consistent read behavior (and is as slow as a system without invalidate queue) match each write barrier in one process with a read barrier in another process mory Consistency Out-of-Order Execution of Loads 40 / 54 Memory Consistency Out-of-Order Execution of Loads ### **Happened-Before Model for Read Barriers** ## Thread A Thread B while (b == 0) {}; a = 1;sfence(); lfence(); b = 1;assert (a == 1); invalidate ack invalidate write back cachinvalidate read **Out-of-Order Execution of Loads** ## Example: The Dekker Algorithm on RMO Systems ## **Happened-Before Model for Read Barriers** ## **Using Memory Barriers: the Dekker Algorithm** Mutual exclusion of *two* processes with busy waiting. ``` //flag[] is boolean array; and turn is an integer flag[0] = false; flag[1] = false; turn // or 1 ``` ``` P0: flag[0] = true; while (flag[1] == true) if (turn != 0) { flag[0] = false; while (turn != 0) { // busy wait flag[0] = true; // critical section = 1; flag[0] = false; ``` ### Using Memory Barriers: the Dekker Algorithm Mutual exclusion of *two* processes with busy waiting. ``` //flag[] is boolean array; and turn is an integer flag[0] = false; flag[1] = false; turn = 0; // or 1 ``` ``` P0: flag[0] = true; while (flag[1] == true) if (turn != 0) { flag[0] = false; while (turn != 0) { // busy wait flag[0] = true; // critical section turn = 1; flag[0] = false; ``` ``` P1: flag[1] = true; while (flag[0] == true) if (turn != 1) { flag[1] = false; while (turn != 1) { // busy wait flag[1] = true; // critical section turn = 0; flag[1] = false; ``` Memory Consistency ## Dekker's Algorithm and RMO Problem: Dekker's algorithm requires sequential consistency. Idea: insert memory barriers between all variables common to both threads. ### The Idea Behind Dekker Communication via three variables: - flag[i] == true process $P_i$ wants to enter its critical section - turn==i process $P_i$ has priority when both want to enter ``` P0: flag[0] = true; while (flag[1] == true) if (turn != 0) { flag[0] = false; while (turn != 0) { // busy wait flag[0] = true; // critical section turn = 1; flag[0] = false; ``` In process $P_i$ : • if $P_{1-i}$ does not want to enter. proceed immediately to the critical section ## **Dekker's Algorithm and RMO** Problem: Dekker's algorithm requires sequential consistency. Idea: insert memory barriers between all variables common to both threads. ``` P0: flag[0] = true; sfence(); while (lfence(), flag[1] == true) if (lfence(), turn != 0) { flag[0] = false; sfence(); while (lfence(), turn != 0) { // busy wait flag[0] = true; sfence(); // critical section turn = 1; sfence(); ``` flag[0] = false; sfence(); insert a load memory barrier lfence() in front of every read from common variables ### **Summary: Relaxed Memory Models** **Discussion** Where are they useful? protocol implementations Why might they not be appropriate? Highly optimized CPUs may use a relaxed memory model: - reads and writes are not synchronized unless requested by the user - many kinds of memory barriers exist with subtle differences - → ARM, PowerPC, Alpha, ia-64, even x86 (→ SSE Write Combining) --- memory barriers are the "lowest-level" of synchronization Memory Consistency The Dekker Algorithm 47 / 54 Memory Consistency Wrapping U Memory barriers reside at the lowest level of synchronization primitives. • when several processes implement automata and coordinate their OS provides synchronization facilities based on memory barriers • difficult to get right, best suited for specific well-understood algorithms • too many fences are costly if store/invalidate buffers are bottleneck 48 / 54 ## **Memory Models and Compilers** **Before Optimization** ``` int x = 0; for (int i=0;i<100;i++) { x = 1; printf("%d",x); }</pre> ``` ## **Memory Models and Compilers** • when blocking should not de-schedule threads transitions via common synchronized variables often synchronization with locks is as fast and easier **Before Optimization** ``` int x = 0; for (int i=0;i<100;i++) { x = 1; printf("%d",x); }</pre> ``` After Optimization ``` int x = 1; for (int i=0;i<100;i++) { printf("%d",x); }</pre> ``` ### **Standard Program Optimizations** comprises loop-invariant code motion and dead store elimination, e.g. Memory Consistency Wrapping Up 49/54 Memory Consistency Wrapping Up 49/54 ### **Memory Models and C-Compilers** ## **Memory Models and C-Compilers** Keeping semantics I ``` int x = 0; for (int i=0;i<100;i++) { sfence(); x = 1; printf("%d",x); }</pre> ``` Keeping semantics I ``` int x = 0; for (int i=0;i<100;i++) { sfence(); x = 1; printf("%d",x); }</pre> ``` Keeping semantics II ``` volatile int x = 0; for (int i=0;i<100;i++) { x = 1; printf("%d",x); }</pre> ``` - Compilers may also reorder store instructions - Write barriers keep the compiler from reordering across - The specification of volatile keeps the *C-Compiler* from reordering memory accesses to this address **Memory Consistency** Vrapping Up 50 / 54 Wrapping F0 / F ## **Summary** ### Learning Outcomes - Strict Consistency - 4 Happened-before Relation - Sequential Consistency - The MESI Cache Model - TSO: FIFO store buffers - PSO: store buffers - RMO: invalidate queues - Reestablishing Sequential Consistency with memory barriers - Oekker's Algorithm for Mutual Exclusion ## **Future Many-Core Systems: NUMA** ### Many-Core Machines' Read Responses congest the bus In that case: Intel's MESIF (Forward) to reduce communication overhead. - a memory-intensive computation may cause contention on the bus - the speed of the bus is limited since the electrical signal has to travel to all participants - point-to-point connections are faster than a bus, but do not provide possibility of forming consensus Memory Consistency Wrapping Up 51/54 Memory Consistency Wrapping Up 52/5 ### **Overhead of NUMA Systems** Communication overhead in a NUMA system. Processors in a NUMA system may be fully or partially connected. The directory of who stores an address is partitioned amongst processors. A cache miss that cannot be satisfied by the local memory at *A*: - A sends a retrieve request to processor B owning the directory - B tells the processor C who holds the content - C sends data (or status) to A and sends acknowledge to B - B completes transmission by an acknowledge to A source: [Int09] Memory Consistency Wrapping Up ### References 54 / 54 An introduction to the intel quickpath interconnect. Technical Report 320412, 2009. Time, Clocks, and the Ordering of Events in a Distributed System. *Commun. ACM*, 21(7):558–565, July 1978. Memory Barriers: a Hardware View for Software Hackers. Technical report, Linux Technology Center, IBM Beaverton, June 2010. A low overhead coherence solution for multiprocessors with private cache memories. In *In Proc. 11th ISCA*, pages 348–354, 1984. 53 / 54 CORPORATE SPARC International, Inc. The SPARC Architecture Manual: Version 8. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1992. The SPARC Architecture Manual (Version 9). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1994. Memory Consistency Wrapping Up