#### Script generated by TTT Title: Simon: Programmiersprachen (29.11.2013) Date: Fri Nov 29 14:15:17 CET 2013 Duration: 92:24 min Pages: 98 # **Consistency During Transactions** #### Consistency during a transaction. ACID states how committed transactions behave but not what may happen until a transaction commits. - a transaction that is run on an inconsistent state may generate an inconsistent state → zombie transaction - this is usually ok since it will be aborted eventually critical for C/C++ if, for instance, variables are pointers • but transactions may cause havoc when run on inconsistent states ``` atomic { int tmp1 = x; int tmp2 = y; assert(tmp1-tmp2==0); } // preserved invariant: x==y atomic { x = 10; y = 10; } ``` Concurrency: Transactions Transaction Semantics 6/33 # **Consistency During Transactions** #### Consistency during a transaction. ACID states how committed transactions behave but not what may happen until a transaction commits. - a transaction that is run on an inconsistent state may generate an inconsistent state → zombie transaction - this is usually ok since it will be aborted eventually - but transactions may cause havoc when run on inconsistent states ``` atomic { int tmp1 = x; int tmp2 = y; assert(tmp1-tmp2==0); } // preserved invariant: x==y atomic { x = 10; y = 10; } ``` • critical for C/C++ if, for instance, variables are pointers #### **Definition (opacity)** A TM system provides *opacity* if failing transactions are serializable w.r.t. committing transactions. → failing transactions still sees a consistent view of memory Concurrency: Transactions Transaction Semantics 6/33 ## **Consistency During Transactions** #### Consistency during a transaction. ACID states how committed transactions behave but not what may happen until a transaction commits. - this is usually ok since it will be aborted eventually - but transactions may cause havor when run on inconsistent states atomic { $\times = 0$ % preserved invariant: x = y ``` atomic { int tmp1 = x; int tmp2 = y; assert(tmp1-tmp2==0); } x=0 y preser atomic { x = 10; y = 10; } ``` Concurrency: Transactions Transaction Semantics 6 #### **Consistency During Transactions** #### Consistency during a transaction. ACID states how committed transactions behave but not what may happen until a transaction commits. - a transaction that is run on an inconsistent state may generate an inconsistent state → zombie transaction - this is usually ok since it will be aborted eventually - but transactions may cause havoc when run on inconsistent states ``` atomic { int tmp1 = x; int tmp2 = y; assert(tmp1-tmp2==0); } // preserved invariant: x==y atomic { x = 10; y = 10; } ``` critical for C/C++ if, for instance, variables are pointers Concurrency: Transactions Transaction Semantics 6/33 #### Weak- and Strong Isolation If guarantees are only given about memory accessed inside atomic, a TM implementation provides *weak isolation*. Can we mix transactions with code accessing memory non-transactionally? #### **Consistency During Transactions** #### Consistency during a transaction. ACID states how committed transactions behave but not what may happen until a transaction commits. - a transaction that is run on an inconsistent state may generate an inconsistent state → zombie transaction - this is usually ok since it will be aborted eventually - but transactions may cause havoc when run on inconsistent states ``` atomic { int tmp1 = x; int tmp2 = y; assert(tmp1-tmp2==0); } // preserved invariant: x==y atomic { x = 10; y = 10; } ``` • critical for C/C++ if, for instance, variables are pointers #### **Definition (opacity)** A TM system provides *opacity* if failing transactions are serializable w.r.t. committing transactions. → failing transactions still sees a consistent view of memory Concurrency: Transactions ransaction Semantics 6/33 # Weak- and Strong Isolation If guarantees are only given about memory accessed inside atomic, a TM implementation provides *weak isolation*. Can we mix transactions with code accessing memory non-transactionally? no conflict detection for non-transactional accesses Concurrency: Transactions Transaction Semantics 7/33 Concurrency: Transaction Semantics 7/33 #### Weak- and Strong Isolation If guarantees are only given about memory accessed inside atomic, a TM implementation provides *weak isolation*. Can we mix transactions with code accessing memory non-transactionally? - no conflict detection for non-transactional accesses - standard race problems as in unlocked shared accesses ``` // Thread 1 atomic 1 x = 42; ``` ``` int tmp = x; ``` Concurrency: Transactions **Transaction Semantics** 7/33 Transaction Semantic 7/3 #### Weak- and Strong Isolation If guarantees are only given about memory accessed inside atomic, a TM implementation provides *weak isolation*. Can we mix transactions with code accessing memory non-transactionally? - no conflict detection for non-transactional accesses - standard race problems as in unlocked shared accesses ``` // Thread 1 atomic { x = 42; } // Thread 2 int tmp = x; ``` - sgive programs with races the same semantics as if using a single global lock for all atomic blocks - strong isolation: retain order between accesses to TM and non-TM #### Weak- and Strong Isolation If guarantees are only given about memory accessed inside atomic, a TM implementation provides *weak isolation*. Can we mix transactions with code accessing memory non-transactionally? - no conflict detection for non-transactional accesses - standard race problems as in unlocked shared accesses ``` // Thread 1 atomic { x = 42; } // Thread 2 int tmp = x; ``` ¬ give programs with races the same semantics as if using a single global lock for all atomic blocks #### Weak- and Strong Isolation If guarantees are only given about memory accessed inside atomic, a TM implementation provides *weak isolation*. Can we mix transactions with code accessing memory non-transactionally? - no conflict detection for non-transactional accesses - standard *race* problems as in unlocked shared accesses - sive programs with races the same semantics as if using a single global lock for all atomic blocks - strong isolation: retain order between accesses to TM and non-TM #### **Definition (SLA)** The *single-lock atomicity* is a model in which the program executes as if all transactions acquire a single, program-wide mutual exclusion lock. → like *sequential consistency*, SLA is a statement about program equivalence Concurrency: Transactions Transaction Semantics 7/33 Concurrency: Transaction Semantics 7/33 #### **Properties of Single-Lock Atomicity** # **Properties of Single-Lock Atomicity** atomic $\{ k = i+j; \}$ Observation: В Observation: SLA enforces order between TM and non-TM accesses √ Concurrency: Transactions **Transaction Semantics** 8/33 ansaction Semantics k = i+j; 0/22 #### **Properties of Single-Lock Atomicity** #### Observation: - ullet SLA enforces order between TM and non-TM accesses $\sqrt{\phantom{a}}$ - ▶ this guarantees strong isolation between TM and non-TM accesses - within one transactions, accesses may be re-ordered √ - the content of <u>non-TM</u> memory conveys information which <u>atomic</u> block has executed, even if the TM regions do not access the same memory # **Properties of Single-Lock Atomicity** #### Observation: - ullet SLA enforces order between TM and non-TM accesses $\sqrt{\ }$ - ▶ this guarantees strong isolation between TM and non-TM accesses - within one transactions, accesses may be re-ordered √ - the content of non-TM memory conveys information which atomic block has executed, even if the TM regions do not access the same memory - ► SLA makes it possible to use atomic block for synchronization oncurrency: Transactions Transaction Semantics 8/33 Concurrency: Transaction Semantics 8/33 # Disadvantages of the SLA model The SLA model is *simple* but often too strong: SLA has a weaker progress guarantee than a transaction should have ``` // Thread 1 // Thread 2 atomic { atomic { while (true) {}; int tmp = x; // x in TM ``` SLA correctness is too strong in practice ``` // Thread 2 atomic { // Thread 1 int tmp = data; data = 1: // Thread 1 not in atomic atomic { if (ready) { // use tmp ready = 1; ``` Concurrency: Transactions # Disadvantages of the SLA model The SLA model is *simple* but often too strong: SLA has a weaker progress quarantee than a transaction should have ``` // Thread 1 // Thread 2 atomic { atomic { while (true) {}; int tmp = x; // x in TM ``` SLA correctness is too strong in practice ``` // Thread 2 atomic { // Thread 1 int tmp = data; data = 1: // Thread 1 not in atomic atomic { if (ready) { // use tmp ready = 1; ``` under the SLA model, atomic {} acts as barrier ## Disadvantages of the SLA model The SLA model is *simple* but often too strong: SLA has a weaker progress guarantee than a transaction should have ``` // Thread 1 // Thread 2 atomic { atomic { while (true) {}; int tmp = x; // x in TM ``` SLA correctness is too strong in practice ``` atomic { // Thread 1 int tmp = data; data = 1: // Thread 1 not in atomic atomic { if (ready) { // use tmp ready = 1; ``` // Thread 2 - under the SLA model, atomic {} acts as barrier - intuitively, the two transactions should be independent rather than synchronize # **Transactional Sequential Consistency** How about a more permissive view of transaction semantics? - TM should not have the blocking behaviour of locks - which the programmer cannot rely on synchronization #### **Definition (TSC)** The *transactional sequential consistency* is a model in which the accesses within each transaction are sequentially consistent. #### **Transactional Sequential Consistency** How about a more permissive view of transaction semantics? - TM should not have the blocking behaviour of locks - whe programmer cannot rely on synchronization #### **Definition (TSC)** The *transactional sequential consistency* is a model in which the accesses within each transaction are sequentially consistent. - TSC is weaker: gives strong isolation, but allows parallel execution √ - TSC is stronger: accesses within a transaction may not be re-ordered Concurrency: Transactions ransaction Semantic: 10 / 33 חחלחו #### Translation of atomic-Blocks convert every read access x from a shared variable to ReadTx(&x) # **Transactional Sequential Consistency** How about a more permissive view of transaction semantics? - TM should not have the blocking behaviour of locks - whe programmer cannot rely on synchronization #### **Definition (TSC)** The *transactional sequential consistency* is a model in which the accesses within each transaction are sequentially consistent. - TSC is weaker: gives strong isolation, but he we parallel execution √ - TSC is stronger: accesses within a transaction may not be re-ordered △ → actual implementations use TSC with some *race free* re-orderings Concurrency: Transactions ransaction Semantics 10 / 33 #### Translation of atomic-Blocks A TM system must track which shared memory locations are accessed: - convert every read access x from a shared variable to ReadTx(&x) - convert every write access x=e to a shared variable to WriteTx(&x,e) oncurrency: Transactions Implementation of Software TM 12/33 Concurrency: Transactions Implementation of Software TM 12/33 #### Translation of atomic-Blocks A TM system must track which shared memory locations are accessed: - convert every read access x from a shared variable to ReadTx(&x) - convert every write access x=e to a shared variable to WriteTx(&x,e) Convert atomic blocks as follows: ``` atomic { StartTx(): // code with ReadTx and WriteTx } while (!CommitTx()); ``` וחחלוחו # **Transactional Memory for the Queue** If a preprocessor is used, PopRight can be implemented as follows: ``` double-ended queue: removal int PopRight(DQueue* q) { QNode* oldRightNode; atomic { Real QNode* rightSentinel = q->right; oldRightNode = rightSentinel->left; if (oldRightNode==leftSentinel) retry; QNode* newRightNode = oldRightNode->left; newRightNode->right = rightSentinel; rightSentinel->left = newRightNode; int val = oldRightNode->val; free(oldRightNode); return val; ``` • the transaction will abort if other threads call PopRight #### Translation of atomic-Blocks A TM system must track which shared memory locations are accessed: - convert every read access x from a shared variable to ReadTx(&x) - convert every write access x=e to a shared variable to WriteTx(&x,e) Convert atomic blocks as follows: ``` do { atomic { StartTx(): // code // code with ReadTx and WriteTx } while (!CommitTx()): ``` - translation can be done using a pre-processor - determining a minimal set of memory accesses that need to be transactional requires a good static analysis - idea: translate all accesses to global variables and the heap as TM - more fine-grained control using manual translation - an actual implementation might provide a retry keyword - when executing retry, the transaction aborts and re-starts - ▶ the transaction will again wind up at retry unless its read set changes - Solution by block until a variable in the read-set has changed - ▶ similar to condition variables in monitors √ 12/33 # **Transactional Memory for the Queue** If a preprocessor is used, PopRight can be implemented as follows: ``` double-ended queue: removal int PopRight(DQueue* q) { QNode* oldRightNode; atomic { QNode* rightSentinel = q->right; oldRightNode = rightSentinel->left; if (oldRightNode==leftSentinel) retry; QNode* newRightNode = oldRightNode->left; newRightNode->right = rightSentinel; rightSentinel->left = newRightNode; int val = oldRightNode->val; free(oldRightNode); return val; ``` - the transaction will abort if other threads call PopRight - if the queue is empty, it may abort if PushLeft is executed #### A Software TM Implementation A software TM implementation allocates a *transaction descriptor* to store data specific to each atomic block, for instance: - undo-log of writes if writes have to be undone if a commit fails - redo-log of writes if writes are postponed until a commit - read- and write-set: locations accessed so far - read- and write-version: time stamp when value was accessed Consider the TL2 STM (software transactional memory) algorithm [1]: • provides opacity: zombie transactions do not see inconsistent state encurroney: maneactions Implementation of Software TM 14 / 33 Concurrency: Transaction Implementation of Software TI 14 / 33 #### **A Software TM Implementation** A software TM implementation allocates a *transaction descriptor* to store data specific to each atomic block, for instance: - undo-log of writes if writes have to be undone if a commit fails - redo-log of writes if writes are postponed until a commit - read- and write-set: locations accessed so far - read- and write-version: time stamp when value was accessed Consider the TL2 STM (software transactional memory) algorithm [1]: - provides opacity: zombie transactions do not see inconsistent state - uses lazy versioning: writes are stored in a redo-log and done on commit - eager conflict detection: a transaction aborts as soon as it conflicts ## A Software TM Implementation A software TM implementation allocates a *transaction descriptor* to store data specific to each atomic block, for instance: - undo-log of writes if writes have to be undone if a commit fails - redo-log of writes if writes are postponed until a commit - read- and write-set: locations accessed so far - read- and write-version: time stamp when value was accessed Consider the TL2 STM (software transactional memory) algorithm [1]: - provides opacity: zombie transactions do not see inconsistent state - uses lazy versioning: writes are stored in a redo-log and done on commit A Software TM Implementation A software TM implementation allocates a *transaction descriptor* to store data specific to each atomic block, for instance: - undo-log of writes if writes have to be undone if a commit fails - redo-log of writes if writes are postponed until a commit - read- and write-set: locations accessed so far - read- and write-version: time stamp when value was accessed Consider the TL2 STM (software transactional memory) algorithm [1]: - provides opacity: zombie transactions do not see inconsistent state - uses lazy versioning: writes are stored in a redo-log and done on commit plants. eager conflict detection: a transaction aborts as soon as it conflicts TL2 stores a *global version* counter and: - a read version in each <u>object</u> (allocate a few bytes more in each call to <u>malloc</u>, or inherit from a <u>transaction object</u> in e.g. Java) - a redo-log in the transaction descriptor - a read- and a write-set in the transaction descriptor - a read-version: the version when the transaction started Concurrency: Transactions Implementation of Software TM 14/33 Concurrency: Transactions Implementation of Software TM 14/33 #### **Principles of TL2** ing the The idea: obtain a version ty. RV from the global clock when starting the transaction, the <u>read-version</u>, and set the versions of all written cells to a new <u>version</u> on commit. A read from a field at offset of object obj is implemented as follows: ``` int ReadTx(TMDesc tx, object obj, int offset) { if (&(obj[offset]) in tx.redoLog) { return tx.redoLog[&obj[offset]]; } else { atomic { v1 = obj.timestamp; locked = obj.sem<1; }; result = obj[offset]; v2 = obj.timestamp; if (locked || v1 != v2 || v1 > tx.RV) AbortTx(tx); } tx.readSet = tx.readSet.add(objA AbortTx(tx); } return result; } ``` Concurrency: Transactions mplementation of Software TM 15 / 33 ## **Committing a Transaction** A transaction can succeed if none of the read locations has changed: ``` committing a transaction bool CommitTx(TMDesc tx) { foreach (e in tx.writeSet) if (!try_wait(e.obj.sem)) goto Fail; WV = FetchAndAdd(&globalClock); foreach (e in tx.readSet) if (e.obj.version > tx.RV) goto Fail; foreach (e in tx.redoLog) e.obj[e.offset] = e.value; foreach (e in tx.writeSet) { e.obj = WV; signal(e.obj.sem); } return true; Fail: // signal all acquired semaphores return false; } ``` ## **Principles of TL2** The idea: obtain a version tx.RV from the global clock when starting the transaction, the *read-version*, and set the versions of all written cells to a new version on commit. A read from a field at offset of object obj is implemented as follows: # int ReadTx(TMDesc tx, object obj, int offset) { if (&(obj[offset]) in tx.redoLog) { return tx.redoLog[&obj[offset]]; } else { atomic { v1 = obj.timestamp; locked = obj.sem<1; }; result = obj[offset]; v2 = obj.timestamp; if (locked || v1 != v2 || v1 > tx.RV) AbortTx(tx); } tx.readSet = tx.readSet.add(obj); return result; } WriteTx is simpler: add or update the location in the redo-log. Concurrency: Transactions Implementation of Software TI 15 / 33 # **Properties of TL2** Opacity is guaranteed by aborting a read access with an inconsistent value: ency: Transactions Implementation of Software TM Concurrency: Transaction Implementation of Software TM #### **Properties of TL2** Opacity is guaranteed by aborting a read access with an inconsistent value: #### Other observations: read-only transactions just need to check that read versions are consistent (no need to increment the global clock) # **Properties of TL2** Concurrency: Transactions Opacity is guaranteed by aborting a read access with an inconsistent value: #### Other observations: - read-only transactions just need to check that read versions are consistent (no need to increment the global clock) - writing values still requires locks - deadlocks are still possible - since other transactions can be aborted, one can preempt transactions that are deadlocked - since lock accesses are generated, computing a lock order up-front might be possible #### **Properties of TL2** Opacity is guaranteed by aborting a read access with an inconsistent value: #### Other observations: - read-only transactions just need to check that read versions are consistent (no need to increment the global clock) - writing values still requires locks # **Properties of TL2** Opacity is guaranteed by aborting a read access with an inconsistent value: #### Other observations: - read-only transactions just need to check that read versions are consistent (no need to increment the global clock) - writing values still requires locks - deadlocks are still possible - ▶ since other transactions can be aborted, one can preempt transactions that are deadlocked - since lock accesses are generated, computing a lock order up-front might be possible - at least two memory barriers are necessary in ReadTx Concurrency: Transactions Implementation of Software TM #### **Properties of TL2** Opacity is guaranteed by aborting a read access with an inconsistent value: #### Other observations: - read-only transactions just need to check that read versions are consistent (no need to increment the global clock) - writing values still requires locks - deadlocks are still possible - since other transactions can be aborted, one can preempt transactions that are deadlocked - since lock accesses are generated, computing a lock order up-front might be possible - at least two memory barriers are necessary in ReadTx - read version+lock, lfence, read value, lfence, read version Concurrency: Transactions mplementation of Software T 17 / 33 ## **General Challenges when using TM** Executing atomic blocks by repeatedly trying to executing them non-atomically creates new problems: a transaction might unnecessarily be aborted #### **Properties of TL2** Opacity is guaranteed by aborting a read access with an inconsistent value: #### Other observations: - read-only transactions just need to check that read versions are consistent (no need to increment the global clock) - writing values still requires locks - deadlocks are still possible - since other transactions can be aborted, one can preempt transactions that are deadlocked - since lock accesses are generated, computing a lock order up-front might be possible - at least two memory barriers are necessary in ReadTx - ► read version+lock, lfence, read value, lfence, read version - there might be contention on the global clock Concurrency: Transactions nplementation of Software TM 17 / 33 # **General Challenges when using TM** Executing atomic blocks by repeatedly trying to executing them non-atomically creates new problems: - a transaction might unnecessarily be aborted - the granularity of what is locked might be too large Concurrency: Transactions Implementation of Software TM 18/33 Concurrency: Transactions Implementation of Software TM 18/33 #### **General Challenges when using TM** Executing atomic blocks by repeatedly trying to executing them non-atomically creates new problems: - a transaction might unnecessarily be aborted - the granularity of what is locked might be too large - a TM implementation might impose restrictions: Concurrency: Transactions Implementation of Software TM 18 / 33 Concurrency: Transaction Implementation of Software TI // Thread 2 x = 42: // x is shared atomic { 18 / 33 #### **General Challenges when using TM** Executing atomic blocks by repeatedly trying to executing them non-atomically creates new problems: - a transaction might unnecessarily be aborted - the granularity of what is locked might be too large - a TM implementation might impose restrictions: - lock-based commits can cause contention - organize cells that participate in a transaction in one object - compute a new object as result of a transaction - atomically replace a pointer to the old object with a pointer to the new object if the old object has not changed - ▶ ~ idea of the original STM proposal non-atomically creates new problems: // Thread 1 atomic { lock-based commits can cause contention int r = ReadTx(&x,0); General Challenges when using TM a transaction might unnecessarily be aborted Executing atomic blocks by repeatedly trying to executing them the granularity of what is locked might be too large a TM implementation might impose restrictions: #### **General Challenges when using TM** Executing atomic blocks by repeatedly trying to executing them non-atomically creates new problems: - a transaction might unnecessarily be aborted - the granularity of what is locked might be too large - a TM implementation might impose restrictions: - lock-based commits can cause contention - organize cells that participate in a transaction in one object - compute a new object as result of a transaction - atomically replace a pointer to the old object with a pointer to the new object if the old object has not changed - ▶ → idea of the original STM proposal - TM system should figure out which memory locations must be logged Concurrency: Transactions Implementation of Software TM 18/33 Concurrency: Transactions Implementation of Software TM 18/33 #### **General Challenges when using TM** Executing atomic blocks by repeatedly trying to executing them non-atomically creates new problems: - a transaction might unnecessarily be aborted - the granularity of what is locked might be too large - a TM implementation might impose restrictions: - lock-based commits can cause contention - organize cells that participate in a transaction in one object - compute a new object as result of a transaction - atomically replace a pointer to the old object with a pointer to the new object if the old object has not changed - ▶ ~ idea of the original STM proposal - TM system should figure out which memory locations must be logged - danger of live-locks: transaction B might abort A which might abort B . . . Concurrency: Transactions Implementation of Software TI 18 / 3 #### **Integrating Non-TM Resources** Allowing access to other resources than memory inside an atomic block poses problems: - storage management, condition variables, volatile variables, input/output - semantics should be as if atomic implements SLA or TSC semantics Usual choice is one of the following: - <u>Prohibit It.</u> Certain constructs do not make sense. Use compiler to reject these programs. - <u>Execute It</u>. I/O operations may only happen in some runs (e.g. file writes usually go to a buffer). Abort if I/O happens. - Irrevocably Execute It. Universal way to deal with operations that cannot be undone: enforce that this transaction terminates (possibly before starting) by making all other transactions conflict. - <u>Integrate It.</u> Re-write code to be transactional: error logging, writing data to a file, . . . #### **Integrating Non-TM Resources** Allowing access to other resources than memory inside an atomic block poses problems: - storage management, condition variables, volatile variables, input/output - semantics should be as if atomic implements SLA or TSC semantics Concurrency: Transaction mplementation of Software T 19 / 3 # **Integrating Non-TM Resources** Allowing access to other resources than memory inside an atomic block poses problems: - storage management, condition variables, volatile variables, input/output - semantics should be as if atomic implements SLA or TSC semantics Usual choice is one of the following: - Prohibit It. Certain constructs do not make sense. Use compiler to reject these programs. - Execute It. I/O operations may only happen in some runs (e.g. file writes usually go to a buffer). Abort if I/O happens. - Irrevocably Execute It. Universal way to deal with operations that cannot be undone: enforce that this transaction terminates (possibly before starting) by making all other transactions conflict. - Integrate It. Re-write code to be transactional: error logging, writing data to a file, . . . currently best to use TM only for memory; check if TM supports irrevocable transactions Concurrency: Transactions Implementation of Software TM 19/33 Concurrency: Transactions Implementation of Software TM 19/33 #### **Hardware Transactional Memory** **Hardware Transactional Memory** Transactions of a limited size can also be implemented in hardware: - additional hardware to track read- and write-sets - conflict detection is <u>eage</u>r using the cache: additional hardware to track read- and write-sets Transactions of a limited size can also be implemented in hardware: Concurrency: Transactions lardware Transactional Memory 20 / 22 Hardware Transactional 20 / 33 ## **Hardware Transactional Memory** - additional hardware to track read- and write-sets - conflict detection is eager using the cache: - additional hardware makes it cheap to perform conflict detection # **Hardware Transactional Memory** Transactions of a limited size can also be implemented in hardware: - additional hardware to track read- and write-sets - conflict detection is *eager* using the cache: - additional hardware makes it cheap to perform conflict detection - ▶ if a cache-line in the read set is invalidated, the transaction aborts Concurrency: Transactions Hardware Transactional Memory 20/33 Concurrency: Transactions Hardware Transactional Memory 20/33 #### **Hardware Transactional Memory** Transactions of a limited size can also be implemented in hardware: - additional hardware to track read- and write-sets - conflict detection is *eager* using the cache: - additional hardware makes it cheap to perform conflict detection - if a cache-line in the read set is invalidated, the transaction aborts - if a cache-line in the write set must be written-back, the transaction aborts - --- limited by fixed hardware resources, a software backup must be provided Concurrency: Transactions Hardware Transactional Memory 20 / 33 חחלחו Concurrency: Transaction **Hardware Transactional Memor** 20 / 33 #### **Hardware Transactional Memory** - additional hardware to track read- and write-sets - conflict detection is *eager* using the cache: - additional hardware makes it cheap to perform conflict detection - if a cache-line in the read set is invalidated, the transaction aborts - if a cache-line in the write set must be written-back, the transaction aborts → limited by fixed hardware resources, a software backup must be provided Two principal implementation of HTM: - Explicit Transactional HTM: each access is marked as transactional - ▶ similar to StartTx, ReadTx, WriteTx, and CommitTx ## **Hardware Transactional Memory** Transactions of a limited size can also be implemented in hardware: - additional hardware to track read- and write-sets - conflict detection is *eager* using the cache: - additional hardware makes it cheap to perform conflict detection - ▶ if a cache-line in the read set is invalidated, the transaction aborts - if a cache-line in the write set must be written-back, the transaction aborts willimited by fixed hardware resources, a software backup must be provided. Two principal implementation of HTM: Transactions of a limited size can also be implemented in hardware: - additional hardware to track read- and write-sets - conflict detection is eager using the cache: - additional hardware makes it cheap to perform conflict detection - ▶ if a cache-line in the read set is invalidated, the transaction aborts - if a cache-line in the write set must be written-back, the transaction aborts → limited by fixed hardware resources, a software backup must be provided Two principal implementation of HTM: - Explicit Transactional HTM: each access is marked as transactional - ▶ similar to StartTx, ReadTx, WriteTx, and CommitTx - requires separate transaction instructions - a transaction has to be translated differently - ▶ ⚠ mixing transactional and non-transactional accesses is problematic - Implicit Transactional HTM: only the beginning and end of a transaction are marked Concurrency: Transactions Hardware Transactional Memory 20 / 33 Concurrency: Transactions Hardware Transactional Memory 20 / 33 #### **Hardware Transactional Memory** Transactions of a limited size can also be implemented in hardware: - additional hardware to track read- and write-sets - conflict detection is *eager* using the cache: - additional hardware makes it cheap to perform conflict detection - if a cache-line in the read set is invalidated, the transaction aborts - if a cache-line in the write set must be written-back, the transaction aborts → limited by fixed hardware resources, a software backup must be provided Two principal implementation of HTM: - Explicit Transactional HTM: each access is marked as transactional - ▶ similar to StartTx, ReadTx, WriteTx, and CommitTx - requires separate transaction instructions - a transaction has to be translated differently - ▶ ⚠ mixing transactional and non-transactional accesses is problematic - Implicit Transactional HTM: only the beginning and end of a transaction are marked - same instructions can be used, hardware interprets them as transactional **Concurrency: Transactions** Hardware Transactional Memory 20 / 33 Hardware Transactional Mem 21 / 33 ## **Example for HTM** AMD Advanced Synchronization Facilities (ASF): - defines a logical speculative region - <u>LOCK</u> MOV instructions provide <u>explicit</u> data transfer between <u>normal</u> memory and speculative region #### **Example for HTM** **Example for HTM** AMD Advanced Synchronization Facilities (ASF): defines a logical speculative region AMD Advanced Synchronization Facilities (ASF): - defines a logical speculative region - LOCK MOV instructions provide explicit data transfer between normal memory and speculative region - aimed to implement larger atomic operations Concurrency: Transactions Hardware Transactional Memory 21/33 Concurrency: Transactions Hardware Transactional Memory 21/33 #### **Example for HTM** Example for HTM AMD Advanced Synchronization Facilities (ASF): - defines a logical speculative region - LOCK MOV instructions provide explicit data transfer between normal memory and speculative region - aimed to implement larger atomic operations Intel's Haswell microarchitecture (since Sep 2013): AMD Advanced Synchronization Facilities (ASF): - defines a logical speculative region - LOCK MOV instructions provide explicit data transfer between normal memory and speculative region - aimed to implement larger atomic operations Intel's Haswell microarchitecture (since Sep 2013): • implicit transactional, can use normal instructions within transactions **Concurrency: Transactions** Hardware Transactional Memory 21 / 33 חולחו Hardware Transactional Memo 21 / 22 ## **Example for HTM** - defines a logical speculative region - LOCK MOV instructions provide explicit data transfer between normal memory and speculative region - aimed to implement larger atomic operations Intel's Haswell microarchitecture (since Sep 2013): - implicit transactional, can use normal instructions within transactions - tracks read/write set using a single transaction bit on cache lines - provides space for a backup of the whole CPU state (registers, ...) # **Example for HTM** AMD Advanced Synchronization Facilities (ASF): - defines a logical speculative region - LOCK MOV instructions provide explicit data transfer between normal memory and speculative region - aimed to implement larger atomic operations Intel's Haswell microarchitecture (since Sep 2013): - implicit transactional, can use normal instructions within transactions - tracks read/write set using a single transaction bit on cache lines - provides space for a backup of the whole CPU state (registers, ...) - use a simple counter to support nested transactions concurrency: Transactions Hardware Transactional Memory 21/33 Concurrency: Transactions Hardware Transactional Memory 21/33 #### **Example for HTM** AMD Advanced Synchronization Facilities (ASF): - defines a logical speculative region - LOCK MOV instructions provide explicit data transfer between normal memory and speculative region - aimed to implement larger atomic operations Intel's Haswell microarchitecture (since Sep 2013): - implicit transactional, can use normal instructions within transactions - tracks read/write set using a single *transaction* bit on cache lines - provides space for a backup of the whole CPU state (registers, ...) - use a simple counter to support nested transactions - may abort at any time due to lack of resources **Concurrency: Transactions** חחלחו ## **Example for HTM** AMD Advanced Synchronization Facilities (ASF): - defines a logical speculative region - LOCK MOV instructions provide explicit data transfer between normal memory and speculative region - aimed to implement larger atomic operations Intel's Haswell microarchitecture (since Sep 2013): - implicit transactional, can use normal instructions within transactions - tracks read/write set using a single transaction bit on cache lines - provides space for a backup of the whole CPU state (registers, ...) - use a simple counter to support nested transactions - may abort at any time due to lack of resources - aborting in an inner transaction means aborting all of them Intel provides two software interfaces to TM: Restricted Transactional Memory # HTM #### **Example for HTM** AMD Advanced Synchronization Facilities (ASF): - defines a logical speculative region - LOCK MOV instructions provide explicit data transfer between normal memory and speculative region - aimed to implement larger atomic operations Intel's Haswell microarchitecture (since Sep 2013): - *implicit transactional*, can use normal instructions within transactions - tracks read/write set using a single transaction bit on cache lines - provides space for a backup of the whole CPU state (registers, ...) - use a simple counter to support nested transactions - may abort at any time due to lack of resources - aborting in an inner transaction means aborting all of them # **Restricted Transactional Memory (Intel)** Provides new instructions XBEGIN, XEND, XABORT, and XTEST: XBEGIN takes an instruction address where execution continues if the transaction aborts #### **Example for HTM** AMD Advanced Synchronization Facilities (ASF): - defines a logical speculative region - LOCK MOV instructions provide explicit data transfer between normal memory and speculative region - aimed to implement larger atomic operations Intel's Haswell microarchitecture (since Sep 2013): - implicit transactional, can use normal instructions within transactions - tracks read/write set using a single *transaction* bit on cache lines - provides space for a backup of the whole CPU state (registers, ...) - use a simple counter to support nested transactions - may abort at any time due to lack of resources - aborting in an inner transaction means aborting all of them Intel provides two software interfaces to TM: - Restricted Transactional Memory - Hardware Lock Elision Concurrency: Transactions ## **Restricted Transactional Memory (Intel)** Provides new instructions XBEGIN, XEND, XABORT, and XTEST: - XBEGIN takes an instruction address where execution continues if the transaction aborts - XEND commits the transaction started by the last XBEGIN # **Restricted Transactional Memory (Intel)** Provides new instructions XBEGIN, XEND, XABORT, and XTEST: XBEGIN takes an instruction address where execution continues if the transaction aborts 22 / 33 ## **Restricted Transactional Memory (Intel)** Provides new instructions XBEGIN, XEND, XABORT, and XTEST: - XBEGIN takes an instruction address where execution continues if the transaction aborts - XEND commits the transaction started by the last XBEGIN - XABORT aborts the current transaction with an error code - XTEST checks if the processor is executing transactionally The instruction XBEGIN can be implemented as a C function: ``` int data[100]; // shared void update(int idx, int value) { if(_xbegin()==-1) { data[idx] += value; _xend(); } else { // transaction failed ``` → user must provide fall-back code Concurrency: Transactions 22 / 33 #### **Considerations for the Fall-Back Path** Consider executing the following code in parallel with itself: ``` int data[100]; // shared void update(int idx, int value) { if(_xbegin()==-1) { data[idx] += value; _xend(); } else { data[idx] += value; } } ``` #### Problem: - if the fall-back code is executed, it might be interrupted by the transaction - the write in the fall-back path thereby overwrites the value of the transaction Concurrency: Transactions **Hardware Transactional Memory** 23 / 33 #### **Protecting the Fall-Back Path** Use a lock to prevent the transaction from interrupting the fall-back path: ``` int data[100]; // shared int mutex; void update(int idx, int value) { if(_xbegin()==-1) { data[idx] += value; _xend(); relse { wait(mutex); data[idx += value] signal(mutex); } ``` - ullet fall-back path may not run in parallel with others $\sqrt{\phantom{a}}$ - A transactional region may not run in parallel with fall-back path #### Considerations for the Fall-Back Path Consider executing the following code in parallel with itself: #### Problem: - if the fall-back code is executed, it might be interrupted by the transaction - the write in the fall-back path thereby overwrites the value of the transaction → need to ensure that the fall-back path is executed atomically Concurrency: Transaction Hardware Transactional Memo 23 / 33 # Implementing RTM using the Cache Transactional operation: augment each cache line with an extra bit T Concurrency: Transactions Hardware Transactional Memory 24/33 Concurrency: Transactions Hardware Transactional Memory 25/33 # Implementing RTM using the Cache Transactional operation: ullet augment each cache line with an extra bit T **Concurrency: Transactions** 25/33 # Implementing RTM using the Cache Transactional operation: - augment each cache line with an extra bit T - use a nesting counter C and a backup register set - → additional transaction logic: - read or write access to a cache line sets T if C > 0 - applying an invalidate message from invalidate queue to a cache line with T=1 issues ${\tt XABORT}$ - observing a read message for a *modified* cache line with T=1 issues XABORT # Implementing RTM using the Cache Transactional operation: store buffer CPU A ullet augment each cache line with an extra bit T register bank • use a nesting counter C and a backup register set С → additional transaction logic: read or write access to a cache line sets T if C > 0 25 / 33 # Implementing RTM using the Cache Transactional operation: - ullet augment each cache line with an extra bit T - use a nesting counter C and a backup register set → additional transaction logic: - read or write access to a cache line sets T if C > 0 - applying an invalidate message from invalidate queue to a cache line with T=1 issues XABORT - observing a read message for a *modified* cache line with T=1 issues XABORT - XABORT clears all T flags, sets C=0 and restores CPU registers register CPU A С bank store buffer cache invalidate queue Memory Memory 25 / 33 #### Implementing RTM using the Cache Transactional operation: - ullet augment each cache line with an extra bit T - ullet use a nesting counter C and a backup register set → additional transaction logic: - XBEGIN increment C and, if C = 0, back up registers - read or write access to a cache line sets T if C > 0 - applying an *invalidate* message from *invalidate queue* to a cache line with T=1 issues XABORT - observing a $\emph{read}$ message for a $\emph{modified}$ cache line with T=1 issues XABORT - XABORT clears all T flags, sets C=0 and restores CPU registers - XCOMMIT decrement C and, if C = 0, clear all T flags **Concurrency: Transactions** Hardware Transactional Memory 25 / 33 #### **Common Code Pattern for Mutexes** Using HTM in order to implement mutex: ``` void update(int idx, int val) { int data[100]; // shared lock(mutex); data[idx] += val: int mutex: void update(int idx, int value) { unlock(mutex); if(_xbegin()==-1) { if (mutex>0) _xabort(); void lock(int mutex) { data[idx] += value: if(xbegin()==-1) _xend(); if (mutex>0) _xabort(); } else { else return; wait(mutex): wait(mutex); data[idx += value] signal(mutex); void unlock(int mutex) { if (mutex>0) signal(mutex); } else _xend(); ``` - the <u>critical</u> section may be executed <u>with an <u>elided</u> lock </u> - as soon as one thread conflicts, the <u>mutex will be taken</u>, thereby aborting all other transactions that have read <u>mutex</u> #### **Illustrating Transactions** Augment MESI state with extra bit T per cache line. CPU A: E5, CPU B: I # **Hardware Lock Elision** Observation: Using HTM to implement lock elision is a common pattern → provide special handling in hardware: HLE provides a way to execute a critical section without the overhead of the atomic updates necessary to acquire and release the lock Concurrency: Transactions Hardware Transactional Memory 27/33 Concurrency: Transactions Hardware Transactional Memory 28/33 #### Hardware Lock Elision Observation: Using HTM to implement lock elision is a common pattern provide special handling in hardware: HLE - provides a way to execute a critical section without the overhead of the atomic updates necessary to acquire and release the lock - requires annotations: **Concurrency: Transactions** Hardware Transactional Memory 20 / 22 Concurrency: Transaction **Hardware Transactional Memor** 28 / 33 #### **Hardware Lock Elision** Observation: Using HTM to implement lock elision is a common pattern → provide special handling in hardware: HLE - provides a way to execute a critical section without the overhead of the atomic updates necessary to acquire and release the lock - requires annotations: - instruction setting the semaphore to 0 must be prefixed with XACQUIRE - ▶ instruction that increments the semaphore must be prefixed with XRELEASE - these prefixes are ignored on older platforms - for a successful elision, all signal/wait operations of a lock must be annotated - the memory location of the lock is locally visible as 0 ("taken") #### **Hardware Lock Elision** Observation: Using HTM to implement lock elision is a common pattern → provide special handling in hardware: HLE - provides a way to execute a critical section without the overhead of the atomic updates necessary to acquire and release the lock - requires annotations: - ▶ instruction setting the semaphore to 0 must be prefixed with XACQUIRE - ▶ instruction that increments the semaphore must be prefixed with XRELEASE - these prefixes are ignored on older platforms - for a successful elision, all signal/wait operations of a lock must be annotated **Hardware Lock Elision** Observation: Using HTM to implement lock elision is a common pattern → provide special handling in hardware: HLE - provides a way to execute a critical section without the overhead of the atomic updates necessary to acquire and release the lock - requires annotations: - ▶ instruction setting the semaphore to 0 must be prefixed with XACQUIRE - ▶ instruction that increments the semaphore must be prefixed with XRELEASE - these prefixes are ignored on older platforms - for a successful elision, all signal/wait operations of a lock must be annotated - the memory location of the lock is locally visible as 0 ("taken") - other processor see the lock as 1 ("not taken") Concurrency: Transactions Hardware Transactional Memory 28/33 Concurrency: Transactions Hardware Transactional Memory 28/33 #### Hardware Lock Elision Observation: Using HTM to implement lock elision is a common pattern → provide special handling in hardware: HLE - provides a way to execute a critical section without the overhead of the atomic updates necessary to acquire and release the lock - requires annotations: - ▶ instruction setting the semaphore to 0 must be prefixed with XACQUIRE - ▶ instruction that increments the semaphore must be prefixed with XRELEASE - these prefixes are ignored on older platforms - for a successful elision, all signal/wait operations of a lock must be annotated - the memory location of the lock is locally visible as 0 ("taken") - other processor see the lock as 1 ("not taken") - only a finite number of locks can be elided - all but one elided lock may abort **Concurrency: Transactions** חחלחו 28 / 33 # Implementing Lock Elision Transactional operation: - re-uses infrastructure for Restricted Transactional Memory - add a buffer for elided locks, similar to store buffer XACQUIRE of lock ensures shared/exclusive cache line state. issues XBEGIN and stores written value in elided lock buffer #### Hardware Lock Elision Observation: Using HTM to implement lock elision is a common pattern → provide special handling in hardware: HLE - provides a way to execute a critical section without the overhead of the atomic updates necessary to acquire and release the lock - requires annotations: - ▶ instruction setting the semaphore to 0 must be prefixed with XACQUIRE - ▶ instruction that increments the semaphore must be prefixed with XRELEASE - these prefixes are ignored on older platforms - for a successful elision, all signal/wait operations of a lock must be annotated - the memory location of the lock is locally visible as 0 ("taken") - other processor see the lock as 1 ("not taken") - only a finite number of locks can be elided - all but one elided lock may abort - progress guarantee since lock is taken on abort - no back up path is required # Implementing Lock Elision Transactional operation: - re-uses infrastructure for Restricted Transactional Memory - add a buffer for elided locks, similar to store buffer - XACQUIRE of lock ensures shared/exclusive cache line state. issues XBEGIN and stores written value in *elided lock* buffer - r/w access to a cache line sets T - applying an invalidate message from invalidate queue to an address in the elided lock buffer issues XABORT Concurrency: Transactions ## **Implementing Lock Elision** Transactional operation: - re-uses infrastructure for Restricted Transactional Memory - add a buffer for elided locks, similar to store buffer XACQUIRE of lock ensures shared/exclusive cache line state, issues XBEGIN and stores written value in elided lock buffer - r/w access to a cache line sets T - applying an invalidate message from invalidate queue to an address in the elided lock buffer issues XABORT - a read message for a modified cache line or an invalidate message makes the transaction irrevocable Concurrency: Transactions Hardware Transactional Memory 29 / 33 #### References - D. Dice, O. Shalev, and N. Shavit. Transactional Locking II. In Distributed Coputing, LNCS, pages 194–208. Springer, Sept. 2006. - T. Harris, J. Larus, and R. Rajwar. Transactional memory, 2nd edition. Synthesis Lectures on Computer Architecture, 5(1):1–263, 2010. #### Online blog entries on Intel HTM: - http://software.intel.com/en-us/blogs/2013/07/25/ fun-with-intel-transactional-synchronization-extensions #### **Implementing Lock Elision** Transactional operation: - re-uses infrastructure for Restricted Transactional Memory - add a buffer for elided locks, similar to store buffer - XACQUIRE of lock ensures shared/exclusive cache line state, issues XBEGIN and stores written value in elided lock buffer - r/w access to a cache line sets T - applying an invalidate message from invalidate queue to an address in the elided lock buffer issues XABORT - a read message for a modified cache line or an invalidate message makes the transaction irrevocable - if irrevocable, clear all T flags, set C=0 and move elided buffer to store buffer Concurrency: Transactions rdware Transactional Memor 29 / 33 Concurrency: Transactions Hardware Transactional Memory 33 / 33