Four Coherence-based Solutions to a Data Hot Spot Problem
August 31st, 2011Summary
Leveraging the DICE framework I developed last year (write-up available under “Papers and Articles” at http://www.scapps.co.uk), I developed four different solutions to a simplified (non-distributed) variant of the DICE computing problem using Coherence 3.7, each using a different approach to concurrency management. I then measured the throughput of each implementation using DICE configurations that promote a high rate of concurrent requests for access to shared data objects.
The implementation based on Coherence’s Entry Processors feature showed the highest throughput.
Two of the four implementations provide ACID transactional guarantees. The transactional implementations provided lower throughput than the non-transactional implementations.
One of the two transactional implementations uses a Coherence transactionality feature which is now deprecated. The other uses Coherence’s new Transactions Framework. The transactional implementation based on the deprecated Coherence feature provided greater throughput than the implementation based on the new Transaction Framework.
Simplified DICE Computing Problem
There are two domain classes:
- Counters, each with a unique identifier and a counter value.
- Incrementers, each containing a unique identifier and a reference to a Counter, and a flag that is used to distinguish between Incrementers that have and have not been processed. Each Incrementer represents an event to be processed.
Processing an Incrementer event involves the following steps:
- The Incrementer object is interrogated to obtain the id of the Counter object to which it refers.
- The Counter object is located.
- The value of the Counter object’s counter field is incremented.
- The Incrementer object’s processed flag is set.
- The new state of the Counter is saved in the shared data area.
- The Incrementer is written to the shared data area. The throughput being measured is the rate at which these update operations (DICE ops.) are performed.
This problem is a simplified version of the original DICE problem, which required that domain data and processing be distributed across more than one host. In the simplified version presented here, all data is stored and all processing is done on a single host. Note, however, that the solutions presented here can also work without modification with distributed clients, data or both.
The Four Implementations
The client program in each of the four implementations is a cluster member without local storage. The cache(s) needed by each implementation are established and managed by another cluster member which is started before the client is run.
The four implementations differ in the ways they perform DICE ops:
- Client non-transactional – The client’s worker threads read from and write to the caches, using explicit locks to manage concurrency. Two caches are used, one per domain type.
- EntryProcessor non-transactional – The client’s worker threads send Entry Processors to the cluster member that hosts the data. That cluster member performs the updates, taking advantage of the Entry Processors’ inherent concurrency controls. One cache is used for both domain types so that a single Entry Processor can be used for each DICE op.
- Client transactional deprecated – The client’s worker threads read from and write to the caches using the (deprecated) TransactionMap interface. Pessimistic mode is used to manage concurrency. Two caches are used, one per domain type.
- Client transactional Transaction Framework – The client’s worker threads read from and writes to the caches using the Transaction Framework’s OptimisticNamedCache interface. When updates fail for concurrency-related reasons (either a required lock is unavailable or the optimistic assumption proves false), retries are issued automatically. Two caches are used, one per domain type.
Each DICE Op. Includes two writes to the cache: a Counter is updated, and an Incrementer is inserted. In the two transactional implementations, these writes are performed as an atomic operation. In the non-transactional implementation they are not.
DICE Configuration
Here are some key points from the DICE application configuration used for these tests:
- The number of Counter objects was set to two for one series of tests, and to ten for another series (as shown in the table of results).
- One client instance was run for each test execution.
- The client spawned five worker threads that initiated DICE ops.
- Each thread was instructed to execute 2,000 DICE ops.
(The total number of DICE ops. per test was therefore 5 x 2,000 = 10,000.) - Counter objects each contain a string field that can be populated to increase the size of each instance. For these tests a string of 4,096 bytes was used.
Coherence Cache Configuration
Here are some key points from the Coherence cache configuration used for these tests:
A distributed scheme managed by the DistributedCache service was used for the two non-transactional implementations and the transactional implementation based on the deprecated TransactionMap interface.
- A transactional scheme managed by the TransactionalCache service was used for the TransactionMap implementation.
- In both cases, thread-count was set to two; lease-granularity was set to ‘lease’; and an unlimited backing map local scheme was used.
Observed Throughput
The Coherence license under which I am working does not allow me to “disclose results of any program benchmark tests without our prior consent”. Because I don’t have Oracle’s consent to publish my observations, I am reporting results on a relative basis, with the throughput rates I observed indexed to the lowest value.
Several iterations of the test were run with each implementation. The table that follows shows the best throughput achieved by each implementation with two and five Counters respectively, indexed to the throughput of the lowest observed rate (Transaction Framework with two Counters).
|
implementation |
no. Counters |
indexed throughput |
|
EntryProcessor non-transactional |
2 |
11.7 |
|
client non-transactional |
2 |
3.7 |
|
client transactional deprecated |
2 |
2.4 |
|
client transactional Transaction Framework |
2 |
1.0 |
|
EntryProcessor non-transactional |
10 |
12.1 |
|
client non-transactional |
10 |
5.3 |
|
client transactional deprecated |
10 |
3.1 |
|
client transactional Transaction Framework |
10 |
2.1 |
When considering these results it is important to remember that:
-
Throughput was measured using a DICE configuration designed to produce very heavy contention for access to a small number of data elements.
-
Neither the execution environment nor the Coherence configuration were tuned to optimize the throughput of any of the implementations.
Conclusions and Interpretation
Higher rates of throughput were achieved by the non-transactional implementations than by the transactional implementations. Presumably this is because of the overhead required to support the atomicity and isolation properties of transactions.
Of the two non-transactional implementations, the implementation that used EntryProcessors provided greater throughput by a factor of more than two. This advantage is likely attributable to the smaller amount of inter-process communication required by the Entry Processors compared with the non-transactional client implementation.
Of the two transactional implementations, the one based on the deprecated TransactionMap feature showed greater throughput. This is probably because its pessimistic mode is more efficient than the Transaction Framework’s optimistic concurrency control for this use-case, which is write-intensive and contention-intensive by design.
As shown in the following table, each implementation achieved greater throughput when the number of Counters was increased (although in the case of the EntryProcessor implementation the increase was only about 4%). The reduced rate of contention for each Counter probably explains the increases.
|
implementation |
% change |
|
EntryProcessor non-transactional |
4% |
|
client non-transactional |
42% |
|
client transactional deprecated |
33% |
|
client transactional Transaction Framework |
134% |
The percentage improvement in throughput when the number of counters was increased was greatest (134%) for the Transaction Framework implementation. The likely explanation is that the cost of the retries used by this implementation to recover from concurrency conflicts is very high. With the updates spread across five times as many Counters the number of retries was substantially reduced.
As the following table show, the throughput advantage of the implementation using the deprecated TransactionMap feature over the implementation using the newer Transaction Framework became much smaller percentage-wise when the number of Counters was increased from two to ten. This observation suggests that the retries employed by the Transaction Framework implementation to resolve concurrency conflicts account for most of the difference in throughput rates between the old and new transactionality features.
-
Counters
% increase
Two Counters
136%
Ten Counters
34%