Random Post: Rolling the DICE
RSS .92| RSS 2.0| ATOM 0.3
  • Home
  • About
  •  

    Four Coherence-based Solutions to a Data Hot Spot Problem

    August 31st, 2011

    Summary

    Leveraging the DICE framework I developed last year (write-up available under “Papers and Articles” at http://www.scapps.co.uk), I developed four different solutions to a simplified (non-distributed) variant of the DICE computing problem using Coherence 3.7, each using a different approach to concurrency management. I then measured the throughput of each implementation using DICE configurations that promote a high rate of concurrent requests for access to shared data objects.

    The implementation based on Coherence’s Entry Processors feature showed the highest throughput.

    Two of the four implementations provide ACID transactional guarantees. The transactional implementations provided lower throughput than the non-transactional implementations.

    One of the two transactional implementations uses a Coherence transactionality feature which is now deprecated. The other uses Coherence’s new Transactions Framework. The transactional implementation based on the deprecated Coherence feature provided greater throughput than the implementation based on the new Transaction Framework.

    Simplified DICE Computing Problem

    There are two domain classes:

    • Counters, each with a unique identifier and a counter value.
    • Incrementers, each containing a unique identifier and a reference to a Counter, and a flag that is used to distinguish between Incrementers that have and have not been processed. Each Incrementer represents an event to be processed.

    Processing an Incrementer event involves the following steps:

    1. The Incrementer object is interrogated to obtain the id of the Counter object to which it refers.
    2. The Counter object is located.
    3. The value of the Counter object’s counter field is incremented.
    4. The Incrementer object’s processed flag is set.
    5. The new state of the Counter is saved in the shared data area.
    6. The Incrementer is written to the shared data area. The throughput being measured is the rate at which these update operations (DICE ops.) are performed.

    This problem is a simplified version of the original DICE problem, which required that domain data and processing be distributed across more than one host. In the simplified version presented here, all data is stored and all processing is done on a single host. Note, however, that the solutions presented here can also work without modification with distributed clients, data or both.

    The Four Implementations

    The client program in each of the four implementations is a cluster member without local storage. The cache(s) needed by each implementation are established and managed by another cluster member which is started before the client is run.

    The four implementations differ in the ways they perform DICE ops:

    • Client non-transactional – The client’s worker threads read from and write to the caches, using explicit locks to manage concurrency. Two caches are used, one per domain type.
    • EntryProcessor non-transactional – The client’s worker threads send Entry Processors to the cluster member that hosts the data. That cluster member performs the updates, taking advantage of the Entry Processors’ inherent concurrency controls. One cache is used for both domain types so that a single Entry Processor can be used for each DICE op.
    • Client transactional deprecated – The client’s worker threads read from and write to the caches using the (deprecated) TransactionMap interface. Pessimistic mode is used to manage concurrency. Two caches are used, one per domain type.
    • Client transactional Transaction Framework – The client’s worker threads read from and writes to the caches using the Transaction Framework’s OptimisticNamedCache interface. When updates fail for concurrency-related reasons (either a required lock is unavailable or the optimistic assumption proves false), retries are issued automatically. Two caches are used, one per domain type.

    Each DICE Op. Includes two writes to the cache: a Counter is updated, and an Incrementer is inserted. In the two transactional implementations, these writes are performed as an atomic operation. In the non-transactional implementation they are not.

    DICE Configuration

    Here are some key points from the DICE application configuration used for these tests:

    • The number of Counter objects was set to two for one series of tests, and to ten for another series (as shown in the table of results).
    • One client instance was run for each test execution.
    • The client spawned five worker threads that initiated DICE ops.
    • Each thread was instructed to execute 2,000 DICE ops.
      (The total number of DICE ops. per test was therefore 5 x 2,000 = 10,000.)
    • Counter objects each contain a string field that can be populated to increase the size of each instance. For these tests a string of 4,096 bytes was used.

    Coherence Cache Configuration

    Here are some key points from the Coherence cache configuration used for these tests:

    A distributed scheme managed by the DistributedCache service was used for the two non-transactional implementations and the transactional implementation based on the deprecated TransactionMap interface.

    • A transactional scheme managed by the TransactionalCache service was used for the TransactionMap implementation.
    • In both cases, thread-count was set to two; lease-granularity was set to ‘lease’; and an unlimited backing map local scheme was used.

    Observed Throughput

    The Coherence license under which I am working does not allow me to “disclose results of any program benchmark tests without our prior consent”. Because I don’t have Oracle’s consent to publish my observations, I am reporting results on a relative basis, with the throughput rates I observed indexed to the lowest value.

    Several iterations of the test were run with each implementation. The table that follows shows the best throughput achieved by each implementation with two and five Counters respectively, indexed to the throughput of the lowest observed rate (Transaction Framework with two Counters).

    implementation

    no. Counters

    indexed throughput

    EntryProcessor non-transactional

    2

    11.7

    client non-transactional

    2

    3.7

    client transactional deprecated

    2

    2.4

    client transactional Transaction Framework

    2

    1.0

    EntryProcessor non-transactional

    10

    12.1

    client non-transactional

    10

    5.3

    client transactional deprecated

    10

    3.1

    client transactional Transaction Framework

    10

    2.1

    When considering these results it is important to remember that:

    • Throughput was measured using a DICE configuration designed to produce very heavy contention for access to a small number of data elements.

    • Neither the execution environment nor the Coherence configuration were tuned to optimize the throughput of any of the implementations.

    Conclusions and Interpretation

    Higher rates of throughput were achieved by the non-transactional implementations than by the transactional implementations. Presumably this is because of the overhead required to support the atomicity and isolation properties of transactions.

    Of the two non-transactional implementations, the implementation that used EntryProcessors provided greater throughput by a factor of more than two. This advantage is likely attributable to the smaller amount of inter-process communication required by the Entry Processors compared with the non-transactional client implementation.

    Of the two transactional implementations, the one based on the deprecated TransactionMap feature showed greater throughput. This is probably because its pessimistic mode is more efficient than the Transaction Framework’s optimistic concurrency control for this use-case, which is write-intensive and contention-intensive by design.

    As shown in the following table, each implementation achieved greater throughput when the number of Counters was increased (although in the case of the EntryProcessor implementation the increase was only about 4%). The reduced rate of contention for each Counter probably explains the increases.

    implementation

    % change

    EntryProcessor non-transactional

    4%

    client non-transactional

    42%

    client transactional deprecated

    33%

    client transactional Transaction Framework

    134%

    The percentage improvement in throughput when the number of counters was increased was greatest (134%) for the Transaction Framework implementation. The likely explanation is that the cost of the retries used by this implementation to recover from concurrency conflicts is very high. With the updates spread across five times as many Counters the number of retries was substantially reduced.

    As the following table show, the throughput advantage of the implementation using the deprecated TransactionMap feature over the implementation using the newer Transaction Framework became much smaller percentage-wise when the number of Counters was increased from two to ten. This observation suggests that the retries employed by the Transaction Framework implementation to resolve concurrency conflicts account for most of the difference in throughput rates between the old and new transactionality features.

    Counters

    % increase

    Two Counters

    136%

    Ten Counters

    34%

     



    It wasn’t a race, but . . .

    May 6th, 2009

    This DICE study I have been working on is as much about scalability – what happens to throughput when you add an additional unit of compute power – as about raw performance.  Nevertheless, testing eight different solutions to the same problem did provide some insight into overall performance characteristics.  I’ll tell you how the eight approaches stacked up in terms of speed, but first let me summarize the problem and the eight solutions.

    The problem was to update the objects (“counters”) in a shared data area.  Each update was represented by an object (“updaters”) containing a reference to the counter object to be updated.  The updater objects also had to be written to the shared data area.  High ratios of updaters to counters promoted contention for access to the counters, creating classic hot spots.

    Three solutions used Terracotta. The first had several instances of a simple program that processed a list of updaters.  The instances tussled for access to the counters.  The second solution was like the first except that each instance of the program had a list of updaters that referred to a distinct subset of the counters.  With this approach there was no competition for the Counters.  The third Terracotta solution had each instance of an updating program fed by a private queue.  Each queue contained updaters that referred to a distinct set of counters so, again, there was no competition for the counters.

    Among these three approaches, the second proved to be the fastest overall, achieving over 5,000 updates per second when run with two or four instances of the updating program.  The first (and simplest) approach was the second fastest.  It’s best performance came with only a single updating program running, when it achieved over 3,400 updates per second.  Overall throughput declined precipitously as additional instances were added.

    The third approach was the slowest, but got faster consistently as the number of instances was increased from one to two to four to eight (about the limit of my test environment).  With one instance of  the updating program running throughput was 576 updater per second.  At eight updaters throughput was 1,282 updates per second.

    The five GigaSpaces solutions worked as follows:

    1. A simple non-PU client that connects to a partitioned space and executes reads  and writes against the space.

    2. Clients send updater objects to a partitioned space to be processed.  Updates performed by PUs against local space instances (space-based architecture) that use  a polling containers to detect the arrival of new updater objects.  Clients use writeMultiple() to improve throughput.

    3.  Just like no. 2, but using FIFO features to preserve ordering of updates per counter.

    4. Clients invoke remote methods advertised by PUs to update counters.  Updaters are passed as arguments.  PUs do work against local space instances.

    5. Clients send update requests to spaces as Task objects.  Spaces execute the tasks.

    Among these five approached, numbers two and three, both of which use writeMultiple() and polling containers, were the fastest by substantial margins.  Number two delivered over 30,000 updates per second with two clients and four updaters.  Number three came close to 17,000 updates per second with one client and two updaters.

    Next fastest was number five at about 3,800 updates per second with one client and two updaters.  Number four peaked at around 2,400 updaters per second with two clients and two updaters.  Slowest of the five was number one, which reached between 1,000 and 1,100 updates per second in a variety of configurations.

    Analyzing these results and explaining the differences in performance are topics too large for this post.  A few things are clear  however,  from even the simple set of results presented above:

    1. The concept of locality – which decomposes into the related concepts of proximity and exclusivity – is profoundly important in designing solutions to this class of problem.

    2. A very wide range of results is possible depending on the solution architecture.

    3. For raw speed, GigaSpaces’ polling container construct offers a significant advantage over any of the other choices examined here.