It wasn’t a race, but . . .
This DICE study I have been working on is as much about scalability – what happens to throughput when you add an additional unit of compute power – as about raw performance. Nevertheless, testing eight different solutions to the same problem did provide some insight into overall performance characteristics. I’ll tell you how the eight approaches stacked up in terms of speed, but first let me summarize the problem and the eight solutions.
The problem was to update the objects (“counters”) in a shared data area. Each update was represented by an object (“updaters”) containing a reference to the counter object to be updated. The updater objects also had to be written to the shared data area. High ratios of updaters to counters promoted contention for access to the counters, creating classic hot spots.
Three solutions used Terracotta. The first had several instances of a simple program that processed a list of updaters. The instances tussled for access to the counters. The second solution was like the first except that each instance of the program had a list of updaters that referred to a distinct subset of the counters. With this approach there was no competition for the Counters. The third Terracotta solution had each instance of an updating program fed by a private queue. Each queue contained updaters that referred to a distinct set of counters so, again, there was no competition for the counters.
Among these three approaches, the second proved to be the fastest overall, achieving over 5,000 updates per second when run with two or four instances of the updating program. The first (and simplest) approach was the second fastest. It’s best performance came with only a single updating program running, when it achieved over 3,400 updates per second. Overall throughput declined precipitously as additional instances were added.
The third approach was the slowest, but got faster consistently as the number of instances was increased from one to two to four to eight (about the limit of my test environment). With one instance of the updating program running throughput was 576 updater per second. At eight updaters throughput was 1,282 updates per second.
The five GigaSpaces solutions worked as follows:
1. A simple non-PU client that connects to a partitioned space and executes reads and writes against the space.
2. Clients send updater objects to a partitioned space to be processed. Updates performed by PUs against local space instances (space-based architecture) that use a polling containers to detect the arrival of new updater objects. Clients use writeMultiple() to improve throughput.
3. Just like no. 2, but using FIFO features to preserve ordering of updates per counter.
4. Clients invoke remote methods advertised by PUs to update counters. Updaters are passed as arguments. PUs do work against local space instances.
5. Clients send update requests to spaces as Task objects. Spaces execute the tasks.
Among these five approached, numbers two and three, both of which use writeMultiple() and polling containers, were the fastest by substantial margins. Number two delivered over 30,000 updates per second with two clients and four updaters. Number three came close to 17,000 updates per second with one client and two updaters.
Next fastest was number five at about 3,800 updates per second with one client and two updaters. Number four peaked at around 2,400 updaters per second with two clients and two updaters. Slowest of the five was number one, which reached between 1,000 and 1,100 updates per second in a variety of configurations.
Analyzing these results and explaining the differences in performance are topics too large for this post. A few things are clear however, from even the simple set of results presented above:
1. The concept of locality – which decomposes into the related concepts of proximity and exclusivity – is profoundly important in designing solutions to this class of problem.
2. A very wide range of results is possible depending on the solution architecture.
3. For raw speed, GigaSpaces’ polling container construct offers a significant advantage over any of the other choices examined here.
Nice writeup.
1. I understand that the Terracotta test was using thier commercial edition, do you know what would be the numbers on the free edition?
2. When you mention:
“The second solution was like the first except that each instance of the program had a list of updaters that referred to a distinct subset of the counters”
I believe that you mean that you basically partitioned the client i.e. user application in such away that each user will only access its own “private” set of data right? – If that is the case doesn’t that breaks the entire test as if understand correctly the idea was to test the case of using “shared” data.
I believe that you didn’t had to apply the same restriction on the gigaspaces side.
3. Have you applied any measure of high availability to the two tests i.e. do you know if the configuration you where running with would be fail-safe? As far as i can tell a fail-safe configuration could be significantly different (In Terracotta it might require synch to disk which may change the entire performance characteristics and in GigaSpaces you will need synch to backup).
4. Did you tried to measure the linear scalability behaviour?
Hi Nati,
These questions and comments are very helpful. Thanks so much for looking in.
1. I did not compare performance between editions of Terracotta.
2. The work was partitioned for “perfect” locality, but the data was still in a shared space, equally accessible to every client. From my perspective this approach is consistent with the exercise.
To elaborate, the Counter objects were in a clustered data structure that made each of them equally available to every client. It was “luck of the draw” (carefully engineered by me, of course) that no two clients chose to update the same Counter. Also, the processed Incrementer objects were written to a clustered data structure. The proof of the matter was that my results checking program was able to tally the results without regard for location of data, a standard I applied to every solution.
I could have created a GigaSpaces configuration in which the pending updates were partitioned on the client side but there didn’t seem to be much point as that change would only have been significant when the client was doing the updates. With GigaSpaces I wanted to explore the rich assortment of server-side processing options. In all the server-side cases the Counters and the Incrementers were routed by the Counter’s id.
3. I bypassed the high availability dimension in the interest of keeping the test cases tree to a manageable size. (You might be astonished by how long it took to achieve even the modest results I summarized in the post; I certainly was.)
I know from experience that there is a price to be paid in reduced throughput when high availability on the form of backup spaces is activated in GigaSpaces. I assume that I would experience a similar effect with Terracotta. Note, however, that achieving high availability with Terracotta would not necessarily require activating disk writes. Apparently Terracotta can replicate data between servers using a configuration option that is orthogonal to its ability to write all of the data to disk. So I think you could have a replicated (high availability) configuration that does not provide guaranteed recoverability from disk, or vice versa, or both, or neither. (Note that I have not tested these options.)
4. Can you provide some more detail on the kind of linear scalability testing you are thinking of?
I have been reviewing a draft of the full paper from which this post was derived with a representative from Terracotta and with Uri Cohen at GigaSpaces. Both have faithfully honored my request that they not distribute the paper in its current form. However, if you would like to grab a copy of the draft from Uri, please feel free. Kindly do not distribute it.
-Dan
Dan
Thanks for the quick response.
WRT to point 2 I believe that the fact that the data can be shared potentially doesn’t make it shared in the context of the test. If I understand correctly there are different assumptions and performance benefit when the data is not “really” shared i.e. each process access its own private set of data at specific point in time.
IMO shared means by definition that two process can access the same object at the same time – changing that assumption makes it a completely different scenario.
To give you an analogy its like comparing session-state to entity state. The decision to change the access pattern to a certain object is very application specific. Unless I miss something I don’t see how it can be compared in the context of a generic benchmark. Would you agree with that?
My points on the rest of item was for clarification purposes – so thanks for the clarifications.
Another point that is important on that regard is that in many transactional system the transaction involve more then one step (update) – a typical scenario will involve parsing, matching, executing etc. In SBA all the steps will be done locally beyond the first injection of the first data item. In addition to that it is critical that all objects that are associated with the transaction will be collocated. Guaranteed FIFO between the steps is also critical to ensure the correctness of the business logic. These are things that are hard to achieve at the JVM level. See the following link as a reference for a Typical SBA Application
Hi Nati,
You make an excellent point. I’m afraid that I have caused some confusion by not being clear enough about the purpose of the DICE work from which these results are taken. Quoting from the draft write-up:
DICE is structured as a series of solutions to a simple problem
that embodies some of the basic challenges of a distributed
application acting on a shared data set. There are eight solutions
in DICE, with three designed to take advantage of Terracotta’s
capabilities and five exploiting GigaSpaces’ features. Each has been
implemented and its throughput measured under a variety of
conditions. The implementations and the results of the performance
testing are described in the remainder of this paper, with an
emphasis on scalability as opposed to absolute performance.
In other words, I’m not representing that the eight approaches are completely comparable, only that they all meet certain functional requirements. Throughput results, while interesting and informative, don’t represent the main purpose of the work.
The solutions that never produce competition for the Counter objects certainly have an advantage over those that do in terms of managing access to (potentially) shared data. Of course, some of the solutions that result in competition for the Counters have other advantages (such as asynchronous execution and multi-threading) that the no-conflict solutions may lack.
I hope this makes sense.
-Dan