Single-Threading: Back to the Future?

July 1, 2010, 5:00 am

≫ Next: Single-Threading: Back to the Future? (Part 2)

So the ‘multi-core revolution’ is finally here [Merritt07][Suess07][Martin10] (some might argue that it has already been here for several years, but that’s beyond the point now). Without arguing whether it is good or bad, we should agree that it is a reality which we cannot possibly change. The age of CPU frequency doubling every two years is long gone and we shouldn’t expect any substantial frequency increases in the near future, and while there are still improvements in single-core performance unrelated to raw frequency increases, from now on we should count mostly on multiple cores to improve performance. Will it make development easier? Certainly not. Does it mean that everybody will need to deal with mutexes, atomics, in-memory transactions (both with optimistic and pessimistic locking), memory barriers, deadlocks, and the rest of the really scary multi-threading stuff, or switch to functional languages merely to deal with multi-threading? Not exactly, and we’ll try to show it below. Please note that in this article we do not try to present anything substantially new, it is merely an analysis of existing (but unfortunately way too often overlooked) mechanisms and techniques.

'No Multi-Threaded Bugs' Hare & 'Multithreaded Gorrillazz'

How long is ‘as long as possible’?

It is more or less commonly accepted that multi-threading is a thing which should be avoided for as long as it is possible. While writing multi-threaded code might not look too bad on the first glance, debugging it is a very different story. Because of the very nature of multi-threaded code, it is non-deterministic (i.e. its behavior can easily differ for every run), and as a result finding all the bugs in testing becomes unfeasible even if you know where to look and what exactly you want to test; in addition, code coverage metrics aren’t too useful for evaluating coverage of possible race scenarios in multi-threaded code. To make things worse, even if the multi-threaded bug is reproducible, every time it will happen on a different iteration, so there is no way to restart the program and stop it a few steps before the bug; usually, step-by-step debugging offsets fragile race conditions, so it is rarely helpful for finding multi-threaded bugs. With post-mortem analysis of multi-threaded races using log files usually being impossible too, it leaves developer almost without any reliable means to debug multi-threaded code, making it more of a trial-and-error exercise based on ‘gut feeling’ without a real understanding what is really going on (until the bug is identified). Maintaining multi-threaded code is even worse: heavily multi-threaded code tends to be very rigid and fragile, and making any changes requires careful analysis and lots and lots of debugging, making any mix of frequently changed business logic with heavy multi-threading virtually suicidal (unless multi-threading and business logic are clearly separated into different levels of abstraction).

With all these (and many other) problems associated with multi-threaded code, it is easy to agree that multi-threading should be avoided. On the other hand, there is some disagreement on how long we can avoid it. In this article we will try to discuss how performance issues (if there are any) can be handled without going into too much detail of multi-threading. While not always possible, the number of cases when multi-threading can be avoided is extensive. And as discussed above whenever you can avoid it – you should avoid it, despite the potential fear that programs without multi-threading aren’t ‘cool’ anymore. After all, the end-user (the guy who we all are working for) couldn’t care less how many threads the program has or whether it utilizes all the available cores, as long as the program works correctly and is fast enough. In fact, using fewer cores is often beneficial for the end-user, so he’s able to do something else at the same time; we also need to keep in mind that overhead incurred by multi-threading/multi-coring can be huge, and that Amdahl’s Law provides only the theoretical maximum speedup from parallelization, with realized gains often being not even close to that. If a single-threaded program does something in a minute on one core, and multi-threaded one does the same thing in 55 seconds on 8 cores (which can easily happen if the granularity of context switching is suboptimal for any reason), it looks quite likely that user would prefer single-threaded program.

‘No Multi-Threaded Bugs’ Hare & ‘Multithreaded Gorrillazz’

Let us consider a developer who really hates dealing with those elusive multi-threading bugs. As he is our positive hero, we need to make him somewhat cute, so let’s make him a bunny rabbit. However, he’s not that little to be called bunny, so he becomes ‘No Multi-Threaded Bugs’ Hare. And in opposition to him there is a whole bunch of reasons, which try to push him into heavy multi-threading with all the associated problems. Let’s picture them as ‘Multithreaded Gorrillazz’ defending team. To win, our ‘No MT Bugs’ Hare needs to rush through the whole field full of Gorrillazz, and score a touchdown. While it might seem hopeless, he has one advantage on his side: while extremely powerful and heavy, Gorrillazz are usually very slow, so in many cases he can escape them before they can reach him.

A few minor notes before the game starts: first of all, in this article we will address only programs which concentrate on interacting with the user one way or another (it can be a web, desktop, or mobile phone program, but interaction should be a substantial part of it). Scientific calculations/HPC/video rendering farms/etc. is a completely different game which is played on a very different field, so we will not discuss it here. Second important note is that we’re making a distinction between ‘a bit of multi-threaded code in a very limited number of isolated places’ and ‘massive multi-threading all over the code’. While the former can usually be managed with a limited effort and (given that there are no better options) we’ll consider it as acceptable, the latter is exactly the thing we’re aiming to avoid.

Houston, have we got a problem?

So, our ‘No MT Bugs’ Hare is standing all alone against the whole field of fierce Gorrillazz. What is the first move he shall make? First of all, we need to see if he’s writing client-side code or server-side code. In the first two quarters of the game, we will concentrate on client-side code, with server-side considered in the quarters 3&4 (coming in the next issue). And if the application is client-side, then the very first question for our hero is the following: does his application really experience any performance problems (in other words, would users actually care if the application runs faster)? If not, he can simply ignore all the Gorrillazz at least at the moment and stay single-threaded. And while Moore’s law doesn’t work any more for frequencies, and modern CPUs got stuck at measly 3-4GHz, it is still about 1000 times more than the frequency of the first PCs, which (despite a 1000-times-less-than-measly-3-4-GHz-modern-CPUs speed) were indeed able to do a thing or three. It is especially true if Hare’s application is a business-like one, with logic like ‘if user clicks button ‘ok’, close this dialog and open dialog ‘ZZZ’ with field ‘XXX’ set to the appropriate value from previous dialog'; in cases like the last one, it is virtually impossible to imagine how such logic (taken alone, without something such as associated MP4 playing in one of the dialogs – we’ll deal with such a scenario a bit later), can possibly require more than one modern core regardless of code size of this logic.

To block or not to block – there is no question

If our ‘No MT Bugs’ Hare does indeed have performance problems with his client application, then there is a big chance that it is related to I/O(if there are doubts, a profiler should be able to help clarify, although it’s not always 100% obvious from the results of the profiling). If a user experiences delays while the CPU isn’t loaded, the cause often has nothing to do with using more cores, or with CPU in general, but is all about I/O. While hard disk capacities went up tremendously in recent years, typical access time experienced much more modest improvements, and is still in the order of 10ms, or more than ten to the seventh power CPU clock cycles. No wonder, that if Hare’s program accesses the disk heavily the user can experience delays. Network access can incur even larger delays: while in a LAN typical round-trip time is normally within 1ms, typical transatlantic round-trip time is around 100-200ms, and that is only if everything on the way works perfectly; if something goes wrong, one can easily run into delays on the order of seconds (such as DNS timeout or a TCP retransmit), or even minutes (for example, the typical BGP convergence time [Maennel02]; the last number is about eleven orders of magnitude larger than the CPU clock time. As Garfield the Cat would have put it: ‘Programmers who’re blocking UI while waiting for network I/O, should be dragged out into the street and shot’.

The way to avoid I/O delays for the user without going into multi-threading has been well-known since at least 1970s, but unfortunately is rarely used in practice; it is non-blocking I/O. The concept itself is very simple: instead of telling the system ‘get me this piece of information and wait until it’s here’, say ‘start getting this piece of information, return control to me right away and let me know when you’re done’. These days non-blocking I/O is almost universally supported (even the originally 100%-thread-oriented Java eventually gave up and started to support it), and even if it is not supported for some specific and usually rather exotic API function (such as FlushFileBuffers() on Windows), it is usually not that big problem for our Hare to implement it himself via threads created specially for this purpose. While implementing non-blocking I/O himself via threads will involve some multithreaded coding, it is normally not too complicated, and most importantly it is still confined to one single place, without the need to spread it over all the rest of the code.

'No Bugs' Rabbit Escapes from Multithreaded Gorrilla

Non-blocking I/O vs heavy multi-threading

Unfortunately, doing things in parallel (using non-blocking I/O or via any other means) inherently has a few not-so-pleasant implications. In particular, it means that after the main thread has started I/O, an essentially new program state is created. It also means that program runs are not 100% deterministic anymore, and opens the potential for races (there can be differences in program behavior depending on at which point I/O has ended). Despite all of this, doing things in parallel within non-blocking I/O is still much more manageable than a generic heavily multi-threaded program with shared variables, mutexes etc. Compared to heavily multi-threaded approach, a program which is based on non-blocking I/O usually has fewer chances for races to occur, and step-by-step debugging has more chances to work. This happens because while there is some non-determinism in non-blocking I/O program, in this case a number of significantly different scenarios (‘I/O has been completed earlier or later than certain other event’) is usually orders of magnitude smaller than the potential number of different scenarios in a heavily multi-threaded program (where a context switch after every instruction can potentially cause substantially different scenarios and lead to a race). It can even be possible to perform a formal analysis of all substantially different scenarios due to different I/O timing, providing a theoretical proof of correctness (a similar proof is definitely not feasible for any heavily multi-threaded program which is more complicated than ‘Hello, World!’). But the most important advantage of non-blocking I/O approach is that, with proper level of logging, it is possible to reconstruct the exact sequence of events which has led to the problem, and to identify the bug based on this information. This means we still have some regular way to identify bugs, not relying on trial-and-error (which can easily take years if we’re trying to identify a problem which manifests itself only in production, and only once in a while); in addition, it also means that we can also perform post-mortem analysis in production environments.

These improvements in code quality and debugging/testing efficiency don’t come for free. While adding a single non-blocking I/O is usually simple, handling lots of them can require quite a substantial effort. There are two common ways of handling this complexity. One approach is event-driven programming, ubiquitous in the GUI programming world for user events; for non-blocking I/O it needs to be extended to include ‘I/O has been completed’ events. Another approach is to use finite state machines (which can vary in many aspects, including for example hierarchical state machines). We will not address the differences of these approaches here, instead mentioning that any such implementations will have all the debugging benefits described above.

One common problem for both approaches above is that if our Hare has lots of small pieces of I/O, making all of them non-blocking can be quite tedious. For example, if his program makes a long search in a file then, while the whole operation can be very long, it will consist of many smaller pieces of I/O and handling all associated states will be quite a piece of work. It is often very tempting to separate a whole bunch of micro-I/Os into a single macro-operation to simplify coding. This approach often works pretty well, but only as long as two conditions are met: (a) the whole such operation is treated as similar to a kind of large custom non-blocking I/O; (b) until the macro-operation is completed, there is absolutely no interaction between this macro-operation and the main thread, except for the ability to cancel this macro-operation from the main thread. Fortunately, usually these two conditions can be met, but as soon as there is at least some interaction added, this approach won’t work anymore and will need to be reconsidered (for example, by splitting this big macro-operation into two non-blocking I/O operations at the place of interaction, or by introducing some kind of message-based interaction between I/O operation and main thread; normally it is not too difficult, though if the interaction is extensive it can become rather tedious).

Still, despite all the associated complexities, one of those approaches, namely event-driven approach, has an excellent record of success, at least in GUI programming (it will be quite difficult to find a GUI framework which isn’t event-driven at least to a certain extent).

'No Bugs' Rabbit running circles around 'Multithreaded Gorrillazz'

If it walks like a duck, swims like a duck, and quacks like a duck…

If after escaping ‘Long I/O’ Gorrilla, Hare’s client-side program is still not working as fast as the user would like, then there are chances that it is indeed related to the lack of CPU power of a single core. Let’s come back to our earlier example with business-like dialogs, but now let’s assume that somewhere in one of the dialogs there is an MP4 playing (we won’t ask why it’s necessary, maybe because a manager has said it’s cute, or maybe marketing has found it increases sales; our Hare just needs to implement it). If Hare would call a synchronous function play_mp4() at the point of creating the dialog, it would stop the program from going any further before the MP4 ends. To deal with the problem, he clearly needs some kind of asynchronous solution.

Let’s think about it a bit. What we need is a way to start rendering, wait for it to end, and to be able to cancel it when necessary… Wait, but this is exactly what non-blocking I/O is all about! If so, what prevents our Hare from representing this MP4 playback as a yet another kind of non-blocking I/O (and in fact, it is a non-blocking output, just using the screen instead of a file as an output device)? As soon as we can call start_playing_mp4_and_notify_us_when_you_re_done(), we can safely consider MP4 playback as a custom non-blocking I/O operation, just as our custom file-search operation we’ve discussed above. There might be a multi-threaded wrapper needed to wrap play_mp4() into a non-blocking API, but as it needs to be done only once: multi-threading still stays restricted to a very limited number of places. The very same approach will also cover lots of situations where heavy calculations are necessary within the client. How to optimize calculations (or MP4 playback) to split themselves over multiple cores is another story, and if our Hare is writing yet another video codec, he still has more Gorrillazz to deal with (with chances remaining that one of them will get him).

Single-thread: back to the future

If our ‘No MT Bugs’ Hare has managed to write a client-side program which relies on its main GUI thread as a main loop, and treats everything else as non-blocking I/O, he can afford to know absolutely nothing about threads, mutexes and other scary stuff, making the program from his point of view essentially a single-threaded program (while there might be threads in the background, our Hare doesn’t actually need to know about them to do his job). Some may argue that in 2010 going single-threaded might sound ‘way too 1990-ish’ (or even ‘way too 1970-ish’). On the other hand, our single-thread with non-blocking I/O is not exactly the single-thread of linear programs of K&R times. Instead, we can consider it a result of taking into account the strengths and weaknesses of both previous approaches (classical single-threaded and classical heavily multi-threaded) and taking a small step further, trying to address the issues specific to both of them. In some very wide sense, we can even consider single-thread → multi-thread → single-thread-with-nonblocking-I/O transition, similar to Hegelian bud → blossom → fruit [Hegel1807]. In practice, architectures based on non-blocking I/O are usually more straightforward, can be understood more easily, and most importantly, are by orders of magnitude easier to test and debug than their heavily multi-threaded counterparts.

'No Bugs' Rabbit once again escapes 'Multithreaded Gorrillazz'

Last-second attempt

Our ‘No MT Bugs’ Hare has already got far across the client side of the field, but if he hasn’t scored his touchdown yet, he now faces the mightiest of remaining Gorrillazz, and unfortunately he has almost no room to maneuver. Still, there is one more chance for him to escape the horrible fate of heavily multi-threaded programming. It is good old algorithm optimization. While a speed up of a few percent might not be enough to keep you single threaded, certain major kinds of optimizations might make all the multi-threading (and multi-coring) unnecessary (unless, of course, you’re afraid that a program without multi-core support won’t look ‘cool’ anymore, regardless of its speed). If our Hare’s bottleneck is a bubble sort on a 10M element array, or if he’s looking for primes by checking every number N by dividing it by every number in 3..sqrt(N) range [Blair-Chappell10], there are significant chances that he doesn’t really need any multi-coring, but just needs a better algorithm. Of course, your code obviously doesn’t have any dreadfully inefficient stuff, but maybe it’s still worth another look just to be 100% sure? What about that repeated linear scan of a million-element list? And when was the last time when you ran a profiler over your program?

Being gotcha-ed

Unfortunately, if our Hare hasn’t scored his touchdown yet, he’s not too likely to score it anymore. He’s been gotcha-ed by one of the Multithreaded Gorrillazz, and multi-threading seems inevitable for him. If such a situation arises, some developers may consider themselves lucky that they will need to write multi-threaded programs, some will hate the very thought of it; it is just a matter of personal preference. What is clear though is that (even if everything is done properly) it will be quite a piece of work, and more than a fair share of bugs to deal with.

Tools like OpenMP or TBB won’t provide too much help in this regard: while they indeed make thread and mutex creation much easier and hide the details of inter-thread communication, it is not thread creation but thread synchronization which causes most of the problems with multi-threaded code; while OpenMP provides certain tools to help detecting race conditions a bit earlier, the resulting code will still remain very rigid and fragile, and will still be extremely difficult to test and debug, especially in production environments.

Quarter 1&2 summary

While we have seen that our Hare didn’t score a touchdown every time, he still did pretty well. As we can see, he has scored 4 times, and has been gotcha-ed only once. The first half of the game has ended with a score of ‘No Multi-Threaded Bugs’ Hare: 4 to ‘Multithreaded Gorrillazz': 1. Stay tuned for remaining two quarters of this exciting match.

References

[Blair-Chappell10] Stephen Blair-Chappell, “How to become a parallel programming expert in 9 minutes”, ACCU conference, 2010
[Hegel1807] Hegel, 1807, translation by Terry Pinkard, “Phenomenology of Spirit”, 2008
[Maennel02] Olaf Maennel, Anja Feldmann, “Realistic BGP traffic for test labs”, ACM SIGCOMM Computer Communication Review, 2002
[Martin10] Robert Martin, “The Language Stew”, ACCU Conference, 2010
[Merritt07] Rick Merritt, “M'soft: Parallel programming model 10 years off”, 2007
[Suess07] Michael Suess, “Is the Multi-Core Revolution a Hype?”, 2007

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

We've tried to optimize our feed for viewing in your RSS viewer. However, our pages are quite complicated, so if you see any glitches when viewing this page in your RSS viewer, please refer to our original page.

↧

Single-Threading: Back to the Future? (Part 2)

September 1, 2010, 5:00 am

≫ Next: Part IIIa: Server-Side (Store-Process-and-Forward Architecture) of 64 Network DO’s and DON’Ts for Game Engines

≪ Previous: Single-Threading: Back to the Future?

As we have seen in the previous article [NoBugs10] which described the 1st half of the historical match between our ‘No Multithreaded Bugs’ Hare and Multithreaded Gorrillazz, in most cases our Hare has managed to reach the touchdown area without the need for heavily multi-threaded development. It means that on the client side the number of cases which are really calling for heavily multithreaded code (known to have numerous problems) is rather limited, and so in most cases heavy multi-threading can be avoided. Now it is time to take a look at server-side programs to see if our hero can avoid heavy multithreading there. As ‘server-side’ we will consider programs aimed to support many users simultaneously over the network. Web applications is one prominent example, but we will not limit our analysis only to them.

After the first half of the match:

No MT Bugs' Hare: 4 'Multithreaded Gorrillazz': 1

One common feature of server-side applications is that they almost always depend on some server-side storage, normally a database; therefore, we will assume that some database (not specifying if it is relational or not) is used by application. Even if the application uses plain files for storage, we still can consider it as a kind of (very primitive) database for the purposes of our analysis. There are a few exceptions to this database dependency, but as we will show below in ‘No Database? No Problem’ section, the most practically important of them can be handled using the same techniques as described here and in part 1.

It should be noted that, as before, this is mostly a summary of practical experiences and cannot be used as strict instructions for ‘how to avoid multithreading’. Still, we hope it will be useful as a bunch of ‘rules of thumb’. We will also try to provide some very rough numbers to back up these ideas, but please take them with a good pinch of salt.

Quarter 3&4 line-up

As in the first half of the match, we have our ‘No Multithreaded Bugs’ Hare standing on the left side of the field, with touchdown being his only chance to win. He faces a team of extremely strong ‘Multithreaded Gorrillazz’, and any single Gorrilla is strong enough to stop him forever. Fortunately they’re rather slow, which leaves our hero a chance. As in the first half we will make a distinction between heavily multithreaded code all over the place (which results in perpetual debugging and a maintenance nightmare) and isolated multithreaded pieces (which are not exactly a picnic either, but can be dealt with with finite amount of effort; we will consider them acceptable if there are no better options).

To discuss server development, the very first thing we need to see is if our Hare is writing a program based on standard interfaces and frameworks, or needs to develop his own framework. If a framework is already there (for example, he’s developing a web application), it simplifies his task significantly. As the whole idea of the server-side application is to serve independent requests from many users, most existing frameworks (eg CGI/FastCGI, PHP, .NET, Java Servlets) do not require any interaction between requests. Sometimes avoiding interaction between threads within the framework requires some discipline (for example, static data in Java servlets can cause inadvertent interactions leading to problems [CWE-567]), but overall it is usually not rocket science to avoid it (unless it is dictated by application logic itself, which is discussed below).

Now, let us consider the scenario where standard interfaces are not good enough; while it is not so common, there are still several good reasons not to use standard web frameworks in some specific cases. Such reasons may include, for example, the inherently request-response nature of the HTTP protocol, which doesn’t necessarily fit all application usage scenarios. The decision to write your own framework is always a tough one and obviously includes lots of work, and often such frameworks need to be multithreaded for performance reasons. But even when it is really necessary, the framework can still be written such that all multithreading stuff is kept within the framework itself and is never exposed to the application. It means that even if our hero has quite an unusual case when existing frameworks don’t work, he can still confine multithreading to relatively small and (probably even more importantly) rarely changed area of the code.

If it ain’t broken, don’t fix it

Now, one way or another, our ‘No Multithreaded Bugs’ Hare has a framework which handles multiprocessing and multithreading itself, without imposing that his application code be multithreaded. It doesn’t mean he will be able to avoid multithreading in the end, it merely means that he hasn’t been grabbed by any of Gorrillazz yet.

The next question to our ‘No MT Bugs’ Hare is the very same ‘Houston, have we really got a problem?’ question that he needed to answer for the client-side. The main reason for multithreading is performance, so if there are no performance problems there is no real need to do anything about it (and if multi-threading exists for any other reason, our ‘No Bugs’ Hare should think about it twice, especially if threads were added only because it’s ‘cool’ or because without them the program is ‘so 1990-ish’). If there are any observable performance problems, the very first thing our Hare should ask himself is ‘Are you sure that the database has all the indexes it needs?’ It is unbelievable how many cases can be drastically improved by simply adding one single index. In particular, developers and even DBAs often tend to forget that a 2-column index on columns A+B can be orders of magnitude faster than 2 separate indexes on column A and column B. The biggest improvement we’ve observed from adding a 2-column index, was 1000x; not a bad incentive to take a closer look at indexes. So, we want to re-iterate: when using databases, indexing is of paramount importance, and is the very first thing to be considered if there are any performance problems. No amount of multithreading/multi-coring will save your program if the database lacks appropriate indexes. Another thing to take a look at this stage is eliminating outright inefficient requests (there are usually at least a few in any application, and basic profiling using database-provided tools should be able to help).

If the database indexes are fine and there are still performance problems (for Internet applications it usually won’t happen until about 1M-10M requests/day¹ ), than the next question arises. Usually most applications of this kind can be divided into two wide categories. The first category of applications is ‘data publishing’ and have mostly read-only requests (represented by any kind of site which publishes information, including serving search requests). The second category makes many updates, but these updates are usually trivial and, after optimizations mentioned above, should take rather little time; reporting can still be ugly and have heavy and very heavy requests (this is a typical pattern of ‘Online Transactional Processing’, or OLTP, applications). At this point our Hare should understand which category his application belongs to.

Attack of the clone hare

For a ‘data publishing’ application where updates are rare but the number of read requests is huge, the next step is usually to see if it some kind of caching will help. Caching does introduce interactions between requests, but with a proper implementation (similar, for example, to memcached [Facebook08]) it can easily be used in a way which has nothing to do with multithreading. For applications/sites which can cache all the information they need (for example, content management systems with updates a few times a day, or a ‘widget’ showing weather to every user in its location), it usually means handling virtually unlimited number of users without much further effort (in practice, the exact number will depend greatly on application specifics and the framework used, but our extremely rough estimate would be on the order of 10M-100M requests per day per typical 2-socket 8-core ‘workhorse’ server, with an option to add more such servers as necessary). If, on the other hand, there are essential requests which cannot be handled from the cache (for example, ad-hoc searches) and even after caching everything we can performance is still not good enough, then things become more complicated. At this stage, our ‘No Multithreaded Bugs’ Hare should consider creating a ‘master’ database which will handle all the updates, and multiple ‘replica’ databases which will handle the read-only requests.² This approach will allow scalability for read-only requests (with an extremely rough estimate of number of requests per single ‘workhorse’ server on the order of 1M-10M per day, though with proper optimization in some cases it can reach as high as 100M), so the only risk which remains is handling the update requests; usually it is not a problem for this kind of application, but if it is – our ‘No MT Bugs’ Hare can approach them the same way as described below for typical OLTP applications.

Heavy-weights

So, what should our ‘No MT Bugs’ Hare do if he faces an application which needs to handle lots of updates (more than a few million per day) and still experiences problems after all necessary indexes are present and the outright inefficient requests are eliminated? The next step is usually to optimize the database further, mostly at the physical level. It could include things like upgrading the server to a RAID controller with a battery-backed write cache (this alone can help a lot), moving DB logs to a completely separate set of HDDs (usually RAID-1), selecting an optimal RAID structure for tables (often a simple bunch of RAID-1 arrays works the best, and RAID-5/RAID-6 are usually not a good idea for heavily updated tables), separating tables with different behavior into separate bufferpools and onto separate physical disks, and so on. Additionally, moving most (or all) reports to ‘uncommitted read’ transaction isolation level could be considered; in some cases this simple optimization can work wonders. A related optimization can include separating a few frequently updated fields into a separate table, even if such a table has 1:1 relation to the original one. Another application-related optimization which can occur at this stage is moving to prepared statements or stored procedures. It is worth noting that despite common perception, on a DBMS where prepared statements are properly supported (last time we’ve checked it still didn’t include MySQL) they tend to provide almost the same performance as stored procedures, while requiring less code rewriting and keeping more potential for switching the DBMS if necessary.

Half-gotchaed?

What will happen if our ‘No MT Bugs’ Hare did all the above optimizations, but his system or program still doesn’t work efficiently enough (which we estimate shouldn’t normally happen until 10M update requests/day is reached)? It is no picnic, but is still not as bad as heavy multithreading, yet. At this stage our hero can try to separate the operational updatable database from the read-only reporting database, making the reporting database a replica of the ‘master’ operational database, running on a separate server. The effect achieved by this step heavily depends on DBMS in use and types of load, but usually the heavier the load – the bigger the effect observed (removing inherently heavy reporting requests from an operational database reduces cache/bufferpool poisoning and disk contention, and these effects can hurt performance a lot).

If it doesn’t help, our ‘No Bugs’ Hare might need to take a closer look at inter-transaction interaction (including transaction isolation levels, SELECT FOR UPDATE statements and the order of obtained locks). We feel that if it goes as far as this, he is in quite big trouble. While inter-transaction communication is not exactly multithreading, it has many similar features. In particular, deadlocks or ‘dirty reads’ can easily occur, eliminating them can be really tricky, and debugging can become extremely unpleasant. If our ‘No MT Bugs’ Hare finds himself in such situation, we will consider him being ‘Half-Gotchaed’. One application-level option which might be useful at this point is to start postponing updates (storing them in separate table, or some kind of queue) for certain frequently updated statistical fields (like a ‘number of hits’ field) to avoid the locking, and move such postponed updates into the main table later, time-permitting or in bulk, reducing locking.

Single connection: back to the future?

It is worth noting that there is an option to avoid this kind of multithread-like problems altogether, which is rarely considered. It is sometimes possible to move all update statements into a single DB connection (essentially to a single thread); while such approaches are often ostracized for lack of scalability, practice shows that in some environments (especially those where data integrity is paramount with no room for mistakes, for example, in financial areas), it is a perfectly viable approach – the biggest load which we have observed for such single-update-connection architecture was on the order of 30M update transactions per day for a single synchronous DB connection, and when it became not enough, it was (though with a substantial effort) separated into several databases with a single connection for updates to each one, reaching 100M+ update transactions per day (and with the option to go further if necessary).

Divide et impera

If after applying all the optimizations above our our ‘No MT Bugs’ Hare still experiences performance problems, his next chance to escape fierce ‘Multithreaded Gorrillazz’ is to try to find out if the data he works with can be easily classified by certain criteria, and split the single monolithic database into several partitioned ones. For example, if his application needs to handle transactions in a thousand stores worldwide, but most transactions are in-store and only a few interactions between the stores (similar to the task defined in [TPC-C] benchmark), he has a chance of getting away with partitioning the database by store (one or several stores per database), achieving scalability this way. Methods of separation can vary from DBMS-supported partitioning (for example, [IBM06] and [Oracle07]) to application-level separation. Application-level separation can have many varieties (with many being extremely application-specific), and detailed discussion of such separation can easily take a few books, so we will not try to go into more details here.

In-memory state: a bad case of dementia

If everything described above fails, and our Hare indeed has an application with 100M+ update transactions per day, he may need to resort to RAM to remember some parts of the system state, rather than to keep everything in the database. It is a fundamental change, and it won’t be easy. One important implication is that all information held in memory only will be lost if system goes down or reboots; in some cases (like caches) it doesn’t matter, but if going beyond caches, the implications must be considered very carefully.

Still, even with in-memory state multithreading is not always necessary; it can be avoided either by techniques described in the previous article [NoBugs10], or by separating the system into a series of logical objects, each having its own in-memory state and incoming queue, and all the logical object input being limited to the processing of incoming messages, with all interaction between objects restricted to sending messages to each other. One of us has seen such a system processing over 1 billion (yes, this is nine zeros) of requests per day, still without any multithreading at the application level (all multithreading has been confined to a few thousand lines of specialized framework, which is 100% isolated from the application-level business logic and therefore is changed extremely rarely). If our Hare is one of the few very lucky (or unlucky, depending on point of view) ones who really need to process more than 1e9 requests per day – there is a chance he will be gotchaed, but honestly – how many of us are really working on such systems? To set some kind of benchmark: NASDAQ is currently able to process 2e8 transactions per day [NASDAQ], so we can reasonably expect that there are relatively few systems which need more than 1e9. Still, it can happen and we have no choice other than to award a point to Gorrillazz in this case.

No database? No problem

As promised at the very beginning of the article, now we will come back to discussing examples of server-side applications which don’t use databases (or use them in a very limited way). One such example is music/video streaming server applications. While such applications don’t need to rely on a database, they can be scaled easily enough similar to any other ‘data publishing’ application (see ‘The Clone Hare’ above); in extreme cases where top performance is necessary, using non-blocking I/O techniques can be used to improve performance further.

Another prominent example of server-side applications which don’t really need to depend on the database, is game servers. While it is very difficult to generalize such a vast field as games in general, massive server-side games usually seem to fit under ‘In-Memory State: Case of Multiple Sclerosis’ described above, and our ‘No MT Bugs’ Hare can try to handle them using the very same techniques as described there and in previous article.

Quarter 3&4: ‘No MT Bugs’ Hare: 4¾ ‘Multithreaded Gorrillazz': 1¼

Now as the match between ‘No MT Bugs’ Hare and ‘Multithreaded Gorrillazz’ has came to end, we’re able to find out the final score of this magnificent game. As we’ve seen, similar to client-side, on the server-side there aren’t too many cases for multithreading either.

'No MT Bugs' Hare: 8¾ 'Multithreaded Gorrillazz': 2¼

Our ‘No MT Bugs’ Hare has managed to make 9 home runs on the server side of the field, while being gotchaed only once, and being half-gotchaed once. Taking into account the relative weights of these runs, we conclude that quarters 3 & 4 have been completed with a score of ‘No MT Bugs’ Hare: 4¾ ‘Multithreaded Gorrillazz': 1¼, making the overall score for the whole game ‘No MT Bugs’ Hare: 8¾ ‘Multithreaded Gorrillazz': 2¼.

Final Score:

¹ All numbers in the article are extremely rough estimates, your mileage may vary. Also we're assuming 'typical' Internet application with 'typical' distribution of requests over the day, with difference between minimum hour and peak hour not exceeding 2-5x. Still, while numbers are extremely rough, we feel that even such rough numbers can be of some value on initial stages of analysis.

² Unfortunately, way too many RDBMS still experience problems under heavy load when replication is implemented using RDBMS-provided means. Heavy testing with comparable to production loads and data volumes is advised when trying to implement replication. As a workaround, custom application-level replication can be considered, but it is rather complicated and is beyond the scope of this article.

References

[CWE-567] “CWE-567: Unsynchronized Access to Shared Data, Common Weakness Enumeration”
[Facebook08] Paul Saab, “Scaling memcached at Facebook”, 2008,
[IBM06] Rav Ahuja, “Introducing DB2 9, Part 2: Table partitioning in DB2 9”, 2006,
[NoBugs10] Sergey Ignatchenko, “Single-Threading: Back to the Future?”, Overload #97, June 2010
[NASDAQ] NASDAQ, “Technology Fast Facts”
[Oracle07] Hermann Baer, “Partitioning in Oracle Database 11g”, 2007
[TPC-C] ,Francois Raab, Walt Kohler, Amitabh Shah, “Overview of the TPC Benchmark C: The Order-Entry Benchmark”

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧

Part IIIa: Server-Side (Store-Process-and-Forward Architecture) of 64 Network DO’s and DON’Ts for Game Engines

June 8, 2015, 12:25 am

≫ Next: Part IIIb: Server-Side (deployment, optimizations, and testing) of 64 Network DO’s and DONT’s for Game Engines

≪ Previous: Single-Threading: Back to the Future? (Part 2)

It’s Monday, so it’s time for yet another part of our Network DO’s and DON’Ts for Game Engines article.

Incoming queue: cake. Processing: restaurant table. Outgoing queue: WC.

Previous parts:

Part I. Client Side
Part IIa. Protocols and APIs
Part IIb. Protocols and APIs (continued)

In the Part III of the article, we’ll discuss issues specific to server-side, as well as certain DO’s and DON’Ts related to system testing. Due to the size, part III has been split, and in this part IIIa we’ll concentrate on the issues related to Store-Process-and-Forward architecture.

Upcoming parts include:

Part IIIb. Server-Side (deployment, optimizations, and testing)
Part IV. Great TCP-vs-UDP Debate
Part V. UDP
Part VI. TCP
Part VIIa. Security (TLS/SSL)
Part VIIb. Security (concluded)

18. DO consider Event-Driven programming model for Server Side too

As discussed above (see item #1 in Part I), the event-driven programming is a must for the client side; in addition, it also comes handy on the server side. Having multi-threaded logic is still a nightmare for the server-side [NoBugs2010], and keeping logic single-threaded simplifies development a lot. Whether to think that multi-threaded game logic is normal, and single-threaded logic is a big improvement, or to think that single-threaded game logic is normal, and multi-threaded logic is a nightmare – is up to you. What is clear is that if you can keep your game logic single-threaded – you’ll be very happy compared to the multi-threaded alternative.

However, unlike the client-side where performance and scalability rarely pose problems, on the server side where you need to serve hundreds of thousands of players, they become really (or, if your project is successful, “really really”) important. I know two ways of handling performance/scalability for games, while keeping logic single-threaded.

18a. Off-loading

The first way to allow using another core, is off-loading of the heavy processing from the main thread (very similar to that of described in [NoBugs2010], ‘If it walks like a duck, swims like a duck, and quacks like a duck…’ section). When applying it to our event-driven model, it would mean our “main thread” creating a request to be offloaded into another thread, starting another thread to process (or post request to existing thread via some kind of queue), and continuing doing other stuff. When another thread is done with computations – it will post a special message into an incoming queue of our “main thread”. For this to work properly, it is important to make sure that at no point non-constant data is shared between threads (i.e. offloaded function should have all it’s input and output parameters passed by value, with no references/pointers to main thread data).

One example of the off-loading implementation might look as follows:

class Offloader { //generic one, provided by the network engine
  void process(IncomingMessage&,OutgoingMessage&) = 0;
    // IncomingMessage contains input data for calculation
    // OutgoingMessage is a reply which will be fed back to "main thread"
    //   when calculation is done
};

This off-loading approach to multi-threading is very-well-controlled, with synchronization being barely noticeable (and those next-to-impossible-to-find inter-thread races being eliminated), and doesn’t cause much trouble in practice. However, pure offloading scenarios (those which don’t require data sharing between threads), are rare for games, and deviating from pure off-loading can easily bring back all the multithreading nightmares. If you can offload some computation while being sure that underlying data doesn’t change (or data is small, so you can make a snapshot to feed it to offloaded thread) – by all means, do it, but it won’t happen too often.

Another (and usually more applicable to game engines) way is to go beyond single core on the server side, is to use Store-Process-and-Forward architecture, which is described in our next item.

19. DO consider Store-Process-and-Forward architecture

So, you do like single-threaded simple development, but at the same time you really need scalability, and offloading doesn’t cut it for you. What to do?

There exists a reasonably good solution out there, which works good enough in many (though not all) cases.

The basic idea is to separate your system into many loosely-coupled nodes which communicate via sending and receiving messages; as soon as it is done – all each of the nodes can and should do, is merely processing events (with processing for each node staying within a single thread). For example, an interface implemented by a node, may look as simple as:

class Node {//generic one, provided by the network engine
  public:
  void process_message(const IncomingMessage&) = 0;
};

class MyNode : public Node { //game-specific one
  //here goes state of MyNode
  public:
  void process_message(const IncomingMessage& msg) {
    //...
  }
};

This model is simple and efficient, and enforces a very-well-defined message-based interface. How to divide the system into nodes – depends on the game, but in practice it can be either different-parts-of-the-same-game-world, or casino tables, or stock exchange floor, or whatever else you can think of (as long as it is loosely coupled and doesn’t require to be absolutely synchronous with the rest of the system). We’ve named this architecture Store-Process-and-Forward (see below why). Each of the nodes in Store-Process-and-Forward system is capable of doing only one single thing – processing incoming messages, and the processing is performed as follows:

incoming-message -> incoming-queue -> processing (node logic) -> outgoing-queue(s) -> outgoing-message(s)

For those into patterns, each of our nodes can be seen as implementing “Reactor” pattern [Schmidt2000].

In a sense, Store-Process-and-Forward architecture is quite similar to store-and-forward processing model as it is used in backbone Internet routers (and which is known to be extremely efficient). With classical store-and-forward model, each node receives an incoming packet, and puts it to an incoming queue; and on the other hand, as long as there is something in incoming queue – the node takes it out, makes a decision where to route it, and pushes the packet to outgoing queue(s). What we’re adding to classical store-and-forward model, is that between taking out a message from incoming queue and sending it out, we’re processing it. Hence, the Store-Process-and-Forward name.

For our purposes, receiving-incoming-message-and-putting-it-to-incoming-queue can be implemented as a non-blocking I/O (in the extreme case the incoming queue could be, for example, an incoming TCP buffer, and I/O can be blocking) or as a separate network thread, and message processing can be implemented as node’s “main thread”. In some rare and specific cases, processing within some of the nodes may be implemented as multi-threaded, but this should be treated as an exception (and such exceptions have been observed to cause those thread-sync problems, so unless really necessary, multi-threading in node processing should be avoided).

19a. Applicability: DO Split the Hare Hair

In general, if you can split your system into reasonably small nodes – you SHOULD use Store-Process-and-Forward architecture (it has numerous advantages which are discussed in the item #19d below). So, the question is: can you split your system or not? Let’s consider it in a bit more detail.

First of all, let’s see what can be represented by a node? We have quite a bit of good news in this regard. Node can represent pretty much everything out there – from game-logic-node to database-handling-node, and from payment-processing-node to an SMS-sending-node. In general, peripheral nodes (such as two latter ones) rarely cause any problems, it is game-logic-nodes and database-handling-nodes (if you need the latter) which may be a bit tricky to separate.

When it comes to splitting your game into game-logic-nodes, think: what is the smallest isolated unit in your game (and/or game engine)? Very often you will find out that you do have isolated units with lots of action within, and very few communications in-between. For a world simulator such a unit can be a cell or a scene (usually the latter), for a casino it can be a table/lobby (note: you don’t need to have all the game nodes to be the same – you can and should make them specific, so it is perfectly fine to have a separate table node and a lobby node, and instantiate them as necessary). You will be surprised how many systems you will be able to split into the small-and-relatively-isolated nodes, if you think about it for 5 minutes.

A word of caution: the best split is usually a “natural” one (like those described above), with direct mapping between existing game entities and nodes. While artificial splits are possible, they tend to cause too many interactions between the nodes, which kills the whole idea. Once again – to be efficient, a node needs to be an entity which contains a lot of logic within, with relatively few communication with the other nodes.

19b. DO Think about Splitting Database-Handling Nodes

In most over-the-Internet games you will need some kind of a database (or some other persistent storage) – at least to keep user database, logins, high scores, etc. etc. On the other hand, in many games you don’t really need to care about splitting database nodes, and one single node will do it all for you. Otherwise, splitting database-handling nodes may become tricky. I know of three approaches for such a split. First is “quick-and-dirty but not exactly scalable”, the second one is “somewhat scalable”, and the third one is “really scalable”.

“Quick-and-dirty” approach is actually delaying DB node split for a while. You have one single DB-handling node (with a single underlying DB); it greatly simplifies both development and deployment (including DBA work). On the negative side, obviously, “quick-and-dirty” approach is not really scalable. However, surprisingly, a “quick-and-dirty” system can work for quite a long time (provided that your DB programmers are good); I’ve seen a system which had managed to process 30’000’000 non-trivial database transactions a day on a single DB node, running over a single DB connection. And if the game is successful enough to exceed these numbers, this model can be changed to the “really scalable” one without rewriting the whole thing (though with significant effort on DB-node side).

Very roughly, “quick-and-dirty” model can be described as follows:

lots of game nodes -> single DB-handling node -> single DB connection -> single DB

“Somewhat scalable” approach is to have multiple DB-handling nodes over a single underlying DB, while sharing different objects stored in DB, between nodes (if objects are not shared, it is really a version of “really scalable” approach, see below). In general, I don’t really like this “somewhat scalable” thing; one big problem with this approach is object sharing, which creates high risks of running into race conditions (in the worst case leading to game items – or even money – lost in transit), very-difficult-to-track locks (affecting performance in unpredictable ways), and deadlocks (ouch!), making the whole thing tantamount to inter-thread races (which, as discussed above and in [NoBugs2010], are a Really Bad Thing). For some systems, “somewhat scalable” approach might be reasonable, but in general I’d suggest to avoid it, unless you’re 200% sure that (a) it is what you need, and even more importantly, (b) that you’re not going to ask me to fix resulting problems :-) .

Very roughly, “somewhat scalable” model can be described as follows:

lots of game nodes -> many DB-handling nodes -> multiple DB connections -> single DB with data objects shared between DB-handling nodes

It is worth noting that in books and lectures, this “somewhat scalable” approach will by far the most popular and the most recommended one (and often they will also tell you that it is the only way to achieve real scalability). However, in practice (a) it appears not to be linearly scalable (due to intra-database interactions and locks on shared objects), (b) it causes pretty bad problems due to inherent synchronization issues, and (c) among real-world reasonably-large (like “multi-million transactions per day”) systems, at the very least, it is not universal; I’ve seen quite a few architects that admitted using “quick and dirty” or “really scalable” implementations, always feeling uneasy about it because they went against the current teachings. IMNSHO, it is one of those cases when the books are wrong, and practice is right.

“Really scalable” approach usually means that for M gaming nodes you can have N+1 database-handling nodes, (where M is usually much larger than N). Here, each of N database-handling nodes will support a bunch of gaming nodes, and 1 database-handling node will support central user database. Each of database-handling nodes has it’s own database (or it’s own set of tables in the database) where nobody except for this database-handling node is allowed to access. To deal with inter-database-node interactions, you’ll certainly need to provide “guaranteed inter-DB transactions” (implementing them is beyond the scope of this article, but it is doable with or without explicit DB support for distributed transactions). This approach is perfectly free from races, and is scalable beyond the wildest dreams (all the nodes are completely independent, which ensures linear scaling without any risk of unexpected slowdowns, N can be increased easily, and single central user database_handling_node+associated_database pair can be split too quite easily if it becomes necessary). It has been seen to work extremely well even for the largest systems. However, it is a bit cumbersome, so at first you might want to settle for 1+1 database-handling nodes (or even for a single one, making it a “quick-and-dirty” system at first).

Very roughly, “really scalable” model can be described as follows:

lots of game nodes -> N+1 DB-handling nodes -> single DB connection per DB-handling node -> DB(s) with each data object exclusively owned by some DB-handling node

As noted above, one could stay within “really scalable” model while keeping data from several database-handling nodes within one database; one database doesn’t make much difference as long as there is no possible locking between the objects stored there. An ideal way to achieve separation is to have each node exclusively “own” a set of it’s own tables; if each node accesses only it’s own tables – using one database for multiple nodes shouldn’t cause any problems (unless you manage to overload DB log, which is usually difficult unless you’re storing multimedia in your database), but it simplifies deployment and work of DBAs quite a lot. On the other hand, while implementing DB-handling node separation at row level (with different DB-handling nodes sharing the same DB tables, but exclusively “owning” rows with a certain key field), is possible, you should keep in mind that even while objects are separated, nodes will still occasionally compete for index locks (because index scans tend to lock not only current row, but also previous/next ones); whether this would cause problems in practice – depends heavily on a lot of factors, including nature of DB interaction, DB load, and specific database (and index type) in use.

A variation of “really scalable” approach (with M~=N) is to say that there are many game nodes which are essentially DB-handling nodes, and which have both game logic and their own DB (for deployment purposes it is better to provide each DB-handling node with their own set of tables within the same DB); you’ll still need to have one central database node with the user DB, and you’ll still need to have guaranteed inter-DB transactions, but overall logic might be significantly simpler than for generic “really scalable” approach. This variation is scalable, simple, and race-free, so feel free to try it if it fits your game :-) .

Very roughly, this variation of “really scalable” model can be described as follows:

lots of game nodes each implementing DB-handling -> single DB connection per DB-handling node -> DB(s) with each data object exclusively owned by some DB-handling node

Overall, when you start your development, as a very rough rule of thumb, I’d suggest to consider one of two opportunities:

“Quick and Dirty” approach, planning to migrate to “Really Scalable” one if the project is vastly successful
Each-Node-is-DB-Handling-Node (M~=N) variation of “Really Scalable” model.

Both these approaches share one very important property. In both these cases, each DB-handling node has it’s own data, and doesn’t share this data with the others; it allows to have DB processing as deterministic, and synchronization-, race- and lock-free, as node processing under Store-Process-and-Forward. And believe me, races (whether inter-thread, or inter-DB-connection, the latter regardless of transaction isolation levels) are Really Tough To Deal With, especially when the logic changes all the time (and if your game is successful, game logic will change, there is no doubt about it).

In general, the database-handling issue is way too large to fit into one single item of an article; some day, I might write specifically on this subject; for now the most important thing is to realize that databases can be handled easily and efficiently within Store-Process-and-Forward architecture (I’ve seen such systems myself).

19c. DO Think about “I Have no Idea” State

One thing which you need to keep in mind when writing a server-side distributed system, is that when you send a message to another node asking to do something, it can end up in three different states: “success”, “failure”, and “I have no idea whether it has even been delivered”. It is the last state which causes most of the problems: for example, in case when part of server-side nodes fails, and another part stays alive, or when you have some of your inter-node links temporarily failed, what is especially important for inter-datacenter deployments, but does happen within a single datacenter and has been observed even on one single server.

This “I Have no Idea” state is not specific to our Store-Process-and-Forward architecture, but appears inevitably as soon as server-side becomes distributed (it also appears for communications between client and server, but this is beyond the scope now).

For inter-datacenter interactions this “I Have no Idea” state is clearly one of the worst problems; in particular, I’ve observed it to be (by far) the worst problem when organizing communications with payment processors; from my experience, while all payment processors do address this problem one way or another, only around 1/3rd of them are doing it in a sensible way; others tend to have extremely inconvenient and/or error-prone requirements to recover from it (like “to recover, you need to request all your transactions around the time of failure to see if it went through; oh, and you can’t request more than 2 hours of your transactions at once to avoid overloading our server”).

In practice, these “I have no idea whether it has been delivered” scenarios might or might not be a problem for you, but if you can eliminate this problem in general – you’d better do it.

19c1. DO Consider implementing Explicit Support for Idempotence

One of the approaches to deal with “I have no Idea” state is to have all the messages (or at least all the messages which may cause some kind of trouble) in the system as idempotent ones; idempotence means that the message can be applied many times without any problems (with any number of duplicate messages causing the same result as one single message).

In many cases, it is easy to support idempotence on a case-by-case basis, but the issue is so important that it is better to think about generic support for idempotence. One way to support idempotence for certtain classes of messages, is related to two observations:

all messages with read-only processing, are naturally idempotent; while reading the data may cause different value being read, under certain conditions (like no other way for the nodes to interact) it becomes indistinguishable on the requestor side from the message being delayed
all messages for which processing is restricted to a mere update of a certain node state to be equal to received value (in a “x=x_from_message” manner) are naturally idempotent too (provided that certain conditions are met, such as x having only one source of updates).
- it is important to note that messages with even a little bit more complicated processing, are not necessarily idempotent. For example, message with processing “x+=dx_from_message” is not naturally idempotent. The key here is not about simplicity, but about following an exact processing described above
- ordering might be an issue, and implications of mis-ordering need to be taken into account
- this approach works well for non-guaranteed delivery. In particular, Unity 3D’s state synchronization exhibits this type of idempotence.

In addition, if timing is not that important, engine can provide support for all the messages to be idempotent; one such implementation would include:

have all the nodes to provide their own ID for each and every message
receiving nodes storing received IDs (for example, one option is to store maximum processed ID if IDs are known to be monotonous)
receiving nodes checking if the ID has already been received, and handling request differently depending on request being original one or a duplicate one (but providing exactly the same reply in any case).

No, as we’ve discussed how to implement idempotency, let’s think why idempotency is so important? (ok, it is a bit late to raise this question, but better late, then never). The answer is: idempotency is important because if you know that the message is idempotent, and you ended up in a “I Have no Idea” state, you can always repeat sending your message, while being sure that it will work without side effects regardless of the message being previously received or not. This resending may be handled by application (and there are cases when it is a good idea, especially in real-time scenarios), or can be implemented at game engine level.

19c2. DO Consider implementing Explicit Support for Once-and-Only-Once Guaranteed Delivery

A close cousin of Idempotence is Once-and-Only-Once Guaranteed Delivery. In this case, game engine states that it doesn’t matter if there is a connectivity between nodes or not; in any case, message will be retransmitted as soon as connectivity is restored, and will be processed exactly once. One of the ways to implement Once-and-Only-Once Guaranteed Delivery is to combine idempotence with authomated message retransmit.

Overall, handling of “I have no idea” state in general way is quite complicated; further implementation details (such as DB/persistent-storage handling on both sides if any, generating IDs which cannot possibly clash, etc.) are beyond the scope of the present article, but maybe some day I will elaborate on it; for now – just keep in mind that this “I have no idea if it has been delivered” problem does exist, and if you see how to handle it for your specific case – do it.

19d. Store-Process-and-Forward Advantages

As soon as you have your system split into nodes, and implemented your system as Store-Process-and-Forward one – you’re in real luck. Advantages of Store-Process-and-Forward architecture are numerous (yes, I know I sound like a commercial):

Your game (and DB, if applicable) developers don’t need to care about threads or races, development is very straightforward, code is not cluttered with synchronization issues and is easy to read
Store-Process-and-Forward architecture provides very clean separation of concerns and encourages very clean and well-defined interfaces, hiding implementation details in a very strong way
As soon as you know node state and incoming message, processing is deterministic
- which means that automated testing can be implemented easily
It has been observed to simplify debugging and even post-mortem analysis greatly (that is, if you have enough logging of incoming messages). With proper logging, in most cases bug can be found from one single crash/malfunction.
As game developers have no idea about the way inter-node communication is implemented, game engine framework can allow admins to deploy it as they wish (without any changes to game logic):
- deploy everything on one server (useful for debugging), or
- deploy on multiple servers in single location, or
- deploy on multiple servers in different locations CDN-style, or
- deploy nodes to the cloud
In addition, different mappings of nodes to processes and threads are possible (again, without any changes to game logic):
- admin may deploy some of the nodes as processes, and
- deploy some of the nodes as threads within common process (saving on process overheads at the cost of less inter-node protection), and
- deploy some of the nodes as multiple-nodes-per-single-thread. This imposes some specific requirements (only those nodes without blocking calls and lengthy processing are eligible), but tends to reduce CPU load significantly if you have 10000+ nodes per server.

19e. DO use Store-Process-and-Forward: Parting Shot

In general, Store-Process-and-Forward architecture certainly goes beyond one simple item, and deserves a separate article (or even a book). It provides quite a few advantages (listed above), and I’ve seen it working with a really great success. A system with up to half a million simultaneous highly active users generating half a billion messages a day, has operated on a hardware-which-was-5x-to-10x-weaker-per-user-than-most-of-competitors, has had response times better than any competitor, all of this with unplanned-downtimes of the order of 1-2 hours per year for 24/7 operation (which can be translated into 99.98% uptime, and that’s not for a single component, but for the system as a whole – a number which auditors considered “too good to be true” until the proof was provided).

Bottom line: you certainly SHOULD use Store-Process-and-Forward – that is, if you can logically split your system into loosely coupled nodes. And if you cannot think of the way how to split your game into nodes – think again.

To be continued…

Due to the size of Part III, it has been split. Stay tuned for part IIIb, Server-Side (deployment, optimizations, and testing).

EDIT: the series has been completed, with the following parts published:
Part IIIb. Server-Side (deployment, optimizations, and testing)
Part IV. Great TCP-vs-UDP Debate
Part V. UDP
Part VI. TCP
Part VIIa. Security (TLS/SSL)
Part VIIb. Security (concluded)

References

[NoBugs2010] 'No Bugs' Hare, “Single-Threading: Back to the Future?”
[Schmidt2000] Schmidt, Douglas et al., “Pattern-Oriented Software Architecture Volume 2: Patterns for Concurrent and Networked Objects.”, Wiley, 2000

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧

Part IIIb: Server-Side (deployment, optimizations, and testing) of 64 Network DO’s and DONT’s for Game Engines

June 15, 2015, 12:23 am

≫ Next: Multi-threading at Business-logic Level is Considered Harmful

≪ Previous: Part IIIa: Server-Side (Store-Process-and-Forward Architecture) of 64 Network DO’s and DON’Ts for Game Engines

I hate Mondays
— Garfield the Cat —

We continue our series of posts on network development for game engines.

Menu: Java.Thread with a Side of synchronize, C++ boost:: Mignon, 3NF à la Codd, Ruby Thermidor

Previous parts:
Part I. Client Side
Part IIa. Protocols and APIs
Part IIb. Protocols and APIs (continued)
Part IIIa. Server-Side (Store-process-and-Forward Architecture)

In today’s part IIIb, we’ll concentrate on those server-side issues which were not addressed in the part IIIa. Mostly our discussion will be related to deployment (and front-end servers), optimizations, and testing.

Upcoming parts include:
Part IV. Great TCP-vs-UDP Debate
Part V. UDP
Part VI. TCP
Part VIIa. Security (TLS/SSL)
Part VIIb. Security (concluded)

20. DO support „front-end“ easily replaceable servers

If you’ve followed an advice in the item #16 of Part IIb, you have implemented those publisher/subscriber interfaces. And if you’ve followed an advice in the item #14, you do have your own addressing schema. Now, it’s time to reap the benefits. These two protocol-level decisions open an opportunity to implement one wonderful thing – “front-end servers”. The idea is to support deploying gaming system as follows:

a few game-servers -> a few dozens of front-end servers -> million of end-users

This approach (as has been observed in practice) does help to take the load off the game servers quite significantly. The thing is that, handling TCP connections is quite a burden. And if you need to support any kind of encryption – it becomes much worse (especially if we’re speaking TLS, as public crypto is one hell of CPU-cycle-eater, especially at the moment when your million users reconnect after BGP convergence or something). And, if we have this kind of burden, that we can offload to cheap-and-easily-replaceable servers, why not do it?

In such deployments, game-servers and front-end servers are very different:

Game servers	Front-End Servers
Process game logic	Process only connections, no game logic is processed here
Do carry game state	Don’t carry any game state
Mission-critical (i.e. failure causes lots of trouble)	Easily replaceable on the fly (at the cost of automated reconnect of the players)
Expensive (such as a 4-socket $50K box and up)	Cheap (such as a 2-socket $10K box and down)

If trying to serve half a million of players from 5 game servers – it would cost you a whole damn lot to buy such servers (especially if each game server needs to talk to all the half a million of users). Serving the same half a million from 5 or so game servers + 15 front-end servers in the architecture above – is easy and relatively cheap, I’ve seen it myself.

Ideally, each of the players should be served only by one of front-end servers at any given time, even if the player interacts with multiple game servers. This single-client-single-front-end-server approach tends to help in quite a few ways, usually reducing overall traffic (due to better compression and less overhead) and overall server load. It plays well with the item #47 (single-TCP-connection) from Part VI, but doesn’t really require it, and can be implemented for both TCP- and UDP-based communications.

20a. CDN-like and Cloud Deployments

One additional benefit of ‘front-end server’ deployment architectures is that they allow to have CDN-like deployments, where your game servers sit in one central datacenter, and front-end servers are distributed over several datacenters all over the world; if you can afford a good connection (such as frame-relay with a good SLA) between your datacenters, you can improve latencies for your distant users, and make the game significantly more fair; while this is not an often option because of the associated costs (though it can be as low as $20K/month), for one special kind of games known as “stock exchanges”, this is one option to be taken into account.

Alternatively to (or concurrently with) CDN-like deployments, you can deploy such an architecture into the cloud, keeping in mind that SLAs for the front-end cloud servers can be significantly worse (and therefore significantly cheaper) than SLAs for game servers.

21. DO start with Berkeley-socket-based version, improve later

For servers, there are tons of different technologies which work with sockets: WaitForMultipleObjects(), completion ports and APC on Windows, poll/epoll on Linux, etc. It might be very tempting to say „hey, everything except for <your-favorite-technology-here> sucks, let’s do the latest greatest thing based on <this-technology>“. However, way too often it represents a bad case of premature optimization. In the experiments which I’ve observed, good Berkeley-socket-based implementation (the one with multiple sockets per thread, see item #21a below) was about on-par (within margin of error) of Completion-Port- and APC-based ones. On the other hand, using shared-memory for same-machine communications has been observed to provide 20% performance improvement (which is not much to start with, but at least makes some sense). While your mileage may certainly vary, the point here is that it is usually not a good idea to start with a network engine optimized for a very specific API (in the worst case, affecting your own APIs, which would prevent you from changing underlying technology, and from cross-porting your library in the future).

I’m not claiming that all these “newer and better” technologies are useless (in particular, there may be substantially different for serving large files, or it may be that we’ve implemented them poorly for our experiments); what I’m saying is that it is very unlikely that good Berkeley-socket-based implementation will make a “life-or-death” difference for your project, and that if you have Berkeley-socket-based implementation, you’ll be able to change it to <whatever-technology-works-best-for-you> later – without changing all the game logic around. Also, Berkeley-socket implementation has an another advantage – it will run both on Windows and Linux, which may come handy (see also item #22 below).

21a. Optimization: DO use multiple-sockets-per-thread

The very first thing on my list of optimizations is making sure that you can support multiple sockets per networking thread. Context switches are damn expensive (if we take cache poisoning into account, context switch cost goes of the order of 10000 clocks (!)), and if you have a single-socket-per-thread in a typical gaming traffic, your program will spend much more time on switching back and forth between threads, than on actual work within those threads. Exact effect of switching from single-socket-per-thread to multiple-sockets-per-thread heavily depends on the game specifics, but I’ve seen around 2x improvement, which is by far the largest effect I’ve observed from any optimization at this level.

This optimization works extremely well with those front-end servers (see item #20 above), especially in gaming environments. This is because for games, per-socket traffic usually consists of small and very sparse packets, so with one-socket-per-thread, threads are heavily underutilized, causing too many those extremely expensive context switches. Multiple-sockets-per-thread help to mitigate this problem.

I’ve heard arguments that multiple-sockets-per-thread introduce unnecessary latency, and is therefore unfair to players. Let’s take a closer look at this potential issue. First of all, even if multiple-sockets-per-thread does introduce observable latency, it is still fair as long as distribution of players to threads is random (which usually is as long as you’re not making any preferential treatment). Second, as long as number of sockets per thread is small (such as 32-64 or so – this is enough to get optimum speedup), and the number of threads is still much larger than the number of cores, latency patterns will be most likely indistinguishable from single-thread-per-user model. What we can say about latency-wise for multiple-sockets-per-thread (assuming that there is always a free CPU to run the thread when the packet comes in) is that it exhibits occasional single-digit microsecond-range-delay (due to the queueing within the thread), and occasional microsecond-range-speedup (due to the lack of context switch). Combining with typical over-the-Internet jitter being at least of the order of single-digit milliseconds (which is orders of magnitude larger than those microseconds), the effect of those occasional latency changes will be lost-beyond-detection in the much larger jitter noise. Moreover, for most of the games microsecond-range deviations, even if theoretically detectable, are well within acceptable range (as long as it is fair).

If you’re satisfied that this issue is out of the way for your specific game, let’s see what we have implementation-wise. Good news in this department is that multiple-sockets-per-thread can be easily supported for Berkeley-socket-based implementation (via non-blocking sockets). Even more of good news is that as long as you have strict Berkeley-socket implementation (using select()), changing it to poll()/epoll()/WaitForMultipleObjects() is next-to-trivial, so you can easily try any of them as soon as you’ve got a basic select()-based one working. A bit of bad news is that it still might be a premature optimization (or it might be not).

Oh, and if you need security/TLS: OpenSSL can be made working in multiple-socket-per-thread environment (via non-blocking BIOs), though it is quite a bit of work.

Bottom line: multiple-sockets-per-thread is the only optimization which I’d consider for your first implementation. However, if you do need to support OpenSSL and/or non-trivial over-the-network protocols – it might not be worth the trouble; in such cases it might be better to start with classical single-socket-per-thread model and improve it later.

21b. Optimization: DO consider shared-memory for intra-server communications

In my experience, when going along “start with socket implementation, improve later” route, one of the most significant improvements comes from re-implementing intra-server communications with shared memory (overall gain was around 20%). It is not too much (and YMMV for different systems), and you can live without it for a while, but in a hunt for the absolute-best performance, it is extremely difficult to beat the shared memory.

21c. Optimization: DO consider non-blocking and half-blocking queues

If you’re following advice from item #19 in Part IIIa and building your system around Store-Process-and-Forward architecture, you have quite a bit of queues (FIFO style ones) in your system. At a certain point (when you’ve already optimized your engine enough) you’re likely to find that those queues start to take a significant time (due to waits on queues and associated switches).

The solution is to avoid those expensive context switches by using non-blocking data structures (based on Compare-and-Swap, a.k.a. CAS operations, which correspond to atomics in C++11). Such “lockfree” structures are available as Windows API (Interlocked Slist, [MSDN] ), or as boost::lockfree:: family. Both these structures do as advertised – they provide fully lock-free data structures.

However, what we often need for inter-thread communication, is not exactly a fully non-blocking queue, but a queue where the writer never needs to block on the queue, but the reader has an option to block on it (ideally – only if the queue is empty).

This kind of “no-writer-blocks-blocking queue” can be implemented using non-blocking primitives such as those mentioned above (NB: code below is based on Windows and Interlocked*() functions, but similar thing can be implemented using boost::lockfree stuff):

class NoWriterBlockQueue {
  PSLIST_HEADER slist;
  HANDLE event;
  //constructor goes here
  //  event is initialized as an auto-reset event

  void noblock_push(SLIST_ENTRY item) {
    //having SLIST_ENTRY in queue API is ugly,
    //  done here only to demonstrate the idea,
    //  SLIST_ENTRY  SHOULD be wrapped for any practical implementation
    PSLIST_ENTRY prev = InterlockedPushEntrySList(slist,item);
    if(prev==NULL)
      SetEvent(event);//not exactly optimal
                      //  but still more or less non-blocking
  }

  SLIST_ENTRY noblock_pop() { //returns a list of data to be processed
    return InterlockedFlushSList(slist);
  }

  SLIST_ENTRY wait_for_pop() { //returns a list of data to be processed
    for(;;) {
      SLIST_ENTRY dequeued = InterlockedFlushSList(slist);
      if(dequeued!=NULL)
        return dequeued;
      WaitForSingleObject(event,INFINITE);
      //spurious returns (with slist empty)
      //  are possible but harmless
      }
    }
  };

While the implementation above is not exactly optimal (it causes not-strictly-necessary SetEvents() and spurious wake-ups from WaitForSingleObject()), it might be a reasonably good starting point for further optimizations in this field.

21d. Optimization: DO experiment further

The list of optimizations above is certainly not exhaustive; there can be many more things which you’ll be able to optimize. In particular, I don’t want to discourage you from trying all those latest-and-greatest I/O technologies (such as completion ports/APC/whatever else). The key, however, is to have working and stable system first, and to improve it later.

22. DO consider Linux for servers

Even if your game engine is Windows-only – you should consider making your server side to support Linux. On the one hand, Windows (contrary to popular belief) is capable of running huge loads (such as hundreds of millions of messages per day) for months without reboot, so this is not that bad. On the other hand, contrary to another popular belief, protecting Windows from attacks is still significantly more difficult that protecting Linux. In particular, for non-money-related game I would even be ready to commit the ultimate security fallacy – to run it on a Linux server wide-open to the Internet without a hardware firewall; for Windows I would still consider it a suicide.

One more reason which makes Linux somewhat better suited for production servers than Windows, is ability to make tcpdump (without installing anything and at almost-zero performance cost) right there on server, and to analyze results offline (see item #25 below for further details). On Windows, to achieve the same thing, you’d need to install 3rd-party software (such as Wireshark) on your production server, and the less 3rd-party stuff you have on the server – the better (both from reliability and from performance point of view).

22a. DO fight dependencies, especially on Windows

When I was saying in the previous item that Windows is capable of running huge loads – I meant it, but there is a caveat. Windows are indeed capable to run huge loads reliably – but only as long as you severely curtail your dependencies.

The less services you need to run on your production server – the better, the less DLL’s/.so’s you need to link – the better, and so on, this applies to any platform. This helps both to improve reliability and to improve security, both for Windows and for Linux. However, for Windows programs, there are usually many more dependencies than for Linux programs (which means that Windows team has done a good job promoting vendor lock-in), so dependencies tend to be much more of a problem for Windows-based programs. In any case, you should keep a list of all your dependencies, and each and every new dependency needs to be discussed before being accepted.

A tiny real-world story about dependencies: the best thing I’ve ever seen in this regard, was a server-side Windows process, which directly linked exactly one DLL – kernel32.dll (which, of course, indirectly linked ntdll.dll). The process used shared memory to communicate with the other processes on the same machine, and the whole thing was able to run for months without any issues. While having even less dependencies is certainly not an option, I’m usually trying at least to come as close to this “absolute dependency minimum” as possible; it tends to help quite a bit in the long run.

23. DO implement application-level balancing

When you have those front-end servers (see item #20 above), there is a question of “how different clients should reach those different front-end servers?” There are three answers to this question – two are classical ones, and one is unorthodox but quite handy for our specific task of “game-with-our-own-client”.

The first classical answer to the balancing problem is “use hardware load balancer”. This is a box (and a very expensive one) which sits in front of your front-end servers and balances the load between them. This is what your network admins will push you (and very hard at that) to do. It might even work for you, but there are inherent problems with this approach:

each such a balancer needs to handle all your traffic (sic!)
- as it is inherently overloaded, it might start dropping packets
  - which in turn will degrade your player experience
these boxes are damn expensive
such a box is either a single point of failure, or redundancy will make it even more expensive
- redundancy implementations of the balancers have been observed to be a source of failures themselves
it is yet another box which might go wrong (either hardware may fail, or it may get misconfigured)
it can’t possibly work to balance across different locations

On the positive side – I can see only that hardware balancers can at least in theory provide better balancing, in cases when one single player can eat up 10+% of the single server load (which I don’t see happening in practice, but if your game requires it – by all means, take it into account).

As you can see, I’m not exactly an avid fan of hardware load balancers. This is not to say that they’re useless in general – there are cases when you don’t have any better options. One big example is when you need to balance web traffic, and then it is either DNS round robin described below, or hardware load balancers, with both options being rather ugly. Fortunately, for our game-with-client purposes we don’t need to decide which of them is worse (see ‘unorthodox-but-my-favorite approach’ below).

The second classical solution for the balancing problem is “DNS round robin”. It is hacking your own DNS to return different addresses to different clients in a random manner. Unfortunately, this approach has One Big Fat Problem: it doesn’t provide for easy failover, so if one of your round-robin servers goes dead, some of your clients won’t be able to play until you fix the problem. Ouch. Certainly not recommended.

An unorthodox-but-my-favorite approach to the balancing problem is to have balancing performed by the client (it is our own client anyway). The idea is to have a list of servers embedded into client (probably into some config file), and try them in a random manner until a good one is found. It addresses most of the problems with two approaches above, handles all kinds of failures in front-end servers in a very simple and natural way (the only thing you need to do in your client – is to detect that connection has failed, and try another random server from the list), and in practice achieves almost-perfect balancing. While 99% of network engineers out there will tell you that application-level balancing is a Bad Idea (preferring hardware “load balancers“ instead), you still should implement it. In addition to overcoming those disadvantages of hardware load balancers described above, there are several additional reasons to prefer your own application-level balancing:

With application-level balancing, you can balance between different datacenters. Depending on the nature of your application and deployment architecture, it might help you to deal with DDoS attacks, and help a lot.
With application-level balancing, you can easily balance between different cloud nodes residing whenever-they-prefer-to-reside.
With application-level balancing, you don’t risk that “load balancer“ itself becomes a bottleneck, and that it introduces some packet loss etc., which in turn might affect end-user experience (while this is a bit of repetition of disadvantages of hardware load balancers, it is important enough to be mentioned again)

Bottom line: just implement application-level balancing as described above. It will take you two hours to do it (including testing), and in the extreme case, your clients just won’t use it.

23a. DO use both DNS-based addresses and numeric IP addresses

When implementing your application-level balancing as described right above, store both DNS-based and number-based addresses of the same server within your client-side app. While this advice is somewhat controversial (and once again, network engineers will bash you for doing it), this allows to handle scenarios when the end-user has Internet access, but his ISP’s DNS server is down (which does happen rather often).

By adding number-based IP addresses to the mix, you’ll make your app able to work when your competition (which is usually about any other game out there) is not working for that specific user. It won’t happen often (around 0.x% of time, with x being a small integer), so this difference might be insignificant. However, if your game is the only one working when all the competition is down, it improves user perception about your app a lot (at essentially zero cost for you). And with the modern app updates being much more frequent than IP address changes (plus you should keep DNS address too, there is no reason not to do it), all the arguments against using numeric-IPs-which-might-change-all-out-of-a-sudden become pretty much insignificant.

24. DO test your engine/game over a Bad Connection and over Trans-Atlantic Connection

Way too often I saw network applications which worked perfectly within LAN, and failed badly when moved to the Internet. If your application is targeted for the Internet – test it over the Internet from the very beginning of development. I really really recommend it. This kind of testing will save you a lot of effort (and a lot of embarrassment) later.

Moreover, you should test your application not on “just any Internet connection”, but to have “the very worst connection you want to support” (which is often “the very worst connection you can find”) just for testing purposes. In one of my large projects back in 2000 or so, we’ve used AOL dial-up for the testing purposes, and it worked like a charm. I don’t mean that the connection was good; to the contrary, it was pretty bad, but it meant that after we’ve made our application work over this pretty bad connection, it worked without any issues over any connection.

Another set of tests which you SHOULD do is testing your game over a trans-atlantic connection. Once I saw a (business) application which worked ok in LAN, but when deployed to a trans-atlantic connection, opening a form began to take 20+ minutes. The problem was an obviously excessive number of round-trips; rewriting it to use a different network protocol has helped to improve the performance 400x, bringing opening a form to an acceptable few-seconds level. Writing it this way from the very beginning would save a few months of work and a lot of embarrassment when showcasing it to the customers.

With these two test setups (one being “the worst connection you want to support”, another being trans-atlantic) – you can be reasonably confident that your game won’t have too many problems when deployed to a real world.

In addition to the testing your engine as described above, you should also encourage developers which write games on top of your engine, to do the same. Quite a few networking issues can easily apply not only to the engine but also to the game itself, and sorting them out from the very beginning is an invariably Good Thing.

25. DO analyze your traffic with Wireshark

There is a wonderful tool out there which every network developer should use – it is Wireshark. If you didn’t do it yet – take a look at your application’s traffic with Wireshark. In the ideal case, it won’t tell you anything new about your game engine, but chances are it will show you something different from what-you’d-expect, and you might be able to improve things as a result. In addition, the experience with Wireshark comes very handy when you need to debug network problems on a live production server. If your server is a Linux one, you can make a tcpdump of the traffic of the user who has problems, get it to your local PC and using Wireshark analyze what is happening to this unfortunate user. Neat and very useful in practice!

And if you’re developing game engine intended for lots of games, consider developing a Wireshark plugin, so game developers are able to analyze your traffic. While this may be at odds with security-by-obscurity, let’s face it that if your engine is popular, all your formats and protocols will be well-known anyway.

26. DO Find Metrics to Measure your Player Experience Network-Wise

This one is a bit tricky, but the idea is the following. When deploying a large system, there is always a question about system health. And for an over-the-Internet game system, a question of network health is a Really Important one. While you can use different metrics for this purpose, practice has shown that the best metrics are those which are observable in user space.

For example, if you’re using TCP as a transport, you can use “number of non-user-initiated disconnects per user per hour”; if you’re using UDP as a transport – it can be something like “percentage of packets lost” and/or “jitter”. In any case, what is really important – is to have some way to see “how changes in deployed system affect user experience”.

Why this is so important? Because it allows to analyze lots and lots of things, which are very difficult to find out otherwise. Just one practical example. At some point, I’ve seen a “great new blade server” installed instead of a bunch of front-end servers, for a large multiplayer game. So far so good, but it has been observed that those users connected via this “great new blade server” were experiencing a bit more disconnects per hour than those connected to an older-style 1U boxes.

An investigation has lead to a missing flow-control on the specific model of blade chassis hub (!) – which of course was prompty replaced. While this one single example didn’t make much overall difference from player perspective, but over the course of several years there were dozens of such issues (including comparisons with-hardware-balancer vs without-hardware-balancer, comparison of different ISPs and inter-ISP peerings, comparisons before/after datacenter migration, etc.). I feel that having this metrics has significantly contributed to the reputation of “the best connectivity out there” which has been enjoyed by the game in question.

BTW, the changes which can affect such metrics are not restricted to hardware stuff – certain software changes (such as protocol changes) were also observed to affect user experience metrics. It means that these metrics will be good both for admins-who-deploy-your-game, and for you as developer to see if your recent changes didn’t make life of your players worse. Most importantly, however, this approach allows to keep your players happy – and this is one thing which really matters (it is players who’re paying the bill, whether we as developers like it or not [NoBugs2011]).

To be Continued…

This post concludes Part III of the article. Stay tuned for Part IV, Great TCP-vs-UDP Debate.

EDIT: The series has been completed, with the following parts published:
Part IV. Great TCP-vs-UDP Debate
Part V. UDP
Part VI. TCP
Part VIIa. Security (TLS/SSL)
Part VIIb. Security (concluded)

References

[MSDN] “Interlocked Single Linked Lists”, MSDN

[NoBugs2011] 'No Bugs' Hare, “The Guy We're All Working For”, 2011

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧

Multi-threading at Business-logic Level is Considered Harmful

September 7, 2015, 4:15 pm

≫ Next: Chapter V(d). Modular Architecture: Client-Side. Client Architecture Diagram, Threads, and Game Loop

≪ Previous: Part IIIb: Server-Side (deployment, optimizations, and testing) of 64 Network DO’s and DONT’s for Game Engines

Assumption is a mother of all screw-ups
— Honorable Mr. Eugene Lewis Fordsworthe —

For quite a long time (since I first needed to deal with non-trivial multi-threading 15 years ago) I knew that mixing multi-threading (especially thread synchronization) with business logic is a Really Bad Idea and argued for avoiding it whenever possible (see, for example, [NoBugs10]). This notion became so deeply ingrained in my mind, that I’ve erroneously started to assume that everybody else shares this knowledge (or belief, depending on which side of the argument you are ☺).

As usually happens with most assumptions, Mother Nature has once again proved that I was wrong. Recently I wrote an article on networking for games [NoBugs15], where I took ‘mixing multi-threading with business logic is a Bad Idea’ as granted; I’ve had feedback that this is unclear and needs explaining. Ok, here goes the explanation (an outline for this explanation has already been provided in [Ignatchenko15], but this is a much more elaborate version with a few additional twists).

Last Straw: Business Logic+Multithreading

There are four Big Reasons for avoiding handling both business logic and non-trivial multi-threading within the same pieces of code. However, before going into reasons, we need to provide some definitions.

Definitions

In this field, a lot depends on how trivial your multi-threading is. For example, if you have multi-threading where all the synchronization is performed on one single mutex, we can call it ‘trivial’ (and, as shown below, you’re likely to be able to get away with it1). However, the window for triviality is very narrow: for example, even going into two interrelated mutexes instead of one, can easily make multi-threading non-trivial (and, as discussed below, has potential to make your life a nightmare).

Another example is trivialized multi-threading (a close cousin of the trivial one); one good example of trivialized multi-threading is when all the inter-thread interactions are made only via queues. It doesn’t mean that implementing queues is trivial, but that from the point of view of the developer-who-writes-business-logic, he doesn’t need to care about queue implementation details. In other words, the problem is not about having multi-threading within your program, it is about mixing multi-threading synchronization with business logic in the same piece of code.

Now, we’re all set to start discussing why you shouldn’t intermix non-trivial multi-threading synchronization with business logic.

Reason 1: Cognitive limits of the human brain

In psychology, there is a well-known ‘7 ± 2’ cognitive limit. This means that the number of objects an average human can hold in working memory is 7 ± 2.2 When you go above this limit, you a get kind of ‘swapping’ (to ‘swap out’ some entities to free space in your working memory, only to ‘swap them back in’ when they’re needed). And from our programming experience, we all know what swapping does to performance (‘slowing down to a crawl’ being a very mild description). A similar thing happens when a human being goes beyond his cognitive capacity – the process of solving the problem becomes so slow that often the problem cannot solved at all (unless it can be split into smaller problems, with each of these problems fitting into cognitive limits).

BTW, don’t think that as you are not an average person3, you will be able to process 70 objects or entities instead of the average 7 – you won’t; the best you can realistically hope for is 10–15, and this difference won’t change our analysis. And even if you have on your team one person with an exceptionally high cognitive limit, you can be sure that it is extremely uncommon, which means that relying on her abilities to maintain your program is a Really Bad Idea. The simple question, “What are we going to do when she leaves?” is enough to bury the idea of relying on One Single Developer (however much of a genius she is).

So, how does this 7 ± 2 limit apply to combining business logic with multi-threading? The answer is simple: for real-world programs, each of these things is already complicated enough and usually is already pushing this “7 ± 2” limit. Combining them together will very likely take you over the limit, which will likely lead to the problem of ‘making the program work’ becoming unsolvable. Exceeding the limit becomes even more obvious when we observe that when adding multi-threading to business logic, we’re loading our brain with not only analysis of readily visible entities such as threads and mutexes, but also with less obvious entities such as how existing business objects will interact with this mutex? With these TWO mutexes? This brings the number of entities even higher, which in turn makes the cognitive overload even worse.

For trivial (and trivialized) multi-threading, this effect, while present, can be seen as adding (very roughly) only one additional entity; while even one additional entity can also bring you over the cognitive limit, it is still much better than having dozens of additional entities in scope. Also, cognitive limits are not exactly hard limits as in “9 and you’re fine, 10 and you’re mine”, and while one extra entity over the limit would clearly mean reduced overall performance of the developer, it isn’t likely to cause 100% drop in performance (so it shouldn’t go into the ‘problem never solved’ area). Therefore, given the very small typical numbers for cognitive limits, while adding even one entity will be noticeable (so is not desirable), it is not very likely to be fatal.

Reason 2: Non-determinism is bad enough, but inherently untestable programs are even worse

We don’t know what we have until we lose it
— proverb —

Non-trivial multi-threaded code usually has one property – it is inherently non-deterministic.

By the very definition of pre-emptive multi-threading, context switches happen not when you expect them, but between any two assembler-level instructions (yes, we’re not considering disabling interrupts within business logic). On one run of the program, a context switch may happen between lines A and B, and on the next run of the very same program, it may happen between lines B and C (on some runs it may happen even in the middle of a line of code, if it is compiled to more than one assembly instruction). It means that the multi-threaded program MAY become non-deterministic, i.e. it MAY behave differently from one run to another even if all the program inputs are exactly the same.

One may ask, “What is so bad about that?” Unfortunately, this potential non-determinism has several extremely unpleasant implications.

A: Untestability

As you have no way to control context switches, you cannot really test your program.

Your multi-threaded program can pass all of your tests for years, and then, after you’ve changed a line in one place, a bug in a completely unrelated place (which has existed for all these years, but was hidden) – starts to manifest itself. Why? Just because context switching patterns have shifted a bit, and instead of context switch between lines A and B, you’ve got a context switch between lines B and C.

In [Ignatchenko98] a multi-threading bug is described, which has manifested itself on a 20-line program which has been specially written to demonstrate the bug, and it took any time between 20ms to 20s (on the very same computer, just depending on the run) for the bug to manifest itself (!). On a larger scale – it was a bug no less than in the Microsoft C++ STL implementation shipped with MSVC (carrying a copyright by no less than P.J. Plauger), and while the bug was sitting there for years and has manifested itself in a real-world environment, the manifestation was usually like “our program hangs about once a month on a client machine with no apparent reason”, which is virtually impossible to debug. Only careful analysis of the STL code found the bug (and the analysis wasn’t related to any specific problem with any specific program, it was done out of curiosity).

Another example of untestability is as follows. Your program passes all the tests in your test environment, but when you deploy it to the client’s computer, it starts to fail. I’ve observed this pattern quite a few times, and can tell that it is extremely unpleasant for the team involved. The reason for failure is the same – context switch patterns have shifted a bit due to different hardware or due to different load patterns on client’s machine.

Bottom line: you cannot rely on testing for multi-threaded programs. Bummer.

B: Irreproducibility

Non-determinism implies that on every program run you get different patterns.

This means that if you have a bug in your multi-threading code, you won’t really be able to jump to a certain point in the debugger and see what’s going on (nor will you be able to print what happens there, unless you’re printing everything in sight over the whole program, which by itself will shift patterns and may mask the bug). Ok, technically you are able to jump to any point of your program, but the variables you see may (and if you have a multi-threaded bug – will) differ every time you jump there.

This makes debugging multi-threaded issues a nightmare. When the bug manifests itself about every 50th run of the program, it is already bad enough for debugging, but when the pattern you see is a bit different every time when it happens – your task of debugging the program can easily become hopeless.

Many of you will say “Hey, I’ve debugged multi-threaded programs, it works perfectly”. Indeed, much debugging works in a multi-threaded environment, and you can debug a multi-threaded program, you just cannot debug subtle multi-threaded issues within your non-trivial multi-threaded program.

To allow for multi-threaded debugging in one of many complicated multi-threaded projects, we went as far as creating our own fiber-based framework which simulated threads, with our own simulated scheduler and switching at the relevant points. Our simulated scheduler was run using a pseudo-random generator, so when seeding it with the same original seed, we’ve got determinism back, and were able to debug the program. For us, it was the only way to debug that program (!). There are similar tools out there (just Google for “deterministic framework to debug multi threaded program”), and they might help, but while helpful for debugging small primitives, such methods are inherently very time-consuming and most likely will be infeasible for ongoing debugging of your business logic.

C: Need for proofs of work (or exhaustive deterministic testing)

So, we’ve found (both from theory and illustrated by experience) that no kind of testing can serve as a reasonable assurance that your multi-threaded program will work, and that debugging is likely to be a real problem. Sounds Really Bad, doesn’t it? More importantly, can we do something about it?

In practice, I tend to provide proofs of work for any non-trivial multi-threaded code. I’ve found from experience, that it is the only way to ensure that a multi-threaded program will work 100% of the time (opposed to working 99.99% of the time, which means failing here and there), and will work everywhere.

For small pieces of code (20–50 lines) it is perfectly feasible. The level of formality you need for your proofs is up to you, but it is important at least to convince yourself and somebody else, that with any pattern of switches the piece of code in question will work as expected. One good example of code where more or less formal proofs are feasible (and necessary) is an implementation of the queue for inter-thread communications.

Of course, for thousands-of-lines business logic, such proofs are not feasible (that is, unless you trivialize the interaction of business logic with multi-threading).

An alternative to proofs of work is to use one of those deterministic testing frameworks mentioned above, and to perform exhaustive testing, testing the program behavior for all the possible (or at least relevant, though the notion of ‘relevant’ requires very careful consideration) context switches. Our own framework (the one mentioned above) did allow such testing, but times for such exhaustive testing were growing at least exponentially as the size (more precisely – number of points of interest where the context switch might be relevant) of the program grew, so once again such exhaustive testing wasn’t feasible for the programs with over 20–50 lines of code.

Reason 3: Code fragility

A logical consequence of untestability and the need for proofs of work is code fragility. If, whenever you need to change the program, you need to re-prove that it still works, this cannot be safely entwined with business logic (which, by definition, changes 5 times a day). If, whenever you’re changing something, you’re afraid that it might break something somewhere 50000 lines of code away, it won’t work either.

More formally, a non-trivial mixture of business logic with thread synchronization is inherently fragile. Any change in business logic is likely to affect non-trivial thread synchronization, which in turn is likely to lead to impossible-to-test and next-to-impossible-to-debug bugs.

Reason 4. Context switching granularity

To be efficient, multi-threading programs SHOULD make sure that they don’t cause too much context switching (i.e. multi-threading SHOULD be coarse-grained rather than fine-grained). The thing is that context switches are damn expensive (taking into account the cost of recovery from thread caches being flushed out by another thread, think of the order of 10,000 CPU clock ticks on x86/x64).

For example, if you want to move integer addition to another thread, you’re likely to spend 20,000 CPU clock ticks for 2 context switches (to another thread and back, with roughly half of the work being in your original thread), and to save 0.75 CPU clocks on offloading the addition. Of course, this is an extreme example, but way too often multi-threading is used without understanding the implications of the cost of the context switches.

In this regard, separating business logic from threading helps to establish a well-defined interface which encourages (though doesn’t guarantee) coarse-grained granularity. For example, when having queues for inter-thread communications, it is usually easier to write a coarse-grained program, which is (as a rule of thumb; as with anything else, there are exceptions) a Good Thing.

On the opposite side, code which intermixes business logic and thread synchronization tends to overlook the need to keep granularity in check; while in theory it is possible to handle it properly, in practice adding it to the equation is not feasible, not least because of adding yet another layer of entities, overloading (already overloaded) cognitive limits even further.

Hey, there are working multi-threaded programs out there!

One may say: “Hey, you’re saying that writing multi-threaded programs is impossible, but everybody and his dog is writing multi-threaded programs these days!”. You do have a point. However:

Quite a few multi-threaded programs are using trivial multi-threading (for example, with a single mutex). Have you ever seen a multi-threaded program which is able to utilize only 1.2 cores? They’re likely using single mutex. And BTW, I cannot blame them as soon as they provide adequate overall performance: if one marketing guy has said “we need to write ‘support for multiple cores’ on our website, because all the competition does it”, a single mutex is one way to do what marketing wants without jeopardizing the whole project.
Quite a few programs (think of video codecs) do really need to utilize multiple cores, but don’t really have much business logic (depends on how you define ‘business logic’, but at least it doesn’t change too often for codecs). They may get away with more or less complicated thread sync, but even for video codecs having per-frame (or per-large-part-of-frame) processing granularity (with clearly defined inter-thread interfaces such as queues) tends to work better than alternatives.
Quite a few multi-threaded programs out there do have those difficult-to-find-and-debug bugs. This is especially true for those programs which don’t have a multi-million install base, but having a large install base certainly doesn’t guarantee that the program is multi-threaded-bug-free. I would guesstimate that for those programs which are released (i.e. out of the development shop), at least 50% of crashes are related to multi-threading.
And finally, there are programs out there which do follow the principles outlined in the next section, ‘Divide and conquer’.

Divide and conquer

The first step in solving a problem is to recognize that it does exist
— Zig Ziglar —

Despite the ‘Divide and conquer’ concept coming from politics, it is still useful in many fields related to engineering, and is usually a Good Thing to use in the context of programming (not to be confused with programming team management!).

Jokes aside, if we can separate business logic from non-trivial multi-threading (trivializing multi-threading interaction from the point of view of business logic), we will be able to escape from (or at least heavily mitigate) all the problems described in this article. The number of entities to fit into cognitive limits will come back to reasonable numbers, business logic will become deterministic again (and while multi-threading synchronization will still require proofs of work, they are feasible for small and almost-never-changing pieces of code), code will be decoupled and will become much less fragile, and coarse-grained granularity will be encouraged.

The only teensy-weensy question remaining is “how to do it”. There are several approaches to start answering this question, and I hope to describe one of them sooner rather than later. For now, we need to recognize that we do have a problem, solving it is the next step.

References

[Ignatchenko15] Sergey Ignatchenko, “'Three Reasons to Avoid Intermixing Business Logic and Thread Synchronization'”
[Ignatchenko98] Sergey Ignatchenko, “'STL Implementations and Thread Safety'”, C++ Report July/Aug 1998
[Loganberry04] David ‘Loganberry’, “Frithaes! – an Introduction to Colloquial Lapine!”
[NoBugs10] ‘No Bugs’ Hare, “‘Single-Threading: Back to the Future?’”
[NoBugs15] ‘No Bugs’ Hare, “‘64 Network DO’s and DON’Ts for Game Engines. Part IIIa: Server-Side (Store-Process-and-Forward Architecture)’”

Disclaimer

As usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translator and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧

Chapter V(d). Modular Architecture: Client-Side. Client Architecture Diagram, Threads, and Game Loop

December 14, 2015, 7:50 am

≫ Next: Chapter VI(a). Server-Side MMO Architecture. Naïve, Web-Based, and Classical Deployment Architectures

≪ Previous: Multi-threading at Business-logic Level is Considered Harmful

[[This is Chapter V(d) from the upcoming book “Development&Deployment of Massively Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MMOG: Table of Contents.]]

After we’ve spent quite a lot of time discussing boring things such as deterministic logic and finite automata, we can go ahead and finally draw the architecture diagram for our MMO game client. Yahoo!

Queues and Finite State Machines (QnFSM) architecture diagram

However, as the very last delay before that glorious and long-promised diagram, we need to define one term that we’ll use in this section. Let’s define “tight loop” as an “infinite loop which goes over and over without delays, and is regardless of any input”.¹ In other words, tight loop is bound to eat CPU cycles (and lots of them) regardless of doing any useful work.

And now we’re really ready for the diagram :-) .

¹ while different interpretations of the term “tight loop” exist out there, for our purposes this one will be the most convenient and useful

Queues-and-FSMs (QnFSM) Architecture: Generic Diagram

Fig. V.2 shows a diagram which describes a “generic” client-side architecture. It is admittedly more complicated than many of you will want or need to use; on the other hand, it represents quite a generic case, and many simplifications can be obtained right out of it by simple joining some of its elements.

Fig V.2. MMOG Client Architecture Diagram

Let’s name this architecture a “Queues-and-FSMs Architecture” for obvious reasons, or “QnFSM” in short. Of course, QnFSM is (by far) not the only possible architecture, and even not the most popular one, but its variations have been seen to produce games with extremely good reliability, extremely good decoupling between parts, and very good maintainability. On the minus side, I can list only a bit of development overhead due to message-based exchange mechanism, but from my experience it is more than covered with better separation between different parts and very-well defined interfaces, which leads to the development speed-ups even in the medium-run (and is even more important in the long-run to avoid spaghetti code). Throw in the ability of “replay debug” and “replay-based post-mortem” in production, and it becomes a solution for lots of real-world pre-deployment and post-deployment problems.

In short – I’m an extremely strong advocate of this architecture (and its variations described below), and don’t know of any practical cases when it is not the best thing you can do. While it might look over-engineered at the first glance, it pays off in the medium- and long-run²

I hope that the diagram on Fig V.2 should be more or less self-explanatory, but I will elaborate on a few points which might not be too obvious:

each of FSMs is a strictly-deterministic FSM as described in “Event-Driven Programming and Finite State Machines” section above
- while being strictly-deterministic is not a strict requirement, implementing your FSMs this way will make your debugging and post-mortem much much much easier.
all the exchange between different FSMs is message-based. Here “message” is a close cousin of a network packet; in other words – it is just a bunch of bytes formatted according to some convention between sending thread and receiving thread.
- There can be different ways how to pass these messages around; examples include explicit message posting, or implementing non-blocking RPC calls instead. While the Big Idea behind the QnFSM architecture won’t change because of the way how the messages are posted, convenience and development time may change quite significantly. Still, while important, this is only an implementation detail which will be further discussed in Chapter [[TODO]].
- for the messages between Game Logic Thread and Animation&Rendering Thread, format should be along the lines of “Logic-to-Graphics API”, described in “Logic-to-Graphics Layer” section above. In short: it should be all about logical changes in the game world, along the lines of “NPC ID=ZZZ is currently moving along the path defined by the set of points {(X0,Y0),(X1,Y1),…} with speed V” (with coordinates being game world coordinates, not screen coordinates), or “Player at seat #N is in the process of showing his cards to the opponents”.³
each thread has an associated Queue, which is able to accept messages, and provides a way to wait on it as long as is the Queue is empty
the architecture is “Share-Nothing”. It means that there is no data shared between threads, and the only way to exchange data between threads, is via Queues and messages-passed-via-the-Queues
- “share-nothing” means no thread synchronization problems (there is no need for mutexes, critical sections, etc. etc. outside of your queues). This is a Really Good Thing™, as trying to handle thread synchronization with any frequently changeable logic (such as the one within at least some of the FSMs) inevitably leads to lots and lots of problems (see, for example, [NoBugs2015])
- of course, implementation of the Queues still needs to use inter-thread synchronization, but this is one-time effort and it has been done many times before, so it is not likely to cause too much trouble; see Chapter [[TODO]] for further details on Queues in C++
- as a nice side effect, it means that whenever you want it, you can deploy your threads into different processes without changing any code within your FSMs (merely by switching to an inter-process implementation of the Queue). In particular, it can make answering very annoying questions such as “who’s guilty for the memory corruption” much more easily
Queues of Game Logic Thread and Communications Thread, are rather unusual. They’re waiting not only for usual inter-thread messages, but also for some other stuff (namely input messages for Game Logic Thread, and network packets for the Communications Thread).
- In most cases, at least one of these two particular queues will be supported by your platform (see Chapter [[TODO]] for details)
- For those platforms which don’t support such queues – you can always use your-usual-inter-thread-queue (once again, the specifics will be discussed in Chapter [[TODO]]), and have an additional thread which will get user input data (or call select()), and then feed the data into your-usual-inter-thread-queue as a yet another message. This will create a strict functional equivalent (a.k.a. “compliant implementation”) of two specific Queues mentioned above
all the threads on the diagram (with one possible exception being Animation&Rendering Thread, see below) are not tight-looped, and unless there is something in their respective Queue – they just wait on the Queue until some kind of message comes in (or select() event happens)
- while “no-tight-loops” is not a strict requirement for the client-side, wasting CPU cycles in tight loops without Really Good Reason is rarely a good idea, and might hurt quite a few of your players (those with weaker rigs).
- Animation&Rendering Thread is a potentially special case, and MAY use tight loop, see “Game Loop” subsection below for details
- to handle delays in other-than-Animation&Rendering Thread, Queues should allow FSMs to post some kind of “timer message” to the own thread
- even without tight loops it is possible to write your FSM in an “almost-tight-loop” manner that is closely resembling real-world real-time control systems (and classical Game Loop too), but without CPU overhead. See more on it in [[TODO!! – add subsection on it to “FSM” section]] section above.

² As usual, “I don’t know of any cases” doesn’t provide guarantees of any kind, and your mileage may vary, but at least before throwing this architecture away and doing something-that-you-like-better, please make sure to read the rest of this Chapter, where quite a few of potential concerns will be addressed

³ yes, I know I’ve repeated it for quite a few times already, but it is that important, that I prefer to risk being bashed for annoying you rather than being pounded by somebody who didn’t notice it and got into trouble

Migration from Classical 3D Single-Player Game

If you’re coming from single-player development, you may find this whole diagram confusing; this maybe especially true for inter-relation between Game Logic FSM and Animation&Rendering FSM. The idea here is to have 95% of your existing “3D engine as you know it”, with all the 3D stuff, as a part of “Animation&Rendering FSM”. You will just need to cut off game decision logic (which will go to the server-side, and maybe partially duplicated to Game Logic FSM too for client-side prediction purposes), and UI logic (which will go into Game Logic FSM). All the mesh-related stuff should stay within Animation&Rendering FSM (even Game Logic FSM should know absolutely nothing about meshes and triangles).

If your existing 3D engine is too complicated to fit into single-threaded FSM, it is ok to keep it multi-threaded as long as it looks “just as an FSM” from the outside (i.e. all the communications with Animation&Rendering FSM go via messages or non-blocking RPC calls, expressed in terms of Logic-to-Graphics Layer). For details on using FSMs for multi-threaded 3D engines, see “On Additional Threads and Task-Based Multithreading” section below. Note that depending on specifics of your existing 3D rendering engine, you MAY need to resort to Option C; while Option C won’t provide you with FSM goodies for your rendering engine (sorry, my supply of magic powder is quite limited), you will still be able to enjoy all the benefits (such as replay debugging and production post-mortem) for the other parts of your client.

It is worth noting that Game Logic FSM, despite its name, can often be more or less rudimentary, and (unless client-side prediction is used) mostly performs two functions: (a) parsing network messages and translating them into the commands of Logic-to-Graphics Layer, (b) UI handing. However, if client-side prediction is used, Game Logic FSM can become much more elaborated.

Interaction Examples in 3D World: Single-Player vs MMO

Let’s consider three typical interaction examples after migration from single-player game to an MMO diagram shown above.

MMOFPS interaction example (shooting). Let’s consider an MMOFPS example when Player A presses a button to shoot with a laser gun, and game logic needs to perform a raycast to see where it hits and what else happens. In single-player, all this usually happens within a 3D engine. For an MMO, it is more complicated:

Step 1. button press goes to our authoritative server as a message
Step 2. authoritative server receives message, performs a raycast, and calculates where the shot hits.
Step 3. our authoritative server expresses “where it hits” in terms such as “Player B got hit right between his eyes”⁴ and sends it as a message to the client (actually, to all the clients).
Step 4. this message is received by Game Logic FSM, and translated into the commands of Logic-to-Graphics Layer (still without meshes and triangles, for example, “show laser ray from my gun to the point right-between-the-eyes-of-Player B”, and “show laser hit right between the eyes of Player B”), which commands are sent (as messages) to Animation&Rendering FSM.
Step 5. Animation&Rendering FSM can finally render the whole thing.⁵

While the process is rather complicated, most of the steps are inherently inevitable for an MMO; the only thing which you could theoretically save compared to the procedure described above, is merging step 4 and step 5 together (by merging Game Logic FSM and Animation&Rendering together), but I advise against it as such merging would introduce too much coupling which will hit you in the long run. Doing such different things as parsing network messages and rendering within one tightly coupled module is rarely a good idea, and it becomes even worse if there is a chance that you will ever want to use some other Animation&Rendering FSM (for example, a newer one, or the one optimized for a different platform).

MMORPG interaction example (ragdoll). In a typical MMORPG example, when an NPC is hit for 93th time and dies as a result, ragdoll physics is activated. In a typical single-player game, once again, the whole thing is usually performed within 3D engine. And once again, for a MMO the whole thing will be more complicated:

Step 1. button press (the one which will cause NPC death) goes to authoritative server
Step 2. server checks attack radius, calculates chances to hit, finds that the hit is successful, decreases health, and find that NPC is dead
Step 3. server performs ragdoll simulation in the server-side 3D world. However, it doesn’t need to (neither it really can) send it to clients as a complete triangle-based animation. Instead, the server can usually send to the client only a movement of “center of gravity” of NPC in question (calculated as a result of 3D simulation). This movement of “center of gravity” is sent to the client (either as a single message with the whole animation or as a series of messages with “current position” each)
- as an interesting side-effect: as the whole thing is quite simple, there may be no real need to calculate the whole limb movement, and it may suffice to calculate just a simple parabolic movement of the “center of gravity”, which MAY save you quite a bit of resources (both CPU and memory-wise) on the server side (!)
Step 4. Game Logic FSM receives the message with “center of gravity” movement and translates it into Logic-to-Graphics commands. This doesn’t necessarily need to be trivial; in particular, it may happen that Game Logic stores larger part of the game world than Animation&Rendering FSM. In this latter case, Game Logic FSM may want to check if this specific ragdoll animation is within the scope of the current 3D world of Animation&Rendering FSM.
Step 5. Animation&Rendering FSM performs some ragdoll simulation (it can be pretty much the same simulation which has already been made on the server side, or something completely different). If ragdoll simulation is the same, then the process of ragdoll simulation on the client-side will be quite close to the one on the server-side; however, if there are any discrepancies due to not-so-perfect determinism – client-side simulation will correct coordinates so that “center of gravity” is adjusted to the position sent by server. In case of non-deterministic behaviour between client and server, the movement of the limbs on the client and the server may be different, but for a typical RPG it doesn’t matter (what is really important is where the NPC eventually lands – here or over the edge of the cliff, but this is guaranteed to be the same for all the clients as “center of gravity” comes from the server side).

UI interaction example. In a typical MMORPG game, a very common task is to show object properties when the object is currently under cursor. For the diagram above, it should be performed as follows:

Step 1. Game Logic FSM sends a request to the Animation&Rendering FSM: “what is the object ID at screen coordinates (X,Y)?” (where (X,Y) are cursor coordinates)
Step 2. Animation&Rendering FSM processes this (trivial) request and returns object ID back
Step 3. Game Logic FSM finds object properties by ID, translates them into text, and instructs Animation&Rendering FSM to display object properties in HUD

While this may seem as an overkill, the overhead (both in terms of developer’s time and run time) is negligible, and good old rule of “the more cleanly separated parts you have – the easy is further development is” will more than compensate for the complexities of such separation.

⁴ this is generally preferable to player-unrelated “laser hit at (X,Y,Z)” in case of client-side prediction; of course, in practice you’ll use some coordinates, but the point is that it is usually better to use player-related coordinates rather than absolute game world coordinates

⁵ I won’t try to teach you how to render things; if you’re from 3D development side, you know much more about it than myself

FSMs and their respective States

The diagram on Fig. V.2 shows four different FSMs; while they all derive from our FiniteStateMachineBase described above, each of them is different, has a different function, and stores a different state. Let’s take a closer look at each of them.

Game Logic FSM

Game Logic FSM is the one which makes most of decisions about your game world. More strictly, these are not exactly decisions about the game world in general (this one is maintained by our authoritative server), but about client-side copy of the game world. In some cases it can be almost-trivial, in some cases (especially when client-side prediction is involved) it can be very elaborated.

In any case, Game Logic FSM is likely to keep a copy of the game world (or of relevant portion of the game world) from the game server, as a part of it’s state. This copy has normally nothing to do with meshes, and describes things in terms such as “there is a PC standing at position (X,Y) in the game world coordinates, facing NNW”, or “There are cards AS and JH on the table”.

Game Logic FSM & Graphics

Probably the most closely related to Game Logic FSM is Animation&Rendering one. Most of the interaction between the two goes in the direction from Game Logic to Animation&Rendering, using Logic-to-Graphics Layer commands as messages. Game Logic FSM should instruct Animation&Rendering FSM to construct a portion of its own game copy as a 3D scene, and to update it as its own copy of the game world changes.

In addition, Game Logic FSM is going to handle (but not render) UI, such as HUDs, and various UI dialogs (including the dialogs leading to purchases, social stuff, etc.); this UI handling should be implemented in a very cross-platform manner, via sending messages to Animation&Rendering Engine. These messages, as usual, should be expressed in very graphics-agnostic terms, such as “show health at 87%”, or “show the dialog described by such-and-such resource”.

To handle UI, Game Logic FSM MAY send a message to Animation&Rendering FSM, requesting information such as “what object (or dialog element) is at such-and-such screen position” (once again, the whole translation between screen coordinates into world objects is made on the Animation&Rendering side, keeping Game Logic FSM free of such information); on receiving reply, Game Logic FSM may decide to update HUD, or to do whatever-else-is-necessary.

Other messages coming from Animation&Rendering FSM to Game Logic FSM, such as “notify me when the bullet hits the NPC”, MAY be necessary for the client-side prediction purposes (see Chapter [[TODO]] for further discussion). On the other hand, it is very important to understand that these messages are non-authoritative by design, and that their results can be easily overridden by the server.

As you can see, there can be quite a few interactions between Game Logic FSM and Animation&Rendering FSM. Still, while it may be tempting to combine Game Logic FSM with Animation&Rendering FSM, I would advise against it at least for the games with many platforms to be supported, and for the games with Undefined Life Span; having these two FSMs separate (as shown on Fig V.2) will ensure much cleaner separation, facilitating much-better-structured code in the medium- to long-run. On the other hand, having these two FSM running within the same thread is a very different story, is generally ok and can be done even on a per-platform basis; see “Variations” section below.

Game Logic FSM: Miscellaneous

There are two other things which need to be mentioned with regards to Game Logic FSM:

You MUST keep your Game Logic FSM truly platform-independent. While all the other FSMs MAY be platform-specific (and separation between FSMs along the lines described above, facilitates platform-specific development when/if it becomes necessary), you should make all the possible effort to keep your Game Logic the same across all your platforms. The reason for it has already been mentioned before, and it is all about Game Logic being the most volatile of all your client-side code; it changes so often that you won’t be able to keep several code bases reasonably in sync.
If by any chance your Game Logic is that CPU-consuming that one single core won’t cope with it – in most cases it can be addressed without giving up the goodies of FSM-based system, see “Additional Threads and Task-Based Multi-Threading” section below.

Animation&Rendering FSM

Animation&Rendering FSM is more or less similar to the rendering part of your usual single-player game engine. If your game is a 3D one, then in the diagram above,

it is Animations&Rendering FSM which keeps and cares about all the meshes, textures, and animations; as a Big Fat Rule of Thumb, nobody else in the system (including Game Logic FSM) should know about them.

At the heart of the Animation&Rendering FSM there is a more or less traditional Game Loop.

Game Loop

Most of single-player games are based on a so-called Game Loop. Classical game loop looks more or less as follows (see, for example, [GameProgrammingPatterns.GameLoop]):

while(true) {
  process_input();
  update();
  render();
}

Usually, Game Loop doesn’t wait for input, but rather polls input and goes ahead regardless of the input being present. This is pretty close to what is often done in real-time control systems.

For our diagram on Fig V.2 above, within our Animation&Rendering Thread we can easily have something very similar to a traditional Game Loop (with a substantial part of it going within our Animation&Rendering FSM). Our Animation&Rendering Thread can be built as follows:

Animation&Rendering Thread (outside of Animation&Rendering FSM) checks if there is anything in its Queue; unlike other Threads, it MAY proceed even if there is nothing in the Queue
it passes whatever-it-received-from-the-Queue (or some kind of NULL if there was nothing) to Animation&Rendering FSM, alongside with any time-related information
within the Animation&Rendering FSM’s process_event(), we can still have process_input(), update() and render(), however:
- there is no loop within Animation&Rendering FSM; instead, as discussed above, the Game Loop is a part of larger Animation&Rendering Thread
- process_input(), instead of processing user input, processes instructions coming from Game Logic FSM
- update() updates only 3D scene to be rendered, and not the game logic’s representation of the game world; all the decision-making is moved at least to the Game Logic FSM, with most of the decisions actually being made by our authoritative server
- render() works exactly as it worked for a single-player game
after Animation&Rendering FSM processes input (or lack thereof) and returns, Animation&Rendering Thread may conclude Game Loop as it sees fit (in particular, it can be done in any classical Game Loop manner mentioned below)
then, Animation&Rendering Thread goes back to the very beginning (back to checking if there is anything in its Queue), which completes the infinite Game Loop.

All the usual variations of Game Loop can be used within the Animation&Rendering Thread – including such things as fixed-time-step with delay at the end if there is time left until the next frame, variable-time-step tight loop (in this case a parameter such as elapsed_time needs to be fed to the Animation&Rendering FSM to keep it deterministic), and fixed-update-time-step-but-variable-render-time-step tight loop. Any further improvements (such as using VSYNC) can be added on top. I don’t want to elaborate further here, and refer for further discussion of game loops and time steps to two excellent sources: [GafferOnGames.FixYourTimestep] and [GameProgrammingPatterns.GameLoop].

One variation of the Game Loop that is not discussed there, is a simple event-driven thing which you would use for your usual Windows programming (and without any tight loops); in this case animation under Windows can be done via WM_TIMER,⁶ and 2D drawing – via something like BitBlt(). While usually woefully inadequate for any serious frames-per-second-oriented games, it has been seen to work very well for social- and casino-like ones.

However, the best thing about our architecture is that the architecture as such doesn’t really depend on time step choices; you can even make different time step choices for different platforms and still keep the rest of your code (beyond Animation&Rendering Thread) intact, though Animation&Rendering FSM may need to be somewhat different depending on the fixed-step vs variable-step choice.⁷

Animation&Rendering FSM: Running from Game Logic Thread

For some games and/or platforms it might be beneficial to run Animation&Rendering FSM within the same thread as Game Logic FSM. In particular, if your game is a social game running on Windows, there may be no real need to use two separate CPU cores for Game Logic and Animation&Rendering, and the whole thing will be quite ok running within one single thread. In this case, you’ll have one thread, one Queue, but two FSMs, with thread code outside of the FSMs deciding which of the FSMs incoming message belongs to.

However, even in this case I still urge you to keep it as two separate FSMs with a very clean message-based interface between them. First, nobody knows which platform you will need to port your game next year, and second, clean well-separated interfaces at the right places tend to save lots of trouble in the long run.

⁶ yes, this does work, despite being likely to cause ROFLMAO syndrome for any game developer familiar with game engines

⁷ of course, technically you may write your Animation&Rendering FSM as a variable-step one and use it for the fixed-step too, but there is a big open question if you really need to go the variable-step, or can live with a much simpler fixed-step forever-and-ever

Communications FSM

Another FSM, which is all-important for your MMOG, is Communications FSM. The idea here is to keep all the communications-related logic in one place. This may include very different things, from plain socket handling to such things as connect/reconnect logic⁸, connection quality monitoring, encryption logic if applicable, etc. etc. Also implementations of higher-level concepts such as generic publisher/subscriber, generic state synchronization, messages-which-can-be-overridden etc. (see Chapter [[TODO]] for further details) also belong here.

For most of (if not “all”) the platforms, the code of Communications FSM can be kept the same, with the only things being called from within the FSM, being your own wrappers around sockets (for C/C++ – Berkeley sockets). Your own wrappers are nice-to-have just in case if some other platform will have some peculiar ideas about sockets, or to make your system use something like OpenSSL in a straightforward manner. They are also necessary to implement “call interception” on your FSM (see “Implementing Strictly-Deterministic Logic: Strictly-Deteministic Code via Intercepting Calls” section above), allowing you to “replay test” and post-mortem of your Communications FSM.

The diagram of Fig. V.2 shows an implementation of the Communications FSM that uses non-blocking socket calls. For client-side it is perfectly feasible to keep the code of Communications FSM exactly the same, but to deploy it in a different manner, simulating non-blocking sockets via two additional threads (one to handle reading and another to handle writing), with these additional threads communicating with the main Communications Thread via Queues (using Communication Thread’s existing Queue, and one new Queue per new thread).⁹

One more thing to keep in mind with regards to blocking/non-blocking Berkeley sockets, is that getaddrinfo() function (as well as older gethostbyname() function) used for DNS resolution, is inherently blocking, with many platforms having no non-blocking counterpart. However, for the client side in most cases it is a non-issue unless you decide to run your Communications FSM within the same thread as your Game Logic FSM. In the latter case, calling a function with a potential to block for minutes, can easily freeze not only your game (which is more or less expected in case of connectivity problems), but also game UI (which is not acceptable regardless of network connectivity). To avoid this effect, you can always introduce yet another thread (with its own Queue) with the only thing for this thread to do, being to call getaddrinfo() when requested, and to send result back as a message, when the call is finished.¹⁰

Communications FSM: Running from Game Logic Thread

For Communications FSM, running it from Game Logic Thread might be possible. One reason against doing it, would be if your communications are encrypted, and your Game Logic is computationally-intensive.

And again, as with Animation&Rendering FSM, even if you run two FSMs from one single thread, it is much better to keep them separate. One additional reason to keep things separate (with this reason being specific to Communications FSM) is that Communications FSM (or at least large parts of it) is likely to be used on the server-side too.

⁸ BTW, connect/reconnect will be most likely needed even for UDP

⁹ for the server-side, however, these extra threads are not advisable due to the performance overhead. See Chapter [[TODO]] for more details

¹⁰ Alternatively, it is also possible to create a new thread for each getaddrinfo() (with such a thread performing getaddrinfo(), reporting result back and terminating). This thread-per-request solution would work, but it would be a departure from QnFSM, and it can lead to creating too many threads in some fringe scenarios, so I usually prefer to keep a specialized thread intended for getaddrinfo() in a pure QnFSM model

Sound FSM

Sound FSM handles, well, sound. In a sense, it is somewhat similar to Animation&Rendering FSM, but for sound. Its interface (and as always with QnFSM, interfaces are implemented over messages) needs to be implemented as a kind of “Logic-to-Sound Layer”. This “Logic-to-Sound Layer” message-based API should be conceptually similar to “Logic-to-Graphics Layer” with commands going from the Game Logic expressed in terms of “play this sound at such-and-such volume coming from such-and-such position within the game world”.

Sound FSM: Running from Game Logic Thread

For Sound FSM running it from the same thread as Game Logic FSM makes sense quite often. On the other hand, on some platforms sound APIs (while being non-blocking in a sense that they return before the sound ends) MAY cause substantial delays, effectively blocking while the sound function finds and parses the file header etc.; while this is still obviously shorter than waiting until the sound ends, it might be not short enough depending on your game. Therefore, keeping Sound FSM in a separate thread MAY be useful for fast-paced frame-per-second-oriented games.

And once again – even if you decide to run two FSMs from the same thread – do yourself a favour and keep the FSMs separate; some months down the road you’ll be very happy that you kept your interfaces clean and different modules nicely decoupled.¹¹

¹¹ Or you’ll regret that you didn’t do it, which is pretty much the same thing

Other FSMs

While not shown on the diagram on Fig V.2, there can be other FSMs within your client. For example, these FSMs may run in their own threads, but other variations are also possible.

One practical example of such a client-side FSM (which was implemented in practice) was “update FSM” which handled online download of DLC while making sure that the gameplay delays were within acceptable margins (see more on client updates in general and updates-while-playing in Chapter [[TODO]]).

In general, any kind of entity which performs mostly-independent tasks on the client-side, can be implemented as an additional FSM. While I don’t know of practical examples of extra client-side FSMs other than “update FSM” described above, it doesn’t mean that your specific game won’t allow/require any, so keep your eyes open.

On Additional Threads and Task-Based Multithreading

If your game is very CPU-intensive, and either your Game Logic Thread, or Animation&Rendering Thread become overloaded beyond capabilities of one single CPU core, you might need to introduce an additional thread or five into the picture. This is especially likely for Animation&Rendering Thread/FSM if your game uses serious 3D graphics. While complexities threading model of 3D graphics engines are well beyond the scope of this book, I will try to provide a few hints for those who’re just starting to venture there.

As usually with multi-threading, if you’re not careful, things can easily become ugly, so in this case:

first of all, take another look if you have some Gross Inefficiencies in your code; it is usually much better to remove these rather than trying to parallelize. For example, if you’d have calculated Fibonacci numbers recursively, it is much better to switch to non-recursive implementation (which is IIRC has humongous O(2^N) advantage over recursive one¹²) than to try getting more and more cores working on unnecessary stuff.
From this point on, to the best of my knowledge you have about three-and-a-half options:
- Option A. The first option is to split the whole thing into several FSMs running within several threads, dedicating one thread per one specific task. In 3D rendering world, this is known as “System-on-a-Thread”, and was used by Halo engine (in Halo, they copy the whole game state between threads[GDC.Destiny], which is equivalent to having a queue, so this is a very direct analogy of our QnFSM).
- Option B. The second option is to “off-load” some of the processing to a different thread, with this new thread being just as all the other threads on Fig V.2; in other words, it should have an input queue and a FSM within. This is known as “Task-Based Multithreading” [GDC.TaskBasedMT]. In this case, after doing its (very isolated) part of the job a.k.a. “task”, the thread may report back to the whichever-thread-has-requested-its-services. This option is really good for several reasons, from keeping all the FSM-based goodies (such as “replay testing” and post-mortem) for all parts of your client, to encouraging multi-threading model with very few context switches (known as “Coarse-grained parallelism”), and context switches are damn expensive on all general-purpose CPUs.¹³ The way how “task off-loading” is done, depends on the implementation. In some implementations, we MAY use data-driven pipelines (similar to those described in [GDC.Destiny]) to enable dynamic task balancing, which allows to optimize core utilization on different platforms. Note that in pure “Option B”, we still have shared-nothing model, so each of the FSMs has it’s own exclusive state. On the other hand, for serious rendering engines, due to the sheer size of the game state, pure “shared-nothing” approach MIGHT BE not too feasible.
  - Option B1. That’s the point where “task-off-loading-with-an-immutable-shared-state” emerges. It improves¹⁴ over a basic Option B by allowing for a very-well-controlled use of a shared state – namely, sharing is allowed only when the shared state is guaranteed to be immutable. It means that, in a limited departure from our shared-nothing model, in addition to inter-thread queues in our QnFSM, we MAY have a shared state. However, to avoid those nasty inter-thread problems, we MUST guarantee that while there is more than one thread which can be accessing the shared state, the shared state is constant/immutable (though it may change outside of “shared” windows). At the moment, it is unclear to me whether Destiny engine (as described in [GDC.Destiny]) uses Option B1 (with an immutable game state shared between threads during “visibility” and “extract” phases) – while it looks likely, it is not 100% clear. In any case, both Option B and Option B1 can be described more or less in terms of QnFSM (and most importantly – both eliminate all the non-maintainable and inefficient tinkering with mutexes etc. within your logic). From the point of view of determinism, Option B1 is equivalent to Option B, provided that we consider that immutable-shared-state as one of our inputs (as it is immutable, it is indistinguishable from an input, though delivered in a somewhat different way); while such a game sharing would effectively preclude from applying recording/replay in production (as recording the whole game state on each frame would be too expensive), determinism can still be used for regression testing etc.
- Option C. To throw away “replay debug” and post-mortem benefits for this specific FSM, and to implement it using multi-thread in-whatever-way-you-like (i.e. using traditional inter-thread synchronization stuff such as mutexes, semaphores, or Dijkstra forbid – memory fences etc. etc.).
  - This is a very dangerous option, and it is to be avoided as long as possible. However, there are some cases when clean separation between the main-thread-data and data-necessary-for-the-secondary-thread is not feasible, usually because of the piece of data to be used by both parallel processes, being too large; it is these cases (and to the best of my knowledge, only these cases), when you need to choose Option C. And even in these cases, you might be able to stay away from handling fine-grained thread synchronization, see Chapter [[TODO]] for some hints in this direction.
  - Also, if you need Option C for your Game Logic – think twice, and then twice more. As Game Logic is the one which changes a damn lot, with Option C this has all the chances of becoming unmanageable (see, for example, [NoBugs2015]). It is that bad, that if you run into this situation, I would seriously think whether the Game Logic requirements are feasible to implement (and maintain) at all.
  - On the positive side, it should be noted that even in such an unfortunate case you should be losing FSM-related benefits (such as “replay testing” and post-mortem) only for the FSM which you’re rewriting into Option C; all the other FSMs will still remain deterministic (and therefore, easily testable).
- In any case, your multi-threaded FSM SHOULD look as a normal FSM from the outside. In other words, multi-threaded implementation SHOULD be just this – implementation detail of this particular FSM, and SHOULD NOT affect the rest of your code. This is useful for two reasons. First, it decouples things and creates a clean well-defined interface, and second, it allows you to change implementation (or add another one, for example, for a different platform) without rewriting the whole thing.

¹² that is, if you’re not programming in Haskell or something similar

¹³ GPGPUs is the only place I know where context switches are cheap, but usually we’re not speaking about GPGPUs for these threads

¹⁴ or “degrades”, depending on the point of view

On Latencies

One question which may arise for queue-based architectures and fast-paced games, is about latencies introduced by those additional queues (we do want to show the data to the user as fast as possible). My experience shows that¹⁵ then we’re speaking about additional latency¹⁶ of the order of single-digit microseconds. Probably it can be lowered further into sub-microsecond range by using less trivial non-blocking queues, but this I’m not 100% sure of because of relatively expensive allocations usually involved in marshalling/unmarshalling; for further details on implementing high-performance low-latency queues in C++, please refer to Chapter [[TODO]]. As this single-digit-microsecond delay is at least 3 orders of magnitude smaller than inter-frame delay of 1/60 sec or so, I am arguing that nobody will ever notice the difference, even for single-player or LAN-based games; for Internet-based MMOs where the absolutely best we can hope for is 10ms delay,¹⁷ makes it even less relevant.

In short – I don’t think this additional single-digit-microsecond delay can possibly have any effect which is visible to end-user.

¹⁵ assuming that the thread is not busy doing something else, and that there are available CPU cores

¹⁶ introduced by a reasonably well-designed message marshalling/unmarshalling + reasonably well-designed inter-process single-reader queue

¹⁷ see Chapter [[TODO]] for conditions when such delays are possible before hitting me too hard

Variations

The diagram on Fig V.2 shows each of the FSMs running within it’s own thread. On the other hand, as noted above, each of the FSMs can be run in the same thread as Game Logic FSM. In the extreme case it results in the system where all the FSMs are running within single thread with a corresponding diagram shown on Fig V.3:

Fig V.3. MMOG Client Architecture Diagram, Single Thread

Each and every of FSMs on Fig V.3 is exactly the same as an FSM on Fig V.2; moreover, logically, these two diagrams are exactly equivalent (and “recording” from one can be “replayed” on another one). The only difference on Fig V.3 is that we’re using the same thread (and the same Queue) to run all our FSMs. FSM Selector here is just a very dumb selector, which looks at the destination-FSM field (set by whoever-sent-the-message) and routes the message accordingly.

This kind of threading could be quite practical, for example, for a casino or a social game. However, not all the platforms allow to wait for the select() in the main graphics loop, so you may need to resort to the one on Fig V.4:

Fig V.4. MMOG Client Architecture Diagram, with Socket Thread

Here Sockets Thread is very simple and doesn’t contain any substantial logic; all it does is just pushing whatever-it-got-from-Queue to the socket, and pushing whatever-it-got-from-socket – to the Queue of the Main Thread; all the actual processing will be performed there, within Communications FSM.

Another alternative is shown on Fig V.5:

Fig V.5. MMOG Client Architecture Diagram, with Communication Thread

Both Fig V.4 and Fig V.5 will work for a social or casino-like game on Windows.¹⁸

On the other end of the spectrum, lie such heavy-weight implementations as the one shown on Fig V.6:

Fig V.6. MMOG Client Architecture Diagram, Process-based

Here, Animation&Rendering FSM, and Communications FSM run in their own processes. This approach might be useful during testing (in general, you may even run FSMs on different developer’s computers if you prefer this kind of interactive debugging). However, for production it is better to avoid such configurations, as inter-process interfaces may help bot writers.

Overall, an exact thread (and even process) configuration you will deploy is not that important and may easily be system-dependent (or even situation-dependent, as in “for the time being, we’ve decided to separate this FSM to a separate process to debug it on respective developer’s machines”). What really matters is that

as long as you’re keeping your development model FSM-based, you can deploy it in any way you like without any changes to your FSMs.

In practice, this property has been observed to provide quite a bit of help in the long run. While this effect has significantly more benefits on the server-side (and will be discussed in Chapter [[TODO]]), it has been seen to aid client-side development too; for example, different configurations for different platforms do provide quite a bit of help. In addition, situation-dependent configurations have been observed to help a bit during testing.

¹⁸ While on Windows it is possible to create both “| select()” and “| user-input” queues, I don’t know how to create one single queue which will be both “| select()” and “| user-input” simultaneously, without resorting to a ‘dumb’ extra thread; more details on these and other queues will be provided in Chapter [[TODO]]

On Code Bases for Different Platforms

As it was noted above, you MUST keep your Game Logic FSM the same for all the platforms (i.e. as a single code base). Otherwise, given the frequent changes to Game Logic, all-but-one of your code bases will most likely start to fall behind, to the point of being completely useless.

But what about other FSMs? Do you need to keep them as a single code base? The answer here is quite obvious:

while the architecture shown above allows you to make non-Game-Logic FSMs platform-specific, it makes perfect sense to keep them the same as long as possible

For example, if your game is graphics-intensive, there can be really good reasons to have your Animation&Rendering FSM different for different platforms; for example, you may want to use DirectX on some platforms, and OpenGL on some other platforms (granted, it will be quite a chunk of work to implement both of them, but at least it is possible with the architecture above, and it becomes a potentially viable business choice, especially as OpenGL version and DirectX version can be developed in parallel).

On the other hand, chances that you will need the platform-specific Communications FSM, are much lower.¹⁹ Even if you’re writing in C/C++, usable implementations of Berkeley sockets exist on most (if not on all) platforms of interest. Moreover, the behavior of sockets on different platforms is quite close from game developer’s point of view (at least with regards to those things which we are able to affect).

So, while all such choices are obviously specific to your specific game, statistically you should have much more Animation&Rendering FSMs than Communications FSMs :-) .

¹⁹ I don’t count conditional inclusion of WSAStartup() etc. as being really platform-specific

QnFSM Architecture Summary

Queues-and-FSMs Architecture shown on Fig V.2 (as well as its variations on Fig V.3-Fig V.6) is quite an interesting beast. In particular, while it does ensure a clean separation between parts (FSMs in our case), it tends to go against commonly used patterns of COM-like components or even usual libraries. The key difference here is that COM-like components are essentially based on blocking RPC, so after you called a COM-like RPC²⁰, you’re blocked until you get a reply. With FSM-based architecture from Fig V.2-V.6, even if you’re requesting something from another FSM, you still can (and usually should) process events coming while you’re waiting for the reply. See in particular [[TODO!! add subsection on callbacks to FSM]] section above.

From my experience, while developers usually see this kind of FSM-based programming as somewhat more cumbersome than usual procedure-call-based programming, most of them agree that it is beneficial in the medium- to long-run. This is also supported by experiences of people writing in Erlang, which has almost exactly the same approach to concurrency (except for certain QnFSM’s goodies, see also “Relation to Erlang” section below). As advantages of QnFSM architecture, we can list the following:

very good separation between different modules (FSMs in our case). FSMs and their message-oriented APIs tend to be isolated very nicely (sometimes even a bit too nicely, but this is just another side of the “somewhat more cumbersome” negative listed above).
“replay testing“ and post-mortem analysis. See “Strictly-Deterministic Logic: Benefits” section above.
very good performance. While usually it is not that important for client-side, it certainly doesn’t hurt either. The point here is that with such an architecture, context switches are kept to the absolute minimum, and each thread is working without any pauses (and without any overhead associated with these pauses) as long as it has something to do. On the flip side, it doesn’t provide inherent capabilities to scale (so server-side scaling needs to be introduced separately, see Chapter [[TODO]]), but at least it is substantially better than having some state under the mutex, and trying to lock this mutex from different threads to perform something useful.

We will discuss more details on this Queues-and-FSMs architecture as applicable to the server-side, in Chapter [[TODO]], where its performance benefits become significantly more important.

Relation to Actor Concurrency

NB: this subsection is entirely optional, feel free to skip it if theory is of no interest to you

From theoretical point of view QnFSM architecture can be seen as a system which is pretty close to so-called “Actor Concurrency Model” (that is, until Option C from “Additional Threads and Task-Based Multithreading” is used), with QnFSM’s deterministic FSMs being Actor Concurrency’s ‘Actors’. However, there is a significant difference between the two, at least perceptionally. Traditionally, Actor concurrency is considered as a way to ensure concurrent calculations; that is, the calculation which is considered is originally a “pure” calculation, with all the parameters known in advance. With games, the situation is very different because we don’t know everything in advance (by definition). This has quite a few implications.

Most importantly, system-wide determinism (lack of which is often considered a problem for Actor concurrency when we’re speaking about calculations) is not possible for games.²¹ In other words, games (more generally, any distributed interactive system which produces results substantially dependent on timing; dependency on timing can be either absolute, like “whether the player pressed the button before 12:00”, or relative such as “whether player A pressed the button before player B”) are inherently non-deterministic when taken as a whole. On the other hand, each of the FSMs/Actors can be made completely deterministic, and this is what I am arguing for in this book.

In other words – while QnFSM is indeed a close cousin of Actor concurrency, quite a few of the analysis made for Actor-concurrency-for-HPC type of tasks, is not exactly applicable to inherently time-dependent systems such as games, so take it with a big pinch of salt.

²⁰ also DCE RPC, CORBA RPC, and so on; however, game engine RPCs are usually very different, and you’re not blocked after the call, in exchange for not receiving anything back from the call

²¹ the discussion of this phenomenon is out of scope of this book, but it follows from inherently distributed nature of the games, which, combined with Einstein’s light cone and inherently non-deterministic quantum effects when we’re organizing transmissions from client to server, mean that very-close events happening for different players, may lead to random results when it comes to time of arrival of these events to server. Given current technologies, determinism is not possible as soon as we have more than one independent “clock domain” within our system (non-deterministic behaviour happens at least due to metastability problem on inter-clock-domain data paths), so at the very least any real-world multi-device game cannot be made fully deterministic in any practical sense.

Relation to Erlang Concurrency and Akka Actors

On the other hand, if looking at Erlang concurrency (more specifically, at ! and receive operators), or at Akka’s Actors, we will see that QnFSM is pretty much the same thing.²² There are no shared states, everything goes via message passing, et caetera, et caetera, et caetera.

The only significant difference is that for QnFSM I am arguing for determinism (which is not guaranteed in Erlang/Akka, at least not without “call interception”; on the other hand, you can write deterministic actors in Erlang or Acca the same way as in QnFSM, it is just an additional restriction you need to keep in mind and enforce). Other than that, and some of those practical goodies in QnFSM (such as recording/replay with all the associated benefits), QnFSM is extremely close to Erlang’s concurrency (as well as to Akka’s Actors which were inspired by Erlang) from developer’s point of view.

Which can be roughly translated into the following observation:

to have a good concurrency model, it is not strictly necessary to program in Erlang or to use Akka

²² While both Erlang and Akka zealots will argue ad infinitum that their favourite technology is much better, from our perspective the differences are negligible

Bottom Line for Chapter V

Phew, it was a long chapter. On the other hand, we’ve managed to provide a 50’000-feet (and 20’000-word) view on my favorite MMOG client-side architecture. To summarize and re-iterate my recommendations in this regard:

Think about your graphics, in particular whether you want to use pre-rendered 3D or whether you want/need dual graphics (such as 2D+3D); this is one of the most important questions for your game client;²³ moreover, client-side 3D is not always the best choice, and there are quite a few MMO games out there which have rudimentary graphics
- if your game is an MMOFPS or an MMORPG, most likely you do need fully-fledged client-side 3D, but even for an MMORTS the answer can be not that obvious
when choosing your programming language, think twice about resilience to bot writers, and also about those platforms you want to support. While the former is just one of those things to keep in mind, the latter can be a deal-breaker when deciding on your programming language
- Usually, C++ is quite a good all-around candidate, but you need to have pretty good developers to work with it
Write your code in a deterministic event-driven manner (as described in “Strictly-Deterministic Logic” and “Event-Driven Programming and Finite State Machines” sections), it helps, and helps a lot
- This is not the only viable architecture, so you may be able to get away without it, but at the very least you should consider it and understand why you prefer an alternative one
- The code written this way magically becomes a deterministic FSM, which has lots of useful implications
- Keep all your FSMs perfectly self-contained, in a “Share-Nothing” model. It will help in quite a few places down the road.
- Feel free to run multiple FSMs in a single thread if you think that your game and/or current platform is a good fit, but keep those FSMs separate; it can really save your bacon a few months later.
- Keep one single code base for Game Logic FSM. For other FSMs, you may make different implementations for different platforms, but do it only if it becomes really necessary.

²³ yes, I know I’m putting on my Captain Obvious’ hat once again here

[[To Be Continued…

This concludes beta Chapter V(d) from the upcoming book “Development and Deployment of Massively Multiplayer Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter VI, “Modular Architecture: Server-Side. Naive and Classical Deployment Architectures.]]

References

[NoBugs2015] 'No Bugs' Hare, “Multi-threading at Business-logic Level is Considered Harmful”, Overload #128

[GameProgrammingPatterns.GameLoop] Robert Nystrom, “Game Programming Patterns”

[GafferOnGames.FixYourTimestep] Glenn Fiedler, “Fix Your Timestep!”, Gaffer On Games

[GDC.Destiny] Natalya Tatarchuk, “Destiny's Multithreaded Rendering Architecture”, GDC2015

[GDC.TaskBasedMT] Ron Fosner, “Task-based Multithreading - How to Program for 100 cores”, GDC2010

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧

Chapter VI(a). Server-Side MMO Architecture. Naïve, Web-Based, and Classical Deployment Architectures

December 21, 2015, 3:50 pm

≫ Next: Implementing Queues for Event-Driven Programs

≪ Previous: Chapter V(d). Modular Architecture: Client-Side. Client Architecture Diagram, Threads, and Game Loop

[[This is Chapter VI(a) from the upcoming book “Development&Deployment of Massively Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MMOG: Table of Contents.]]

After drawing all that nice client-side QnFSM-based diagrams, we need to describe our server architecture. The very first thing we need to do is to start thinking in terms of “how we’re going to deploy our servers, when our game is ready?” Yes, I really mean it – architecture starts not in terms of classes, and for the server-side – not even in terms of processes or FSMs, it starts with the highest-level meaningful diagram we can draw, and for the server-side this is a deployment diagram with servers being its main building blocks. If deploying to cloud, these may be virtual servers, but a concept of “server” which is a “more or less self-contained box running our server-side software”, still remains very central to the server-side software. If not thinking about clear separation between the pieces of your software, you can easily end up with a server-side architecture that looks nicely while you program it, but falls apart on the third day after deployment, exactly when you’re starting to think that your game is a big success.

Deployment Architectures, Take 1

In this Chapter we’ll discuss only “basic” deployment architectures. These architectures are “basic” in a sense that they’re usually sufficient to deploy your game and run it for several months, but as your game grows, further improvements may become necessary. Fortunately, these improvements can be done later, when/if the problems with basic deployment architecture arise; these improvements will be discussed in Chapter [[TODO]].

Also note that for your very first deployment, you may have much less physical/virtual boxes than shown on the diagram, by combining quite a few of them together. On the other hand, you should be able to increase the number of your servers quickly, so you need to have the software able to work in basic deployment architecture from the very beginning. This is important, as demand for increase in number of servers can develop very soon if you’re successful. We’ll discuss your very first deployment in Chapter [[TODO]].

First, let’s start with an architecture you shouldn’t do.

Don’t Do It: Naïve Game Deployment Architectures

Quite often, when faced with development their very first multi-player game, developers start with something like the following Fig VI.1:

Fig VI.1. Naïve Game Deployment Architecture, initial stage

It is dead simple: there is a server, and there is a database to store persistent state. And later on, as one single Game World server proves to be insufficient, it naturally evolves into something like the diagram on Fig VI.2:

Fig VI.2. Naïve Game Deployment Architecture, extensive expansion

with each of Game World servers having its own database.

My word of advice about such naïve deployment architectures:

DON’T DO THIS!

Such a naïve approach won’t work well for a vast majority of games. The problem here (usually ranging from near-fatal to absolutely-fatal depending on specifics of your game) is that this architecture doesn’t allow for interaction between players coming from different servers. In particular, such an architecture becomes absolutely deadly if your game allows some way for a player to choose who he’s playing with (or if you have some kind of merit-based tournament system), in other words – if you’re not allowed to arbitrary separate your players (and in most cases you will need some kind of interaction at least because of the social network integration, see Chapter II for further discussion in this regard).

For the naïve architecture shown on Fig VI.2, any interaction between separate players coming from separate databases, leads to huge mortgage-crisis-size problems. Inter-DB interaction, while possible (and we’ll discuss it in Chapter [[TODO]]) won’t work well around these lines and between completely independent databases. You’re going to have lots and lots of problems, ranging from delays due to improperly implemented inter-DB transactions (apparently this is not that easy), to your CSRs going crazy because of two different users having the same ID in different databases. Moreover, if you start like this, you will even have trouble merging the databases later (the very first problem you will face will be about collisions in user names between different DBs, with much more to follow).

To summarize relevant discussion from Chapter II and from present Chapter:

A. You WILL need inter-player interaction between arbitrary players. If not now, then later. B. Hence, you SHOULD NOT use “naïve” architecture shown above.

Fortunately, there are relatively simple and practical architectures which allow to avoid problems typical for naïve approaches shown above.

Web-Based Game Deployment Architecture

If your game satisfies two conditions:

first, it is reeeeallyyyy sloooow-paaaaaced (in other words, it is not an MMOFPS and even not a poker game) and/or “asynchronous” (as defined in Chapter I, i.e. it doesn’t need players to be present simultaneously),
and second, it has little interaction between players (think farming-like games with only occasional inter-player interaction),

then you might be able to get away with Web-Based server-side architecture, shown on Fig VI.3:

Fig VI.3. Web-based Game Deployment Architecture

Web-Based Deployment Architecture: How It Works

The whole thing looks alongside the lines of a heavily-loaded web app – with lots of caching, both at front-end (to cache pages), and at a back-end. However, there are also significant differences (special thanks to Robert Zubek for sharing his experiences in this regard, [Zubek2016]).

The question “which web server to use” is not that important here. On the other hand, there exists an interesting and not-so-well-known web server, which took an extra mile to improve communications in game-like environments. I’m speaking about [Lightstreamer]. I didn’t try it myself, so I cannot vouch for it, but what they’re doing with regards to improving interactivity over TCP, is really interesting. We’ll discuss some of their tricks in Chapter [{TODO]].

Peculiarities in Web-Based Game architectures are mostly about the way caching is built. First, on Fig VI.3 both front-end caching and back-end caching is used. Front-end caching is your usual page caching (like nginx in reverse-proxy mode, or even a CDN), though there is a caveat. As your current-game-data changes very frequently, you normally don’t want to cache it, so you need to take an effort and clearly separate your static assets (.SWFs, CSS, JS, etc. etc.) which can (and should) be cached, and dynamic pages (or AJAX) with current game state data which changes too frequently to bother about caching it (and which will likely go directly from your web servers) [Zubek2010].

At the back-end, the situation is significantly more complicated. According to [Zubek2016], for games you will often want not only to use your back-end cache as a cache to reduce number of DB reads, but also will want to make it a write-back cache (!), to reduce the number of DB writes. Such a write-back cache can be implemented either manually over memcached (with web servers writing to memcached only, and a separate daemon writing ‘dirty’ pages from memcached to DB), or a product such as Redis or Couchbase (formerly Membase) can be used [Zubek2016].

Taming DB Load: Write-Back Caches and In-Memory States

One Big Advantage of having write-back cache (and of the in-memory state of Classical deployment architecture described below) is related to the huge reduction in number of DB updates. For example, if we’d need to save each and every click on the simulated farm with 25M daily users (each coming twice a day and doing 50 modifying-farm-state clicks each time in a 5-minute session), we could easily end up with 2.5 billion DB transactions/day (which is infeasible, or at least non-affordable). On the other hand, if we’re keeping write-back cache, we can write the cache into DB only once per 10 minutes, we’d reduce the number of DB transactions 50-fold, bringing it to much more manageable 50 million/day.

For faster-paced games (usually implemented as a Classical Architecture described below, but facing the same challenge of DB being overloaded), the problem surfaces even earlier. For example, to write each and every movement of every character in an MMORPG, we’d have a flow of updates of the order of 10 DB-transactions/sec/player (i.e. for 10’000 simultaneous players we’d have 100’000 DB transactions/second, or around 10 billion DB transactions/day, once again making it infeasible, or at the very least non-affordable). On the other hand, with in-memory states stored in-memory-only (and saving to DB only major events such as changing zones, or obtaining level) – we can reduce the number of DB transactions by 3-4 orders of magnitude, bringing it down to much more manageable 1M-10M transactions/day.

As an additional benefit, such write-back caches (as long as you control write times yourself) and in-memory states also tend to play well with handling server failures. In short: for multi-player games, if you disrupt a multi-player “game event” (such as match, hand, or fight) for more than a few seconds, you won’t be able to continue it anyway because you won’t be able to get all of your players back; therefore, you’ll need to roll your “game event” back, and in-memory states provide a very natural way of doing it. See “Failure Modes & Effects” section below for detailed discussion of failure modes under Classical Game Architecture.

A word of caution for stock exchanges. If your game is a stock exchange, you generally do need to save everything in DB (to ensure strict correctness even in case of Game Server loss), so in-memory-only states are not an option, and DB savings do not apply. However, even for stock exchanges at least Classical Game architecture described below has been observed to work very well despite DB transaction numbers being rather large; on the other hand, for stock exchanges transaction numbers are usually not that high as for MMORPG, and price of the hardware is generally less of a problem than for other types of games.

Write-Back Caches: Locking

As always, having a write-back cache has some very serious implications, and will cause lots of problems whenever two of your players try to interact with the same cached object. To deal with it, there are three main approaches: “optimistic locking”, “pessimistic locking”, and transactions. Let’s consider them one by one.

Optimistic Locking. This one is directly based on memcached’s CAS operation.¹ The idea of using CAS for optimistic locking goes along the following lines. To process some incoming request, Web Server does the following:

reads whole “game world” state as a single blob from memcached, alongside with “cas token”. “cas token” is a thing which is actually a “version number” for this object.
we’re optimists! so Web Server is processing incoming request ignoring possibility that some other Web Server also got the same “game world” and is working on it
- Web Server is NOT allowed to send any kind of reply back to user (yet)
Web Server issues cas operation with both new-value-of-“game-world”-blob, and the same “cas token” which it has received
- if “cas token” is still valid (i.e. nobody has written to the blob before current Web Server has read it), memcached writes new value, and returns ok.
  - Then our Web Server may send reply back to whoever-requested-it
- if, however, there was a second Web Server which has managed to write after we’ve read our blob – memcached will return a special error
  - in this case, our Web Server MUST discard all the prepared replies
  - in addition, it MAY read new value of “game world” state (with new “cas token”), and try to re-apply incoming request to it
    - this is perfectly valid: it is just “as if” incoming request has came a little bit later (which can always happen)

Optimistic locking is simple, is lock-less (which is important, see below why), and has only one significant drawback for our purposes. That is, while it works fine as long as collision probability (i.e. two Web Servers working on the same “game world” at the same time) is low, but as soon as probability grows (beyond, say 10%) – you will start getting a significant performance hit (for processing the same message twice, three times, and so on and so forth). For slow-paced asynchronous games it is very unlikely to become a problem, and therefore by default I’d recommend optimistic locking for web-based games, but you still need to understand limitations of the technology before using it.

¹ a supposedly equivalent optimistic locking for Redis is described in [Redis.CAS]

Pessimistic Locking. This is pretty much a classical multi-threaded mutex-based locking, applied to our “how to handle two concurrent actions from two different Web Servers over the same “game world” problem.

In this case, game state (usually stored as a whole in a blob) is protected by a sorta-mutex (so that two web servers cannot access it concurrently). Such a mutex can be implemented, for example, over something like memcached’s CAS operation [Zubek2010]. For pessimistic locking, Web Server acts as follows:

obtains lock on mutex, associated with our “game world” (we’re pessimists , so we need to be 100% sure before processing, that we’re not processing in vain).
- if mutex cannot be obtained – Web Server MAY try again after waiting a bit
reads “game world” state blob
processes it
writes “game world” state blob
releases lock on mutex

This is a classical mutex-based schema and it is very robust when applied to classical multi-thread synchronization. However, when applying it to web servers and memcached, there is a pretty bad caveat :-( . The problem here is related to “how to detect hanged/crashed web server – or process – which didn’t remove the lock” question, as such a lock will effectively prevent all future legitimate interactions with the locked game world (which reminds me of the nasty problems from the early-90ish pre-SQL FoxPro-like file-lock-based databases).

For practical purposes, such a problem can be resolved via timeouts, effectively breaking the lock on mutex (so that if original mutex owner of the broken mutex comes later, he just gets an error). However, allowing to break mutex locks on timeouts, in turn, has significant further implications, which are not typical for usual mutex-based inter-thread synchronizations:

first, if we’re breaking mutex on timeout – there is a problem of choosing the timeout. Have it too low, and we can end up with fake timeouts, and having it too high will cause frustrated users
second, it implies that we’re working EXACTLY according to the pattern above. In particular:
- having more than one memcached object per “game world” is not allowed
- “partially correct” writes of “game state” are not allowed either, even if they’re intended to be replaced “very soon” under the same lock

In practice, these issues are rarely causing too much problems when using memcached for mutex-based pessimistic locking. On the other hand, as for memcached we’d need to simulate mutex over CAS, I still suggest optimistic locking (just because it is simpler and causes less memcached interactions).

Transactions. Classical DB transactions are useful, but dealing with concurrent transactions is really messy. All those transaction isolation levels (with interpretations subtly different across different databases), locks, and deadlocks are not a thing which you really want to think about.

Fortunately, Redis transactions are completely unlike classical DB transactions and are coming without all this burden. In fact, Redis transaction is merely a sequence of operations which are executed atomically. It means no locking, and an ability to split your “game world” state into several parts to deal with traffic. On the other hand, I’d rather suggest to stay away from this additional complexity as long as possible, using Redis transactions only as means of optimistic locking as described in [Redis.CAS]. Another way of utilizing capabilities of Redis transactions is briefly mentioned in “Web-Based Deployment Architecture: FSMs” section below.

Web-Based Deployment Architecture: FSMs

You may ask: how finite state machines (FSMs) can possibly be related to the web-based stuff? They seem to be different as night and day, don’t they?

Actually, they’re not. Let’s take a look at both optimistic and pessimistic locking above. Both are taking the whole state, generating new state out of it, and storing this new state. But this is exactly what our FSM::process_event() function from Chapter V does! In other words, even for web-based architecture, we can (and IMHO SHOULD) write processing in an event-driven manner, taking state and processing inputs, producing state and issuing replies as a result.

As soon as we’ve done it this way, the question “Should we use optimistic locking or pessimistic one”, becomes a deployment implementation detail

In other words, if we have an FSM-based (a.k.a. event-driven) game code, we can change the wrapping infrastructure code around it, and switch it from optimistic locking to pessimistic one (or vice versa). All this without changing a single line within any of FSMs!

Moreover, if using FSMs, we can even change from Web-Based Architecture to Classical one and vice versa without changing FSM code

If by any chance reading the whole “game world” state from cache becomes a problem (which it shouldn’t, but you never know), it MIGHT still be solved via FSMs together with Redis-style transactions mentioned above. Infrastructure code (the one outside of FSM) may, for example, load only a part of the “game world” state depending on type of input request (while locking all the other parts of the state to avoid synchronization problems), and also MAY implement some kind on-demand exception-based state loading along the lines of on-demand input loading discussed in [[TODO]] section below.

Web-Based Deployment Architecture: Merits

Unlike the naïve approach above, Web-Based systems may work. Their obvious advantage (especially if you have a bunch of experienced web developers on your team) is that it uses familiar and readily-available technologies. Other benefits are also available, such as:

easy-to-find developers
simplicity and being relatively obvious (that is, until you need to deal with locks, see above)
web servers are stateless (except for caching, see below), so failure analysis is trivial: if one of your web servers goes down, it can be simply replaced
can be easily used both for the games with downloadable client and for browser-based ones

Web-Based Architecture (as well as any other one), of course, also has downsides, though they may or may not matter depending on your game:

there is no way out of web-based architecture; once you’re in – switching to any other one will be impossible. Might be not that important for you, but keep it in mind.
it is pretty much HTTP-only (with an option to use Websockets); migration to plain TCP/UDP is generally not feasible.
as everything will work via operations on the whole game state, different parts of your game will tend to be tightly coupled. Not a big problem if your game is trivial, but may start to bite as complexity grows.
as the number of interactions between players and game world grows, Web-Based Architecture becomes less and less efficient (as distributed-mutex-locked accesses to retrieve whole game state from the back-end cache and write it back as a whole, don’t scale well). Even medium-paced “synchronous” games such as casino multi-players, are usually not good candidates for Web-Based Architecture.
you need to remember to keep all the accesses to game objects synchronized; if you miss one – it will work for a while, but will cause very strange-looking bugs under heavier load.
you’ll need to spend A LOT of time meditating over your caching strategy. As the number of player grows, you’re very likely to need a LOT of caching, so start designing your caching strategies ASAP. See above about peculiarities of caching when applied to games (especially on write-back part and mutexes), and make your own research.
as the load grows, you will be forced to spend time on finding a good and really-working-for-you solution for that nasty web-server-never-releases-mutex problem mentioned above. While not that hopeless as ensuring consistency within pre-SQL DBF-like file-lock-based databases, expect quite a chunk of trouble until you get it right.

Still,

if your game is rather slow/asynchronous and inter-player interactions are simple and rather far between, Web-Based Architecture may be the way to go

While Classical Architecture described below (especially with Front-End Servers added, see [[TODO]] section) can also be used for slow-paced games, implementing it yourself just for this purpose is a Really Big Headache and might be easily not worth the trouble if you can get away with Web-Based one. On the other hand,

even for medium-paced synchronous multi-player games (such as casino-like multi-player games) Web-Based Architecture is usually not a good candidate

(see above).

Classical Game Deployment Architecture

Fig VI.4 shows a classical game deployment diagram.

In this deployment architecture, clients are connected to Game Servers directly, and Game Servers are connected to a single DB Server, which hosts system-wide persistent state. Each of Game Servers MIGHT (or might not) have it own database (or other persistent storage) depending on the needs of your specific game; however, usually Game Servers store only in-memory states with all the persistent storage going into a single DB residing on DB Server.

Game Servers

Game Servers are traditionally divided according to their functionality, and while you can combine different types of functionality on the same box, there are often good reasons to avoid combining too many different things together.

Different types of Game Servers (more strictly – different types of functionality hosted on Game Servers) should be mapped to the entities on your Entities&Relationships Diagram described in Chapter II. You should do this mapping for your specific game yourself. However, as an example, let’s take a look at a few of typical Game Servers (while as always, YMMV, these are likely to be present for quite a few games):

Game World Servers. Your game worlds are running on Game World Servers, plain and simple. Note that “Game World” here doesn’t necessarily mean a “3D game world with simulated physics etc.”. Taking a page from a casino-like games book, “Game World” can be a casino table; going even further into realm of stock exchanges, “Game World” may be a stock exchange floor. Surprisingly, from an architecture point of view, all these seemingly different things are very similar. All of them represent a certain state (we usually name it “game world”) which is affected by player’s actions in real time, and changes to this state are shown to all the players.²

Matchmaking Servers. Usually, when a player launches her client app, the client by default connects to one of Matchmaking Servers. In general, matchmaking servers are responsible for redirecting players to one of your multiple game worlds. In practice, they can be pretty much anything: from lobbies where players can join teams or select game worlds, to completely automated matchmaking. Usually it is matchmaking servers that are responsible for creating new game worlds, and placing them on the servers (and sometimes even creating new servers in cloud environments).

Tournament Servers. Not always, but quite often your game will include certain types of “tournaments”, which can be defined as game-related entities that have their own life span and may create multiple Game World instances during this life span. Technically, these are usually reminiscent of Matchmaking Servers (they need to communicate with players, they need to create Game Worlds, they tend to use about the same generic protocol synchronization mechanics, see Chapter [[TODO]] for details), but of course, Tournament Servers need to implement tournament rules of the specific tournament etc. etc.

Payment Server and Social Gateway Server. These are necessary to provide interaction of your game with the real world. While these server might look an “optional thing nobody should care about”, they’re usually playing an all-important role in increasing popularity of your game and monetization, so you’d better to account for them from the very beginning.

The very nature of Payment Servers and Social Gateway Server is to be “gateways to the real world”, so they’re usually exactly what is written on the tin: gateways. It means that their primary function is usually to get some kind of input from the player and/or other Game Servers, write something to DB (via DB Server), and make some request according to some-external-protocol (defined by payment provider or by social network). On the other hand, implementing them when you need to support multiple payment/social providers (each with their own peculiarities, you can count on it) – is a challenge; also they tend to change a lot due to requirements coming from business and marketing, changes in provider’s APIs, need to support new providers etc. And of course, at least for payment servers, there are questions of distributed transactions between your DB and payment-provider DB, with all the associated issues of recovery from “unknown-state” transactions, and semi-manul reconciliation of reports at the end of month. As a result, these two seemingly irrelevant-to-gameplay servers tend to have their own teams after deployment; more details on payment servers will be discussed in Chapter [[TODO]].

One of the things these servers should do, is isolating Game World Servers and preferably Matchmaking Servers from the intimate details about specifics of the payment providers and social networks. In other words, Game World Servers shouldn’t generally know about such things as “a guy has made a post of Facebook, so we need to give him bonus of 25% extra experience for 2 days”. Instead, this functionality should be split in two: Social Gateway Server should say “this guy has earned bonus X” (with explanation in DB why he’s got the bonus, for audit purposes), and Game World Server should take “this guy has bonus X” statement and translate it into 25% extra experience.

² restrictions may apply to which parts of the state are shown to which players. One such example is a server-side fog-of-war, that we’ll discuss in Chapter [[TODO]]

Implementing Game Servers under QnFSM architecture

In theory, Game Servers can be implemented in whatever way you prefer. In practice, however, I strongly suggest to have them implemented under Queues-and-FSMs (QnFSM) model described in Chapter V. Among the other things, QnFSM provides very clean separation between different modules, enables replay-based debug and production post-mortem, allows for different deployment scenarios without changing the FSM code (this one becomes quite important for the server side), and completely avoids all those pesky inter-thread synchronization problems at logical level; see Chapter V for further discussion of QnFSM benefits.

Fig VI.5 shows a diagram with an implementation of a generic Game Server under QnFSM:

Fig VI.5. Game Server implemented over QnFSM

If it looks complicated at the first glance – well, it should. First of all, the diagram represents quite a generic case, and for your specific game (and at least at first stages) you may not need all of that stuff, we’ll discuss it below. Second, but certainly not unimportant, writing anywhere-close-to-scalable server is not easy.

Now let’s take a closer look at the diagram on Fig VI.5, going in an unusual direction from right to left.

Game Logic and Game Logic Factory. On the rightmost side of the diagram, there is the most interesting part – things, closely related to your game logic. Specifics of those Game Logic FSMs are different for different Game Servers you have, and can vary from “Game World FSM” to “Payment Processing FSM” with anything else you need in between. It is worth noting that while for most Game Logic FSMs you won’t need any communications with the outside world except for sending/receiving messages (as shown on the diagram), for gateway-style FSMs (such as Payment FSM or Social Gateway FSM) you will need some kind of external API (most of the time they go over outgoing HTTP, though I’ve seen quite strange things, such as X.25); it doesn’t change the nature of those gateway-style FSMs, so you still have all the FSM goodies (as long as you “intercept” all the calls to that external API, see Chapter V for details). [[TODO! – discussion on blocking-vs-non-blocking APIs for gateway-style FSMs]]

Game Logic Factory is necessary to create new FSMs (and if necessary, new threads) by an external request. For example, when a Matchmaking server needs to create a new game world on server X, it sends a request to the Game Logic Factory which resides on server X, and Game Logic Factory creates game world with requested parameters. Deployment-wise, usually there is only one instance of the Game Logic Factory per server, but technically there is no such strict requirement.

TCP Sockets and TCP Accept. Going to the left of Game Logic on Fig VI.5, we can see TCP-related stuff. Here the things are relatively simple: we have classical accept() thread, that passes the accepted sockets to Socket Threads (creating Socket Threads when it becomes necessary).

The only really important thing to be noted here is that each Socket Thread³ should normally handle more than one TCP socket; usually number of TCP sockets per thread for a game server should be somewhere between 16 and 128 (or “somewhere between 10 and 100” if you prefer decimal notation to hex). On Windows, if you’re using WaitForMultipleObjects()⁴, you’re likely to hit the wall at around 30 sockets per thread (see further discussion in Chapter [[TODO]]), and this has been observed to work perfectly fine. Having one thread (even worse – two, one for recv() and another one for send()) per socket on the server-side is generally not advisable, as threads have substantial associated overhead (both in terms of resources, and in terms of context switches). In theory, multiple sockets per thread may cause additional latencies and jitter, but in practice for a reasonably well written code running on a non-overloaded server I wouldn’t expect additional latencies and jitter of more than single-digit microseconds, which should be non-observable even for the most fast-paced games.

³ and accordingly, Socket FSM, unless you’re hosting multiple Socket FSMs per Socket Thread, which is also possible

⁴ which IMHO provides the best balance between performance and implementation complexity (that is, if you need to run your servers on Windows), see Chapter [[TODO]] for further details

UDP-related FSMs. UDP (shown on the left side of Fig VI.5) is quite a weird beast; in some cases, you can use really simple things to get UDP working, but in some other cases (especially when high performance is involved), you may need to resort to quite heavy solutions to achieve scalability. The solution on Fig VI.5 is on the simpler side, so you MIGHT need to get into more complicated things to achieve performance/scalability (see below).

Let’s start explaining things here. One problem which you [almost?] universally will have when using UDP, is that you will need to know whether your player is connected or not. And as soon as you have a concept of “UDP connection” (for example, provided by your “reliable UDP” library), you have some kind of connection state/context that needs to be stored somewhere. This is where those “Connected UDP Threads” come in.

So, as soon as we have the concept of “player connected to our server” (and we need this concept at least because players need to be subscribed to the updates from our server), we need those “Connected UDP Threads”. Not exactly the best start from KISS point of view, but at least we know what we need them for. As for the number of those threads – we should limit the number of UDP connections per Connected UDP Thread; as a starting point, we can use the same ballpark numbers of UDP connections per thread as we were using for TCP sockets per thread: that is, between 16-128 UDP connections per thread.

UDP Handler Thread and FSM is a very simple thing – it merely gets whatever-comes-in-from-recvfrom(), and passes it to an appropriate Connected UDP Thread (as UDP Handler FSM also creates those Connected UDP Threads, it is not a problem for it to have a map of incoming-packet-IP/port-pairs to threads).

However, you MAY find that this simpler approach doesn’t work for you (and your UDP Handler Thread becomes a bottleneck, causing incoming packets to drop while your server is not overloaded yet); in this case, you’ll need to use platform-specific stuff such as recvmmsg(),⁵ or to use multiple recvfrom()/sendto() threads. The latter multi-threaded approach will in turn cause a question “where to store this mapping of incoming-packet-IP/port-pairs to threads”. This can be addressed either using shared state (which is a deviation from pure FSM model, but in this particular case it won’t cause too much trouble in practice), or via separate UDP Factory Thread/FSM (with UDP Factory FSM storing the mapping, and notifying recvfrom() threads about the mapping on request, in a manner somewhat similar to the one used for Routing Factory FSM described in [[TODO]] section below).

⁵ see further discussion on recvmmsg() in Chapter [[TODO]]

Websocket-related FSMs and HTTP-related FSMs (not shown). If you need to support Websocket clients (or, Stevens forbid, HTTP clients) in addition to, or instead of TCP or UDP, this can be implemented quite easily. Basic Websocket protocol is very simple (with basic HTTP being even simpler), so you can use pretty much the same FSMs as for TCP, but implementing additional header parsing and frame logic within your Websocket FSMs. If you think you need to support HTTP protocol for a synchronous game – think again, as implementing interactive communications over request-response HTTP is difficult (and tends to cause too much server load), so Websockets are generally preferable over HTTP for synchronous games and are providing about-the-same (though not identical) benefits in terms of browser support and being firewall friendly; see further discussion on these protocols in Chapter [[TODO]]. For asynchronous games, HTTP (with simple polling) MAY be a reasonable choice.

CUDA/OpenCL/Phi FSM (not shown). If your Game Worlds require simulation which is very computationally heavy, you may want to use your Game World servers with CUDA (or OpenCL/Phi) hardware, and to add another FSM (not shown on Fig VI.5) to communicate with CUDA/OpenCL/Phi GPGPU. A few things to note in this regard:

We won’t discuss how to apply CUDA/OpenCL/Phi to your simulation; this is your game and a question “how to use massively parallel computations for your specific simulation” is utterly out of scope of the present book.
Obtaining strict determinism for CUDA/OpenCL FSMs is not trivial due to potential inter-thread interactions which may, for example, change the order of floating-point additions which may lead to rounding-related differences in the last digit (with both results practically the same, but technically different). However, for most of gaming purposes (except for replaying server-side simulation forever-and-ever on all the clients), even this “almost-strict-determinism” may be sufficient. For example, for “recovery via replay” feature discussed in “Complete Recovery from Game World server failures: DIY Fault-Tolerance in QnFSM World” section below, results during replay-since-last-state-snapshot, while not guaranteed to be exactly the same, are not too likely to result in macroscopic changes which are too visible to players.
Normally, you’re not going to ship your game servers to your datacenter. Well, if the life of your game depends on it, you might, but this is a huuuge headache (see below, as well as Chapter [[TODO]] for further discussion)
- As soon as you agree that it is not your servers, but leased ones or cloud ones (see also Chapter [[TODO]]), it means that you’re completely dependent on your server ISP/CSP on supporting whatever you need.
- Most likely, with 3rd-party ISP/CSP it will be Tesla or GRID GPU (both by NVidia), so in this case you should be ok with CUDA rather than OpenCL.
- The choice of such ISPs which can lease you GPUs, is limited, and they tend to be on an expensive side :-(. As of the end of 2015, the best I was able to find was Tesla K80 GPU (the one with 4992 cores) rented at $500/month (up to two K80’s per server, with the server itself going at $750/month). With cloud-based GPUs, things weren’t any better, and started from around $350/month for a GRID K340 (the one with 4×384=1536 total cores). Ouch!
If you are going to co-locate your servers instead of leasing them from ISP⁶, you should still realize that server-oriented NVidia Tesla GPUs (as well as AMD FirePro S designated for servers) are damn expensive. For example, as of the end of 2015, Tesla K80 costs around $4000(!); at this price, you get 2xGK210 cores, 24GB RAM@5GHz, clock of 562/875MHz, and 4992 CUDA cores. At the same time, desktop-class GeForce Titan X is available for about $1100, has 2 of newer GM200 cores, 12GB RAM@7GHz, clock of 1002/1089MHz, and 3072 CUDA cores. In short – Titan X gets you more or less comparable performance parameters (except for RAM size and double-precision calculations) at less than 30% of the price of Tesla K80. It might look as a no-brainer to use desktop-class GPUs, but there are several significant things to keep in mind:
- the numbers above are not directly comparable; make sure to test your specific simulation with different cards before making a decision. In particular, differences due to RAM size a double-precision maths can be very nasty depending on specifics of your code
- even if you’re assembling your servers yourself, you are still going to place your servers into a 3rd-party datacenter; hosting stuff within your office is not an option (see Chapter [[TODO]])
  - space in datacenters costs, and costs a lot. It means that tower servers, even if allowed, are damn expensive. In turn, it usually means that you need a “rack” server.
  - Usually, you cannot just push a desktop-class GPU card (especially a card such as Titan X) into your usual 1U/2U “rack” server; even if it fits physically, in most cases it won’t be able to run properly because of overheating. Feel free to try, and maybe you will find the card which runs ok, but don’t expect it to be the-latest-greatest one; thermal conditions within “rack” servers are extremely tight, and air flows are traditionally very different from the desktop servers, so throwing in additional 250W or so with a desktop-oriented air flow to a non-GPU-optimized server isn’t likely to work for more than a few minutes.
- IMHO, your best bet would be to buy rack servers which are specially designated as “GPU-optimized”, and ideally – explicitly supporting those GPUs that you’re going to use. Examples of rack-servers-supporting-desktop-class-GPUs range from⁷ 1U server by Supermicro with up 4x Titan X cards,⁸ to 4U boxes with up to 8x Titan X cards, and monsters such as 12U multi-node “cluster” which includes total of 10×6-core Xeons and 16x GTX 980, the whole thing going at humble $40K total, by ExxactCorp. In any case, before investing a lot to buy dozens of specific servers, make sure to load-test them, and load-test a lot to make sure that they won’t overheat under many hours of heavy load and datacenter-class thermal conditions (where you have 42 such 1U servers with one lying right on top of each other, ouch!, see Chapter [[TODO]] for further details).

To summarize: if your game cannot survive without server-side GPGPU simulations – it can be done, but be prepared to pay a lot more than you would expect based on desktop GPU prices, and keep in mind that deploying CUDA/OpenCL/Phi on servers will take much more effort than simply making your software run on your local Titan X :-( . Also – make sure to start testing on real server rack-based hardware as early as possible, you do need to know ASAP whether hardware of your choice has any pitfalls.

⁶ this potentially includes even assembling them yourself, but I generally don’t recommend it

⁷ I didn’t use any of these, so I cannot really vouch for them, but at least you, IMHO, have reasonably good chances if you try; also make sure to double-check if your colocation provider is ready to host these not-so-mainstream boxes

⁸ officially Supermicro doesn’t support Titans, but their 1U boxes can be bought from 3rd-party VARs such as Thinkmate with 4x Titan X for a total of $10K, Titans included; whether it really works with Titans in datacenter environment 24×7 under your type of load – you’ll need to see yourself

Simplifications. Of course, if your server doesn’t need to support UDP, you won’t need corresponding threads and FSMs. However, keep in mind that usually your connection to DB Server SHOULD be TCP (see “On Inter-Server Communications” section below), so if your client-to-server communication is UDP, you’ll usually need to implement both. On the other hand, our QnFSM architecture provides a very good separation between protocols and logic, so usually you can safely start with a TCP-only server, and this will almost-certainly be enough to test your game intra-LAN (where packet losses and latencies are negligible), and implement UDP support later (without the need to change your FSMs). Appropriate APIs which allow this kind of clean separation, will be discussed in Chapter [[TODO]].

On Inter-Server Communications

One of the questions you will face when designing your server-side, will be about the protocol used for inter-server communications. My take on it is simple:

even if you’re using UDP for client-to-server communications, seriously consider using TCP for server-to-server communications

Detailed discussion on TCP (lack of) interactivity is due in Chapter [[TODO]], but for now, let’s just say that poor interactivity of TCP (when you have Nagle algorithm disabled) becomes observable only when you have packet loss, and if you have non-zero packet loss within your server LAN – you need to fire your admins.⁹

On the positive side, TCP has two significant benefits. First, if you can get acceptable latencies without disabling Nagle algorithm, TCP is likely to produce much less hardware interrupts (and overall context switches) on the receiving server’s side, which in turn is likely to reduce overall load of your Game Servers and even more importantly – DB Server. Second, TCP is usually much easier to deal with than UDP (on the other hand, this may be offset if you already have implemented UDP support to handle client-to-server communications).

⁹ to those asking “if it is zero packet loss, why would we need to use TCP at all?” – I’ll note that when I’m speaking about “zero packet loss”, I can’t rule out two packet lost in a day which can happen even if your system is really really well-built. And while a-few-dozen-microsecond additional delay twice a day won’t be noticeable, crashing twice a day is not too good

QnFSM on Server Side: Flexibility and Deployment-Time/Run-Time Options.

When it comes to the available deployment options, QnFSM is an extremely flexible architecture. Let’s discuss your deployment and run-time options provided by QnFSM in more detail.

Threads and Processes

First of all, you can have your FSMs deployed in different configurations depending on your needs. In particular, FSMs can be deployed as multiple-FSMs-per-thread, one-FSM-per-thread-multiple-threads-per-process, or one-FSM-per-process configurations (all this without changing your FSM code at all).¹⁰

In one real-world system with hundreds of thousands simultaneous players but lightweight processing on the server-side and rather high acceptable latencies, they’ve decided to have some of game worlds (those for novice players) deployed as multiple-FSMs-per-thread, another bunch of game worlds (intended for mature players) – deployed as a single-FSM-per-thread (improving latencies a bit, and providing an option to raise thread priority for these FSMs), and those game worlds for pro players – as a single-FSM-per-process (additionally improving memory isolation in case of problems, and practically-unobservedly improving memory locality and therefore performance); all these FSMs were using absolutely very same FSM code, but it was compiled into different executables to provide slightly different performance properties.

Moreover, in really extreme cases (like “we’re running a Tournament of the Year with live players”), you may even pin a single-FSM-per-thread to a single core (preferably the same where interrupts from you NIC come on this server) and to pin other processes to other cores, keeping your latencies to the absolute minimum.¹¹

¹⁰ Restrictions apply, batteries not included. If you have blocking calls from within your FSM, which is common for DB-style FSMs and some of gateway-style FSMs, you shouldn’t deploy multiple-FSMs-per-thread

¹¹ yes, this will further reduce latencies in addition to any benefits obtained by simple increase of thread priority, because of per-core caches being intact

Communication as an Implementation Detail

With QnFSM, communication becomes an implementation detail. For example, you can have the same Game Logic FSM to serve both TCP and UDP. Not only it can come handy for testing purposes, but also may enable some of your players (those who cannot access your servers via UDP due to firewalls/weird routers etc.) to play over TCP, while the rest are playing over UDP. Whether you want this capability (and whether you want to match TCP players only with TCP players to make sure nobody has an unfair advantage) is up to you, but at least QnFSM does provide you with such an option at a very limited cost.

Moving Game Worlds Around (at the cost of client reconnect)

Yet another flexibility option which QnFSM can provide (though with some additional headache, and a bit of additional latencies), is to allow moving your game worlds (or more generally – FSMs) from one server to another one. To do it, you just need to serialize your FSM on server A (see Chapter V for details on serialization), to transfer serialized state to a server’s B Game Logic Factory, and to deserialize it there. Bingo! Your FSM runs on server B right from the same moment where it stopped running on server A. In practice, however, moving FSMs around is not that easy, as you’ll also need to notify your clients about changed address where this moved FSM can be reached, but despite being an additional chunk of work, this is also perfectly doable if you really want it.

Online Upgrades

Yet another two options provided by QnFSM, enable server-side software upgrades while your system is running, without stopping the server.

The first of these options is just to start creating new game worlds using new Game Logic FSMs (while existing FSMs are still running with the old code). This works as long as changes within FSMs are minor enough so that all external inter-FSM interfaces are 100% backward compatible, and the life time of each FSM is naturally limited (so that at some point you’re able to say that migration from the old code is complete).

The second of these online-upgrade options allows to upgrade FSMs while the game world is still running (via serialization – replacing the code – deserialization). This second option, however, is much more demanding than the first one, and migration problems may be difficult to identify. Therefore, severe automated testing using “replay” technique (also provided by QnFSM, see Chapter V for details) is strongly advised. Such testing should use big chunks of the real-world data, and should simulate online upgrades at the random moments of the replay.

On Importance of Flexibility

Quite often we don’t realize how important flexibility is. Actually, we rarely realize how important it is until we run into the wall because of lack of flexibility. Deterministic FSMs provide a lot of flexibility (as well as other goodies such as post-mortem) at a relatively low development cost. That’s one of the reasons why I am positively in love with them.

DB Server

DB Server handles access to a database. This can be implemented using several very different approaches.

The first and the most obvious model is also the worst one. While in theory, it is possible to use your usual ODBC-style blocking calls to your database right from your Game Server FSMs, do yourself a favor and skip this option. It will have several significant drawbacks: from making your Game Server FSMs too tightly coupled to your DB to having blocking calls with undefined response time right in the middle of your FSM simulation (ouch!). In short – I don’t know any game where this approach is appropriate.

DB API and DB FSM(s)

A much better alternative (which I’m arguing for) is to have at least one FSM running on your DB server, to have your very own message-based DB API (expressed in terms of messages or non-blocking RPC calls) to communicate with it, and to keep all DB work where it belongs – on DB Server, within appropriate DB FSM(s). An additional benefit of such a separation is that you shouldn’t be a DB guru to write your game logic, but you can easily have a DB guru (who’s not a game logic guru) writing your DB FSM(s).

DB API exposed by DB Server’s FSM(s), SHOULD NOT be plain SQL (which would violate all the decoupling we’re after). Instead, your DB API SHOULD be specific to your game, and (once again) should be expressed in game terms such as “take PC Z and place it (with all it’s gear) into game world #NN”. All the logic to implement this request (including pre-checking that PC doesn’t belong to any other game world, modifying PC’s row in table of PCs to reflect the number of the world where she currently resides, and reading all PC attributes and gear to pass it back) should be done by your DB FSM(s).

In addition, all the requests in DB API MUST be atomic; no things such as “open cursor and return it back, so I can iterate on it later” are ever allowed in your DB API (neither you will really need such things, this stands in spite of whatever-your-DB-guru-may-tell-you).

As soon as you have this nice DB API tailored for your needs, you can proceed with writing your Game Server FSMs, without worrying about exact implementation of your DB FSM(s).

Meanwhile, at the King’s Castle…

As soon as we have this really nice separation between Game Server’s FSMs and DB FSM(s) via your very own message-based DB API, in a sense, the implementation of DB FSM will become an implementation detail. Still, let’s discuss how this small but important detail can be implemented. Here I know of two major approaches.

Single-connection approach. This approach is very simple. You have run just one FSM on your DB Server and process everything within one single DB connection:

Here, there is a single DB FSM which has single DB connection (such as an ODBC connection, but there are lots of similar interfaces out there), which performs all the operations using blocking calls. A very important thing in this architecture is application-level cache, which allows to speed things up very considerably. In fact, this application-level cache has been observed to provide 10x+ performance improvement over DB cache even if all the necessary performance-related optimizations (such as prepared statements or even stored procedures) are made on the DB side. Just think about it – what is faster: simple hash-based in-memory search within your DB FSM (where you already have all the data, so we’re speaking about 100 CPU clocks or so even if the data is out of L3 cache), or marshalling -> going-to-DB-side-over-IPC -> unmarshaling -> finding-execution-plan-by-prepared-statement-handle -> executing-execution-plan -> marshaling results -> going-back-to-DB-FSM-side-over-RPC -> unmarshaling results. In the latter case, we’re speaking at least a few dozens of microseconds, or over 1e4 CPU clocks, over two orders of magnitude difference.¹² And with single connection to DB which is able to write data, keeping cache coherency is trivial. The main thing which gets cached for games is usually ubiquitous USERS (or PLAYERS) table, as well as some of small game-specific near-constant tables.

Despite all the benefits provided by caching, this schema clearly sounds as an heresy from any-DB-gal-out-there point of view. On the other hand, in practice it works surprisingly well (that is, as soon as you manage to convince your DB gal that you know what you’re doing). I’ve seen such single-connection architecture¹³ handling 10M+ DB transactions per day for a real-world game, and it were real transactions, with all the necessary changes, transactions, audit tables and so on.

Actually, at least at first stages of your development, I’m advocating to go with this single-connection approach.

It is very nice from many different points of view.

First, it is damn simple.
Second, there is no need to worry about transaction isolation levels, locks and deadlocks
Third, it can be written as a real deterministic FSM (with all the associated goodies); moreover, this determinism stands (a) both if you “intercept calls” to DB for DB FSM itself, or (b) if we consider DB itself as a part of the FSM state, in the latter case no call interception is required for determinism.
Fourth, the performance is very good. There are no locks whatsoever, the light is always green, so everything goes unbelievably smoothly. Add here application-level caching, and we have a winner! The single-connection system I’ve mentioned above, has had an average transaction processing time in below-1ms range (once again, with real-world transactions, commit after every transaction, etc.).

The only drawback of this schema (and the one which will make DB people extremely skeptical about it, to put it very mildly) is an apparent lack of scalability. However, there are ways to modify this single-connection approach to provide virtually unlimited scalability¹⁴ The ways to achieve DB scalability for this single-connection model will be discussed in Vol. 2.

One thing to keep in mind for this single-connection approach, is that it (at least if we’re using blocking calls to DB, which is usually the case) is very sensitive to latencies between DB FSM and DB; we’ll speak about it in more detail in Chapter [[TODO]], but for now let’s just say that to get into any serious performance (that is, comparable to numbers above), you’ll need to use RAID card with BBWC in write-back mode¹⁵, or something like NVMe, for the disk which stores DB log files (other disks don’t really matter much). If your DB server is a cloud one, you’ll need to look for the one which has low latency disk access (such things are available from quite a few cloud providers).

¹² with stored procedures the things become a bit better for DB side, but the performance difference is still considerable, not to mention vendor lock-in which is pretty much inevitable when using stored procedures

¹³ with a full cache of PLAYERS table

¹⁴ while in practice I’ve never went above around 100M DB transactions/day with this “single-connection-made-scalable” approach, I’m pretty sure that you can get to 1B pretty easily, and then it MAY become tough, as the number is too different from what-I’ve-seen so some unknown-as-of-now problems can start to develop. On the other hand, I daresay reaching these numbers is even more challenging with traditional multiple-connection approach

¹⁵ don’t worry, it is a perfectly safe mode for this kind of RAID, even for financial applications

Multiple-Connections approach. This approach is much more along the lines of traditional DB development, and is shown on Fig VI.7:

In short: we have one single DB-Proxy FSM (with the same DB API as discussed above),¹⁶ which does nothing but dispatches requests to DB-Worker FSMs; each of these DB-Worker FSMs will keep its own DB connection and will issue DB requests over this connection. Number of these DB-Worker FSMs should be comparable to the number of the cores on your DB server (usually 2*number-of-cores is not bad starting number), which effectively makes this schema a kind of transaction monitor.

The upside of this schema is that it is inherently somewhat-scalable, but that’s about it. Downsides, however, are numerous. The most concerning one is the cost of code maintenance in face of all those changes of logic, which are run in multiple connections. This inevitably leads us to a well-known but way-too-often-ignored discussion about transaction isolation levels, locks, and deadlocks at DB level. And if you don’t know what it is – believe me, you Really don’t want to know about them. And updating DB-handling code when you have lots of concurrent access (with isolation levels above UR), is possible, but is extremely tedious. Restrictions such as “to avoid deadlocks, we must always issue all our SELECT FOR UPDATEs in the same order – the one written in blood on the wall of DB department” can be quite a headache to put it mildly.

Oh, and don’t try using application-side caching for multiple-connections (i.e. even DB-Proxy SHOULD NOT be allowed to cache). While this is theoretically possible, re-ordering of replies on the way from DB to DB-Proxy make the whole thing way too complicated to be practical. While I’ve done such a thing myself once, and it worked without any problems (after several months of heavy replay-based testing), it was the most convoluted thing I’ve ever written, and I clearly don’t want to repeat this experience.

But IMNSHO the worst thing about using multiple DB connections, is that while each of those DB FSMs can be made deterministic (via “call interception”), the whole DB Server cannot possibly be made deterministic (for multiple connections), period. It means that it may work perfectly under test, but fail in production while processing exactly the same sequence of requests.

Worse than that, there is a strong tendency for improper-transaction-isolation bugs to manifest themselves only under heavy load.

So, you can easily live with such a bug (for example, using SELECT instead of SELECT FOR UPDATE) quietly sitting in, but not manifesting itself until your Big Day comes, and then it crashes your site.¹⁷ Believe me, you really don’t find yourself in such a situation, it can be really (and I mean Really) unpleasant.

In a sense, working with transaction isolation levels is akin to working with threads: about the same problems with lack of determinism, bugs which appear only in production and cannot be reproduced in test environment, and so on. On the other hand, there are DB guys&gals out there who’re saying that they can design a real-world multi-connection system which works under the load of 100M+ write transactions per day and never deadlocks, and I don’t doubt that they can indeed do it. The thing which I’m not so sure about, is whether they really can maintain such quality of their system in face of new-code-required-twice-a-week, and I’m even less sure that you’ll have such a person on your game team.

In addition, the scalability under this approach, while apparent, is never perfect (and no, those TPC-C linear-scalability numbers don’t prove that linear scalability is achievable for real-world transactions). In contrast, single-connection-made-scalable approach which we’ll discuss in Vol. 2, can be extended to achieve perfect linear scalability (at least in theory).

¹⁶ in particular, it means that we can rewrite our DB FSM from Single-connection to Multiple-connections without changing anything else in the system

¹⁷ And it is not a generic “all the problems are waiting for the worst moment to happen” observation (which is actually purely perception), but a real deal. When probability of the problem depends on site load in a non-linear manner (and this is the case for transaction isolation bugs), chances of it happening for the first time exactly during your heavily advertised Event of the Year are huge.

DB Server: Bottom Line.

Unless you happen to have on your team a DB gal with real-world experience of dealing with locks, deadlocks, and transaction isolation levels for your specific DB under at least million-per-day DB write-transaction load – go for single-connection approach

If you do happen to have such a DB guru who vehemently opposes going single-connection – you can try multi-connection, at least if she’s intimately familiar with SELECT-FOR-UPDATE and practical ways of avoiding deadlocks (and no, using RDBMS’s built-in mechanism to detect the deadlock 10 seconds after it happens, is usually not good enough).

And in any case, stay away from any things which include SQL in your Game Server FSMs.

Failure Modes & Effects

When speaking about deployment, one all-important question which you’d better have an answer to, is the following: “What will happen if some piece of hardware fails badly?” Of course, within the scope of this book we won’t be able to do a formal full-scale FMEA for an essentially unknown architecture, but at least we’ll be able to give some hints in this regard.

Communication Failures

So, what can possibly go wrong within our deployment architecture? First of all, there are (not shown, but existing) switches (or even firewalls) residing between our servers; while these can be made redundant, their failures (or transient software failures of the network stack on hosts) may easily cause occasional packet loss, and also (though extremely infrequently) may cause TCP disconnects on inter-server connections. Therefore, to deal with it, our Server-to-Server protocols need to account for potential channel loss and allow for guaranteed recovery after the channel is restored. Let’s write this down as a requirement and remember until Chapter [[TODO]], where we will describe our protocols.

Server Failures

In addition, of course, any of the servers can go badly wrong. There are tons of solutions out there claiming to address this kind of failures, but you should keep in mind that usually, the stuff marked as “High Availability”, doesn’t help with losing in-memory state: what you need if you want to avoid losing in-memory state, is “Fault-Tolerant” techniques (see “Server Fault Tolerance: King is Dead, Long Live the King!” section below).

Fortunately, though, for a reasonably good hardware (the one which has a reasonably good hardware monitoring, including fans, and at least having ECC and RAID, see Chapter [[TODO]] for more discussion on it), such fatal server failures are extremely rare. From my experience (and more or less consistently with manufacturer estimates), failure rate for reasonably good server boxes (such as those from one of Big Three major server vendors) is somewhere between “once-per-5-years” and “once-per-10-years”, so if you’d have only one such server (and unless you’re running a stock exchange), you’d be pretty much able to ignore this problem completely. However, if you have 100 servers – the failure rate goes up to “once or twice a month”, which is unacceptable if such a failure leads to the whole site going down.

Therefore, at the very least you should plan to make sure that single failure of the single server doesn’t bring your whole site down. BTW, most of the time it will be a Game World Server going down, as you’re likely to have much more of these than the other servers, so at first stages you may concentrate on containment of Game World server failures. Also we can note that, counter-intuitively, failures of DB Server are not that important to deal with;¹⁸ not because they have less impact (they do have much more impact), but because they’re much less likely to happen that a failure of one-of-Game-World-servers.

¹⁸ that is, beyond keeping a DB backup with DB logs being continuously moved to another location, see Chapter [[TODO]] for further discussion

Containment of Game World server failures

The very first (and very obvious) technique to minimize the impact of your Game World server failure on the whole site, is to make sure that your Game World reports relevant changes (without sending the whole state) to DB Server as soon as they occur. So that if Game World server fails, it can be restarted from scratch, losing all the changes since last save-to-DB, but at least preserving previous results. These saves-to-DB are the best to be done at some naturally arising points within your game flow.

For example, if your game is essentially a Starcraft- or Titanfall-like sequence of matches, then the end of each match represents a very natural save-to-DB point. In other words, if Game World server fails within the match – all the match data will be lost, but all the player standings will be naturally restored as of beginning of the match, which isn’t too bad. In another example, for a casino-like game the end of each “hand” also represents the natural save-to-DB point.

If your gameplay is an MMORPG with continuous gameplay, then you need to find a way to save-to-DB all the major changes of the players’ stats (such as “level has been gained”, or “artifact has changed hands”). Then, if the Game Server crashes, you may lose the current positions of PCs within the world and a few hundred XP per player, but players will still keep all their important stats and achievements more or less preserved.

Two words of caution with regards to save-to-DB points. First,

For synchronous games, don’t try to keep the whole state of your Game Worlds in DB

Except for some rather narrow special cases (such as stock exchanges and some of slow-paced and/or “asynchronous” games as defined in Chapter I), saving all the state of your game world into DB won’t work due to performance/scalability reasons (see discussion in “Taming DB Load: Write-Back Caches and In-Memory States” section above). Also keep in mind that even if you would be able to perfectly preserve the current state of the game-event-currently-in-progress (with game event being “match”, “hand”, or an “RPG fight”) without killing your DB, there is another very big practical problem of psychological rather than technical nature. Namely, if you disrupt the game-event-currently-in-progress for more than 0.5-2 minutes, for almost-any synchronous multi-player game you won’t be able to get the same players back, and will need to rollback the game event anyway.

For example, if you are running a bingo game with a hundred of players, and you disrupt it for 10 minutes for technical reasons, you won’t be able to continue it in a manner which is fair to all the players, at the very least because you won’t be able to get all that 100 players back into playing at the same time. The problem is all about numbers: for two-player game it might work, for 10+ – succeeding in getting all the players back at the same time is extremely unlikely (that is, unless the event is about a Big Cash Prize). I’ve personally seen a large commercial game that handled the crashes in the following way: to restore after the crash, first, it rolled forward its DB at the DB level to get perfectly correct current state, and then it rolled all the current game-events back at application level, exactly because continuing these events wasn’t a viable option due to the lack of players.

Trying to keep all the state in DB is a common pitfall which arises when the guys-coming-from-single-player-casino-game-development are trying to implement something multiplayer. Once again: don’t do it. While for a single-player casino game having state stored in DB is a big fat Business Requirement (and is easily doable too), for multi-player games it is neither a requirement, nor is feasible (at least because of the can’t-get-the-same-players-together problem noted above). Think of Game World server failure as of direct analogy of the fire-in-brick-and-mortar-casino in the middle of the hand: the very best you can possibly do in this case is to abort the hand, return all the chips to their respective owners (as of the beginning of the hand), and to run out of the casino, just to come back later when the fire is extinguished, so you can start an all-new game with all-new players.

The second pitfall on this way is related to DB consistency issues and DB API.

Your DB API MUST enforce logical consistency

For example, if (as a part of your very own DB API) you have two DB requests, one of which says “Give PC X artifact Y”, and another one “Take artifact Y from PC X”, and are trying to report an occurrence of “PC X took over artifact Y from PC XX” as two separate DB requests (one “Take” and one “Give”), you’re risking that in case of Game World server failure, one of these two requests will go through, and the other one won’t, so artifact will get lost (or will be duplicated) as a result. Instead of using these two requests to simulate “taking over” occurrence, you should have a special DB request “PC X took over artifact Y from PC XX” (and it should be implemented as a single DB transaction within DB FSM); this way at least the consistency of the system will be preserved, so whatever happens – there is still exactly one artifact. The very same pattern MUST be followed for passing around anything of value, from casino chips to artifacts, with any other goodies in between.

Server Fault Tolerance: King is Dead, Long Live the King!

If you want to have your servers to be really fault-tolerant, there are some ways to have your cake and eat it too.

However, keep in mind, that all fall-tolerant solutions are complicated, costly, and in the games realm I generally consider them as an over-engineering (even by my standards).

Fault-Tolerant Servers: Damn Expensive

Historically, fault-tolerant systems were provided by damn-expensive hardware such as [Stratus] (I mean their hardware solutions such as ftServer; see discussion on hardware-vs-software redundancy in Chapter [[TODO]]) and [HPIntegrityNonStop] which have everything doubled (and CPUs often quadrupled(!)) to avoid all single points of failure, and these tend do work very well. But they’re usually way out of game developer’s reach for financial reasons, so unless your game is a stock exchange – you can pretty much forget about them.

Fault-Tolerant VMs

Fault-Tolerant VMs (such as VMWare FT feature or Xen Remus) are quite new kids on the block (for example, VMWare FT got beyond single vCPU only in 2015), but they’re already working. However, there are some significant caveats. Take everything I’m saying about fault-tolerant VMs with a really good pinch of salt, as all the technologies are new and evolving, and information is scarce; also I admit that I didn’t have a chance to try these things myself .

When you’re using a fault-tolerant VM, the Big Picture looks like this: you have two commodity servers (usually right next to each other), connect them via 10G Ethernet, run VM on one of them (the “primary” one), and when your “primary” server fails, your VM magically reappears on the “secondary” box. From what I can see, modern Fault-Tolerant VMs are using one of two technologies: “virtual lockstep” and “fast checkpoints”. Unfortunately, each of them has its own limitations.

Virtual Lockstep: Not Available Anymore?

The concept of virtual lockstep is very similar to our QnFSM (with the whole VM treated as FSM). Virtual lockstep takes one single-core VM, intercepts all the inputs, passes these inputs to the secondary server, and runs a copy VM there. As any other fault-tolerant technology, virtual lockstep causes additional latencies, but it seems to be able to restrict its appetite for additional latency to a sub-ms range, which is acceptable for most of the games out there. Virtual lockstep is the method of fault-tolerance vSphere prior to vSphere v6 was using. The downside of virtual lockstep is that it (at least as implemented by vSphere) wasn’t able to support more that one core. For our QnFSMs, this single-core restriction wouldn’t be too much of a problem, as they’re single-threaded anyway (though balancing FSMs between VMs would be a headache), but there are lots of applications out there which are still heavily-multithreaded, so it was considered an unacceptable restriction. As a result, vSphere, starting from vSphere 6, has changed their fault-tolerant implementation from virtual lockstep to checkpoint-based implementation. As of now, I don’t know of any supported implementations of Virtual Lockstep .

Checkpoint-Based Fault Tolerance: Latencies

To get around the single-core limitation, a different technique, known as “checkpoints”, is used by both Xen Remus and vSphere 6+. The idea behind checkpoints is to make a kind of incremental snapshots (“checkpoints”) of the full state of the system and log it to a safe location (“secondary server”). As long as you don’t let anything out of your system before the coming-later “checkpoint” is committed to a secondary server, all the calculations you’re making meanwhile, become inherently unobservable from the outside, so in case of “primary” server failure, it is not possible to say whether it didn’t receive the incoming data at all. It means that for the world outside of your system, your system (except for the additional latency) becomes almost-indistinguishable¹⁹ from a real fault-tolerant server such as Stratus (see above). In theory, everything looks perfect, but with VM checkpoints we seem to hit the wall with checkpoint frequency, which defines the minimum possible latency. On systems such as VMWare FT, and Xen Remus, checkpoint intervals are measured in dozens of milliseconds. If your game is ok with such delays – you’re fine, but otherwise – you’re out of luck :-( . For more details on checkpoint-based VMs, see [Remus].

Saving for latencies (and the need to have 10G connections between servers, which is not that big deal), checkpoint-based fault tolerance has several significant advantages over virtual lockstep; these include such important things as support for multiple CPU cores, and N+1 redundancy.

¹⁹ strictly speaking, the difference can be observed as some network packets may be lost, but as packet loss is a normal occurrence, any reasonable protocol should deal with transient packet loss anyway without any observable impact

Complete Recovery from Game World server failures: DIY Fault-Tolerance in QnFSM World

If you’re using FSMs (as you should anyway), you can also implement your own fault-tolerance. I should confess that I didn’t try this approach myself, so despite looking very straightforward, there can be practical pitfalls which I don’t see yet. Other than that, it should be as fault-tolerant as any other solution mentioned above, and it should provide good latencies too (well in sub-ms range).

As any other fault-tolerant solution, for games IMHO it is an over-engineering, but if I’d feel strongly about the failures causing per-game-event rollbacks, this is the one I’d try first. It is latency friendly, it allows for N+2 redundancy (saving you from doubling the number of your servers in case of 1+1 redundancy schemas), and it plays really well alongside our FSM-related stuff.

The idea here is to have separate Logging Servers logging all the events to all the FSMs residing on your Game World servers; then, you will essentially have enough information on your Logging Servers to recover from Game World server failure. More specifically, you can do the following:

have an additional Logging Server(s) “in front of Game Servers”; these Logging Server(s) perform two functions:
- log all the messages incoming to all Game Server FSMs
  - these include: messages coming from clients, messages coming from other Game Servers, and messages coming from DB Server
  - moreover, even communications between different FSMs residing on the same Game Server, need to go via Logging Server and need to be logged
- timestamp all the incoming messages
all your Game Server FSMs need to be strictly-deterministic
- in particular, Game Server FSMs won’t use their own clocks, but will use timestamps provided by Logging Servers instead
In addition, from time to time each of Game Server FSMs need to serialize its whole state, and report it to Logging Server
then, we need to consider two scenarios: Logging Server failure and Game Server failure (we’ll assume that they never fail simultaneously, and such an event is indeed extremely unlikely unless it is a fire-in-datacenter or something)
- if it is Logging Server which fails, we can just replace it with another (re-provisioned) one; there is no game-critical data there
- if it is Game Server which fails, we can re-provision it, and then roll-forward each and every FSM which was running on it, using last-reported-state and logs-saved-since-last-reported-state saved on the Logging Server. Due to the deterministic nature of all the FSMs, the restored state will be exactly the same as it was a few seconds ago²⁰
  - at this point, all the clients and servers which were connected to the FSM, will experience a disconnect
  - on disconnect, the clients should automatically reconnect anyway (this needs to account for IP change, what is a medium-sized headache, but is doable; in [[TODO]] section we’ll discuss Front-End servers which will isolate clients from disconnects completely)
  - issues with server-to-server messages should already be solved as described in “Communication Failures” subsection above

In a sense, this “Complete Recovery” thing is conceptually similar to EventProcessorWithCircularLog from Chapter V (but with logging residing on different server, and with auto-rollforward in case of server failure), or to a traditional DB restore-and-log-rollforward.

Note that only hardware problems (and software bugs outside of your FSMs, such as OS bugs) can be addressed with this method; bugs within your FSM will be replayed and will lead to exactly the same failure :-( .

Last but not least, I need to re-iterate that I would object any fault-tolerant schema for most of the games out there on the basis of over-engineering, though I admit that there might be good reasons to try achieving it, especially if it is not too expensive/complicated.

²⁰ or, in case of almost-strictly-deterministic FSMs such as those CUDA-based ones, it will be almost-exactly-the-same

[[TODO!]] DIY VIrtual-Lockstep

Classical Game Deployment Architecture: Summary

To summarize the discussion above about Classical Game Deployment Architecture:

It works
It can and should be implemented using QnFSM model with deterministic FSMs, see discussion above for details
Your communication with DB (DB API) SHOULD use game-specific requests, and SHOULD NOT use any SQL; all the SQL should be hidden behind your DB FSM(s)
Your first DB Server SHOULD use single-connection approach, unless you happen to have a DB guy who has real-world experience with multi-connection systems under at least millions-per-day write(!) transaction loads
- Even in the latter case, you SHOULD try to convince him, but if he resists, it is ok to leave him alone, as long as external DB API is still exactly the same (message-based and expressed in terms of whatever-your-game-needs). This will provide assurance that in the extreme case, you’ll be able to rewrite your DB Server later.

[[To Be Continued…

This concludes beta Chapter VI(a) from the upcoming book “Development and Deployment of Massively Multiplayer Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter VI(b), “Modular Architecture: Server-Side. Throwing in Front-End Servers.]]

References

[Lightstreamer] http://www.lightstreamer.com/

[Redis.CAS] http://redis.io/topics/transactions#cas

[Zubek2016] Robert Zubek, “Private communications with”

[Zubek2010] Robert Zubek, “Engineering Scalable Social Games”, GDC2010

[Stratus] “Stratus Technologies”, Wikipedia

[HPIntegrityNonStop] “HP Integrity NonStop”, Wikipedia

[Remus] Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson, and Andrew Warfield, “Remus: High Availability via Asynchronous Virtual Machine Replication”

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧

Implementing Queues for Event-Driven Programs

June 13, 2016, 3:48 am

≫ Next: Infographics: Operation Costs in CPU Clock Cycles

≪ Previous: Chapter VI(a). Server-Side MMO Architecture. Naïve, Web-Based, and Classical Deployment Architectures

[[This is Chapter XIII(d) from “beta” Volume 2 of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

We’ve already discussed things related to sockets; now let’s discuss the stuff which is often needed (in particular, it is of Utmost Importance when implementing Reactors), but which is not that common to be universally available as a part of operating system.

I’m speaking about queues in the context of inter-thread communications (where “threads” are usual preemptive threads able to run on different cores, and not cooperative ones a.k.a. fibers). And not only just about “some” implementation of queue, but about queues which have certain properties desirable for our Reactors a.k.a. ad-hoc Finite State Machines a.k.a. Event-Driven Programs.

Simple MWSR Queue

What we usually need from our Queue, is an ability to push asynchronous messages/events there (usually from different threads), and to get them back (usually from one single thread) – in FIFO order, of course. Such Multiple-Writer-Single-Reader queues are known as MWSR queues. In our case, reading from an empty queue MUST block until something appears there; this is necessary to avoid polling. On the other hand, writing MAY block if the queue is full, though in practice this should happen Really Rarely.

Let’s consider the following simple implementation (with no blocking, as our queue cannot become “full”):

template <class Collection>
class MWSRQueue {
  private:
  std::mutex mx;
  std::condition_variable waitrd;
  Collection coll;
  bool killflag = false;

  public:
  using T = Collection::value_type;
  MWSRQueue() {
  }

  void push_back(T&& it) {
    //as a rule of thumb, DO prefer move semantics for queues
    //   it reduces the number of potential allocations
    //   which happen under the lock(!), as such extra
    //   unnecessary allocations due to unnecessary copying
    //   can have Bad Impact on performance
    //   because of significantly increased mutex contention
    {//creating scope for lock
    std::unique_lock<std::mutex> lock(mx);
    coll.push_back(std::move(it));
    }//unlocking mx

    waitrd.notify_one();
    //Yep, notifying outside of lock is usually BETTER.
    //  Otherwise the other thread would be released
    //  but will immediately run into
    //  our own lock above, causing unnecessary
    //  (and Very Expensive) context switch
  }

  pair<bool,T> pop_front() {
    //returns pair<true,popped_value>,
    //  or – if the queue is being killed - <false,T()>
    std::unique_lock<std::mutex> lock(mx);
    while(coll.size() == 0 && !killflag) {
      waitrd.wait(lock);
    }
    if(killflag)
      return pair<bool,T>(false,T());
        //creates an unnecessary copy of T(),
        //  but usually we won’t care much at this point

    assert(coll.size() > 0);
    T ret = std::move(coll.front());
    coll.pop_front();
    lock.unlock();
    return pair<bool,T>(true, std::move(ret));
  }

void kill() {
    {//creating scope for lock
    std::unique_lock<std::mutex> lock(mx);
    killflag = true;
    }//unlocking mx

    waitrd.notify_all();
  }

};

[[TODO!:test!]]

This is a rather naïve implementation of MWSR queues, but – it will work for quite a while, and it uses only very standard C++11, so it will work pretty much everywhere these days. More importantly, it does implement exactly the API which you need: you can push the items back from other threads, you can read your items from a single thread, and you can request that the wait (if any) is aborted, so your thread can terminate (for example, if you want to terminate your app gracefully). Moreover, our queue provides the whole API which you’ll ever need from your queue; this IS important as it means that you can re-implement your queue later if necessary in a better-performing manner, and nobody will notice the difference.

A nice (though side) implementation detail is that our template class MWSRQueue can use any collection which implements usual-for-std-containers functions push_back(), pop_front(), and front(). It means that you can use std::list<> or std::deque<> directly, or make your own class which satisfies this API (for example, you can make your own prioritized queue¹). Oh BTW, and (by pure accident) it seems to be exception-safe too (even in a strong sense²).

OTOH, this naïve implementation has several significant drawbacks, which MAY come into play as soon as we become concerned about performance and reliability. Let’s see these drawbacks one by one.

¹ Note that std::priority_queue<> as such does NOT guarantee the order in case of elements with equal priority, so to make a FIFO-queue-with-priority out of it, you’ll need to make another adapter which assigns number-of-item-since-very-beginning as one of the parameters (and then sort by tuple (priority, number_of_item_since_very_beginning) – and DON’T forget about potential wraparounds too! – that is, unless you’re using uint64_t as your number_of_item_since_very_beginning, when in most practical cases you can demonstrate that wraparound will never happen

² assuming that your type T has a moving constructor with a no-throw guarantee, which it usually does

Fixed-Size Queues

As our class MWSRQueue above is organized, queue size may grow indefinitely. This might look as a Good Thing from theoretical point of view (“hey, we don’t put any limits on our Queue”), but in the real world it often causes severe issues . For example, if for some reason one of your servers/Reactors starts to delay processing (or even hangs), such infinite-sized queues can easily eat up all the available RAM, causing swap or denial of allocations, and potentially affecting MUCH more players than it should.

Flow Control

Let’s consider what will happen in the case of one of the Reactors hanging/slowing-down if we limit the size of ALL our queues within the system.

If we limit sizes of ALL our queues, AND all our connections are TCP, then in case of severe overload the following scenario will unfold. First, one queue (the one close to the slow Reactor) will get full; in turn, queue being full will cause TCP thread which fills it, to block.³ Then, the TCP thread on the other side of TCP connection will find that it cannot push data into TCP, so it will block too. Then, the queue which feeds that TCP thread on pushing side, will get full. Then, the sending Reactor’s supposedly-non-blocking function sendMessage(), will be unable to push the message into the queue-which-just-became-full, so our supposedly-non-blocking Reactor will block.

As we can see, when working with all flow-controlled transports (TCP is flow-controlled, and fixed-size queue is flow-controlled too), severe delays tend to propagate from the target to the source. Whether it is good or not – depends on specifics of your system, though from what I’ve seen, in most cases such propagating delays are at least not worse than exhausting RAM which happens in case of infinite queues.

Also, it gives us back the control over what-we-want-to-do in case of such a problem. For example, to avoid one Reactor which processes messages from pretty much independent channels and feeding them to different Reactors, from blocking all the channels in case of one of the target Reactors being slow or hanged, we MAY be able to “teach” our single Reactor to postpone just messages from the affected channel, while working with the other channels as usual. Implementing it would require two things: (a) adding a trySendMessage() function, which tries to send message and returns “send wait handle” if the sending is unsuccessful, and (b) adding a list_of_wait_handles parameter to pop_front() function, with understanding that if some space becomes available in any of “send wait handle”s, pop_front() stops the wait and returns “send wait handle” to the caller (and then infrastructure code will need to send a message/call a callback or continuation from our Reactor).

³ in case when there is no TCP between Reactors so that Reactors are interacting directly, sending Reactor’s supposedly-non-blocking sendMessage() will block immediately, as described below

Dropping Packets

When dealing with messages coming over TCP or over internal communications, we’re usually relying on ALL the messages being delivered (and in order too); that’s why dropping messages is usually not an option on these queues.⁴

However, for UDP packets, there is always an option to drop them if the incoming queue is full;⁵ this is possible because any UDP packets can be dropped anyway, so that our upper-level protocols need to handle dropped packets regardless of us dropping some packets at application level. Moreover, we can implement a selective packet drop if we feel like it (for example, we can drop less important traffic in favor of more important one).

⁴ strictly speaking, if you DO implement reliable inter-Server communications as described in Chapter III, you MAY be able to force-terminate TCP connection, AND to drop all the messages from that connection from the queue too. Not sure whether it is ever useful though.

⁵ Or even almost-full, see, for example, [RED] family of congestion avoidance algorithms

Full Queues are Abnormal. Size. Tracking

Regardless of the choice whether-to-block-or-to-drop outlines above, full queues SHOULD NOT happen during normal operation; they’re more like a way to handle scenarios when something has Already Went Wrong, and to recover from them while minimizing losses. That’s why it is Really Important to keep track of all the queue blocks (due to the queue being full), and to report it to your monitoring system; for this purpose, our queues should provide counters so that infrastructure code can read them and report to a system-wide monitor (see more on monitoring in Vol.3).

Now let’s discuss a question of maximum size of our fixed-size queues. On the one hand, we obviously do NOT want to have any kind of swapping because of the memory allocated to our fixed-size queues. On the other hand, we cannot have our queues limited to maximum size of 2 or 3. If our queue is too small, then we can easily run into scenarios of starvation, when our Reactor is effectively blocked by the flow control mechanisms from doing things (while there is work somewhere in the system, it cannot reach our Reactor). In the extreme cases (and ultra-small sizes like 2 or 3), it is possible even to run into deadlocks (!).⁶

My recommendation when it comes to maximum size of the queues, goes as follows:

DO test your system with all queue sizes set to 1
- see whether you have any deadlocks
  - if yes – DO understand whether you really need those dependencies which are causing deadlocks
    - if yes – DO establish such limits on minimum queue sizes, which guarantee deadlock-free operation
Start with maximum size of between 100 and 1000; most of the time, it should be large enough to stay away from blocks and also to avoid allocating too much memory for them.
DO monitor maximum sizes in production (especially “queue is full” conditions), and act accordingly

⁶ There is a strong argument that deadlocks SHOULD NOT happen even with all queue sizes == 1. I would not say that this qualifies as a firm rule, however, I do agree that if using flow-controlled queues, you SHOULD test your system with all queue sizes set to 1, see below

Implementing Fixed-Size Queue with Flow Control

Now, after we’ve specified what we want, we’re ready to define our own Fixed-Size Queues. Let’s start with a Fixed-Size Queue with Flow Control:

template <class FixedSizeCollection>
class MWSRFixedSizeQueueWithFlowControl {
  private:
  std::mutex mx;
  std::condition_variable waitrd;
  std::condition_variable waitwr;
  FixedSizeCollection coll;
  bool killflag = false;

  //stats:
  int nfulls = 0;
  size_t hwmsize = 0;//high watermark on queue size

  public:
  using T = FixedSizeCollection::value_type;

  MWSRFixedSizeQueueWithFlowControl() {
  }
  void push_back(T&& it) {
    //if the queue is full, BLOCKS until some space is freed
    {//creating scope for lock
    std::unique_lock<std::mutex> lock(mx);
    while(coll.is_full() && !killflag) {
      waitwr.wait(lock);
      ++nfulls;
      //this will also count spurious wakeups,
      //  but they’re supposedly rare
    }

    if(killflag)
      return;
    assert(!coll.is_full());
    coll.push_back(std::move(it));
    size_t sz = coll.size();
    hwmsize = max(hwmsize,sz);
    }//unlocking mx

    waitrd.notify_one();
  }

  pair<bool,T> pop_front() {
    std::unique_lock<std::mutex> lock(mx);
    while(coll.size() == 0 && !killflag) {
      waitrd.wait(lock);
    }
    if(killflag)
      return pair<bool,T>(false,T());

    assert(coll.size() > 0);
    T ret = std::move(coll.front());
    coll.pop_front();
    lock.unlock();
    waitwr.notify_one();

    return pair<bool,T>(true, std::move(ret));
  }

  void kill() {
    {//creating scope for lock
    std::unique_lock<std::mutex> lock(mx);
    killflag = true;
    }//unlocking mx

  waitrd.notify_all();
  waitwr.notify_all();
  }
};

Implementing Fixed-Size Queue with a Drop Policy

And here goes a Fixed-Size Queue with a Drop Policy:

template <class FixedSizeCollection, class DropPolicy>
  // DropPolicy should have function
  //    pushAndDropOne(T&& t, FixedSizeCollection& coll)
  //    it MAY either to skip t,
  //    OR to drop something from coll, while pushing t
class MWSRFixedSizeQueueWithDropPolicy {
  private:
  DropPolicy drop;
  std::mutex mx;
  std::condition_variable waitrd;
  FixedSizeCollection coll;
  bool killflag = false;

  //stats:
  int ndrops = 0;
  size_t hwmsize = 0;//high watermark on queue size

  public:
  using T = FixedSizeCollection::value_type;

  MWSRFixedSizeQueueWithDropPolicy(const DropPolicy& drop_)
  : drop(drop_) {
  }

  void push_back(T&& it) {
    //if the queue is full, calls drop.pushAndDropOne()
    {//creating a scope for lock
    std::unique_lock<std::mutex> lock(mx);

    if(coll.is_full()) {//you MAY want to use
                        //  unlikely() here
      ++ndrops;
      drop.pushAndDropOne(it, coll);
      return;
    }

    assert(!coll.is_full());
    coll.push_back(std::move(it));
    size_t sz = coll.size();
    hwmsize = max(hwmsize,sz);
    }//unlocking mx

    waitrd.notify_one();
  }

  pair<bool,T> pop_front() {
    std::unique_lock<std::mutex> lock(mx);
    while(coll.size() == 0 && !killflag) {
      waitrd.wait(lock);
    }

    if(killflag)
      return pair<bool,T>(false,T());
    assert(coll.size() > 0);
    T ret = std::move(coll.front());
    coll.pop_front();
    lock.unlock();
    return pair<bool,T>(true, std::move(ret));
  }

  void kill() {
    {//creating scope for lock
    std::unique_lock<std::mutex> lock(mx);
    killflag = true;
    }//unlocking mx

    waitrd.notify_all();
  }
};

Performance Issues

As we’re running our system, we MAY run into performance issues; sometimes, it is those queues which cause us trouble.

With queues-implemented-over-mutexes like the ones we’ve written above, the most annoying thing performance-wise is that there is a chance that the OS’s scheduler can force the preemptive context switch right when the thread-being-preempted-is-owning-our-mutex. This will cause quite a few context switches going back and forth. Such unnecessary context switches have a Big Fat impact on the performance (as discussed in [TODO], context switch can cost up to a million CPU clocks⁷).

⁷ Most of the time, such Bad Cases won’t apply to the kind of context switches we’re discussing here, but several context switches each costing 10K CPU clocks, is already Pretty Bad

To deal with it, two approaches are possible. Approach #1 would be simply to

Reduce Time Under Lock

As we reduce the time spent under the mutex lock, chances of that unfortunate-context-switch can be reduced to almost-zero (if we’re doing a Really Good Job, time-under-lock can be as little as a hundred CPU clocks under the lock, so chances of being forced-switched there, become very minimal). And without the lock being occupied, the time to acquire/release the lock usually becomes just two atomic/LOCK/Interlocked operations (and you cannot really do better than that).

Removing Allocations from Under the Lock

A mathematician is asked “how to boil water?” His answer goes as follows:

Let’s consider two cases. In the first case, there is no water in the kettle.

Then, we need to light a fire, put some water into the kettle,

place the kettle over the fire, and wait for some time.

In the second case, there is water in the kettle.

Then we need to pour the water out, and the problem is reduced to the previous case.
— A mathematician who Prefers to stay Anonymous —

Now, let’s see what we can do to reduce time under the lock. If we take a closer look at our class MWSRQueue, we’ll realize that all the operations under the lock are very minimal, except for potential allocations (and/or O(N) operations to move things around).

The problem is that none of the existing std:: containers provides a guarantee that there are neither allocations/deallocations nor O(N) operations within their respective push_back() and pop_front() operations.

std::list<>::push_back()/pop_front()	Allocation/deallocation; some implementations MAY cache or pool allocations, but such optimizations are implementation-specific
std::vector<>::erase(begin()) (as a replacement for pop_front())	O(N)
std::deque<>::push_back()/pop_front()	Allocation/deallocation; some implementations MAY cache or pool allocations, but such optimizations are implementation-specific

I know of two ways how to deal with this problem. First, it is possible to use some kind of pool allocation and feed pool allocator to std::list<> or std::deque<> (effectively guaranteeing that all the items are always taken from the pool and nothing else). However, IMO this solution, while workable, looks too much as a way mathematician gets the kettle boiled (see epigraph to this subsection).

Instead, I suggest to do the following:

If you need an infinite-size queue, you can use “intrusive lists” (allocating list elements outside the mutex lock, and reducing contention)
If you need a fixed-size queue, then you can create your own Collection based on circular buffer along the following lines:

template<class T, size_t maxsz_bits>
class CircularBuffer {
  static constexpr size_t bufsz = 1 << maxsz_bits;
  static constexpr size_t maxsz = bufsz - 1;
    //-1 to make sure that head==tail always means ‘empty’
  static constexpr size_t mask = maxsz;
  size_t head = 0;
  size_t tail = 0;
  alignas(T) uint8_t buffer[bufsz*sizeof(T)];
  
  public:
    size_t size() {
      return head – tail + 
        (((size_t)(head>=tail)-(size_t)1) & bufsz);
        //trickery to avoid pipeline stalls via arithmetic
        //supposedly equivalent to:
        //if(head >= tail)
        //  return head – tail;
        //else
        //  return head + bufsz - tail;
    }

  void push_back(T&& t) {
    assert(size() < maxsz);
    new(&buffer[head]) T(std::move(t));
    head = ( head + 1 ) & mask; 
  } 

  T pop_front() {
    assert(size() > 0);
    T ret = std::move(buffer[tail]);
    buffer[tail].~T();
    tail = ( tail + 1 ) & mask;
    return ret;
  }
};

Removing locks completely

The second approach is MUCH more radical – it is the one to remove locks completely. And at the first glance, it seems that it is easy to find an appropriate “lockless queue” library. However, there is a caveat:

We do NOT really need “completely lockless queue”. What we need, is a “queue which is lockless until it becomes empty or full”

In other words, our (almost)-lockless queue still needs to lock (otherwise we’d need to poll it, which puts us in a predicament between sacrificing latency and burning CPU cycle in a MUCH worse manner than any losses from the very-infrequent-context-switches on barely-loaded-locks).

Unfortunately, I do NOT know of any readily-available library which supports such “blocking-only-when-necessary” queues . Writing such a thing yourself is certainly possible, but keep in mind that it is going to be a Really Major Effort even if you’re proficient in writing synchro primitives (and Even More Major Effort to debug/test it and to prove its correctness⁸). Overall, if considering complexity of writing such a “blocking-only-when-necessary” queue from the point of view of exercises from Knuth’ “The Art of Computer Programming”, I would rate is around 40 (with “50” being a “non-proven-yet theorem”).

One library which I didn’t try myself, but which MAY help in converting lockless algorithms into lock-when-necessary ones, is [EventCount] from Facebook’s folly library. Let me know whether it worked for you .

⁸ yes, for non-trivial primitives such proof is necessary, even if it is done by an exhaustive analysis of all the context switches in all the substantially different points – of course, not forgetting about those nasty ABA problems

Waiting for Other Stuff

More often than not, in addition to waiting for incoming events, we MAY want to wait for “something else”. Examples of these “something else” things range from “something coming in from socket” to “user moving mouse”.

Of course, we could dedicate a thread to wait for several sockets (user input, DNS response, whatever-else) and pushing the result to one of our MWSR Queues, but it means extra context switches, and therefore is not always optimal.

In such cases, we MAY want to use some OS-specific mechanism which allows to wait for several such things simultaneously. Examples of such mechanisms include:

(not exactly that OS-specific, but still different enough to be mention here): using select() (poll()/epoll()) as a queue. If MOST of your IO is sockets, and everything-else (like “a message coming in from another thread”) happens very occasionally, then it often makes sense to use select() etc. to deal with sockets – and with anything else too (with absolutely no mutexes etc. in sight). To deal with those very-occasional other events (which cannot be handled via select()/poll()/epoll() because they’re not file handles, or because they’re regular files(!)), a separate anonymous pipe (or equivalent) can be created, which can be listened by the very same select()-like function. Bingo! Most of the things are handled with select()/poll()/epoll()/… without any unnecessary context switches, and the very-occasional stuff is occasional enough to ignore the associated (usually not-too-bad) overhead of sending it over the pipe.
- On Linux, instead of pipe, you can (and IMHO SHOULD) use eventfd() instead of anonymous pipe, to get an improvement in performance. For thread-to-thread communications, it makes select()-etc.-based queues rather efficient.
- Note however, that this approach does NOT work too well performance-wise when most of your events CANNOT be handled by select()-like function directly (and need to be simulated over that pipe). While such a thing WILL work, the time spent on simulating events over pipes, can become substantial :-(.
kqueue(). On BSD, kqueue() allows to wait not only on file handles, and provides more flexibility than epoll(), and occasionally allows to avoid an extra-thread-with-an-anonymous-pipe which would be necessary otherwise.
Win32 WaitForMultipleObjects(). WaitForMultipleObjects() can wait both for sockets and for “events”. This can be used to build a queue which can handle both sockets etc. and other stuff – all without those unnecessary context switches.[[TODO:MsgWaitForMultipleObjects()]]
Win32 thread queues. Another Win32-specific mechanism is related to thread queues (and GetMessage() function). These come handy when you need to handle both Windows messages and something-else (especially when you need to do it in a UI thread).

On libuv

In a sense, [libuv] is The King when we speak about 3rd-party event handling libraries. It can take pretty much anything and make it asynchronous. However, being that universal comes at a price: libuv’s performance, while “pretty good”, is not “the best one possible”. In particular, the trickery described above, can often outperform libuv.

[[TODO: IPC/shared-memory]]

[[To Be Continued…

This concludes beta Chapter XIII from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter XIV, where we’ll start discussing graphics (though ONLY as much as it is necessary for a multiplayer gamedev).]]

References

[RED] “Random Early Detection”, Wikipedia

[EventCount] https://github.com/facebook/folly/blob/master/folly/experimental/EventCount.h

[libuv] http://libuv.org

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧

Infographics: Operation Costs in CPU Clock Cycles

September 12, 2016, 5:10 am

≫ Next: Network Programming: Socket Peculiarities, Threads, and Testing

≪ Previous: Implementing Queues for Event-Driven Programs

Operation costs in CPU clock cycles on x86/x64 Platform

Click to enlarge
NB: scale is logarithmic!

[[TODO: TLB]]

Whenever we need to optimise the code, we should profile it, plain and simple. However, sometimes it makes sense just to know ballpark numbers for relative costs of some popular operations, so you won’t do grossly inefficient things from the very beginning (and hopefully won’t need to profile the program later ).

So, here it goes – an infographics which should help to estimate costs of certain operations in CPU clocks cycles – and answer the questions such as “hey, how much L2 read usually costs?”. While answers to all these questions are more or less known, I don’t know of a single place where all of them are listed and put into perspective. Let’s also note that while the listed numbers, strictly speaking, apply only to modern x86/x64 CPUs, similar patterns of relative operation costs are expected to be observed on other modern CPUs with large multi-level caches (such as ARM Cortex A, or SPARC); on the other hand, MCUs (including ARM Cortex M) are different enough so some of the patterns may be different there.

Last but not least, a word of caution: all the estimates here are just indications of the order of magnitude; however, given the scale of differences between different operations, these indications may still be of use (at least to be kept in mind to avoid “premature pessimisation”).

On the other hand, I am still sure that such a diagram is useful to avoid saying things “hey, virtual function calls cost nothing” – which may or may not be true depending on how often you call them. Instead, using the infographics above – you’ll be able to see that

if you call your virtual function 100K times per second on a 3GHz CPU – it probably won’t cost you more than 0.3% of your CPU total; however, if you’re calling the same virtual function 10M times per second – it can easily mean that virtualisation eats up double-digit percentages of your CPU core.

Another way of approaching the same question is to say that “hey, I’m calling virtual function once per piece of code which is like 10000 cycles, so virtualisation won’t eat more than 1% of the program time” – but you still need some kind of way to see an order of magnitude for the related costs – and the diagram above will still come in handy .

Preliminaries aside, let’s take a closer look at those items on our infographics above.

ALU/FPU Operations

For our purposes, when speaking about ALU operations, we will consider only register-register ones. If memory is involved, the costs can be VERY different – and will depend on “how bad the cache miss was” for the memory access, as discussed below.

“Simple” Operations

These days (and on modern CPUs), “simple” operations such as ADD/MOV/OR/… can easily have costs of less than 1 CPU cycle. This doesn’t mean that the operation will be literally performed in half a cycle. Instead –

while all operations are still performed in a whole number of cycles, some of them can be performed in parallel

In [Agner4] (which BTW is IMO the best reference guide on CPU operation costs), this phenomenon is represented by each operation having two associated numbers – one is latency (which is always a whole number of cycles), and another is throughput. It should be noted, however, that in real-world, when going beyond order of magnitude estimates, exact timing will depend a lot on the nature of your program, and on the order in which the compiler has put seemingly-unrelated instructions; in short – whenever you need something better than an order-of-magnitude guesstimate, you need to profile your specific program, compiled with your specific compiler (and ideally – on a specific target CPU too).

Further discussion of such techniques (known as “out of order execution”), while being Really Interesting, is going to be way too hardware-related (what about “register renaming” which happens under the hood of CPU to reduce dependencies which prevent out-of-order from working efficiently?), and is clearly out of our scope at the moment .

Integer Multiplication/Division

Integer multiplication/division is quite expensive compared to “simple” operations above. [Agner4] gives cost of 32/64-bit multiplication (MUL/IMUL in x86/x64 world) at between 1-7 cycles (in practice, I’ve observed more narrow range of values, such as 3-6 cycles), and cost of 32/64-bit division (known as DIV/IDIV on x86/64) – at between 12-44 cycles.

Floating-Point Operations

Costs of floating-point operations are taken from [Agner4], and range from 1-3 CPU cycles for addition (FADD/FSUB) and 2-5 cycles for multiplication (FMUL), to 37-39 cycles for division (FDIV).

If using SSE scalar operations (which apparently every compiler and its dog does these days), the numbers will go down to 0.5-5 cycles for multiplication (MULSS/MULSD), and to 1-4o cycles for division (DIVSS/DIVSD); in practice, however, you should expect more like 10-40 cycles for division (1 cycle is “reciprocal throughput”, which is rarely achievable in practice).

128-bit Vector Operations

For quite a few years, CPUs are providing “vector” operations (more precisely – Single Instruction Multiple Data a.k.a. SIMD operations); in Intel world they’re known as SSE and AVX and in ARM world – as ARM Neon. One funny thing about them is that they operate on “vectors” of data, with data being of the same size (128 bit for SSE2-SSE4, 256 bit for AVX and AVX2, and 512 bits for upcoming AVX-512) – but interpretations of these bits being different. For example, 128-bit SSE2 register can be interpreted as (a) two doubles, (b) four floats, (c) two 64-bit integers, (d) four 32-bit integers, (e) eight 16-bit integers, (f) 16 8-bit integers.

[Agner4] gives the costs of integer addition over 128-bit vector at < 1 cycle if the vector is interpreted as 4×32-bit integers, and at 4 cycles if it is 2×64-bit integers; multiplication (4×32 bits) goes at 1-5 cycles – and last time I checked, there were no integer division vector operations in x86/x64 instruction set. For floating-point operations over 128-bit vectors, the numbers go from 1-3 CPU cycles for addition and 1-7 CPU cycles for multiplication, to 17-69 cycles for division.

Bypass Delays

One not-so-obvious thing related to calculation costs, is that switching between integer and floating-point instructions is not free. [Agner3] gives this cost (known as “bypass delay”) at 0-3 CPU cycles depending on the CPU. Actually, the problem is more generic than that, and (depending on CPU) there can also be penalties for switching between vector (SSE) integer instructions and usual (scalar) integer instructions.

Optimisation hint: in performance-critical code, avoid mixing floating-point and integer calculations.

Branching

The next thing which we’ll be discussing, is branching code. Branch (an if within your program) is essentially a comparison, plus a change in the program counter. While both these things are simple, there can be a significant cost associated with branching. Discussing why it is the case, is once again way too hardware-related (in particular, pipelining and speculative execution are two things involved here), but from the software developer’s perspective it looks as follows:

if the CPU guesses correctly where the execution will go (that’s before actually calculating condition of if), then cost of branching operation is about 1-2 CPU cycles.
however, if the CPU makes an incorrect guess – it results in CPU effectively “stalling”

The amount of this stall is estimated at 10-20 CPU cycles [Wikipedia.BranchPredictor], for recent Intel CPUs – around 15-20 CPU cycles [Agner3].

Let’s note that while GCC’s __builtin_expect() macro is widely believed to affect branch prediction – and it used to work this way just 15 years ago, it is no longer the case at least for Intel CPUs anymore (since Core 2 or so). As described in [Agner3], on modern Intel CPUs branch prediction is always dynamic (or at least dominated by dynamic decisions); this in turn, implies that __builtin_expect()-induced differences in the code are not expected to have any effect on branch prediction (on modern Intel CPUs, that is). However, __builtin_expect() still has effect on the way code is generated, as described in “Memory Access” section below.

Memory Access

Back in 80s, CPU speed was comparable with memory latency (for example, Z80 CPU, running at 4MHz, spent 4 cycles on a register-register instruction, and 6 cycles on a register-memory instruction). At that time, it was possible to calculate the speed of the program just by looking at assembly.

Since that point, speeds of CPUs have grown by 3 orders of magnitude, while memory latency has improved only 10-30-fold or so. To deal with remaining 30x+ discrepancy, all kinds of caches were introduced. Modern CPU usually has 3 levels of caches. As a result, speed of memory access depends very significantly on the answer to “where the data we’re trying to read, is residing?” The lower the cache level where your request was found – the faster you can get it.

L1 and L2 cache access times can be found in official documents such as [Intel.Skylake]; it lists access L1/L2/L3 times at 4/12/44 CPU cycles respectively (NB: these numbers vary slightly from one CPU model to another one). Actually, as mentioned in [Levinthal], L3 access times can go as high as 75 cycles if the cache line is shared with another core.

However, what is more difficult to find, is information about main RAM access times. [Levinthal] gives it at 60ns (~180 cycles if CPU is running at 3GHz).

Optimisation hint: DO improve data locality. For more discussion on it, see, for example, [NoBugs].

Besides memory reads, there are also memory writes. While intuitively write is perceived to be more expensive than read, most often it is not; the reason for it is simple – CPU doesn’t need to wait for the write to complete before going forward (instead, it just starts writing – and goes ahead with the other business). This means that most of the time, CPU can perform memory write in ~1 cycle; this is consistent with my experience, and seems to correlate with [Agner4] reasonably well. On the other hand, if your system happens to be memory-bandwidth-bound, numbers can get EXTREMELY high; still, from what I’ve seen, having memory bandwidth overloaded by writes is a very rare occurrence, so I didn’t reflect it on the diagram.

And besides data, there is also code.

Another optimisation hint: try to improve code locality too. This one is less obvious (and usually has less drastic effects on performance than poor data locality). Discussion on the ways to improve code locality can be found in [Drepper]; these ways include such things as inlining, and __builtin_expect().

Let’s note that while __builtin_expect(), as mentioned above, doesn’t have effect on branch prediction on Intel CPUs anymore, it still has an effect on the code layout, which in turn impacts code spatial locality. As a result, __builtin_expect() doesn’t have effects which are too pronounced on modern Intel CPUs (on ARM – I have no idea TBH), but still can affect a thing or three performance-wise. Also there were reports that under MSVC, swapping if and else branches of if statement has effects which are similar to __builtin_expect() ones (with “likely” branch being the if branch of two-handed if), but make sure to take it with a good pinch of salt.

NUMA

One further thing which is related to memory accesses and performance, is rarely observed on desktops (as it requires multi-socket machines – not to be confused with multi-core ones). As such, it is mostly server-land; however, it does affect memory access times significantly.

When multiple sockets are involved, modern CPUs tend to implement so-called NUMA architecture, with each processor (where “processor” = “that thing inserted into a socket”) having its own RAM (opposed to earlier-age FSB architecture with shared FSB a.k.a. Front-Side Bus, and shared RAM). In spite of each of the CPUs having its own RAM, CPUs share RAM address space – and whenever one needs access to RAM physically located within another one – it is done by sending a request to another socket via ultra-fast protocol such as QPI or Hypertransport.

Surprisingly, this doesn’t take as long as you might have expected – [Levinthal] gives the numbers of 100-300 CPU cycles if the data was in the remote CPU L3 cache, and of 100ns (~=300 cycles) if the data wasn’t there, and remote CPU needed to go to its own main RAM for this data.

Software Primitives

Now we’re done with those things which are directly hardware-related, and will be speaking about certain things which are more software-related; still, they’re really ubiquitous, so let’s see how much we spend every time we’re using them.

C/C++ Function Calls

First, let’s see the cost of C/C++ function call. Actually, C/C++ caller does a damn lot of stuff before making a call – and callee makes another few things too.

[Efficient C++] estimates CPU costs for a function call at 25-250 CPU cycles depending on number of parameters; however, it is quite an old book, and I don’t have a better reference of the same caliber . On the other hand, from my experience, for a function with a reasonably small number of parameters, it is more like 15-30 cycles; this also seems to apply to non-Intel CPUs as measured by [eruskin].

Optimisation hint: Use inline functions where applicable. However, keep in mind that these days compilers tend to ignore inline specifications more often than not . Therefore, for really time-critical pieces of code you may want to consider __attribute__((always_inline)) for GCC, and __forceinline for MSVC compilers to make them do what you need. However, do NOT overuse this forced-inline stuff for not-so-critical pieces of code, it can make things worse rather easily.

BTW, in many cases gains from inlining can exceed simple removal of call costs. This happens because inlining enables quite a few additional optimisations (including those related to reordering to achieve the proper use of hardware pipeline). Also let’s not forget that inlining improves spatial locality for the code – which tends to help a bit too (see, for example, [Drepper]).

Indirect and Virtual Calls

Discussion above was related to usual (“direct”) function calls. Costs of indirect and virtual calls are known to be higher, and there is pretty much a consensus on that indirect call causes branching (however, as [Agner1] notes, as long as you happen to call the same function from the same point in code, branch predictors of modern CPUs are able to predict it pretty good; otherwise – you’ll get a misprediction penalty of 10-30 cycles). As for virtual calls – it is one extra read (reading VMT pointer), so if everything is cached at this point (which it usually is), we’re speaking about additional 4 CPU cycles or so.

On the other hand, practical measurements from [eruskin] show that the cost of virtual functions is roughly double of the direct call costs for small functions; within our margin of error (which is “an order of magnitude”) this is quite consistent with the analysis above.

Optimisation hint: IF your virtual calls are expensive – in C++ you may want to think about using templates instead (implementing so-called compile-time polymorphism); CRTP is one (though not the only one) way of doing it.

Allocations

These days, allocators as such can be quite fast; in particular, tcmalloc and ptmalloc2 allocators can take as little as 200-500 CPU cycles for allocation/deallocation of a small object [TCMalloc].

However, there is a significant caveat related to allocation – and adding to indirect costs of using allocations: allocation, as a Big Fat rule of thumb, reduces memory locality, which in turn adversely affects performance (due to uncached memory accesses described above). Just to illustrate how bad this can be in practice, we can take a look at a 20-line program in [NoBugs]; this program, when using vector<>, happens to be from 100x to 780x faster (depending on compiler and specific box) than an equivalent program using list<> – all because of poor memory locality of the latter :-(.

Optimisation hint: DO think about reducing number of allocations within your programs – especially if there is a stage when lots of work is done on a read-only data. In some real-world cases flattening your data structures (i.e. replacing allocated objects with packed ones) can speed up your program as much as 5x. A real-world story in this regard. Once upon a time, there was a program which used some gigabytes of RAM, which was deemed too much; ok, I rewrote it to a “flattened” version (i.e. each node was first constructed dynamically, and then an equivalent “flattened” read-only object was created in memory); the idea of “flattening” was to reduce memory footprint. When we ran the program, we observed that not only memory footprint was reduced by 2x (which was what we expected), but that also – as a very nice side effect – execution speed went up by 5x.

Kernel Calls

If our program runs under an operating system,¹ then we have a whole bunch of system APIs. In practice,² quite a few of those system calls cause kernel calls, which involve switches to kernel mode and back; this includes switching between different “protection rings” (on Intel CPU – usually between “ring 3” and “ring 0”). While this CPU-level switching back and forth itself takes only ~100 CPU cycles, other related overheads tend to make kernel calls much more expensive, so usual kernel call takes at least 1000-1500 CPU cycles [Wikipedia.ProtectionRing].

¹ yes, there are still programs which run without it

² at least if we’re speaking about more or less conventional OS

C++ Exceptions

These days, C++ exceptions are said to be zero-cost until thrown. Whether it is really zero – is still not 100% clear (IMO it is even unclear whether such a question can be asked at all), but it is certainly very close.

However, these “zero-cost until thrown” implementations come at the cost of a huge pile of work which needs to be done whenever an exception is thrown. Everybody agrees that the cost of exception thrown is huge, however (as usual) experimental data is scarce. Still, an experiment by [Ongaro] gives us a ballpark number of around 5000 CPU cycles (sic!). Moreover, in more complicated cases, I would expect it to take even more.

Return Error and Check

An ages-old alternative to exceptions is returning error codes and checking them at each level. While I don’t have references for performance measurements of this kind of things, we already know enough to make a reasonable guesstimate. Let’s take a closer look at it.

Basically, cost of return-and-check consists of three separate costs. The first one is the cost of conditional jump itself – and we can safely assume that 99+% of the time it will be predicted correctly; which means the cost of conditional jump in this case is around 1-2 cycles. The second cost is the cost of copying the error code around – and as long as it stays within the registers, it is a simple MOV – which is, given the circumstances, is 0 to 1 cycles (0 cycles means that MOV has no additional cost, as it is performed in parallel with some other stuff). The third cost is much less obvious – it is a cost of the extra register necessary to carry the error code; if we’re out of registers – we’ll need PUSH/POP pair (or a reasonable facsimile), which is in turn a write + L1 read, or 1+4 cycles. On the other hand, let’s keep in mind that chances of PUSH/POP being necessary, vary from one platform to another one; for example, on x86 any realistic function would require them almost for sure; however, on x64 (which has double number of registers) – chances of PUSH/POP being necessary, go down significantly (and in quite a few cases, even if register is not completely free, making it available may be done by compiler cheaper than a dumb PUSH/POP).

Adding all three costs together, I’d guesstimate costs of return-error-code-and-check at anywhere between 1 and 7 CPU cycles. Which in turn means that if we have one exception per 10000 function calls – we’re likely to be better with exceptions; however, if we have one exception per 100 function calls – we’re likely to be better with error codes. In other words, we’ve just reconfirmed a very well-known best practice – “use exceptions only for abnormal situations” .

Thread Context Switches

Last but certainly not least, we need to speak about costs of thread context switches. One problem with estimating them is that, well, it is very difficult to figure them out. Common wisdom tells that they’re “damn expensive” (hey, there should be a reason why nginx outperforms Apache) – but how much this “damn expensive” is?

From my personal observations, the costs were at least 10000 CPU cycles; however, there are lots of sources which are giving MUCH lower numbers. In fact, however, it is all about “what exactly we’re trying to measure”. As noted in [LiEtAl], there are two different costs with relation to context switches.

The first cost is direct costs of thread context switching – and these are measured at about 2000 CPU cycles³
However, the second cost is MUCH higher; it is related to cache invalidation by the thread; according to [LiEtAl], it can be as large as 3M CPU clocks. In theory, with completely random access pattern, modern CPU with 12M of L3 cache (and taking penalty of the order of 50 cycles per access) can take a penalty of up to 10M cycles per context switch; still, in practice the penalties are usually somewhat lower than that, so the number of 1M from [LiEtAl] makes sense.

³ that is, if my math is correct when converting from microseconds into cycles

Wrapping it Up

Phew, it was quite a bit of work to find references for all these more-or-less known observations.

Also please note that while I’ve honestly tried to collect all the related costs in one place (checking 3rd-party findings against my own experiences in the process), it is just a very first attempt at this, so if you find reasonably compelling evidence that certain item is wrong – please let me know, I will be happy to make the diagram more accurate.

References

[Agner4] Agner Fog, “Instruction tables. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs”

[Agner3] Agner Fog, “The microarchitecture of Intel, AMD and VIA CPUs. An optimization guide for assembly programmers and compiler makers”

[Intel.Skylake] “Intel® 64 and IA-32 Architectures Optimization Reference Manual”, 2-6, Intel

[Levinthal] David Levinthal, “Performance Analysis Guide for Intel® CoreTM i7 Processor and Intel® XeonTM 5500 processors”, 22

[NoBugs] 'No Bugs' Hare, “C++ for Games: Performance. Allocations and Data Locality”

[eruskin] http://assemblyrequired.crashworks.org/how-slow-are-virtual-functions-really/

[Agner1] Agner Fog, “Optimizing software in C++. An optimization guide for Windows, Linux and Mac platforms”

[Efficient C++] Dov Bulka, David Mayhew, “Efficient C++: Performance Programming Techniques”, p. 115

[Drepper] Ulrich Drepper, “Memory part 5: What programmers can do”, section 6.2.2

[TCMalloc] Sanjay Ghemawat, Paul Menage, “TCMalloc : Thread-Caching Malloc”

[Wikipedia.ProtectionRing] “Protection Ring”, Wikipedia

[Ongaro] Diego Ongaro, “The Cost of Exceptions of C++”

[LiEtAl] Chuanpeng Li, Chen Ding, Kai Shen, “Quantifying The Cost of Context Switch”

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧

Network Programming: Socket Peculiarities, Threads, and Testing

May 9, 2016, 5:44 am

≫ Next: Allocator for (Re)Actors with Optional Kinda-Safety and Relocation

≪ Previous: Infographics: Operation Costs in CPU Clock Cycles

[[This is Chapter XI(e) from “beta” Volume 2 of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

Socket peculiarities

Most of the time, to work with network (both TCP and UDP), you will be using so-called Berkeley Sockets. I won’t go into detailed discussion of them (you can find pretty much everything you need on “how to use Berkeley Sockets” subject, in [Stevens]). However, there are several socket-related things which are not that well-known, and this is what I’ll try to discuss here.

To IPv6 or not to IPv6?

One common question which arises these days, is “whether we need to have our Clients and Servers support IPv6?” – or in a more strong version, “maybe we can support ONLY IPv6?”

In short – these days ALMOST-ALL player’s devices support IPv4, and only like 10% support IPv6 (though the number is growing).[Google] On the other hand, there are reports of some ISPs using IPv6-only within their networks (and converting them to IPv4 via NAT64/DNS64); the number of such setups is expected to grow as exhaustion of IPv4 space goes ahead.

When applied to the games, it (rather counterintuitively) means the following:

You MUST support IPv4 both on Server and Client
You SHOULD support IPv6 on the Client. This is necessary to deal with those IPv6-only player ISPs
- It MAY be achieved by simply taking an IPv6 address from getaddrinfo() (though with a fallback to other IPs, including IPv4 to account for potential misconfigurations)
You MAY support IPv6 on the Server if you like it. However, even if your Server doesn’t support IPv6, pretty much all the real-world Clients-supporting-IPv6 will be able to connect to it anyway via NAT64/DNS64 (a bit more on it will be discussed below).
- Supporting IPv6 on Server can be generally done either via listening on two sockets (one IPv4, and another IPv6 with IPV6_V6ONLY option), or via single IPv6 socket with IPV6_V6ONLY turned off; there are, however, some OS-specific peculiarities in this regard, see [StackOverflow.BothIPv4IPv6] for details.

Horrors of gethostbyname(): using getaddrinfo() instead, or NOT using anything at all

On the client, we will often need to convert DNS name (like us-server.ourgamedomain.com) into an IP address. This process is known as “name resolution”.

To DNS or Not to DNS?

Actually, even before we start discussing “HOW to do name resolution”, we need to think a bit on “WHETHER we need name resolution at all?” Actually (unlike, say, web browser), games CAN have IP addresses of their servers embedded into their Clients and avoid name resolution at all.

Whether it is a good idea to embed IPs into the Client – is arguable. Each and every admin-who-knows-his-stuff, will ask “Are You Crazy?” on the very mention of such an option. But let’s see what are the real reasons behind DNS names (opposed to IPs).

In the context of games, one reason to have your IP addresses obtained via DNS rather than embedded into your Client, is that IP addresses can change. Granted, it does happen, but for servers IP the change is a Very Infrequent occurrence (like “from once per several years to never ever”), plus, if your hosting ISP happens to force an IP change, you will know about it at least 2 months in advance, etc. etc. As most of the games out there are routinely updated much more frequently than that, it is not a real problem.

Another reason for using DNS is to allow fast adding/removal of servers when they’re added/removed from the pool of active servers. And depending on the way you’re balancing your servers, this IS one valid reason to use DNS,¹ ² though its efficiency is limited because of DNS propagation times being of the order of hours.³

On the other hand, using DNS has been seen to cause problems for some players in some cases. Yes, failing DNS is not really a problem of your game, but if, whenever DNS server of player’s ISP is down for 15 minutes, your game is accessible while competitor’s one is not – well, you DO get a bit of competitive advantage (that’s for free BTW); as a side bonus, it also greatly reduces incentives to mount DDoS against your DNS.

As a result, my advice with regards to “To DNS or not to DNS” goes as follows:

DO have BOTH DNS names and IP addresses in the list-of-servers stored within your Client⁴
Your Client should try all of them one by one (when you get a list of IPs when resolving your DNS-name via getaddrinfo(), you MUST try all the addresses you get from there)
- Make sure to have a timeout on the Client side, in case if you did connect but didn’t receive anything from the server side for a while

This way, you’ll be fine BOTH if IP address has changed, and if player’s DNS server cannot resolve your DNS name for whatever reason (which can range from player-ISP’s DNS server failure to DDoS on your DNS provider infrastructure).

¹ In particular, if you’re using either Round-Robin DNS, or Client-Side Random Balancing, then DNS might help.

² just make sure to use short TTL in your Zone Files, and keep in mind that lots of DNS caches out there will ignore TTL/enforce their-own-minimums.

³ usually, you will start feeling DNS change almost immediately, with 50% of your clients migrating in a few (like 3-6, though YMMV) hours, and 95% migrating by the end of 24 hours.

⁴ in the same manner, DO publish BOTH DNS name and IP address of your servers in your directory service.

gethostbyname() vs getaddrinfo()

Since the time immemorial, The Way to do DNS name resolution was via function gethostbyname(). Unfortunately, this function is ridden with numerous problems:

It returns a pointer to a static variable, making it non-thread-safe (ouch!)⁵
It is blocking
It doesn’t support IPv6
Most (all?) of the time, gethostbyname() returned only one IP address from all the IP addresses advertised by DNS (we’ll see why it is important, in a moment).

Not surprisingly, with all the problems of gethostbyname(), there is a newer-better replacement, it is getaddrinfo() function. I don’t know of any cases when you’ll need to use gethostbyname() these days (well, maybe saving for some Really Obscure Platform which still lives in 1980’s and doesn’t implement getaddrinfo()). In short:

Use getaddrinfo() and forget about gethostbyname()

However, even despite several major improvements, getaddrinfo() is still a blocking function. While non-blocking alternatives do exist (such as getaddrinfo_a() on Linux and GetAddrInfoEx()-with-OVERLAPPED on Windows), they’re not too universal . Fortunately enough:

you don’t normally need getaddrinfo() on the server side. While we’re at it – DO NOT use reverse-DNS-lookup on your production servers; in other words – when logging Client’s IP – DO NOT try to log it as DNS name, settle for a plain IP.
for Clients, it is usually a Good Idea to have a separate thread which does nothing but receives-DNS-resolution-requests, calls blocking getaddrinfo(), and sends the results back.

When it comes to multiple addresses returned by getaddrinfo(), let’s recall that usually I am arguing for using Client-Side Random Load Balancing (as was described in Chapter VII), opposed to DNS Round-Robin. For the rationale, see Chapter VII, but in short – while they DO look very similar, DNS round-robin is subject to MUCH more severe disbalances due to DNS caching (and Client-Side Random Load Balancing is not affected by caching much). To implement Client-Side Random Load Balancing via getaddrinfo(), you can do the following:

Get the list of server IPs (either from getaddrinfo(), or from embedded list within the Client)
Choose one IP address at random (DO use something better than time-based srand(time(0)) for randomness)
Try connecting there
If not successful – take this IP out of the list and take another IP address at random from the remaining IPs
Rinse and repeat (starting from step #2)
if still not successful – rinse and repeat (starting from step #1)

Yet another peculiarity in this regard is related to IPv6. In presence of so-called DNS64, even if you don’t have an IPv6 address in your Zone File, your Client still MAY get an IPv6 address. As a rule of thumb, you should just use this “synthetic” IPv6 address – it is rarely malicious and it will allow your Client to work over those IPv6-only networks which sit behind NAT64/DNS64.

⁵ Ok, on some platforms it returns a pointer to TLS, but this is a non-standard extension, non-guaranteed etc.

Scalability Issues

select vs epoll vs kqueue vs Completion Ports

The very first question which usually arises at the beginning of a discussion about scalability and sockets, is almost-universally a religious-war-like question of “what is better – epoll or kqueue or Completion Ports?” As with quite a few things ;-), I have my own answer to this question (and I do know that I will be hit hard for articulating it). My take on it goes as follows:

In the context of games, there is little difference between different non-blocking network APIs

Yes, it also means that select() is going to work reasonably well too (though YMMV). By all means, try to experiment (I mean on the Server-Side), but don’t expect miracles from platform-specific APIs. As one example, in two major works comparing select()/poll()/epoll() ([GammoEtAl] and [Libenzi]) we can see that for workloads-without-idle-connections, performance of select()/poll()/epoll() is more or less on par, and only when the number of idle connections goes high, epoll() starts to take a significant lead. However, as it is very common (and recommended) for game servers to drop idle connections after a very brief period of inactivity, this advantage of epoll() doesn’t really manifest itself in games.

What is Really Important, however, is to make your calls non-blocking and process more-than-one connection per thread

Indeed, with 1’000 (10’000 if we’re speaking about front-end servers, see Chapter VII for discussion on front-end servers) players per server and one connection-per-thread we’ll have 1K-10K threads (running over only 10 or so CPU cores), which will cause too much otherwise-unnecessary context switching if run simultaneously.

on limitations of select()

That being said, select() (being the oldest one from the bunch) has a rather nasty limitation. There is a limit on 1024 file handles for select().⁶ It might seem as not a big deal, but unfortunately it is NOT a limit on “number of file handles which you’re waiting for in select() call”, but rather a limit on “number of overall file handles within the process”(!!). While this limit can be raised (on Linux – via __FD_SETSIZE and ulimit, see, for example, [StackOverflow.Over1024]), this is a rather nasty property of select(), and if you’re hitting this limit, you MAY be better using alternatives such as poll() or epoll()/kqueue() (and the change from select() at least to poll() is usually a very simple one).

⁶ that’s default for Linux, MAY vary for other platforms

TCP: multiple sockets per thread

Personally, for TCP connections in the game-like contexts I’ve had very good experience with the following rather simple architecture:

There is a fixed number of maximum TCP sockets-per-thread (in practice – between 32 and 256)
Each of these threads has an input queue of data-to-be-sent for all associated sockets
Each thread is using some-kind-of-non-blocking-IO and single wait (select/poll/WaitForMultipleObjects/epoll/kqueue) for all these sockets (plus the input queue!), and processes all the input/output from them as needed (this processing includes encryption)
Of course, there is also an additional thread which handles accept()’s on the listen()-ing socket, but it works only when we have a new connection, so it is not really loaded. I didn’t see one single thread handling all accept()s running into any performance problems (that is, as long it does nothing but accept()-then-push-accepted-socket-to-the-input-queue-of-some-thread), but if you ever run into it, you MAY be able to have more than one such accept()-ing thread.⁷

As we can see, within this architecture the number of handles per select()/WaitForMultipleObjects() call is quite limited, so most of the problems related to having-too-many-handles-in-one-call are gone.⁸ On the other hand, it reduces the number of concurrent threads by 1.5-2.5 orders of magnitude, bringing the number of threads down to around 40 for 256 connections/thread and 10K connections; this is not that different from optimum for a typical-for-server-side 12-core-server-with-HT.

I am not saying that this architecture is the only viable one, but it does work for TCP for sure.⁹ And it performs reasonably well too; for example, for one specific game, this architecture has been compared to a Completion-Port-based one (in production), and the performance differences were found to be negligible.¹⁰

⁷ in this case you WILL need to experiment with different APIs as wake-up behavior can vary significantly between different platforms; we certainly don’t want BOTH accept() threads to be woken up on EACH accepted connection.

⁸ however, it doesn’t help with a per-process limit of select() described above

⁹ one thing to remember about this architecture in the context of stock exchanges, is that 100% fairness may or may not be guaranteed. In quite a few cases fairness can be achieved via randomizing the order of sockets before calling your wait-for-socket API, but you really need to consult your manuals very carefully first.

¹⁰ actually, the architecture described above, performed 2-5% better than a Completion-Port-based one, but from our perspective, it clearly qualifies as “pretty much the same”

UDP: Only-One-Socket Problem

With UDP, achieving scalability becomes significantly more complicated . In particular, with all-UDP-traffic-going-over-one-single-port we have only one socket for all the 1’000-10’000 connections. It means that our thread-reading-on-UDP-socket is rather likely to become overloaded :-(. In practice, there are at least three different architectures aiming to address this problem.

The very first (naïve) approach is to give each of your Clients its own UDP port number, which allows to have a socket for each of them too. Then – you can do pretty much what we’ve done for TCP (multiple-sockets-per-thread stuff). This approach does work, but also limits the number of clients-per-IP to the number-of-UDP-ports-you-can-have (which is usually in the range of several hundred, and often is not enough).¹¹

The second take would be to modify the model above slightly to have one-UDP-port-and-one-UDP-socket-per-thread instead of one-UDP-port-and-socket-per-player. This one will work (as noted above, even for 10K players and 32 sockets/thread there is only 300 threads or so), but still has some (admittedly rather minor) drawbacks; in particular, it is usually considered a not-so-good-idea to show relations within your system to outside world (as pretty-much-everything-you-reveal MIGHT be used to attack your server); also exposing the port doesn’t allow for easy movement of your users across the threads (which in turn can affect load balancing between the threads).

The third option is to have your UDP-reading thread to do only a very basic job of determining-which-processing-thread-incoming-packet-belongs-to, and to dispatch it there (using some kind of a non-blocking queue, more on queues in Chapter [[TODO]]). In this case it is rather unlikely that your UDP-reading thread will become overloaded, and processing threads will do their job exactly like in a TCP case. And in the unlikely event that your UDP-reading thread becomes overloaded just by receiving-and-dispatching – you can have more than one thread reading the same socket;¹²and as all the UDP-reading threads are only dispatching – it won’t matter too much where the packet arrives, though occasional packet reorderings will happen. See also discussion on implementing this UDP threading architecture on top of Reactors, in Chapter VII.

Which option to choose – it depends. I would stay away from option #1, but both options #2 and #3 do have their own merits. Option #2 is usually a bit faster (there is less data passed around), which is more prominent especially on server boxes which are pretty much all NUMA these days. Option #3, on the other hand, encapsulates your server-side implementation better (and hides more implementation details from the view of potential attacker).

IMNSHO, the most important thing in this regard is to avoid tying all of your code to one specific option right away, but rather to have your UDP-threading architecture completely isolated from the rest of your code, so that when you come to the point when this choice becomes Really Important – you can change it without any changes to your Game Logic.

The whole task of optimizing performance beyond, say, 20-50K packets/second per box tends to be Quite Elaborated, and involves quite a few things which are platform- and hardware-dependent. Chances are that you won’t need to go further than that, but if you do – make sure to read an interesting exercise described in [CloudFlare]; while mere receiving the packets (as described in [CloudFlare]) is different than receiving-and-processing them (as we need for a game server), if you want absolutely-best performance, you MIGHT need to play with stuff such as RX queues as described there (see also discussion on Receive Side Scaling a.k.a. RSS, and Receive Packet Steering, a.k.a. RPS, in one of the following sources: [Balode], [kernel.org], and [MSDN]).

¹¹ technically, the limit on port numbers is 65535 (minus first 1024), but using too many ports often starts to conflict with firewalls rules and/or NATs. In practice, while games often DO want more than one UDP port to be open, the number of ports they use rarely goes above 500 or so.

¹² more or less the same effect can be achieved via IO Completion Ports.

[[TODO: recvmmsg() – referenced in Chapter VII on Server-Side Architecture]]

[[TODO: RSS/RPS/RFS and netmap/DPDK/RIO, see also Chapter VII on Server-Side Architecture]]

Testing

When implementing network protocols, you DO need to test your implementation very thoroughly (even more so if you’re developing your own protocol). As Glenn Fiedler has put it in [GafferOnGames.PacketFragmentation]:

“Writing a custom network protocol is hard. So hard that I’ve done this from scratch at least 10 times but each time I manage to fuck it up in a new and interesting ways. You’d think I’d learn eventually but this stuff is complicated. You can’t just write the code and expect it to work. You have to test it!” – Glenn Fiedler

and I can sign under each and every of these words. Now let’s see how such testing can/should be done.

Wireshark

One of the main tools you will need to use when debugging your network protocol (or your implementation of an existing protocol) is [Wireshark]. It is even more true if you need to debug your over-TCP protocol.

While debugging and testing your own network protocol, just install Wireshark on your development machine and monitor all the packets going between your Client and your Server; I am sure you will learn quite a few new things about your protocol even if you previously thought it worked perfectly; this applies regardless of you using TCP or UDP.

Wireshark and encryption

One of the many things which Wireshark can do, is decrypting TLS traffic (seems also to apply to DTLS, though I’ve never used DTLS decrypting myself). Of course, it is not possible to decrypt traffic without a key, but there is a way to supply your server’s private key to Wireshark (see [Wireshark.SSL] for details).

Note that NOT all the cipher suites are supported by Wireshark, so you MAY need to adjust your ciphersuite-of-choice to be able to decrypt your traffic with Wireshark.

tcpdump+Wireshark in Production

One interesting (and not-too-well-known) feature of Wireshark is that you can use it to analyze production communications without installing Wireshark on your production server (that is, at least if you’re running your servers on Linux). Usual sequence in this case goes as follows:

You need to analyze what is going on with a specific player
You find out her IP address
You running tcpdump (easily available for all Linuxes) on your server to get the traffic (into a “capture file”), filtering for that IP address (using tcpdump’s option such as “src host <IP> or dst host <IP>”).
- While you’re at it, make sure to use tcpdump option “-n” to avoid reverse DNS lookups
You download that capture file to your development environment
You run Wireshark to see the capture file in a parsed format (feeding your server’s private key to Wireshark to decrypt traffic as described above if applicable)
Bingo! You can see what has happened with that unfortunate player, and maybe even fix the bug affecting hundreds of others.

While this option should be considered as a “last resort”, I’ve seen it used in production to solve issues which were otherwise-next-to-impossible-to-identify.

“Soak Testing”, Simulation, and Replay

In [GafferOnGames.PacketFragmentation], Glenn Fiedler mentions “soak testing”. While I myself didn’t name this thing “soak testing” before, I’ve done LOTS of it (and I like the term too ). The idea (when applied to network/distributed testing) is to make a test run of your implementation with more-or-less random data, and as-random-as-possible usage patterns; typical running time for “soak test” is “overnight”, and “soak test” is considered passed if the next morning there are no apparent problems (like core dumps / asserts / hanged connections / etc.).

As noted above, this kind of “soak testing” is pretty well-known among network developers. However, I tend to add two (IMHO very important) things to it.

Network Problems Simulation

First of all, I suggest to run “soak tests” while simulating problems at network layer. The rationale for it is trivial: in LAN (and even worse on your local machine) chances are that you will never face packet loss (especially “two in a row” packet loss), large and inconsistent latencies, reordered packets, corrupted packets, etc. To make sure that your program does work in real-world outside of LAN, you do need to test it in presence of network problems such as those described above.

To get a “bad network connection” in your lab, I know three different approaches:

Write your own Bad-Network simulator at UDP level. As UDP is a packet-level protocol, you can just write your own wrapper around sendto() (and/or around recvfrom()) and insert all-the-nasty-stuff-you-want, there. This does work (and when dealing with UDP, I prefer this way personally), but it is restricted to UDP only (for TCP, you don’t have control over packets, so simulating packet loss is not really feasible at app-level)
Having your test traffic routed via “latency simulator”, such as Linux-box-with-netem. This is a bit less flexible than Option #1 (you’re restricted to whatever-your-latency-simulator-can-do), but in practice, it is usually enough for in-lab testing, and it works both for UDP and for TCP
- One variation includes running Linux-with-netem inside VM; this way you’re able to run all the tests on one single developer machine
Get Really Bad Real-World connection. I’m deadly serious about this, and usually suggest to do it at some point after option-#1-or-option-#2, and before the real-world deployment. However thorough our simulations are, we cannot imagine what kind of weird stuff real-world-Internet can throw at us; a really-bad real-world connection tends to show quite a few new things about your system
- While you’re at it, and if you’re using TCP, make sure to use all kinds of your Client platforms over this Really Bad connection. TCP stacks are notorious for having quite different implementations, and those differences have potential to hit you in pretty bad ways.

Replay Testing as a Big Helper for “Soak Testing”

One problem with “soak testing” is that most of the time we’re actually hunting for those elusive valid-packets-coming-in-an-unusual-order patterns (see discussion in Chapter V). And when the problem hits – usual response is “add more logging – run again – hope-that-the-same-problem-will-occur-this-time”.

This approach does work, BUT it tends to take LOTS of trial-and-error. I was using it myself for years – that is, until I’ve figured out that deterministic-record-and-replay helps with debugging deterministic systems (including network protocols) A LOT. Let me elaborate on it a bit.

First, let’s note that most of the time, pretty much any network protocol is described in terms of a state machine (and if by any chance it is not – protocol description can be trivially rewritten this way).

Then, let’s observe that protocol state machines are an ideal fit for “Reactors” (such as those described in Chapter V, and also known as event-driven programs or ad-hoc state machines) – and are often implemented as such. And now, if we just add determinism (as described in Chapter V) – then bingo! We’ve just got an ideal way to test our implementation in a post-mortem of a failed-soak-test.

Normally, if we have our network protocol implemented as a deterministic Reactor (a.k.a. deterministic state machine), development process goes as follows:

We run “soak test” while recording all the messages/packets coming to each of the sides of our conversation, into an input-log¹³
If/when “soak test” fails, we can easily reproduce all the sequence of events which has lead to the problem
- For example, we can go as follows:
  - We can usually run 10-hours-before-last-2-seconds-before-the-failure in just 20 minutes (for fully deterministic stuff, we are not bound to run it at the same pace, and usually replay runs MUCH faster than original stuff)
  - We can make a snapshot (of the state of our Reactor) at this point to be able to run it from this point pretty much instantly
  - We can launch the debugger and execute our protocol handler exactly as it behaved under these conditions, showing exactly as the bug has brewed and unfolded

Honestly, after spending a substantial portion of my life on debugging of network stuff in a usual (non-replay) manner, I can say that for a complicated network protocol, replayable debugging can reduce debugging time by as much as an order of magnitude(!). BTW, most (though admittedly not all) of my fellow network protocol developers loved this replay technique too. In addition, it was observed that this replay technique tends to improve quality of the resulting protocol/implementation; with replay in place, we can say that we can identify and fix each and every failure which has happened during out “soak testing”; this statement doesn’t stand when using usual trial-and-error-based fixes during “soak testing”.

Of course, it IS possible to debug network protocols (and implementations) in a traditional trial-and-error style, but I’ve tried both, and I strongly prefer the replay-based one.

¹³ unlike many other practical uses of deterministic replay in this book, this time it is usually a full input-log, not a circular one.

[[TODO: big provider down. handling massive connectivity problems]]

[[To Be Continued…

This concludes beta Chapter XI(e) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter XII, describing marshalling and encodings.]]

References

[Stevens] W. Richard Stevens, Bill Fenner, Andrew M. Rudoff, “UNIX Network Programming: Networking APIs: Sockets and XTI”

[Google] “Google IPv6”

[StackOverflow.BothIPv4IPv6] “How to support both IPv4 and IPv6 connections”

[GammoEtAl] Louay Gammo, Tim Brecht, Amol Shukla, and David Pariag, “Comparing and Evaluating epoll, select, and poll Event Mechanisms”

[Libenzi] Davide Libenzi, “Improving (network) I/O performance ...”

[CloudFlare] Marek Majkowski, “How to receive a million packets per second”

[StackOverflow.Over1024] “Handling more than 1024 file descriptors, in C on Linux”

[Balode] Amit Balode, “Receive Side Scaling and Receive Packet Steering”

[kernel.org] “Scaling in the Linux Networking Stack”

[MSDN] “Introduction to Receive Side Scaling”

[GafferOnGames.PacketFragmentation] Glenn Fiedler, “Packet Fragmentation and Reassembly”

[Wireshark] https://en.wikipedia.org/wiki/Wireshark

[Wireshark.SSL] https://wiki.wireshark.org/SSL

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧

Allocator for (Re)Actors with Optional Kinda-Safety and Relocation

August 15, 2017, 4:16 am

≫ Next: #CPPCON2017. Day 2. Why Local Allocators are a Good Thing(tm) Performance-Wise, and Why I am Very Cautious about C++17 STL parallelized algos

≪ Previous: Network Programming: Socket Peculiarities, Threads, and Testing

Effects of External Fragmentation on Memory Usage

What is it about

As it says on the tin, this article is about allocators within the context of (Re)Actors (a.k.a. Reactors, Actors, ad hoc FSMs, Event-Driven Programs, and so on).

The main benefits we get from our C++ allocator (with (Re)Actors and proposed allocation model as a prerequisite), are the following:

We have the option of having the best possible performance (same as that of good ol’ plain C/C++)
Without changing app-level code, we have the option of tracking access to ‘dead’ objects via ‘dangling’ pointers (causing exception rather than memory corruption in the case of such access)
Again, without changing app-level code, we have the option of having a compactable heap. Very briefly, compacting the heap is often very important for long-running programs, as without relocation, programs are known to fall victim to so-called ‘external fragmentation’. Just one very common scenario: if we allocate a million 100-byte small objects, we will use around 25,000 4K CPU pages; then if we randomly delete 900,000 of our 100-byte objects, we’ll still have around 24,600 pages in use (unable to release them back to OS), just because it so happened that each of the remaining 24,600 pages has at least one non-deleted object. Such scenarios are quite common, and tend to cause quite a bit of trouble (in the example above, we’re wasting about 9x more memory than we really need, plus we have very poor spatial locality too, which is quite likely to waste cache space and to hurt performance).
- As a side note, many garbage-collected programming languages have been using compactable heaps for ages; I’ve seen this capability to compact used as an argument that garbage-collected languages are inherently better (and an argument against C++).

Let’s note that while what we’ll be doing allows us to achieve benefits which are comparable to using traditional non-C++ mark-compact garbage collectors, we’re achieving those benefits in a significantly different manner. On the other hand, I don’t want to argue whether what we’re doing really qualifies as ‘automated garbage collection’, or if the name should be different. In the form described in this article, it is not even reference-counted garbage collection (though a similar approach can be applied to allocation models based on std::shared_ptr<> + std::weak_ptr<> – as long as we’re staying within (Re)Actors).

What is important though, is to:

Significantly reduce chances for errors/mistakes while coding.
- Within the proposed allocation model, there are no manual deletes, which should help quite a bit in this regard.
- In addition, the handling of ‘dangling’ pointers is expected to help quite a bit too (at least while debugging, but in some cases also in production).

Allow for best-possible performance when we need it, while allowing it to be a little bit reduced (but still good enough for most production code) if we happen to need to track some bugs (or to rely on the handling of ‘dangling’ pointers).
Allow for a compactable heap (again, giving some performance hit compared to the best-possible performance – but the performance hit should usually be mild enough to run our compactable heap in production).

Message-passing is the way to go

Before starting to speak about memory allocation, we need to define what those (Re)Actors we’re about to rely on are about (and why they’re so important).

For a long while, I have been a strong proponent of message-passing mechanisms over mutex-based thread sync for concurrency purposes (starting from [NoBugs10]). Fortunately, I am not alone with such a view; just as one example, the Go language’s concept of “Do not communicate by sharing memory; instead, share memory by communicating” [Go2010] is pretty much the same thing.

However, only after returning from ACCU2017 – and listening to a brilliant talk [Henney17] – I realized that we’re pretty much at the point of no return, and are about to reach a kinda-consensus that

Message-passing is THE way to implement concurrency at app-level

(as opposed to traditional mutex-based thread sync).

The reasons for this choice are numerous – and range from “mutexes and locks are there to prevent concurrency” (as it was pointed out in [Henney17]), to “doing both thread sync and app-level logic at the same time tends to exceed cognitive limits of the human brain” [NoBugs15].

For the time being, it is not clear which of the message passing mechanisms will win (and whether one single mechanism will win at all) – but as I have had very good experiences with (Re)Actors (a.k.a. Actors, Reactors, ad hoc FSMs, and Event-Driven Programs), for the rest of this article I will concentrate on them.

Setting

To be a bit more specific, let’s describe what I understand as (Re)Actors.

Let’s use Generic Reactor as the common denominator for all our (Re)Actors. This Generic Reactor is just an abstract class, and has a pure virtual function react():

class GenericReactor {
  virtual void react(const Event& ev) = 0;
};

Let’s name any piece of code which calls GenericReactor’s react() the ‘Infrastructure Code’. Quite often, this call is within the so-called ‘event loop’:

std::unique_ptr<GenericReactor> r
        = reactorFactory.createReactor(...);
  while(true) { //event loop
    Event ev = get_event(); //from select(), libuv, ...
    r->react(ev);
  }

Let’s note that the get_event() function can obtain events from wherever we want – from select() (which is quite typical for servers) to libraries such as libuv (which is common for clients).

Also let’s note that an event loop, such as the one above, is by far not the only way to call react(): I’ve seen implementations of Infrastructure Code ranging from one running multiple (Re)Actors within the same thread, to another one which deserialized the (Re)Actor from a DB, then called react(), and then serialized the (Re)Actor back to the DB. What’s important, though, is that even if react() can be called from different threads – it MUST be called as if it is one single thread (=‘if necessary, all thread sync should be done OUTSIDE of our (Re)Actor, so react() doesn’t need to bother about thread sync regardless of the Infrastructure Code in use’).

Finally, let’s name any specific derivative from Generic Reactor (which actually implements our react() function), a Specific Reactor:

class SpecificReactor : public GenericReactor {
  void react(const Event& ev) override;
};

Also, let’s observe that whenever a (Re)Actor needs to communicate with another (Re)Actor – adhering to the ‘Do not communicate by sharing memory; instead, share memory by communicating’ principle – it merely sends a message, and it is only this message which will be shared between (Re)Actors.

Trivial optimization: single-threaded allocator

Armed with (Re)Actors, we can easily think of a very simple optimization for our allocation techniques. As all the processing within (Re)Actors is single-threaded, we can easily say that:

(Re)Actor allocators can be single-threaded (i.e. without any thread sync – and avoiding relatively expensive ‘compare-and-swap’ operations).
- One exception to this is those messages which the (Re)Actor sends to the others – but classes implementing those messages can easily use a different (thread-synced) allocator.

For the purposes of this article, we’ll say that each (Re)Actor will have its own private (and single-threaded) heap. While this approach can be generalized to per-thread heaps (which may be different from per-(Re)Actor heaps, in cases of multiple (Re)Actors per thread) we won’t do that here.

Ok, let’s write it down that our (Re)Actor allocator is single-threaded – and we’ll rely on this fact for the rest of this article (and everybody who has written a multi-threaded allocator will acknowledge that writing a single-threaded one is a big relief).

However, we’ll go MUCH further than this rather trivial observation.

Allocation model: owning refs, soft refs, naked refs

At this point, we need to note that in C++ (as mentioned, for example, in [Sutter11]), it is impossible to provide compacted heaps “without at least a new pointer type”. Now, let’s see what can be done about it.

Let’s consider how we handle memory allocations within our (Re)Actor. Let’s say that within our (Re)Actor:

We allow for three different types of references/pointers:
- ‘owning’ references/pointers, which are conceptually similar to std::unique_ptr<>. In other words, if the ‘owning’ reference object goes out of scope, the object referenced by it is automatically destroyed. For the time being, we can say that ‘owning’ references are not reference-counted (and therefore copying them is prohibited, though moving is perfectly fine – just as with std::unique_ptr<>).
- ‘soft’ pointers/references. These are quite similar to std::weak_ptr<> (though our ‘soft’ references are created from ‘owning’ references and not from std:shared_ptr<>), and to Java WeakRef/SoftRef. However, I don’t want to call them ‘weak references’ to avoid confusion with std::weak_ptr<> – which is pretty similar in concept, but works only in conjunction with std::shared_ptr<>, hence the name ‘soft references’.
  - Most importantly – trying to dereference (in C++, call an operator ->(), operator *(), or operator[]) our ‘soft’ reference when the ‘owning’ reference is already gone is an invalid operation (leading – depending on the mode of operation – to an exception or to UB; more on different modes of operation below).
- ‘naked’ pointers/references. These are just our usual C/C++ pointers.

Our (Re)Actor doesn’t use any non-const globals. Avoiding non-const globals is just good practice – and an especially good one in case of (Re)Actors (which are not supposed to interact beyond exchanging messages).
Now, we’re saying that whatever forms the state of our (Re)Actor (in fact – it is all the members of our SpecificReactor) MUST NOT have any naked pointers or references (though both ‘owning’ and ‘soft’ references are perfectly fine). This is quite easy to ensure – and is extremely important for us to be able to provide some of the capabilities which we’ll discuss below.
As for collections – we can easily say that they’re exempt from the rules above (i.e. we don’t care how collections are implemented – as long as they’re working). In addition, memory allocated by collections may be exempt from other requirements discussed below (we’ll note when it happens, in appropriate places).

With this memory allocation model in mind, I am very comfortable to say that

It is sufficient to represent ANY data structure, both theoretically and practically

The theoretical part can be demonstrated by establishing a way to represent an arbitrary graph with our allocation model. This can be achieved via two steps: (a) first, we can replace all the refs in an arbitrary graph by ‘soft’ refs, and (b) second, there is always a set of refs which make all the nodes in the graph reachable exactly once; by replacing exactly this second set of references with our ‘owning’ refs, we get the original arbitrary graph represented with our ‘owning refs’+‘soft refs’.

As for a practical part – IMO, it is quite telling that I’ve seen a very practical over-a-million-LOC codebase which worked exactly like this, and it worked like a charm too.

BTW,

most of the findings in this article are also applicable to a more-traditional-for-C++11-folks allocation model of ‘shared ptr’+‘weak ptr’

(though for single-threaded access, so atomic requirements don’t apply; also, we’ll still need to avoid ‘naked’ pointers within the state of our (Re)Actor). However, it is a bit simpler to tell the story from the point of view of ‘owning’ refs +‘soft’ refs, so for the time being we’ll stick to the memory allocation model discussed above.

An all-important observation

Now, based on our memory allocation model, we’re able to make an all-important

Observation 1. Whenever our program counter is within the Infrastructure Code but is outside of react(), there are no ‘naked pointers’ to (Re)Actor’s heap.

This observation directly follows from a prohibition on having ‘naked pointers’ within (Re)Actor’s state: when we’re outside of react(), there are no ‘naked pointers’ (pointing to the heap of our (Re)Actor) on the stack; and as there are no non-const globals, and there are ‘naked pointers’ within the heap itself either – well, we’re fine.

Modes of operation

Now, let’s see what how we can implement these ‘owning refs’ and ‘soft refs’. Actually, the beauty of our memory model is that it describes WHAT we’re doing, but doesn’t prescribe HOW it should be implemented. This leads us to several possible implementations (or ‘modes of operation’) for ‘owning refs’/‘soft refs’. Let’s consider some of these modes.

‘Fast’ mode

In ‘Fast’ mode, ‘owning refs/pointers’ are more or less std::unique_ptr<>s – and ‘soft refs/pointers’ are implemented as simple ‘naked pointers’.

With this ‘fast’ mode, we get the best possible speed, but we don’t have any safety or reallocation goodies. Still, it might be perfectly viable for some production deployments where speed is paramount (and crashes are already kinda-ruled out by thorough testing, running new in production in ‘safe’ mode for a while, etc. etc.).

‘kinda-Safe’ mode

In a ‘kinda-Safe’ mode, we’ll be dealing with ‘dangling pointers’; the idea is to make sure that ‘dangling pointers’ (if there are any) don’t cause memory corruption but cause an exception instead.

First of all, let’s note though that because of the semantics of ‘owning pointers’, they cannot be ‘dangling’, so we need to handle only ‘soft’ and ‘naked’ pointers, and references.

‘Dangling’ soft references/pointers

To deal with ‘dangling’ soft-pointers/references, we could go the way of double-reference-counting (similar to the one done by std::weak_ref<> – which actually uses the ages-old concept of tombstones), but we can do something better (and BTW, the same technique might be usable to implement std::weak_ref<> too – though admittedly generalizing our technique to multi-threaded environment is going to be non-trivial).

Our idea will be to:

Say that our allocator is a ‘bucket allocator’ or ‘slab allocator’. What’s important is that if there is an object at memory address X, then there cannot be an object crossing memory address X, ever.
- Let’s note that memory allocated by collections for their internal purposes is exempt from this requirement (!).

Say that each allocated object has an ID – positioned right before the object itself. IDs are just incremented forever-and-ever for each new allocation (NB: 64-bit ID, being incremented 1e9 times per second, will last without wraparound for about 600 years – good enough for most of the apps out there if you ask me).
Each of our ‘owning refs’ and ‘soft refs’, in addition to the pointer, contains an ID of the object it is supposed to point to.
Whenever we need to access our ‘owning ref’ or ‘soft ref’ (i.e. we’re calling operator ->() or operator *() to convert from our ref to naked pointer), we’re reading the ID from our ref, AND reading the ID which is positioned right before the object itself – and comparing them. If there is a mismatch, we can easily raise an exception (as the only reason for such a mismatch is that the object has been deleted).
- This approach has an inherent advantage over a tombstone-based one: as we do not need an extra indirection – this implementation is inherently more cache friendly. More specifically, we’re not risking an extra read from L3 cache or, Ritchie forbid, from main RAM, and the latter can take as much as 150 CPU cycles easily. On the other hand, for our ID-reading-and-comparing, we’ll be usually speaking only about the cost of 2–3 CPU cycles.

NB: of course, it IS still possible to use double-ref-counting/tombstones to implement ‘kinda-Safe mode’ – but at this time, I prefer an ID-based implementation as it doesn’t require an extra indirection (and such indirections, potentially costing as much as 150 cycles, can hurt performance pretty badly). OTOH, if it happens that for some of the real-world projects tombstones work better, it is always still possible to implement ‘kinda-Safe mode’ via a traditional tombstone-based approach.

‘Dangling’ naked references/pointers

With naked references/pointers – well, strictly speaking, we cannot provide strict guarantees on their safety (that’s why the mode is ‘kinda-Safe’, and not ‘really-Safe’). However, quite a few measures are still possible to both detect such accesses in debugging, and to mitigate the impact if it happens in production:

Most importantly, our allocation model already has a restriction on life time of ‘naked’ pointers, which already significantly lowers the risks of ‘naked’ pointers dangling around.
In addition, we can ensure that within our (Re)Actor allocator, we do NOT really free memory of deleted objects (leaving them in a kind of ‘zombie’ state) – that is, until we’re out of the react() function. This will further reduce risks of memory corruption due to a ‘dangling’ pointer (just because within our memory allocation model, all the dangling naked pointers will point to ‘zombie’ objects and nothing but ‘zombie’ objects). As for increased memory usage due to delayed reclaiming of the memory – in the vast majority of use cases, it won’t be a problem because of a typical react() being pretty short with relatively few temporaries.
- In debug mode, we may additionally fill deleted objects with some garbage. In addition, when out of react(), we can detect that the garbage within such deleted objects is still intact; for example, if we filled our deleted objects with 0xDEAD bytes, we can check that after leaving react() deleted objects still have the 0xDEAD pattern – and raise hell if they don’t (messing with the contents of supposedly deleted objects would indicate severe problems within the last call to react()).
- In production mode, we can say that our destructors leave our objects in a ‘kinda-safe’ state; in particular, ‘kinda-safe’ state may mean that further pointers (if any) are replaced with nullptrs (and BTW, within our memory allocation model, this may be achieved by enforcing that destructors of ‘owning pointers/refs’ and ‘soft pointers/refs’ are setting their respective pointers to nullptrs; implementing ‘kinda-safe’ state of collections is a different story, though, and will require additional efforts).
- This can help to contain the damage if a ‘dangling’ pointer indeed tries to access such a ‘zombie’ object – at least we won’t be trying to access any further memory based on garbage within the ‘zombie’.

‘Safe with relocation’ mode

In a ‘Safe with relocation’ mode, in addition to dealing with ‘dangling’ soft refs, we’ll be allowing to relocate our allocated objects. This will allow us to eliminate dreaded ‘external fragmentation’ – which tends to cause quite a bit of trouble for long-running systems – with lots of CPU pages having a single object in them being allocated some memory (which in turn, if we cannot possibly relocate those single objects, tends to cause lots of memory waste).

To implement relocation, in addition to the trickery discussed for ‘Safe’ mode, we’ll be doing the following:

All relocations will happen only outside of the react() function (i.e. when there are no ‘naked’ pointers to the heap, phew)
- How exactly to relocate objects to ensure freeing pages is outside the scope of this article; here, we are concentrating only on the question of how to ensure that everything works after we’re done relocating some of our objects

Keep a per-(Re)Actor-heap ‘relocation map’ – a separate map of object IDs (the ones used to identify objects, as discussed in ‘Safe’ mode) into new addresses.
- To keep the size of ‘relocation map’ from growing forever-and-ever, we could:
  - For each of our heap objects, keep a counter of all the ‘owning’ and ‘soft’ pointers to the object.
  - Whenever we relocate object, copy this counter to the ‘relocation map’. Here, it will have the semantics of ‘remaining pointers to be fixed’.
  - Whenever we update our ‘owning’ or ‘soft’ pointer as described below, decrement the ‘remaining pointers to be fixed’ counter (and when it becomes zero, we can safely remove the entry from our ‘relocation map’).
- An alternative (or complementing) approach is to rely on ‘traversing’, as described below.
- Exact implementation details of the ‘relocation map’ don’t really matter much; as it is accessed only very infrequently, search times within it are not important (though I am not saying we should use linear search there).

Whenever we detect access to a non-matching object ID (i.e. an ‘owning pointer’ or ‘soft pointer’ tries to convert to a ‘naked’ pointer and finds out that the object ID in heap is different from the ID they have stored), instead of raising an exception right away, we’ll look into the ‘relocation map’ using the object ID within the pointer trying to access the object, and then:
- If the object with such an object ID is found in the ‘relocation map’, we update our ‘owning pointer’ or ‘soft pointer’ to a new value and continue.
- If the object with the ID within the pointer is not found, the object has been deleted, so we raise exception to indicate access attempt to a deleted object (just as for ‘safe mode’ above).

If our relocation has led to a page being freed (and decommitted), attempts to dereference ‘owning pointers’ or ‘soft pointers’ may cause a CPU access violation. In such cases, we should catch the CPU exception, and once again look into our ‘relocation map’ using exactly the same logic as above (and causing either updating the current pointer, or an app-level exception).
- To make sure that our system works as intended (and that all the pointers can still rely on an object ID always being before the object), we need to take the following steps:
  - After decommitting the page, we still need to keep address space for it reserved.
  - In addition, we need to keep track of such decommitted-but-reserved pages in a some kind of ‘page map’, and make sure that if we reuse the same page, we use it only for allocations of exactly the same ‘bucket size’ as before.
  - While this might sound restrictive, for practical x64 systems it is usually not a big deal because (as we’re decommitting the page) we’ll be wasting only address space, and not actual memory. As modern x64 OSs tend to provide processes with 47-bit address space, this means that for a program which uses not more than 100G of RAM at any given time, and uses 100 different bucket sizes, in the very worst case, we’ll waste at most 10000G of address space, and this is still well below that 47-bit address space we normally have.

Bingo! We’ve got (kinda-)safe implementation – and with the ability to compact our heap too, if we wish.

Traversing SpecificReactor state

In spite of all our efforts discussed above, in certain cases, there might be situations when the size of our ‘page map’ and especially ‘relocation map’ will grow too large. While I expect such situations to be extremely rare, it is still nice to know that there is a way to handle them.

If we say that for every object within our class SpecificReactor, there is a traverse() function (with traverse() at each level doing nothing but calling traverse() for each of child objects) then after calling traverse() for the whole SpecificReactor, we can be sure that all the pointers have been dereferenced, and therefore were fixed if applicable; as a result – after such a traverse() – our ‘relocation map’ is no longer necessary and can be cleaned (BTW, if we’re doing traverse() frequently enough, we may avoid storing the reference count, which was mentioned above in the context of cleaning up the ‘relocation map’).

Moreover, after such a call to SpecificReactor::traverse(), we can be sure that there are no more pointers to decommitted pages, which means that ‘page map’ can be cleaned too.

On the one hand, let’s note that for (Re)Actors with a large state, traversing the whole state may take a while (especially if the state is large enough to spill out of the CPU caches) – which may be undesirable for latency-critical apps. On the other hand, in such cases it is usually possible to implement traversing in an incremental manner (relying on the observation that any newly created objects are not a problem) – but all methods I know for such incremental traversals require us to be very careful about object moves (from a not-traversed-yet into a supposedly-already-traversed area) and about invalidating collection iterators. Still, it is usually possible and fairly easy to write such an incremental traversal – albeit an ad hoc one (i.e. taking the specifics of the app into account).

Further discussion planned

Actually, this is not the end of discussion about (Re)Actors and their allocators. In particular, I hope to discuss how to use such allocators to implement (Re)Actor serialization (and as mentioned in [NoBugs17], serialization of the (Re)Actor state is necessary to achieve quite a few (Re)Actor goodies, including such big things as Replay-Based Regression Testing and production post-factum debugging).

References

[Go2010] “‘Share Memory By Communicating’”, The Go Blog
[Henney17] Kevlin Henney, “‘Thinking Outside the Synchronisation Quadrant’”, ACCU2017
[Loganberry04] David ‘Loganberry’, “Frithaes! – an Introduction to Colloquial Lapine!”
[NoBugs10] ‘No Bugs’ Hare, “‘Single Threading: Back to the Future, Overload #97, #98”, June–Aug 2010
[NoBugs15] ‘No Bugs’ Hare, “‘Multi-threading at Business-logic Level is Considered Harmful’, Overload #128”, Aug 2015
[NoBugs17] ‘No Bugs’ Hare, “‘Deterministic Components for Interactive Distributed Systems’”, ACCU2017
[Sutter11] Herb Sutter, “‘Garbage Collection Synopsis, and C++’”

Disclaimer

as usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧

#CPPCON2017. Day 2. Why Local Allocators are a Good Thing(tm) Performance-Wise, and Why I am Very Cautious about C++17 STL parallelized algos

September 27, 2017, 9:01 am

≫ Next: #CPPCON2017. Day 4. Async Rulezzz!

≪ Previous: Allocator for (Re)Actors with Optional Kinda-Safety and Relocation

At CPPCON2017 Day 2, two talks were of special interest to me. One was a 2-hour talk about Local Allocators – and another about C++17 STL parallelised algorithms.

Local Allocators

The talk on Local Allocator by John Lakos was brilliant both in terms of content and in terms of presentation (I wish I’d be able to speak as convincingly as him some day <sigh />). I highly recommend all the performance-conscious C++ developers to watch his talk as soon as it is becomes available on the Youtube.

Very very shortly:

using local allocators can easily improve performance by a factor of 10x
- It doesn’t matter much which of the global allocators you’re using (default one, tsmalloc, whatever-else) – compared to the local allocators all of them lose hopelessly.
- It is NOT something new – however, it MIGHT easily come as a revelation to LOTS of developers out there

pretty often (IMO – most of the time), the most important thing is not about the time of allocation/deallocation – but rather a spatial locality of the data allocated. It may come as a revelation to (and can cause friction with) LOTS of developers trying to optimize their allocators for pure alloc/dealloc times – but here I am 100% on the side of John.
- diffusion (randomising memory patterns which naturally occurs over the program life time) IS a performance-killer. Local allocators can help against diffusion too – but keep in mind it is all about your typical access patterns, so your allocators should be optimized for the access patterns you’re mostly facing.
penalty of C++17-style virtualized allocators is low (and often, though not always, is non-existing); the reason is two-fold:
- because for the virtualized allocators C++17-style, most of the time, modern compilers can inline virtualized calls (while inlining virtualized calls is known since time immemorial, the fact that it can eliminate most of virtualized calls in such usage scenarios, is new to me, but well – I am ready to trust John’s experience and experiments in this regard, though IMO there will be restrictions on the programming patterns which allow this, so it most-performance-critical-scenarios we should be aware of them).
- [not really articulated by John as a reason, by IMNSHO still standing] – most of the time, direct alloc/dealloc costs are minor compared to the benefits of improved locality.

Sure, going for local allocators if you don’t care about performance still qualifies as a premature optimization and has to be avoided – but IF you’re after performance – they’re one of the very first things to look at (IMO – the second one after getting rid of those unnecessary thread context switches, they can be even more devastating).

Local Allocators for (Re)Actors

Last but not least. While it wasn’t discussed in John’s talk, but I hope he would agree <wink />: (Re)Actors (a.k.a. event-driven programs, ad-hoc finite state machines, etc. etc., discussed in many places of this blog) tend to play really nicely with local allocators; at the very least, per-(Re)Actor allocator can help A LOT; if necessary – it can be subdivided further into even-more-local-allocators, but it is still a very-good-to-have first step. For a discussion of (Re)Actor allocators – see also my own recent series of articles in Overload: Allocator for (Re)Actors with Optional Kinda-Safety and Relocation, A Usable C++ Dialect that is Safe Against Memory Corruption, and the third one coming in a few days in an October issue of Overload, and titled “Allocator for (Re)Actors. Part III – “Speedy Gonzales” Serializing (Re)Actors via Allocators”; while the issues raised by me and in John’s talk, are pretty much orthogonal – we can easily rip both performance-benefits-discussed-by-John and other-benefits-discussed-by-me from using per-(Re)Actor allocators.

STL parallel algorithms

Another talk of significant interest to me was the talk by Anthony Williams on concurrency, parallelism, and coroutines. I have to admit that I even before the talk, I already was cautious about newer STL parallel algorithms, but was wondering whether I am missing something.

Anthony Williams. Concurrency, Parallelism and Coroutines

And Antony’s presentation (which BTW was brilliant, and answered all my questions without me asking) had a probably-undesired-by-him effect – it has convinced me further that STL parallel algorithms are way way too easy to misuse, and as a result – I am strongly against recommending them to an average developer.

The idea behind C++17 parallel algos looks nice at the first glance – “hey, you can just add this “policy” parameter, and your algo will magically become parallel!”. However, it is exactly this perceived simplicity which gives me creeps. I see the following major problems on this way (to be very very clear: it is NOT what Antony has said, it is my feelings about this subject; to make your own judgements based on his talk – make sure to watch his talk when it becomes available on Youtube):

adding parallelism only looks simple, while it is not.
- Most importantly, thread sync is still a responsibility of developer <ouch! />. As getting thread sync correctly is Extremely Difficult (just one example – even WG21 members are guilty of providing and proposing code-with-grave-MT-bugs, more on it in my Friday talk), it means that anything which requires thread sync makes the program inherently fragile. And while thread sync in parallel algos might indeed be necessary – still, providing a very syntaxically simple way of making your program extremely fragile is hardly a good thing.
  - In other words – I am of a very strong opinion that not only the standard should allow doing simple things in a simple manner, but also that the standard should make doing dangerous things more complicated. In other words – IMNSHO, we should make an effort to make shooting ourselves in the foot more difficult. Of course, it is a philosophical question and as such can be debated at length – but well, personally I am very sure about it.
- Another major problem on this way is that in practice, one of the most important things to make the algo not only formally parallel, but also to perform better – is granularity of the stuff offloaded between different threads. This absolutely-important issue is completely missing from the STL parallel algos. And while in theory, it might be fixed solely by the policies – I am extremely sceptical that policies can do it magically efficient without the significant help from the developer’s side who has intimate knowledge of the algo involved (and having per-algo policies, while technically possible, effectively defeats the idea of the separation between algos and policies)
  - Sure, these concerns might be mitigated by providing some benchmarks – but these are conspicuously missing from all-the-talks-I’ve-seen-on-the-STL-parallel-containers (!). This certainly doesn’t improve my confidence in the current practical value of them.
  - And without paying attention to making our algos coarse-grain, at least on non-GPU architectures, we can easily run into a seen-many-times-myself situation when developer happily says “hey, I was able to utilize all 8 cores”, while at the second glance it is observed that his-program-utilizing-all-8-cores actually takes more wall-clock time to execute (!!).
level of control provided by default policies, is extremely limited. And it is another thing which is extremely important in practice.

Given this, I am standing extremely cautious about the merits of the STL parallelised algorithms for real-world use on non-GPU processors; in particular, for Intel CPUs I still strongly prefer TBB. On the other hand, for GPGPUs, STL paralellised algos mentioned by Anthony might happen to be much more interesting for two big reasons: (a) writing GPGPU code is a major headache to start with; (b) for GPGPUs, the issue of coarse granularity, while still somewhat present, is not that important, so automated balancing has much better chances to work.

Overall: even if STL parallelised algos can be made to perform well after spending significant amounts of work (which is yet to be observed, most likely – even if it happens eventually, there is still a loooong road to achieve this), and (given right tools) can easily provide significant value for GPGPUs, what is clear that

by pretending that parallelisation is simple – it has an enormous potential for unsuspecting developer trying to use it – and getting the whole project badly burned

This can very easily lead to an improper thread sync, which will in turn lead to bugs-exhibited-only-once-a-month-on-the-client-site (and increasing an already-unacceptably-high number of programs which fail occasionally – is hardly a good idea). In addition – as noted above, it is very easy to write a parallel algo which performs worse even wallclock-wise that non-parallel one,¹

¹ BTW, consumed-power-wise parallel algos pretty much inevitably perform worse than serial ones, especially if the thread sync is involved(!) – and this also should be articulated to avoid developers trying to use parallel stuff just because it is “cool”

It was another long day – and there is another one coming. Tomorrow for me it will be SG14 (“Game Development and Low Latency”) series of meetings.

↧

#CPPCON2017. Day 4. Async Rulezzz!

September 29, 2017, 6:28 pm

≫ Next: Using Parallel Without a Clue: 90x Performance Loss Instead of 8x Gain

≪ Previous: #CPPCON2017. Day 2. Why Local Allocators are a Good Thing(tm) Performance-Wise, and Why I am Very Cautious about C++17 STL parallelized algos

During the Day 4 of CPPCON, I had to prepare to my own talk tomorrow; still – I was able to attend all the talks I was interested in.

But before I can start – let’s discuss a thing which I just understood during CPPCON17; it is related to…

Big Fallacies of Parallel Programming for Joe Average Developer

Twice today, I heard pretty much the same thing, related to the concept of “utilization” of the available CPU cores. And the most funny about it was that while once, a reference to “core utilization” felt as a fallacy, on another occurrence it made perfect sense. After thinking about it – I realized that the whole difference was about whether the person realized what it really means.

The first time, I heard about CPU core utilization from a vendor trying to sell their profiler, and saying something along the lines of “hey, if your core is idle – we’ll show it to you in red, so you can see your cores are idling, which is a Bad Thing(tm)”.¹ Unfortunately, if interpreting core utilization this way – it is extremely easy to fall into the trap of assuming that utilization is The Only Thing Which Matters(tm). And with such an assumption – the concept of “core utilization” becomes a Really Big Problem. Unfortunately, it is perfectly possible to have a parallelized program, which utilizes all the available cores – while working slower (by wall-clock(!)) than original single-threaded one. Moreover, even if we’re using exactly the same algorithms in single-threaded implementation and in a parallel one – single-threaded one can still win wall-clock-wise. The point here is related to a so-called granularity: very shortly, if we’re doing our job in chunks which are too small – we’ll be incurring LOTS of thread context switches, and with each context switch taking between 2’000 and 1’000’000 CPU clock cycles² – the overhead can easily make the whole thing perform really badly.

This, in turn, means that if we’re going to run a program which simply calculates a sum of an array in parallel, with each sub-task doing only one addition before doing a thread context switch back to the main one, while we can easily utilize all the 24 cores of our box – the wall-clock time of the calculation can easily be LARGER than the single-threaded one, that’s while consuming 24x more power, and causing 24x more of CO₂ footprint too.

As a result, an interpretation of “core utilization” being The Only Thing Which Matters – is not just wrong, but Really Badly Wrong.

The second time, I heard about CPU core utilization within the talk “The Asynchronous C++ Parallel Programming Model” by Hartmut Kaiser. However, in this talk – it was in the context of HPC, and more importantly – it was accompanied by a discussion on overheads and granularity. In this context – sure, utilizing all the cores, as a rule of thumb, is indeed a Good Thing(tm).

Observing this difference in potential interpretation of the “core utilization” has lead me to a feeling that there are several different layers of understanding of the parallelizing our calculations, and that the understanding of a Joe Average developer (such as myself) can be very different from the understanding of an HPC pro. This, in turn, means that

taking HPC pro’s comments out of context and without proper understanding, can easily be devastating for our (=”Joe Average”) programming efforts

Here is the list of concepts, which are, while being true in the proper context (and which are so obvious to HPC pros that they’re using them without specifying the context) – can be very dangerous when being pulled into the context of what-most-of-us-are-doing-for-living:

the concept of “core utilization”, briefly described above.
- For HPC pros, it is indeed one of the most important metrics (mostly because they already handled everything else, including choosing proper granularity)
- For us, “Joe Average” developers – it is not that obvious, and pretty often having wrong granularity can kill our efforts much easier than we say “Joe Average” – while keeping 100% utilization.
“Let’s parallelize as many things as humanly possible”.
- While this recommendation comes right from Amdahl’s Law, strictly speaking, even for HPC it stands only if we postulate that “we want the result ASAP at all costs” (and if we introduce things such as “power costs” or “CO₂ footprint” in consideration – the things will be somewhat different). Still, for real-world HPC I am perfectly ready to defer to HPC pros on this one .
- However, for us “Joe Average” programmers, most of the time such approach is just plain wrong.
  - First of all – more often than not, we’re speaking not about long calculations – but instead about interactive stuff. Sure, there MAY be a requirement to finish our work before-next-frame-comes (before-user-dies-from-old-age, …) – but all the other things which are not time-critical, are better to be left without parallelization. Most importantly, non-parallelized code is simpler and is much less risky. In addition – it will have less overhead, which in turn will leave more CPU power available for other applications, reduce electricity bills, and will be more environment-friendly.
  - Second – if trying to parallelize all the things in the context of the interactive programs (which tend to be rather short) – we’ll likely end up with lots of the algos with calculation chunks being too short to be efficient – which, in turn, will cause severe losses of performance.
  - Overall, for us (=”Joe Average developers working on interactive programs”), the rule of “Let’s parallelize as many things as humanly possible”, most of the time, becomes its exact opposite: “Let’s NOT parallelize anything which still fits into time bounds required by interactivity”.
“Oversubscribe and utilize all resources”
- For HPC, it makes perfect sense – after all, if calculation chunks are large enough, any idle core is indeed a waste of resources.
  - Indiscriminate oversubscription can be still a problem (running a million threads per core is rarely a good idea) – but HPC guys do know their stuff, so they know how much to oversubscribe.
- However, for interactive programs – having oversubscription is risky. In particular – having all the cores at use all the time, reduces responsiveness of the system. In other words – to reduce latencies, we have to keep at least one core idle more or less at all times.
  - NB: while we can improve responsiveness by using the priorities – in practice, it happens to be rather difficult (it is easy to forget to raise priority of one of the threads on the critical path, priority inversion can kick in, etc.)
  - Moreover, heavy oversubscription can kill an interactive program much easier than HPC.

¹ BTW, later the guy has admitted he’s just a sales person, so he has no idea what it all means <sad-face />

² that is, on modern desktop/server CPUs, not on GPUs or MCUs

Why HPX kicks a** of current parallel STL

Now, after all those necessary disclaimers and clarification above – I can convey my feelings about the talk “The Asynchronous C++ Parallel Programming Model” by Hartmut Kaiser. Very briefly – he was speaking about HPX, which is IMHO inherently superior to the new parallelized STL algos in C++17. The reason for it is the difference in the programming paradigm. While C++17 STL follows a traditional OpenMP-like model of “hey, we have this loop, let’s parallelize it” – HPX goes along the lines of describing “how the thing-we-need can be calculated”, and then gives it to the infrastructure code to ensure that the stuff is calculated efficiently.

'The Asynchronous C++ Parallel Programming Model' by Hartmut Kaiser

For example, with C++17’s parallel algos we have two loops – then first, we need to incur lots of context switches to start loop #1, then we have to wait until all the threads finish their calculations (incurring another bunch of expensive context switches), and only then we can start calculating the second loop. On the other hand, with HPX (while the code actually looks reasonably similar to the C++17-based one) – in practice, we’re just describing the way how certain futures can be calculated, so as soon as all the information necessary to calculate certain future becomes available – HPX can start calculating it. This allows to avoid all the unnecessary thread synchronizations – which, in turn, provides both better core utilization and reduced overall overheads.

For more details – make sure to watch the presentation when it appears on YouTube. What is clear though – is that

once again, asynchronous systems have demonstrated their advantage over the systems-relying-on-explicit-synchronization (such as mutexes)

NB: I heard that similar future-based implementation is planned for C++ standard too; when we can hope for it – is unclear though.

Naked Coroutines

Another excellent presentation (and once again, demonstrating the power of the asynchronous stuff <wink />) was a talk by Gor Nishanov titled “Naked coroutines live (with networking)”. The point was to take just bare coroutines+new-networking-stuff-from-networking-TS – and to make a non-blocking networked app out of it – live within one hour, no less <smile />. When video of the talk becomes available on YouTube – make sure to watch it with your own eyes to see how easy writing an async network app has become.

Disclaimer: for the time being, for serious network development I’d rather still use native OS APIs, as they’re more likely to provide more knobs to turn – and at least for now, this is often necessary to achieve optimal networking performance. However, nothing prevents us to use coroutines with our-own-async-calls-built-on-top-of-{select()|poll()|epoll()|kqueue()|whatever-else} – in pretty much the same manner as Gor has described in his talk.

Once Again on Metaprogramming

As you might have noticed based on my previous post – I am wildly excited about the coming-to-compiler-near-you-some-time-in-next-decade those “metaprogramming” features for C++. As a result – I didn’t really have an option to miss the talk “Language Support for Metaprogramming in C++” by Andrew Sutton, and I certainly wasn’t disappointed with in. Very shortly – it is another MUST-watch talk when it appears on YouTube.

It was another day on CPPCON2017; tomorrow is the last day, and I’ll be speaking myself – so while I still hope to write something about it, please don’t expect too much <wink />.

↧

Using Parallel Without a Clue: 90x Performance Loss Instead of 8x Gain

March 27, 2018, 3:12 am

≫ Next: Parallel Coding: From 90x Performance Loss To 2x Improvement

≪ Previous: #CPPCON2017. Day 4. Async Rulezzz!

Ubi nihil vales, ibi nihil velis
Where you are worth nothing, there you should want nothing
— Arnold Geulincx, XVII century —

Disclaimer: if you’re doing parallel programming for living – please ignore this post (this stuff will be way way way too obvious for you). However, if you’re just about to believe claims that parallel programming can be made easy-as-pie – make sure to read it.

With C++17 supporting¹ parallel versions of the std:: algorithms, there are quite a few people saying “hey, it became really simple to write parallel code!”.

Just as one example, [MSDN] wrote: “Only a few years ago, writing parallel code in C++ was a domain of the experts.” (implying that these days, to write parallel code, you don’t need to be an expert anymore).

I always had my extremely strong suspicions about this position being deadly wrong, but recently I made an experiment which demonstrates Big Fat Dangers(tm) of implying that parallelization can be made as simple as just adding a policy parameter to your std:: call.

¹ well, at least on paper; to the best of my knowledge, both libstd++ and libc++ do NOT support it yet (all my GCC/Clang compilers, as well as Godbolt, are choking on #include <execution>)

Task at Hand

Let’s consider the following very simple scenario: we have to calculate a sum of the elements in a million-element integer array. The code we’re starting from, is as simple as

template<class Iterator>
size_t seq_calc_sum(Iterator begin, Iterator end) {
  size_t x = 0;
  std::for_each(begin, end, [&](int item) {
    x += item;
  });
  return x;
}

When running this code on my pretty old Windows box (compiled with MSVC VS2017 in Release mode²) – it takes about 450us.

² as noted above – as of now and to the best of my knowledge, MSVC is the only compiler+lib able to handle this kind of stuff; note that even in MSVC it is still “experimental”

Adding parallelism: Take 1 (badly broken)

First, we have to note that simple addition of std::execution::par to our std::foreach() call will NOT work. If trying to write something like

//DON'T USE: WILL CAUSE WRONG RESULT
template<class Iterator>
size_t par_calc_sum_deadly_broken(Iterator begin, Iterator end) {
  size_t x = 0;
  std::for_each(std::execution::par,begin, end, [&](int item) {
    x += item;//data race; often causes wrong calculation result(!)
  });
  return x;
}

– it will compile and will run, but we’ll easily get wrong result (in my experiments with a million-element array, the result was wrong each and every time, but YMMV, which only makes things worse <sad-face />).

Adding parallelism: Take 2 (90x performance hit)

IF we were observant enough to note this problem – and found a neat recommendation in [CppReference], we’ll realize that in addition to specifying std::execution::par, we also have to use std::mutex (or std::atomic) to make our program correct.

Ok, so our next (still very naive BTW) version would be along the following lines:

//DON'T USE: DEADLY SLOW
template<class Iterator>
size_t par_calc_sum_mutex(Iterator begin, Iterator end) {
  size_t x = 0;
  std::mutex m;
  std::for_each(std::execution::par,begin, end, [&](int item) {
    std::lock_guard<std::mutex> guard(m);
    x += item;
  });
  return x;
}

This does work correctly – and if we take a look at taskman, we’ll see that it DOES use all the cores (4 physical x 2-due-to-HT = 8 logical ones in my case). And, if we wouldn’t measure the performance compared to the sequential version – we could think that everything is good here. Not so fast <sad-face />.

Measurements have shown that the function above, takes about 40 milliseconds of wall-clock time, so instead of expected speedup of about 4x-8x, it is about 90x slowdown compared to the sequential one

(BTW, if you have doubts and want to run it yourself – the code is available at [demo]).
To make things even worse, the code above is written strictly along the lines recommended in [CppReference] (actually, it is almost exactly the same code).

Adding parallelism: Take 3. Atomics to the rescue? (still 50x performance hit)

Ok, as a next step we could think “hey, mutexes are bad for performance, so we should use atomics instead”. So, we rewrite it as

//DON'T USE: DEADLY SLOW
template<class Iterator>
size_t par_calc_sum_atomic(Iterator begin, Iterator end) {
  std::atomic<size_t> x = 0;
  std::for_each(std::execution::par, begin, end, [&](int item) {
    x += item;//changing it to x.fetch_add(item) doesn't change anything - neither it should
  });
  return x;
}

Well, it is still correct, AND (as expected) it is faster than our previous mutex-based version. The problem is that

It is still 50x slower than sequential one

Oh, and BTW – replacing std::execution::par with std::execution::par_unseq (with an assertion on x.is_lock_free() to prevent potential for deadlocks due to vectorization) didn’t make it better.³

³ it is unclear whether using of std::atomic when it is_lock_free() is safe for par_unseq; IMO it is, but there are voices out there that formally it isn’t; in any case, currently MSVC doesn’t really implement par_unseq (falling back to par), so as of now, it is a moot issue.

Results

Box	non-parallelized	std::execution::par with std::mutex	std::execution::par with std::atomic	std::execution::par_unseq with std::atomic
#1 (4 physical, 8 logical cores)	470+-4us	41200+-900us (90x slower, 600x+ less power-efficient)	23400+-140us (50x slower, 300x+ less power-efficient)	23400+-140us (50x slower, 300x+ less power-efficient)
#2 (2 physical, 4 logical cores)	900+-150us	52500+-6000us (60x slower, 200x+ less power-efficient)	25100+-4500us (30x slower, 100x+ less power-efficient)	21800+-2800us (25x slower, 100x+ less power-efficient)

As we can see, not only our naive parallel code is hopelessly inefficient and greatly increases CO₂ footprint for absolutely no reason,⁴

it also punishes users with more powerful boxes

(more strictly, it seems that the more cores we have – the more penalty we get; this will start making sense in the second part of this mini-series).

⁴ desire to utilize all cores, or to get the program parallel does not qualify as a reason

Intermediate Conclusions and To Be Continued

As we have seen from the results above, naive attempts to make our code parallel (while having no clue about the way parallel code works) can easily cause HUGE problems (starting from wrong results and even crashes, and to having even correct programs slowing down by factors of 50x-90x).

In other words (arguing with the quote from [MSDN] cited in the very beginning of this post):

Writing parallel code in C++ is still a domain of the experts.⁵

OTOH, the point of the exercise above is NOT to say that it is not possible to write efficient code with parallel std:: functions (it is). However, to do it – it is still necessary to understand what we’re doing. An explanation of what-is-wrong-with-the-code-above is planned for the next post of mine (spoiler: it is all about overhead which happens to be waaay too high in the code above).

⁵ It MIGHT change with the async stuff such as [HPX], but current support for parallel algos in std:: (except for std::reduce()), which forces us to mix calculations with thread sync, is not going to make writing of correct-and-high-performance programs any simpler <sigh />

References

[MSDN] Artur Laksberg, “Parallel STL – Democratizing Parallelism in C++”, Visual C++ Team Blog
[CppReference] “std::execution::sequenced_policy, std::execution::parallel_policy, std::execution::parallel_unsequenced_policy”
[demo] “parallel.cpp”
[HPX] “The C++ Standard Library for Parallelism and Concurrency”

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧

Parallel Coding: From 90x Performance Loss To 2x Improvement

April 4, 2018, 8:25 am

≫ Next: #ACCU2018 Day 1. From Gender Equality to (kinda-)Quantum Computing, with Threads and C++ copy/move in between

≪ Previous: Using Parallel Without a Clue: 90x Performance Loss Instead of 8x Gain

Parallel vs Single-Threaded: Speed vs Power Consumption

Disclaimer: if you’re doing parallel programming for living – please ignore this post (this stuff will be way way way too obvious for you). However, if you’re just about to believe claims that parallel programs can be achieved just by adding a few things here and there to your usual single-threaded code – make sure to read it.

In my previous post, we have observed pretty bad results for calculations as we tried to use mutexes and even atomics to do things parallel. OTOH, it was promised to show how parallel <algorithm> CAN be used both correctly and efficiently (that is, IF you need it, which is a separate question); this is what we’ll start discussing within this post.

Parallel Code Which Made It 90x Slower on 8-Core Box

As it was mentioned in the previous post, the code below – while being correct – was observed to be REALLY REALLY slow (90x slower than original, that’s while using all 8 cores instead of one):

//DON'T USE: DEADLY SLOW
template<class Iterator>
size_t par_calc_sum_mutex(Iterator begin, Iterator end) {
  size_t x = 0;
  std::mutex m;
  std::for_each(std::execution::par, begin, end, [&](int item) {
    std::lock_guard<std::mutex> guard(m);
    x += item;
  });
  return x;
}

And BTW, while replacing mutex with atomic did improve things, it still was 50x slower than original serialized code.

The Big Fat Overhead Problem

The problem we’re facing here, is actually a problem of the overhead. It is just that the cost of thread context switch (which is likely to happen pretty often as we’re trying to obtain lock_guard) is sooo huge,¹ that addition (which costs about 1 CPU cycle²) happens to be about zero-cost compared to the costs of synchronization. That’s exactly what happens here – and in this extreme case, we have this generic problem amplified sooooo badly that it happens that ALL the calculations are actually serialized by this mutex (so effectively, there is no parallelism at all³).

¹ in general, it can be within 2000-1’000’000 CPU cycles [NoBugs2016], though here it is on the lower side of things, as cache invalidation doesn’t really happen

² NB: this is a wrong place to go into a long discussion whether it is 1 CPU cycle or we should speak about it statistically being 3/4th of CPU cycle

³ however, if not for the overhead, 100% serialization would mean only that we’re not gaining anything from going parallel; it is the overhead which is responsible for 90x slowdown

Tackling The Overhead Manually

Let’s see what we could do to speed things up; note that it is NOT intended to be a code-for-real-world-use – but rather a code-to-provide-some-“feeling”-of-what-is-going-on. While performance-wise it is a huge improvement over previous code, but compared to alternatives-which-we’ll-see-later – it is certainly NOT the best option.

  //DON'T USE: UGLY, ERROR-PRONE, AND A-BIT-SLOWER-THAN-OPTIMAL
  size_t TestFuncManualParSum::run(RandomAccessIterator begin, RandomAccessIterator end) {
    std::atomic<size_t> x = 0;
    constexpr int NCHUNKS = 128;
    assert( (end-begin) % NCHUNKS ==0 );//too lazy to handle it for sample code
    RandomAccessIterator starts[NCHUNKS];
    size_t sz = (end - begin) / NCHUNKS;
    for (int i = 0; i < NCHUNKS; ++i) {
      starts[i] = begin + sz * i;
      assert(starts[i]<end);
      }
    std::for_each(std::execution::par, starts, starts + NCHUNKS, [&](RandomAccessIterator start) {
      size_t y = 0;
      for (auto it = start; it < start + sz; ++it) {
        y += *it;//NO synchronization here
      }
      x += y;//synchronization
    });
    return x;
  }

The basic idea here is to avoid synchronization on each and every element; instead – we’re splitting our array into 128 chunks – and running for_each over these large chunks (and within each chunk – we’re calculating partial sum in a local variable y, avoiding any synchronization until the very end of the chunk).⁴ Let’s note that in the code above, we’re actually relying on the sum to be associative to achieve the speed improvement (we DO change the order of additions compared to sequential code).

If we run this code – we’ll see that it is ACTUALLY FASTER than sequential one (in my measurements – see also a table below – it was by a factor of almost-2x on an 8-core box, while using all the 8 cores).

This is a HUGE improvement over our previous 90x-slower-than-sequential code – AND is the first time we’ve got it faster than sequential.

⁴ As noted above – it is NOT a production-level code, so I didn’t bother with creating an iterator-which-jumps-over-each-sz-elements, and created an ugly temporary array instead; however, it should be good enough to illustrate the point

Enter std::reduce()

Well, while we did manage to improve speed by using multiple cores in our previous code <phew />, it is NOT what we really want to write in our app-level code (that’s to put it mildly). Fortunately, there is a function which will do pretty much the same thing (actually, even a bit better) for us: it is std::reduce().

The point of std::reduce() is that it – just like our code above – exploits the fact that operation-which-is-used-for-reduce (default is ‘+’), is associative.⁵ Then, the whole thing we wrote above, can be written as follows (it is MUCH less ugly, MUCH less error-prone, and a tiny bit more efficient):

  //DO USE
  size_t TestFuncReduceParSum::run(Iterator begin, Iterator end) {
    return std::reduce(std::execution::par, begin, end, (size_t)0);
       //NB: (size_t)0 specifies not only an initial value, but also the TYPE of the accumulator
       //    which ensures that accumulation is done in terms of size_t, and NOT in terms of *begin
       //    This, in turn, is necessary to keep our calculation EXACTLY the same as our previous ones
  }

Interestingly enough, when I run the code with std::reduce(), I observed that it is merely 2-3% faster⁶ than our manual parallelization which we used above. While it is certainly not a proof, but it is an indication that the methods used within std::reduce(), are conceptually similar to the stuff-we-wrote-ourselves-above.⁷ This approximation, in turn, can be useful to “feel” what we can realistically expect from std::reduce(). After all, there is NO magic involved, and everything-what-happens-in-the-world can (and IMO SHOULD) be explained at least at the very high level of abstraction (and BTW, if you have a better explanation – without going into “we’re experts, just trust us to do things right” – please LMK).

⁵ or at least almost-associative, which happens with floats, more on it in the next post

⁶ actually, given the standard deviation being 3%, it qualifies as “barely noticeable”

⁷ except for “work stealing” which is most likely used by std::reduce(), but I don’t think we’re ready to discuss work stealing at this point

Per-Element Mutations

In terms of std::reduce() we can express LOTS of parallelizable things (a bit more on related trickery in the next post), but all of them are inherently limited to our array-being-processed, being unmodified in the process (in other words – std::reduce() can be seen as a “pure” function). However, as in C++, we’re NOT confined to the world of “pure” functions such as std::reduce(), we MAY consider certain mutation operations over our large collection. However, learned from experience with mutexes, we WON’T try to modify more-than-one-single-element of our collection in our parallel algorithm. And as long as all our mutations are limited to only one element of our array – we are good to go without mutexes / atomics / etc. <phew />:

void TestFuncForEachParModify::run(Iterator begin, Iterator end) {
  std::for_each(std::execution::par,begin, end, [](int& item) {
    item += 2;
});

In general, it is The Right Way(tm) of doing things parallel; however – in this particular case of ultra-simple ops (which happen to cause our app to consume all the memory bandwidth at least on my box) – parallel version still happens to be slower than sequential one(!).⁸ This is NOT to say that parallel mutations are always slower than serial ones – but rather that while with some experience it MIGHT possible to predict which parallelization WON’T work (one such example is our mutex-based calculation above) – it is next-to-impossible to predict which parallelization WILL work for sure; hence – whatever-we-think about our parallelization, it is necessary to test its performance (and under conditions which are as close as possible to real-world ones).

Results

	Cores Used	Wall-Clock Time (us)	CPU Time (us)	Walk-Clock Compared to Sequential	Power Consumption Compared to Sequential (guesstimate)
“Pure” (mutation-free) Calculation
sequential (for)	1	2960+-20	2960	1x	1x
sequential (for_each)	1	4480+-50⁹	4480	1.5x slower	1.5x higher
std::accumulate()¹⁰	1	2940+-30	2940	1x	1x
std::reduce(seq)	1	2960+-60	2960	1x	1x
naive std::mutex¹¹	8	201’000+-4000	1’600’000	70x slower	300x higher
naive std::atomic¹¹	8	68’600+-20	550’000	25x slower	100x higher
manual chunking	8	1620+-50	13000	1.82x faster	2.5x higher
std::reduce(par)	8	1580+-50	12600	1.87x faster	2.5x higher
Mutation
sequential (for)	1	3310+-20	3310	1x	1x
sequential (for_each)	1	3300+-4	3300	1x	1x
sequential (in-place transform)	1	4510+-30	4510	1.36x slower¹²	1.3x higher
manual chunking	8	3540+-90	28300	1.07x slower	4x higher
for_each(par)	8	3520+-100	28100	1.06x slower	4x higher

⁸ once again, manually-parallelized version performs pretty much the same as the best-library-provided one, see [demo2] for source

⁹ To the my best understanding, it SHOULD NOT be the case, hopefully is merely a flaw in MSVC handling of lambdas

¹⁰ see [demo2]

¹¹ from previous post, see also [demo2]

¹² probably – but not 100% sure – this is another under-optimization by MSVC

Hey fellas, don’t be jealous
When they made him they broke the mould
So charismatic with an automatic
Never prematurely shooting his load
— 'Man for All Seasons' song from Johnny English film —

My observations from the table above:

All sequential methods are equivalent (all the differences are within measurement errors).
- The only exceptions are related to sequential for_each(), and in-place transform() – but hopefully they’re merely a flaw in optimizing lambdas out by MSVC; OTOH, for_each() case does highlight risks of using lambdas in performance-critical code even in 2018(!).
Out of the parallel methods – those which serialize too often (“naive” versions above) fail BADLY performance-wise.
Those parallel methods which serialize “rarely enough” MAY (or may not) outperform serialized version latency-wise, but are always losing energy-consumption-wise; in fact, it is a UNIVERSAL observation that parallel versions are (almost)-always losing to the serial ones in terms of power consumed, and CO₂ footprint. Which in turn means that – as long as serial code provides good-enough response times from end-user perspective – DON’T PARALLELIZE; you’ll do a favour BOTH to your user (she will be able to use the cores for something else, AND will pay less in electricity bills), AND to the planet. Or from a bit different perspective – out of all the premature optimizations, premature parallelization is the most premature one.
- mutexes are evil;¹³ even atomics are to be avoided as long as possible
- Even IF we’re mutex-free, parallelization MAY slow things down. So after you found the bottleneck and parallelized – make sure to re-test performance – and under close-to-real-world conditions too.
- FWIW, effects of our manual parallelization look very close to those of std::reduce(); it is NOT to argue to do things manually – but to get a very-simplified idea of how std::reduce() does it magic internally.

¹³ especially so as you’re reading this post (see disclaimer at the very beginning)

Intermediate Conclusions and To Be Continued

In the previous post, we took a look at “how NOT to write parallel programs” (which apparently where LOTS of parallel programming tutorials will lead you). In this one, we have taken a look at “how efficient parallel programs CAN be written”. In the next (and hopefully last-in-this mini-series) post we’ll discuss some guidelines-and-hints for-NOVICE-parallel programmers, which MAY allow speeding up their programs here and there, without the risk to jeopardize program correctness – and without the risk of slowing things down by 50x too. Sure, it won’t be the-very-best optimization – but it MAY bring SOME benefits to the table (and it is rather low-risk too, so as a whole it MAY fly).

References

[NoBugs2016] 'No Bugs' Hare, “Operation Costs in CPU Clock Cycles”
[demo2] “parallel2.cpp”

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧

#ACCU2018 Day 1. From Gender Equality to (kinda-)Quantum Computing, with Threads and C++ copy/move in between

April 12, 2018, 2:57 am

≫ Next: #ACCU2018. Day 2. Threads and Locks ARE a Dead End, Period!

≪ Previous: Parallel Coding: From 90x Performance Loss To 2x Improvement

As #ACCU2018 is underway, and as I am here, it would be strange if I wouldn’t use the opportunity to tell about what I like (and don’t like ;-)) here.

Russel Winder with opening notes

Diversity in Tech – What Can We Do?

Gen Ashley was opening the conference with a keynote speech on “Diversity & Inclusivity in Tech”. Main point: we have way too few women in programming (to prove it, it was sufficient to look around).

What can be done about it – is a different story. In particular, I am not sure that I agree with Gen that educating specifically girls and women is “the way to go”; to me, it is more like trying to fix the problem after it has already happened, rather than to prevent it from happening in the first place. I can very easily be very wrong, but for me, trying to educate (all) children earlier, when they do not have this bias yet, would be a better and more sustainable solution (in particualr, it would be an approach which unites people, instead of differentiating them because of their gender).

OTOH, from my perspective this is NOT what I (or most of the developers around there for that matter) can realistically affect. As a down-to-earth person, I tried to look for those things which I can do while staying within my current job position, and managed to take out a few very simple things which we can (and SHOULD) do to make this field just a tiny bit better:

let’s try to avoid bias at the working place. In other words, let’s treat all our fellow developers as equal, regardless of their gender. This means that a “compliment” of “oh, you’re doing extremely well… for a girl” is NOT a good thing (and yes, there are guys out there who think they’re making a compliment). We are all here for one simple thing – to develop software, so any kind of special treatment based on gender SHOULD NOT happen.¹
let’s try to avoid unconscious bias when hiring. In this regard, I’ve learned of two simple but hopefully all-important techniques (which are more for HR, but well, some of us MIGHT be able to tell our HRs “what we want” – especially as it will go along their other tasks such as ensuring gender equality):
- to have resumes anonymized before reviewing them
  - thinking aloud (was NOT mentioned in the keynote): in theory, we could even try to have the first phone interview anonymized – while being awkward, it might work in the long run (hint: for almost-every female name out there, there is a similarly-sounding male one, so we can try giving this “substitute” name to the over-the-phone interviewer, and to warn the woman-being-interviewed, about it in advance).
- (was NOT mentioned during the keynote, but is mentioned here, and IMO looks reasonable):
  - have multiple people involved in the decision-making; not only they should tell who they prefer, but also why. While it is not a guarantee (there is always a chance of bias being systemic), it might help to reduce bias a little bit.
- as for standardizing the hiring process (which was proposed as one of the ways to deal with bias) – I have to say am against it; not because I want to promote bias, but because any kind of formalism is incompatible with creative processes (see, for example, my article on it), and we do agree that software development is creative process, right?

Overall, I think it is long overdue that every one of us starts making small but important steps when doing things which are already doing (such as development, or hiring, etc. etc.). I do not want to go as far as saying that we should aim for any specific gender ratio (this is much more controversial, in particular, because according to Goodhart’s Law, at the very moment we start to aiming something, it ceases to be a metric); however, eliminating existing biases is certainly a Good Thing(tm) from any possible point of view.

¹ well, I have to admit that personally, I’d still try to open the door for a woman developer, at least unless she objects (I just hope I won’t be beaten too hard for doing it) – or unless I forget about it, which is unfortunately easy in such environments

Multithreading and C++ copy/move

Anthony William's 'Designing multithreaded code for scalability'

Then I went to “Designing multithreaded code for scalability” talk by Anthony Williams, and to “Nothing is better than copy or move” by Roger Orr. Coming from two top experts in their respective fields, both were great talks, with significant takeouts. My only complaint about Anthony’s talk was that it should have come with a big fat disclaimer in the very beginning: “If you’re writing app-level code, DON’T EVEN THINK about using any of this!” (I spoke to Anthony after the talk, and he said that he never thought that somebody will try to do it, so we don’t have any disagreements on this point). As for Roger’s talk – I cannot find anything to complain about, and trust me, I tried hard (seriously, when it is out on video – make sure to watch it).

Roger Orr's 'Nothing is better than copy or move'

Quantum Computing for C++ Devs? Well, not 100% so

Charley Bay's 'A Quantum Data Structure For Classical Computers'

The last talk I attended on Wednesday, was “A Quantum Data Structure For Classical Computers” by Charley Bay. It was quite an interesting talk, however, a word of caution:

while it WAS a talk on the history of quantum computing (and quantum mechanics in general), and on very generic principles behind it,
it WAS NOT a talk on “how to do practical quantum computing programming” in any way. This is not to say the talk should have included it, but is to say that if you’re looking for anything of practical value (beyond an understanding of the very very basics, such as “if you read your quantum state – some information in the state gets destroyed”) – it is not the talk you really want .

Post-Mortem Debugging

My report of the 1st day of the ACCU2018 wouldn’t be complete if I don’t mention one of the sponsor’s exhibits which managed to impress me (it is not an easy feat, BTW ). I do NOT normally comment on sponsor’s exhibits (it would be too detrimental for most of the sponsors out there <sad-wink />), and no, I am not getting paid for mentioning them here, but this exhibit really stood out by being (IMNSHO) both clever and (potentially) useful.

I am speaking about the exhibit by Undo company. Their main idea is (a) to instrument your existing (compiled) program to be deterministic, (b) to record all the inputs (including returns from system calls and instructions such as RDTSC) into a log, and (c) to allow you replaying this log in the comfort of your own developer’s box – which run is guaranteed to exhibit exactly the same behaviour as it happened in production. In other words, we get not just a core dump (which usually contains memory which is FUBAR and therefore useless), but we can see which series of events has brought our system from normal state to crashed one.

Of course, the technology is not without its limitations, in particular:

in many cases, you cannot possibly record all the history for years your program is running; it means that there is a chance that the problem has started to develop before that log we have. This is “toough luck”, but from my experience with similar systems – it is relatively rare.
performance hit due to instrumentation. Depends on your code, but MAY lead to your code becoming unusable in production.
- In particular, to ensure determinism – they serialize multithreading, so if your program is essentially multithreaded (opposed to being multithreaded only accidentally) – it is likely to have troubles.

BTW, if you noticed similarities with my own talk on ACCU2017 about determinism in (Re)Actors – it is because the principles behind are indeed the same; the main differences between the two are the following:

(Re)Actor-based systems provide determinism from the very beginning; Undo tries to implement it as an afterthought (that’s not because THEM didn’t think about it in advance, but because WE as developers didn’t think about it in advance). And while I have deep sympathy for the hurdles Undo is facing, I have to say that I still strongly prefer to have determinism to be built into the system from the very behinning. OTOH, for existing non-(Re)Actor-based systems – Undo can happen to be just the ticket.
the performance hit is inherently smaller for (Re)Actor-based determinism (at least because we do know about our threading model, and we do have the luxury to restrict system calls to those-absolutely-necessary).
Overall:
- if you have to run a serious existing distributed system (the one which was NOT developed with (Re)Actors in mind) – DO try Undo (while I cannot vouch whether they really work, they certainly look good enough to give them a try).
- however, for NEW development – I still strongly suggest to go the route of deterministic (Re)Actors, which would provide all of this (almost-)for-free – and with TONS of other goodies too…

Phew, it was a long day – and there is another one coming. I hope to continue my reports from ACCU2018 tomorrow.

↧

#ACCU2018. Day 2. Threads and Locks ARE a Dead End, Period!

April 12, 2018, 1:35 pm

≫ Next: Parallel STL for Newbies: Reduce and Independent Modifications

≪ Previous: #ACCU2018 Day 1. From Gender Equality to (kinda-)Quantum Computing, with Threads and C++ copy/move in between

As the Day 2 of ACCU2018 is over, here is my report about it.

Kotlin ~= Java with a Better Syntax

Hadi Hariri's 'Kotlin/Native – Embracing existing ecosystems'

Keynote of Day 2 was about Kotlin (titled “Kotlin/Native – Embracing existing ecosystems”), delivered by Hadi Hariri. While the talk as such was pretty good, I wasn’t convinced that Kotlin is worth the trouble. My takeouts from the talk:

Kotlin is a separate language, with a syntax which on the first glance looks like a weird hybrid of Java and Python
There are quite a few improvements which might make the code a bit less verbose than Java one here and there
- OTOH, sometimes it comes at a cost of pretty convoluted rules, which are IMO rather difficult to grasp (and are very easy to abuse).
- Also, there are MANY ways to write the same thing with differences being purely syntactical (which I expect to hit readability)
Other than that – the ideas behind Kotlin are pretty much the same as that of Java.
- Arguably – Kotlin is “Java as it should have been written from the very beginning”
- On the uglier side – I hate when people are saying “hey, you don’t need to think about memory management”; as practice shows, it inevitably leads to semantic memory leaks Eclipse- and HAB-style. While it is not exactly a flaw of the language, it is certainly a flaw of the culture, and when one of the people behind Kotlin says this kind of things – they’re likely to be ingrained into the dominating culture of Kotlin developers, and it is NOT a good sign.
And most importantly – as all the improvements of Kotlin are merely syntactic sugar – I have my doubts whether all these improvements are worth the cost of migration to a yet another almost-identical programming language. Sure, IF Google will do the heavy lifting of pushing Kotlin to the mainstream – it might become a contender, but other than that – don’t hold your breath over it. In other words, while there is nothing inherently bad about Kotlin – it is just IMO not “better enough than Java” to justify migration.

Team Management 101

Arjan van Leeuwen's 'How not to lead a team of software professionals'

The second talk I attended today, was “How not to lead a team of software professionals” by Arjan van Leeuwen; it was a pretty good talk, though for those of you who will decide to watch it (when it is available), I have to say that I disagree with the author on two significant points:

I do not think that a good manager can afford to be “isolated” (and feel lonely, communicating with the other managers rather than with the team, ouch!); while this is a classical management approach, for us (ppl who are both managers and team leads), there is an option to be a part of the team, and if you can achieve this – it becomes soooo major improvement for the team, that it trumps pretty much all the other considerations.
From what I’ve seen, it tends to be very beneficial not to position your developers as subordinates, but rather to have some Greater Good(tm) (best product in the world, making millions of your customers happy, …), and to subordinate both yourself and everybody-else to this Greater Good(tm). This is known to help A Damn Lot(tm) (and BTW, I was able to trace this kind of findings at least back to 60s, so it is certainly nothing new).

Other than that – the talk IMO does map pretty good to realities of team management. As an added bonus – take a look at the diagram on the photo above – it shows how much time (4 hours per week) you will have for development after you took that lead/management position (and this is consistent with my own experience in this regard). This is to all those ppl who’re saying “hey, architects and team leads must code themselves” (FWIW, in my books, 4 hours out of 40 does not really qualify as “coding”).

Finally! There is a consensus that treads-with-locks MUST DIE!

And now I can discuss the juiciest of the today’s talks – two talks related to an observation that threads-with-locks (or more generally – shared-memory approaches) are inherently evil (at least at the app level). A year ago (after ACCU2017), I wrote that “we are past the point of no return” in this regard, and now we already can observe a landslide of talks (by very knowlegeable and respected people) quoting and re-quoting that “mutex should have been named a bottleneck”, that threads as such are not scalable, and so on, and so forth.

Hubert Matthews's 'Read and write considered harmful'

The first related talk was “Read and write considered harmful” by Hubert Matthews. Overall, it was brilliant talk (ABSOLUTE MUST WATCH as soon as it appears on the YouTube).¹ Overall, it is very much consistent with my own talk coming on Saturday, going along the lines that whether you’re dealing with in-memory state or DB, you should go for Shared-Nothing architectures for one simple reason that nothing else scales (and in addition, Shared-Nothing Message-Passing architectures provide all kinds of goodies such as being testable, having post-mortem debugging, etc. etc.).

Hevlin Kenney's 'Procedural Programming: It’s Back? It Never Went Away'

A talk by Hevlin Kenney (titled “Procedural Programming: It’s Back? It Never Went Away”) was more of a brilliantly-delivered lecture about the history of procedural programming, and I didn’t expect any further trouble for multithreading, but – all of a sudden it culminated with the quote on the photo above: “Threads and locks – they’re kind of a dead end, right?”

Overall, I am extremely happy that we got a consensus that we should get back to single-threading (after all, I am writing about it since 2010); I am sure that it will allow writing MUCH more reliable (and significantly better performing) programs. For details – if you’re in Bristol now, you may want to visit my talk on Friday…

¹ and I have to admit that I felt like a student (ok, let’s make it a Ph.D. student) listening to a lecture by a professor: while everything he said was perfectly understandable and in line with my own experience, it was for the first time in many years that I felt that I’m listening to a person who knows more than me, in what I consider “my” field (FWIW, last time I got the same feeling when I spoke to Alexander Stepanov about 20 years ago).

Phew, it was a long day – and there is another one coming. I hope to continue my reports from ACCU2018 tomorrow.

↧

Parallel STL for Newbies: Reduce and Independent Modifications

April 26, 2018, 8:35 am

≫ Next: CAS (Re)Actor for Non-Blocking Multithreaded Primitives

≪ Previous: #ACCU2018. Day 2. Threads and Locks ARE a Dead End, Period!

Parallel Programming: Good Way and Bad Way

Disclaimer: if you’re doing parallel programming for living – please ignore this post (this stuff will be way way way too obvious for you). However, if you’re just about to believe claims that parallel programs can be achieved just by adding a few things here and there to your usual single-threaded code – make sure to read it.

In my previous post, we have observed the whole spectrum of parallel coding results – ranging from “awfully bad” to “reasonably improving”. So, the Big Fat Question(tm) we’re facing, is the following:

How parallel stuff should be coded if you’re NOT a parallel expert?¹

¹ and most of us aren’t, even if we fancy ourselves parallel experts

Do we REALLY need to parallelize?

First of all, we should ask ourselves: do we really need to parallelize? The ONE AND ONLY case for parallelization exists when we have to meet certain latency requirements. In other words, if we do not care for latencies – we MUST NOT parallelize. Any parallel code will be both more error-prone, and less energy-efficient than its serial counterpart, so unless we’re tasked with making our customers unhappy – or are paid to do it by an oil-producing company – we MUST NOT parallelize unless we have proven that we cannot live without it.

TBH, even if we’re latency-bound, I still suggest taking a look at our algorithm before going parallel. There is no point in parallelizing inherently poor algorithms; for example, there is no need to parallelize our recursive calculation of Fibonacci – it is orders of magnitude better to do it iteratively and without any parallelization. In another example, before trying to parallelize calculations of big-number exponents (which BTW won’t parallelize well), you should certainly consider using Montgomery multiplication first.²

Real-world example: once upon a time, I was involved in development of (potentially parallelizable) universal hash function, which calculations were performed modulo 2^32+15; apparently, 2^32+15 can be seen as a “pseudo+Mersenne” prime (similar to pseudo-Mersenne primes which are represented as 2^N-epsilon), which allows calculations modulo 2^32+15 to be sped up significantly using methods similar to optimizations used for Mersenne and pseudo-Mersenne primes. Overall, comparing our final code to the original one (which was based on the simplistic use of Boost.Multiprecision), we got performance improvement of about 100x – that’s without any parallelization.

This is not to say that parallelization can’t possibly make sense – but rather that

Parallelization MUST be used ONLY as a last resort – when all the algorithmic improvements are exhausted

² well, “first after using ‘exponentiation by squaring'”

Parallelization on the Server-Side

When speaking about parallelization on the Server-Side, we should be even more careful than usual.

One example of a horrible misuse of parallelization is to try parallelizing things in a naive manner when processing HTTP requests which come to a web server. Let’s imagine that we decided to parallelize such a thing, just by requesting more threads whenever we feel like it. We can parallelize to our heart’s content, and even tests on our development box will show significant improvements – but when we try to deploy such a contraption to at least somewhat-loaded web server – it will fall apart very quickly. The problem here is that seriously loaded web servers are already heavily parallelized (just by having lots of requests from different users being handled in parallel) – and any further parallelization will at best do nothing, and at worst – will create NCORES threads for each request, which can easily lead to thousands of threads running – with these threads causing TONS of completely unnecessary context switches, and consuming TONS of resources, easily bringing down your whole website because of such naive Server-Side parallelization.

This is not to say that Server-Side parallelization is inherently evil – BUT you have to be extra careful when parallelizing on the Server-Side. In particular, you MUST understand how many cores you have – and how your setup will use them. Examples when Server-Side parallelization DOES work, include:

HPC-like stuff (I don’t want to argue whether HPC qualifies as a Server-Side, but more importantly, HPC and parallelization do work very well together).
Queued web requests. If you have some of your web server requests processed in a special time-consuming manner – you MAY want to have a queue, processing all such requests. And as soon as you have such a queue – you MAY dedicate some of your cores (or whole boxes) to processing requests in such a queue, and as soon as you know this – you MAY parallelize without the risk of thrashing your whole systems with too many threads.

Two scenarios above are the most common ones, but there are, of course, other parallelization cases which do work. The most important thing is to remember to avoid creating threads on per-request basis; instead – you should make sure that the number of threads you’re using for parallelization, is related not to the number of requests, but to the number of cores you really have.

Reduce rulezz!

One observation which we can make based on results which were already seen in the previous post, tells that

reduce (in particular, std::reduce()) is a very good way to achieve parallelization (that is, IF we need it).

Reduce allows us to calculate certain associative operation over a huge set of data (with pre-filtering if necessary). In general, reduce works very well (among other things, it scales to Shared-Nothing architectures, which is why it is so popular for Big Data DBs). However, reduce is not without its quirks. We won’t go into advanced stuff here, but will restrain ourselves to three most obvious observations.

Reduce Caveat #1: Operations on floats are NOT 100% Associative

As it was noted above, parallel reduce (in particular, std::reduce()) relies on the operation to be associative. While this may seem a minor restriction, it becomes much more significant if we realize that

Operations over floats/doubles are not really associative

Indeed, every time we’re adding two floats, we also have an implicit rounding operation after the sum-in-math-sense is calculated. Moreover, this rounding is non-linear with relation to all the arithmetic operations, which means that the result of the addition of 3 float numbers MAY depend on the order in which we’re doing it. One very practical manifestation of this property is an observation that well-known catastrophic cancellation can be often avoided by reordering additions, but it is much more general than that.

When applied to reduce, it means that

If we’re calling reduce() over floats, the end result is NOT deterministic

In practice, it is often not that bad (=”reduce is still usable”), but you still to keep in mind at least two things:

unit tests which are based on exact comparisons won’t work anymore (or worse – they might on the few first runs, but may fail at some point later).
in some relatively rare cases, it may lead your algorithm to become unstable (starting to diverge, etc.). Usually, it is just a sign of a time bomb which is ready to fire anyway, but sometimes it can happen when your algorithm is implicitly relying on certain unwritten properties (such as an order of elements in the array).

When using reduce with floats (and regardless of the programming language), we have to always keep these in mind <sad-face />.

Reduce Caveat #2: Different Semantics for Non-Commutative Calculations than std::accumulate()

The second caveat is actually C++-specific, and goes to certain assumptions which are easy to make – and which (as quite a few of assumptions out there) will lead to mortgage-crisis-size ummm… hiccups.

The story of this particular assumption unfolds as follows. When using std::accumulate(), it is common to write something along the following lines:

double add_sq(double acc, double item) { return acc + item * item; }
//...
double sum = std::reduce(v.begin(), v.end(),0.,add_sq);

This way, we can calculate things such as sum of squares. However,

with std::reduce() such code, while it compiles at least under MSVC, MAY occasionally provide wrong results

The thing here is that for std::reduce(), it is not one function, but four of them which we have to provide (as we’ve seen in the previous post, reduce needs to accumulate data in separate threads separately – and then needs to combine accumulators; this is where the 2nd function comes from; other two actually can be avoided, but as the almighty standard requires us to implement all four – libraries are allowed to use any of these, and unless we implement all four, we’ll be in Big Trouble(tm)).

These four overloads which we have to provide to a functor we’re feeding to std::reduce(), are the following (special thanks to Billy O’Neal from MSVC Lib Team for explaining these things and underlying mechanics to me):

struct Acc {
  double sum_sq;

  Acc() : sum_sq(0.) {}
  Acc(double sum_sq_) : sum_sq(sum_sq_) {}
};

struct add_sq {
  Acc operator()(Acc acc, double item) {
    return Acc(acc.sum_sq + item * item);
  }
  Acc operator()(double item, Acc acc) {
    return Acc(acc.sum_sq + item * item);
  }
  Acc operator()(double item1, double item2) {
    return Acc(item1*item1 + item2*item2);
  }
  Acc operator()(Acc a, Acc b) {
    return Acc(a.sum_sq + b.sum_sq);
  }
};
Acc sum = std::reduce(v.begin(), v.end(),0.,add_sq);

IF we’re specifying only a function add_sq() rather than the whole functor-with-four-overloads – we’re effectively saying that all four overloads are the same(!). This is fine for “symmetric” accumulators like simple sum,

but leads to invalid results when dealing with non-commutative accumulation function such as calculating sum of squares

Overall, I am arguing for writing four overloads above in pretty much any case (ok, we can exclude simple addition); it is when you write all four overloads and see that they’re identical – you may switch back to using single function.

NB: similar thing should be doable with std::transform_reduce(), which is probably more appropriate way of doing things; still, if going from std::accumulate() side, the way described above may feel more “natural”. As we’ll see below, it is also more generic, as it allows to avoid unnecessary multiple passes over the same container.

Reduce Caveat #3: Avoiding Multiple Passes

Another thing to keep in mind is that when shoehorning your program to use reduce(), it is easy to make a performance mistake of replacing one single pass over your collection, with two calls to reduce() – merely to calculate two different things over the same collection. Two-pass version will usually take longer (in many cases – 2x longer!) – often negating any benefit from going parallel.

Let’s consider a very simple case when you had to calculate both sum of the elements – and sum of squares of the elements – over the same container. Your original code looked as follows:

for(double item:v) {
  sum += item;
  sum_sq += item * item;
}

(that is, assuming that you didn’t bend to not-so-wise voices of “hey, just run std::accumulate() twice” in the first place³).

When going parallel, it does become tempting to rewrite it as

//using add_sq from above
double sum = std::reduce(v.begin(), v.end(),double(0.));
  //yes, I know that 0. would suffice, 
  //  but I still prefer to be very explicit in such critical places
double sum_sq = std::reduce(v.begin(), v.end(),double(0.),add_sq());

As a Big Fat Rule of Thumb(tm), this counts as a pretty bad idea – as we’ll be making two passes over the same data, at the very least we’re causing a double hit on the memory bus. However, there is a way to write the same thing as one single reduce():

struct Acc2 {
  double sum;
  double sum_sq;

  Acc2() : sum(0.), sum_sq(0.) {}
  Acc2(double sum_, double sum_sq_) 
  : sum(sum_), sum_sq(sum_sq_) {
  }
};
struct add_item_and_sq {
  Acc2 operator()(Acc2 acc, double item) { 
    return Acc2(acc.sum+item, acc.sum_sq + item * item);
  }
  Acc2 operator()(double item, Acc acc) { 
    return Acc2(acc.sum+item,acc.sum_sq + item * item); 
  }
  Acc2 operator()(double item1, double item2) {
    return Acc2(item1+item2, item1*item1 + item2*item2);
  }
  Acc2 operator()(Acc2 a, Acc2 b) {
    return Acc2(a.sum+b.sum,a.sum_sq + b.sum_sq); 
  } 
}; //...
Acc2 acc = std::reduce(v.begin(), v.end(),Acc2(),add_item_and_sq());

Bingo! We got our parallel code – and in one pass too!

IMPORTANT: at least under MSVC, results of using one-pass vs two-pass calculations varied greatly: I’ve got anything from single-pass being 2x faster, to it being 2x slower, so make sure to performance-test your code. I tend to attribute it to optimization flaws in MSVC (FWIW, from my experience they tend to affect both MSVC and GCC, but not Clang – and I am going to research this phenomenon further), so there is a chance that in some distant future one-pass calculations will become faster than two-pass as they should.

³ it is about 2x performance loss over all but the smallest arrays

Independent Per-Item Modifications Are Ok

Reduce is a very powerful mechanism – that is, IF we have to calculate something over a large collection without modifying it; in other words, modifications are out of scope for reduce()⁴. If we have to modify something – we MAY still do it using something like parallel for_each() – but we MUST be sure that ALL parallel modifications are going over independent elements, without ANY interaction between any two elements involved. In other words, it is perfectly ok to modify all the elements in your array as item = f(item, params), where f is an arbitrarily complex function, and params are exactly the same for all the elements. Such an approach guarantees that we can run our parallel for_each without any mutexes – and it will be correct too <smile />.

However, at the very moment when we’ll try to use any other item in our modification code – we’re badly out of luck <sad-face />. Not only demons will fly out of our noses,⁵ but even if we manage to place mutexes correctly – it will be extremely difficult to define the semantics of what we’re really trying to do here (and we didn’t even start to discuss performance, which is likely to be atrocious when using mutexes).

⁴ and also of transform_reduce(), which only modifies copies of the elements – so it can feed these modified-copies to reduce()

⁵ that’s what happens when we allow Undefined Behavior to happen

Beyond That – RTFM, RTFM, and even more RTFM

If you can limit your parallelism efforts to (a) reduce() or transform_reduce() for calculating values over constant-at-the-moment arrays, and (b) to modifications which are strictly per-item – you’ll be fine. But if you happen to need something beyond that – take a deep breath and allocate at least several months⁶ to understand how does parallelism really work under the hood. Just to get you started in this direction, you can read two books (make sure to read both!): Anthony William’s “C++ Concurrency in Action”, and Paul McKenney’s “Is Parallel Programming Hard, And, If So, What Can You Do About It?”. [[TODO: links]] Believe me⁷ – this is very likely to become the most difficult part of software engineering to master <sad-face />.

[[NOTE TO PARALLEL PROS: IF you can point out some other simplistic patterns which can be done without mutexes/atomics – please LMK, I’ll be happy to include them in my list-of-patterns-for-parallel-non-experts.]]

⁶ I am NOT kidding! <sad-face /> Actually, to master the subject of parallelism will take years – but in some months you MIGHT be able to produce something of value

⁷ NB: I am still assuming you’re NOT writing parallel code for living

↧

CAS (Re)Actor for Non-Blocking Multithreaded Primitives

July 23, 2018, 1:55 am

≪ Previous: Parallel STL for Newbies: Reduce and Independent Modifications

Those of you who happen to follow my ramblings, probably already know that I am a big fan of so-called (Re)Actors (see, for example, [NoBugs10], [NoBugs15], and [NoBugs17]).

Very, very briefly, a (Re)Actor is a thing which is known under a dozen different names, including ‘Actor’, ‘Reactor’, ‘event-driven program’, and ‘ad hoc state machine’. What is most important for us now is that the logic within a (Re)Actor is inherently thread-agnostic; in other words, logic within the (Re)Actor runs without the need to know about inter-thread synchronization (just as a single-threaded program would do). This has numerous benefits: it simplifies development a lot, makes the logic deterministic and therefore testable (and determinism further enables such goodies as post-mortem production debugging and replay-based regression testing), tends to beat mutex-based multithreaded programs performance-wise, etc. etc. And in 2017, I started to feel that the Dark Ages of mutex-based thread sync were over, and that more and more opinion leaders were starting to advocate message-passing approaches in general (see, for example, [Henney17] and [Kaiser17]) and (Re)Actors in particular.

Next, let’s note that in spite of the aforementioned single-threaded (or, more precisely, thread-agnostic) nature of each single (Re)Actor, multiple (Re)Actors can be used to build Shared-Nothing near-perfectly-scalable multi-threaded/multi-core systems [NoBugs17]. This observation has recently led me to a not-so-trivial realization that in quite a few cases, we can use (Re)Actors to… implement non-blocking multithreaded primitives. The specific problem I was thinking about at that point, was a multiple-writer single-reader (MWSR) blocking-only-when-necessary queue with flow control, but I am certain the concept is applicable to a wide array of multithreaded primitives.

Basic Idea – CAS (Re)Actor

As noted above, distributed systems consisting of multiple (Re)Actors are known to work pretty well. Basically, in what I call a (Re)Actor-fest architecture, all we have is a bunch of (Re)Actors, which exchange messages with each other, with nothing more than this bunch of (Re)Actors in sight. Apparently, this model is sufficient to implement any distributed real-world system I know about (and very practically too).

Now, let’s try to use pretty much the same idea to build a multithreaded primitive (using (Re)Actors with an ultra-small state). Let’s start with the following few basic concepts:

We have one or more (Re)Actors
Each of these (Re)Actors has its state fitting into one CAS block (i.e. the whole state be processed within one CAS operation). Let’s call these (Re)Actors ‘CAS (Re)Actors’.
When we’re saving the state to the CAS block, all kinds of compression are permissible, as long as we guarantee that the state always fits into one single CAS block. In particular, all kinds of bitfields are perfectly fine.
All interactions between (Re)Actors are implemented as message exchanges (i.e. no (Re)Actor can access another (Re)Actor’s state, except via sending a message asking to perform a certain operation).
As for the nature of messages – it depends, and in theory they can be as complicated as we wish, but in practice most of the time they will be as simple as a tuple (enum message_type, some_int_t parameter)

As soon as this is in place, we can write and use our (Re)Actors as shown in Listing 1, annotated with (a), (b), (c) and (d) to correspond with the following explanation. The logic within the infinite while loop with compare_exchange_weak inside is very standard for CAS-based primitives. First, we’re reading the data (in our case, we’re doing it in constructor). Then, we’re going into an infinite loop: (a) calculating a new value for the CAS block; (b) executing compare_exchange_weak(). If compare_exchange_weak() returns true (c), our job is done, and we can return the value. If, however, compare_exchange_weak() returns false, this guarantees that the CAS block wasn’t changed, so we can easily discard all our on-stack changes to bring the system to the exact state which was before we started (but with an up-to-date value for last_data), and try again (d). In practice, it is extremely rare to have more than 2–3 rounds within this ‘infinite’ loop, but in theory on a highly contentious CAS block, any number of iterations is possible.

//Listing 1
using CAS=std::atomic<CAS_block>;
CAS global_cas;//accessible from multiple threads
               //in practice, shouldn’t be global
               //but for the example it will do

class ReactorAData { //state of our ReactorA
  CAS_block data;

  public:
  ReactorAData() { ... }

  private:
  int OnEventX_mt_agnostic(int param) {
    //modifies our data
    //absolutely NO worries about multithreading here(!)
    //MUST NOT have any side effects
    //such as modifying globals etc.
    //...
  }
  //other OnEvent*_mt_agnostic() handlers
  friend class ReactorAHandle;
};
class ReactorAHandle {//’handle’ to the state of ReactorA
  CAS* cas; //points to global_cas
  ReactorAData last_read;

  public:
  ReactorAHandle(CAS* cas_) {
    cas = cas_;
    last_read = cas->load();
  }
  int OnEventX(int param) {
    while(true) {
      ReactorAData new_data = last_read;
      int ret = new_data.OnEventX_mt_agnostic(param);//(a)
      bool ok = cas->compare_exchange_weak(
                last_read.data, new_data.data );//(b)
      if( ok )
        return ret;//(c)
      //(d)
    }
  }
  //other OnEvent*() handlers
};

Another way to see it is to say that what we’re doing here is an incarnation of the good old optimistic locking: we’re just trying to perform a kinda-‘transaction’ over our CAS block, with the kinda-‘transaction’ being a read-modify-write performed in an optimistic manner. If a mid-air collision (= “somebody has already modified the CAS block while we were working”) happens, it will be detected by compare_exchange_weak(), and – just as for any other optimistic locking – we just have to rollback our kinda-‘transaction’ and start over.

That’s pretty much it! We’ve got our multithread-safe event-handling function OnEventX() for ReactorAHandle, while our OnEventX_mt_agnostic() function is, well, multithread-agnostic. This means that we do NOT need to think about multithreading while writing OnEventX_mt_agnostic(). This alone counts as a Big Fat Improvement when designing correct multithreaded primitives/algorithms.

Moreover, with these mechanics in place, we can build our multithreaded primitives out of multiple (Re)Actors using the very-simple-to-follow logic of “hey, to do this operation, I – as a (Re)ActorA – have to change my own state and to send such-and-such message to another (Re)ActorB”. This effectively introduces a layer of abstraction, which tends to provide a much more manageable approach to designing multithreaded primitives/algorithms than designing them right on top of CAS (which are rather difficult to grasp, and happen to be even more difficult to get right).

Of course, as always, it is not a really a silver bullet, and there are certain caveats. In particular, two things are going to cause us trouble on the way: these are (a) a limitation on CAS block size, and (b) an ABA problem.

On CAS block size

One thing which traditionally plagues writers of multithreaded primitives is a limitation on the CAS block size. Fortunately, all modern x64 CPUs support CMPXCHG16B operations, which means that we’re speaking about 128-bit CAS blocks for our (Re)Actors. This, while not being much, happens to be not too shabby for the purposes of our extremely limited (Re)Actors.

To further help with the limitations, we can observe that (rather unusually for CAS-based stuff) we can use all kinds of bit-packing techniques within our CAS_block. In other words, if we have to have a field within ReactorAData::data, we can use as many bits as we need, and don’t need to care about alignments, byte boundaries, etc. In addition, we can (and often should) use indexes instead of pointers (which usually helps to save quite a few bits), etc. etc.

Solving the ABA problem

Another issue which almost universally rears its ugly head when speaking about not-so-trivial uses of CAS is the so-called ABA problem. Very, very roughly it is about the system being in exactly the same state under CAS, while being in a different semantic state (for examples of ABA in action, see, for example, [Wikipedia.ABA]).

Of course, the same problem would apply to our CAS (Re)Actors too. However, apparently there is a neat workaround. If within our ReactorAData::data, we keep a special ABAcounter field as a part of our ReactorAData::data that is a counter of successful modifications of ReactorAData::data (i.e. we’ll increment this counter on each and every modification of the ReactorAData::data) then we’re guaranteed to avoid the ABA problem as long as the ABAcounter doesn’t overflow. This stands merely because for each modification we’ll get a different value of the CAS block, and therefore won’t run into ‘having the same state’ situation, ever.

Now, let’s take a look at the question of workarounds for the counter. Let’s consider a system with the CPU clock running at 3GHz, and a maximum lifetime of the running program being 10 years. Let’s also assume that CAS takes no less than 1 cycle (in practice, it takes 10+ at least for x64, but we’re being conservative here). Then, the most CAS operations we can possibly make within one single program run, is 1 CAS/cycle * 3e9 cycles/sec * 86400 sec/day * 365 days/year * 10 years ~= 1e18 CAS operations. And as 1e18 can be represented with mere 60 bits, this means that

by using a 60-bit ABA counter, we’re protected from ABA even under extremely conservative assumptions.

NB: 40–48 bit counters will be more than enough for most of practical purposes – but even a 60-bit counter is not too bad, especially as our whole allocation, as discussed above, is 128 bits (at least for x64).

Relaxing the requirement for ABAcounter modifications

As discussed above (with sufficient sizes of ABACounter) we can guarantee that no ABA problem occurs as long as we increment ABAcounter on each and every modification of our ReactorAData::data. However, there are cases when we can provide the same guarantees even when we skip incrementing on some of the modifications. More specifically, we can go along the following lines:

We divide fields within ReactorAData::data into two categories: (a) those fields ‘protected’ by ABAcounter, and (b) those fields ‘unprotected’ by ABAcounter
Then, we still increment ABAcounter on any modification to ‘protected’ fields, but are not required to increment ABAcounter on those modifications touching only ‘unprotected’ fields
Then, we’re still providing ‘no-ABA-problem’ guarantees as long as all our ‘unprotected’ fields have the property that the same value of those ‘unprotected’ fields is guaranteed to have the same semantic meaning.
- For example, if we have a ‘number of current locks’ field within our ReactorAData::data – for most of the typical usage patterns, we don’t really care why this field got this value, but care only about its current value; this means that whatever we’re doing with this field, it is ABA-free even without the ABAcounter, so it can be left ‘unprotected’.

Conclusions and Ongoing Work

We presented a hopefully novel way for building of non-blocking multithreaded primitives and algorithms, based on ‘CAS (Re)Actors’ (essentially – (Re)Actors with the size fitting into one CAS block).

This approach is practically interesting because it provides an additional layer of abstraction, and – as a result – allows us to reason about multithreaded primitives/algorithms in terms which don’t involve multithreading (in particular, such issues as the semantics of CAS and the ABA problem are out of the picture completely). Instead, the reasoning can be done in terms of distributed systems (more specifically – in terms of Actors, Reactors, event-driven programs, or ad hoc finite state machines). This, in turn, is expected to enable composing of more complicated primitives/algorithms than it is currently possible. In particular, the author is currently working on an MWSR queue with locking-only-when-necessary and providing different means of flow control; when the work is completed he hopes to present that in Overload too.[[EDIT: such work was indeed published in Overload #143, see, for example, https://accu.org/index.php/journals/2467 ]]

References

[Henney17] Kevlin Henney, “Thinking Outside the Synchronisation Quadrant”, ACCU2017
[Kaiser17] Hartmut Kaiser, “The Asynchronous C++ Parallel Programming Model”, CPPCON2017
[Loganberry04] David ‘Loganberry’, “Frithaes! – an Introduction to Colloquial Lapine!”
[NoBugs10] ‘No Bugs’ Hare, “Single-Threading: Back to the Future?”, Overload #97, 2010
[NoBugs15] ‘No Bugs’ Hare, “Client-Side. On Debugging Distributed Systems, Deterministic Logic, and Finite State Machines”
[NoBugs17] ‘No Bugs’ Hare, “Development and Deployment of Multiplayer Online Games, Vol. II”
[Wikipedia.ABA] Wikipedia, “ABA problem.”

Disclaimer

as usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04] might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

↧