On 5/13/11 9:43 AM, Andriy Gapon wrote:
> This is a change in vein of what I've been doing in the xcpu branch and it's
> supposed to fix the issue by the recent commit that (probably unintentionally)
> stress-tests smp_rendezvous in TSC code.
> Non-essential changes:
> - ditch initial, and in my opinion useless, pre-setup rendezvous in
As long as IPIs ensure all data is up to date (I think this is certainly
true on x86) that is fine. Presumably sending an IPI has an implicit
store barrier on all other platforms as well?
> Essential changes (the fix):
> - re-use freed smp_rv_waiters to indicate that a slave/target is really fully
> done with rendezvous (i.e. it's not going to access any members of smp_rv_*
> - spin on smp_rv_waiters upon _entry_ to smp_rendezvous_cpus() to not re-use
> the smp_rv_* pseudo-structure too early
Hmmm, so this is not actually sufficient. NetApp ran into a very
similar race with virtual CPUs in BHyVe. In their case because virtual
CPUs are threads that can be preempted, they have a chance at a longer race.
The problem that they see is that even though the values have been
updated, the next CPU to start a rendezvous can clear smp_rv_waiters
to zero before one of the other CPUs notices that it has finished.
A more concrete example:
Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator.
All three CPUs run the rendezvous down to the point of increasing
smp_rv_waiters. CPU 1 is the last one to rendezvous for some reason
so he notices smp_rv_waiters being correct first and exits the
rendezvous. It immediately does a new rendezvous. Even with your
change, it will see that smp_rv_waiters is correct and will proceed
to clear it before starting the next rendezvous.
Meanwhile one of the other CPUs never sees the final update by CPU 1 to
smp_rv_waiters, instead it sees the value go from 2 to 0 (e.g. if CPU
1's two writes were coalesced, or in the case of the hypervisor, CPU 2
or 3's backing thread is preempted and both writes have posted before
the thread backing CPU 2 or 3 gets to run again).
At this point that CPU (call it CPU 2) will spin forever. When CPU 1
sends a second rendezvous IPI it will be held in the local APIC of CPU 2
because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump
smp_rv_waiters and the entire system will deadlock.
NetApp's solution is to add a monotonically increasing generation count
to the rendezvous data set, which is cached in the rendezvous handler
and to exit the last synchronization point if either smp_rv_waiters
is high enough, or the generation count has changed.
I think this would also handle the case the TSC changes have provoked.
I'm not sure if this would be sufficient for the error case Max Laier
Extra protection for consecutive smp_rendezvous() calls.
We need this because interrupts are not really disabled when
executing the smp_rendezvous_action() when running inside a
In particular it is possible for the last cpu to increment
smp_rv_waiters so that the exit rendezvous condition is
satisfied and then get interrupted by a hardware interrupt.
When the execution of the interrupted vcpu continues it is
possible that one of the other vcpus that did *not* get
interrupted exited the old smp_rendezvous() and started a
new smp_rendezvous(). In this case 'smp_rv_waiters' is
again reset to 0. This would mean that the vcpu that got
interrupted will spin forever on the exit rendezvous.
We protect this by spinning on the exit rendezvous only if
the generation of the smp_rendezvous() matches what we
started out with before incrementing 'smp_rv_waiters'.
==== //private/xxxx/sys/kern/subr_smp.c#3 (text) ====
@@ -127,6 +127,7 @@
static void (*volatile smp_rv_teardown_func)(void *arg);
static void * volatile smp_rv_func_arg;
static volatile int smp_rv_waiters;
+static volatile int smp_rv_generation;
* Shared mutex to restrict busywaits between smp_rendezvous() and
@@ -418,6 +419,7 @@
+ int generation;
void* local_func_arg = smp_rv_func_arg;
void (*local_setup_func)(void*) = smp_rv_setup_func;
void (*local_action_func)(void*) = smp_rv_action_func;
@@ -457,11 +459,14 @@
if (local_action_func != NULL)
+ generation = atomic_load_acq_int(&smp_rv_generation);
/* spin on exit rendezvous */
if (local_teardown_func == smp_no_rendevous_barrier)
- while (smp_rv_waiters < smp_rv_ncpus)
+ while (smp_rv_waiters < smp_rv_ncpus &&
+ generation == smp_rv_generation)
@@ -565,6 +570,8 @@
/* obtain rendezvous lock */
+ atomic_add_acq_int(&smp_rv_generation, 1);
/* set static function pointers */
smp_rv_ncpus = ncpus;
smp_rv_setup_func = setup_func;