LinuxLists.cc - [PATCH 5/5] x86/microcode: Handle NMI's during microcode update.

2022-08-13 22:43:18

Subject: [PATCH 5/5] x86/microcode: Handle NMI's during microcode update.

Microcode updates need a guarantee that the thread sibling that is waiting
for the update to finish on the primary core will not execute any
instructions until the update is complete. This is required to guarantee
any MSR or instruction that's being patched will be executed before the
update is complete.

After the stop_machine() rendezvous, an NMI handler is registered. If an
NMI were to happen while the microcode update is not complete, the
secondary thread will spin until the ucode update state is cleared.

Couple of choices discussed are:

1. Rendezvous inside the NMI handler, and also perform the update from
within the handler. This seemed too risky and might cause instability
with the races that we would need to solve. This would be a difficult
choice.
2. Thomas (tglx) suggested that we could look into masking all the LVT
originating NMI's. Such as LINT1, Perf control LVT entries and such.
Since we are in the rendezvous loop, we don't need to worry about any
NMI IPI's generated by the OS.

The one we didn't have any control over is the ACPI mechanism of sending
notifications to kernel for Firmware First Processing (FFM). Apparently
it seems there is a PCH register that BIOS in SMI would write to
generate such an interrupt (ACPI GHES).
3. This is a simpler option. OS registers an NMI handler and doesn't do any
NMI rendezvous dance. But if an NMI were to happen, we check if any of
the CPUs thread siblings have an update in progress. Only those CPUs
would take an NMI. The thread performing the wrmsr() will only take an
NMI after the completion of the wrmsr 0x79 flow.

Signed-off-by: Ashok Raj <[email protected]>
---
arch/x86/kernel/cpu/microcode/core.c | 88 +++++++++++++++++++++++++++-
1 file changed, 85 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c
index d24e1c754c27..ec10fa2db8b1 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -40,6 +40,7 @@
#include <asm/cmdline.h>
#include <asm/setup.h>
#include <asm/mce.h>
+#include <asm/nmi.h>

#define DRIVER_VERSION "2.2"

@@ -411,6 +412,10 @@ static int check_online_cpus(void)

static atomic_t late_cpus_in;
static atomic_t late_cpus_out;
+static atomic_t nmi_cpus;
+static atomic_t nmi_timeouts;
+
+static struct cpumask cpus_in_wait;

static int __wait_for_cpus(atomic_t *t, long long timeout)
{
@@ -433,6 +438,53 @@ static int __wait_for_cpus(atomic_t *t, long long timeout)
return 0;
}

+static int ucode_nmi_cb(unsigned int val, struct pt_regs *regs)
+{
+ int cpu = smp_processor_id();
+ int timeout = 100 * NSEC_PER_USEC;
+
+ atomic_inc(&nmi_cpus);
+ if (!cpumask_test_cpu(cpu, &cpus_in_wait))
+ return NMI_DONE;
+
+ while (timeout < NSEC_PER_USEC) {
+ if (timeout < NSEC_PER_USEC) {
+ atomic_inc(&nmi_timeouts);
+ break;
+ }
+ ndelay(SPINUNIT);
+ timeout -= SPINUNIT;
+ touch_nmi_watchdog();
+ if (!cpumask_test_cpu(cpu, &cpus_in_wait))
+ break;
+ }
+ return NMI_HANDLED;
+}
+
+static void set_nmi_cpus(struct cpumask *wait_mask)
+{
+ int first_cpu, wait_cpu, cpu = smp_processor_id();
+
+ first_cpu = cpumask_first(topology_sibling_cpumask(cpu));
+ for_each_cpu(wait_cpu, topology_sibling_cpumask(cpu)) {
+ if (wait_cpu == first_cpu)
+ continue;
+ cpumask_set_cpu(wait_cpu, wait_mask);
+ }
+}
+
+static void clear_nmi_cpus(struct cpumask *wait_mask)
+{
+ int first_cpu, wait_cpu, cpu = smp_processor_id();
+
+ first_cpu = cpumask_first(topology_sibling_cpumask(cpu));
+ for_each_cpu(wait_cpu, topology_sibling_cpumask(cpu)) {
+ if (wait_cpu == first_cpu)
+ continue;
+ cpumask_clear_cpu(wait_cpu, wait_mask);
+ }
+}
+
/*
* Returns:
* < 0 - on error
@@ -440,7 +492,7 @@ static int __wait_for_cpus(atomic_t *t, long long timeout)
*/
static int __reload_late(void *info)
{
- int cpu = smp_processor_id();
+ int first_cpu, cpu = smp_processor_id();
enum ucode_state err;
int ret = 0;

@@ -459,6 +511,7 @@ static int __reload_late(void *info)
* the platform is taken to reset predictively.
*/
mce_set_mcip();
+
/*
* On an SMT system, it suffices to load the microcode on one sibling of
* the core because the microcode engine is shared between the threads.
@@ -466,9 +519,17 @@ static int __reload_late(void *info)
* loading attempts happen on multiple threads of an SMT core. See
* below.
*/
+ first_cpu = cpumask_first(topology_sibling_cpumask(cpu));

- if (cpumask_first(topology_sibling_cpumask(cpu)) == cpu)
+ /*
+ * Set the CPUs that we should hold in NMI until the primary has
+ * completed the microcode update.
+ */
+ if (first_cpu == cpu) {
+ set_nmi_cpus(&cpus_in_wait);
apply_microcode_local(&err);
+ clear_nmi_cpus(&cpus_in_wait);
+ }
else
goto wait_for_siblings;

@@ -502,20 +563,41 @@ static int __reload_late(void *info)
*/
static int microcode_reload_late(void)
{
- int ret;
+ int ret = 0;

pr_err("Attempting late microcode loading - it is dangerous and taints the kernel.\n");
pr_err("You should switch to early loading, if possible.\n");

atomic_set(&late_cpus_in, 0);
atomic_set(&late_cpus_out, 0);
+ atomic_set(&nmi_cpus, 0);
+ atomic_set(&nmi_timeouts, 0);
+ cpumask_clear(&cpus_in_wait);
+
+ ret = register_nmi_handler(NMI_LOCAL, ucode_nmi_cb, NMI_FLAG_FIRST,
+ "ucode_nmi");
+ if (ret) {
+ pr_err("Unable to register NMI handler\n");
+ goto done;
+ }

ret = stop_machine_cpuslocked(__reload_late, NULL, cpu_online_mask);
if (ret == 0)
microcode_check();

+ unregister_nmi_handler(NMI_LOCAL, "ucode_nmi");
+
+ if (atomic_read(&nmi_cpus))
+ pr_info("%d CPUs entered NMI while microcode update"
+ "in progress\n", atomic_read(&nmi_cpus));
+
+ if (atomic_read(&nmi_timeouts))
+ pr_err("Some CPUs [%d] entered NMI and timedout waiting for its"
+ " mask to be cleared\n", atomic_read(&nmi_timeouts));
+
pr_info("Reload completed, microcode revision: 0x%x\n", boot_cpu_data.microcode);

+done:
return ret;
}

--
2.32.0

2022-08-14 00:41:02

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH 5/5] x86/microcode: Handle NMI's during microcode update.

On Sat, Aug 13, 2022, at 3:38 PM, Ashok Raj wrote:
> Microcode updates need a guarantee that the thread sibling that is waiting
> for the update to finish on the primary core will not execute any
> instructions until the update is complete. This is required to guarantee
> any MSR or instruction that's being patched will be executed before the
> update is complete.
>
> After the stop_machine() rendezvous, an NMI handler is registered. If an
> NMI were to happen while the microcode update is not complete, the
> secondary thread will spin until the ucode update state is cleared.
>
> Couple of choices discussed are:
>
> 1. Rendezvous inside the NMI handler, and also perform the update from
> within the handler. This seemed too risky and might cause instability
> with the races that we would need to solve. This would be a difficult
> choice.

I prefer choice 1. As I understand it, Xen has done this for a while to good effect.

If I were implementing this, I would rendezvous via stop_machine as usual. Then I would set a flag or install a handler indicating that we are doing a microcode update, send NMI-to-self, and rendezvous in the NMI handler and do the update.

> 2. Thomas (tglx) suggested that we could look into masking all the LVT
> originating NMI's. Such as LINT1, Perf control LVT entries and such.
> Since we are in the rendezvous loop, we don't need to worry about any
> NMI IPI's generated by the OS.
>
> The one we didn't have any control over is the ACPI mechanism of sending
> notifications to kernel for Firmware First Processing (FFM). Apparently
> it seems there is a PCH register that BIOS in SMI would write to
> generate such an interrupt (ACPI GHES).
> 3. This is a simpler option. OS registers an NMI handler and doesn't do any
> NMI rendezvous dance. But if an NMI were to happen, we check if any of
> the CPUs thread siblings have an update in progress. Only those CPUs
> would take an NMI. The thread performing the wrmsr() will only take an
> NMI after the completion of the wrmsr 0x79 flow.
>
> Signed-off-by: Ashok Raj <[email protected]>
> ---
> arch/x86/kernel/cpu/microcode/core.c | 88 +++++++++++++++++++++++++++-
> 1 file changed, 85 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/microcode/core.c
> b/arch/x86/kernel/cpu/microcode/core.c
> index d24e1c754c27..ec10fa2db8b1 100644
> --- a/arch/x86/kernel/cpu/microcode/core.c
> +++ b/arch/x86/kernel/cpu/microcode/core.c
> @@ -40,6 +40,7 @@
> #include <asm/cmdline.h>
> #include <asm/setup.h>
> #include <asm/mce.h>
> +#include <asm/nmi.h>
>
> #define DRIVER_VERSION "2.2"
>
> @@ -411,6 +412,10 @@ static int check_online_cpus(void)
>
> static atomic_t late_cpus_in;
> static atomic_t late_cpus_out;
> +static atomic_t nmi_cpus;
> +static atomic_t nmi_timeouts;
> +
> +static struct cpumask cpus_in_wait;
>
> static int __wait_for_cpus(atomic_t *t, long long timeout)
> {
> @@ -433,6 +438,53 @@ static int __wait_for_cpus(atomic_t *t, long long timeout)
> return 0;
> }
>
> +static int ucode_nmi_cb(unsigned int val, struct pt_regs *regs)
> +{
> + int cpu = smp_processor_id();
> + int timeout = 100 * NSEC_PER_USEC;
> +
> + atomic_inc(&nmi_cpus);
> + if (!cpumask_test_cpu(cpu, &cpus_in_wait))
> + return NMI_DONE;
> +
> + while (timeout < NSEC_PER_USEC) {
> + if (timeout < NSEC_PER_USEC) {
> + atomic_inc(&nmi_timeouts);
> + break;
> + }
> + ndelay(SPINUNIT);
> + timeout -= SPINUNIT;
> + touch_nmi_watchdog();
> + if (!cpumask_test_cpu(cpu, &cpus_in_wait))
> + break;
> + }
> + return NMI_HANDLED;
> +}
> +
> +static void set_nmi_cpus(struct cpumask *wait_mask)
> +{
> + int first_cpu, wait_cpu, cpu = smp_processor_id();
> +
> + first_cpu = cpumask_first(topology_sibling_cpumask(cpu));
> + for_each_cpu(wait_cpu, topology_sibling_cpumask(cpu)) {
> + if (wait_cpu == first_cpu)
> + continue;
> + cpumask_set_cpu(wait_cpu, wait_mask);
> + }
> +}
> +
> +static void clear_nmi_cpus(struct cpumask *wait_mask)
> +{
> + int first_cpu, wait_cpu, cpu = smp_processor_id();
> +
> + first_cpu = cpumask_first(topology_sibling_cpumask(cpu));
> + for_each_cpu(wait_cpu, topology_sibling_cpumask(cpu)) {
> + if (wait_cpu == first_cpu)
> + continue;
> + cpumask_clear_cpu(wait_cpu, wait_mask);
> + }
> +}
> +
> /*
> * Returns:
> * < 0 - on error
> @@ -440,7 +492,7 @@ static int __wait_for_cpus(atomic_t *t, long long timeout)
> */
> static int __reload_late(void *info)
> {
> - int cpu = smp_processor_id();
> + int first_cpu, cpu = smp_processor_id();
> enum ucode_state err;
> int ret = 0;
>
> @@ -459,6 +511,7 @@ static int __reload_late(void *info)
> * the platform is taken to reset predictively.
> */
> mce_set_mcip();
> +
> /*
> * On an SMT system, it suffices to load the microcode on one sibling of
> * the core because the microcode engine is shared between the threads.
> @@ -466,9 +519,17 @@ static int __reload_late(void *info)
> * loading attempts happen on multiple threads of an SMT core. See
> * below.
> */
> + first_cpu = cpumask_first(topology_sibling_cpumask(cpu));
>
> - if (cpumask_first(topology_sibling_cpumask(cpu)) == cpu)
> + /*
> + * Set the CPUs that we should hold in NMI until the primary has
> + * completed the microcode update.
> + */
> + if (first_cpu == cpu) {
> + set_nmi_cpus(&cpus_in_wait);
> apply_microcode_local(&err);
> + clear_nmi_cpus(&cpus_in_wait);
> + }
> else
> goto wait_for_siblings;
>
> @@ -502,20 +563,41 @@ static int __reload_late(void *info)
> */
> static int microcode_reload_late(void)
> {
> - int ret;
> + int ret = 0;
>
> pr_err("Attempting late microcode loading - it is dangerous and
> taints the kernel.\n");
> pr_err("You should switch to early loading, if possible.\n");
>
> atomic_set(&late_cpus_in, 0);
> atomic_set(&late_cpus_out, 0);
> + atomic_set(&nmi_cpus, 0);
> + atomic_set(&nmi_timeouts, 0);
> + cpumask_clear(&cpus_in_wait);
> +
> + ret = register_nmi_handler(NMI_LOCAL, ucode_nmi_cb, NMI_FLAG_FIRST,
> + "ucode_nmi");
> + if (ret) {
> + pr_err("Unable to register NMI handler\n");
> + goto done;
> + }
>
> ret = stop_machine_cpuslocked(__reload_late, NULL, cpu_online_mask);
> if (ret == 0)
> microcode_check();
>
> + unregister_nmi_handler(NMI_LOCAL, "ucode_nmi");
> +
> + if (atomic_read(&nmi_cpus))
> + pr_info("%d CPUs entered NMI while microcode update"
> + "in progress\n", atomic_read(&nmi_cpus));
> +
> + if (atomic_read(&nmi_timeouts))
> + pr_err("Some CPUs [%d] entered NMI and timedout waiting for its"
> + " mask to be cleared\n", atomic_read(&nmi_timeouts));
> +
> pr_info("Reload completed, microcode revision: 0x%x\n",
> boot_cpu_data.microcode);
>
> +done:
> return ret;
> }
>
> --
> 2.32.0

2022-08-14 01:48:09

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH 5/5] x86/microcode: Handle NMI's during microcode update.

On Sat, Aug 13, 2022, at 5:13 PM, Andy Lutomirski wrote:
> On Sat, Aug 13, 2022, at 3:38 PM, Ashok Raj wrote:
>> Microcode updates need a guarantee that the thread sibling that is waiting
>> for the update to finish on the primary core will not execute any
>> instructions until the update is complete. This is required to guarantee
>> any MSR or instruction that's being patched will be executed before the
>> update is complete.
>>
>> After the stop_machine() rendezvous, an NMI handler is registered. If an
>> NMI were to happen while the microcode update is not complete, the
>> secondary thread will spin until the ucode update state is cleared.
>>
>> Couple of choices discussed are:
>>
>> 1. Rendezvous inside the NMI handler, and also perform the update from
>> within the handler. This seemed too risky and might cause instability
>> with the races that we would need to solve. This would be a difficult
>> choice.
>
> I prefer choice 1. As I understand it, Xen has done this for a while
> to good effect.
>
> If I were implementing this, I would rendezvous via stop_machine as
> usual. Then I would set a flag or install a handler indicating that we
> are doing a microcode update, send NMI-to-self, and rendezvous in the
> NMI handler and do the update.
>

So I thought about this a bit more.

What's the actual goal? Are we trying to execute no instructions at all on the sibling or are we trying to avoid executing nasty instructions like RDMSR and WRMSR?

If it's the former, we don't have a whole lot of choices. We need the sibling to be in HLT or MWAIT, and we need NMIs masked or we need all likely NMI sources shut down. If it's the latter, then we would ideally like to avoid a trip through the entry or exit code -- that code has nasty instructions (RDMSR in the paranoid path if !FSGSBASE, WRMSRs for mitigations, etc). And we need to be extra careful: there are cases where NMIs are not actually masked but we just simulate NMIs being masked through the delightful logic in the entry code. Off the top of my head, the NMI-entry-pretend-masked path probably doesn't execute nasty instructions other than IRET, but still, this might be fragile.

So here's my half-backed suggestion. Add a new feature to the NMI entry code: at this point:

/* Everything up to here is safe from nested NMIs */

At that point, NMIs are actually masked. So check a flag indicating that we're trying to do the NMI-protected ucode patch dance. If so, call a special noinstr C function (this part is distinctly nontrivial) that does the ucode work. Now the stop_machine handler does NMI-to-self in a loop until it actually hits the special code path.

Alternatively, stop_machine, then change the IDT to a special one with a special non-IST NMI handler that does the dirty work. NMI-to-self, do the ucode work, set the IDT back *inside the handler* so a latched NMI will do the right thing, and return. This may be much, much simpler.

Or we stop messing around and do this for real. Soft-offline the sibling, send it INIT, do the ucode patch, then SIPI, SIPI and resume the world. What could possibly go wrong?

I have to say: Intel, can you please get your act together? There is an entity in the system that is *actually* capable of doing this right: the microcode.

2022-08-14 03:02:00

by Ashok Raj

[permalink] [raw]

Subject: Re: [PATCH 5/5] x86/microcode: Handle NMI's during microcode update.

On Sat, Aug 13, 2022 at 05:13:13PM -0700, Andy Lutomirski wrote:
>
>
> On Sat, Aug 13, 2022, at 3:38 PM, Ashok Raj wrote:
> > Microcode updates need a guarantee that the thread sibling that is waiting
> > for the update to finish on the primary core will not execute any
> > instructions until the update is complete. This is required to guarantee
> > any MSR or instruction that's being patched will be executed before the
> > update is complete.
> >
> > After the stop_machine() rendezvous, an NMI handler is registered. If an
> > NMI were to happen while the microcode update is not complete, the
> > secondary thread will spin until the ucode update state is cleared.
> >
> > Couple of choices discussed are:
> >
> > 1. Rendezvous inside the NMI handler, and also perform the update from
> > within the handler. This seemed too risky and might cause instability
> > with the races that we would need to solve. This would be a difficult
> > choice.
>
> I prefer choice 1. As I understand it, Xen has done this for a while to good effect.
>
> If I were implementing this, I would rendezvous via stop_machine as usual. Then I would set a flag or install a handler indicating that we are doing a microcode update, send NMI-to-self, and rendezvous in the NMI handler and do the update.

Well, that is exactly what I did for the first attempt. The code looked so
beautiful in the eyes of the creator :-) but somehow I couldn't get it to
not lock up. But the new implementation seemed to be more efficient. We do
nothing until the secondary drops in the NMI handler, and then hold them
hostage right there.

I thought this was slightly improved, in the sense we don't take extra hit
for sending and receiving interrupts.

In my first attempt I didn't only rendezvous between threads of the core.
Everything looked great but just didn't work for me in time.

Cheers,
Ashok

2022-08-14 03:27:53

by Ashok Raj

[permalink] [raw]

Subject: Re: [PATCH 5/5] x86/microcode: Handle NMI's during microcode update.

Hi Andy

On Sat, Aug 13, 2022 at 06:19:04PM -0700, Andy Lutomirski wrote:
> On Sat, Aug 13, 2022, at 5:13 PM, Andy Lutomirski wrote:
> > On Sat, Aug 13, 2022, at 3:38 PM, Ashok Raj wrote:
> >> Microcode updates need a guarantee that the thread sibling that is waiting
> >> for the update to finish on the primary core will not execute any
> >> instructions until the update is complete. This is required to guarantee
> >> any MSR or instruction that's being patched will be executed before the
> >> update is complete.
> >>
> >> After the stop_machine() rendezvous, an NMI handler is registered. If an
> >> NMI were to happen while the microcode update is not complete, the
> >> secondary thread will spin until the ucode update state is cleared.
> >>
> >> Couple of choices discussed are:
> >>
> >> 1. Rendezvous inside the NMI handler, and also perform the update from
> >> within the handler. This seemed too risky and might cause instability
> >> with the races that we would need to solve. This would be a difficult
> >> choice.
> >
> > I prefer choice 1. As I understand it, Xen has done this for a while
> > to good effect.
> >
> > If I were implementing this, I would rendezvous via stop_machine as
> > usual. Then I would set a flag or install a handler indicating that we
> > are doing a microcode update, send NMI-to-self, and rendezvous in the
> > NMI handler and do the update.
> >
>
> So I thought about this a bit more.
>
> What's the actual goal? Are we trying to execute no instructions at all on the sibling or are we trying to avoid executing nasty instructions like RDMSR and WRMSR?

Basically when the thread running wrmsr 0x79 asks for exclusive access to
the core, the second CPU is pulled into some ucode context, then the
primary thread updates the ucode. Secondary is released. But if the
secondary was in the middle of an instruction that is being patched, when
it resumes execution, it may go to some non-existent context and cause
weird things to happen.

I'm not sure if the interrupt entry code does any fancy stuff, which case
dropping in the NMI handler early might be a better option.

I tried to do apic->sendIPIall(), then just wait for the 2 threads of a
core to rendezvous. Maybe instead I should have done the
selfIPI might be better. I'll do some more experiments with what I sent in
this patchset.
>
> If it's the former, we don't have a whole lot of choices. We need the sibling to be in HLT or MWAIT, and we need NMIs masked or we need all likely NMI sources shut down. If it's the latter, then we would ideally like to avoid a trip through the entry or exit code -- that code has nasty instructions (RDMSR in the paranoid path if !FSGSBASE, WRMSRs for mitigations, etc). And we need to be extra careful: there are cases where NMIs are not actually masked but we just simulate NMIs being masked through the delightful logic in the entry code. Off the top of my head, the NMI-entry-pretend-masked path probably doesn't execute nasty instructions other than IRET, but still, this might be fragile.

Remember, mwait() was patched that caused pain.. I remember Thomas
mentioned this a while back.

>
> Or we stop messing around and do this for real. Soft-offline the sibling, send it INIT, do the ucode patch, then SIPI, SIPI and resume the world. What could possibly go wrong?

All these are looking more complicated, and might add some significant
latency. We might as well invoke SMI and let BIOS SMM handler to the update
:-)

>
> I have to say: Intel, can you please get your act together? There is an entity in the system that is *actually* capable of doing this right: the microcode.

So we have it.. this dirtyness is for all the current gen products.. Much
improved microcode loading is on the way.

2022-08-14 15:09:01

by Ashok Raj

[permalink] [raw]

Subject: Re: [PATCH 5/5] x86/microcode: Handle NMI's during microcode update.

Hi Andrew,

On Sun, Aug 14, 2022 at 11:58:17AM +0000, Andrew Cooper wrote:
> >> If I were implementing this, I would rendezvous via stop_machine as usual. Then I would set a flag or install a handler indicating that we are doing a microcode update, send NMI-to-self, and rendezvous in the NMI handler and do the update.
> > Well, that is exactly what I did for the first attempt. The code looked so
> > beautiful in the eyes of the creator :-) but somehow I couldn't get it to
> > not lock up.
>
> So the way we do this in Xen is to rendezvous in stop machine, then have
> only the siblings self-NMI.? The primary threads don't need to be in NMI
> context, because the WRMSR to trigger the update *is* atomic with NMIs.
>
> However, you do need to make sure that the NMI wait loop knows not to
> wait for primary threads, otherwise you can deadlock when taking an NMI
> on a primary thread between setting up the NMI handler and actually
> issuing the update.
>

I'm almost sure that was the deadlock I ran into. You are correct, the
primary thread doesn't need to be in NMI, since once the wrmsr starts, it
can't be interrupted.

But the primary needs to wait until its own siblings have dropped into NMI.
Before proceeding to perform wrmsr.

in stop_machine() handler, primary thread waits for its thread siblings to
enter NMI and report itself. Siblings will simply self IPI and then proceed
to wait for exit_sync

then primary does the wrmsr flow
clears the wait_cpus mask so that secondary inside NMI hander can release
itself

resync at exit rendezvous.

I have this coded, will test and repost.

Cheers,
Ashok