2003-08-06 02:26:13

by john stultz

[permalink] [raw]
Subject: [RFC][PATCH] linux-2.6.0-test2_mtrr-race-fix_A0

All,
I've been noticing occasional (and with some kernels, not so
ocassional) hangs after the mtrr init call. Booting with
"initcall_debug" pointed the finger at init_bio(), but further
investigation showed that cpus were actually getting hung in the mtrr
ipi_handler, thus they could not respond to the smp_call_function made
deeper within init_bio().

I've found a race in the mtrr ipi_handler() and set_mtrr() functions.

Basically the server, set_mtrr() does the following:

1.1 initialize a flag
1.2 send ipi
1.3 waits for all cpus to check in
1.4 toggle flag
1.5 do stuff
1.6 wait for all cpus to check out
1.7 toggle flag again
1.8 return

While the clients, running ipi_handler() do the following:

2.1 check in
2.2 wait for flag
2.3 do stuff
2.4 check out
2.5 wait for flag
2.6 return

The problem is the flag is on the servers stack! So if 1.7 and 1.8
executes before 2.5 happens, the client's pointer to the flag becomes
invalid (likely overwritten) which causes the client to never see the
flag change, hanging the box.

I believe the solution to this is to just remove the needless
synchronization steps: 1.7 and 2.5. Once all the clients have checked
out, there is no reason to have them hang around only to return all at
once. The final flagging seems needless and racy.

Amazingly this has been in the kernel for about a year, yet its not
bothered me until very recently. Go figure.

Please review and comment on the following fix. Please let me know if
I'm just wrong and the final flagging is more needed then I think.

thanks
-john

diff -Nru a/arch/i386/kernel/cpu/mtrr/main.c b/arch/i386/kernel/cpu/mtrr/main.c
--- a/arch/i386/kernel/cpu/mtrr/main.c Tue Aug 5 18:45:17 2003
+++ b/arch/i386/kernel/cpu/mtrr/main.c Tue Aug 5 18:45:17 2003
@@ -165,10 +165,6 @@
mtrr_if->set_all();

atomic_dec(&data->count);
- while(atomic_read(&data->gate)) {
- cpu_relax();
- barrier();
- }
local_irq_restore(flags);
}

@@ -257,7 +253,6 @@
barrier();
}
local_irq_restore(flags);
- atomic_set(&data.gate,0);
}

/**




2003-08-06 08:53:07

by Mathias Fröhlich

[permalink] [raw]
Subject: Re: [RFC][PATCH] linux-2.6.0-test2_mtrr-race-fix_A0


Hi,

You should not remove the barrier past mtrr change. If you do that, it is
possible that cpu's run with inconsistent mtrrs. This can have bad
sideeffects since at least the cache snooping protocol used by intel uses
assumptions about the cachability of memory regions. Those information about
the cachability is also taken from the mtrrs as far as I remember.
This intel cpu developer manual, which documented the early PII and PPro
chips, recommended this algorithm. Since actual intel cpus use the same old
cpu to chipset bus protocol, this old documentation most propably still
applies.

So the conclusion is that as far as you don't know the exact way all those SMP
protocols between chipsets and CPUs with all the possible sideeffects very
well, dont't change this behavour.

But I'm shure that fixes to the stack allocated variable problem are welcome
:)

Greetings

Mathias Fröhlich

--
Mathias Fröhlich, email: [email protected]
old email was: [email protected]

2003-08-06 17:04:06

by Mark Haverkamp

[permalink] [raw]
Subject: Re: [RFC][PATCH] linux-2.6.0-test2_mtrr-race-fix_A0

On Tue, 2003-08-05 at 19:17, john stultz wrote:
> All,


>
> Amazingly this has been in the kernel for about a year, yet its not
> bothered me until very recently. Go figure.

I started seeing this on an 8-way after the mtrr_init function was
changed from core_initcall to subsys_initcall.

>
> Please review and comment on the following fix. Please let me know if
> I'm just wrong and the final flagging is more needed then I think.

I applied this patch and so far I have been able to boot every time
without a hang on my 8-way.

Mark.


>
> thanks
> -john

--
Mark Haverkamp <[email protected]>

2003-08-06 17:25:11

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] linux-2.6.0-test2_mtrr-race-fix_A0

On Wed, 2003-08-06 at 01:52, Mathias Fr?hlich wrote:
> Hi,
>
> You should not remove the barrier past mtrr change. If you do that, it is
> possible that cpu's run with inconsistent mtrrs. This can have bad
> sideeffects since at least the cache snooping protocol used by intel uses
> assumptions about the cachability of memory regions. Those information about
> the cachability is also taken from the mtrrs as far as I remember.
> This intel cpu developer manual, which documented the early PII and PPro
> chips, recommended this algorithm. Since actual intel cpus use the same old
> cpu to chipset bus protocol, this old documentation most propably still
> applies.

Hmm. I should dig up that doc. Its a little hazy in my mind, but I think
I understand your description. I'm glad you caught this, as I can't
imagine the subtle bugs that might have popped up.

Well, let me look at it again and see if I can come up with a proper
fix.

Thanks for the knowledgeable feedback and sanity checking!
-john

2003-08-06 18:11:33

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] linux-2.6.0-test2_mtrr-race-fix_A0

On Wed, 2003-08-06 at 11:05, Mark Haverkamp wrote:
> On Wed, 2003-08-06 at 10:15, john stultz wrote:
> >
> > Well, let me look at it again and see if I can come up with a proper
> > fix.
> I added an extra sync up from the caller after the last gate change so
> it is the last one to touch the automatic data.

Ah, you beat me to it!

I'm actually testing the very same change (comments differ a touch, but
that's ok).

Looks good. If everyone is happy I'd say resubmit it to Andrew.

thanks
-john



2003-08-06 18:05:33

by Mark Haverkamp

[permalink] [raw]
Subject: Re: [RFC][PATCH] linux-2.6.0-test2_mtrr-race-fix_A0

On Wed, 2003-08-06 at 10:15, john stultz wrote:
> On Wed, 2003-08-06 at 01:52, Mathias Fr?hlich wrote:
> > Hi,
> >
> > You should not remove the barrier past mtrr change. If you do that, it is
> > possible that cpu's run with inconsistent mtrrs. This can have bad
> > sideeffects since at least the cache snooping protocol used by intel uses
> > assumptions about the cachability of memory regions. Those information about
> > the cachability is also taken from the mtrrs as far as I remember.
> > This intel cpu developer manual, which documented the early PII and PPro
> > chips, recommended this algorithm. Since actual intel cpus use the same old
> > cpu to chipset bus protocol, this old documentation most propably still
> > applies.
>
> Hmm. I should dig up that doc. Its a little hazy in my mind, but I think
> I understand your description. I'm glad you caught this, as I can't
> imagine the subtle bugs that might have popped up.
>
> Well, let me look at it again and see if I can come up with a proper
> fix.
>
> Thanks for the knowledgeable feedback and sanity checking!
> -john


I added an extra sync up from the caller after the last gate change so
it is the last one to touch the automatic data.

Mark.


===== arch/i386/kernel/cpu/mtrr/main.c 1.29 vs edited =====
--- 1.29/arch/i386/kernel/cpu/mtrr/main.c Tue Jul 15 10:08:48 2003
+++ edited/arch/i386/kernel/cpu/mtrr/main.c Wed Aug 6 10:46:00 2003
@@ -169,6 +169,7 @@
cpu_relax();
barrier();
}
+ atomic_dec(&data->count);
local_irq_restore(flags);
}

@@ -256,8 +257,18 @@
cpu_relax();
barrier();
}
- local_irq_restore(flags);
+ atomic_set(&data.count, num_booting_cpus() - 1);
atomic_set(&data.gate,0);
+
+ /*
+ * Wait here for everyone to have seen the gate change
+ * So we're the last ones to touch 'data'
+ */
+ while(atomic_read(&data.count)) {
+ cpu_relax();
+ barrier();
+ }
+ local_irq_restore(flags);
}

/**


--
Mark Haverkamp <[email protected]>

2003-08-06 18:43:42

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [RFC][PATCH] linux-2.6.0-test2_mtrr-race-fix_A0

On Wed, 6 Aug 2003, john stultz wrote:

> On Wed, 2003-08-06 at 01:52, Mathias Fr?hlich wrote:

> > You should not remove the barrier past mtrr change. If you do that, it is
> > possible that cpu's run with inconsistent mtrrs. This can have bad
> > sideeffects since at least the cache snooping protocol used by intel uses
> > assumptions about the cachability of memory regions. Those information about
> > the cachability is also taken from the mtrrs as far as I remember.
> > This intel cpu developer manual, which documented the early PII and PPro
> > chips, recommended this algorithm. Since actual intel cpus use the same old
> > cpu to chipset bus protocol, this old documentation most propably still
> > applies.
>
> Hmm. I should dig up that doc. Its a little hazy in my mind, but I think
> I understand your description. I'm glad you caught this, as I can't
> imagine the subtle bugs that might have popped up.
>
> Well, let me look at it again and see if I can come up with a proper
> fix.

The intel manual (10-36 Memory Cache Control - vol3) actually recommends
the following procedure I was a bit anal about explicitely setting and
clearing flags and used the specific TLB flush via cr3->reg->cr3. John
could you give this a spin on your afflicted systems?

Index: linux-2.5/arch/i386/kernel/cpu/mtrr/generic.c
===================================================================
RCS file: /home/cvs/linux-2.5/arch/i386/kernel/cpu/mtrr/generic.c,v
retrieving revision 1.8
diff -u -p -B -r1.8 generic.c
--- linux-2.5/arch/i386/kernel/cpu/mtrr/generic.c 15 Feb 2003 20:30:00 -0000 1.8
+++ linux-2.5/arch/i386/kernel/cpu/mtrr/generic.c 6 Aug 2003 17:05:59 -0000
@@ -8,6 +8,7 @@
#include <asm/msr.h>
#include <asm/system.h>
#include <asm/cpufeature.h>
+#include <asm/tlbflush.h>
#include "mtrr.h"

struct mtrr_state {
@@ -241,19 +242,22 @@ static void prepare_set(void)
more invasive changes to the way the kernel boots */
spin_lock(&set_atomicity_lock);

+ /* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
+ cr0 = read_cr0() | 0x40000000; /* set CD flag */
+ cr0 &= ~(0x20000000); /* clear NW flag */
+ wbinvd();
+ write_cr0(cr0);
+ wbinvd();
+
/* Save value of CR4 and clear Page Global Enable (bit 7) */
if ( cpu_has_pge ) {
cr4 = read_cr4();
write_cr4(cr4 & (unsigned char) ~(1 << 7));
}

- /* Disable and flush caches. Note that wbinvd flushes the TLBs as
- a side-effect */
- cr0 = read_cr0() | 0x40000000;
- wbinvd();
- write_cr0(cr0);
- wbinvd();
-
+ /* Flush all TLBs via a mov %cr3, %reg; mov %reg, %cr3 */
+ __flush_tlb();
+
/* Save MTRR state */
rdmsr(MTRRdefType_MSR, deftype_lo, deftype_hi);

@@ -263,14 +267,18 @@ static void prepare_set(void)

static void post_set(void)
{
+ unsigned long cr0;
+
/* Flush caches and TLBs */
wbinvd();
+ __flush_tlb();

/* Intel (P6) standard MTRRs */
wrmsr(MTRRdefType_MSR, deftype_lo, deftype_hi);

- /* Enable caches */
- write_cr0(read_cr0() & 0xbfffffff);
+ /* Enable caches, clear previously set CD flag */
+ cr0 = read_cr0() & ~(0x40000000);
+ write_cr0(cr0);

/* Restore value of CR4 */
if ( cpu_has_pge )

2003-08-06 20:34:09

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] linux-2.6.0-test2_mtrr-race-fix_A0

On Wed, 2003-08-06 at 11:31, Zwane Mwaikambo wrote:
> The intel manual (10-36 Memory Cache Control - vol3) actually recommends
> the following procedure I was a bit anal about explicitely setting and
> clearing flags and used the specific TLB flush via cr3->reg->cr3. John
> could you give this a spin on your afflicted systems?

I'm not quite sure I follow. Really I've never looked at the mtrr code
before, so forgive my daftness. I just found a deadlock that was
locking my box caused by the synchronization between ipi_handler() and
set_mtrr() (which Mark's patch seems to properly fix).

How does this change affect the deadlock? Is this just a separate issue?
thanks
-john


2003-08-07 05:17:57

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [RFC][PATCH] linux-2.6.0-test2_mtrr-race-fix_A0

On Wed, 6 Aug 2003, john stultz wrote:

> On Wed, 2003-08-06 at 11:31, Zwane Mwaikambo wrote:
> > The intel manual (10-36 Memory Cache Control - vol3) actually recommends
> > the following procedure I was a bit anal about explicitely setting and
> > clearing flags and used the specific TLB flush via cr3->reg->cr3. John
> > could you give this a spin on your afflicted systems?
>
> I'm not quite sure I follow. Really I've never looked at the mtrr code
> before, so forgive my daftness. I just found a deadlock that was
> locking my box caused by the synchronization between ipi_handler() and
> set_mtrr() (which Mark's patch seems to properly fix).
>
> How does this change affect the deadlock? Is this just a separate issue?
> thanks

Mostly seperate issue, but seeing as we where there, there is a slight
deviation from recommended procedure. Sorry about that, you may ignore it.

Thanks
Zwane
--
function.linuxpower.ca

2003-08-07 14:39:07

by Mark Haverkamp

[permalink] [raw]
Subject: [PATCH] linux-2.6.0-test2_mtrr-race-fix_A0

On Wed, 2003-08-06 at 11:02, john stultz wrote:
> On Wed, 2003-08-06 at 11:05, Mark Haverkamp wrote:
> > On Wed, 2003-08-06 at 10:15, john stultz wrote:
> > >
> > > Well, let me look at it again and see if I can come up with a proper
> > > fix.
> > I added an extra sync up from the caller after the last gate change so
> > it is the last one to touch the automatic data.
>
> Ah, you beat me to it!
>
> I'm actually testing the very same change (comments differ a touch, but
> that's ok).
>
> Looks good. If everyone is happy I'd say resubmit it to Andrew.
>
> thanks
> -john

I'd like to submit this patch for inclusion. It adds an extra sync up
in set_mtrr and ipi_handler to make sure that set_mtrr is the last to
touch the automatic set_mtrr_data.


===== arch/i386/kernel/cpu/mtrr/main.c 1.29 vs edited =====
--- 1.29/arch/i386/kernel/cpu/mtrr/main.c Tue Jul 15 10:08:48 2003
+++ edited/arch/i386/kernel/cpu/mtrr/main.c Wed Aug 6 10:46:00 2003
@@ -169,6 +169,7 @@
cpu_relax();
barrier();
}
+ atomic_dec(&data->count);
local_irq_restore(flags);
}

@@ -256,8 +257,18 @@
cpu_relax();
barrier();
}
- local_irq_restore(flags);
+ atomic_set(&data.count, num_booting_cpus() - 1);
atomic_set(&data.gate,0);
+
+ /*
+ * Wait here for everyone to have seen the gate change
+ * So we're the last ones to touch 'data'
+ */
+ while(atomic_read(&data.count)) {
+ cpu_relax();
+ barrier();
+ }
+ local_irq_restore(flags);
}

/**


--
Mark Haverkamp <[email protected]>