2006-02-02 01:19:32

by NeilBrown

[permalink] [raw]
Subject: 2.6.16-rc1-mm4 i386 atomic operations broken on SMP (in modules at least)


I've been testing md/raid in 2.6.16-rc1-mm4 on a dual Xeon with most
of the md personalities compiled as modules, and weird stuff if
happening.

In particular I'm getting lots of

BUG: atomic counter underflow at:

reports in raid10 and raid5, which are modules.

I reverted to 2.6.16-rc1-mm2, which still has that BUG check, but
doesn't muck about with the LOCK prefix, and the "atomic" problems go
away (leaving me to look into the other problems of my own making:-).

My guess is there is there is something wrong with the 'alternative'
stuff which strips out the lock prefix, but I couldn't see anything
obviously wrong. The CPUs don't have FEATURE_UP (see below) so it
cannot possibly be removing the 'lock' prefix... but it certainly acts
like it is.

Help?

NeilBrown



processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 3
model name : Intel(R) Xeon(TM) CPU 3.20GHz
stepping : 4
cpu MHz : 3192.524
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pni monitor ds_cpl cid xtpr
bogomips : 6389.26


2006-02-02 01:30:58

by J.A. Magallon

[permalink] [raw]
Subject: Re: 2.6.16-rc1-mm4 i386 atomic operations broken on SMP (in modules at least)

On Thu, 2 Feb 2006 12:19:22 +1100, Neil Brown <[email protected]> wrote:

>
> I've been testing md/raid in 2.6.16-rc1-mm4 on a dual Xeon with most
> of the md personalities compiled as modules, and weird stuff if
> happening.
>
> In particular I'm getting lots of
>
> BUG: atomic counter underflow at:
>
> reports in raid10 and raid5, which are modules.
>
>

I also run this kernel (plus a couple patches) on a SATA raid5 setup, and
had no problems. People throws and gets files via SMB/AFP, mainly.

My box is dual PIII@933.

--
J.A. Magallon <jamagallon()able!es> \ Software is like sex:
werewolf!able!es \ It's better when it's free
Mandriva Linux release 2006.1 (Cooker) for i586
Linux 2.6.15-jam7 (gcc 4.0.2 (4.0.2-1mdk for Mandriva Linux release 2006.1))


Attachments:
signature.asc (189.00 B)

2006-02-02 01:50:29

by NeilBrown

[permalink] [raw]
Subject: Re: 2.6.16-rc1-mm4 i386 atomic operations broken on SMP (in modules at least)

On Thursday February 2, [email protected] wrote:
> On Thu, 2 Feb 2006 12:19:22 +1100, Neil Brown <[email protected]> wrote:
>
> >
> > I've been testing md/raid in 2.6.16-rc1-mm4 on a dual Xeon with most
> > of the md personalities compiled as modules, and weird stuff if
> > happening.
> >
> > In particular I'm getting lots of
> >
> > BUG: atomic counter underflow at:
> >
> > reports in raid10 and raid5, which are modules.
> >
> >
>
> I also run this kernel (plus a couple patches) on a SATA raid5 setup, and
> had no problems. People throws and gets files via SMB/AFP, mainly.
>
> My box is dual PIII@933.

Is 'raid5' a module, or is it compiled in?

NeilBrown

2006-02-02 08:10:21

by J.A. Magallon

[permalink] [raw]
Subject: Re: 2.6.16-rc1-mm4 i386 atomic operations broken on SMP (in modules at least)

On Thu, 2 Feb 2006 12:50:19 +1100, Neil Brown <[email protected]> wrote:

> On Thursday February 2, [email protected] wrote:
> > On Thu, 2 Feb 2006 12:19:22 +1100, Neil Brown <[email protected]> wrote:
> >
> > >
> > > I've been testing md/raid in 2.6.16-rc1-mm4 on a dual Xeon with most
> > > of the md personalities compiled as modules, and weird stuff if
> > > happening.
> > >
> > > In particular I'm getting lots of
> > >
> > > BUG: atomic counter underflow at:
> > >
> > > reports in raid10 and raid5, which are modules.
> > >
> > >
> >
> > I also run this kernel (plus a couple patches) on a SATA raid5 setup, and
> > had no problems. People throws and gets files via SMB/AFP, mainly.
> >
> > My box is dual PIII@933.
>
> Is 'raid5' a module, or is it compiled in?
>

nada:/usr/src/linux# grep _MD_ .config
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=y
CONFIG_MD_RAID5=y
CONFIG_MD_RAID6=y
CONFIG_MD_MULTIPATH=m
# CONFIG_MD_FAULTY is not set

nada:/usr/src/linux# lsmod
Module Size Used by
w83627hf 22512 0
hwmon_vid 2016 1 w83627hf
i2c_isa 3392 1 w83627hf
i2c_viapro 7188 0
i2c_core 16640 3 w83627hf,i2c_isa,i2c_viapro
snd_ens1371 18656 0
snd_rawmidi 17312 1 snd_ens1371
snd_ac97_codec 87744 1 snd_ens1371
snd_ac97_bus 1760 1 snd_ac97_codec
snd_pcm 72772 2 snd_ens1371,snd_ac97_codec
snd_timer 18724 1 snd_pcm
snd_page_alloc 7784 1 snd_pcm
snd 38104 5 snd_ens1371,snd_rawmidi,snd_ac97_codec,snd_pcm,snd_timer
e100 31620 0
ide_cd 35264 0
loop 11816 0
via82cxxx 8132 0 [permanent]
ide_core 109844 2 ide_cd,via82cxxx
via_agp 7584 1
agpgart 25352 1 via_agp
microcode 5592 0
sata_promise 8868 7
libata 63956 1 sata_promise
uhci_hcd 20044 0
sg 20984 0
st 34880 0
sr_mod 14020 0
cdrom 34320 2 ide_cd,sr_mod

nada:/usr/src/linux# grep BUG /var/log/syslog
nada:/usr/src/linux# grep BUG /var/log/messages
nada:/usr/src/linux#

;)

--
J.A. Magallon <jamagallon()able!es> \ Software is like sex:
werewolf!able!es \ It's better when it's free
Mandriva Linux release 2006.1 (Cooker) for i586
Linux 2.6.15-jam7 (gcc 4.0.2 (4.0.2-1mdk for Mandriva Linux release 2006.1))


Attachments:
signature.asc (189.00 B)

2006-02-02 08:23:19

by NeilBrown

[permalink] [raw]
Subject: Re: 2.6.16-rc1-mm4 i386 atomic operations broken on SMP (in modules at least)

On Thursday February 2, [email protected] wrote:
> On Thu, 2 Feb 2006 12:50:19 +1100, Neil Brown <[email protected]> wrote:
>
> > On Thursday February 2, [email protected] wrote:
> > > On Thu, 2 Feb 2006 12:19:22 +1100, Neil Brown <[email protected]> wrote:
> > >
> > > >
> > > > I've been testing md/raid in 2.6.16-rc1-mm4 on a dual Xeon with most
> > > > of the md personalities compiled as modules, and weird stuff if
> > > > happening.
> > > >
> > > > In particular I'm getting lots of
> > > >
> > > > BUG: atomic counter underflow at:
> > > >
> > > > reports in raid10 and raid5, which are modules.
> > > >
> > > >
> > >
> > > I also run this kernel (plus a couple patches) on a SATA raid5 setup, and
> > > had no problems. People throws and gets files via SMB/AFP, mainly.
> > >
> > > My box is dual PIII@933.
> >
> > Is 'raid5' a module, or is it compiled in?
> >
>
> nada:/usr/src/linux# grep _MD_ .config
> CONFIG_MD_LINEAR=y
> CONFIG_MD_RAID0=y
> CONFIG_MD_RAID1=y
> CONFIG_MD_RAID10=y
> CONFIG_MD_RAID5=y

I see it's not a module.
That points pretty strongly to the module loading code then I guess.

Thanks,
NeilBrown

2006-02-02 18:14:43

by Chuck Ebbert

[permalink] [raw]
Subject: Re: 2.6.16-rc1-mm4 i386 atomic operations broken on SMP (in modules at least)

In-Reply-To: <[email protected]>

On Thu, 2 Feb 2006 at 12:19:22 +1100, Neil Brown wrote:

> My guess is there is there is something wrong with the 'alternative'
> stuff which strips out the lock prefix, but I couldn't see anything
> obviously wrong. The CPUs don't have FEATURE_UP (see below) so it
> cannot possibly be removing the 'lock' prefix... but it certainly acts
> like it is.

Look closer:

> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm
> constant_tsc pni monitor ds_cpl cid xtpr
^^^^^^^^^^^^

SMP alternatives is re-using the constant_tsc X86 feature bit.

--- 2.6.16-rc1-mm4-386.orig/include/asm-i386/cpufeature.h
+++ 2.6.16-rc1-mm4-386/include/asm-i386/cpufeature.h
@@ -71,7 +71,7 @@
#define X86_FEATURE_P4 (3*32+ 7) /* P4 */
#define X86_FEATURE_CONSTANT_TSC (3*32+ 8) /* TSC ticks at a constant rate */

-#define X86_FEATURE_UP (3*32+ 8) /* smp kernel running on up */
+#define X86_FEATURE_UP (3*32+ 9) /* smp kernel running on up */

/* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
#define X86_FEATURE_XMM3 (4*32+ 0) /* Streaming SIMD Extensions-3 */
--
Chuck

2006-02-02 21:50:55

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.16-rc1-mm4 i386 atomic operations broken on SMP (in modules at least)

Chuck Ebbert <[email protected]> wrote:
>
> In-Reply-To: <[email protected]>
>
> On Thu, 2 Feb 2006 at 12:19:22 +1100, Neil Brown wrote:
>
> > My guess is there is there is something wrong with the 'alternative'
> > stuff which strips out the lock prefix, but I couldn't see anything
> > obviously wrong. The CPUs don't have FEATURE_UP (see below) so it
> > cannot possibly be removing the 'lock' prefix... but it certainly acts
> > like it is.
>
> Look closer:
>
> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm
> > constant_tsc pni monitor ds_cpl cid xtpr
> ^^^^^^^^^^^^
>
> SMP alternatives is re-using the constant_tsc X86 feature bit.
>
> --- 2.6.16-rc1-mm4-386.orig/include/asm-i386/cpufeature.h
> +++ 2.6.16-rc1-mm4-386/include/asm-i386/cpufeature.h
> @@ -71,7 +71,7 @@
> #define X86_FEATURE_P4 (3*32+ 7) /* P4 */
> #define X86_FEATURE_CONSTANT_TSC (3*32+ 8) /* TSC ticks at a constant rate */
>
> -#define X86_FEATURE_UP (3*32+ 8) /* smp kernel running on up */
> +#define X86_FEATURE_UP (3*32+ 9) /* smp kernel running on up */
>
> /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
> #define X86_FEATURE_XMM3 (4*32+ 0) /* Streaming SIMD Extensions-3 */

Darn, how did you spot that?

Should `feature_up' appear in /proc/cpuinfo?

2006-02-02 22:41:40

by NeilBrown

[permalink] [raw]
Subject: Re: 2.6.16-rc1-mm4 i386 atomic operations broken on SMP (in modules at least)

On Thursday February 2, [email protected] wrote:
> Chuck Ebbert <[email protected]> wrote:
> >
> > In-Reply-To: <[email protected]>
> >
> > On Thu, 2 Feb 2006 at 12:19:22 +1100, Neil Brown wrote:
> >
> > > My guess is there is there is something wrong with the 'alternative'
> > > stuff which strips out the lock prefix, but I couldn't see anything
> > > obviously wrong. The CPUs don't have FEATURE_UP (see below) so it
> > > cannot possibly be removing the 'lock' prefix... but it certainly acts
> > > like it is.
> >
> > Look closer:
> >
> > > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> > > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm
> > > constant_tsc pni monitor ds_cpl cid xtpr
> > ^^^^^^^^^^^^
> >
> > SMP alternatives is re-using the constant_tsc X86 feature bit.
> >
> > --- 2.6.16-rc1-mm4-386.orig/include/asm-i386/cpufeature.h
> > +++ 2.6.16-rc1-mm4-386/include/asm-i386/cpufeature.h
> > @@ -71,7 +71,7 @@
> > #define X86_FEATURE_P4 (3*32+ 7) /* P4 */
> > #define X86_FEATURE_CONSTANT_TSC (3*32+ 8) /* TSC ticks at a constant rate */
> >
> > -#define X86_FEATURE_UP (3*32+ 8) /* smp kernel running on up */
> > +#define X86_FEATURE_UP (3*32+ 9) /* smp kernel running on up */
> >
> > /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
> > #define X86_FEATURE_XMM3 (4*32+ 0) /* Streaming SIMD Extensions-3 */
>
> Darn, how did you spot that?

I can't say how he found the needle in the haystack, but I can confirm
that it fixes the problem. I'm running -mm4 quite successfully with
this patch now.

Thanks!

Thinks.. maybe this typo would have been harder if the columns lined
up better, like this:
> > #define X86_FEATURE_P4 (3*32+ 7) /* P4 */
> > #define X86_FEATURE_CONSTANT_TSC (3*32+ 8) /* TSC ticks at a constant rate */
> >
> > -#define X86_FEATURE_UP (3*32+ 8) /* smp kernel running on up */
> > +#define X86_FEATURE_UP (3*32+ 9) /* smp kernel running on up */


NeilBrown

2006-02-02 23:30:36

by Chuck Ebbert

[permalink] [raw]
Subject: Re: 2.6.16-rc1-mm4 i386 atomic operations broken on SMP (in modules at least)

In-Reply-To: <[email protected]>

On Thu, 2 Feb 2006 at 13:52:05 -0800, Andrew Morton wrote:

> Chuck Ebbert <[email protected]> wrote:
> >
> > SMP alternatives is re-using the constant_tsc X86 feature bit.
> >
>
> Darn, how did you spot that?

I went looking for which bit represented X86_FEATURE_UP and there
it was...

>
> Should `feature_up' appear in /proc/cpuinfo?

Probably. The bug would have been nearly impossible if that had
been done to begin with.


i386: show x86 feature "up" in cpuinfo

Show feature bit "up" (SMP kernel running on uniprocessor) in
/proc/cpuinfo.

Signed-off-by: Chuck Ebbert <[email protected]>

--- 2.6.16-rc1-mm4-386.orig/arch/i386/kernel/cpu/proc.c
+++ 2.6.16-rc1-mm4-386/arch/i386/kernel/cpu/proc.c
@@ -40,7 +40,7 @@ static int show_cpuinfo(struct seq_file
/* Other (Linux-defined) */
"cxmmx", "k6_mtrr", "cyrix_arr", "centaur_mcr",
NULL, NULL, NULL, NULL,
- "constant_tsc", NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+ "constant_tsc", "up", NULL, NULL, NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,

--
Chuck