2004-04-30 16:13:59

by Mikael Pettersson

[permalink] [raw]
Subject: [BUG] 2.6.6-rc2-bk5 mm/slab.c change broke x86-64 SMP

The change to mm/slab.c between 2.6.6-rc2-bk4 and -bk5
broke x86-64 SMP. The symptoms are general protection
faults in __switch_to shortly after init starts, and
then the machine is dead. (Can't be more specific, my
box can't log early boot oopses.)

I'm only seeing this with x86-64 SMP; x86-64 UP and i386
SMP on the same machine (Athlon64 UP) have no problems.

Reverting 2.6.6-rc2-bk5's change to mm/slab.c eliminates
the problem.

/Mikael


2004-04-30 18:23:28

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [BUG] 2.6.6-rc2-bk5 mm/slab.c change broke x86-64 SMP

> The change to mm/slab.c between 2.6.6-rc2-bk4 and -bk5
> broke x86-64 SMP. The symptoms are general protection
> faults in __switch_to shortly after init starts, and
> then the machine is dead. (Can't be more specific, my
> box can't log early boot oopses.)
>
> I'm only seeing this with x86-64 SMP; x86-64 UP and i386
> SMP on the same machine (Athlon64 UP) have no problems.

FWIW, this sure looks a lot like the boot-time crash I'm seeing; I get the
same __switch_to oopses once init starts. *But* I'm running a UP,
no-preempt kernel. And I get it with -rc1 as well. Might reverting the
later slab change be concealing a different problem?

jon

2004-04-30 18:27:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] 2.6.6-rc2-bk5 mm/slab.c change broke x86-64 SMP

Mikael Pettersson <[email protected]> wrote:
>
> The change to mm/slab.c between 2.6.6-rc2-bk4 and -bk5
> broke x86-64 SMP. The symptoms are general protection
> faults in __switch_to shortly after init starts, and
> then the machine is dead. (Can't be more specific, my
> box can't log early boot oopses.)
>
> I'm only seeing this with x86-64 SMP; x86-64 UP and i386
> SMP on the same machine (Athlon64 UP) have no problems.
>
> Reverting 2.6.6-rc2-bk5's change to mm/slab.c eliminates
> the problem.

The "-bk5" terminology doesn't mean much to people who use bitkeeper or who
use http://www.kernel.org/pub/linux/kernel/v2.5/testing/cset/ - I assume
you refer to the alignment changes?

Does this fix?

diff -puN include/asm-x86_64/processor.h~a include/asm-x86_64/processor.h
--- 25/include/asm-x86_64/processor.h~a Fri Apr 30 11:24:58 2004
+++ 25-akpm/include/asm-x86_64/processor.h Fri Apr 30 11:25:28 2004
@@ -20,6 +20,8 @@
#include <asm/mmsegment.h>
#include <linux/personality.h>

+#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
+
#define TF_MASK 0x00000100
#define IF_MASK 0x00000200
#define IOPL_MASK 0x00003000

_

2004-04-30 19:26:24

by Rafał J. Wysocki

[permalink] [raw]
Subject: Re: [BUG] 2.6.6-rc2-bk5 mm/slab.c change broke x86-64 SMP

On Friday 30 of April 2004 20:27, Andrew Morton wrote:
> Mikael Pettersson <[email protected]> wrote:
> > The change to mm/slab.c between 2.6.6-rc2-bk4 and -bk5
> > broke x86-64 SMP. The symptoms are general protection
> > faults in __switch_to shortly after init starts, and
> > then the machine is dead. (Can't be more specific, my
> > box can't log early boot oopses.)
> >
> > I'm only seeing this with x86-64 SMP; x86-64 UP and i386
> > SMP on the same machine (Athlon64 UP) have no problems.
> >
> > Reverting 2.6.6-rc2-bk5's change to mm/slab.c eliminates
> > the problem.
>
> The "-bk5" terminology doesn't mean much to people who use bitkeeper or who
> use http://www.kernel.org/pub/linux/kernel/v2.5/testing/cset/ - I assume
> you refer to the alignment changes?
>
> Does this fix?
>
> diff -puN include/asm-x86_64/processor.h~a include/asm-x86_64/processor.h
> --- 25/include/asm-x86_64/processor.h~a Fri Apr 30 11:24:58 2004
> +++ 25-akpm/include/asm-x86_64/processor.h Fri Apr 30 11:25:28 2004
> @@ -20,6 +20,8 @@
> #include <asm/mmsegment.h>
> #include <linux/personality.h>
>
> +#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
> +
> #define TF_MASK 0x00000100
> #define IF_MASK 0x00000200
> #define IOPL_MASK 0x00003000
>

AFAICS, yes, it does. :-)
I'm now (happily) running 2.6.6-rc3 on a dual-Opteron box.

RJW

2004-04-30 19:31:49

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [BUG] 2.6.6-rc2-bk5 mm/slab.c change broke x86-64 SMP

> Does this fix?
>
> diff -puN include/asm-x86_64/processor.h~a include/asm-x86_64/processor.h
> --- 25/include/asm-x86_64/processor.h~a Fri Apr 30 11:24:58 2004
> +++ 25-akpm/include/asm-x86_64/processor.h Fri Apr 30 11:25:28 2004
> @@ -20,6 +20,8 @@
> #include <asm/mmsegment.h>
> #include <linux/personality.h>
>
> +#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
> +

That made my x86_64 boot problem go away; with that patch the system comes
up just fine.

Now I have weird display problems with my Radeon card instead. Ever seen X
running 100% in kernel space, unkillable?

jon

Jonathan Corbet
Executive editor, LWN.net
[email protected]

2004-04-30 20:22:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] 2.6.6-rc2-bk5 mm/slab.c change broke x86-64 SMP

"R. J. Wysocki" <[email protected]> wrote:
>
> > Does this fix?
> >
> > diff -puN include/asm-x86_64/processor.h~a include/asm-x86_64/processor.h
> > --- 25/include/asm-x86_64/processor.h~a Fri Apr 30 11:24:58 2004
> > +++ 25-akpm/include/asm-x86_64/processor.h Fri Apr 30 11:25:28 2004
> > @@ -20,6 +20,8 @@
> > #include <asm/mmsegment.h>
> > #include <linux/personality.h>
> >
> > +#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
> > +
> > #define TF_MASK 0x00000100
> > #define IF_MASK 0x00000200
> > #define IOPL_MASK 0x00003000
> >
>
> AFAICS, yes, it does. :-)
> I'm now (happily) running 2.6.6-rc3 on a dual-Opteron box.

OK, thanks. I suspect that change has broken other architectures for the
same reason.

I think I'll just change the default:


diff -puN kernel/fork.c~task-struct-alignment-fix kernel/fork.c
--- 25/kernel/fork.c~task-struct-alignment-fix Fri Apr 30 13:22:24 2004
+++ 25-akpm/kernel/fork.c Fri Apr 30 13:22:36 2004
@@ -211,7 +211,7 @@ void __init fork_init(unsigned long memp
{
#ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
#ifndef ARCH_MIN_TASKALIGN
-#define ARCH_MIN_TASKALIGN 0
+#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
#endif
/* create a slab on which task_structs can be allocated */
task_struct_cachep =

_

2004-04-30 20:28:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] 2.6.6-rc2-bk5 mm/slab.c change broke x86-64 SMP

[email protected] (Jonathan Corbet) wrote:
>
> > Does this fix?
> >
> > diff -puN include/asm-x86_64/processor.h~a include/asm-x86_64/processor.h
> > --- 25/include/asm-x86_64/processor.h~a Fri Apr 30 11:24:58 2004
> > +++ 25-akpm/include/asm-x86_64/processor.h Fri Apr 30 11:25:28 2004
> > @@ -20,6 +20,8 @@
> > #include <asm/mmsegment.h>
> > #include <linux/personality.h>
> >
> > +#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
> > +
>
> That made my x86_64 boot problem go away; with that patch the system comes
> up just fine.

OK, thanks. It broke parisc too...

> Now I have weird display problems with my Radeon card instead. Ever seen X
> running 100% in kernel space, unkillable?

I did, about a year ago. It was spinning madly in some ioctl waiting for a
bit in a device register to change state. Are you able to generate a
kernel profile while it's being silly? That will tell us where it's stuck.

2004-05-01 01:48:10

by Andi Kleen

[permalink] [raw]
Subject: Re: [BUG] 2.6.6-rc2-bk5 mm/slab.c change broke x86-64 SMP

> Does this fix?
>
> diff -puN include/asm-x86_64/processor.h~a include/asm-x86_64/processor.h
> --- 25/include/asm-x86_64/processor.h~a Fri Apr 30 11:24:58 2004
> +++ 25-akpm/include/asm-x86_64/processor.h Fri Apr 30 11:25:28 2004
> @@ -20,6 +20,8 @@
> #include <asm/mmsegment.h>
> #include <linux/personality.h>
>
> +#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES

16 should be enough actually. The problem is the FXSAVE instruction that
is used to switch the FPU state, and that only requires 16 byte alignment.

-Andi

2004-05-01 02:01:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] 2.6.6-rc2-bk5 mm/slab.c change broke x86-64 SMP

Andi Kleen <[email protected]> wrote:
>
> > Does this fix?
> >
> > diff -puN include/asm-x86_64/processor.h~a include/asm-x86_64/processor.h
> > --- 25/include/asm-x86_64/processor.h~a Fri Apr 30 11:24:58 2004
> > +++ 25-akpm/include/asm-x86_64/processor.h Fri Apr 30 11:25:28 2004
> > @@ -20,6 +20,8 @@
> > #include <asm/mmsegment.h>
> > #include <linux/personality.h>
> >
> > +#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
>
> 16 should be enough actually. The problem is the FXSAVE instruction that
> is used to switch the FPU state, and that only requires 16 byte alignment.
>

yup. I sent Linus the patch which changes the default from 0 to
L1_CACHE_SIZE in kernel/fork.c. x86_64 can override that by setting
ARCH_MIN_TASKALIGN to 16 in asm/processor.h

2004-05-01 02:02:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [BUG] 2.6.6-rc2-bk5 mm/slab.c change broke x86-64 SMP

On Fri, Apr 30, 2004 at 07:01:02PM -0700, Andrew Morton wrote:
> Andi Kleen <[email protected]> wrote:
> >
> > > Does this fix?
> > >
> > > diff -puN include/asm-x86_64/processor.h~a include/asm-x86_64/processor.h
> > > --- 25/include/asm-x86_64/processor.h~a Fri Apr 30 11:24:58 2004
> > > +++ 25-akpm/include/asm-x86_64/processor.h Fri Apr 30 11:25:28 2004
> > > @@ -20,6 +20,8 @@
> > > #include <asm/mmsegment.h>
> > > #include <linux/personality.h>
> > >
> > > +#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
> >
> > 16 should be enough actually. The problem is the FXSAVE instruction that
> > is used to switch the FPU state, and that only requires 16 byte alignment.
> >
>
> yup. I sent Linus the patch which changes the default from 0 to
> L1_CACHE_SIZE in kernel/fork.c. x86_64 can override that by setting
> ARCH_MIN_TASKALIGN to 16 in asm/processor.h

Ok, I will change it in my next patchkit.

For i386 it is the same - 16 should be enough.

-Andi