2008-06-19 14:07:58

by Alexander Beregalov

[permalink] [raw]
Subject: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

Hi

Sun Ultra 10 can not boot the last 2.6.26-rc6-git, it hangs right after

Console: colour dummy device 80x25
console handover: boot [earlyprom0] -> real [tty0]

netconsole does not work at this early stage.
It is a boot log of the closest to bad commit working kernel:

PROMLIB: Sun IEEE Boot Prom 'OBP 3.25.3 2000/06/29 14:12'
PROMLIB: Root node compatible:
Linux version 2.6.26-rc3-00123-g3651751 (alexb@sparky) (gcc version
4.1.2 (Gentoo 4.1.2 p1.0.1)) #21 PREEMPT Thu Jun 19 17:42:45 MSD 2008
console [earlyprom0] enabled
ARCH: SUN4U
Ethernet address: 08:00:20:ff:e6:ff
Kernel: Using 3 locked TLB entries for main kernel image.
Remapping the kernel... done.
OF stdout device is: /pci@1f,0/pci@1,1/SUNW,m64B@2
PROM: Built device tree with 42083 bytes of memory.
Top of RAM: 0x1ff46000, Total RAM: 0x1ff44000
Memory hole size: 0MB
Entering add_active_range(0, 0, 65407) 0 entries of 256 used
Entering add_active_range(0, 65408, 65443) 1 entries of 256 used
[0000000200000000-fffff80001000000] page_structs=131072 node=0 entry=0/0
[0000000200000000-fffff80001400000] page_structs=131072 node=0 entry=1/0
Allocated 532480 bytes for kernel page tables.
Zone PFN ranges:
Normal 0 -> 65443
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
0: 0 -> 65407
0: 65408 -> 65443
On node 0 totalpages: 65442
Normal zone: 447 pages used for memmap
Normal zone: 0 pages reserved
Normal zone: 64995 pages, LIFO batch:15
Movable zone: 0 pages used for memmap
Booting Linux...
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 64995
Kernel command line: root=/dev/sda2
Preemptible RCU implementation.
PID hash table entries: 2048 (order: 11, 16384 bytes)
clocksource: mult[245d1] shift[16]
clockevent: mult[70a3d70a] shift[32]
Console: colour dummy device 80x25
console handover: boot [earlyprom0] -> real [tty0]
======= It hangs here! =======
Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBCLASSES: 8
... MAX_LOCK_DEPTH: 48
... MAX_LOCKDEP_KEYS: 2048
... CLASSHASH_SIZE: 1024
... MAX_LOCKDEP_ENTRIES: 8192
... MAX_LOCKDEP_CHAINS: 16384
... CHAINHASH_SIZE: 8192
memory used by lock dependency info: 1648 kB
per task-struct memory footprint: 2688 bytes


I have done bisect:

a051bc5bb1ac6dc138d529077fa20cbbc6622d95 is first bad commit
commit a051bc5bb1ac6dc138d529077fa20cbbc6622d95
Author: David S. Miller <[email protected]>
Date: Wed May 21 18:14:28 2008 -0700

sparc64: Fix kernel thread stack termination.

Because of the silly way I set up the initial stack for
new kernel threads, there is a loop at the top of the
stack.

To fix this, properly add another stack frame that is copied
from the parent and terminate it in the child by setting
the frame pointer in that frame to zero.

Signed-off-by: David S. Miller <[email protected]>

:040000 040000 beba7e8c740188a7c19a3569f08b1edd25b8000d
aed42d5af3ab9a84ccb3dfe07eb444a78f9001fa M arch


2008-06-19 16:02:58

by Alexander Beregalov

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

2008/6/19 Alexander Beregalov <[email protected]>:
> Hi
>
> Sun Ultra 10 can not boot the last 2.6.26-rc6-git, it hangs right after
>
> Console: colour dummy device 80x25
> console handover: boot [earlyprom0] -> real [tty0]
<cut>
> I have done bisect:
>
> a051bc5bb1ac6dc138d529077fa20cbbc6622d95 is first bad commit
> commit a051bc5bb1ac6dc138d529077fa20cbbc6622d95
> Author: David S. Miller <[email protected]>
> Date: Wed May 21 18:14:28 2008 -0700
>
> sparc64: Fix kernel thread stack termination.
>
> Because of the silly way I set up the initial stack for
> new kernel threads, there is a loop at the top of the
> stack.
>
> To fix this, properly add another stack frame that is copied
> from the parent and terminate it in the child by setting
> the frame pointer in that frame to zero.
>
> Signed-off-by: David S. Miller <[email protected]>
>
> :040000 040000 beba7e8c740188a7c19a3569f08b1edd25b8000d
> aed42d5af3ab9a84ccb3dfe07eb444a78f9001fa M arch
>

I have reverted this commit and
2.6.26-rc6-00233-g99d3b2d works fine. It can boot.

2008-06-19 23:10:23

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

From: "Alexander Beregalov" <[email protected]>
Date: Thu, 19 Jun 2008 20:02:44 +0400

> I have reverted this commit and
> 2.6.26-rc6-00233-g99d3b2d works fine. It can boot.

Thanks for your report, I'll look into this.

The irony is that this changeset was added to fix a
hang with lockdep enabled :-)

2008-06-20 02:00:59

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

From: "Alexander Beregalov" <[email protected]>
Date: Thu, 19 Jun 2008 20:02:44 +0400

> I have reverted this commit and
> 2.6.26-rc6-00233-g99d3b2d works fine. It can boot.

I know what the problem is.

It's hanging because lockdep is starting to actually be used. :-)
Beforehand various bugs caused lockdep to disable itself almost
immediately. So it was never actually enabled on sparc64.

When you revert that changeset, there is a loop in the backtrace of
all kernel threads, and therefore lockdep turns itself off when all of
the stack backtrace slots get consumed by that loop in the backtraces.

After the revert you should see a set of kernel messages like:

BUG: MAX_STACK_TRACE_ENTRIES too low!
turning off the locking correctness validator.

and that would confirm my theory.

Lockdep really isn't usable on sparc64 at this time and it's one of
the ongoing things I'm trying to get fully fixed. The changeset in
question fixes a real bug in stack backtrace output, so it should
stay in the tree.

FWIW, I just verified the above on my ultra5 box.

2008-06-20 21:19:44

by Alexander Beregalov

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

2008/6/20 David Miller <[email protected]>:
> When you revert that changeset, there is a loop in the backtrace of
> all kernel threads, and therefore lockdep turns itself off when all of
> the stack backtrace slots get consumed by that loop in the backtraces.
>
> After the revert you should see a set of kernel messages like:
>
> BUG: MAX_STACK_TRACE_ENTRIES too low!
> turning off the locking correctness validator.
>
> and that would confirm my theory.

No, I do not have such messages.
Does it mean that we need different explanation?

2008-06-20 21:21:50

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

From: "Alexander Beregalov" <[email protected]>
Date: Sat, 21 Jun 2008 01:19:29 +0400

> 2008/6/20 David Miller <[email protected]>:
> > When you revert that changeset, there is a loop in the backtrace of
> > all kernel threads, and therefore lockdep turns itself off when all of
> > the stack backtrace slots get consumed by that loop in the backtraces.
> >
> > After the revert you should see a set of kernel messages like:
> >
> > BUG: MAX_STACK_TRACE_ENTRIES too low!
> > turning off the locking correctness validator.
> >
> > and that would confirm my theory.
>
> No, I do not have such messages.

When you revert that patch, this is exactly what you should
see.

Make sure to check "dmesg" because they get lost on the screeen
during the time between registering the VC console driver and
the ATY framebuffer registry.

2008-06-20 22:51:50

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

From: "Alexander Beregalov" <[email protected]>
Date: Sat, 21 Jun 2008 02:42:36 +0400

> I can not see such messages in lockdep output:
>
> [ 17.960404] Console: colour dummy device 80x25
> [ 18.060543] console handover: boot [earlyprom0] -> real [tty0]
> [ 18.169062] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc.,
> Ingo Molnar
> [ 18.169180] ... MAX_LOCKDEP_SUBCLASSES: 8
> [ 18.169246] ... MAX_LOCK_DEPTH: 48
> [ 18.169310] ... MAX_LOCKDEP_KEYS: 2048
> [ 18.169376] ... CLASSHASH_SIZE: 1024
> [ 18.169444] ... MAX_LOCKDEP_ENTRIES: 8192
> [ 18.169509] ... MAX_LOCKDEP_CHAINS: 16384
> [ 18.169577] ... CHAINHASH_SIZE: 8192
> [ 18.169643] memory used by lock dependency info: 1648 kB
> [ 18.169722] per task-struct memory footprint: 2688 bytes
> [ 18.169798] ------------------------
> [ 18.169855] | Locking API testsuite:

Something is screwey here... Hmmm...

When I added the changeset in question, it fixed a problem in that
any backtrace of a kernel thread would loop forever at the end.
Any stack backtrace would hang or reach a safety limit (such as
the one imposed by lockdep).

Please double check that you are precisely reverting this patch
below _before_ doing these tests:

commit a051bc5bb1ac6dc138d529077fa20cbbc6622d95
Author: David S. Miller <[email protected]>
Date: Wed May 21 18:14:28 2008 -0700

sparc64: Fix kernel thread stack termination.

Because of the silly way I set up the initial stack for
new kernel threads, there is a loop at the top of the
stack.

To fix this, properly add another stack frame that is copied
from the parent and terminate it in the child by setting
the frame pointer in that frame to zero.

Signed-off-by: David S. Miller <[email protected]>

diff --git a/arch/sparc64/kernel/process.c b/arch/sparc64/kernel/process.c
index 0a0c05f..2084f81 100644
--- a/arch/sparc64/kernel/process.c
+++ b/arch/sparc64/kernel/process.c
@@ -657,20 +657,39 @@ int copy_thread(int nr, unsigned long clone_flags, unsigned long sp,
struct task_struct *p, struct pt_regs *regs)
{
struct thread_info *t = task_thread_info(p);
+ struct sparc_stackf *parent_sf;
+ unsigned long child_stack_sz;
char *child_trap_frame;
+ int kernel_thread;

- /* Calculate offset to stack_frame & pt_regs */
- child_trap_frame = task_stack_page(p) + (THREAD_SIZE - (TRACEREG_SZ+STACKFRAME_SZ));
- memcpy(child_trap_frame, (((struct sparc_stackf *)regs)-1), (TRACEREG_SZ+STACKFRAME_SZ));
+ kernel_thread = (regs->tstate & TSTATE_PRIV) ? 1 : 0;
+ parent_sf = ((struct sparc_stackf *) regs) - 1;

- t->flags = (t->flags & ~((0xffUL << TI_FLAG_CWP_SHIFT) | (0xffUL << TI_FLAG_CURRENT_DS_SHIFT))) |
+ /* Calculate offset to stack_frame & pt_regs */
+ child_stack_sz = ((STACKFRAME_SZ + TRACEREG_SZ) +
+ (kernel_thread ? STACKFRAME_SZ : 0));
+ child_trap_frame = (task_stack_page(p) +
+ (THREAD_SIZE - child_stack_sz));
+ memcpy(child_trap_frame, parent_sf, child_stack_sz);
+
+ t->flags = (t->flags & ~((0xffUL << TI_FLAG_CWP_SHIFT) |
+ (0xffUL << TI_FLAG_CURRENT_DS_SHIFT))) |
(((regs->tstate + 1) & TSTATE_CWP) << TI_FLAG_CWP_SHIFT);
t->new_child = 1;
t->ksp = ((unsigned long) child_trap_frame) - STACK_BIAS;
- t->kregs = (struct pt_regs *)(child_trap_frame+sizeof(struct sparc_stackf));
+ t->kregs = (struct pt_regs *) (child_trap_frame +
+ sizeof(struct sparc_stackf));
t->fpsaved[0] = 0;

- if (regs->tstate & TSTATE_PRIV) {
+ if (kernel_thread) {
+ struct sparc_stackf *child_sf = (struct sparc_stackf *)
+ (child_trap_frame + (STACKFRAME_SZ + TRACEREG_SZ));
+
+ /* Zero terminate the stack backtrace. */
+ child_sf->fp = NULL;
+ t->kregs->u_regs[UREG_FP] =
+ ((unsigned long) child_sf) - STACK_BIAS;
+
/* Special case, if we are spawning a kernel thread from
* a userspace task (via KMOD, NFS, or similar) we must
* disable performance counters in the child because the
@@ -681,12 +700,7 @@ int copy_thread(int nr, unsigned long clone_flags, unsigned long sp,
t->pcr_reg = 0;
t->flags &= ~_TIF_PERFCTR;
}
- t->kregs->u_regs[UREG_FP] = t->ksp;
t->flags |= ((long)ASI_P << TI_FLAG_CURRENT_DS_SHIFT);
- flush_register_windows();
- memcpy((void *)(t->ksp + STACK_BIAS),
- (void *)(regs->u_regs[UREG_FP] + STACK_BIAS),
- sizeof(struct sparc_stackf));
t->kregs->u_regs[UREG_G6] = (unsigned long) t;
t->kregs->u_regs[UREG_G4] = (unsigned long) t->task;
} else {

2008-06-20 23:12:44

by Alexander Beregalov

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

2008/6/21 David Miller <[email protected]>:
> Something is screwey here... Hmmm...
>
> When I added the changeset in question, it fixed a problem in that
> any backtrace of a kernel thread would loop forever at the end.
> Any stack backtrace would hang or reach a safety limit (such as
> the one imposed by lockdep).
>
> Please double check that you are precisely reverting this patch
> below _before_ doing these tests:
>
> commit a051bc5bb1ac6dc138d529077fa20cbbc6622d95

Yes, I am sure. It runs without this commit and hangs with it.
I can connect serial console, but if it is a infinite loop it will not
provide more info.

$ git log arch/sparc64/kernel/process.c
commit 99d3b2d0d3df1fa171a7ee1d2d3a92f540873b15
Author: alexb <alexb@sparky>
Date: Thu Jun 19 18:49:46 2008 +0400

Revert "sparc64: Fix kernel thread stack termination."

This reverts commit a051bc5bb1ac6dc138d529077fa20cbbc6622d95.

commit a051bc5bb1ac6dc138d529077fa20cbbc6622d95
Author: David S. Miller <[email protected]>
Date: Wed May 21 18:14:28 2008 -0700

sparc64: Fix kernel thread stack termination.

2008-06-20 23:21:48

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

From: "Alexander Beregalov" <[email protected]>
Date: Sat, 21 Jun 2008 03:12:30 +0400

> 2008/6/21 David Miller <[email protected]>:
> > Something is screwey here... Hmmm...
> >
> > When I added the changeset in question, it fixed a problem in that
> > any backtrace of a kernel thread would loop forever at the end.
> > Any stack backtrace would hang or reach a safety limit (such as
> > the one imposed by lockdep).
> >
> > Please double check that you are precisely reverting this patch
> > below _before_ doing these tests:
> >
> > commit a051bc5bb1ac6dc138d529077fa20cbbc6622d95
>
> Yes, I am sure. It runs without this commit and hangs with it.
> I can connect serial console, but if it is a infinite loop it will not
> provide more info.

Ok I have to find some way to reproduce this. Please post the
kernel .config you are using during these tests. Also please
let me know what distribution and compiler version you are using.

Thanks.

2008-06-20 23:36:22

by Alexander Beregalov

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

2008/6/21 David Miller <[email protected]>:
> Ok I have to find some way to reproduce this. Please post the
> kernel .config you are using during these tests. Also please
> let me know what distribution and compiler version you are using.
>
> Thanks.
>
It is Gentoo; gcc version 4.1.2 (Gentoo 4.1.2 p1.0.1); sys-devel/kgcc64-4.1.2.
I cross-compiled it as you advised me:
make -j2 CROSS_COMPILE=sparc64-unknown-linux-gnu- image modules &&
sudo make modules_install

Config is in attachment.


Attachments:
(No filename) (494.00 B)
sparky-config.gz (6.06 kB)
Download all attachments

2008-07-07 09:19:19

by Alexander Beregalov

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

Hi David

Is it possible to add some debug code to understand what is going on?

2008-07-07 11:01:36

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

From: "Alexander Beregalov" <[email protected]>
Date: Mon, 7 Jul 2008 13:19:05 +0400

> Hi David
>
> Is it possible to add some debug code to understand what is going on?

What do you want me to add? The thing hangs and we have no idea
where since there are no messages on the console and the hang
happens before we have any kind of console output.

Why do you think I haven't been able to fix this yet?

Besides I'm too busy with some networking stuff to work at
all on this at the moment, so folks will just need to be
patient or do the grunt work of debugging this themselves.

2008-07-07 13:05:28

by Mikael Pettersson

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

David Miller writes:
> From: "Alexander Beregalov" <[email protected]>
> Date: Mon, 7 Jul 2008 13:19:05 +0400
>
> > Hi David
> >
> > Is it possible to add some debug code to understand what is going on?
>
> What do you want me to add? The thing hangs and we have no idea
> where since there are no messages on the console and the hang
> happens before we have any kind of console output.
>
> Why do you think I haven't been able to fix this yet?
>
> Besides I'm too busy with some networking stuff to work at
> all on this at the moment, so folks will just need to be
> patient or do the grunt work of debugging this themselves.

My Ultra 5 (same mainboard as the Ultra 10) boots 2.6.26-rc9
just fine. So Alexander's problem is probably caused by .config
settings or his toolchain.

My working 2.6.26-rc9 .config and a boot log are available in
<http://user.it.uu.se/~mikpe/linux/ultra5/> in case someone wants
to use them as a starting point for debugging this problem.

The toolchain I used has gcc-4.3.1 and binutils-2.17.50.0.3.

/Mikael

2008-07-07 15:59:19

by Alexander Beregalov

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

I have turned off LOCKDEP and it boots properly.
2.6.26-rc9-00005-g1b40a89

Mikael's config also does not contain LOCKDEP.

2008-08-08 06:01:49

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

From: "Alexander Beregalov" <[email protected]>
Date: Mon, 7 Jul 2008 19:59:04 +0400

> I have turned off LOCKDEP and it boots properly.
> 2.6.26-rc9-00005-g1b40a89
>
> Mikael's config also does not contain LOCKDEP.

I have finally reproduced the problem locally and figured out the
bug.

Please try this patch:

sparc64: Fix end-of-stack checking in save_stack_trace().

Bug reported by Alexander Beregalov.

Before we dereference the stack frame or try to peek at the
pt_regs magic value, make sure the entire object is within
the kernel stack bounds.

Signed-off-by: David S. Miller <[email protected]>

diff --git a/arch/sparc64/kernel/stacktrace.c b/arch/sparc64/kernel/stacktrace.c
index c73ce3f..c5576e8 100644
--- a/arch/sparc64/kernel/stacktrace.c
+++ b/arch/sparc64/kernel/stacktrace.c
@@ -25,13 +25,15 @@ void save_stack_trace(struct stack_trace *trace)

/* Bogus frame pointer? */
if (fp < (thread_base + sizeof(struct thread_info)) ||
- fp >= (thread_base + THREAD_SIZE))
+ fp > (thread_base + THREAD_SIZE - sizeof(struct sparc_stackf)))
break;

sf = (struct sparc_stackf *) fp;
regs = (struct pt_regs *) (sf + 1);

- if ((regs->magic & ~0x1ff) == PT_REGS_MAGIC) {
+ if (((unsigned long)regs <=
+ (thread_base + THREAD_SIZE - sizeof(*regs))) &&
+ (regs->magic & ~0x1ff) == PT_REGS_MAGIC) {
if (!(regs->tstate & TSTATE_PRIV))
break;
pc = regs->tpc;

2008-08-08 09:31:53

by Alexander Beregalov

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

2008/8/8 David Miller <[email protected]>:
> From: "Alexander Beregalov" <[email protected]>
> Date: Mon, 7 Jul 2008 19:59:04 +0400
>
>> I have turned off LOCKDEP and it boots properly.
>> 2.6.26-rc9-00005-g1b40a89
>>
>> Mikael's config also does not contain LOCKDEP.
>
> I have finally reproduced the problem locally and figured out the
> bug.
>
> Please try this patch:
>
Thanks David, but 2.6.27-rc2-00166-gaeee90d hangs in the same way.

Config is in attachment.


Attachments:
(No filename) (474.00 B)
config-27-lockdep.gz (6.54 kB)
Download all attachments

2008-08-08 09:40:39

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

From: "Alexander Beregalov" <[email protected]>
Date: Fri, 8 Aug 2008 13:31:40 +0400

> 2008/8/8 David Miller <[email protected]>:
> > From: "Alexander Beregalov" <[email protected]>
> > Date: Mon, 7 Jul 2008 19:59:04 +0400
> >
> >> I have turned off LOCKDEP and it boots properly.
> >> 2.6.26-rc9-00005-g1b40a89
> >>
> >> Mikael's config also does not contain LOCKDEP.
> >
> > I have finally reproduced the problem locally and figured out the
> > bug.
> >
> > Please try this patch:
> >
> Thanks David, but 2.6.27-rc2-00166-gaeee90d hangs in the same way.

That patch was for you to add on top of whatever tree you
have handy. Did you apply the patch?

That patch will fix all trees.

2008-08-08 10:14:31

by Alexander Beregalov

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

2008/8/8 David Miller <[email protected]>:
> From: "Alexander Beregalov" <[email protected]>
> Date: Fri, 8 Aug 2008 13:31:40 +0400
>
>> 2008/8/8 David Miller <[email protected]>:
>> > From: "Alexander Beregalov" <[email protected]>
>> > Date: Mon, 7 Jul 2008 19:59:04 +0400
>> >
>> >> I have turned off LOCKDEP and it boots properly.
>> >> 2.6.26-rc9-00005-g1b40a89
>> >>
>> >> Mikael's config also does not contain LOCKDEP.
>> >
>> > I have finally reproduced the problem locally and figured out the
>> > bug.
>> >
>> > Please try this patch:
>> >
>> Thanks David, but 2.6.27-rc2-00166-gaeee90d hangs in the same way.
>
> That patch was for you to add on top of whatever tree you
> have handy. Did you apply the patch?
>
> That patch will fix all trees.
>
Yes, I applied it manually on top of 2.6.27-rc2-0166

$git diff
diff --git a/arch/sparc64/kernel/stacktrace.c b/arch/sparc64/kernel/stacktrace.c
index b3e3737..c22a131 100644
--- a/arch/sparc64/kernel/stacktrace.c
+++ b/arch/sparc64/kernel/stacktrace.c
@@ -26,13 +26,15 @@ void save_stack_trace(struct stack_trace *trace)

/* Bogus frame pointer? */
if (fp < (thread_base + sizeof(struct thread_info)) ||
- fp >= (thread_base + THREAD_SIZE))
+ fp > (thread_base + THREAD_SIZE - sizeof(struct
sparc_stackf)))
break;

sf = (struct sparc_stackf *) fp;
regs = (struct pt_regs *) (sf + 1);

- if ((regs->magic & ~0x1ff) == PT_REGS_MAGIC) {
+ if (((unsigned long)regs <=
+ (thread_base + THREAD_SIZE - sizeof(*regs))) &&
+ (regs->magic & ~0x1ff) == PT_REGS_MAGIC) {
if (!(regs->tstate & TSTATE_PRIV))
break;
pc = regs->tpc;

2008-08-08 10:38:21

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

From: "Alexander Beregalov" <[email protected]>
Date: Fri, 8 Aug 2008 14:14:00 +0400

> 2008/8/8 David Miller <[email protected]>:
> > From: "Alexander Beregalov" <[email protected]>
> > Date: Fri, 8 Aug 2008 13:31:40 +0400
> >
> >> 2008/8/8 David Miller <[email protected]>:
> >> > From: "Alexander Beregalov" <[email protected]>
> >> > Date: Mon, 7 Jul 2008 19:59:04 +0400
> >> >
> >> >> I have turned off LOCKDEP and it boots properly.
> >> >> 2.6.26-rc9-00005-g1b40a89
> >> >>
> >> >> Mikael's config also does not contain LOCKDEP.
> >> >
> >> > I have finally reproduced the problem locally and figured out the
> >> > bug.
> >> >
> >> > Please try this patch:
> >> >
> >> Thanks David, but 2.6.27-rc2-00166-gaeee90d hangs in the same way.
> >
> > That patch was for you to add on top of whatever tree you
> > have handy. Did you apply the patch?
> >
> > That patch will fix all trees.
> >
> Yes, I applied it manually on top of 2.6.27-rc2-0166

And then the problem goes away right?

2008-08-08 10:56:46

by Alexander Beregalov

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

2008/8/8 David Miller <[email protected]>:
>> Yes, I applied it manually on top of 2.6.27-rc2-0166
>
> And then the problem goes away right?
>
No, It hangs in the same way, right after
console handover: boot [earlyprom0] -> real [tty0]

2008-08-08 11:18:23

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

From: "Alexander Beregalov" <[email protected]>
Date: Fri, 8 Aug 2008 14:56:35 +0400

> 2008/8/8 David Miller <[email protected]>:
> >> Yes, I applied it manually on top of 2.6.27-rc2-0166
> >
> > And then the problem goes away right?
> >
> No, It hangs in the same way, right after
> console handover: boot [earlyprom0] -> real [tty0]

Please edit arch/sparc64/kernel/setup.c, where it says:

static struct console prom_early_console = {
.name = "earlyprom",
.write = prom_console_write,
.flags = CON_PRINTBUFFER | CON_BOOT | CON_ANYTIME,
.index = -1,
};

and remove "CON_BOOT |".

This will allow you to see the crash message.

Please also double check that you patched the kernel with my
fix correctly. I used your exact config, on the exact same
kind of system, reproducing the exact same hang, and it goes
away with my fix.

2008-08-08 11:53:09

by Alexander Beregalov

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

2008/8/8 David Miller <[email protected]>:
> This will allow you to see the crash message.
Yes, I saw it.
There were few WARNINGS at lib/list_debug.c:__list_add
That messages went fast, I can not see it now.
Now I see call trace:
__free_pages_ok
__free_pages
__free_pages_bootmem
free_all_bootmem_core
free_all_bootmem
mem_init
start_kernel
tlb_fixup_done

Can it be helpful?

I tried to connect console cable to it, but nothing was there. I found
I should disconnect keyb and vga to able to see it, I will try it
later, perhaps not today.

>
> Please also double check that you patched the kernel with my
> fix correctly. I used your exact config, on the exact same
> kind of system, reproducing the exact same hang, and it goes
> away with my fix.
>

2008-08-08 14:28:39

by Alexander Beregalov

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

2008/8/8 David Miller <[email protected]>:
> Please also double check that you patched the kernel with my
> fix correctly. I used your exact config, on the exact same
> kind of system, reproducing the exact same hang, and it goes
> away with my fix.
>
It can be a different compiler version,
but I tried 4.3.1 and 4.1.2 with the same result.

After I removed CON_BOOT warnings apear after
Locking API testsuite:
<..>
Good, all 218 testcases passed!
Dentry cache hash table entries: 131072 (order: 7, 1048576 bytes)
Inode-cache hash table entries: 65536 (order: 6, 524288 bytes)
=== here ===
Memory: 1019600k available (2672k kernel code, 1264k data, 128k init)
[fffff80000000000,000000003ff46000]
SLUB: Genslabs=13, HWalign=32, Order=0-2, MinObjects=8, CPUs=1, Nodes=1
Calibrating delay using timer specific routine.. 884.33 BogoMIPS (lpj=4421694)

Is it possible that lockdep messages did not appear when CON_BOOT was
in the flags?

Hope it will help.

2008-08-08 23:17:38

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

From: "Alexander Beregalov" <[email protected]>
Date: Fri, 8 Aug 2008 15:52:53 +0400

> Now I see call trace:
> __free_pages_ok
> __free_pages
> __free_pages_bootmem
> free_all_bootmem_core
> free_all_bootmem
> mem_init
> start_kernel
> tlb_fixup_done
>
> Can it be helpful?

This is probably a different bug than the one I fixed, we'll have
to analyze this somehow.

I'll cook up a patch that will let you see the crash without it
scrolling off the screen.

2008-08-14 03:53:43

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

From: "Alexander Beregalov" <[email protected]>
Date: Fri, 8 Aug 2008 15:52:53 +0400

> 2008/8/8 David Miller <[email protected]>:
> > This will allow you to see the crash message.
> Yes, I saw it.
> There were few WARNINGS at lib/list_debug.c:__list_add
> That messages went fast, I can not see it now.
> Now I see call trace:
> __free_pages_ok
> __free_pages
> __free_pages_bootmem
> free_all_bootmem_core
> free_all_bootmem
> mem_init
> start_kernel
> tlb_fixup_done
>
> Can it be helpful?

Mikulas Patocka is seeing the same bug (see thread "Re: console
handover badness") I just posted the following patch there that can
help track this down.

Please try it out on your machine too.

BTW, how much ram is in your system?

Thanks.

diff --git a/arch/sparc64/mm/init.c b/arch/sparc64/mm/init.c
index 217de3e..26b018f 100644
--- a/arch/sparc64/mm/init.c
+++ b/arch/sparc64/mm/init.c
@@ -1643,6 +1643,8 @@ void __init setup_per_cpu_areas(void)
{
}

+extern void sparse_validate_usemap(const char *file, int line);
+
void __init paging_init(void)
{
unsigned long end_pfn, shift, phys_base;
@@ -1788,7 +1790,9 @@ void __init paging_init(void)
#ifndef CONFIG_NEED_MULTIPLE_NODES
max_mapnr = last_valid_pfn;
#endif
+ sparse_validate_usemap(__FILE__, __LINE__);
kernel_physical_mapping_init();
+ sparse_validate_usemap(__FILE__, __LINE__);

{
unsigned long max_zone_pfns[MAX_NR_ZONES];
@@ -1798,12 +1802,15 @@ void __init paging_init(void)
max_zone_pfns[ZONE_NORMAL] = end_pfn;

free_area_init_nodes(max_zone_pfns);
+ sparse_validate_usemap(__FILE__, __LINE__);
}

printk("Booting Linux...\n");

central_probe();
+ sparse_validate_usemap(__FILE__, __LINE__);
cpu_probe();
+ sparse_validate_usemap(__FILE__, __LINE__);
}

int __init page_in_phys_avail(unsigned long paddr)
diff --git a/init/main.c b/init/main.c
index 0bc7e16..80771f5 100644
--- a/init/main.c
+++ b/init/main.c
@@ -536,6 +536,8 @@ void __init __weak thread_info_cache_init(void)
{
}

+extern void sparse_validate_usemap(const char *file, int line);
+
asmlinkage void __init start_kernel(void)
{
char * command_line;
@@ -567,12 +569,19 @@ asmlinkage void __init start_kernel(void)
printk(KERN_NOTICE);
printk(linux_banner);
setup_arch(&command_line);
+ sparse_validate_usemap(__FILE__, __LINE__);
mm_init_owner(&init_mm, &init_task);
+ sparse_validate_usemap(__FILE__, __LINE__);
setup_command_line(command_line);
+ sparse_validate_usemap(__FILE__, __LINE__);
unwind_setup();
+ sparse_validate_usemap(__FILE__, __LINE__);
setup_per_cpu_areas();
+ sparse_validate_usemap(__FILE__, __LINE__);
setup_nr_cpu_ids();
+ sparse_validate_usemap(__FILE__, __LINE__);
smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
+ sparse_validate_usemap(__FILE__, __LINE__);

/*
* Set up the scheduler prior starting any interrupts (such as the
@@ -580,35 +589,52 @@ asmlinkage void __init start_kernel(void)
* time - but meanwhile we still have a functioning scheduler.
*/
sched_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
/*
* Disable preemption - early bootup scheduling is extremely
* fragile until we cpu_idle() for the first time.
*/
preempt_disable();
build_all_zonelists();
+ sparse_validate_usemap(__FILE__, __LINE__);
page_alloc_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);
parse_early_param();
+ sparse_validate_usemap(__FILE__, __LINE__);
parse_args("Booting kernel", static_command_line, __start___param,
__stop___param - __start___param,
&unknown_bootoption);
+ sparse_validate_usemap(__FILE__, __LINE__);
if (!irqs_disabled()) {
printk(KERN_WARNING "start_kernel(): bug: interrupts were "
"enabled *very* early, fixing it\n");
local_irq_disable();
}
sort_main_extable();
+ sparse_validate_usemap(__FILE__, __LINE__);
trap_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
rcu_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
init_IRQ();
+ sparse_validate_usemap(__FILE__, __LINE__);
pidhash_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
init_timers();
+ sparse_validate_usemap(__FILE__, __LINE__);
hrtimers_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
softirq_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
timekeeping_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
time_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
sched_clock_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
profile_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
if (!irqs_disabled())
printk("start_kernel(): bug: interrupts were enabled early\n");
early_boot_irqs_on();
@@ -620,10 +646,12 @@ asmlinkage void __init start_kernel(void)
* this. But we do want output early, in case something goes wrong.
*/
console_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
if (panic_later)
panic(panic_later, panic_param);

lockdep_info();
+ sparse_validate_usemap(__FILE__, __LINE__);

/*
* Need to run this when irqs are enabled, because it wants
@@ -631,6 +659,7 @@ asmlinkage void __init start_kernel(void)
* too:
*/
locking_selftest();
+ sparse_validate_usemap(__FILE__, __LINE__);

#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
@@ -643,7 +672,9 @@ asmlinkage void __init start_kernel(void)
}
#endif
vfs_caches_init_early();
+ sparse_validate_usemap(__FILE__, __LINE__);
cpuset_init_early();
+ sparse_validate_usemap(__FILE__, __LINE__);
mem_init();
enable_debug_pagealloc();
cpu_hotplug_init();
diff --git a/mm/sparse.c b/mm/sparse.c
index 5d9dbbb..116559c 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -262,6 +262,52 @@ unsigned long usemap_size(void)
return size_bytes;
}

+#if 1
+static int check_one_blockval(unsigned long *bitmap, unsigned long off, unsigned long nbits)
+{
+ unsigned long i, value = 1, flags = 0;
+
+ for (i = 0; i < nbits; i++, value <<= 1)
+ if (test_bit(off + i, bitmap))
+ flags |= value;
+
+ if (flags >= MIGRATE_TYPES) {
+ printk(KERN_ERR "BUG: Bogus migrate type %lu\n", flags);
+ return 1;
+ }
+ return 0;
+}
+
+void sparse_validate_usemap(const char *file, int line)
+{
+ void *caller = __builtin_return_address(0);
+ unsigned long size = usemap_size();
+ unsigned long pnum;
+ static int reported = 0;
+
+ if (reported)
+ return;
+
+ for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+ struct mem_section *ms;
+ unsigned long *bitmap;
+ unsigned long off;
+
+ if (!present_section_nr(pnum))
+ continue;
+ ms = __nr_to_section(pnum);
+ bitmap = ms->pageblock_flags;
+ for (off = 0; off < size; off += 3) {
+ if (check_one_blockval(bitmap, off, 3)) {
+ printk(KERN_ERR "BUG: Usemap for section %lu corrupted at %pS[%s:%d]\n",
+ pnum, caller, file, line);
+ reported = 1;
+ break;
+ }
+ }
+ }
+}
+#endif
#ifdef CONFIG_MEMORY_HOTPLUG
static unsigned long *__kmalloc_section_usemap(void)
{
@@ -445,10 +491,16 @@ void __init sparse_init(void)
sparse_init_one_section(__nr_to_section(pnum), pnum, map,
usemap);
}
+#if 1
+ sparse_validate_usemap(__FILE__, __LINE__);
+#endif

vmemmap_populate_print_last();

free_bootmem(__pa(usemap_map), size);
+#if 1
+ sparse_validate_usemap(__FILE__, __LINE__);
+#endif
}

#ifdef CONFIG_MEMORY_HOTPLUG

2008-08-14 10:20:16

by Alexander Beregalov

[permalink] [raw]
Subject: Re: 2.6.26-rc: SPARC: Sun Ultra 10 can not boot

2008/8/14 David Miller <[email protected]>:
>> __free_pages_ok
>> __free_pages
>> __free_pages_bootmem
>> free_all_bootmem_core
>> free_all_bootmem
>> mem_init
>> start_kernel
>> tlb_fixup_done
>>
>> Can it be helpful?
>
> Mikulas Patocka is seeing the same bug (see thread "Re: console
> handover badness") I just posted the following patch there that can
> help track this down.
>
> Please try it out on your machine too.

Bogus migrate type 6
Usemap for section 0 corrupted
paging_init+0xcac/0xd38[arch/sparc64/mm/init.c:1795]

1790 #ifndef CONFIG_NEED_MULTIPLE_NODES
1791 max_mapnr = last_valid_pfn;
1792 #endif
1793 sparse_validate_usemap(__FILE__, __LINE__);
1794 kernel_physical_mapping_init();
1795 sparse_validate_usemap(__FILE__, __LINE__);
1796
1797 {
1798 unsigned long max_zone_pfns[MAX_NR_ZONES];
1799
1800 memset(max_zone_pfns, 0, sizeof(max_zone_pfns));

>
> BTW, how much ram is in your system?
Ultra 10, 1024Mb

Thanks David