2006-03-14 11:55:37

by Andrew Clayton

[permalink] [raw]
Subject: 2.6.16-rc6-git[12] spontaneous reboots on x86_64

Hi,

With the above kernels I am seeing spontaneous system reboots. Nothing
seems to get logged anywhere and when I've been at the console I haven't
noticed any oops or anything before the machine resets.

This was first triggered by accessing a usb key drive thing, this
happened a couple of times and then this morning while investigating
some more it happened as I was exiting my X session.

The machine is an AMD Athlon(tm) 64 Processor 3500+ (Single processor,
single core), with 1GB RAM. GCC is gcc (GCC) 4.0.2 20051125 (Red Hat
4.0.2-8) from Fedora Core 4


2.6.16-rc6 is working fine.


The following change looked an obvious candidate

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c33d4568aca9028a22857f94f5e0850012b6444b

So I took a 2.6.16-rc6-git2 tree and reverted arch/x86_64/kernel/entry.S
to the one in 2.6.16-rc6 and so far (35 minutes) no problems.



Let me know if you'd like any more info.


Cheers,

Andrew



2006-03-14 15:29:55

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.16-rc6-git[12] spontaneous reboots on x86_64

On Tue, 14 Mar 2006, Andrew Clayton wrote:
>
> With the above kernels I am seeing spontaneous system reboots. Nothing
> seems to get logged anywhere and when I've been at the console I haven't
> noticed any oops or anything before the machine resets.
>
> This was first triggered by accessing a usb key drive thing, this
> happened a couple of times and then this morning while investigating
> some more it happened as I was exiting my X session.
>
> The machine is an AMD Athlon(tm) 64 Processor 3500+ (Single processor,
> single core), with 1GB RAM. GCC is gcc (GCC) 4.0.2 20051125 (Red Hat
> 4.0.2-8) from Fedora Core 4
>
>
> 2.6.16-rc6 is working fine.
>
>
> The following change looked an obvious candidate
>
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c33d4568aca9028a22857f94f5e0850012b6444b
>
> So I took a 2.6.16-rc6-git2 tree and reverted arch/x86_64/kernel/entry.S
> to the one in 2.6.16-rc6 and so far (35 minutes) no problems.

Yep, that one's a turkey, definitely something for Linus to revert.

Seeing your report, I gave 2.6.16-rc6-git2 a try at concurrent kernel
builds on dual HT EM64T: collapsed in all kinds of weird page table
corruption or slab corruption within minutes, three boots in a row.
Backed out that patch and it's going fine for half an hour now.

Andi, if you've a replacement patch you'd like everybody to test,
please post: I for one will surely give it a try.

Hugh

2006-03-14 15:40:13

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.16-rc6-git[12] spontaneous reboots on x86_64

On Tuesday 14 March 2006 16:30, Hugh Dickins wrote:
> Andi, if you've a replacement patch you'd like everybody to test,
> please post: I for one will surely give it a try.

Hrm, it worked on my test machine.

But what happens when you just revert the last hunk (the stub_execve change)?

-Andi

2006-03-14 16:06:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.16-rc6-git[12] spontaneous reboots on x86_64



On Tue, 14 Mar 2006, Hugh Dickins wrote:
>
> Yep, that one's a turkey, definitely something for Linus to revert.

Reverted. Let's get wider testing before applying an alternate fix.

Linus

2006-03-14 16:24:35

by Andrew Clayton

[permalink] [raw]
Subject: Re: 2.6.16-rc6-git[12] spontaneous reboots on x86_64

On Tue, 2006-03-14 at 08:06 -0800, Linus Torvalds wrote:
>
> On Tue, 14 Mar 2006, Hugh Dickins wrote:
> >
> > Yep, that one's a turkey, definitely something for Linus to revert.
>
> Reverted. Let's get wider testing before applying an alternate fix.
>
> Linus


Just to note: Doing what Andi suggested seems to be working OK.

Cheers,

Andrew


2006-03-14 16:26:52

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.16-rc6-git[12] spontaneous reboots on x86_64

On Tue, 14 Mar 2006, Andi Kleen wrote:
>
> But what happens when you just revert the last hunk (the stub_execve change)?

Still no good: spontaneously rebooted under load after eight minutes.

Hugh

2006-03-14 18:49:55

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.16-rc6-git[12] spontaneous reboots on x86_64

On Tue, 14 Mar 2006, Andrew Clayton wrote:
> On Tue, 2006-03-14 at 08:06 -0800, Linus Torvalds wrote:
> >
> > Reverted. Let's get wider testing before applying an alternate fix.
>
> Just to note: Doing what Andi suggested seems to be working OK.

Whereas on EM64T I found the opposite,
reverting just the stub_execve hunk still behaved badly.

I've double-checked that finding since, built and ran another
kernel to confirm it. But your Athlon64 still works OK that way?

Just trying to clarify - I don't think we're in any rush to
settle it now that Linus has reverted the damage from his tree.

Hugh

2006-03-14 18:55:36

by Andrew Clayton

[permalink] [raw]
Subject: Re: 2.6.16-rc6-git[12] spontaneous reboots on x86_64

On Tue, 2006-03-14 at 18:50 +0000, Hugh Dickins wrote:
> On Tue, 14 Mar 2006, Andrew Clayton wrote:
> > On Tue, 2006-03-14 at 08:06 -0800, Linus Torvalds wrote:
> > >
> > > Reverted. Let's get wider testing before applying an alternate fix.
> >
> > Just to note: Doing what Andi suggested seems to be working OK.
>
> Whereas on EM64T I found the opposite,
> reverting just the stub_execve hunk still behaved badly.
>
> I've double-checked that finding since, built and ran another
> kernel to confirm it. But your Athlon64 still works OK that way?

Yeah, reverting just the stub_execve hunk and 3 hours later everything
still looks good.

> Just trying to clarify - I don't think we're in any rush to
> settle it now that Linus has reverted the damage from his tree.

Sure.

> Hugh

Andrew


Attachments:
signature.asc (191.00 B)
This is a digitally signed message part

2006-03-14 21:31:11

by Andrew Clayton

[permalink] [raw]
Subject: Re: 2.6.16-rc6-git[12] spontaneous reboots on x86_64

On Tue, 2006-03-14 at 18:50 +0000, Hugh Dickins wrote:
> On Tue, 14 Mar 2006, Andrew Clayton wrote:
> > On Tue, 2006-03-14 at 08:06 -0800, Linus Torvalds wrote:
> > >
> > > Reverted. Let's get wider testing before applying an alternate fix.
> >
> > Just to note: Doing what Andi suggested seems to be working OK.
>
> Whereas on EM64T I found the opposite,
> reverting just the stub_execve hunk still behaved badly.
>
> I've double-checked that finding since, built and ran another
> kernel to confirm it. But your Athlon64 still works OK that way?

OK, looks like I may have spoke too soon, just found my ssh session to
it dead and the machine no longer reachable (other machines on the same
network are). I'll be able to see for sure when I get into work in the
morning.


Andrew


Attachments:
signature.asc (191.00 B)
This is a digitally signed message part