2008-06-04 14:27:31

by Dave Jones

[permalink] [raw]
Subject: Re: NFS oops in 2.6.26rc4

On Fri, May 30, 2008 at 03:37:01PM -0400, Chuck Lever wrote:

> > Something else of note which I hadn't seen before, usually things lock
> > up just after that first oops. For some reason, today it survived
> > a little longer, but things really went downhill fast.
> > It survived a 'dmesg ; scp dmesg davej@gelk', and then wedged solid.
> > So as well as the oops, it seems we're corrupting memory too.
> > For reference, this kernel has both SLUB_DEBUG and PAGEALLOC_DEBUG
> > enabled.
>
> I haven't seen this kind of problem here with .26, but yes, it does
> look like something is clobbering memory during an NFS mount.
>
> I introduced some NFS mount parsing changes in this commit range:
>
> 2d767432..82d101d5
>
> A quick bisect should show which, if any of these, is the guilty
> party. If any of these are the problem, I suspect it's 3f8400d1.

I didn't get time to try this out yet (hopefully tomorrow).
In the meantime, we've just gotten word of another user seeing memory
corruption with nfs - https://bugzilla.redhat.com/show_bug.cgi?id=449958

Dave

--
http://www.codemonkey.org.uk


2008-06-04 18:14:42

by Chuck Lever III

[permalink] [raw]
Subject: Re: NFS oops in 2.6.26rc4


On Jun 4, 2008, at 10:19 AM, Dave Jones wrote:

> On Fri, May 30, 2008 at 03:37:01PM -0400, Chuck Lever wrote:
>
>>> Something else of note which I hadn't seen before, usually things
>>> lock
>>> up just after that first oops. For some reason, today it survived
>>> a little longer, but things really went downhill fast.
>>> It survived a 'dmesg ; scp dmesg davej@gelk', and then wedged solid.
>>> So as well as the oops, it seems we're corrupting memory too.
>>> For reference, this kernel has both SLUB_DEBUG and PAGEALLOC_DEBUG
>>> enabled.
>>
>> I haven't seen this kind of problem here with .26, but yes, it does
>> look like something is clobbering memory during an NFS mount.
>>
>> I introduced some NFS mount parsing changes in this commit range:
>>
>> 2d767432..82d101d5
>>
>> A quick bisect should show which, if any of these, is the guilty
>> party. If any of these are the problem, I suspect it's 3f8400d1.
>
> I didn't get time to try this out yet (hopefully tomorrow).
> In the meantime, we've just gotten word of another user seeing memory
> corruption with nfs - https://bugzilla.redhat.com/show_bug.cgi?id=449958

449958 could very well be the same problem. The stack traceback is a
lot cleaner than the one you originally sent, but there are a lot of
similarities. (I doubt this is related to symlinks, as the comment
suggests).

Is commit 86d61d863 applied to the current rawhide kernel?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2008-06-04 18:27:17

by Dave Jones

[permalink] [raw]
Subject: Re: NFS oops in 2.6.26rc4

On Wed, Jun 04, 2008 at 02:13:08PM -0400, Chuck Lever wrote:
>
> On Jun 4, 2008, at 10:19 AM, Dave Jones wrote:
>
> > On Fri, May 30, 2008 at 03:37:01PM -0400, Chuck Lever wrote:
> >
> >>> Something else of note which I hadn't seen before, usually things
> >>> lock
> >>> up just after that first oops. For some reason, today it survived
> >>> a little longer, but things really went downhill fast.
> >>> It survived a 'dmesg ; scp dmesg davej@gelk', and then wedged solid.
> >>> So as well as the oops, it seems we're corrupting memory too.
> >>> For reference, this kernel has both SLUB_DEBUG and PAGEALLOC_DEBUG
> >>> enabled.
> >>
> >> I haven't seen this kind of problem here with .26, but yes, it does
> >> look like something is clobbering memory during an NFS mount.
> >>
> >> I introduced some NFS mount parsing changes in this commit range:
> >>
> >> 2d767432..82d101d5
> >>
> >> A quick bisect should show which, if any of these, is the guilty
> >> party. If any of these are the problem, I suspect it's 3f8400d1.
> >
> > I didn't get time to try this out yet (hopefully tomorrow).
> > In the meantime, we've just gotten word of another user seeing memory
> > corruption with nfs - https://bugzilla.redhat.com/show_bug.cgi?id=449958
>
> 449958 could very well be the same problem. The stack traceback is a
> lot cleaner than the one you originally sent, but there are a lot of
> similarities. (I doubt this is related to symlinks, as the comment
> suggests).
>
> Is commit 86d61d863 applied to the current rawhide kernel?

That kernel was .26rc4.git2, so unless it's only gone in in the last day
or two, yes. (Bandwidth impaired right now, and no local git repo to check)

Dave

--
http://www.codemonkey.org.uk

2008-06-04 19:13:42

by Chuck Lever III

[permalink] [raw]
Subject: Re: NFS oops in 2.6.26rc4

On Wed, Jun 4, 2008 at 2:20 PM, Dave Jones <[email protected]> wrote:
> On Wed, Jun 04, 2008 at 02:13:08PM -0400, Chuck Lever wrote:
> >
> > On Jun 4, 2008, at 10:19 AM, Dave Jones wrote:
> >
> > > On Fri, May 30, 2008 at 03:37:01PM -0400, Chuck Lever wrote:
> > >
> > >>> Something else of note which I hadn't seen before, usually things
> > >>> lock
> > >>> up just after that first oops. For some reason, today it survived
> > >>> a little longer, but things really went downhill fast.
> > >>> It survived a 'dmesg ; scp dmesg davej@gelk', and then wedged solid.
> > >>> So as well as the oops, it seems we're corrupting memory too.
> > >>> For reference, this kernel has both SLUB_DEBUG and PAGEALLOC_DEBUG
> > >>> enabled.
> > >>
> > >> I haven't seen this kind of problem here with .26, but yes, it does
> > >> look like something is clobbering memory during an NFS mount.
> > >>
> > >> I introduced some NFS mount parsing changes in this commit range:
> > >>
> > >> 2d767432..82d101d5
> > >>
> > >> A quick bisect should show which, if any of these, is the guilty
> > >> party. If any of these are the problem, I suspect it's 3f8400d1.
> > >
> > > I didn't get time to try this out yet (hopefully tomorrow).
> > > In the meantime, we've just gotten word of another user seeing memory
> > > corruption with nfs - https://bugzilla.redhat.com/show_bug.cgi?id=449958
> >
> > 449958 could very well be the same problem. The stack traceback is a
> > lot cleaner than the one you originally sent, but there are a lot of
> > similarities. (I doubt this is related to symlinks, as the comment
> > suggests).
> >
> > Is commit 86d61d863 applied to the current rawhide kernel?
>
> That kernel was .26rc4.git2, so unless it's only gone in in the last day
> or two, yes. (Bandwidth impaired right now, and no local git repo to check)

Argh, I was afraid of that. I expected that commit to improve things.
Maybe it did, but this is a different problem? You're going to force
me to actually think about this. :-)

In any event, a bisect would be helpful here, when you can. I will
also stare at the traceback in 449958 and see if anything new jumps
out. It's certainly taken the heat off of the NFS client; it looks
like an rpcbind issue.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2008-06-23 15:40:41

by Myklebust, Trond

[permalink] [raw]
Subject: Re: NFS oops in 2.6.26rc4

Hi Dave,

Any chance you could give the attached patch a whirl to see if it fixes
the NFS oops you reported?

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


Attachments:
linux-2.6.26-001-reduce_mount_stack_usage.dif (5.06 kB)

2008-06-23 16:03:53

by Dave Jones

[permalink] [raw]
Subject: Re: NFS oops in 2.6.26rc4

On Mon, Jun 23, 2008 at 11:40:29AM -0400, Trond Myklebust wrote:
> Hi Dave,
>
> Any chance you could give the attached patch a whirl to see if it fixes
> the NFS oops you reported?

Yeah, I'll give it a shot, won't be until the end of the day/tomorrow though.

Dave

--
http://www.codemonkey.org.uk

2008-06-23 16:05:13

by Myklebust, Trond

[permalink] [raw]
Subject: Re: NFS oops in 2.6.26rc4

On Mon, 2008-06-23 at 11:55 -0400, Dave Jones wrote:
> On Mon, Jun 23, 2008 at 11:40:29AM -0400, Trond Myklebust wrote:
> > Hi Dave,
> >
> > Any chance you could give the attached patch a whirl to see if it fixes
> > the NFS oops you reported?
>
> Yeah, I'll give it a shot, won't be until the end of the day/tomorrow though.
>
> Dave

That will be great. Thanks!

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2008-06-23 23:19:46

by Dave Jones

[permalink] [raw]
Subject: Re: NFS oops in 2.6.26rc4

On Mon, Jun 23, 2008 at 11:40:29AM -0400, Trond Myklebust wrote:
> Hi Dave,
>
> Any chance you could give the attached patch a whirl to see if it fixes
> the NFS oops you reported?

Seems to have done the trick for me.

Dave

--
http://www.codemonkey.org.uk

2008-06-23 23:20:10

by Myklebust, Trond

[permalink] [raw]
Subject: Re: NFS oops in 2.6.26rc4

On Mon, 2008-06-23 at 19:11 -0400, Dave Jones wrote:
> On Mon, Jun 23, 2008 at 11:40:29AM -0400, Trond Myklebust wrote:
> > Hi Dave,
> >
> > Any chance you could give the attached patch a whirl to see if it fixes
> > the NFS oops you reported?
>
> Seems to have done the trick for me.
>
> Dave

Thanks Dave!

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com