2005-09-14 04:46:41

by Simon Horman [Horms]

[permalink] [raw]
Subject: Re: Bug#328135: kernel-image-2.6.11-1-686-smp: nfs reading process stuck in disk wait

Hi Marc,

would is be possible to test linux-image-2.6.12-1-686-smp from
unstable to see if this problem persists? I am CCing the NFS
maintainer and LKML as this looks reasonably nasty and they
may be interested in looking into it.

--
Horms

On Tue, Sep 13, 2005 at 03:47:07PM -0400, Marc Horowitz wrote:
> Package: kernel-image-2.6.11-1-686-smp
> Version: 2.6.11-7
> Severity: important
>
> cvs D 00000008 0 6344 3830 6345 6162 (NOTLB)
> db6bdd18 00000086 db6bdd08 00000008 00000002 e203cba4 c02edb60 00000001
> 00000001 00000001 000001d2 c013fe43 c02f33a0 c02f2b80 c1805fa0 00000000
> 00000000 b3f24580 000f9fc6 00000000 e3f4f540 e3f4f694 c02f33a0 00000002
> Call Trace:
> [<c013fe43>] __alloc_pages+0x2e3/0x420
> [<c02ac668>] io_schedule+0x28/0x40
> [<c013a6c5>] sync_page+0x45/0x60
> [<c02ac9bf>] __wait_on_bit_lock+0x5f/0x70
> [<c013a680>] sync_page+0x0/0x60
> [<c0131f10>] wake_bit_function+0x0/0x60
> [<c013af51>] __lock_page+0x91/0xa0
> [<c0131f10>] wake_bit_function+0x0/0x60
> [<c014286d>] page_cache_readahead+0x24d/0x2d0
> [<c0131f10>] wake_bit_function+0x0/0x60
> [<c013af9d>] find_get_page+0x3d/0x50
> [<c013b897>] do_generic_mapping_read+0x517/0x630
> [<c013bcb2>] __generic_file_aio_read+0x212/0x250
> [<c013b9b0>] file_read_actor+0x0/0xf0
> [<c013bd4b>] generic_file_aio_read+0x5b/0x80
> [<f8cd1920>] nfs_file_read+0xa0/0xf0 [nfs]
> [<c015ada7>] do_sync_read+0xb7/0xf0
> [<c014d131>] vma_merge+0xd1/0x1d0
> [<c014d7e1>] do_mmap_pgoff+0x461/0x790
> [<c0131eb0>] autoremove_wake_function+0x0/0x60
> [<c015aec5>] vfs_read+0xe5/0x160
> [<c015b1e1>] sys_read+0x51/0x80
> [<c0103123>] syscall_call+0x7/0xb
>
> I was doing a "cvs add" on a working directory in NFS, and the process
> got stuck here. I don't know how to tell what file it was accessing.
>
> I have seen this happen twice with this kernel in the past month, but
> I don't know how to reliably reproduce it.
>
> -- System Information:
> Debian Release: 3.1
> APT prefers testing
> APT policy: (990, 'testing'), (500, 'unstable')
> Architecture: i386 (i686)
> Kernel: Linux 2.6.11-1-686-smp
> Locale: LANG=en_US.ISO8859-1, LC_CTYPE=en_US.ISO8859-1 (charmap=ISO-8859-1)
>
> Versions of packages kernel-image-2.6.11-1-686-smp depends on:
> ii coreutils [fileutils] 5.2.1-2 The GNU core utilities
> ii initrd-tools 0.1.77 tools to create initrd image for p
> ii module-init-tools 3.2-pre1-2 tools for managing Linux kernel mo
>
> -- no debconf information
>
>
> --
> To UNSUBSCRIBE, email to [email protected]
> with a subject of "unsubscribe". Trouble? Contact [email protected]


2005-09-14 23:59:17

by Trond Myklebust

[permalink] [raw]
Subject: Re: Bug#328135: kernel-image-2.6.11-1-686-smp: nfs reading process stuck in disk wait

on den 14.09.2005 Klokka 11:51 (+0900) skreiv Horms:
> Hi Marc,
>
> would is be possible to test linux-image-2.6.12-1-686-smp from
> unstable to see if this problem persists? I am CCing the NFS
> maintainer and LKML as this looks reasonably nasty and they
> may be interested in looking into it.
>

I doubt this has anything to do with NFS. We should no longer have a
sync_page VFS method in the 2.6 kernels. What other filesystems is the
user running?

Cheers,
Trond

2005-09-15 01:10:19

by Marc Horowitz

[permalink] [raw]
Subject: Re: Bug#328135: kernel-image-2.6.11-1-686-smp: nfs reading process stuck in disk wait

Trond Myklebust <[email protected]> writes:

>> on den 14.09.2005 Klokka 11:51 (+0900) skreiv Horms:
>> > Hi Marc,
>> >
>> > would is be possible to test linux-image-2.6.12-1-686-smp from
>> > unstable to see if this problem persists? I am CCing the NFS
>> > maintainer and LKML as this looks reasonably nasty and they
>> > may be interested in looking into it.
>> >
>>
>> I doubt this has anything to do with NFS. We should no longer have a
>> sync_page VFS method in the 2.6 kernels. What other filesystems is the
>> user running?

In the stack trace I sent, from a running 2.6.11 kernel, vfs_read
appears to be the vfs method, not sync_page. sync_page is called much
deeper in the stack trace.

I haven't had a chance to try a 2.6.12 kernel, but I should be able to
this week.

Marc

2005-09-15 08:33:11

by Trond Myklebust

[permalink] [raw]
Subject: Re: Bug#328135: kernel-image-2.6.11-1-686-smp: nfs reading process stuck in disk wait

on den 14.09.2005 Klokka 21:10 (-0400) skreiv Marc Horowitz:
> Trond Myklebust <[email protected]> writes:
>
> >> on den 14.09.2005 Klokka 11:51 (+0900) skreiv Horms:
> >> > Hi Marc,
> >> >
> >> > would is be possible to test linux-image-2.6.12-1-686-smp from
> >> > unstable to see if this problem persists? I am CCing the NFS
> >> > maintainer and LKML as this looks reasonably nasty and they
> >> > may be interested in looking into it.
> >> >
> >>
> >> I doubt this has anything to do with NFS. We should no longer have a
> >> sync_page VFS method in the 2.6 kernels. What other filesystems is the
> >> user running?
>
> In the stack trace I sent, from a running 2.6.11 kernel, vfs_read
> appears to be the vfs method, not sync_page. sync_page is called much
> deeper in the stack trace.

So? It is clearly the call to sync_page that is Oopsing.

The NFS call is just trying to lock a page that appears to be owned by
someone else. That triggers a call to that filesystem's sync_page, which
then goes on to do a page allocation, which again Oopses.

Cheers,
Trond

2005-09-15 09:30:53

by Simon Horman [Horms]

[permalink] [raw]
Subject: Re: Bug#328135: kernel-image-2.6.11-1-686-smp: nfs reading process stuck in disk wait

On Thu, Sep 15, 2005 at 09:32:47AM +0100, Trond Myklebust wrote:
> on den 14.09.2005 Klokka 21:10 (-0400) skreiv Marc Horowitz:
> > Trond Myklebust <[email protected]> writes:
> >
> > >> on den 14.09.2005 Klokka 11:51 (+0900) skreiv Horms:
> > >> > Hi Marc,
> > >> >
> > >> > would is be possible to test linux-image-2.6.12-1-686-smp from
> > >> > unstable to see if this problem persists? I am CCing the NFS
> > >> > maintainer and LKML as this looks reasonably nasty and they
> > >> > may be interested in looking into it.
> > >> >
> > >>
> > >> I doubt this has anything to do with NFS. We should no longer have a
> > >> sync_page VFS method in the 2.6 kernels. What other filesystems is the
> > >> user running?
> >
> > In the stack trace I sent, from a running 2.6.11 kernel, vfs_read
> > appears to be the vfs method, not sync_page. sync_page is called much
> > deeper in the stack trace.
>
> So? It is clearly the call to sync_page that is Oopsing.
>
> The NFS call is just trying to lock a page that appears to be owned by
> someone else. That triggers a call to that filesystem's sync_page, which
> then goes on to do a page allocation, which again Oopses.

I take it from your initial remarks that the use of sync_page()
in the VSF has changed recently. And in any case, it would
be worth testing 2.6.12 or 2.6.13 before investigating any further
as in your oppinion the problem is not NFS related, but related
to somthing that NFS coincidently triggers (but could just as
easily triggered by anything else).

--
Horms

2005-09-15 09:50:57

by Trond Myklebust

[permalink] [raw]
Subject: Re: Bug#328135: kernel-image-2.6.11-1-686-smp: nfs reading process stuck in disk wait

to den 15.09.2005 Klokka 18:22 (+0900) skreiv Horms:
> I take it from your initial remarks that the use of sync_page()
> in the VSF has changed recently. And in any case, it would
> be worth testing 2.6.12 or 2.6.13 before investigating any further
> as in your oppinion the problem is not NFS related, but related
> to somthing that NFS coincidently triggers (but could just as
> easily triggered by anything else).

Right. What I'm saying is that NFS has no special hooks inside
lock_page(), so this is 100% generic VFS code.

Cheers,
Trond

2005-09-15 14:29:47

by Marc Horowitz

[permalink] [raw]
Subject: Re: Bug#328135: kernel-image-2.6.11-1-686-smp: nfs reading process stuck in disk wait

Trond Myklebust <[email protected]> writes:

>> on den 14.09.2005 Klokka 21:10 (-0400) skreiv Marc Horowitz:
>> > Trond Myklebust <[email protected]> writes:
>> >
>> > >> on den 14.09.2005 Klokka 11:51 (+0900) skreiv Horms:
>> > >> I doubt this has anything to do with NFS. We should no longer have a
>> > >> sync_page VFS method in the 2.6 kernels. What other filesystems is the
>> > >> user running?
>> >
>> > In the stack trace I sent, from a running 2.6.11 kernel, vfs_read
>> > appears to be the vfs method, not sync_page. sync_page is called much
>> > deeper in the stack trace.
>>
>> So? It is clearly the call to sync_page that is Oopsing.
>>
>> The NFS call is just trying to lock a page that appears to be owned by
>> someone else. That triggers a call to that filesystem's sync_page, which
>> then goes on to do a page allocation, which again Oopses.

Ah, I understand now. I misinterpreted what you said to mean you
didn't expect to see a sync_page call at all.

That said, I'd like to clarify one thing: there is no oops in the
dmesg output. That stack trace comes from dmesg after I do
"echo t > /proc/sysrq_trigger".

I'll give the 2.6.12 kernel a try today or tomorrow, and see what
happens.

Marc