2002-08-08 19:21:31

by Jeremy Sanders

[permalink] [raw]
Subject: RedHat Rawhide Kernels

Just to let you guys know (as it's close to the current thread), I'm
getting very bad lock-ups using the current rawhide kernel (2.4.18-7.94).
When I connet with a machine running 2.4.18-5 (standard 7.3 errata
kernel), both the client and the server get processes stuck in a "D" state
- the nfsd processes on the server and the user command on the client.
This means you can't shut the server down as the nfsd processes can't get
killed. Strangely 2.4.19 kernels can talk to the server fine!

See
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=70561

I'm going to try a stack trace on the nfsd processes as the RH guys
suggested tomorrow.

Jeremy

--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2002-08-08 19:33:35

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: RedHat Rawhide Kernels

On Thu, Aug 08, 2002 at 08:21:27PM +0100, Jeremy Sanders wrote:
> Just to let you guys know (as it's close to the current thread), I'm
> getting very bad lock-ups using the current rawhide kernel (2.4.18-7.94).
> When I connet with a machine running 2.4.18-5 (standard 7.3 errata
> kernel), both the client and the server get processes stuck in a "D" state
> - the nfsd processes on the server and the user command on the client.
> This means you can't shut the server down as the nfsd processes can't get
> killed. Strangely 2.4.19 kernels can talk to the server fine!

For the sake of spreading the knowledge around, I've also got a couple
of reports of -5 (which has an earlier set of Trond's client patches)
and vanilla 2.4.18 getting D state stuck processes in lock_page on NFS
mounts. There's no useful data point beyond that, but it looks like a
request is getting lost somewhere.

-ben
--
"You will be reincarnated as a toad; and you will be much happier."


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-08-08 20:16:26

by Trond Myklebust

[permalink] [raw]
Subject: Re: RedHat Rawhide Kernels

>>>>> " " == Benjamin LaHaise <[email protected]> writes:

> For the sake of spreading the knowledge around, I've also got a
> couple of reports of -5 (which has an earlier set of Trond's
> client patches) and vanilla 2.4.18 getting D state stuck
> processes in lock_page on NFS mounts. There's no useful data
> point beyond that, but it looks like a request is getting lost
> somewhere.

Do you know offhand which set of patches are included?

Cheers,
Trond


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-08-08 20:21:42

by Trond Myklebust

[permalink] [raw]
Subject: Re: RedHat Rawhide Kernels

>>>>> " " == Jeremy Sanders <[email protected]> writes:

> Just to let you guys know (as it's close to the current
> thread), I'm getting very bad lock-ups using the current
> rawhide kernel (2.4.18-7.94). When I connet with a machine

I don't know about the server stuff, but the version of the client
patches that they appear to have included in 2.4.18-7.94 contains one
pretty nasty race in the RPC code that can cause significant
corruption.
I've already notified RH of this, and provided them with details on
how to fix it...

Cheers,
Trond


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-08-08 20:23:30

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: RedHat Rawhide Kernels

On Thu, Aug 08, 2002 at 10:16:18PM +0200, Trond Myklebust wrote:
> Do you know offhand which set of patches are included?

# 15xx
# NFS patches: selected bits from Trond's 2.4.19pre8 patchset
#
Patch1501: linux-2.4.19-nfs-01-pathconf.dif.txt
Patch1503: linux-2.4.19-nfs-03-noac.dif.txt
Patch1504: linux-2.4.19-nfs-04-seekdir.dif.txt
Patch1505: linux-2.4.19-nfs-05-rdplus.dif.txt
Patch1506: linux-2.4.19-nfs-06-rpc_bkl.dif.txt
Patch1507: linux-2.4.19-nfs-07-bkl2.dif.txt
Patch1508: linux-2.4.19-nfs-08-rpc_cong.dif.txt
Patch1509: linux-2.4.19-nfs-09-rpc_wspace.dif.txt
Patch1510: linux-2.4.19-nfs-10-ping.dif.txt
Patch1511: linux-2.4.19-nfs-11-rpc_tweaks.dif.txt
Patch1520: linux-2.4.19-nfs-nosvc.patch
Patch1550: linux-2.4.18-nfs-default-size.patch

nosvc just silences the annoying unknown version (0) printk, and the
default size patch reverts to a 4KB default if none is specified.

-ben
--
"You will be reincarnated as a toad; and you will be much happier."


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-08-08 20:32:07

by Trond Myklebust

[permalink] [raw]
Subject: Re: RedHat Rawhide Kernels

>>>>> " " == Benjamin LaHaise <[email protected]> writes:

> On Thu, Aug 08, 2002 at 10:16:18PM +0200, Trond Myklebust
> wrote:
>> Do you know offhand which set of patches are included?

> # 15xx NFS patches: selected bits from Trond's 2.4.19pre8
> # patchset

That's for 2.4.18-7-94 (with the known RPC race problem), but I
understood that you mentioned a problem affecting the 2.4.18-5 client
too?

Cheers,
Trond


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-08-08 20:57:15

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: RedHat Rawhide Kernels

On Thu, Aug 08, 2002 at 10:31:57PM +0200, Trond Myklebust wrote:
> >>>>> " " == Benjamin LaHaise <[email protected]> writes:
>
> > On Thu, Aug 08, 2002 at 10:16:18PM +0200, Trond Myklebust
> > wrote:
> >> Do you know offhand which set of patches are included?
>
> > # 15xx NFS patches: selected bits from Trond's 2.4.19pre8
> > # patchset
>
> That's for 2.4.18-7-94 (with the known RPC race problem), but I
> understood that you mentioned a problem affecting the 2.4.18-5 client
> too?

That is the list of patches applied to 2.4.18-5. Sorry for not being
clear on that.

-ben
--
"You will be reincarnated as a toad; and you will be much happier."


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-08-08 21:16:31

by Trond Myklebust

[permalink] [raw]
Subject: Re: RedHat Rawhide Kernels

>>>>> " " == Benjamin LaHaise <[email protected]> writes:

> That is the list of patches applied to 2.4.18-5. Sorry for not
> being clear on that.

Duh... My mistake I didn't read your mail clearly enough...

I know of 1 possible hang in that patchset (a hang which also affects
the standard 2.4.19 kernel): The spinlock in xprt_write_space() needs
to be converted to a bh-safe spinlock. The race should be very rare,
but definitely exists (it's been fixed BTW in the newer patchset)...

Do you have any details on the hangs in question that might help?

Cheers,
Trond


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-08-08 21:27:34

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: RedHat Rawhide Kernels

On Thu, Aug 08, 2002 at 11:16:17PM +0200, Trond Myklebust wrote:
> Duh... My mistake I didn't read your mail clearly enough...
>
> I know of 1 possible hang in that patchset (a hang which also affects
> the standard 2.4.19 kernel): The spinlock in xprt_write_space() needs
> to be converted to a bh-safe spinlock. The race should be very rare,
> but definitely exists (it's been fixed BTW in the newer patchset)...

Ah, interesting. I'll put that into a test rpm for the people
experiencing the problem to see if it helps.

> Do you have any details on the hangs in question that might help?

Basically, under heavy load several processes end up stuck in
lock_page being called from generic_file_read. The problem is
very hard to reproduce (at least I can't on my local machines).

-ben
--
"You will be reincarnated as a toad; and you will be much happier."


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-08-08 21:35:17

by Trond Myklebust

[permalink] [raw]
Subject: Re: RedHat Rawhide Kernels

>>>>> " " == Benjamin LaHaise <[email protected]> writes:

>> Do you have any details on the hangs in question that might
>> help?

> Basically, under heavy load several processes end up stuck in
> lock_page being called from generic_file_read. The problem is
> very hard to reproduce (at least I can't on my local machines).

The only other hang I can think of concerns only HIGHMEM machines,
where you can deadlock while exhausting all free kmap() resources.

Also fixed (well - at least chances are *very* heavily reduced) in the
new 'kmap' patchsets.

Cheers,
Trond


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-08-09 01:04:39

by Thomas Langås

[permalink] [raw]
Subject: Re: RedHat Rawhide Kernels

Trond Myklebust:
> The only other hang I can think of concerns only HIGHMEM machines,
> where you can deadlock while exhausting all free kmap() resources.
> Also fixed (well - at least chances are *very* heavily reduced) in the
> new 'kmap' patchsets.

Is this included in 2.4.19 or won't we see it before 2.4.20 comes along?

--
Thomas


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-08-09 04:51:13

by Trond Myklebust

[permalink] [raw]
Subject: Re: RedHat Rawhide Kernels

>>>>> " " =3D=3D Thomas Lang=E5s <[email protected]> writes:

> Trond Myklebust:
>> The only other hang I can think of concerns only HIGHMEM
>> machines, where you can deadlock while exhausting all free
>> kmap() resources. Also fixed (well - at least chances are
>> *very* heavily reduced) in the new 'kmap' patchsets.

> Is this included in 2.4.19 or won't we see it before 2.4.20
> comes along?

You can see it in 2.5.x or from my patchsets for 2.4.19 (in the usual
place).

Cheers,
Trond


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-08-12 10:41:40

by Jeremy Sanders

[permalink] [raw]
Subject: Re: RedHat Rawhide Kernels

On Thu, Aug 08, 2002 at 08:21:27PM +0100, Jeremy Sanders wrote:
> Just to let you guys know (as it's close to the current thread), I'm
> getting very bad lock-ups using the current rawhide kernel (2.4.18-7.94).
> When I connet with a machine running 2.4.18-5 (standard 7.3 errata
> kernel), both the client and the server get processes stuck in a "D" state
> - the nfsd processes on the server and the user command on the client.
> This means you can't shut the server down as the nfsd processes can't get
> killed. Strangely 2.4.19 kernels can talk to the server fine!
>
> See
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=70561

Interestingly this bug seems to be due to ext3 (or a combination of ext3
and nfs). Remounting the partition as ext2 stops the hang.

The processes get stuck (when using ext3) in

nfsd D F68108E0 5976 1142 1 1141 1143 (L-TLB)
Call Trace: [<c0107f7a>] __down [kernel] 0x6a (0xf6af9de4))
[<c01080d4>] __down_failed [kernel] 0x8 (0xf6af9e08))
[<f8824aa0>] ext3_readdir [ext3] 0x0 (0xf6af9e10))
[<c014fdce>] .text.lock.readdir [kernel] 0x5 (0xf6af9e18))
[<f89997e3>] nfsd_readdir [nfsd] 0xc3 (0xf6af9e38))
[<f89a1070>] nfs3svc_encode_entry_plus [nfsd] 0x0 (0xf6af9e40))
[<f8837ca0>] ext3_dir_operations [ext3] 0x0 (0xf6af9e80))
[<f899ee6e>] nfsd3_proc_readdirplus [nfsd] 0xde (0xf6af9ef0))
[<f89a1070>] nfs3svc_encode_entry_plus [nfsd] 0x0 (0xf6af9f04))
[<f89a61c4>] nfsd_procedures3 [nfsd] 0x264 (0xf6af9f24))
[<f89935c0>] nfsd_dispatch [nfsd] 0xd0 (0xf6af9f30))
[<f89a5898>] nfsd_version3 [nfsd] 0x0 (0xf6af9f44))
[<f89754cc>] svc_process_R6eda96b1 [sunrpc] 0x43c (0xf6af9f50))
[<f89a61c4>] nfsd_procedures3 [nfsd] 0x264 (0xf6af9f78))
[<f89a58b8>] nfsd_program [nfsd] 0x0 (0xf6af9f7c))
[<f89933b0>] nfsd [nfsd] 0x1d0 (0xf6af9f98))
[<c010765e>] kernel_thread [kernel] 0x2e (0xf6af9ff0))
[<f89931e0>] nfsd [nfsd] 0x0 (0xf6af9ff8))

nfsd D F68108E0 5592 1143 1 1142 1144 (L-TLB)
Call Trace: [<c0107f7a>] __down [kernel] 0x6a (0xf6aefc24))
[<c01080d4>] __down_failed [kernel] 0x8 (0xf6aefc48))
[<f8837e20>] ext3_dir_inode_operations [ext3] 0x0 (0xf6aefc50))
[<f8999fab>] .text.lock.vfs [nfsd] 0xaf (0xf6aefc58))
[<c01af97c>] ide_wait_stat [kernel] 0xcc (0xf6aefc68))
[<f89a1a20>] encode_fattr3 [nfsd] 0x200 (0xf6aefc90))
[<f89a0f6b>] encode_entry [nfsd] 0x1ab (0xf6aefcb4))
[<f8824d02>] ext3_readdir [ext3] 0x262 (0xf6aefd70))
[<c0223760>] udp_getfrag [kernel] 0x0 (0xf6aefdcc))
[<c014f6b2>] vfs_readdir [kernel] 0x92 (0xf6aefe18))
[<f89a1070>] nfs3svc_encode_entry_plus [nfsd] 0x0 (0xf6aefe24))
[<f89997e3>] nfsd_readdir [nfsd] 0xc3 (0xf6aefe38))
[<f89a1070>] nfs3svc_encode_entry_plus [nfsd] 0x0 (0xf6aefe40))
[<f8837ca0>] ext3_dir_operations [ext3] 0x0 (0xf6aefe80))
[<f899ee6e>] nfsd3_proc_readdirplus [nfsd] 0xde (0xf6aefef0))
[<f89a1070>] nfs3svc_encode_entry_plus [nfsd] 0x0 (0xf6aeff04))
[<f89a61c4>] nfsd_procedures3 [nfsd] 0x264 (0xf6aeff24))
[<f89935c0>] nfsd_dispatch [nfsd] 0xd0 (0xf6aeff30))
[<f89a5898>] nfsd_version3 [nfsd] 0x0 (0xf6aeff44))
[<f89754cc>] svc_process_R6eda96b1 [sunrpc] 0x43c (0xf6aeff50))
[<f89a61c4>] nfsd_procedures3 [nfsd] 0x264 (0xf6aeff78))
[<f89a58b8>] nfsd_program [nfsd] 0x0 (0xf6aeff7c))
[<f89933b0>] nfsd [nfsd] 0x1d0 (0xf6aeff98))
[<c010765e>] kernel_thread [kernel] 0x2e (0xf6aefff0))
[<f89931e0>] nfsd [nfsd] 0x0 (0xf6aefff8))


(See the above bug for more details).

Jeremy

--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs