I'm seeing problems writing to shared directories mounted as nfs3.
Wondering if anyone else is seeing similar problems ?
I noticed this last week with -rc2 (my first new kernel since
3.4.0), but haven't managed to find a minimal test case to replicate
it.
The first problem was with firefox - I use it to download tarballs
and patches to a shared /sources. The first download worked, maybe
also another, but some time later one stalled and firefox stopped
refreshing its window. Looked as if it successfully created
packagename.tar.gz.part, but with an empty packagename.tar.gz.
I killed firefox, then I used wget to download to a local filesystem
and then tried to cp it to the shared /sources - cp hung indefinitely,
apparently after completing the transfer.
At this time I tried to run a script *in* /sources which hung
trying to rm a file [ worked fine when I killed it, ssh'd to the
server, and ran it locally ].
I've also got regular backup scripts which wrap rsync writes to a
different writable directory on the server. These stall with -rc2
and -rc3, and get killed by SIGINT when I eventually reboot or shut
down.
I tried using only the rsync command and playing with just an rsync
of a single file in a directory, but that doesn't provoke the
problem. I've also tried various attempts to cp or move a file
to the shared directory, but again without problems. It seems that
things are fine until something provokes the problem, then all
updates stall. This makes it a bit hard to get a reliable and
simple testcase.
I suppose I'll have to start to bisect using the backup script as
my test case (booted rc2 earlier, did nothing except ssh and allow
fcron to run it - it stalled. built rc3, booted that, fcron tried
to run the incomplete backup, again it has stalled).
Config from rc2 attached, any suggestions are welcome.
ĸen
--
das eine Mal als Tragödie, das andere Mal als Farce
On Sun, Jun 17, 2012 at 09:07:00PM +0100, Ken Moffat wrote:
> I'm seeing problems writing to shared directories mounted as nfs3.
> Wondering if anyone else is seeing similar problems ?
>
[...]
>
> I suppose I'll have to start to bisect using the backup script as
> my test case (booted rc2 earlier, did nothing except ssh and allow
> fcron to run it - it stalled. built rc3, booted that, fcron tried
> to run the incomplete backup, again it has stalled).
>
The good news is that the backup script appears to be adequate to
distinguish good and bad kernels : backing up /boot (I have a new
kernel at each stage of bisection, so always something to transfer
with rsync) either works, or appears to transfer something (I see
network activity in my desktop's panel) and then hangs.
The bad news is that so many of the nfs commits fail to build with
my config (attached to original post). At the moment I seem to be
stuck at 15 commits to test. So far I've installed and tested about
eight commits between 3.4 and 3.5-rc2, and failed to build another
seven (three different errors so far). Enough for tonight, I'll
resume somewhen.
ĸen
On Sun, 2012-06-17 at 21:07 +0100, Ken Moffat wrote:
> I'm seeing problems writing to shared directories mounted as nfs3.
> Wondering if anyone else is seeing similar problems ?
>
> I noticed this last week with -rc2 (my first new kernel since
> 3.4.0), but haven't managed to find a minimal test case to replicate
> it.
>
> The first problem was with firefox - I use it to download tarballs
> and patches to a shared /sources. The first download worked, maybe
> also another, but some time later one stalled and firefox stopped
> refreshing its window. Looked as if it successfully created
> packagename.tar.gz.part, but with an empty packagename.tar.gz.
>
> I killed firefox, then I used wget to download to a local filesystem
> and then tried to cp it to the shared /sources - cp hung indefinitely,
> apparently after completing the transfer.
>
> At this time I tried to run a script *in* /sources which hung
> trying to rm a file [ worked fine when I killed it, ssh'd to the
> server, and ran it locally ].
>
> I've also got regular backup scripts which wrap rsync writes to a
> different writable directory on the server. These stall with -rc2
> and -rc3, and get killed by SIGINT when I eventually reboot or shut
> down.
>
> I tried using only the rsync command and playing with just an rsync
> of a single file in a directory, but that doesn't provoke the
> problem. I've also tried various attempts to cp or move a file
> to the shared directory, but again without problems. It seems that
> things are fine until something provokes the problem, then all
> updates stall. This makes it a bit hard to get a reliable and
> simple testcase.
>
> I suppose I'll have to start to bisect using the backup script as
> my test case (booted rc2 earlier, did nothing except ssh and allow
> fcron to run it - it stalled. built rc3, booted that, fcron tried
> to run the incomplete backup, again it has stalled).
>
> Config from rc2 attached, any suggestions are welcome.
Is this a problem with the client or with the server, and have you tried
seeing if the fixes that were merged into -rc3 help?
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Mon, 2012-06-18 at 09:42 -0400, Trond Myklebust wrote:
> On Sun, 2012-06-17 at 21:07 +0100, Ken Moffat wrote:
> > I'm seeing problems writing to shared directories mounted as nfs3.
> > Wondering if anyone else is seeing similar problems ?
> >
> > I noticed this last week with -rc2 (my first new kernel since
> > 3.4.0), but haven't managed to find a minimal test case to replicate
> > it.
> >
> > The first problem was with firefox - I use it to download tarballs
> > and patches to a shared /sources. The first download worked, maybe
> > also another, but some time later one stalled and firefox stopped
> > refreshing its window. Looked as if it successfully created
> > packagename.tar.gz.part, but with an empty packagename.tar.gz.
> >
> > I killed firefox, then I used wget to download to a local filesystem
> > and then tried to cp it to the shared /sources - cp hung indefinitely,
> > apparently after completing the transfer.
> >
> > At this time I tried to run a script *in* /sources which hung
> > trying to rm a file [ worked fine when I killed it, ssh'd to the
> > server, and ran it locally ].
> >
> > I've also got regular backup scripts which wrap rsync writes to a
> > different writable directory on the server. These stall with -rc2
> > and -rc3, and get killed by SIGINT when I eventually reboot or shut
> > down.
> >
> > I tried using only the rsync command and playing with just an rsync
> > of a single file in a directory, but that doesn't provoke the
> > problem. I've also tried various attempts to cp or move a file
> > to the shared directory, but again without problems. It seems that
> > things are fine until something provokes the problem, then all
> > updates stall. This makes it a bit hard to get a reliable and
> > simple testcase.
> >
> > I suppose I'll have to start to bisect using the backup script as
> > my test case (booted rc2 earlier, did nothing except ssh and allow
> > fcron to run it - it stalled. built rc3, booted that, fcron tried
> > to run the incomplete backup, again it has stalled).
> >
> > Config from rc2 attached, any suggestions are welcome.
>
> Is this a problem with the client or with the server, and have you tried
> seeing if the fixes that were merged into -rc3 help?
Doh... Ignore the second half of the above sentence. error=ENOCOFFEE.
I would still like to know whether this is a client or server issue
(i.e. whether a downgrade of one or the other to 3.4 fixes the problem).
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Mon, Jun 18, 2012 at 01:45:35PM +0000, Myklebust, Trond wrote:
> >
> > Is this a problem with the client or with the server, and have you tried
> > seeing if the fixes that were merged into -rc3 help?
>
> Doh... Ignore the second half of the above sentence. error=ENOCOFFEE.
>
> I would still like to know whether this is a client or server issue
> (i.e. whether a downgrade of one or the other to 3.4 fixes the problem).
It's the client - the server is still running 3.0.
ĸen
--
das eine Mal als Tragödie, das andere Mal als Farce
On Mon, 2012-06-18 at 15:18 +0100, Ken Moffat wrote:
> On Mon, Jun 18, 2012 at 01:45:35PM +0000, Myklebust, Trond wrote:
> > >
> > > Is this a problem with the client or with the server, and have you tried
> > > seeing if the fixes that were merged into -rc3 help?
> >
> > Doh... Ignore the second half of the above sentence. error=ENOCOFFEE.
> >
> > I would still like to know whether this is a client or server issue
> > (i.e. whether a downgrade of one or the other to 3.4 fixes the problem).
>
> It's the client - the server is still running 3.0.
OK. You said you had bisected this down to 15 patches? Can you please
tell me which ones?
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Mon, Jun 18, 2012 at 02:53:56PM +0000, Myklebust, Trond wrote:
> On Mon, 2012-06-18 at 15:18 +0100, Ken Moffat wrote:
> > On Mon, Jun 18, 2012 at 01:45:35PM +0000, Myklebust, Trond wrote:
> > > >
> > > > Is this a problem with the client or with the server, and have you tried
> > > > seeing if the fixes that were merged into -rc3 help?
> > >
> > > Doh... Ignore the second half of the above sentence. error=ENOCOFFEE.
> > >
> > > I would still like to know whether this is a client or server issue
> > > (i.e. whether a downgrade of one or the other to 3.4 fixes the problem).
> >
> > It's the client - the server is still running 3.0.
>
> OK. You said you had bisected this down to 15 patches? Can you please
> tell me which ones?
>
Here's the current log:
git bisect start
# bad: [cfaf025112d3856637ff34a767ef785ef5cf2ca9] Linux 3.5-rc2
git bisect bad cfaf025112d3856637ff34a767ef785ef5cf2ca9
# good: [76e10d158efb6d4516018846f60c2ab5501900bc] Linux 3.4
git bisect good 76e10d158efb6d4516018846f60c2ab5501900bc
# good: [3813d4024a75562baf77d3907fb6afbf8f9c8232] Merge tag
# 'ia64-3.5-merge' of git://git.kernel.or
g/pub/scm/linux/kernel/git/aegl/linux
git bisect good 3813d4024a75562baf77d3907fb6afbf8f9c8232
# good: [5723aa993d83803157c22327e90cd59e3dcbe879] x86: use the new
# generic strnlen_user() function
git bisect good 5723aa993d83803157c22327e90cd59e3dcbe879
# bad: [a70f35af4e49f87ba4b6c4b30220fbb66cd74af6] Merge branch
# 'for-3.5/drivers' of git://git.kernel.dk/linux-block
git bisect bad a70f35af4e49f87ba4b6c4b30220fbb66cd74af6
# bad: [53f2c4a8fd882009a2a75c5b72d6898c0808616e] Merge tag
# 'nfs-for-3.5-1' of
# git://git.linux-nfs.org/projects/trondmy/linux-nfs
git bisect bad 53f2c4a8fd882009a2a75c5b72d6898c0808616e
# good: [84a442b9a16ee69243ce7fce5d6f6f9c3fbdee68] Merge tag 'dt2'
# of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect good 84a442b9a16ee69243ce7fce5d6f6f9c3fbdee68
# bad: [cc0a98436820b161b595b8cc1d2329bcf7328107] NFSv4: Add
# debugging printks to state manager
git bisect bad cc0a98436820b161b595b8cc1d2329bcf7328107
# skip: [db8333519187d5974cf2ff33910c893bf8727d9f] NFS: Let mount
# data parsing set the NFS version
git bisect skip db8333519187d5974cf2ff33910c893bf8727d9f
# skip: [486aa699ffb6ec28adbc147326d62ac9294de8dc] NFS: Create a new
# nfs_try_mount()
git bisect skip 486aa699ffb6ec28adbc147326d62ac9294de8dc
# bad: [8d197a568fc337c66729b289c7fa0f28c14ba5ac] NFS: Always trust
# the PageUptodate flag when we have a delegation
git bisect bad 8d197a568fc337c66729b289c7fa0f28c14ba5ac
# skip: [1763da1234cba663b849476d451bdccac5147859] NFS: rewrite
# directio write to use async coalesce code
git bisect skip 1763da1234cba663b849476d451bdccac5147859
# skip: [df0117481cd94dbb8970f4be9d05b0568fa09ab1] NFS: Prevent
# garbage cinfo->ds from leaking out
git bisect skip df0117481cd94dbb8970f4be9d05b0568fa09ab1
# good: [6c75dc0d498caa402fb17b1bf769835a9db875c8] NFS: merge _full
# and _partial write rpc_ops
git bisect good 6c75dc0d498caa402fb17b1bf769835a9db875c8
# skip: [b58fee2189b17719c846f65ffe9483c2814e6605] NFS:
# pnfs_pageio_init_read() and init_write() need an extra argument
git bisect skip b58fee2189b17719c846f65ffe9483c2814e6605
# skip: [9533da2979757258d3fd5429d830a297013d69ed] NFS: remove
# unused wb_complete field from struct nfs_page
git bisect skip 9533da2979757258d3fd5429d830a297013d69ed
# skip: [2671bfc3beb44e70636bd0208274426db57f73b5] NFS: Remove
# secinfo knowledge out of the generic client
git bisect skip 2671bfc3beb44e70636bd0208274426db57f73b5
# skip: [1825a0d08f22463e5a8f4b1636473efd057a3479] NFS: prepare
# coalesce testing for directio
git bisect skip 1825a0d08f22463e5a8f4b1636473efd057a3479
I'm just about to resume, to see if I can get anywhere - the rc2
version built fine, so I'm guessing that some more of the remaining
versions will also build. Not sure which commits are still
outstanding.
ĸen
--
das eine Mal als Tragödie, das andere Mal als Farce
On Mon, Jun 18, 2012 at 05:00:05PM +0100, Ken Moffat wrote:
> >
> > OK. You said you had bisected this down to 15 patches? Can you please
> > tell me which ones?
> >
>
> I'm just about to resume, to see if I can get anywhere - the rc2
> version built fine, so I'm guessing that some more of the remaining
> versions will also build. Not sure which commits are still
> outstanding.
>
Turned out that almost everything from 3e9e0ca3 failed to compile,
until 4f97615d (fix for NFS_V4_1 undefined) resolved that, (only had
one or two still to try when I decided to use 'git visualize'). And
it was bad by 4f97615d.
Added all the NFS v4 / 4.1 client options to my config :
3e9e0ca3 is now good, I'll start a fresh bisect between these two.
Actually, I'd better retry 4f97515d to confirm it is still bad with
this config. [ saves this message, tests ... ]. Rude words! If I
enable all the NFS v4 client options, 4f97615d is good.
Looks, to me, as if something in that untested range of patches, or
in 3e9e0ca3 itself, causes the problem *if* nfs v4 is not enabled.
I suppose I'll have to enable v4 in my configs.
ĸen
--
das eine Mal als Tragödie, das andere Mal als Farce
On Mon, 2012-06-18 at 21:05 +0100, Ken Moffat wrote:
> On Mon, Jun 18, 2012 at 05:00:05PM +0100, Ken Moffat wrote:
> > >
> > > OK. You said you had bisected this down to 15 patches? Can you please
> > > tell me which ones?
> > >
> >
> > I'm just about to resume, to see if I can get anywhere - the rc2
> > version built fine, so I'm guessing that some more of the remaining
> > versions will also build. Not sure which commits are still
> > outstanding.
> >
>
> Turned out that almost everything from 3e9e0ca3 failed to compile,
> until 4f97615d (fix for NFS_V4_1 undefined) resolved that, (only had
> one or two still to try when I decided to use 'git visualize'). And
> it was bad by 4f97615d.
>
> Added all the NFS v4 / 4.1 client options to my config :
> 3e9e0ca3 is now good, I'll start a fresh bisect between these two.
>
> Actually, I'd better retry 4f97515d to confirm it is still bad with
> this config. [ saves this message, tests ... ]. Rude words! If I
> enable all the NFS v4 client options, 4f97615d is good.
>
> Looks, to me, as if something in that untested range of patches, or
> in 3e9e0ca3 itself, causes the problem *if* nfs v4 is not enabled.
OK. That's helpful... I've been compiling a kernel without CONFIG_NFS_V4
enabled, and have run a few tests using fsx. So far, I haven't managed
to reproduce your issue.
The next time that you see the hang, can you try to run the command
'echo "t" >/proc/sysrq-trigger' as root?
If you can compile with CONFIG_SUNRPC_DEBUG, then an 'echo 0
>/proc/sys/rpc_debug' might also be helpful.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Mon, Jun 18, 2012 at 09:05:25PM +0100, Ken Moffat wrote:
>
> I suppose I'll have to enable v4 in my configs.
>
Just to confirm, -rc3 with CONFIG_NFS_V4 added to my original
.config appears to work fine on nfs v3.
I suppose I could apply 4f97615d19c370d1d907ef37f8bcd9c3672851ca on
top of the conmits which failed to compile without v4, if it is worth
investigating this ? It certainly fixes the error in fs/nfs/read.c,
but I also had:
fs/nfs/direct.c:86:29: error: field ‘ds_cinfo’ has incomplete type
in 3e9e0ca3 which would still prevent me testing at least some of
these commits. Seems to have been fixed *somewhere* in that series,
otherwise 4f97615d would not have compiled. Any suggestions for
that ? Alternatively would it be useful if I discovered which
commits are affected for ds_cinfo, or should I just follow the good
Dr.Pangloss and add V4 to my config ? ;-)
Your call. Thanks.
ĸen
--
das eine Mal als Tragödie, das andere Mal als Farce
On Mon, 2012-06-18 at 22:53 +0100, Ken Moffat wrote:
> On Mon, Jun 18, 2012 at 09:05:25PM +0100, Ken Moffat wrote:
> >
> > I suppose I'll have to enable v4 in my configs.
> >
> Just to confirm, -rc3 with CONFIG_NFS_V4 added to my original
> .config appears to work fine on nfs v3.
>
> I suppose I could apply 4f97615d19c370d1d907ef37f8bcd9c3672851ca on
> top of the conmits which failed to compile without v4, if it is worth
> investigating this ? It certainly fixes the error in fs/nfs/read.c,
> but I also had:
>
> fs/nfs/direct.c:86:29: error: field ‘ds_cinfo’ has incomplete type
>
> in 3e9e0ca3 which would still prevent me testing at least some of
> these commits. Seems to have been fixed *somewhere* in that series,
> otherwise 4f97615d would not have compiled. Any suggestions for
> that ? Alternatively would it be useful if I discovered which
> commits are affected for ds_cinfo, or should I just follow the good
> Dr.Pangloss and add V4 to my config ? ;-)
Doesn't 4f97615d19 fix the fs/nfs/direct.c problem too? It should.
Anyhow, if you can apply that on top of the commits that didn't compile,
and then continue the bisection, that would be great. We definitely do
want the !defined(CONFIG_NFS_V4) case to work in 3.5-final...
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Mon, Jun 18, 2012 at 10:03:24PM +0000, Myklebust, Trond wrote:
> On Mon, 2012-06-18 at 22:53 +0100, Ken Moffat wrote:
> > On Mon, Jun 18, 2012 at 09:05:25PM +0100, Ken Moffat wrote:
> > >
> > > I suppose I'll have to enable v4 in my configs.
> > >
> > Just to confirm, -rc3 with CONFIG_NFS_V4 added to my original
> > .config appears to work fine on nfs v3.
> >
> > I suppose I could apply 4f97615d19c370d1d907ef37f8bcd9c3672851ca on
> > top of the conmits which failed to compile without v4, if it is worth
> > investigating this ? It certainly fixes the error in fs/nfs/read.c,
> > but I also had:
> >
> > fs/nfs/direct.c:86:29: error: field ‘ds_cinfo’ has incomplete type
> >
> > in 3e9e0ca3 which would still prevent me testing at least some of
> > these commits. Seems to have been fixed *somewhere* in that series,
> > otherwise 4f97615d would not have compiled. Any suggestions for
> > that ? Alternatively would it be useful if I discovered which
> > commits are affected for ds_cinfo, or should I just follow the good
> > Dr.Pangloss and add V4 to my config ? ;-)
>
> Doesn't 4f97615d19 fix the fs/nfs/direct.c problem too? It should.
>
> Anyhow, if you can apply that on top of the commits that didn't compile,
> and then continue the bisection, that would be great. We definitely do
> want the !defined(CONFIG_NFS_V4) case to work in 3.5-final...
>
OK (I was assuming errors in different places were from different
causes). I'll do that after I've rerun rc3 without NFS_V4 with
SUNRPC_DEBUG. Thanks.
ĸen
--
das eine Mal als Tragödie, das andere Mal als Farce
On Mon, Jun 18, 2012 at 08:11:40PM +0000, Myklebust, Trond wrote:
>
> The next time that you see the hang, can you try to run the command
> 'echo "t" >/proc/sysrq-trigger' as root?
> If you can compile with CONFIG_SUNRPC_DEBUG, then an 'echo 0
> >/proc/sys/rpc_debug' might also be helpful.
>
I'm attaching the bzip2'd output from the trigger [ 151k before
compressing ] - I guess that the lines from 1346 (the backup script)
onwards, and particularly from 1417 (rsync in state D) are the parts
of most interest. The last line is because I killed the backup with
Ctrl-C.
For the debug: I assume /proc/sys/sunrpc/rpc_debug is the thing ?
It didn't seem to do anything (udev-181) - but I only looked at the
log, forgot to look at dmesg.
I've also got an nfs_debug file there. Tried echoing 0 to that,
reran the backup script but it now ends normally and there is still
nothing in dmesg.
Created a short text file in /boot, backup ran normally. Added
another kernel image, used 'rpcdebug -s -m rpc all' and the same for
nfs, this time it hung. I suspect it isn't the rsync itself which
hangs, but updating or touching or deleting a status file. Might be
totally wrong there. Also, 0 in the debug file seems to turn it off.
I've grepped the separate NFS and RPC messages into nfs-only.bz2
and rpc-only.bz2.
Back to bisection.
ĸen
--
das eine Mal als Tragödie, das andere Mal als Farce
On Tue, Jun 19, 2012 at 12:38:02AM +0100, Ken Moffat wrote:
> On Mon, Jun 18, 2012 at 08:11:40PM +0000, Myklebust, Trond wrote:
> >
> I'm attaching the bzip2'd output from the trigger [ 151k before
> compressing ] - I guess that the lines from 1346 (the backup script)
> onwards, and particularly from 1417 (rsync in state D) are the parts
> of most interest. The last line is because I killed the backup with
> Ctrl-C.
>
[...]
>
> I've grepped the separate NFS and RPC messages into nfs-only.bz2
> and rpc-only.bz2.
and forgot to attach them. Sorry. All three attached to this.
ĸen
--
das eine Mal als Tragödie, das andere Mal als Farce
On Mon, Jun 18, 2012 at 11:10:52PM +0100, Ken Moffat wrote:
> On Mon, Jun 18, 2012 at 10:03:24PM +0000, Myklebust, Trond wrote:
> >
> > Doesn't 4f97615d19 fix the fs/nfs/direct.c problem too? It should.
> >
> > Anyhow, if you can apply that on top of the commits that didn't compile,
> > and then continue the bisection, that would be great. We definitely do
> > want the !defined(CONFIG_NFS_V4) case to work in 3.5-final...
> >
> OK (I was assuming errors in different places were from different
> causes). I'll do that after I've rerun rc3 without NFS_V4 with
> SUNRPC_DEBUG. Thanks.
Bisection now points to:
6d74743b088d116e31fe1b73f47e782ee2016b94 is the first bad commit
commit 6d74743b088d116e31fe1b73f47e782ee2016b94
Author: Trond Myklebust <[email protected]>
Date: Mon Apr 30 13:27:31 2012 -0400
NFS: Simplify O_DIRECT page referencing
The O_DIRECT code shouldn't need to hold 2 references to each
page. The
reference held by the struct nfs_page should suffice.
Signed-off-by: Trond Myklebust <[email protected]>
Cc: Fred Isaman <[email protected]>
I was going to revert that from 3.5.0-rc3 to confirm that my
problem with backups was gone, and then give it more extended
testing to prove firefox downloads were ok, but 6 of 11 hunks
failed, the code has changed and I'm not familiar with it.
So, for the moment I'm not 100% sure that this is indeed the
problem.
ĸen
--
das eine Mal als Tragödie, das andere Mal als Farce
On Tue, 2012-06-19 at 02:06 +0100, Ken Moffat wrote:
> On Mon, Jun 18, 2012 at 11:10:52PM +0100, Ken Moffat wrote:
> > On Mon, Jun 18, 2012 at 10:03:24PM +0000, Myklebust, Trond wrote:
> > >
> > > Doesn't 4f97615d19 fix the fs/nfs/direct.c problem too? It should.
> > >
> > > Anyhow, if you can apply that on top of the commits that didn't compile,
> > > and then continue the bisection, that would be great. We definitely do
> > > want the !defined(CONFIG_NFS_V4) case to work in 3.5-final...
> > >
> > OK (I was assuming errors in different places were from different
> > causes). I'll do that after I've rerun rc3 without NFS_V4 with
> > SUNRPC_DEBUG. Thanks.
>
> Bisection now points to:
>
> 6d74743b088d116e31fe1b73f47e782ee2016b94 is the first bad commit
> commit 6d74743b088d116e31fe1b73f47e782ee2016b94
> Author: Trond Myklebust <[email protected]>
> Date: Mon Apr 30 13:27:31 2012 -0400
>
> NFS: Simplify O_DIRECT page referencing
>
> The O_DIRECT code shouldn't need to hold 2 references to each
> page. The
> reference held by the struct nfs_page should suffice.
>
> Signed-off-by: Trond Myklebust <[email protected]>
> Cc: Fred Isaman <[email protected]>
>
> I was going to revert that from 3.5.0-rc3 to confirm that my
> problem with backups was gone, and then give it more extended
> testing to prove firefox downloads were ok, but 6 of 11 hunks
> failed, the code has changed and I'm not familiar with it.
However you are saying that the problem is there when you compile a
kernel with this commit as the head, and it goes away when you compile a
kernel with commit 3e9e0ca3f19e911ce13c2e6c9858fcb41a37496c as the head?
I'm confused as to how a bug in that patch could depend on
CONFIG_NFS_V4, but I'll see what I can find.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Tue, 2012-06-19 at 12:20 -0400, Trond Myklebust wrote:
> On Tue, 2012-06-19 at 02:06 +0100, Ken Moffat wrote:
> > On Mon, Jun 18, 2012 at 11:10:52PM +0100, Ken Moffat wrote:
> > > On Mon, Jun 18, 2012 at 10:03:24PM +0000, Myklebust, Trond wrote:
> > > >
> > > > Doesn't 4f97615d19 fix the fs/nfs/direct.c problem too? It should.
> > > >
> > > > Anyhow, if you can apply that on top of the commits that didn't compile,
> > > > and then continue the bisection, that would be great. We definitely do
> > > > want the !defined(CONFIG_NFS_V4) case to work in 3.5-final...
> > > >
> > > OK (I was assuming errors in different places were from different
> > > causes). I'll do that after I've rerun rc3 without NFS_V4 with
> > > SUNRPC_DEBUG. Thanks.
> >
> > Bisection now points to:
> >
> > 6d74743b088d116e31fe1b73f47e782ee2016b94 is the first bad commit
> > commit 6d74743b088d116e31fe1b73f47e782ee2016b94
> > Author: Trond Myklebust <[email protected]>
> > Date: Mon Apr 30 13:27:31 2012 -0400
> >
> > NFS: Simplify O_DIRECT page referencing
> >
> > The O_DIRECT code shouldn't need to hold 2 references to each
> > page. The
> > reference held by the struct nfs_page should suffice.
> >
> > Signed-off-by: Trond Myklebust <[email protected]>
> > Cc: Fred Isaman <[email protected]>
> >
> > I was going to revert that from 3.5.0-rc3 to confirm that my
> > problem with backups was gone, and then give it more extended
> > testing to prove firefox downloads were ok, but 6 of 11 hunks
> > failed, the code has changed and I'm not familiar with it.
>
> However you are saying that the problem is there when you compile a
> kernel with this commit as the head, and it goes away when you compile a
> kernel with commit 3e9e0ca3f19e911ce13c2e6c9858fcb41a37496c as the head?
>
> I'm confused as to how a bug in that patch could depend on
> CONFIG_NFS_V4, but I'll see what I can find.
By the way, I thought your test-case was doing firefox downloads. Do
those really use O_DIRECT?
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Tue, Jun 19, 2012 at 04:23:23PM +0000, Myklebust, Trond wrote:
> On Tue, 2012-06-19 at 12:20 -0400, Trond Myklebust wrote:
> >
> > However you are saying that the problem is there when you compile a
> > kernel with this commit as the head, and it goes away when you compile a
> > kernel with commit 3e9e0ca3f19e911ce13c2e6c9858fcb41a37496c as the head?
> >
Provided I apply 4f97615d as well, so that it compiles, yes.
> > I'm confused as to how a bug in that patch could depend on
> > CONFIG_NFS_V4, but I'll see what I can find.
Thanks
>
> By the way, I thought your test-case was doing firefox downloads. Do
> those really use O_DIRECT?
>
I originally saw the problem doing that, but it was on the second
download. Or perhaps third or fourth - I tend not to remember
successful downloads when I've got a lot of packages to check for new
versions. Using my backup script seemed a more reliable way to
trigger a problem (but, only if there is something substantial to
back up, such as a new vmlinuz).
Thinking about this, it is almost certain that between the first
download and the one that failed (several hours later) my backup
script did run, from fcron, so I now think the rsync problem is what
leads to issues when other programs later try to update the same nfs
directory.
ĸen
--
das eine Mal als Tragödie, das andere Mal als Farce
On Tue, 2012-06-19 at 17:55 +0100, Ken Moffat wrote:
> On Tue, Jun 19, 2012 at 04:23:23PM +0000, Myklebust, Trond wrote:
> > On Tue, 2012-06-19 at 12:20 -0400, Trond Myklebust wrote:
> > >
> > > However you are saying that the problem is there when you compile a
> > > kernel with this commit as the head, and it goes away when you compile a
> > > kernel with commit 3e9e0ca3f19e911ce13c2e6c9858fcb41a37496c as the head?
> > >
> Provided I apply 4f97615d as well, so that it compiles, yes.
>
> > > I'm confused as to how a bug in that patch could depend on
> > > CONFIG_NFS_V4, but I'll see what I can find.
>
> Thanks
> >
> > By the way, I thought your test-case was doing firefox downloads. Do
> > those really use O_DIRECT?
> >
> I originally saw the problem doing that, but it was on the second
> download. Or perhaps third or fourth - I tend not to remember
> successful downloads when I've got a lot of packages to check for new
> versions. Using my backup script seemed a more reliable way to
> trigger a problem (but, only if there is something substantial to
> back up, such as a new vmlinuz).
>
> Thinking about this, it is almost certain that between the first
> download and the one that failed (several hours later) my backup
> script did run, from fcron, so I now think the rsync problem is what
> leads to issues when other programs later try to update the same nfs
> directory.
Does the following patch make any difference?
You probably want to ensure that you also have commit
906369e43c29001c39c7dfed8a01b9dff24ace75 (which is in 3.5-rc3) since
that corrects a similar issue.
Cheers
Trond
8<------------------------------------------------------------
>From ed3b97f9af6421f326de413e6d6556d1ecc3399d Mon Sep 17 00:00:00 2001
From: Trond Myklebust <[email protected]>
Date: Tue, 19 Jun 2012 13:39:14 -0400
Subject: [PATCH] NFS: Fix a refcounting issue in O_DIRECT
In nfs_direct_write_reschedule(), the requests from nfs_scan_commit_list
have a refcount of 2, whereas the operations in
nfs_direct_write_completion_ops expect them to have a refcount of 1.
This patch adds a call to release the extra references.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/direct.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 3168f6e..9a4cbfc 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -490,6 +490,7 @@ static void nfs_direct_write_reschedule(struct nfs_direct_req *dreq)
dreq->error = -EIO;
spin_unlock(cinfo.lock);
}
+ nfs_release_request(req);
}
nfs_pageio_complete(&desc);
--
1.7.10.2
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Tue, Jun 19, 2012 at 05:46:28PM +0000, Myklebust, Trond wrote:
>
> Does the following patch make any difference?
>
> You probably want to ensure that you also have commit
> 906369e43c29001c39c7dfed8a01b9dff24ace75 (which is in 3.5-rc3) since
> that corrects a similar issue.
>
> Cheers
> Trond
> 8<------------------------------------------------------------
> From ed3b97f9af6421f326de413e6d6556d1ecc3399d Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <[email protected]>
> Date: Tue, 19 Jun 2012 13:39:14 -0400
> Subject: [PATCH] NFS: Fix a refcounting issue in O_DIRECT
>
> In nfs_direct_write_reschedule(), the requests from nfs_scan_commit_list
> have a refcount of 2, whereas the operations in
> nfs_direct_write_completion_ops expect them to have a refcount of 1.
>
> This patch adds a call to release the extra references.
Unfortunately, no difference (on top of -rc3).
FWIW, after rsync stalled I tried a download from firefox, to a
different nfs mount, and that too appeared to lock up firefox.
ĸen
--
das eine Mal als Tragödie, das andere Mal als Farce
On Tue, 2012-06-19 at 20:35 +0100, Ken Moffat wrote:
> Unfortunately, no difference (on top of -rc3).
>
> FWIW, after rsync stalled I tried a download from firefox, to a
> different nfs mount, and that too appeared to lock up firefox.
OK, I think I see what the problem is...
Does the following patch work for you?
Cheers
Trond
8<------------------------------------------------------
>From 1a0de48ae56b5cdb9a46b3d3a0b578dd7f787f22 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <[email protected]>
Date: Tue, 19 Jun 2012 18:38:56 -0400
Subject: [PATCH] NFS: Initialise commit_info.rpc_out when
!defined(CONFIG_NFS_V4)
Signed-off-by: Trond Myklebust <[email protected]>
Cc: Fred Isaman <[email protected]>
---
fs/nfs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index e605d69..f729698 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1530,7 +1530,6 @@ static inline void nfs4_init_once(struct nfs_inode *nfsi)
nfsi->delegation_state = 0;
init_rwsem(&nfsi->rwsem);
nfsi->layout = NULL;
- atomic_set(&nfsi->commit_info.rpcs_out, 0);
#endif
}
@@ -1545,6 +1544,7 @@ static void init_once(void *foo)
INIT_LIST_HEAD(&nfsi->commit_info.list);
nfsi->npages = 0;
nfsi->commit_info.ncommit = 0;
+ atomic_set(&nfsi->commit_info.rpcs_out, 0);
atomic_set(&nfsi->silly_count, 1);
INIT_HLIST_HEAD(&nfsi->silly_list);
init_waitqueue_head(&nfsi->waitqueue);
--
1.7.10.2
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Tue, Jun 19, 2012 at 10:44:38PM +0000, Myklebust, Trond wrote:
> On Tue, 2012-06-19 at 20:35 +0100, Ken Moffat wrote:
> > Unfortunately, no difference (on top of -rc3).
> >
> > FWIW, after rsync stalled I tried a download from firefox, to a
> > different nfs mount, and that too appeared to lock up firefox.
>
> OK, I think I see what the problem is...
>
> Does the following patch work for you?
>
> Cheers
> Trond
> 8<------------------------------------------------------
> From 1a0de48ae56b5cdb9a46b3d3a0b578dd7f787f22 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <[email protected]>
> Date: Tue, 19 Jun 2012 18:38:56 -0400
> Subject: [PATCH] NFS: Initialise commit_info.rpc_out when
> !defined(CONFIG_NFS_V4)
>
> Signed-off-by: Trond Myklebust <[email protected]>
> Cc: Fred Isaman <[email protected]>
> ---
> fs/nfs/inode.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
Yes :)
On top of 3.5-rc3, without the previous patch. I did two backups
(different filesystems) and then three downloads from firefox.
Many thanks!
ĸen
--
das eine Mal als Tragödie, das andere Mal als Farce