2008-10-17 11:01:25

by Ian Campbell

[permalink] [raw]
Subject: RPC retransmission of write requests containing bogus data

(please CC me, I am not currently subscribed to linux-nfs)

Hi,

For some time now I have been tracking down a mysterious host crash when
using Xen with blktap (userspace paravirtual disk implementation) under
stress conditions. I have now managed to create a simple test case with
no Xen components which appears to show RPC retransmissions of write
requests containing invalid data. I'll start at the beginning and
explain the original bug although the test case is much simpler.

blktap is a userspace daemon which provides paravirtualised disk support
to Xen guests. The pages of data for a block write are mapped from the
Xen guest into a domain 0 driver called blktap which has a userspace
component which implements qcow/vhd/etc and writes the data to the
backing file using O_DIRECT and aio in a zero copy manner. Once the the
aio is completed the pages are returned to the guest and unmapped from
domain 0. When a page is unmapped in this way the pte is set not present
and the PFN is mapped to an invalid MFN.

We have been seeing a crash in the domain 0 network driver's start_xmit
routine when it attempts to access data from a page where the PFN maps
to an invalid MFN. I added some tracing to the kernel and observed this
sequence of events on an individual page:
tap: 0/41 at 4fac7a165ad6 "rpc_init_task" c8792844 0
tap: 1/41 at 4fac7a1663e2 "nfs_direct_write_schedule" ece19850 2000
tap: 2/41 at 4fac7a166cca "call_start" c8792844 0
tap: 3/41 at 4fac7a167540 "call_reserve" c8792844 0
tap: 4/41 at 4fac7a167de6 "call_reserveresult" c8792844 0
tap: 5/41 at 4fac7a168620 "call_allocate" c8792844 0
tap: 6/41 at 4fac7a168f08 "call_bind" c8792844 0
tap: 7/41 at 4fac7a169712 "call_connect" c8792844 0
tap: 8/41 at 4fac7a169f28 "call_transmit" c8792844 0
tap: 9/41 at 4fac7a16a7f2 "call_encode" c8792844 0
tap: 10/41 at 4fac7a16afd8 "call_header" c8792844 0
tap: 11/41 at 4fac7a16bc6e "xs_tcp_send_request" c8792844 0
tap: 12/41 at 4fac7a16c9d0 "tcp_sendpage" 0 0
tap: 13/41 at 4fac7a16cec2 "do_tcp_sendpages (adding data to skb)" cef08b00 0
tap: 14/41 at 4fac7a16e068 "call_transmit_status" c8792844 0
tap: 15/41 at 4fac7a2ed8f4 "tcp_transmit_skb, skb_clone" c9dca500 c9dca5a8
tap: 16/41 at 4faeeeb9f566 "xprt_timer" c8792844 0
tap: 17/41 at 4faeeeb9ff0e "xprt_timer: !req->rq_received" c8792844 0
tap: 18/41 at 4faeeeba08ec "rpc_make_runnable" c8792844 0
tap: 19/41 at 4faeeec117b8 "call_status" c8792844 ffffff92
tap: 20/41 at 4faeeec11faa "timeout (minor)" c8792844 0
tap: 21/41 at 4faeeec12778 "call_bind" c8792844 0
tap: 22/41 at 4faeeec12ef8 "call_connect" c8792844 0
tap: 23/41 at 4faeeec13678 "call_transmit" c8792844 0
tap: 24/41 at 4faeeec13e46 "call_encode" c8792844 0
tap: 25/41 at 4faeeec145ae "call_header" c8792844 0
tap: 26/41 at 4faeeec15082 "xs_tcp_send_request" c8792844 0
tap: 27/41 at 4faeeec15d1e "tcp_sendpage" 0 0
tap: 28/41 at 4faeeec161ce "do_tcp_sendpages (adding data to skb)" d06afe40 0
tap: 29/41 at 4faeeec172ea "call_transmit_status" c8792844 0
tap: 30/41 at 4faf2e3280d4 "rpc_make_runnable" c8792844 0
tap: 31/41 at 4faf2e3440c4 "call_status" c8792844 88
tap: 32/41 at 4faf2e3449d6 "call_decode" c8792844 0
tap: 33/41 at 4faf2e345240 "call_verify" c8792844 0
tap: 34/41 at 4faf2e345a9e "rpc_exit_task" c8792844 0
tap: 35/41 at 4faf2e34652a "nfs_direct_write_result" ece19850 2000
tap: 36/41 at 4faf2e351000 "nfs_direct_write_release (completing)" ece19850 0
tap: 37/41 at 4faf2e3517ec "nfs_direct_write_complete,DEFAULT w/ iocb" ece19850 0
tap: 38/41 at 4faf2e35205c "nfs_direct_free_writedata" ece19850 0
tap: 39/41 at 4faf2e51026a "fast_flush_area" 0 0
tap: 40/41 at 4faf33e1813a "tcp_transmit_skb, skb_clone" d06afe40 d06afee8
(nb: fast_flush_area is the blktap function which returns the pages to
the guest and unmaps them from domain 0, it is called via ioctl from the
userspace process once the aio write returns successfully. 4fac7.... is
the tsc, processor is 2.33GHz)

So what we see is the initial request being constructed and transmitted
(around 11/41-15/41) followed by a timeout ~60s later (16/41-20/41)
which causes us to queue a retransmit (26/41-29/41) but, critically, not
yet actually transmit it. At this point we get the reply to the original
request and complete the NFS write (35/41-38/41) returning success to
userspace which causes it to unmap the pages and return them to the
guest (39/41). At this point (40/41) we attempt to transmit the
duplicate request and crash because the pages are not present any more.

By using libnetfilter_queue I was then able to reproduce this in
non-stress situations by introducing delays into the network. Stalling
all network traffic with the NFS server for 65s every 1000 packets seems
to do the trick (probably on the aggressive side, I didn't attempt to
minimise the interruption required to reproduce and I realise that this
represents a pretty crappy network and/or server).

Given the observed sequence of events I then constructed a fairly simple
test program (blktest2.c, attached) using O_DIRECT (but not aio, I don't
think it matters) which shows the same issue without involving any Xen
parts. The test writes a buffer filed with a counter to a file and
immediately after the write() returns fills the buffer with 0xdeadbeef.
By using tcpdump I capture and observe duplicated requests on the wire
containing 0xdeadbeef in the payload and not the expected counter
values. I usually see this in a matter of minutes. I've attached a pcap
of a single request/reply pair which was corrupted.

Presumably in the case of a decent NFS server the XID request cache
would prevent the bogus data actually reaching the disk but on a
non-decent server I suspect it might actually lead to corruption (AIUI
the request cache is not a hard requirement of the NFS protocol?).
Perhaps even a decent server might have timed out the entry in the cache
after such a delay?

The Xen case and the blktest2 repro was on 2.6.18. I have also
reproduced the blktest2 case on 2.6.27 native but not with Xen since no
domain 0 support exists just yet.

I can think of several possible option to fix this issue:
* Do not return success to userspace until all duplicated requests
have actually hit the wire, even if the response comes back
earlier than that.
* Cancel queued duplicates if a response comes in late.
* Copy the data pages on retransmission.

I think I prefer the 3rd option since the first two are a bit tricky
since request has been merged into a tcp stream and may have been
fragmented/segmented already etc. I don't think doing the copy on
retransmits has a massive performance impact -- you must already be
suffering from pretty severe network or server issues!

I have CC'd the maintainer of the bnx2 Ethernet driver (Michael Chan)
because I have so far only been able to reproduce this with the broadcom
NIC. Even in the same server if I switch to e1000 I cannot reproduce.
However given the analysis above I'm not convinced it is likely to be a
driver bug.

Ian.


Attachments:
netoutage.c (2.86 kB)
blktest2.c (1.93 kB)
corrupt.pcap (5.11 kB)
Download all attachments

2008-10-20 14:25:18

by Ian Campbell

[permalink] [raw]
Subject: Re: RPC retransmission of write requests containing bogus data

On Fri, 2008-10-17 at 09:22 -0400, Trond Myklebust wrote:
> On Fri, 2008-10-17 at 14:01 +0100, Ian Campbell wrote:
> > On Fri, 2008-10-17 at 08:48 -0400, Trond Myklebust wrote:
> > > I don't see how this could be an RPC bug. The networking layer is
> > > supposed to either copy the data sent to the socket, or take a reference
> > > to any pages that are pushed via the ->sendpage() abi.
> > >
> > > IOW: the pages are supposed to be still referenced by the networking
> > > layer even if the NFS layer and page cache have dropped their
> > > references.
> >
> > The pages are still referenced by the networking layer. The problem is
> > that the userspace app has been told that the write has completed so it
> > is free to write new data to those pages.
> >
> > Ian.
>
> OK, I see your point.
>
> Does this happen at all with NFSv4? I ask because the NFSv4 client will
> always ensure that the TCP connection gets broken before a
> retransmission. I wouldn't therefore expect any races between a reply to
> the previous transmission and the new one...

It does seem to happen with NFSv4 too (see attached).

Ian.


Attachments:
corrupt-over-nfsv4.pcap (7.18 kB)

2008-10-17 12:49:41

by Myklebust, Trond

[permalink] [raw]
Subject: Re: RPC retransmission of write requests containing bogus data

On Fri, 2008-10-17 at 12:01 +0100, Ian Campbell wrote:
> (please CC me, I am not currently subscribed to linux-nfs)
>
> Hi,
>
> For some time now I have been tracking down a mysterious host crash when
> using Xen with blktap (userspace paravirtual disk implementation) under
> stress conditions. I have now managed to create a simple test case with
> no Xen components which appears to show RPC retransmissions of write
> requests containing invalid data. I'll start at the beginning and
> explain the original bug although the test case is much simpler.
>
> blktap is a userspace daemon which provides paravirtualised disk support
> to Xen guests. The pages of data for a block write are mapped from the
> Xen guest into a domain 0 driver called blktap which has a userspace
> component which implements qcow/vhd/etc and writes the data to the
> backing file using O_DIRECT and aio in a zero copy manner. Once the the
> aio is completed the pages are returned to the guest and unmapped from
> domain 0. When a page is unmapped in this way the pte is set not present
> and the PFN is mapped to an invalid MFN.
>
> We have been seeing a crash in the domain 0 network driver's start_xmit
> routine when it attempts to access data from a page where the PFN maps
> to an invalid MFN. I added some tracing to the kernel and observed this
> sequence of events on an individual page:
> tap: 0/41 at 4fac7a165ad6 "rpc_init_task" c8792844 0
> tap: 1/41 at 4fac7a1663e2 "nfs_direct_write_schedule" ece19850 2000
> tap: 2/41 at 4fac7a166cca "call_start" c8792844 0
> tap: 3/41 at 4fac7a167540 "call_reserve" c8792844 0
> tap: 4/41 at 4fac7a167de6 "call_reserveresult" c8792844 0
> tap: 5/41 at 4fac7a168620 "call_allocate" c8792844 0
> tap: 6/41 at 4fac7a168f08 "call_bind" c8792844 0
> tap: 7/41 at 4fac7a169712 "call_connect" c8792844 0
> tap: 8/41 at 4fac7a169f28 "call_transmit" c8792844 0
> tap: 9/41 at 4fac7a16a7f2 "call_encode" c8792844 0
> tap: 10/41 at 4fac7a16afd8 "call_header" c8792844 0
> tap: 11/41 at 4fac7a16bc6e "xs_tcp_send_request" c8792844 0
> tap: 12/41 at 4fac7a16c9d0 "tcp_sendpage" 0 0
> tap: 13/41 at 4fac7a16cec2 "do_tcp_sendpages (adding data to skb)" cef08b00 0
> tap: 14/41 at 4fac7a16e068 "call_transmit_status" c8792844 0
> tap: 15/41 at 4fac7a2ed8f4 "tcp_transmit_skb, skb_clone" c9dca500 c9dca5a8
> tap: 16/41 at 4faeeeb9f566 "xprt_timer" c8792844 0
> tap: 17/41 at 4faeeeb9ff0e "xprt_timer: !req->rq_received" c8792844 0
> tap: 18/41 at 4faeeeba08ec "rpc_make_runnable" c8792844 0
> tap: 19/41 at 4faeeec117b8 "call_status" c8792844 ffffff92
> tap: 20/41 at 4faeeec11faa "timeout (minor)" c8792844 0
> tap: 21/41 at 4faeeec12778 "call_bind" c8792844 0
> tap: 22/41 at 4faeeec12ef8 "call_connect" c8792844 0
> tap: 23/41 at 4faeeec13678 "call_transmit" c8792844 0
> tap: 24/41 at 4faeeec13e46 "call_encode" c8792844 0
> tap: 25/41 at 4faeeec145ae "call_header" c8792844 0
> tap: 26/41 at 4faeeec15082 "xs_tcp_send_request" c8792844 0
> tap: 27/41 at 4faeeec15d1e "tcp_sendpage" 0 0
> tap: 28/41 at 4faeeec161ce "do_tcp_sendpages (adding data to skb)" d06afe40 0
> tap: 29/41 at 4faeeec172ea "call_transmit_status" c8792844 0
> tap: 30/41 at 4faf2e3280d4 "rpc_make_runnable" c8792844 0
> tap: 31/41 at 4faf2e3440c4 "call_status" c8792844 88
> tap: 32/41 at 4faf2e3449d6 "call_decode" c8792844 0
> tap: 33/41 at 4faf2e345240 "call_verify" c8792844 0
> tap: 34/41 at 4faf2e345a9e "rpc_exit_task" c8792844 0
> tap: 35/41 at 4faf2e34652a "nfs_direct_write_result" ece19850 2000
> tap: 36/41 at 4faf2e351000 "nfs_direct_write_release (completing)" ece19850 0
> tap: 37/41 at 4faf2e3517ec "nfs_direct_write_complete,DEFAULT w/ iocb" ece19850 0
> tap: 38/41 at 4faf2e35205c "nfs_direct_free_writedata" ece19850 0
> tap: 39/41 at 4faf2e51026a "fast_flush_area" 0 0
> tap: 40/41 at 4faf33e1813a "tcp_transmit_skb, skb_clone" d06afe40 d06afee8
> (nb: fast_flush_area is the blktap function which returns the pages to
> the guest and unmaps them from domain 0, it is called via ioctl from the
> userspace process once the aio write returns successfully. 4fac7.... is
> the tsc, processor is 2.33GHz)
>
> So what we see is the initial request being constructed and transmitted
> (around 11/41-15/41) followed by a timeout ~60s later (16/41-20/41)
> which causes us to queue a retransmit (26/41-29/41) but, critically, not
> yet actually transmit it. At this point we get the reply to the original
> request and complete the NFS write (35/41-38/41) returning success to
> userspace which causes it to unmap the pages and return them to the
> guest (39/41). At this point (40/41) we attempt to transmit the
> duplicate request and crash because the pages are not present any more.
>
> By using libnetfilter_queue I was then able to reproduce this in
> non-stress situations by introducing delays into the network. Stalling
> all network traffic with the NFS server for 65s every 1000 packets seems
> to do the trick (probably on the aggressive side, I didn't attempt to
> minimise the interruption required to reproduce and I realise that this
> represents a pretty crappy network and/or server).
>
> Given the observed sequence of events I then constructed a fairly simple
> test program (blktest2.c, attached) using O_DIRECT (but not aio, I don't
> think it matters) which shows the same issue without involving any Xen
> parts. The test writes a buffer filed with a counter to a file and
> immediately after the write() returns fills the buffer with 0xdeadbeef.
> By using tcpdump I capture and observe duplicated requests on the wire
> containing 0xdeadbeef in the payload and not the expected counter
> values. I usually see this in a matter of minutes. I've attached a pcap
> of a single request/reply pair which was corrupted.
>
> Presumably in the case of a decent NFS server the XID request cache
> would prevent the bogus data actually reaching the disk but on a
> non-decent server I suspect it might actually lead to corruption (AIUI
> the request cache is not a hard requirement of the NFS protocol?).
> Perhaps even a decent server might have timed out the entry in the cache
> after such a delay?
>
> The Xen case and the blktest2 repro was on 2.6.18. I have also
> reproduced the blktest2 case on 2.6.27 native but not with Xen since no
> domain 0 support exists just yet.
>
> I can think of several possible option to fix this issue:
> * Do not return success to userspace until all duplicated requests
> have actually hit the wire, even if the response comes back
> earlier than that.
> * Cancel queued duplicates if a response comes in late.
> * Copy the data pages on retransmission.
>
> I think I prefer the 3rd option since the first two are a bit tricky
> since request has been merged into a tcp stream and may have been
> fragmented/segmented already etc. I don't think doing the copy on
> retransmits has a massive performance impact -- you must already be
> suffering from pretty severe network or server issues!
>
> I have CC'd the maintainer of the bnx2 Ethernet driver (Michael Chan)
> because I have so far only been able to reproduce this with the broadcom
> NIC. Even in the same server if I switch to e1000 I cannot reproduce.
> However given the analysis above I'm not convinced it is likely to be a
> driver bug.
>
> Ian.

I don't see how this could be an RPC bug. The networking layer is
supposed to either copy the data sent to the socket, or take a reference
to any pages that are pushed via the ->sendpage() abi.

IOW: the pages are supposed to be still referenced by the networking
layer even if the NFS layer and page cache have dropped their
references.

Cheers
Trond

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2008-10-17 13:01:15

by Ian Campbell

[permalink] [raw]
Subject: Re: RPC retransmission of write requests containing bogus data

On Fri, 2008-10-17 at 08:48 -0400, Trond Myklebust wrote:
> I don't see how this could be an RPC bug. The networking layer is
> supposed to either copy the data sent to the socket, or take a reference
> to any pages that are pushed via the ->sendpage() abi.
>
> IOW: the pages are supposed to be still referenced by the networking
> layer even if the NFS layer and page cache have dropped their
> references.

The pages are still referenced by the networking layer. The problem is
that the userspace app has been told that the write has completed so it
is free to write new data to those pages.

Ian.



2008-10-17 13:23:19

by Myklebust, Trond

[permalink] [raw]
Subject: Re: RPC retransmission of write requests containing bogus data

On Fri, 2008-10-17 at 14:01 +0100, Ian Campbell wrote:
> On Fri, 2008-10-17 at 08:48 -0400, Trond Myklebust wrote:
> > I don't see how this could be an RPC bug. The networking layer is
> > supposed to either copy the data sent to the socket, or take a reference
> > to any pages that are pushed via the ->sendpage() abi.
> >
> > IOW: the pages are supposed to be still referenced by the networking
> > layer even if the NFS layer and page cache have dropped their
> > references.
>
> The pages are still referenced by the networking layer. The problem is
> that the userspace app has been told that the write has completed so it
> is free to write new data to those pages.
>
> Ian.

OK, I see your point.

Does this happen at all with NFSv4? I ask because the NFSv4 client will
always ensure that the TCP connection gets broken before a
retransmission. I wouldn't therefore expect any races between a reply to
the previous transmission and the new one...

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2008-10-17 13:32:37

by Talpey, Thomas

[permalink] [raw]
Subject: Re: RPC retransmission of write requests containing bogus data

At 07:01 AM 10/17/2008, Ian Campbell wrote:
>(please CC me, I am not currently subscribed to linux-nfs)
>...
>Presumably in the case of a decent NFS server the XID request cache
>would prevent the bogus data actually reaching the disk but on a
>non-decent server I suspect it might actually lead to corruption (AIUI
>the request cache is not a hard requirement of the NFS protocol?).
>Perhaps even a decent server might have timed out the entry in the cache
>after such a delay?

Unfortunately no - because 1) your retransmissions are not, in fact,
duplicates since the data has changed and 2) no NFSv3 reply cache
works perfectly, especially under heavy load. The NFSv4.1 session
addresses this, but that's not at issue here.

This is a really nasty race. The whole thing starts with the dropped
TCP segment evidenced at #2 of your trace. Then, the retransmission
appears to have been scheduled prior to the write reply making it back
to the client through the TCP storm, so the retransmit is actually pending
on the wire while the NFS write operation is completed.

The fix here is to break the connection before retrying, a long-standing
pet peeve of mine that NFSv3 historically does not do. Setting the
clnt->cl_discrtry bit in the RPC client struct is all that's required. The
NFSv4 client does this by default, btw.

Tom.


2008-10-17 13:33:59

by Ian Campbell

[permalink] [raw]
Subject: Re: RPC retransmission of write requests containing bogus data

On Fri, 2008-10-17 at 09:22 -0400, Trond Myklebust wrote:
> On Fri, 2008-10-17 at 14:01 +0100, Ian Campbell wrote:
> > On Fri, 2008-10-17 at 08:48 -0400, Trond Myklebust wrote:
> > > I don't see how this could be an RPC bug. The networking layer is
> > > supposed to either copy the data sent to the socket, or take a reference
> > > to any pages that are pushed via the ->sendpage() abi.
> > >
> > > IOW: the pages are supposed to be still referenced by the networking
> > > layer even if the NFS layer and page cache have dropped their
> > > references.
> >
> > The pages are still referenced by the networking layer. The problem is
> > that the userspace app has been told that the write has completed so it
> > is free to write new data to those pages.
> >
> > Ian.
>
> OK, I see your point.
>
> Does this happen at all with NFSv4? I ask because the NFSv4 client will
> always ensure that the TCP connection gets broken before a
> retransmission. I wouldn't therefore expect any races between a reply to
> the previous transmission and the new one...

I believe we've only tested with NFSv3.

Unfortunately I've just lent the box which reproduces this issue to a
colleague for the rest of the day. I'll test with v4 next week.

Ian.


2008-10-17 13:37:39

by Myklebust, Trond

[permalink] [raw]
Subject: Re: RPC retransmission of write requests containing bogus data

On Fri, 2008-10-17 at 09:32 -0400, Talpey, Thomas wrote:
> At 07:01 AM 10/17/2008, Ian Campbell wrote:
> >(please CC me, I am not currently subscribed to linux-nfs)
> >...
> >Presumably in the case of a decent NFS server the XID request cache
> >would prevent the bogus data actually reaching the disk but on a
> >non-decent server I suspect it might actually lead to corruption (AIUI
> >the request cache is not a hard requirement of the NFS protocol?).
> >Perhaps even a decent server might have timed out the entry in the cache
> >after such a delay?
>
> Unfortunately no - because 1) your retransmissions are not, in fact,
> duplicates since the data has changed and 2) no NFSv3 reply cache
> works perfectly, especially under heavy load. The NFSv4.1 session
> addresses this, but that's not at issue here.
>
> This is a really nasty race. The whole thing starts with the dropped
> TCP segment evidenced at #2 of your trace. Then, the retransmission
> appears to have been scheduled prior to the write reply making it back
> to the client through the TCP storm, so the retransmit is actually pending
> on the wire while the NFS write operation is completed.
>
> The fix here is to break the connection before retrying, a long-standing
> pet peeve of mine that NFSv3 historically does not do. Setting the
> clnt->cl_discrtry bit in the RPC client struct is all that's required. The
> NFSv4 client does this by default, btw.
>
> Tom.

It's not a perfect fix, which is why we haven't done that for NFSv3.

When you break the connection, there is the chance that a reply to a
non-idempotent request may get lost, and that the server doesn't
recognise the retransmission due to the above mentioned imperfections
with the replay cache. In that case, the client may get a downright
_wrong_ reply (for instance, it may see an EEXIST reply to a mkdir
request that was actually successful).

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2008-10-17 13:51:57

by Talpey, Thomas

[permalink] [raw]
Subject: Re: RPC retransmission of write requests containing bogus data

At 09:36 AM 10/17/2008, Trond Myklebust wrote:
>On Fri, 2008-10-17 at 09:32 -0400, Talpey, Thomas wrote:
>> The fix here is to break the connection before retrying, a long-standing
>> pet peeve of mine that NFSv3 historically does not do.
>
>It's not a perfect fix, which is why we haven't done that for NFSv3.
>
>When you break the connection, there is the chance that a reply to a
>non-idempotent request may get lost, and that the server doesn't
>recognise the retransmission due to the above mentioned imperfections
>with the replay cache. In that case, the client may get a downright
>_wrong_ reply (for instance, it may see an EEXIST reply to a mkdir
>request that was actually successful).

Well, the NFSv4 client will suffer the same issue in the above case.
So, it's choose your poison. The antidote is to adopt NFSv4.1 asap. ;-)

Sorry about the crossing replies btw. My own transmission was scheduled
before pulling email to look for other replies! Hmm...

Tom.


2008-11-11 13:08:33

by Ian Campbell

[permalink] [raw]
Subject: Re: RPC retransmission of write requests containing bogus data

I cannot find my more recent post in my sent box (although I see it on
gmane.org) so apologies for replying to an older mail instead.

I eventually worked around this issue by not doing retransmissions since
the RPC_CLNT_CREATE_DISCRTRY backport to 2.6.18 was non-trivial. We
handle I/O errors at a higher level in our toolstack anyway so this is
not an issue for us.

Ian.

On Mon, 2008-10-20 at 17:39 +0100, Ian Campbell wrote:
> On Mon, 2008-10-20 at 15:25 +0100, Ian Campbell wrote:
> > On Fri, 2008-10-17 at 09:22 -0400, Trond Myklebust wrote:
> > > On Fri, 2008-10-17 at 14:01 +0100, Ian Campbell wrote:
> > > > On Fri, 2008-10-17 at 08:48 -0400, Trond Myklebust wrote:
> > > > > I don't see how this could be an RPC bug. The networking layer is
> > > > > supposed to either copy the data sent to the socket, or take a reference
> > > > > to any pages that are pushed via the ->sendpage() abi.
> > > > >
> > > > > IOW: the pages are supposed to be still referenced by the networking
> > > > > layer even if the NFS layer and page cache have dropped their
> > > > > references.
> > > >
> > > > The pages are still referenced by the networking layer. The problem is
> > > > that the userspace app has been told that the write has completed so it
> > > > is free to write new data to those pages.
> > > >
> > > > Ian.
> > >
> > > OK, I see your point.
> > >
> > > Does this happen at all with NFSv4? I ask because the NFSv4 client will
> > > always ensure that the TCP connection gets broken before a
> > > retransmission. I wouldn't therefore expect any races between a reply to
> > > the previous transmission and the new one...
> >
> > It does seem to happen with NFSv4 too (see attached).
>
> Actually, that was NFSv4 on 2.6.18, I guess I should test with something
> newer since it looks like the TCP connection reset stuff is newer (it's
> 43d78ef2ba5bec26d0315859e8324bfc0be23766 right?)
>