I'm trying to resolve a fatal bug that happens with Linux 3.2.0-32-generic
(Ubuntu variant of 3.2), and the magic combination of
1. NFSv4
2. AIO from Qemu
3. Xen with upstream qemu DM
4. QCOW plus backing file.
The background is here:
http://lists.xen.org/archives/html/xen-devel/2012-12/msg01154.html
It is completely replicable on different NFS client hardware. We've
tried other kernels to no avail.
The bug is quite nasty in that dom0 crashes fatally due to a VM action.
Within the link, you'll see references to an issue found by Ian Campbell
a while ago, which turned out to be an NFS issue independent of Xen but
apparently not in NFS4. The links are:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=640941
http://marc.info/?l=linux-nfs&m=122424132729720&w=2
In essence, my understanding of what appears to be happening (which
may be entirely wrong) is:
1. Xen 4.2 HVM domU VM has a PV disk driver
2. domU writes a page
3. Xen maps domU's page into dom0's VM space
4. Xen asks Qemu (userspace) to write a page
5. Qemu's disk (and backing file for the disk) are on NFSv4
6. Qemu uses AIO to write the page to NFS
7. AIO claims the page write is complete
8. Qemu marks the write as complete
9. Xen unmaps the page from dom0's VM space
10. Apparently, the write is not actually complete at this
point
11. TCP retransmit is triggered (not quite sure why, possibly
due to slow filer)
12. TCP goes to resend the page, and finds it's not in dom0
memory.
13. Bang
The Xen folks think this is nothing to do with either Xen or QEMU, and
believe the problem is AIO on NFS. The links to earlier investigations
suggest this is/was true, but not for NFSv4, and was fixed. An NFSv4 case
may have been missed.
Against this explanation:
a) it does not happen in KVM (again with QEMU doing AIO to
NFS) - though here the page mapping fanciness doesn't
happen as KVM VMs share the same memory space as the kernel
as I understand it.
b) it does not happen on Xen without a QEMU backing file (though
that may be just what's necessary timing wise to trigger
the race condition).
Any insight you have would be appreciated.
Specifically, the question I'd ask is as follows. Is it correct behaviour
that Linux+NFSv4 marks an AIO request completed when all the relevant data
may have been sent by TCP but not yet ACK'd? If so, how is Linux meant to
deal with retransmits? Are the pages referenced by the TCP stack meant to
be marked COW or something? What is meant to happen if those pages get
removed from the memory map entirely?
As an aside, we're looking for someone to fix this (and things like it) on
a contract basis. Contact me off list if interested.
--
Alex Bligh
Kernel 3.2.0-32-generic on an x86_64
[ 1416.992402] BUG: unable to handle kernel paging request at
ffff88073fee6e00
[ 1416.992902] IP: [<ffffffff81318e2b>] memcpy+0xb/0x120
[ 1416.993244] PGD 1c06067 PUD 7ec73067 PMD 7ee73067 PTE 0
[ 1416.993985] Oops: 0000 [#1] SMP
[ 1416.994433] CPU 4
[ 1416.994587] Modules linked in: xt_physdev xen_pciback xen_netback
xen_blkback xen_gntalloc xen_gntdev xen_evtchn xenfs veth ip6t_LOG
nf_conntrack_ipv6 nf_
defrag_ipv6 ip6table_filter ip6_tables ipt_LOG xt_limit xt_state
xt_tcpudp nf_conntrack_netlink nfnetlink ebt_ip ebtable_filter
iptable_mangle ipt_MASQUERADE
iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
iptable_filter ip_tables ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
ib_core ib_addr iscsi_tcp
libiscsi_tcp libiscsi scsi_transport_iscsi ebtable_broute ebtables
x_tables dcdbas psmouse serio_raw amd64_edac_mod usbhid hid edac_core
sp5100_tco i2c_piix
4 edac_mce_amd fam15h_power k10temp igb bnx2 acpi_power_meter mac_hid
dm_multipath bridge 8021q garp stp ixgbe dca mdio nfsd nfs lockd fscache
auth_rpcgss nf
s_acl sunrpc [last unloaded: scsi_transport_iscsi]
[ 1417.005011]
[ 1417.005011] Pid: 0, comm: swapper/4 Tainted: G ÂÂÂÂÂÂÂW
3.2.0-32-generic #51-Ubuntu Dell Inc. PowerEdge R715/0C5MMK
[ 1417.005011] RIP: e030:[<ffffffff81318e2b>] Â[<ffffffff81318e2b>]
memcpy+0xb/0x120
[ 1417.005011] RSP: e02b:ffff880060083b08 ÂEFLAGS: 00010246
[ 1417.005011] RAX: ffff88001e12c9e4 RBX: 0000000000000210 RCX:
0000000000000040
[ 1417.005011] RDX: 0000000000000000 RSI: ffff88073fee6e00 RDI:
ffff88001e12c9e4
[ 1417.005011] RBP: ffff880060083b70 R08: 00000000000002e8 R09:
0000000000000200
[ 1417.005011] R10: ffff88001e12c9e4 R11: 0000000000000280 R12:
00000000000000e8
[ 1417.005011] R13: ffff88004b014c00 R14: ffff88004b532000 R15:
0000000000000001
[ 1417.005011] FS: Â00007f1a99089700(0000) GS:ffff880060080000(0000)
knlGS:0000000000000000
[ 1417.005011] CS: Âe033 DS: 002b ES: 002b CR0: 000000008005003b
[ 1417.005011] CR2: ffff88073fee6e00 CR3: 0000000015d22000 CR4:
0000000000040660
[ 1417.005011] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 1417.005011] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 1417.005011] Process swapper/4 (pid: 0, threadinfo ffff88004b532000,
task ffff88004b538000)
[ 1417.005011] Stack:
[ 1417.005011] Âffffffff81532c0e 0000000000000000 ffff8800000002e8
ffff880000000200
[ 1417.005011] Âffff88001e12c9e4 0000000000000200 ffff88004b533fd8
ffff880060083ba0
[ 1417.005011] Âffff88004b015800 ffff88004b014c00 ffff88001b142000
00000000000000fc
[ 1417.005011] Call Trace:
[ 1417.005011] Â<IRQ>
[ 1417.005011] Â[<ffffffff81532c0e>] ? skb_copy_bits+0x16e/0x2c0
[ 1417.005011] Â[<ffffffff8153463a>] skb_copy+0x8a/0xb0
[ 1417.005011] Â[<ffffffff8154b517>] neigh_probe+0x37/0x80
[ 1417.005011] Â[<ffffffff8154b9db>] __neigh_event_send+0xbb/0x210
[ 1417.005011] Â[<ffffffff8154bc73>] neigh_resolve_output+0x143/0x1f0
[ 1417.005011] Â[<ffffffff8156dde5>] ? nf_hook_slow+0x75/0x150
[ 1417.005011] Â[<ffffffff8157a510>] ? ip_fragment+0x810/0x810
[ 1417.005011] Â[<ffffffff8157a68e>] ip_finish_output+0x17e/0x2f0
[ 1417.005011] Â[<ffffffff81533ddb>] ? __alloc_skb+0x4b/0x240
[ 1417.005011] Â[<ffffffff8157b1e8>] ip_output+0x98/0xa0
[ 1417.005011] Â[<ffffffff8157a8a4>] ? __ip_local_out+0xa4/0xb0
[ 1417.005011] Â[<ffffffff8157a8d9>] ip_local_out+0x29/0x30
[ 1417.005011] Â[<ffffffff8157aa3c>] ip_queue_xmit+0x15c/0x410
[ 1417.005011] Â[<ffffffff81595840>] ? tcp_retransmit_timer+0x440/0x440
[ 1417.005011] Â[<ffffffff81592c69>] tcp_transmit_skb+0x359/0x580
[ 1417.005011] Â[<ffffffff81593be1>] tcp_retransmit_skb+0x171/0x310
[ 1417.005011] Â[<ffffffff8159561b>] tcp_retransmit_timer+0x21b/0x440
[ 1417.005011] Â[<ffffffff81595928>] tcp_write_timer+0xe8/0x110
[ 1417.005011] Â[<ffffffff81595840>] ? tcp_retransmit_timer+0x440/0x440
[ 1417.005011] Â[<ffffffff81075d36>] call_timer_fn+0x46/0x160
[ 1417.005011] Â[<ffffffff81595840>] ? tcp_retransmit_timer+0x440/0x440
[ 1417.005011] Â[<ffffffff81077682>] run_timer_softirq+0x132/0x2a0
[ 1417.005011] Â[<ffffffff8106e5d8>] __do_softirq+0xa8/0x210
[ 1417.005011] Â[<ffffffff813a94b7>] ? __xen_evtchn_do_upcall+0x207/0x250
[ 1417.005011] Â[<ffffffff816656ac>] call_softirq+0x1c/0x30
[ 1417.005011] Â[<ffffffff81015305>] do_softirq+0x65/0xa0
[ 1417.005011] Â[<ffffffff8106e9be>] irq_exit+0x8e/0xb0
[ 1417.005011] Â[<ffffffff813ab595>] xen_evtchn_do_upcall+0x35/0x50
[ 1417.005011] Â[<ffffffff816656fe>] xen_do_hypervisor_callback+0x1e/0x30
[ 1417.005011] Â<EOI>
[ 1417.005011] Â[<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
[ 1417.005011] Â[<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
[ 1417.005011] Â[<ffffffff8100a2d0>] ? xen_safe_halt+0x10/0x20
[ 1417.005011] Â[<ffffffff8101b983>] ? default_idle+0x53/0x1d0
[ 1417.005011] Â[<ffffffff81012236>] ? cpu_idle+0xd6/0x120
[ 1417.005011] Â[<ffffffff8100ab29>] ? xen_irq_enable_direct_reloc+0x4/0x4
[ 1417.005011] Â[<ffffffff8163369c>] ? cpu_bringup_and_idle+0xe/0x10
[ 1417.005011] Code: 58 48 2b 43 50 88 43 4e 48 83 c4 08 5b 5d c3 90 e8
1b fe ff ff eb e6 90 90 90 90 90 90 90 90 90 48 89 f8 89 d1 c1 e9 03 83
e2 07 <f3> 48 a5 89 d1 f3 a4 c3 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c
[ 1417.005011] RIP Â[<ffffffff81318e2b>] memcpy+0xb/0x120
[ 1417.005011] ÂRSP <ffff880060083b08>
[ 1417.005011] CR2: ffff88073fee6e00
[ 1417.005011] ---[ end trace ae4e7f56ea0665fe ]---
[ 1417.005011] Kernel panic - not syncing: Fatal exception in interrupt
[ 1417.005011] Pid: 0, comm: swapper/4 Tainted: G ÂÂÂÂÂD W
3.2.0-32-generic #51-Ubuntu
[ 1417.005011] Call Trace:
[ 1417.005011] Â<IRQ> Â[<ffffffff81642197>] panic+0x91/0x1a4
[ 1417.005011] Â[<ffffffff8165c01a>] oops_end+0xea/0xf0
[ 1417.005011] Â[<ffffffff81641027>] no_context+0x150/0x15d
[ 1417.005011] Â[<ffffffff816411fd>] __bad_area_nosemaphore+0x1c9/0x1e8
[ 1417.005011] Â[<ffffffff81640835>] ? pte_offset_kernel+0x13/0x3c
[ 1417.005011] Â[<ffffffff8164122f>] bad_area_nosemaphore+0x13/0x15
[ 1417.005011] Â[<ffffffff8165ec36>] do_page_fault+0x426/0x520
[ 1417.005011] Â[<ffffffff8165b0ce>] ? _raw_spin_lock_irqsave+0x2e/0x40
[ 1417.005011] Â[<ffffffff81059d8a>] ? get_nohz_timer_target+0x5a/0xc0
[ 1417.005011] Â[<ffffffff8165b04e>] ?
_raw_spin_unlock_irqrestore+0x1e/0x30
[ 1417.005011] Â[<ffffffff81077f93>] ? mod_timer_pending+0x113/0x240
[ 1417.005011] Â[<ffffffffa0317f34>] ? __nf_ct_refresh_acct+0xd4/0x100
[nf_conntrack]
[ 1417.005011] Â[<ffffffff8165b5b5>] page_fault+0x25/0x30
[ 1417.005011] Â[<ffffffff81318e2b>] ? memcpy+0xb/0x120
[ 1417.005011] Â[<ffffffff81532c0e>] ? skb_copy_bits+0x16e/0x2c0
[ 1417.005011] Â[<ffffffff8153463a>] skb_copy+0x8a/0xb0
[ 1417.005011] Â[<ffffffff8154b517>] neigh_probe+0x37/0x80
[ 1417.005011] Â[<ffffffff8154b9db>] __neigh_event_send+0xbb/0x210
[ 1417.005011] Â[<ffffffff8154bc73>] neigh_resolve_output+0x143/0x1f0
[ 1417.005011] Â[<ffffffff8156dde5>] ? nf_hook_slow+0x75/0x150
[ 1417.005011] Â[<ffffffff8157a510>] ? ip_fragment+0x810/0x810
[ 1417.005011] Â[<ffffffff8157a68e>] ip_finish_output+0x17e/0x2f0
[ 1417.005011] Â[<ffffffff81533ddb>] ? __alloc_skb+0x4b/0x240
[ 1417.005011] Â[<ffffffff8157b1e8>] ip_output+0x98/0xa0
[ 1417.005011] Â[<ffffffff8157a8a4>] ? __ip_local_out+0xa4/0xb0
[ 1417.005011] Â[<ffffffff8157a8d9>] ip_local_out+0x29/0x30
[ 1417.005011] Â[<ffffffff8157aa3c>] ip_queue_xmit+0x15c/0x410
[ 1417.005011] Â[<ffffffff81595840>] ? tcp_retransmit_timer+0x440/0x440
[ 1417.005011] Â[<ffffffff81592c69>] tcp_transmit_skb+0x359/0x580
[ 1417.005011] Â[<ffffffff81593be1>] tcp_retransmit_skb+0x171/0x310
[ 1417.005011] Â[<ffffffff8159561b>] tcp_retransmit_timer+0x21b/0x440
[ 1417.005011] Â[<ffffffff81595928>] tcp_write_timer+0xe8/0x110
[ 1417.005011] Â[<ffffffff81595840>] ? tcp_retransmit_timer+0x440/0x440
[ 1417.005011] Â[<ffffffff81075d36>] call_timer_fn+0x46/0x160
[ 1417.005011] Â[<ffffffff81595840>] ? tcp_retransmit_timer+0x440/0x440
[ 1417.005011] Â[<ffffffff81077682>] run_timer_softirq+0x132/0x2a0
[ 1417.005011] Â[<ffffffff8106e5d8>] __do_softirq+0xa8/0x210
[ 1417.005011] Â[<ffffffff813a94b7>] ? __xen_evtchn_do_upcall+0x207/0x250
[ 1417.005011] Â[<ffffffff816656ac>] call_softirq+0x1c/0x30
[ 1417.005011] Â[<ffffffff81015305>] do_softirq+0x65/0xa0
[ 1417.005011] Â[<ffffffff8106e9be>] irq_exit+0x8e/0xb0
[ 1417.005011] Â[<ffffffff813ab595>] xen_evtchn_do_upcall+0x35/0x50
[ 1417.005011] Â[<ffffffff816656fe>] xen_do_hypervisor_callback+0x1e/0x30
[ 1417.005011] Â<EOI> Â[<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
[ 1417.005011] Â[<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
[ 1417.005011] Â[<ffffffff8100a2d0>] ? xen_safe_halt+0x10/0x20
[ 1417.005011] Â[<ffffffff8101b983>] ? default_idle+0x53/0x1d0
[ 1417.005011] Â[<ffffffff81012236>] ? cpu_idle+0xd6/0x120
[ 1417.005011] Â[<ffffffff8100ab29>] ? xen_irq_enable_direct_reloc+0x4/0x4
[ 1417.005011] Â[<ffffffff8163369c>] ? cpu_bringup_and_idle+0xe/0x10
(XEN) Domain 0 crashed: 'noreboot' set - not rebooting.
Trond,
--On 21 January 2013 15:50:48 +0000 "Myklebust, Trond"
<[email protected]> wrote:
>> I don't think QEMU is actually using O_DIRECT unless I set cache=none
>> on the drive. That causes a different interesting failure which isn't
>> my focus just now!
>
> Then your reference to Ian's bug is a red herring.
>
> If the application is using buffered writes, then the data is
> immediately copied from userspace to the page cache. Once the copy to
> the page cache is done, userspace can do whatever it wants with the
> original buffer, because only the page cache pages are used in the RPC
> calls.
>
> aio doesn't change any of this...
So, just to be clear, if a process is using NFS and AIO with O_DSYNC
(but not O_DIRECT) - which is I think what QEMU is meant to be doing -
then it should *never* be zero copy (even if writes happen to be
appropriately aligned). Is that correct? If so, I can strace the
process and see exactly what flags it is using.
--
Alex Bligh
Trond,
--On 21 January 2013 17:20:36 +0000 "Myklebust, Trond"
<[email protected]> wrote:
> That is correct. If you want zero-copy, then O_DIRECT is your thing
> (with or without aio). Otherwise, the kernel will always write to disk
> by copying through the page cache.
Thanks. From the crash dump it appeared to be accessing mapped memory
somehow, so I'll see whether QEMU is in fact using O_DIRECT contrary
to documentation.
--
Alex Bligh
Trond,
--On 21 January 2013 14:38:20 +0000 "Myklebust, Trond"
<[email protected]> wrote:
> The Oops would be due to a bug in the socket layer: the socket is
> supposed to take a reference count on the page in order to ensure that
> it can copy the contents.
Looking at the original linux-nfs link, you said here:
http://marc.info/?l=linux-nfs&m=122424789508577&w=2
Trond:> I don't see how this could be an RPC bug. The networking
Trond:> layer is supposed to either copy the data sent to the socket,
Trond:> or take a reference to any pages that are pushed via
Trond:> the ->sendpage() abi.
which sounds suspiciously like the same thing.
The conversation then went:
http://marc.info/?l=linux-nfs&m=122424858109731&w=2
Ian:> The pages are still referenced by the networking layer. The problem is
Ian:> that the userspace app has been told that the write has completed so
Ian:> it is free to write new data to those pages.
To which you replied:
http://marc.info/?l=linux-nfs&m=122424984612130&w=2
Trond:> OK, I see your point.
Following the thread, it then seems that Ian's test case did fail on
NFS4 on 2.6.18, but not on 2.6.27.
Note that Ian was seeing something slightly different from me. I think
what he was seeing was alterations to the page after AIO completes
being retransmitted when the page prior to the alteration should
be transmitted. That could presumably be fixed by some COW device.
What I'm seeing is more subtle. Xen thinks (because QEMU tells it,
because AIO tells it) that the memory is done with entirely, and
simply unmaps it. I don't think that's Qemu's fault.
If it is a referencing issue, then it seems to me the problem is
that Xen is releasing the grant structure (I don't quite understand
how this bit works) and unmapping memory when the networking stack
still holds a reference to the page concerned. However, even if it
did not do that, wouldn't a retransmit after the write had completed
risk writing the wrong data? I suppose it could mark the page
COW before it released the grant or something.
> As for the O_DIRECT bug, the problem there is that we have no way of
> knowing when the socket is done writing the page. Just because we got an
> answer from the server doesn't mean that the socket is done
> retransmitting the data. It is quite possible that the server is just
> replying to the first transmission.
I don't think QEMU is actually using O_DIRECT unless I set cache=none
on the drive. That causes a different interesting failure which isn't
my focus just now!
> I thought that Ian was working on a fix for this issue. At one point, he
> had a bunch of patches to allow sendpage() to call you back when the
> transmission was done. What happened to those patches?
No idea (I don't work with Ian but have taken the liberty of copy him).
However, what's happened in the intervening years is that Xen has changed
its device model and it's now QEMU doing the writing (the qcow2 driver
specifically). I'm not sure it's even using sendpage.
--
Alex Bligh
On Wed, 2013-01-23 at 19:37 +0000, Alex Bligh wrote:
>
> --On 23 January 2013 18:13:34 +0000 "Myklebust, Trond"
> <[email protected]> wrote:
>
> >> They can't disappear until they have been successfully transmitted and a
> >> response received. The problem here is that there were two requests
> >> sent or being sent and the page(s) can't be released until everyone,
> >> including TCP and such, are done with them.
> >>
> >> ps
> >
> > Right. The O_DIRECT write() system call will not return until it gets a
> > reply. Similarly, we don't mark an aio/dio request as complete until it
> > too gets a reply. So the data for those requests that need
> > retransmission is still available to be resent through the socket.
>
> I apologise for my stupidity here as I think I must be missing something.
>
> I thought we'd established that Xen's grant system doesn't release the page
> until QEMU says the block I/O is complete. QEMU only states that the block
> I/O is complete when AIO says it is.
That's correct. Xen and qemu maintains the mapping until the kernel says
the I/O is complete. To do otherwise would be a bug.
> What's happening (as far as I can tell
> from the oops) is that the grant system is releasing the page AFTER the aio
> request is complete (and dio may the same), but at that stage the page is
> still referenced by the tcp stack. That contradicts what you say about not
> marking the aio/dio request as complete until it gets a reply, unless it's
> the case that you can get a reply to a request when there is still data
> that the TCP stack can ask to retransmit (I suppose that's conceivable
> if the reply gets sent before the ACK of the data received).
This is exactly what can happen:
1. send request (A)
2. timeout waiting for ACK to (A)
3. queue TCP retransmit of (A) as (B)
4. receive ACK to original (A), sent at #1, and rpc reply to that
request.
5. return success to userspace
6. userspace reuses (or unmaps under Xen) the buffer
7. (B), queued at #3, reaches the head of the queue
8. Try to transmit (B), bug has now happened.
You can also s/TCP/RPC/ and construct a similar issue at the next layer
of the stack, which only happens on NFSv3 AIUI.
> My understanding (which may well be completely wrong) is that the problem
> was that xen was unmapping the page even though it still had kernel
> references to it. This is why the problem does not happen in kvm (which
> does not as I understand it do a similar map/unmap operation). From Ian C I
> understand that just looking at the number of kernel references is not
> sufficient.
Under any userspace process (which includes KVM) you get retransmission
of data which may have changed, because userspace believes the kernel
when it has said it is done with it, and has reused the buffer. All that
is different under Xen is that "changed" can mean "unmapped" which makes
the symptom much worse.
Ian.
Trond,
--On 21 January 2013 17:20:36 +0000 "Myklebust, Trond"
<[email protected]> wrote:
>> So, just to be clear, if a process is using NFS and AIO with O_DSYNC
>> (but not O_DIRECT) - which is I think what QEMU is meant to be doing -
>> then it should *never* be zero copy (even if writes happen to be
>> appropriately aligned). Is that correct? If so, I can strace the
>> process and see exactly what flags it is using.
>>
>
> That is correct. If you want zero-copy, then O_DIRECT is your thing
> (with or without aio). Otherwise, the kernel will always write to disk
> by copying through the page cache.
Just to follow up on this, QEMU (specifically hw/xen_disk.c) was using
O_DIRECT. If O_DIRECT is turned off, we get an additional page copy
but the bug does not appear.
It thus appears that the root of the problem is that if an AIO NFS
request is made with O_DIRECT, AIO can report the request is completed
even when the segment may need to be retransmitted, and whilst the
TCP stack correctly holds a reference to the page concerned, this
is not currently preventing Xen unmapping it as Xen thinks the IO
has completed.
I believe this problem may apply to iSCSI and for that matter (e.g.)
DRDB too.
--
Alex Bligh
On Wed, 2013-01-23 at 15:22 +0000, Alex Bligh wrote:
> Trond,
>
> --On 21 January 2013 17:20:36 +0000 "Myklebust, Trond"
> <[email protected]> wrote:
>
> >> So, just to be clear, if a process is using NFS and AIO with O_DSYNC
> >> (but not O_DIRECT) - which is I think what QEMU is meant to be doing -
> >> then it should *never* be zero copy (even if writes happen to be
> >> appropriately aligned). Is that correct? If so, I can strace the
> >> process and see exactly what flags it is using.
> >>
> >
> > That is correct. If you want zero-copy, then O_DIRECT is your thing
> > (with or without aio). Otherwise, the kernel will always write to disk
> > by copying through the page cache.
>
> Just to follow up on this, QEMU (specifically hw/xen_disk.c) was using
> O_DIRECT. If O_DIRECT is turned off, we get an additional page copy
> but the bug does not appear.
>
> It thus appears that the root of the problem is that if an AIO NFS
> request is made with O_DIRECT, AIO can report the request is completed
> even when the segment may need to be retransmitted, and whilst the
> TCP stack correctly holds a reference to the page concerned, this
> is not currently preventing Xen unmapping it as Xen thinks the IO
> has completed.
It is not limited to aio/dio. It can happen with ordinary synchronous
O_DIRECT too.
As I said, it is a known problem and is one of the reasons why we want
to set retransmission timeouts to a high value. The real fix would be to
implement something along the lines of Ian's patchset.
> I believe this problem may apply to iSCSI and for that matter (e.g.)
> DRDB too.
I've no idea if they do zero copy to the socket in these situations. If
they do, then they probably have similar issues. The problem can be
mitigated by breaking the connection on retransmission; we can't do that
in NFS < NFSv4.1, since the duplicate replay cache is typically indexed
to the port number (and port number reuse is difficult with TCP due to
the existence of the TIME_WAIT state).
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
Trond,
--On 23 January 2013 17:37:08 +0000 "Myklebust, Trond"
<[email protected]> wrote:
>> If you break the connection, and the written data is now not available
>> to dom0 (as it's been mapped out), how would it ever get resent? IE
>> it's not going to be available to the RPC layer either.
>
> There are typically more than 1 outstanding RPC call at any one time.
> Breaking the connection would affect those other RPC calls.
I understand that. I meant 'how does the write which failed ever
get retransmitted now the data to be written has been lost?'
--
Alex Bligh
T24gTW9uLCAyMDEzLTAxLTIxIGF0IDE1OjAxICswMDAwLCBBbGV4IEJsaWdoIHdyb3RlOg0KPiBU
cm9uZCwNCj4gDQo+IC0tT24gMjEgSmFudWFyeSAyMDEzIDE0OjM4OjIwICswMDAwICJNeWtsZWJ1
c3QsIFRyb25kIiANCj4gPFRyb25kLk15a2xlYnVzdEBuZXRhcHAuY29tPiB3cm90ZToNCj4gDQo+
ID4gVGhlIE9vcHMgd291bGQgYmUgZHVlIHRvIGEgYnVnIGluIHRoZSBzb2NrZXQgbGF5ZXI6IHRo
ZSBzb2NrZXQgaXMNCj4gPiBzdXBwb3NlZCB0byB0YWtlIGEgcmVmZXJlbmNlIGNvdW50IG9uIHRo
ZSBwYWdlIGluIG9yZGVyIHRvIGVuc3VyZSB0aGF0DQo+ID4gaXQgY2FuIGNvcHkgdGhlIGNvbnRl
bnRzLg0KPiANCj4gTG9va2luZyBhdCB0aGUgb3JpZ2luYWwgbGludXgtbmZzIGxpbmssIHlvdSBz
YWlkIGhlcmU6DQo+IGh0dHA6Ly9tYXJjLmluZm8vP2w9bGludXgtbmZzJm09MTIyNDI0Nzg5NTA4
NTc3Jnc9Mg0KPiANCj4gVHJvbmQ6PiBJIGRvbid0IHNlZSBob3cgdGhpcyBjb3VsZCBiZSBhbiBS
UEMgYnVnLiBUaGUgbmV0d29ya2luZw0KPiBUcm9uZDo+IGxheWVyIGlzIHN1cHBvc2VkIHRvIGVp
dGhlciBjb3B5IHRoZSBkYXRhIHNlbnQgdG8gdGhlIHNvY2tldCwNCj4gVHJvbmQ6PiBvciB0YWtl
IGEgcmVmZXJlbmNlIHRvIGFueSBwYWdlcyB0aGF0IGFyZSBwdXNoZWQgdmlhDQo+IFRyb25kOj4g
dGhlIC0+c2VuZHBhZ2UoKSBhYmkuDQo+IA0KPiB3aGljaCBzb3VuZHMgc3VzcGljaW91c2x5IGxp
a2UgdGhlIHNhbWUgdGhpbmcuDQo+IA0KPiBUaGUgY29udmVyc2F0aW9uIHRoZW4gd2VudDoNCj4g
aHR0cDovL21hcmMuaW5mby8/bD1saW51eC1uZnMmbT0xMjI0MjQ4NTgxMDk3MzEmdz0yDQo+IElh
bjo+IFRoZSBwYWdlcyBhcmUgc3RpbGwgcmVmZXJlbmNlZCBieSB0aGUgbmV0d29ya2luZyBsYXll
ci4gVGhlIHByb2JsZW0gaXMNCj4gSWFuOj4gdGhhdCB0aGUgdXNlcnNwYWNlIGFwcCBoYXMgYmVl
biB0b2xkIHRoYXQgdGhlIHdyaXRlIGhhcyBjb21wbGV0ZWQgc28NCj4gSWFuOj4gaXQgaXMgZnJl
ZSB0byB3cml0ZSBuZXcgZGF0YSB0byB0aG9zZSBwYWdlcy4NCj4gDQo+IFRvIHdoaWNoIHlvdSBy
ZXBsaWVkOg0KPiBodHRwOi8vbWFyYy5pbmZvLz9sPWxpbnV4LW5mcyZtPTEyMjQyNDk4NDYxMjEz
MCZ3PTINCj4gVHJvbmQ6PiBPSywgSSBzZWUgeW91ciBwb2ludC4NCg0KVGhlIG9yaWdpbmFsIHRo
cmVhZCBkaWQgbm90IEFGQUlDUiBpbnZvbHZlIGFuIE9vcHMuIElmIHlvdSBhcmUgc2VlaW5nIGFu
DQpPb3BzLCB0aGVuIHRoYXQgaXMgc29tZXRoaW5nIG5ldyBhbmQgd291bGQgYmUgYSBzb2NrZXQg
bGV2ZWwgYnVnLg0KDQo+IEZvbGxvd2luZyB0aGUgdGhyZWFkLCBpdCB0aGVuIHNlZW1zIHRoYXQg
SWFuJ3MgdGVzdCBjYXNlIGRpZCBmYWlsIG9uDQo+IE5GUzQgb24gMi42LjE4LCBidXQgbm90IG9u
IDIuNi4yNy4NCj4gDQo+IE5vdGUgdGhhdCBJYW4gd2FzIHNlZWluZyBzb21ldGhpbmcgc2xpZ2h0
bHkgZGlmZmVyZW50IGZyb20gbWUuIEkgdGhpbmsNCj4gd2hhdCBoZSB3YXMgc2VlaW5nIHdhcyBh
bHRlcmF0aW9ucyB0byB0aGUgcGFnZSBhZnRlciBBSU8gY29tcGxldGVzDQo+IGJlaW5nIHJldHJh
bnNtaXR0ZWQgd2hlbiB0aGUgcGFnZSBwcmlvciB0byB0aGUgYWx0ZXJhdGlvbiBzaG91bGQNCj4g
YmUgdHJhbnNtaXR0ZWQuIFRoYXQgY291bGQgcHJlc3VtYWJseSBiZSBmaXhlZCBieSBzb21lIENP
VyBkZXZpY2UuDQo+IA0KPiBXaGF0IEknbSBzZWVpbmcgaXMgbW9yZSBzdWJ0bGUuIFhlbiB0aGlu
a3MgKGJlY2F1c2UgUUVNVSB0ZWxscyBpdCwNCj4gYmVjYXVzZSBBSU8gdGVsbHMgaXQpIHRoYXQg
dGhlIG1lbW9yeSBpcyBkb25lIHdpdGggZW50aXJlbHksIGFuZA0KPiBzaW1wbHkgdW5tYXBzIGl0
LiBJIGRvbid0IHRoaW5rIHRoYXQncyBRZW11J3MgZmF1bHQuDQo+IA0KPiBJZiBpdCBpcyBhIHJl
ZmVyZW5jaW5nIGlzc3VlLCB0aGVuIGl0IHNlZW1zIHRvIG1lIHRoZSBwcm9ibGVtIGlzDQo+IHRo
YXQgWGVuIGlzIHJlbGVhc2luZyB0aGUgZ3JhbnQgc3RydWN0dXJlIChJIGRvbid0IHF1aXRlIHVu
ZGVyc3RhbmQNCj4gaG93IHRoaXMgYml0IHdvcmtzKSBhbmQgdW5tYXBwaW5nIG1lbW9yeSB3aGVu
IHRoZSBuZXR3b3JraW5nIHN0YWNrDQo+IHN0aWxsIGhvbGRzIGEgcmVmZXJlbmNlIHRvIHRoZSBw
YWdlIGNvbmNlcm5lZC4gSG93ZXZlciwgZXZlbiBpZiBpdA0KPiBkaWQgbm90IGRvIHRoYXQsIHdv
dWxkbid0IGEgcmV0cmFuc21pdCBhZnRlciB0aGUgd3JpdGUgaGFkIGNvbXBsZXRlZA0KPiByaXNr
IHdyaXRpbmcgdGhlIHdyb25nIGRhdGE/IEkgc3VwcG9zZSBpdCBjb3VsZCBtYXJrIHRoZSBwYWdl
DQo+IENPVyBiZWZvcmUgaXQgcmVsZWFzZWQgdGhlIGdyYW50IG9yIHNvbWV0aGluZy4NCj4gDQo+
ID4gQXMgZm9yIHRoZSBPX0RJUkVDVCBidWcsIHRoZSBwcm9ibGVtIHRoZXJlIGlzIHRoYXQgd2Ug
aGF2ZSBubyB3YXkgb2YNCj4gPiBrbm93aW5nIHdoZW4gdGhlIHNvY2tldCBpcyBkb25lIHdyaXRp
bmcgdGhlIHBhZ2UuIEp1c3QgYmVjYXVzZSB3ZSBnb3QgYW4NCj4gPiBhbnN3ZXIgZnJvbSB0aGUg
c2VydmVyIGRvZXNuJ3QgbWVhbiB0aGF0IHRoZSBzb2NrZXQgaXMgZG9uZQ0KPiA+IHJldHJhbnNt
aXR0aW5nIHRoZSBkYXRhLiBJdCBpcyBxdWl0ZSBwb3NzaWJsZSB0aGF0IHRoZSBzZXJ2ZXIgaXMg
anVzdA0KPiA+IHJlcGx5aW5nIHRvIHRoZSBmaXJzdCB0cmFuc21pc3Npb24uDQo+IA0KPiBJIGRv
bid0IHRoaW5rIFFFTVUgaXMgYWN0dWFsbHkgdXNpbmcgT19ESVJFQ1QgdW5sZXNzIEkgc2V0IGNh
Y2hlPW5vbmUNCj4gb24gdGhlIGRyaXZlLiBUaGF0IGNhdXNlcyBhIGRpZmZlcmVudCBpbnRlcmVz
dGluZyBmYWlsdXJlIHdoaWNoIGlzbid0DQo+IG15IGZvY3VzIGp1c3Qgbm93IQ0KDQpUaGVuIHlv
dXIgcmVmZXJlbmNlIHRvIElhbidzIGJ1ZyBpcyBhIHJlZCBoZXJyaW5nLg0KDQpJZiB0aGUgYXBw
bGljYXRpb24gaXMgdXNpbmcgYnVmZmVyZWQgd3JpdGVzLCB0aGVuIHRoZSBkYXRhIGlzDQppbW1l
ZGlhdGVseSBjb3BpZWQgZnJvbSB1c2Vyc3BhY2UgdG8gdGhlIHBhZ2UgY2FjaGUuIE9uY2UgdGhl
IGNvcHkgdG8NCnRoZSBwYWdlIGNhY2hlIGlzIGRvbmUsIHVzZXJzcGFjZSBjYW4gZG8gd2hhdGV2
ZXIgaXQgd2FudHMgd2l0aCB0aGUNCm9yaWdpbmFsIGJ1ZmZlciwgYmVjYXVzZSBvbmx5IHRoZSBw
YWdlIGNhY2hlIHBhZ2VzIGFyZSB1c2VkIGluIHRoZSBSUEMNCmNhbGxzLg0KDQphaW8gZG9lc24n
dCBjaGFuZ2UgYW55IG9mIHRoaXMuLi4NCg0KPiA+IEkgdGhvdWdodCB0aGF0IElhbiB3YXMgd29y
a2luZyBvbiBhIGZpeCBmb3IgdGhpcyBpc3N1ZS4gQXQgb25lIHBvaW50LCBoZQ0KPiA+IGhhZCBh
IGJ1bmNoIG9mIHBhdGNoZXMgdG8gYWxsb3cgc2VuZHBhZ2UoKSB0byBjYWxsIHlvdSBiYWNrIHdo
ZW4gdGhlDQo+ID4gdHJhbnNtaXNzaW9uIHdhcyBkb25lLiBXaGF0IGhhcHBlbmVkIHRvIHRob3Nl
IHBhdGNoZXM/DQo+IA0KPiBObyBpZGVhIChJIGRvbid0IHdvcmsgd2l0aCBJYW4gYnV0IGhhdmUg
dGFrZW4gdGhlIGxpYmVydHkgb2YgY29weSBoaW0pLg0KPiANCj4gSG93ZXZlciwgd2hhdCdzIGhh
cHBlbmVkIGluIHRoZSBpbnRlcnZlbmluZyB5ZWFycyBpcyB0aGF0IFhlbiBoYXMgY2hhbmdlZA0K
PiBpdHMgZGV2aWNlIG1vZGVsIGFuZCBpdCdzIG5vdyBRRU1VIGRvaW5nIHRoZSB3cml0aW5nICh0
aGUgcWNvdzIgZHJpdmVyDQo+IHNwZWNpZmljYWxseSkuIEknbSBub3Qgc3VyZSBpdCdzIGV2ZW4g
dXNpbmcgc2VuZHBhZ2UuDQoNClRoZSBrZXJuZWwgUlBDIGxheWVyIHVzZXMgc2VuZHBhZ2UgdG8g
dHJhbnNtaXQgcGFnZXMgYXMgcGFydCBvZiBhbiBSUEMNCmNhbGwuDQoNCi0tIA0KVHJvbmQgTXlr
bGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWts
ZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg==
On Wed, 2013-01-23 at 17:33 +0000, Alex Bligh wrote:
>
> --On 23 January 2013 15:34:42 +0000 "Myklebust, Trond"
> <[email protected]> wrote:
>
> > I've no idea if they do zero copy to the socket in these situations. If
> > they do, then they probably have similar issues. The problem can be
> > mitigated by breaking the connection on retransmission; we can't do that
> > in NFS < NFSv4.1, since the duplicate replay cache is typically indexed
> > to the port number (and port number reuse is difficult with TCP due to
> > the existence of the TIME_WAIT state).
>
> If you break the connection, and the written data is now not available
> to dom0 (as it's been mapped out), how would it ever get resent? IE
> it's not going to be available to the RPC layer either.
There are typically more than 1 outstanding RPC call at any one time.
Breaking the connection would affect those other RPC calls.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
Ian,
--On 24 January 2013 10:42:11 +0000 Ian Campbell <[email protected]>
wrote:
> This is exactly what can happen:
>
> 1. send request (A)
> 2. timeout waiting for ACK to (A)
> 3. queue TCP retransmit of (A) as (B)
> 4. receive ACK to original (A), sent at #1, and rpc reply to that
> request.
> 5. return success to userspace
> 6. userspace reuses (or unmaps under Xen) the buffer
> 7. (B), queued at #3, reaches the head of the queue
> 8. Try to transmit (B), bug has now happened.
>
> You can also s/TCP/RPC/ and construct a similar issue at the next layer
> of the stack, which only happens on NFSv3 AIUI.
Got it - finally! Thanks for your patience in explaining.
I am guessing a simpler fix for the tcp retransmit problem would be
to copy (or optionally copy) the page(s) for B at step 3. Given tcp
retransmit is infrequent and performance is not going to be good
with tcp retransmissions going on anyway, that might be acceptable.
However in practice *anything* that causes a multiple references
to a page in the networking stack is going to have this problem,
and multiple skbuff's can refer to the same page, which is I presume
why you were fixing this by skbuff reference counting, presumably
so you know you can do (5) only when the skbuff is entirely unreferenced
(i.e. after (8)).
>> My understanding (which may well be completely wrong) is that the problem
>> was that xen was unmapping the page even though it still had kernel
>> references to it. This is why the problem does not happen in kvm (which
>> does not as I understand it do a similar map/unmap operation). From Ian
>> C I understand that just looking at the number of kernel references is
>> not sufficient.
>
> Under any userspace process (which includes KVM) you get retransmission
> of data which may have changed, because userspace believes the kernel
> when it has said it is done with it, and has reused the buffer. All that
> is different under Xen is that "changed" can mean "unmapped" which makes
> the symptom much worse.
Indeed. And the fact that kvm by default does not use O_DIRECT whereas
xen_disk.c does, so kvm will hide the problem as a copy is performed in (1).
--
Alex Bligh
Hi.
The theory is that the modified pages which need to get written to the server can't be released until they have been successfully written and committed to stable storage on the server.
They can't disappear until they have been successfully transmitted and a response received. The problem here is that there were two requests sent or being sent and the page(s) can't be released until everyone, including TCP and such, are done with them.
ps
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Alex Bligh
Sent: Wednesday, January 23, 2013 12:42 PM
To: Myklebust, Trond
Cc: [email protected]; [email protected]; Alex Bligh
Subject: Re: Fatal crash with NFS, AIO & tcp retransmit
Trond,
--On 23 January 2013 17:37:08 +0000 "Myklebust, Trond"
<[email protected]> wrote:
>> If you break the connection, and the written data is now not
>> available to dom0 (as it's been mapped out), how would it ever get
>> resent? IE it's not going to be available to the RPC layer either.
>
> There are typically more than 1 outstanding RPC call at any one time.
> Breaking the connection would affect those other RPC calls.
I understand that. I meant 'how does the write which failed ever get retransmitted now the data to be written has been lost?'
--
Alex Bligh
On Wed, 2013-01-23 at 12:48 -0500, Peter Staubach wrote:
> Hi.
>
> The theory is that the modified pages which need to get written to the server can't be released until they have been successfully written and committed to stable storage on the server.
>
> They can't disappear until they have been successfully transmitted and a response received. The problem here is that there were two requests sent or being sent and the page(s) can't be released until everyone, including TCP and such, are done with them.
>
> ps
Right. The O_DIRECT write() system call will not return until it gets a
reply. Similarly, we don't mark an aio/dio request as complete until it
too gets a reply. So the data for those requests that need
retransmission is still available to be resent through the socket.
Cheers
Trond
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Alex Bligh
> Sent: Wednesday, January 23, 2013 12:42 PM
> To: Myklebust, Trond
> Cc: [email protected]; [email protected]; Alex Bligh
> Subject: Re: Fatal crash with NFS, AIO & tcp retransmit
>
> Trond,
>
> --On 23 January 2013 17:37:08 +0000 "Myklebust, Trond"
> <[email protected]> wrote:
>
> >> If you break the connection, and the written data is now not
> >> available to dom0 (as it's been mapped out), how would it ever get
> >> resent? IE it's not going to be available to the RPC layer either.
> >
> > There are typically more than 1 outstanding RPC call at any one time.
> > Breaking the connection would affect those other RPC calls.
>
> I understand that. I meant 'how does the write which failed ever get retransmitted now the data to be written has been lost?'
>
> --
> Alex Bligh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Mon, 2013-01-21 at 15:10 +0000, Alex Bligh wrote:
> >> I thought that Ian was working on a fix for this issue. At one point, he
> >> had a bunch of patches to allow sendpage() to call you back when the
> >> transmission was done. What happened to those patches?
> >
> > No idea (I don't work with Ian but have taken the liberty of copy him).
I'm afraid I've rather dropped the ball on these patches due to lack of
time.
Ian.
--On 23 January 2013 18:13:34 +0000 "Myklebust, Trond"
<[email protected]> wrote:
>> They can't disappear until they have been successfully transmitted and a
>> response received. The problem here is that there were two requests
>> sent or being sent and the page(s) can't be released until everyone,
>> including TCP and such, are done with them.
>>
>> ps
>
> Right. The O_DIRECT write() system call will not return until it gets a
> reply. Similarly, we don't mark an aio/dio request as complete until it
> too gets a reply. So the data for those requests that need
> retransmission is still available to be resent through the socket.
I apologise for my stupidity here as I think I must be missing something.
I thought we'd established that Xen's grant system doesn't release the page
until QEMU says the block I/O is complete. QEMU only states that the block
I/O is complete when AIO says it is. What's happening (as far as I can tell
from the oops) is that the grant system is releasing the page AFTER the aio
request is complete (and dio may the same), but at that stage the page is
still referenced by the tcp stack. That contradicts what you say about not
marking the aio/dio request as complete until it gets a reply, unless it's
the case that you can get a reply to a request when there is still data
that the TCP stack can ask to retransmit (I suppose that's conceivable
if the reply gets sent before the ACK of the data received).
My understanding (which may well be completely wrong) is that the problem
was that xen was unmapping the page even though it still had kernel
references to it. This is why the problem does not happen in kvm (which
does not as I understand it do a similar map/unmap operation). From Ian C I
understand that just looking at the number of kernel references is not
sufficient.
--
Alex Bligh
--On 23 January 2013 15:34:42 +0000 "Myklebust, Trond"
<[email protected]> wrote:
> I've no idea if they do zero copy to the socket in these situations. If
> they do, then they probably have similar issues. The problem can be
> mitigated by breaking the connection on retransmission; we can't do that
> in NFS < NFSv4.1, since the duplicate replay cache is typically indexed
> to the port number (and port number reuse is difficult with TCP due to
> the existence of the TIME_WAIT state).
If you break the connection, and the written data is now not available
to dom0 (as it's been mapped out), how would it ever get resent? IE
it's not going to be available to the RPC layer either.
--
Alex Bligh
On Mon, 2013-01-21 at 15:50 +0000, Myklebust, Trond wrote:
> On Mon, 2013-01-21 at 15:01 +0000, Alex Bligh wrote:
> > Trond,
> >
> > --On 21 January 2013 14:38:20 +0000 "Myklebust, Trond"
> > <[email protected]> wrote:
> >
> > > The Oops would be due to a bug in the socket layer: the socket is
> > > supposed to take a reference count on the page in order to ensure that
> > > it can copy the contents.
> >
> > Looking at the original linux-nfs link, you said here:
> > http://marc.info/?l=linux-nfs&m=122424789508577&w=2
> >
> > Trond:> I don't see how this could be an RPC bug. The networking
> > Trond:> layer is supposed to either copy the data sent to the socket,
> > Trond:> or take a reference to any pages that are pushed via
> > Trond:> the ->sendpage() abi.
> >
> > which sounds suspiciously like the same thing.
> >
> > The conversation then went:
> > http://marc.info/?l=linux-nfs&m=122424858109731&w=2
> > Ian:> The pages are still referenced by the networking layer. The problem is
> > Ian:> that the userspace app has been told that the write has completed so
> > Ian:> it is free to write new data to those pages.
> >
> > To which you replied:
> > http://marc.info/?l=linux-nfs&m=122424984612130&w=2
> > Trond:> OK, I see your point.
>
> The original thread did not AFAICR involve an Oops. If you are seeing an
> Oops, then that is something new and would be a socket level bug.
The oops would be Xen specific, in the case where on native you would
touch the buffer after a write completed and potentially resend changed
data on Xen you would see an unmapped addresss.
The underlying issue is the same, just the consequence on Xen is a bit
more obvious.
[...]
> > I don't think QEMU is actually using O_DIRECT unless I set cache=none
> > on the drive. That causes a different interesting failure which isn't
> > my focus just now!
>
> Then your reference to Ian's bug is a red herring.
Agreed, if there is no zero copy going on then this is a separate issue.
Ian.
And again with Ian's correct email address. Sorry all.
--On 21 January 2013 15:01:54 +0000 Alex Bligh <[email protected]> wrote:
> Trond,
>
> --On 21 January 2013 14:38:20 +0000 "Myklebust, Trond"
> <[email protected]> wrote:
>
>> The Oops would be due to a bug in the socket layer: the socket is
>> supposed to take a reference count on the page in order to ensure that
>> it can copy the contents.
>
> Looking at the original linux-nfs link, you said here:
> http://marc.info/?l=linux-nfs&m=122424789508577&w=2
>
> Trond:> I don't see how this could be an RPC bug. The networking
> Trond:> layer is supposed to either copy the data sent to the socket,
> Trond:> or take a reference to any pages that are pushed via
> Trond:> the ->sendpage() abi.
>
> which sounds suspiciously like the same thing.
>
> The conversation then went:
> http://marc.info/?l=linux-nfs&m=122424858109731&w=2
> Ian:> The pages are still referenced by the networking layer. The problem
> is
> Ian:> that the userspace app has been told that the write has completed so
> Ian:> it is free to write new data to those pages.
>
> To which you replied:
> http://marc.info/?l=linux-nfs&m=122424984612130&w=2
> Trond:> OK, I see your point.
>
> Following the thread, it then seems that Ian's test case did fail on
> NFS4 on 2.6.18, but not on 2.6.27.
>
> Note that Ian was seeing something slightly different from me. I think
> what he was seeing was alterations to the page after AIO completes
> being retransmitted when the page prior to the alteration should
> be transmitted. That could presumably be fixed by some COW device.
>
> What I'm seeing is more subtle. Xen thinks (because QEMU tells it,
> because AIO tells it) that the memory is done with entirely, and
> simply unmaps it. I don't think that's Qemu's fault.
>
> If it is a referencing issue, then it seems to me the problem is
> that Xen is releasing the grant structure (I don't quite understand
> how this bit works) and unmapping memory when the networking stack
> still holds a reference to the page concerned. However, even if it
> did not do that, wouldn't a retransmit after the write had completed
> risk writing the wrong data? I suppose it could mark the page
> COW before it released the grant or something.
>
>> As for the O_DIRECT bug, the problem there is that we have no way of
>> knowing when the socket is done writing the page. Just because we got an
>> answer from the server doesn't mean that the socket is done
>> retransmitting the data. It is quite possible that the server is just
>> replying to the first transmission.
>
> I don't think QEMU is actually using O_DIRECT unless I set cache=none
> on the drive. That causes a different interesting failure which isn't
> my focus just now!
>
>> I thought that Ian was working on a fix for this issue. At one point, he
>> had a bunch of patches to allow sendpage() to call you back when the
>> transmission was done. What happened to those patches?
>
> No idea (I don't work with Ian but have taken the liberty of copy him).
>
> However, what's happened in the intervening years is that Xen has changed
> its device model and it's now QEMU doing the writing (the qcow2 driver
> specifically). I'm not sure it's even using sendpage.
>
> --
> Alex Bligh
>
>
--
Alex Bligh
T24gTW9uLCAyMDEzLTAxLTIxIGF0IDEzOjA2ICswMDAwLCBBbGV4IEJsaWdoIHdyb3RlOg0KPiBJ
J20gdHJ5aW5nIHRvIHJlc29sdmUgYSBmYXRhbCBidWcgdGhhdCBoYXBwZW5zIHdpdGggTGludXgg
My4yLjAtMzItZ2VuZXJpYw0KPiAoVWJ1bnR1IHZhcmlhbnQgb2YgMy4yKSwgYW5kIHRoZSBtYWdp
YyBjb21iaW5hdGlvbiBvZg0KPiAxLiBORlN2NA0KPiAyLiBBSU8gZnJvbSBRZW11DQo+IDMuIFhl
biB3aXRoIHVwc3RyZWFtIHFlbXUgRE0NCj4gNC4gUUNPVyBwbHVzIGJhY2tpbmcgZmlsZS4NCj4g
DQo+IFRoZSBiYWNrZ3JvdW5kIGlzIGhlcmU6DQo+ICAgaHR0cDovL2xpc3RzLnhlbi5vcmcvYXJj
aGl2ZXMvaHRtbC94ZW4tZGV2ZWwvMjAxMi0xMi9tc2cwMTE1NC5odG1sDQo+IEl0IGlzIGNvbXBs
ZXRlbHkgcmVwbGljYWJsZSBvbiBkaWZmZXJlbnQgTkZTIGNsaWVudCBoYXJkd2FyZS4gV2UndmUN
Cj4gdHJpZWQgb3RoZXIga2VybmVscyB0byBubyBhdmFpbC4NCj4gDQo+IFRoZSBidWcgaXMgcXVp
dGUgbmFzdHkgaW4gdGhhdCBkb20wIGNyYXNoZXMgZmF0YWxseSBkdWUgdG8gYSBWTSBhY3Rpb24u
DQo+IA0KPiBXaXRoaW4gdGhlIGxpbmssIHlvdSdsbCBzZWUgcmVmZXJlbmNlcyB0byBhbiBpc3N1
ZSBmb3VuZCBieSBJYW4gQ2FtcGJlbGwNCj4gYSB3aGlsZSBhZ28sIHdoaWNoIHR1cm5lZCBvdXQg
dG8gYmUgYW4gTkZTIGlzc3VlIGluZGVwZW5kZW50IG9mIFhlbiBidXQNCj4gYXBwYXJlbnRseSBu
b3QgaW4gTkZTNC4gVGhlIGxpbmtzIGFyZToNCj4gIGh0dHA6Ly9idWdzLmRlYmlhbi5vcmcvY2dp
LWJpbi9idWdyZXBvcnQuY2dpP2J1Zz02NDA5NDENCj4gIGh0dHA6Ly9tYXJjLmluZm8vP2w9bGlu
dXgtbmZzJm09MTIyNDI0MTMyNzI5NzIwJnc9Mg0KPiANCj4gSW4gZXNzZW5jZSwgbXkgdW5kZXJz
dGFuZGluZyBvZiB3aGF0IGFwcGVhcnMgdG8gYmUgaGFwcGVuaW5nICh3aGljaA0KPiBtYXkgYmUg
ZW50aXJlbHkgd3JvbmcpIGlzOg0KPiAxLiBYZW4gNC4yIEhWTSBkb21VIFZNIGhhcyBhIFBWIGRp
c2sgZHJpdmVyDQo+IDIuIGRvbVUgd3JpdGVzIGEgcGFnZQ0KPiAzLiBYZW4gbWFwcyBkb21VJ3Mg
cGFnZSBpbnRvIGRvbTAncyBWTSBzcGFjZQ0KPiA0LiBYZW4gYXNrcyBRZW11ICh1c2Vyc3BhY2Up
IHRvIHdyaXRlIGEgcGFnZQ0KPiA1LiBRZW11J3MgZGlzayAoYW5kIGJhY2tpbmcgZmlsZSBmb3Ig
dGhlIGRpc2spIGFyZSBvbiBORlN2NA0KPiA2LiBRZW11IHVzZXMgQUlPIHRvIHdyaXRlIHRoZSBw
YWdlIHRvIE5GUw0KPiA3LiBBSU8gY2xhaW1zIHRoZSBwYWdlIHdyaXRlIGlzIGNvbXBsZXRlDQo+
IDguIFFlbXUgbWFya3MgdGhlIHdyaXRlIGFzIGNvbXBsZXRlDQo+IDkuIFhlbiB1bm1hcHMgdGhl
IHBhZ2UgZnJvbSBkb20wJ3MgVk0gc3BhY2UNCj4gMTAuIEFwcGFyZW50bHksIHRoZSB3cml0ZSBp
cyBub3QgYWN0dWFsbHkgY29tcGxldGUgYXQgdGhpcw0KPiAgICAgcG9pbnQNCj4gMTEuIFRDUCBy
ZXRyYW5zbWl0IGlzIHRyaWdnZXJlZCAobm90IHF1aXRlIHN1cmUgd2h5LCBwb3NzaWJseQ0KPiAg
ICAgZHVlIHRvIHNsb3cgZmlsZXIpDQo+IDEyLiBUQ1AgZ29lcyB0byByZXNlbmQgdGhlIHBhZ2Us
IGFuZCBmaW5kcyBpdCdzIG5vdCBpbiBkb20wDQo+ICAgICBtZW1vcnkuDQo+IDEzLiBCYW5nDQo+
IA0KPiBUaGUgWGVuIGZvbGtzIHRoaW5rIHRoaXMgaXMgbm90aGluZyB0byBkbyB3aXRoIGVpdGhl
ciBYZW4gb3IgUUVNVSwgYW5kDQo+IGJlbGlldmUgdGhlIHByb2JsZW0gaXMgQUlPIG9uIE5GUy4g
VGhlIGxpbmtzIHRvIGVhcmxpZXIgaW52ZXN0aWdhdGlvbnMNCj4gc3VnZ2VzdCB0aGlzIGlzL3dh
cyB0cnVlLCBidXQgbm90IGZvciBORlN2NCwgYW5kIHdhcyBmaXhlZC4gQW4gTkZTdjQgY2FzZQ0K
PiBtYXkgaGF2ZSBiZWVuIG1pc3NlZC4NCj4gDQo+IEFnYWluc3QgdGhpcyBleHBsYW5hdGlvbjoN
Cj4gYSkgaXQgZG9lcyBub3QgaGFwcGVuIGluIEtWTSAoYWdhaW4gd2l0aCBRRU1VIGRvaW5nIEFJ
TyB0bw0KPiAgICBORlMpIC0gdGhvdWdoIGhlcmUgdGhlIHBhZ2UgbWFwcGluZyBmYW5jaW5lc3Mg
ZG9lc24ndA0KPiAgICBoYXBwZW4gYXMgS1ZNIFZNcyBzaGFyZSB0aGUgc2FtZSBtZW1vcnkgc3Bh
Y2UgYXMgdGhlIGtlcm5lbA0KPiAgICBhcyBJIHVuZGVyc3RhbmQgaXQuDQo+IGIpIGl0IGRvZXMg
bm90IGhhcHBlbiBvbiBYZW4gd2l0aG91dCBhIFFFTVUgYmFja2luZyBmaWxlICh0aG91Z2gNCj4g
ICAgdGhhdCBtYXkgYmUganVzdCB3aGF0J3MgbmVjZXNzYXJ5IHRpbWluZyB3aXNlIHRvIHRyaWdn
ZXINCj4gICAgdGhlIHJhY2UgY29uZGl0aW9uKS4NCj4gDQo+IEFueSBpbnNpZ2h0IHlvdSBoYXZl
IHdvdWxkIGJlIGFwcHJlY2lhdGVkLg0KPiANCj4gU3BlY2lmaWNhbGx5LCB0aGUgcXVlc3Rpb24g
SSdkIGFzayBpcyBhcyBmb2xsb3dzLiBJcyBpdCBjb3JyZWN0IGJlaGF2aW91cg0KPiB0aGF0IExp
bnV4K05GU3Y0IG1hcmtzIGFuIEFJTyByZXF1ZXN0IGNvbXBsZXRlZCB3aGVuIGFsbCB0aGUgcmVs
ZXZhbnQgZGF0YQ0KPiBtYXkgaGF2ZSBiZWVuIHNlbnQgYnkgVENQIGJ1dCBub3QgeWV0IEFDSydk
PyBJZiBzbywgaG93IGlzIExpbnV4IG1lYW50IHRvDQo+IGRlYWwgd2l0aCByZXRyYW5zbWl0cz8g
QXJlIHRoZSBwYWdlcyByZWZlcmVuY2VkIGJ5IHRoZSBUQ1Agc3RhY2sgbWVhbnQgdG8NCj4gYmUg
bWFya2VkIENPVyBvciBzb21ldGhpbmc/IFdoYXQgaXMgbWVhbnQgdG8gaGFwcGVuIGlmIHRob3Nl
IHBhZ2VzIGdldA0KPiByZW1vdmVkIGZyb20gdGhlIG1lbW9yeSBtYXAgZW50aXJlbHk/DQo+IA0K
PiBBcyBhbiBhc2lkZSwgd2UncmUgbG9va2luZyBmb3Igc29tZW9uZSB0byBmaXggdGhpcyAoYW5k
IHRoaW5ncyBsaWtlIGl0KSBvbg0KPiBhIGNvbnRyYWN0IGJhc2lzLiBDb250YWN0IG1lIG9mZiBs
aXN0IGlmIGludGVyZXN0ZWQuDQo+IA0KDQpUaGUgT29wcyB3b3VsZCBiZSBkdWUgdG8gYSBidWcg
aW4gdGhlIHNvY2tldCBsYXllcjogdGhlIHNvY2tldCBpcw0Kc3VwcG9zZWQgdG8gdGFrZSBhIHJl
ZmVyZW5jZSBjb3VudCBvbiB0aGUgcGFnZSBpbiBvcmRlciB0byBlbnN1cmUgdGhhdA0KaXQgY2Fu
IGNvcHkgdGhlIGNvbnRlbnRzLg0KDQpBcyBmb3IgdGhlIE9fRElSRUNUIGJ1ZywgdGhlIHByb2Js
ZW0gdGhlcmUgaXMgdGhhdCB3ZSBoYXZlIG5vIHdheSBvZg0Ka25vd2luZyB3aGVuIHRoZSBzb2Nr
ZXQgaXMgZG9uZSB3cml0aW5nIHRoZSBwYWdlLiBKdXN0IGJlY2F1c2Ugd2UgZ290IGFuDQphbnN3
ZXIgZnJvbSB0aGUgc2VydmVyIGRvZXNuJ3QgbWVhbiB0aGF0IHRoZSBzb2NrZXQgaXMgZG9uZQ0K
cmV0cmFuc21pdHRpbmcgdGhlIGRhdGEuIEl0IGlzIHF1aXRlIHBvc3NpYmxlIHRoYXQgdGhlIHNl
cnZlciBpcyBqdXN0DQpyZXBseWluZyB0byB0aGUgZmlyc3QgdHJhbnNtaXNzaW9uLg0KSSB0aG91
Z2h0IHRoYXQgSWFuIHdhcyB3b3JraW5nIG9uIGEgZml4IGZvciB0aGlzIGlzc3VlLiBBdCBvbmUg
cG9pbnQsIGhlDQpoYWQgYSBidW5jaCBvZiBwYXRjaGVzIHRvIGFsbG93IHNlbmRwYWdlKCkgdG8g
Y2FsbCB5b3UgYmFjayB3aGVuIHRoZQ0KdHJhbnNtaXNzaW9uIHdhcyBkb25lLiBXaGF0IGhhcHBl
bmVkIHRvIHRob3NlIHBhdGNoZXM/DQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMg
Y2xpZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0K
d3d3Lm5ldGFwcC5jb20NCg==
On Mon, 2013-01-21 at 17:12 +0000, Alex Bligh wrote:
> Trond,
>
> --On 21 January 2013 15:50:48 +0000 "Myklebust, Trond"
> <[email protected]> wrote:
>
> >> I don't think QEMU is actually using O_DIRECT unless I set cache=none
> >> on the drive. That causes a different interesting failure which isn't
> >> my focus just now!
> >
> > Then your reference to Ian's bug is a red herring.
> >
> > If the application is using buffered writes, then the data is
> > immediately copied from userspace to the page cache. Once the copy to
> > the page cache is done, userspace can do whatever it wants with the
> > original buffer, because only the page cache pages are used in the RPC
> > calls.
> >
> > aio doesn't change any of this...
>
> So, just to be clear, if a process is using NFS and AIO with O_DSYNC
> (but not O_DIRECT) - which is I think what QEMU is meant to be doing -
> then it should *never* be zero copy (even if writes happen to be
> appropriately aligned). Is that correct? If so, I can strace the
> process and see exactly what flags it is using.
>
That is correct. If you want zero-copy, then O_DIRECT is your thing
(with or without aio). Otherwise, the kernel will always write to disk
by copying through the page cache.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
--On 21 January 2013 15:21:25 +0000 Ian Campbell <[email protected]>
wrote:
>> >> I thought that Ian was working on a fix for this issue. At one point,
>> >> he had a bunch of patches to allow sendpage() to call you back when
>> >> the transmission was done. What happened to those patches?
>> >
>> > No idea (I don't work with Ian but have taken the liberty of copy him).
>
> I'm afraid I've rather dropped the ball on these patches due to lack of
> time.
I have a rebase of your patch set for 3.2 (done by Mel Gorman) which
whilst not onto the current version is better than nothing, so I shall
send it along to netdev in case it is useful.
--
Alex Bligh