Return-Path: linux-nfs-owner@vger.kernel.org Received: from mail.avalus.com ([89.16.176.221]:59722 "EHLO mail.avalus.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756101Ab3AUPCJ (ORCPT ); Mon, 21 Jan 2013 10:02:09 -0500 Date: Mon, 21 Jan 2013 15:01:54 +0000 From: Alex Bligh Reply-To: Alex Bligh To: "Myklebust, Trond" cc: linux-nfs@vger.kernel.org, Alex Bligh , ian.campbell@centrix.com Subject: Re: Fatal crash with NFS, AIO & tcp retransmit Message-ID: In-Reply-To: <4FA345DA4F4AE44899BD2B03EEEC2FA915C163B9@SACEXCMBX04-PRD.hq.netapp.com> References: <93D3AE9B4990994B2BCA75A9@Ximines.local> <4FA345DA4F4AE44899BD2B03EEEC2FA915C163B9@SACEXCMBX04-PRD.hq.netapp.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Sender: linux-nfs-owner@vger.kernel.org List-ID: Trond, --On 21 January 2013 14:38:20 +0000 "Myklebust, Trond" wrote: > The Oops would be due to a bug in the socket layer: the socket is > supposed to take a reference count on the page in order to ensure that > it can copy the contents. Looking at the original linux-nfs link, you said here: http://marc.info/?l=linux-nfs&m=122424789508577&w=2 Trond:> I don't see how this could be an RPC bug. The networking Trond:> layer is supposed to either copy the data sent to the socket, Trond:> or take a reference to any pages that are pushed via Trond:> the ->sendpage() abi. which sounds suspiciously like the same thing. The conversation then went: http://marc.info/?l=linux-nfs&m=122424858109731&w=2 Ian:> The pages are still referenced by the networking layer. The problem is Ian:> that the userspace app has been told that the write has completed so Ian:> it is free to write new data to those pages. To which you replied: http://marc.info/?l=linux-nfs&m=122424984612130&w=2 Trond:> OK, I see your point. Following the thread, it then seems that Ian's test case did fail on NFS4 on 2.6.18, but not on 2.6.27. Note that Ian was seeing something slightly different from me. I think what he was seeing was alterations to the page after AIO completes being retransmitted when the page prior to the alteration should be transmitted. That could presumably be fixed by some COW device. What I'm seeing is more subtle. Xen thinks (because QEMU tells it, because AIO tells it) that the memory is done with entirely, and simply unmaps it. I don't think that's Qemu's fault. If it is a referencing issue, then it seems to me the problem is that Xen is releasing the grant structure (I don't quite understand how this bit works) and unmapping memory when the networking stack still holds a reference to the page concerned. However, even if it did not do that, wouldn't a retransmit after the write had completed risk writing the wrong data? I suppose it could mark the page COW before it released the grant or something. > As for the O_DIRECT bug, the problem there is that we have no way of > knowing when the socket is done writing the page. Just because we got an > answer from the server doesn't mean that the socket is done > retransmitting the data. It is quite possible that the server is just > replying to the first transmission. I don't think QEMU is actually using O_DIRECT unless I set cache=none on the drive. That causes a different interesting failure which isn't my focus just now! > I thought that Ian was working on a fix for this issue. At one point, he > had a bunch of patches to allow sendpage() to call you back when the > transmission was done. What happened to those patches? No idea (I don't work with Ian but have taken the liberty of copy him). However, what's happened in the intervening years is that Xen has changed its device model and it's now QEMU doing the writing (the qcow2 driver specifically). I'm not sure it's even using sendpage. -- Alex Bligh