2017-03-09 03:38:22

by NeilBrown

[permalink] [raw]
Subject: A NFS mount can still write to the server after 'umount' has completed.


I've been chasing down a problem where a customer has a localhost mount,
and the sequence
unmount -at nfs,nfs4
stop nfsserver
sync

hangs on the sync. The 'sync' is trying to write to the NFS filesystem
that has just been unmounted.
I have duplicated the problem on a current mainline kernel.

There are two important facts that lead to the explanation of this.
1/ whenever a 'struct file' is open, an s_active reference is held on
the superblock, via "open_context" calling nfs_sb_active().
This doesn't prevent "unmount" from succeeding (i.e. EBUSY isn't
returned), but does prevent the actual unmount from happening
(->kill_sb() isn't called).
2/ When a memory mapping of a file is torn down, the file is
"released", causing the context to be discarded and the sb_active
reference released, but unlike close(2), file_operations->flush()
is not called.

Consequently, if you:
open an NFS file
mmap some pages PROT_WRITE
close the file
modify the pages
unmap the pages
unmount the filesystem

the filesystem will remain active, and the pages will remain dirty.
If you then make the nfs server unavailable - e.g. stop it, or tear down
the network connection - and then call 'sync', the sync will hang.

This is surprising, at the least :-)

I have two ideas how it might be fixed.

One is to call nfs_file_flush() from within nfs_file_release().
This is probably simplest (and appears to work).

The other is to add a ".close" to nfs_file_vm_ops. This could trigger a
(partial) flush whenever a page is unmapped. As closing an NFS file
always triggers a flush, it seems reasonable that unmapping a page
would trigger a flush of that page.

Thoughts?

Thanks,
NeilBrown


Attachments:
signature.asc (832.00 B)

2017-03-09 07:34:33

by NeilBrown

[permalink] [raw]
Subject: Re: A NFS mount can still write to the server after 'umount' has completed.

On Thu, Mar 09 2017, NeilBrown wrote:

> I've been chasing down a problem where a customer has a localhost mount,
> and the sequence
> unmount -at nfs,nfs4
> stop nfsserver
> sync
>
> hangs on the sync. The 'sync' is trying to write to the NFS filesystem
> that has just been unmounted.
> I have duplicated the problem on a current mainline kernel.
>
> There are two important facts that lead to the explanation of this.
> 1/ whenever a 'struct file' is open, an s_active reference is held on
> the superblock, via "open_context" calling nfs_sb_active().
> This doesn't prevent "unmount" from succeeding (i.e. EBUSY isn't
> returned), but does prevent the actual unmount from happening
> (->kill_sb() isn't called).
> 2/ When a memory mapping of a file is torn down, the file is
> "released", causing the context to be discarded and the sb_active
> reference released, but unlike close(2), file_operations->flush()
> is not called.

I realised that there is another piece of the puzzle.
When a page is marked dirty (e.g. nfs_vm_page_mkwrite),
nfs_updatepage() -> nfs_writepage_setup()
creates a 'struct nfs_page' request which holds a reference to
the open context, which in turn holds an active reference to the
superblock.
So as long as there are dirty pages, the superblock will not go
inactive. All the rest still holds.

NeilBrown


>
> Consequently, if you:
> open an NFS file
> mmap some pages PROT_WRITE
> close the file
> modify the pages
> unmap the pages
> unmount the filesystem
>
> the filesystem will remain active, and the pages will remain dirty.
> If you then make the nfs server unavailable - e.g. stop it, or tear down
> the network connection - and then call 'sync', the sync will hang.
>
> This is surprising, at the least :-)
>
> I have two ideas how it might be fixed.
>
> One is to call nfs_file_flush() from within nfs_file_release().
> This is probably simplest (and appears to work).
>
> The other is to add a ".close" to nfs_file_vm_ops. This could trigger a
> (partial) flush whenever a page is unmapped. As closing an NFS file
> always triggers a flush, it seems reasonable that unmapping a page
> would trigger a flush of that page.
>
> Thoughts?
>
> Thanks,
> NeilBrown


Attachments:
signature.asc (832.00 B)

2017-03-09 13:21:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: A NFS mount can still write to the server after 'umount' has completed.

T24gVGh1LCAyMDE3LTAzLTA5IGF0IDE1OjQ2ICsxMTAwLCBOZWlsQnJvd24gd3JvdGU6DQo+IE9u
IFRodSwgTWFyIDA5IDIwMTcsIE5laWxCcm93biB3cm90ZToNCj4gDQo+ID4gSSd2ZSBiZWVuIGNo
YXNpbmcgZG93biBhIHByb2JsZW0gd2hlcmUgYSBjdXN0b21lciBoYXMgYSBsb2NhbGhvc3QNCj4g
PiBtb3VudCwNCj4gPiBhbmQgdGhlIHNlcXVlbmNlDQo+ID4gwqAgdW5tb3VudCAtYXQgbmZzLG5m
czQNCj4gPiDCoCBzdG9wIG5mc3NlcnZlcg0KPiA+IMKgIHN5bmMNCj4gPiANCj4gPiBoYW5ncyBv
biB0aGUgc3luYy7CoMKgVGhlICdzeW5jJyBpcyB0cnlpbmcgdG8gd3JpdGUgdG8gdGhlIE5GUw0K
PiA+IGZpbGVzeXN0ZW0NCj4gPiB0aGF0IGhhcyBqdXN0IGJlZW4gdW5tb3VudGVkLg0KDQpTbyB3
aHkgY2FuJ3QgeW91IG1vdmUgdGhlIHN5bmMgdXAgb25lIGxpbmU/DQoNCj4gPiBJIGhhdmUgZHVw
bGljYXRlZCB0aGUgcHJvYmxlbSBvbiBhIGN1cnJlbnQgbWFpbmxpbmUga2VybmVsLg0KPiA+IA0K
PiA+IFRoZXJlIGFyZSB0d28gaW1wb3J0YW50IGZhY3RzIHRoYXQgbGVhZCB0byB0aGUgZXhwbGFu
YXRpb24gb2YgdGhpcy4NCj4gPiAxLyB3aGVuZXZlciBhICdzdHJ1Y3QgZmlsZScgaXMgb3Blbiwg
YW4gc19hY3RpdmUgcmVmZXJlbmNlIGlzIGhlbGQNCj4gPiBvbg0KPiA+IMKgwqDCoHRoZSBzdXBl
cmJsb2NrLCB2aWEgIm9wZW5fY29udGV4dCIgY2FsbGluZyBuZnNfc2JfYWN0aXZlKCkuDQo+ID4g
wqDCoMKgVGhpcyBkb2Vzbid0IHByZXZlbnQgInVubW91bnQiIGZyb20gc3VjY2VlZGluZyAoaS5l
LiBFQlVTWSBpc24ndA0KPiA+IMKgwqDCoHJldHVybmVkKSwgYnV0IGRvZXMgcHJldmVudCB0aGUg
YWN0dWFsIHVubW91bnQgZnJvbSBoYXBwZW5pbmcNCj4gPiDCoMKgwqAoLT5raWxsX3NiKCkgaXNu
J3QgY2FsbGVkKS4NCj4gPiAyLyBXaGVuIGEgbWVtb3J5IG1hcHBpbmcgb2YgYSBmaWxlIGlzIHRv
cm4gZG93biwgdGhlIGZpbGUgaXMNCj4gPiDCoMKgwqAicmVsZWFzZWQiLCBjYXVzaW5nIHRoZSBj
b250ZXh0IHRvIGJlIGRpc2NhcmRlZCBhbmQgdGhlDQo+ID4gc2JfYWN0aXZlDQo+ID4gwqDCoMKg
cmVmZXJlbmNlIHJlbGVhc2VkLCBidXQgdW5saWtlIGNsb3NlKDIpLCBmaWxlX29wZXJhdGlvbnMt
DQo+ID4gPmZsdXNoKCkNCj4gPiDCoMKgwqBpcyBub3QgY2FsbGVkLg0KPiANCj4gSSByZWFsaXNl
ZCB0aGF0IHRoZXJlIGlzIGFub3RoZXIgcGllY2Ugb2YgdGhlIHB1enpsZS4NCj4gV2hlbiBhIHBh
Z2UgaXMgbWFya2VkIGRpcnR5IChlLmcuIG5mc192bV9wYWdlX21rd3JpdGUpLA0KPiDCoG5mc191
cGRhdGVwYWdlKCkgLT4gbmZzX3dyaXRlcGFnZV9zZXR1cCgpDQo+IGNyZWF0ZXMgYSAnc3RydWN0
IG5mc19wYWdlJyByZXF1ZXN0IHdoaWNoIGhvbGRzIGEgcmVmZXJlbmNlIHRvDQo+IHRoZSBvcGVu
IGNvbnRleHQsIHdoaWNoIGluIHR1cm4gaG9sZHMgYW4gYWN0aXZlIHJlZmVyZW5jZSB0byB0aGUN
Cj4gc3VwZXJibG9jay4NCj4gU28gYXMgbG9uZyBhcyB0aGVyZSBhcmUgZGlydHkgcGFnZXMsIHRo
ZSBzdXBlcmJsb2NrIHdpbGwgbm90IGdvDQo+IGluYWN0aXZlLsKgwqBBbGwgdGhlIHJlc3Qgc3Rp
bGwgaG9sZHMuDQo+IA0KPiBOZWlsQnJvd24NCj4gDQo+IA0KPiA+IA0KPiA+IENvbnNlcXVlbnRs
eSwgaWYgeW91Og0KPiA+IMKgb3BlbiBhbiBORlMgZmlsZQ0KPiA+IMKgbW1hcCBzb21lIHBhZ2Vz
IFBST1RfV1JJVEUNCj4gPiDCoGNsb3NlIHRoZSBmaWxlDQo+ID4gwqBtb2RpZnkgdGhlIHBhZ2Vz
DQo+ID4gwqB1bm1hcCB0aGUgcGFnZXMNCj4gPiDCoHVubW91bnQgdGhlIGZpbGVzeXN0ZW0NCj4g
PiANCj4gPiB0aGUgZmlsZXN5c3RlbSB3aWxsIHJlbWFpbiBhY3RpdmUsIGFuZCB0aGUgcGFnZXMg
d2lsbCByZW1haW4gZGlydHkuDQo+ID4gSWYgeW91IHRoZW4gbWFrZSB0aGUgbmZzIHNlcnZlciB1
bmF2YWlsYWJsZSAtIGUuZy4gc3RvcCBpdCwgb3IgdGVhcg0KPiA+IGRvd24NCj4gPiB0aGUgbmV0
d29yayBjb25uZWN0aW9uIC0gYW5kIHRoZW4gY2FsbCAnc3luYycsIHRoZSBzeW5jIHdpbGwgaGFu
Zy4NCj4gPiANCj4gPiBUaGlzIGlzIHN1cnByaXNpbmcsIGF0IHRoZSBsZWFzdCA6LSkNCj4gPiAN
Cj4gPiBJIGhhdmUgdHdvIGlkZWFzIGhvdyBpdCBtaWdodCBiZSBmaXhlZC4NCj4gPiANCj4gPiBP
bmUgaXMgdG8gY2FsbCBuZnNfZmlsZV9mbHVzaCgpIGZyb20gd2l0aGluIG5mc19maWxlX3JlbGVh
c2UoKS4NCj4gPiBUaGlzIGlzIHByb2JhYmx5IHNpbXBsZXN0IChhbmQgYXBwZWFycyB0byB3b3Jr
KS4NCj4gPiANCj4gPiBUaGUgb3RoZXIgaXMgdG8gYWRkIGEgIi5jbG9zZSIgdG8gbmZzX2ZpbGVf
dm1fb3BzLsKgwqBUaGlzIGNvdWxkDQo+ID4gdHJpZ2dlciBhDQo+ID4gKHBhcnRpYWwpIGZsdXNo
IHdoZW5ldmVyIGEgcGFnZSBpcyB1bm1hcHBlZC7CoMKgQXMgY2xvc2luZyBhbiBORlMNCj4gPiBm
aWxlDQo+ID4gYWx3YXlzIHRyaWdnZXJzIGEgZmx1c2gsIGl0IHNlZW1zIHJlYXNvbmFibGUgdGhh
dCB1bm1hcHBpbmcgYSBwYWdlDQo+ID4gd291bGQgdHJpZ2dlciBhIGZsdXNoIG9mIHRoYXQgcGFn
ZS4NCj4gPiANCj4gPiBUaG91Z2h0cz8NCj4gPiANCg0KQWxsIHRoaXMgaXMgd29ya2luZyBhcyBl
eHBlY3RlZC4gVGhlIG9ubHkgd2F5IHRvIGVuc3VyZSB0aGF0IGRhdGEgaXMNCnN5bmNlZCB0byBk
aXNrIGJ5IG1tYXAoKSBpcyB0byBleHBsaWNpdGx5IGNhbGwgbXN5bmMoKS4gVGhpcyBpcyB3ZWxs
DQpkb2N1bWVudGVkLCBhbmQgc2hvdWxkIGJlIHdlbGwgdW5kZXJzdG9vZCBieSBhcHBsaWNhdGlv
biBkZXZlbG9wZXJzLiBJZg0KdGhvc2UgZGV2ZWxvcGVycyBhcmUgaWdub3JpbmcgdGhlIGRvY3Vt
ZW50YXRpb24sIHRoZW4gSSBzdWdnZXN0IHZvdG9pbmcNCndpdGggeW91ciBmZWV0IGFuZCBnZXR0
aW5nIHlvdXIgc29mdHdhcmUgZnJvbSBzb21lb25lIGVsc2UuIA0KDQotLSANClRyb25kIE15a2xl
YnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyLCBQcmltYXJ5RGF0YQ0KdHJvbmQubXlr
bGVidXN0QHByaW1hcnlkYXRhLmNvbQ0K


2017-03-09 21:38:56

by NeilBrown

[permalink] [raw]
Subject: Re: A NFS mount can still write to the server after 'umount' has completed.

On Thu, Mar 09 2017, Trond Myklebust wrote:

> On Thu, 2017-03-09 at 15:46 +1100, NeilBrown wrote:
>> On Thu, Mar 09 2017, NeilBrown wrote:
>>
>> > I've been chasing down a problem where a customer has a localhost
>> > mount,
>> > and the sequence
>> >   unmount -at nfs,nfs4
>> >   stop nfsserver
>> >   sync
>> >
>> > hangs on the sync.  The 'sync' is trying to write to the NFS
>> > filesystem
>> > that has just been unmounted.
>
> So why can't you move the sync up one line?
>
>> > I have duplicated the problem on a current mainline kernel.
>> >
>> > There are two important facts that lead to the explanation of this.
>> > 1/ whenever a 'struct file' is open, an s_active reference is held
>> > on
>> >    the superblock, via "open_context" calling nfs_sb_active().
>> >    This doesn't prevent "unmount" from succeeding (i.e. EBUSY isn't
>> >    returned), but does prevent the actual unmount from happening
>> >    (->kill_sb() isn't called).
>> > 2/ When a memory mapping of a file is torn down, the file is
>> >    "released", causing the context to be discarded and the
>> > sb_active
>> >    reference released, but unlike close(2), file_operations-
>> > >flush()
>> >    is not called.
>>
>> I realised that there is another piece of the puzzle.
>> When a page is marked dirty (e.g. nfs_vm_page_mkwrite),
>>  nfs_updatepage() -> nfs_writepage_setup()
>> creates a 'struct nfs_page' request which holds a reference to
>> the open context, which in turn holds an active reference to the
>> superblock.
>> So as long as there are dirty pages, the superblock will not go
>> inactive.  All the rest still holds.
>>
>> NeilBrown
>>
>>
>> >
>> > Consequently, if you:
>> >  open an NFS file
>> >  mmap some pages PROT_WRITE
>> >  close the file
>> >  modify the pages
>> >  unmap the pages
>> >  unmount the filesystem
>> >
>> > the filesystem will remain active, and the pages will remain dirty.
>> > If you then make the nfs server unavailable - e.g. stop it, or tear
>> > down
>> > the network connection - and then call 'sync', the sync will hang.
>> >
>> > This is surprising, at the least :-)
>> >
>> > I have two ideas how it might be fixed.
>> >
>> > One is to call nfs_file_flush() from within nfs_file_release().
>> > This is probably simplest (and appears to work).
>> >
>> > The other is to add a ".close" to nfs_file_vm_ops.  This could
>> > trigger a
>> > (partial) flush whenever a page is unmapped.  As closing an NFS
>> > file
>> > always triggers a flush, it seems reasonable that unmapping a page
>> > would trigger a flush of that page.
>> >
>> > Thoughts?
>> >
>
> All this is working as expected. The only way to ensure that data is
> synced to disk by mmap() is to explicitly call msync(). This is well
> documented, and should be well understood by application developers. If
> those developers are ignoring the documentation, then I suggest votoing
> with your feet and getting your software from someone else.

Hi Trond,
I don't understand how you can see this behaviour as acceptable.
For any other filesystem, unmount is a synchronisation point.
Assuming there are not bind mount or similar, and that MNT_DETACH is
not used, an unmount will flush out any changes. It doesn't matter whether
or not the application has called fsync or msync - the flush still
happens on unmount.
So with any other filesystem you can unmount and then remove the media
and be sure that nothing is consistent and that no further writes will
be attempted.
With NFS as it currently is, unmount does not act as a synchronisation
point, and writes can continue after the unmount. How is that
acceptable?

The patch below explains the problem and provides a simple fix.

Thanks,
NeilBrown

From: NeilBrown <[email protected]>
Date: Fri, 10 Mar 2017 08:19:32 +1100
Subject: [PATCH] NFS: flush out dirty data on final fput().

Any dirty NFS page holds an s_active reference on the superblock,
because page_private() references an nfs_page, which references an
open context, which references the superblock.

So if there are any dirty pages when the filesystem is unmounted, the
unmount will act like a "lazy" unmount and not call ->kill_sb().
Background write-back can then write out the pages *after* the filesystem
unmount has apparently completed.

This contrasts with other filesystems which do not hold extra s_active
references, so ->kill_sb() is reliably called on unmount, and
generic_shutdown_super() will call sync_filesystem() to flush
everything out before the unmount completes.

When open/write/close is used to modify files, the final close causes
f_op->flush to be called, which flushes all dirty pages. However if
open/mmap/close/modify-memory/unmap is used, dirty pages can remain in
memory after the application has dropped all references to the file.

Fix this by calling vfs_fsync() in nfs_file_release (aka
f_op->release()). This means that on the final unmap of a file, all
changes are flushed, and ensures that when unmount is requested there
will be no dirty pages to delay the final unmount.

Without this patch, it is not safe to stop or disconnect the NFS
server after all clients have unmounted. They need to unmount and
call "sync".

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 668213984d68..d20cf9b3019b 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -81,6 +81,12 @@ nfs_file_release(struct inode *inode, struct file *filp)
{
dprintk("NFS: release(%pD2)\n", filp);

+ if (filp->f_mode & FMODE_WRITE)
+ /* Ensure dirty mmapped pages are flushed
+ * so there will be no dirty pages to
+ * prevent an unmount from completing.
+ */
+ vfs_fsync(filp, 0);
nfs_inc_stats(inode, NFSIOS_VFSRELEASE);
nfs_file_clear_open_context(filp);
return 0;
--
2.12.0


Attachments:
signature.asc (832.00 B)