2010-12-06 12:21:14

by Spelic

[permalink] [raw]
Subject: NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram)

On 12/06/2010 05:09 AM, Dave Chinner wrote:
>> [Files become sparse at nfs-server-side upon hitting ENOSPC if NFS client uses local writeback caching]
>>
>>
>> It's nice that the NFS server does local writeback caching but it
>> should also cache the filesystem's free space (and check it
>> periodically, since nfs-server is presumably not the only process
>> writing in that filesystem) so that it doesn't accept more data than
>> it can really write. Alternatively, when free space drops below 1GB
>> (or a reasonable size based on network speed), nfs-server should
>> turn off filesystem writeback caching.
>>
> This isn't a NFS server problem, or one that canbe worked around at
> the server. it's a NFS _client_ problem in that it does not get
> synchronous ENOSPC errors when using writeback caching. There is no
> way for the NFS client to know the server is near ENOSPC conditions
> prior to writing the data to the server as clients operate
> independently.
>
> If you really want your NFS clients to behave correctly when the
> server goes ENOSPC, turn off writeback caching at the client side,
> not the server (i.e. use sync mounts on the client side).
> Write performance will suck, but if you want sane ENOSPC behaviour...
>
>

[adding NFS ML in cc]

Thank you for your very clear explanation.

Going without writeback cache is a problem (write performance sucks as
you say), but guaranteeing to never reach ENOSPC also is hardly
feasible, especially if humans are logged at client side and they are
doing "whatever they want".

I would suggest that either be the NFS client to do polling to see if
it's near an ENOSPC and if yes disable writeback caching, or be the
server to do the polling and if it finds out it's near-ENOSPC condition
it sends a specific message to clients to warn them so that they can
disable caching.

Performed at client side wouldn't change the NFS protocol and can be
good enough if one can specify how often freespace should be polled and
what is the freespace threshold. Or with just one value: specify what is
the max speed at which server disk can fill (next polling period can be
inferred from current free space), and maybe also specify a minimum
polling period (just in case).

Regarding the last part of the email, perhaps I was not clear:


> .....
>
>> Holes in a random file!
>> This is data corruption, and nobody is notified of this data
>> corruption: no error at client side or server side!
>> Is it good semantics? How could client get notified of this? Some
>> kind of fsync maybe?
>>
> Use wireshark to determine if the server sends an ENOSPC to the
> client when the first background write fails. I bet it does and that
> your dd write failed with ENOSPC, too. Something stopped it writing
> at 1.9GB....
>

No, in that case I had written 15x100MB which was more than the
available space but less than available+writeback_cache.
So "cat" ended by itself and never got an ENOSPC error but data never
reached the disk at the other side.

However today I found that by using fsync, the problem is fortunately
detected:

# time cat randfile{001..015} | pv -b | dd conv=fsync
of=/mnt/nfsram/randfile
1.46GB
dd: fsync failed for `/mnt/nfsram/randfile': Input/output error
3072000+0 records in
3072000+0 records out
1572864000 bytes (1.6 GB) copied, 20.9101 s, 75.2 MB/s

real 0m21.364s
user 0m0.470s
sys 0m11.440s


so ok I understand that processes needing guarantees on written data
should use fsync/fdatasync (which is good practice also for a local
filesystem actually...)

Thank you


2010-12-06 13:34:06

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram)

On Mon, 2010-12-06 at 13:20 +0100, Spelic wrote:
> On 12/06/2010 05:09 AM, Dave Chinner wrote:
> >> [Files become sparse at nfs-server-side upon hitting ENOSPC if NFS client uses local writeback caching]
> >>
> >>
> >> It's nice that the NFS server does local writeback caching but it
> >> should also cache the filesystem's free space (and check it
> >> periodically, since nfs-server is presumably not the only process
> >> writing in that filesystem) so that it doesn't accept more data than
> >> it can really write. Alternatively, when free space drops below 1GB
> >> (or a reasonable size based on network speed), nfs-server should
> >> turn off filesystem writeback caching.
> >>
> > This isn't a NFS server problem, or one that canbe worked around at
> > the server. it's a NFS _client_ problem in that it does not get
> > synchronous ENOSPC errors when using writeback caching. There is no
> > way for the NFS client to know the server is near ENOSPC conditions
> > prior to writing the data to the server as clients operate
> > independently.
> >
> > If you really want your NFS clients to behave correctly when the
> > server goes ENOSPC, turn off writeback caching at the client side,
> > not the server (i.e. use sync mounts on the client side).
> > Write performance will suck, but if you want sane ENOSPC behaviour...
> >
> >
>
> [adding NFS ML in cc]
>
> Thank you for your very clear explanation.
>
> Going without writeback cache is a problem (write performance sucks as
> you say), but guaranteeing to never reach ENOSPC also is hardly
> feasible, especially if humans are logged at client side and they are
> doing "whatever they want".
>
> I would suggest that either be the NFS client to do polling to see if
> it's near an ENOSPC and if yes disable writeback caching, or be the
> server to do the polling and if it finds out it's near-ENOSPC condition
> it sends a specific message to clients to warn them so that they can
> disable caching.



> Performed at client side wouldn't change the NFS protocol and can be
> good enough if one can specify how often freespace should be polled and
> what is the freespace threshold. Or with just one value: specify what is
> the max speed at which server disk can fill (next polling period can be
> inferred from current free space), and maybe also specify a minimum
> polling period (just in case).

You can just as easily do this at the application level. The kernel
can't do it any more reliably than the application can, so there really
is no point in doing it there.

We already ensure that when the server does send us an error, we switch
to synchronous operation until the error clears.

Trond