2012-02-08 02:45:33

by Derek McEachern

[permalink] [raw]
Subject: NFS Mount Option 'nofsc'

I joined the mailing list shortly after Neil sent out a request for
volunteer to update the nfs man page documenting the 'fsc'/'nofsc'
options. I suspect this may stem from a ticket we opened with Suse
inquiring about these options.

Coming from a Solaris background we typically use the 'forcedirectio'
option for certain mounts and I was looking for the same thing in Linux.
The typically advice seems to be use 'noac' but the description in the
man page doesn't seem to match what I would expect from 'forcedirectio',
namely no buffering on the client.

Poking around the kernel I found the 'fsc'/'nofsc' options and my
question is does 'nofsc' provide 'forcedirectio' functionality?

Thanks,
Derek





2012-02-09 14:49:49

by Malahal Naineni

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'

Harshula [[email protected]] wrote:
> Hi Trond,
>
> Thanks for the reply. Could you please elaborate on the subtleties
> involved that require an application to be rewritten if forcedirectio
> mount option was available?
>
> On Thu, 2012-02-09 at 04:12 +0000, Myklebust, Trond wrote:
> > On Thu, 2012-02-09 at 14:56 +1100, Harshula wrote:
> > >
> > > The "sync" option, depending on the NFS server, may impact the NFS
> > > server's performance when serving many NFS clients. But still worth a
> > > try.
> >
> > What on earth makes you think that directio would be any different?
>
> Like I said, sync is still worth a try. I will do O_DIRECT Vs sync mount
> option runs and see what the numbers look like. A while back the numbers
> for cached Vs direct small random writes showed as the number of threads
> increased the cached performance fell well below direct performance. In
> this case I'll be looking at large streaming writes, so completely
> different scenario, but I'd like to verify the numbers first.

directio and sync behavior should be same on server side, but it would
be a different story on the client though. The above behavior you
described is expected on the client.

Thanks, Malahal.


2012-02-08 18:13:25

by Derek McEachern

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'



-------- Original Message --------
Subject: Re: NFS Mount Option 'nofsc'
From: Myklebust, Trond <[email protected]>
To: Derek McEachern <[email protected]>
CC: "[email protected]" <[email protected]>
Date: Tuesday, February 07, 2012 10:55:04 PM
> On Tue, 2012-02-07 at 20:45 -0600, Derek McEachern wrote:
>> I joined the mailing list shortly after Neil sent out a request for
>> volunteer to update the nfs man page documenting the 'fsc'/'nofsc'
>> options. I suspect this may stem from a ticket we opened with Suse
>> inquiring about these options.
>>
>> Coming from a Solaris background we typically use the 'forcedirectio'
>> option for certain mounts and I was looking for the same thing in Linux.
>> The typically advice seems to be use 'noac' but the description in the
>> man page doesn't seem to match what I would expect from 'forcedirectio',
>> namely no buffering on the client.
>>
>> Poking around the kernel I found the 'fsc'/'nofsc' options and my
>> question is does 'nofsc' provide 'forcedirectio' functionality?
> No. There is no equivalent to the Solaris "forcedirectio" mount option
> in Linux.
> Applications that need to use uncached i/o are required to use the
> O_DIRECT open() mode instead, since pretty much all of them need to be
> rewritten to deal with the subtleties involved anyway.
>
> Trond

So then what exact functionality if provided by the 'nofsc' option? It
would seem to me from a write perspective that between noac and the sync
option it is pretty close to forcedirectio.

From the man page describing sync "any system call that writes data to
files on that mount point causes that data to be flushed to the server
before the system call returns control to user space."

Maybe I've answered one of my questions as flushing the data to the
server before returning to user space is really what I'm after. The
userspace app should be blocked until the write has been acknowledged by
the server and if the server is an NFS appliance then I don't
necessarily care if it has committed the data to disk as I expect it to
managed its cache properly.

Though I still want to understand what 'nofsc' is doing.

Derek

2012-02-10 16:49:04

by Myklebust, Trond

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'

T24gRnJpLCAyMDEyLTAyLTEwIGF0IDE5OjA3ICsxMTAwLCBIYXJzaHVsYSB3cm90ZToNCj4gT24g
VGh1LCAyMDEyLTAyLTA5IGF0IDE1OjMxICswMDAwLCBNeWtsZWJ1c3QsIFRyb25kIHdyb3RlOg0K
PiBUaGFua3MuIFdvdWxkIGl0IGJlIGFjY3VyYXRlIHRvIHNheSB0aGF0IGlmIHRoZXJlIHdlcmUg
b25seSBlaXRoZXINCj4gc3RyZWFtaW5nIHdyaXRlcyBvciAoeG9yKSBzdHJlYW1pbmcgcmVhZHMg
dG8gYW55IGdpdmVuIGZpbGUgb24gdGhlIE5GUw0KPiBtb3VudCwgdGhlIGFwcGxpY2F0aW9uIHdv
dWxkIG5vdCBuZWVkIHRvIGJlIHJld3JpdHRlbj8gDQoNClRoYXQgc2hvdWxkIG5vcm1hbGx5IHdv
cmsuDQoNCj4gRG8geW91IHNlZSBmb3JjZWRpcmVjdGlvIGFzIGEgc2hhcnAgb2JqZWN0IHRoYXQg
c29tZW9uZSBjb3VsZCBzdGFiDQo+IHRoZW1zZWx2ZXMgd2l0aD8NCg0KWWVzLiBJdCBkb2VzIGxl
YWQgdG8gc29tZSB2ZXJ5IHN1YnRsZSBQT1NJWCB2aW9sYXRpb25zLg0KDQo+ID4gPiBUaGVyZSdz
IGFub3RoZXIgc2NlbmFyaW8sIHdoaWNoIHdlIHRhbGtlZCBhYm91dCBhIHdoaWxlIGJhY2ssIHdo
ZXJlIHRoZQ0KPiA+ID4gY2FjaGVkIGFzeW5jIHJlYWRzIG9mIGEgc2xvd2x5IGdyb3dpbmcgZmls
ZSAodGFpbCkgd2FzIHNwaXR0aW5nIG91dA0KPiA+ID4gbm9uLWV4aXN0IE5VTExzIHRvIHVzZXIg
c3BhY2UuIFRoZSBmb3JjZWRpcmVjdGlvIG1vdW50IG9wdGlvbiBzaG91bGQNCj4gPiA+IHByZXZl
bnQgdGhhdC4gRnVydGhlcm1vcmUsIHRoZSAic3luYyIgbW91bnQgb3B0aW9uIHdpbGwgbm90IGhl
bHAgYW55bW9yZQ0KPiA+ID4gYmVjYXVzZSB5b3UgcmVtb3ZlZCBuZnNfcmVhZHBhZ2Vfc3luYygp
Lg0KPiA+IA0KPiA+IE5vLiBTZWUgdGhlIHBvaW50cyBhYm91dCBPX0FQUEVORCBhbmQgc2VyaWFs
aXNhdGlvbiBvZiByZWFkKCkgYW5kDQo+ID4gd3JpdGUoKSBhYm92ZS4gWW91IG1heSBzdGlsbCBl
bmQgdXAgc2VlaW5nIE5VTCBjaGFyYWN0ZXJzIChhbmQgaW5kZWVkDQo+ID4gd29yc2UgZm9ybXMg
b2YgY29ycnVwdGlvbikuDQo+IA0KPiBJZiB0aGUgTkZTIGNsaWVudCBvbmx5IGRvZXMgY2FjaGVk
IGFzeW5jIHJlYWRzIG9mIGEgc2xvd2x5IGdyb3dpbmcgZmlsZQ0KPiAodGFpbCksIHdoYXQncyB0
aGUgcHJvYmxlbT8gSXMgbmZzX3JlYWRwYWdlX3N5bmMoKSBnb25lIGZvcmV2ZXIsIG9yDQo+IGNv
dWxkIGl0IGJlIHJldml2ZWQ/DQoNCkl0IHdvdWxkbid0IGhlbHAgYXQgYWxsLiBUaGUgcHJvYmxl
bSBpcyB0aGUgVk0ncyBoYW5kbGluZyBvZiBwYWdlcyB2cw0KdGhlIE5GUyBoYW5kbGluZyBvZiBm
aWxlIHNpemUuDQoNClRoZSBWTSBiYXNpY2FsbHkgdXNlcyB0aGUgZmlsZSBzaXplIGluIG9yZGVy
IHRvIGRldGVybWluZSBob3cgbXVjaCBkYXRhDQphIHBhZ2UgY29udGFpbnMuIElmIHRoYXQgZmls
ZSBzaXplIGNoYW5nZWQgYmV0d2VlbiB0aGUgaW5zdGFuY2Ugd2UNCmZpbmlzaGVkIHRoZSBSRUFE
IFJQQyBjYWxsLCBhbmQgdGhlIGluc3RhbmNlIHRoZSBWTSBnZXRzIHJvdW5kIHRvDQpsb2NraW5n
IHRoZSBwYWdlIGFnYWluLCByZWFkaW5nIHRoZSBkYXRhIGFuZCB0aGVuIGNoZWNraW5nIHRoZSBm
aWxlDQpzaXplLCB0aGVuIHRoZSBWTSBtYXkgZW5kIHVwIGNvcHlpbmcgZGF0YSBiZXlvbmQgdGhl
IGVuZCBvZiB0aGF0DQpyZXRyaWV2ZWQgYnkgdGhlIFJQQyBjYWxsLg0KDQo+IC1vc3luYyBhbHNv
IGltcGFjdHMgdGhlIHBlcmZvcm1hbmNlIG9mIHRoZSBlbnRpcmUgTkZTIG1vdW50LiBXaXRoDQo+
IGFmb3JlbWVudGlvbmVkIGhhY2ssIHlvdSBjYW4gaXNvbGF0ZSB0aGUgc3BlY2lmaWMgZmlsZShz
KSB0aGF0IG5lZWQNCj4gdGhlaXIgZGlydHkgcGFnZXMgdG8gYmUgZmx1c2hlZCBmcmVxdWVudGx5
IHRvIGF2b2lkIGhpdHRpbmcgZ2xvYmFsIGRpcnR5DQo+IHBhZ2UgbGltaXQuDQoNClNvIGRvZXMg
Zm9yY2VkaXJlY3Rpby4gLi4uYW5kIGl0IGFsc28gaW1wYWN0cyB0aGUgcGVyZm9ybWFuY2Ugb2Yg
cmVhZHMNCmZvciB0aGUgZW50aXJlIE5GUyBtb3VudC4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QN
CkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15a2xlYnVzdEBu
ZXRhcHAuY29tDQp3d3cubmV0YXBwLmNvbQ0KDQo=

2012-02-08 04:55:16

by Myklebust, Trond

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'

T24gVHVlLCAyMDEyLTAyLTA3IGF0IDIwOjQ1IC0wNjAwLCBEZXJlayBNY0VhY2hlcm4gd3JvdGU6
DQo+IEkgam9pbmVkIHRoZSBtYWlsaW5nIGxpc3Qgc2hvcnRseSBhZnRlciBOZWlsIHNlbnQgb3V0
IGEgcmVxdWVzdCBmb3IgDQo+IHZvbHVudGVlciB0byB1cGRhdGUgdGhlIG5mcyBtYW4gcGFnZSBk
b2N1bWVudGluZyB0aGUgJ2ZzYycvJ25vZnNjJyANCj4gb3B0aW9ucy4gSSBzdXNwZWN0IHRoaXMg
bWF5IHN0ZW0gZnJvbSBhIHRpY2tldCB3ZSBvcGVuZWQgd2l0aCBTdXNlIA0KPiBpbnF1aXJpbmcg
YWJvdXQgdGhlc2Ugb3B0aW9ucy4NCj4gDQo+IENvbWluZyBmcm9tIGEgU29sYXJpcyBiYWNrZ3Jv
dW5kIHdlIHR5cGljYWxseSB1c2UgdGhlICdmb3JjZWRpcmVjdGlvJyANCj4gb3B0aW9uIGZvciBj
ZXJ0YWluIG1vdW50cyBhbmQgSSB3YXMgbG9va2luZyBmb3IgdGhlIHNhbWUgdGhpbmcgaW4gTGlu
dXguIA0KPiBUaGUgdHlwaWNhbGx5IGFkdmljZSBzZWVtcyB0byBiZSB1c2UgJ25vYWMnIGJ1dCB0
aGUgZGVzY3JpcHRpb24gaW4gdGhlIA0KPiBtYW4gcGFnZSBkb2Vzbid0IHNlZW0gdG8gbWF0Y2gg
d2hhdCBJIHdvdWxkIGV4cGVjdCBmcm9tICdmb3JjZWRpcmVjdGlvJywgDQo+IG5hbWVseSBubyBi
dWZmZXJpbmcgb24gdGhlIGNsaWVudC4NCj4gDQo+IFBva2luZyBhcm91bmQgdGhlIGtlcm5lbCBJ
IGZvdW5kIHRoZSAnZnNjJy8nbm9mc2MnIG9wdGlvbnMgYW5kIG15IA0KPiBxdWVzdGlvbiBpcyBk
b2VzICdub2ZzYycgcHJvdmlkZSAnZm9yY2VkaXJlY3RpbycgZnVuY3Rpb25hbGl0eT8NCg0KTm8u
IFRoZXJlIGlzIG5vIGVxdWl2YWxlbnQgdG8gdGhlIFNvbGFyaXMgImZvcmNlZGlyZWN0aW8iIG1v
dW50IG9wdGlvbg0KaW4gTGludXguDQpBcHBsaWNhdGlvbnMgdGhhdCBuZWVkIHRvIHVzZSB1bmNh
Y2hlZCBpL28gYXJlIHJlcXVpcmVkIHRvIHVzZSB0aGUNCk9fRElSRUNUIG9wZW4oKSBtb2RlIGlu
c3RlYWQsIHNpbmNlIHByZXR0eSBtdWNoIGFsbCBvZiB0aGVtIG5lZWQgdG8gYmUNCnJld3JpdHRl
biB0byBkZWFsIHdpdGggdGhlIHN1YnRsZXRpZXMgaW52b2x2ZWQgYW55d2F5Lg0KDQpUcm9uZA0K
LS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRB
cHANClRyb25kLk15a2xlYnVzdEBuZXRhcHAuY29tDQp3d3cubmV0YXBwLmNvbQ0KDQo=

2012-02-08 19:52:25

by Derek McEachern

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'



-------- Original Message --------
Subject: Re: NFS Mount Option 'nofsc'
From: Chuck Lever <[email protected]>
To: Derek McEachern <[email protected]>
CC: "Myklebust, Trond" <[email protected]>,
"[email protected]" <[email protected]>
Date: Wednesday, February 08, 2012 12:15:37 PM

>> So then what exact functionality if provided by the 'nofsc' option? It would seem to me from a write perspective that between noac and the sync option it is pretty close to forcedirectio.
>>
>> From the man page describing sync "any system call that writes data to files on that mount point causes that data to be flushed to the server before the system call returns control to user space."
>>
>> Maybe I've answered one of my questions as flushing the data to the server before returning to user space is really what I'm after. The userspace app should be blocked until the write has been acknowledged by the server and if the server is an NFS appliance then I don't necessarily care if it has committed the data to disk as I expect it to managed its cache properly.
>>
>> Though I still want to understand what 'nofsc' is doing.
> "nofsc" disables file caching on the client's local disk. It has nothing to do with direct I/O.
>

If 'nofsc' disables file caching on the client's local disk does that
mean that write from userspace could go to kernel memory, then
potentially to client's local disk, before being committed over network
to the nfs server?

This seems really odd. What would be the use case for this?

Derek


2012-02-08 15:40:56

by Chuck Lever

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'


On Feb 8, 2012, at 2:43 AM, Harshula wrote:

> Hi Trond,
>
> On Wed, 2012-02-08 at 04:55 +0000, Myklebust, Trond wrote:
>
>> Applications that need to use uncached i/o are required to use the
>> O_DIRECT open() mode instead, since pretty much all of them need to be
>> rewritten to deal with the subtleties involved anyway.
>
> Could you please expand on the subtleties involved that require an
> application to be rewritten if forcedirectio mount option was available?
>
> A scenario where forcedirectio would be useful is when an application
> reads nearly a TB of data from local disks, processes that data and then
> dumps it to an NFS mount. All that happens while other processes are
> reading/writing to the local disks. The application does not have an
> O_DIRECT option nor is the source code available.
>
> With paged I/O the problem we see is that the NFS client system reaches
> dirty_bytes/dirty_ratio threshold and then blocks/forces all the
> processes to flush dirty pages. This effectively 'locks' up the NFS
> client system while the NFS dirty pages are pushed slowly over the wire
> to the NFS server. Some of the processes that have nothing to do with
> writing to the NFS mount are badly impacted. A forcedirectio mount
> option would be very helpful in this scenario. Do you have any advice on
> alleviating such problems on the NFS client by only using existing
> tunables?

Using direct I/O would be a work-around. The fundamental problem is the architecture of the VM system, and over time we have been making improvements there.

Instead of a mount option, you can fix your application to use direct I/O. Or you can change it to provide the kernel with (better) hints about the disposition of the data it is generating (madvise and fadvise system calls). (On Linux we assume you have source code and can make such changes. I realize this is not true for proprietary applications).

You could try using the "sync" mount option to cause the NFS client to push writes to the server immediately rather than delaying them. This would also slow down applications that aggressively dirties pages on the client.

Meanwhile, you can dial down the dirty_ratio and especially the dirty_background_ratio settings to trigger earlier writeback. We've also found increasing min_free_bytes has positive effects. The exact settings depend on how much memory your client has. Experimenting yourself is pretty harmless, so I won't give exact settings here.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2012-02-08 07:43:51

by Harshula

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'

Hi Trond,

On Wed, 2012-02-08 at 04:55 +0000, Myklebust, Trond wrote:

> Applications that need to use uncached i/o are required to use the
> O_DIRECT open() mode instead, since pretty much all of them need to be
> rewritten to deal with the subtleties involved anyway.

Could you please expand on the subtleties involved that require an
application to be rewritten if forcedirectio mount option was available?

A scenario where forcedirectio would be useful is when an application
reads nearly a TB of data from local disks, processes that data and then
dumps it to an NFS mount. All that happens while other processes are
reading/writing to the local disks. The application does not have an
O_DIRECT option nor is the source code available.

With paged I/O the problem we see is that the NFS client system reaches
dirty_bytes/dirty_ratio threshold and then blocks/forces all the
processes to flush dirty pages. This effectively 'locks' up the NFS
client system while the NFS dirty pages are pushed slowly over the wire
to the NFS server. Some of the processes that have nothing to do with
writing to the NFS mount are badly impacted. A forcedirectio mount
option would be very helpful in this scenario. Do you have any advice on
alleviating such problems on the NFS client by only using existing
tunables?

Thanks,
#


2012-02-08 20:00:30

by Chuck Lever

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'


On Feb 8, 2012, at 2:52 PM, Derek McEachern wrote:

>
>
> -------- Original Message --------
> Subject: Re: NFS Mount Option 'nofsc'
> From: Chuck Lever <[email protected]>
> To: Derek McEachern <[email protected]>
> CC: "Myklebust, Trond" <[email protected]>, "[email protected]" <[email protected]>
> Date: Wednesday, February 08, 2012 12:15:37 PM
>
>>> So then what exact functionality if provided by the 'nofsc' option? It would seem to me from a write perspective that between noac and the sync option it is pretty close to forcedirectio.
>>>
>>> From the man page describing sync "any system call that writes data to files on that mount point causes that data to be flushed to the server before the system call returns control to user space."
>>>
>>> Maybe I've answered one of my questions as flushing the data to the server before returning to user space is really what I'm after. The userspace app should be blocked until the write has been acknowledged by the server and if the server is an NFS appliance then I don't necessarily care if it has committed the data to disk as I expect it to managed its cache properly.
>>>
>>> Though I still want to understand what 'nofsc' is doing.
>> "nofsc" disables file caching on the client's local disk. It has nothing to do with direct I/O.
>>
>
> If 'nofsc' disables file caching on the client's local disk does that mean that write from userspace could go to kernel memory, then potentially to client's local disk, before being committed over network to the nfs server?
>
> This seems really odd. What would be the use case for this?

With "fsc", writes are indeed slower, but reads of a very large file that rarely changes are on average much better. If a file is significantly larger than a client's page cache, a client can cache that file on its local disk, and get local read speeds instead of going over the wire.

Additionally if multiple clients have to access the same large file, it reduces the load on the storage server if they have their own local copies of that file, since the file is too large for the clients to cache in their page cache. This also has the benefit of keeping the file data cached across client reboots.

This feature is an optimization for HPC workloads, where a large number of clients access very large read-mostly datasets on a handful of storage servers. The clients' local fsc absorbs much of the aggregate read workload, allowing storage servers to scale to a larger number of clients.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2012-02-08 21:16:57

by Derek McEachern

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'



-------- Original Message --------
Subject: Re: NFS Mount Option 'nofsc'
From: Chuck Lever <[email protected]>
To: Derek McEachern <[email protected]>
CC: "Myklebust, Trond" <[email protected]>,
"[email protected]" <[email protected]>
Date: Wednesday, February 08, 2012 2:00:24 PM
> On Feb 8, 2012, at 2:52 PM, Derek McEachern wrote:
>
>> If 'nofsc' disables file caching on the client's local disk does that mean that write from userspace could go to kernel memory, then potentially to client's local disk, before being committed over network to the nfs server?
>>
>> This seems really odd. What would be the use case for this?
> With "fsc", writes are indeed slower, but reads of a very large file that rarely changes are on average much better. If a file is significantly larger than a client's page cache, a client can cache that file on its local disk, and get local read speeds instead of going over the wire.
>
> Additionally if multiple clients have to access the same large file, it reduces the load on the storage server if they have their own local copies of that file, since the file is too large for the clients to cache in their page cache. This also has the benefit of keeping the file data cached across client reboots.
>
> This feature is an optimization for HPC workloads, where a large number of clients access very large read-mostly datasets on a handful of storage servers. The clients' local fsc absorbs much of the aggregate read workload, allowing storage servers to scale to a larger number of clients.
>

Thank you, this makes sense for 'fsc'. I'm going to assume then that the
default is 'nofsc' if nothing is specified.

Derek



2012-02-10 08:07:33

by Harshula

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'

Hi Trond,

On Thu, 2012-02-09 at 15:31 +0000, Myklebust, Trond wrote:
> On Thu, 2012-02-09 at 16:51 +1100, Harshula wrote:
> > Hi Trond,
> >
> > Thanks for the reply. Could you please elaborate on the subtleties
> > involved that require an application to be rewritten if forcedirectio
> > mount option was available?
>
> Firstly, we don't support O_DIRECT+O_APPEND (since the NFS protocol
> itself doesn't support atomic appends), so that would break a bunch of
> applications.
>
> Secondly, uncached I/O means that read() and write() requests need to be
> serialised by the application itself, since there are no atomicity or
> ordering guarantees at the VFS, NFS or RPC call level. Normally, the
> page cache services read() requests if there are outstanding writes, and
> so provides the atomicity guarantees that POSIX requires.
> IOW: if a write() occurs while you are reading, the application may end
> up retrieving part of the old data, and part of the new data instead of
> either one or the other.
>
> IOW: your application still needs to be aware of the fact that it is
> using O_DIRECT, and you are better of adding explicit support for it
> rather than hacky cluges such as a forcedirectio option.

Thanks. Would it be accurate to say that if there were only either
streaming writes or (xor) streaming reads to any given file on the NFS
mount, the application would not need to be rewritten?

Do you see forcedirectio as a sharp object that someone could stab
themselves with?

> > There's another scenario, which we talked about a while back, where the
> > cached async reads of a slowly growing file (tail) was spitting out
> > non-exist NULLs to user space. The forcedirectio mount option should
> > prevent that. Furthermore, the "sync" mount option will not help anymore
> > because you removed nfs_readpage_sync().
>
> No. See the points about O_APPEND and serialisation of read() and
> write() above. You may still end up seeing NUL characters (and indeed
> worse forms of corruption).

If the NFS client only does cached async reads of a slowly growing file
(tail), what's the problem? Is nfs_readpage_sync() gone forever, or
could it be revived?

> > > > The other hack that seems to work is periodically triggering an
> > > > nfs_getattr(), via ls -l, to force the dirty pages to be flushed to the
> > > > NFS server. Not exactly elegant ...
> > >
> > > ????????????????????????????????
> >
> > int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
> > {
> > struct inode *inode = dentry->d_inode;
> > int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
> > int err;
> >
> > /* Flush out writes to the server in order to update c/mtime. */
> > if (S_ISREG(inode->i_mode)) {
> > err = filemap_write_and_wait(inode->i_mapping);
> > if (err)
> > goto out;
> > }
>
> I'm aware of that code. The point is that '-osync' does that for free.

-osync also impacts the performance of the entire NFS mount. With
aforementioned hack, you can isolate the specific file(s) that need
their dirty pages to be flushed frequently to avoid hitting global dirty
page limit.

cya,
#


2012-02-09 05:51:50

by Harshula

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'

Hi Trond,

Thanks for the reply. Could you please elaborate on the subtleties
involved that require an application to be rewritten if forcedirectio
mount option was available?

On Thu, 2012-02-09 at 04:12 +0000, Myklebust, Trond wrote:
> On Thu, 2012-02-09 at 14:56 +1100, Harshula wrote:
> >
> > The "sync" option, depending on the NFS server, may impact the NFS
> > server's performance when serving many NFS clients. But still worth a
> > try.
>
> What on earth makes you think that directio would be any different?

Like I said, sync is still worth a try. I will do O_DIRECT Vs sync mount
option runs and see what the numbers look like. A while back the numbers
for cached Vs direct small random writes showed as the number of threads
increased the cached performance fell well below direct performance. In
this case I'll be looking at large streaming writes, so completely
different scenario, but I'd like to verify the numbers first.

Just to be clear, I am not disagreeing with you. "sync" maybe sufficient
for the scenario I described earlier.

> If
> your performance requirements can't cope with 'sync', then they sure as
> hell won't deal well with 'fsc'.

"fsc"?

> Directio is _synchronous_ just like 'sync'. The big difference is that
> with 'sync' then at least those reads are still cached.

There's another scenario, which we talked about a while back, where the
cached async reads of a slowly growing file (tail) was spitting out
non-exist NULLs to user space. The forcedirectio mount option should
prevent that. Furthermore, the "sync" mount option will not help anymore
because you removed nfs_readpage_sync().

> > The other hack that seems to work is periodically triggering an
> > nfs_getattr(), via ls -l, to force the dirty pages to be flushed to the
> > NFS server. Not exactly elegant ...
>
> ????????????????????????????????

int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
{
struct inode *inode = dentry->d_inode;
int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
int err;

/* Flush out writes to the server in order to update c/mtime. */
if (S_ISREG(inode->i_mode)) {
err = filemap_write_and_wait(inode->i_mapping);
if (err)
goto out;
}

Thanks,
#


2012-02-08 18:15:46

by Chuck Lever

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'


On Feb 8, 2012, at 1:13 PM, Derek McEachern wrote:

>
>
> -------- Original Message --------
> Subject: Re: NFS Mount Option 'nofsc'
> From: Myklebust, Trond <[email protected]>
> To: Derek McEachern <[email protected]>
> CC: "[email protected]" <[email protected]>
> Date: Tuesday, February 07, 2012 10:55:04 PM
>> On Tue, 2012-02-07 at 20:45 -0600, Derek McEachern wrote:
>>> I joined the mailing list shortly after Neil sent out a request for
>>> volunteer to update the nfs man page documenting the 'fsc'/'nofsc'
>>> options. I suspect this may stem from a ticket we opened with Suse
>>> inquiring about these options.
>>>
>>> Coming from a Solaris background we typically use the 'forcedirectio'
>>> option for certain mounts and I was looking for the same thing in Linux.
>>> The typically advice seems to be use 'noac' but the description in the
>>> man page doesn't seem to match what I would expect from 'forcedirectio',
>>> namely no buffering on the client.
>>>
>>> Poking around the kernel I found the 'fsc'/'nofsc' options and my
>>> question is does 'nofsc' provide 'forcedirectio' functionality?
>> No. There is no equivalent to the Solaris "forcedirectio" mount option
>> in Linux.
>> Applications that need to use uncached i/o are required to use the
>> O_DIRECT open() mode instead, since pretty much all of them need to be
>> rewritten to deal with the subtleties involved anyway.
>>
>> Trond
>
> So then what exact functionality if provided by the 'nofsc' option? It would seem to me from a write perspective that between noac and the sync option it is pretty close to forcedirectio.
>
> From the man page describing sync "any system call that writes data to files on that mount point causes that data to be flushed to the server before the system call returns control to user space."
>
> Maybe I've answered one of my questions as flushing the data to the server before returning to user space is really what I'm after. The userspace app should be blocked until the write has been acknowledged by the server and if the server is an NFS appliance then I don't necessarily care if it has committed the data to disk as I expect it to managed its cache properly.
>
> Though I still want to understand what 'nofsc' is doing.

"nofsc" disables file caching on the client's local disk. It has nothing to do with direct I/O.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2012-02-09 04:12:09

by Myklebust, Trond

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'

T24gVGh1LCAyMDEyLTAyLTA5IGF0IDE0OjU2ICsxMTAwLCBIYXJzaHVsYSB3cm90ZToNCj4gSGkg
Q2h1Y2ssDQo+IA0KPiBPbiBXZWQsIDIwMTItMDItMDggYXQgMTA6NDAgLTA1MDAsIENodWNrIExl
dmVyIHdyb3RlOg0KPiA+IE9uIEZlYiA4LCAyMDEyLCBhdCAyOjQzIEFNLCBIYXJzaHVsYSB3cm90
ZToNCj4gDQo+ID4gPiBDb3VsZCB5b3UgcGxlYXNlIGV4cGFuZCBvbiB0aGUgc3VidGxldGllcyBp
bnZvbHZlZCB0aGF0IHJlcXVpcmUgYW4NCj4gPiA+IGFwcGxpY2F0aW9uIHRvIGJlIHJld3JpdHRl
biBpZiBmb3JjZWRpcmVjdGlvIG1vdW50IG9wdGlvbiB3YXMgYXZhaWxhYmxlPw0KPiA+ID4gDQo+
ID4gPiBBIHNjZW5hcmlvIHdoZXJlIGZvcmNlZGlyZWN0aW8gd291bGQgYmUgdXNlZnVsIGlzIHdo
ZW4gYW4gYXBwbGljYXRpb24NCj4gPiA+IHJlYWRzIG5lYXJseSBhIFRCIG9mIGRhdGEgZnJvbSBs
b2NhbCBkaXNrcywgcHJvY2Vzc2VzIHRoYXQgZGF0YSBhbmQgdGhlbg0KPiA+ID4gZHVtcHMgaXQg
dG8gYW4gTkZTIG1vdW50LiBBbGwgdGhhdCBoYXBwZW5zIHdoaWxlIG90aGVyIHByb2Nlc3NlcyBh
cmUNCj4gPiA+IHJlYWRpbmcvd3JpdGluZyB0byB0aGUgbG9jYWwgZGlza3MuIFRoZSBhcHBsaWNh
dGlvbiBkb2VzIG5vdCBoYXZlIGFuDQo+ID4gPiBPX0RJUkVDVCBvcHRpb24gbm9yIGlzIHRoZSBz
b3VyY2UgY29kZSBhdmFpbGFibGUuDQoNCm1vdW50IC1vc3luYyB3b3JrcyBqdXN0IGFzIHdlbGwg
YXMgZm9yY2VkaXJlY3RpbyBmb3IgdGhpcy4NCg0KPiA+ID4gV2l0aCBwYWdlZCBJL08gdGhlIHBy
b2JsZW0gd2Ugc2VlIGlzIHRoYXQgdGhlIE5GUyBjbGllbnQgc3lzdGVtIHJlYWNoZXMNCj4gPiA+
IGRpcnR5X2J5dGVzL2RpcnR5X3JhdGlvIHRocmVzaG9sZCBhbmQgdGhlbiBibG9ja3MvZm9yY2Vz
IGFsbCB0aGUNCj4gPiA+IHByb2Nlc3NlcyB0byBmbHVzaCBkaXJ0eSBwYWdlcy4gVGhpcyBlZmZl
Y3RpdmVseSAnbG9ja3MnIHVwIHRoZSBORlMNCj4gPiA+IGNsaWVudCBzeXN0ZW0gd2hpbGUgdGhl
IE5GUyBkaXJ0eSBwYWdlcyBhcmUgcHVzaGVkIHNsb3dseSBvdmVyIHRoZSB3aXJlDQo+ID4gPiB0
byB0aGUgTkZTIHNlcnZlci4gU29tZSBvZiB0aGUgcHJvY2Vzc2VzIHRoYXQgaGF2ZSBub3RoaW5n
IHRvIGRvIHdpdGgNCj4gPiA+IHdyaXRpbmcgdG8gdGhlIE5GUyBtb3VudCBhcmUgYmFkbHkgaW1w
YWN0ZWQuIEEgZm9yY2VkaXJlY3RpbyBtb3VudA0KPiA+ID4gb3B0aW9uIHdvdWxkIGJlIHZlcnkg
aGVscGZ1bCBpbiB0aGlzIHNjZW5hcmlvLiBEbyB5b3UgaGF2ZSBhbnkgYWR2aWNlIG9uDQo+ID4g
PiBhbGxldmlhdGluZyBzdWNoIHByb2JsZW1zIG9uIHRoZSBORlMgY2xpZW50IGJ5IG9ubHkgdXNp
bmcgZXhpc3RpbmcNCj4gPiA+IHR1bmFibGVzPw0KPiA+IA0KPiA+IFVzaW5nIGRpcmVjdCBJL08g
d291bGQgYmUgYSB3b3JrLWFyb3VuZC4gIFRoZSBmdW5kYW1lbnRhbCBwcm9ibGVtIGlzDQo+ID4g
dGhlIGFyY2hpdGVjdHVyZSBvZiB0aGUgVk0gc3lzdGVtLCBhbmQgb3ZlciB0aW1lIHdlIGhhdmUg
YmVlbiBtYWtpbmcNCj4gPiBpbXByb3ZlbWVudHMgdGhlcmUuDQoNClRoZSBhcmd1bWVudCBhYm92
ZSBkb2Vzbid0IHByb3ZpZGVkIGFueSBtb3RpdmUgZm9yIHVzaW5nIGRpcmVjdGlvDQoodW5jYWNo
ZWQgaS9vKSB2cyBzeW5jaHJvbm91cyBpL28uIEkgc2VlIG5vIHJlYXNvbiB3aHkgZm9yY2VkDQpz
eW5jaHJvbm91cyBpL28gd291bGQgYmUgYSBwcm9ibGVtIGhlcmUuDQoNCj4gPiBJbnN0ZWFkIG9m
IGEgbW91bnQgb3B0aW9uLCB5b3UgY2FuIGZpeCB5b3VyIGFwcGxpY2F0aW9uIHRvIHVzZSBkaXJl
Y3QNCj4gPiBJL08uICBPciB5b3UgY2FuIGNoYW5nZSBpdCB0byBwcm92aWRlIHRoZSBrZXJuZWwg
d2l0aCAoYmV0dGVyKSBoaW50cw0KPiA+IGFib3V0IHRoZSBkaXNwb3NpdGlvbiBvZiB0aGUgZGF0
YSBpdCBpcyBnZW5lcmF0aW5nIChtYWR2aXNlIGFuZA0KPiA+IGZhZHZpc2Ugc3lzdGVtIGNhbGxz
KS4gIChPbiBMaW51eCB3ZSBhc3N1bWUgeW91IGhhdmUgc291cmNlIGNvZGUgYW5kDQo+ID4gY2Fu
IG1ha2Ugc3VjaCBjaGFuZ2VzLiAgSSByZWFsaXplIHRoaXMgaXMgbm90IHRydWUgZm9yIHByb3By
aWV0YXJ5DQo+ID4gYXBwbGljYXRpb25zKS4NCj4gPiANCj4gPiBZb3UgY291bGQgdHJ5IHVzaW5n
IHRoZSAic3luYyIgbW91bnQgb3B0aW9uIHRvIGNhdXNlIHRoZSBORlMgY2xpZW50IHRvDQo+ID4g
cHVzaCB3cml0ZXMgdG8gdGhlIHNlcnZlciBpbW1lZGlhdGVseSByYXRoZXIgdGhhbiBkZWxheWlu
ZyB0aGVtLiAgVGhpcw0KPiA+IHdvdWxkIGFsc28gc2xvdyBkb3duIGFwcGxpY2F0aW9ucyB0aGF0
IGFnZ3Jlc3NpdmVseSBkaXJ0aWVzIHBhZ2VzIG9uDQo+ID4gdGhlIGNsaWVudC4NCj4gPiANCj4g
PiBNZWFud2hpbGUsIHlvdSBjYW4gZGlhbCBkb3duIHRoZSBkaXJ0eV9yYXRpbyBhbmQgZXNwZWNp
YWxseSB0aGUNCj4gPiBkaXJ0eV9iYWNrZ3JvdW5kX3JhdGlvIHNldHRpbmdzIHRvIHRyaWdnZXIg
ZWFybGllciB3cml0ZWJhY2suICBXZSd2ZQ0KPiA+IGFsc28gZm91bmQgaW5jcmVhc2luZyBtaW5f
ZnJlZV9ieXRlcyBoYXMgcG9zaXRpdmUgZWZmZWN0cy4gIFRoZSBleGFjdA0KPiA+IHNldHRpbmdz
IGRlcGVuZCBvbiBob3cgbXVjaCBtZW1vcnkgeW91ciBjbGllbnQgaGFzLiAgRXhwZXJpbWVudGlu
Zw0KPiA+IHlvdXJzZWxmIGlzIHByZXR0eSBoYXJtbGVzcywgc28gSSB3b24ndCBnaXZlIGV4YWN0
IHNldHRpbmdzIGhlcmUuDQo+IA0KPiBUaGFua3MgZm9yIHRoZSByZXBseS4gVW5mb3J0dW5hdGVs
eSwgbm90IGFsbCB2ZW5kb3JzIHByb3ZpZGUgdGhlIHNvdXJjZQ0KPiBjb2RlLCBzbyB1c2luZyBP
X0RJUkVDVCBvciBmc3luYyBpcyBub3QgYWx3YXlzIGFuIG9wdGlvbi4gDQoNClRoaXMgaXMgd2hh
dCB2ZW5kb3Igc3VwcG9ydCBpcyBmb3IuIFdpdGggY2xvc2VkIHNvdXJjZSBzb2Z0d2FyZSB5b3UN
CmdlbmVyYWxseSBnZXRzIHdoYXQgeW91IHBheXMgZm9yLg0KDQo+IExvd2VyaW5nIGRpcnR5X2J5
dGVzL2RpcnR5X3JhdGlvIGFuZA0KPiBkaXJ0eV9iYWNrZ3JvdW5kX2J5dGVzL2RpcnR5X2JhY2tn
cm91bmRfcmF0aW8gZGlkIGhlbHAgYXMgaXQgc21vb3RoZWQNCj4gb3V0IHRoZSBkYXRhIHRyYW5z
ZmVyIG92ZXIgdGhlIHdpcmUgYnkgcHVzaGluZyBkYXRhIG91dCB0byB0aGUgTkZTDQo+IHNlcnZl
ciBzb29uZXIuIE90aGVyd2lzZSwgSSB3YXMgc2VlaW5nIHRoZSBkYXRhIHRyYW5zZmVyIG92ZXIg
dGhlIHdpcmUNCj4gaGF2aW5nIGlkbGUgcGVyaW9kcyB3aGlsZSA+MTBHaUIgb2YgcGFnZXMgd2Vy
ZSBiZWluZyBkaXJ0aWVkIGJ5IHRoZQ0KPiBwcm9jZXNzZXMsIHRoZW4gY29uZ2VzdGlvbiBhcyBz
b29uIGFzIHRoZSBkaXJ0eV9yYXRpbyB3YXMgcmVhY2hlZCBhbmQNCj4gdGhlIGZyYW50aWMgZmx1
c2hpbmcgb2YgZGlydHkgcGFnZXMgdG8gdGhlIE5GUyBzZXJ2ZXIuIEhvd2V2ZXIsDQo+IG1vZGlm
eWluZyBkaXJ0eV8qIHR1bmFibGVzIGhhcyBhIHN5c3RlbS13aWRlIGltcGFjdCwgaGVuY2UgaXQg
d2FzIG5vdA0KPiBhY2NlcHRlZC4NCj4gDQo+IFRoZSAic3luYyIgb3B0aW9uLCBkZXBlbmRpbmcg
b24gdGhlIE5GUyBzZXJ2ZXIsIG1heSBpbXBhY3QgdGhlIE5GUw0KPiBzZXJ2ZXIncyBwZXJmb3Jt
YW5jZSB3aGVuIHNlcnZpbmcgbWFueSBORlMgY2xpZW50cy4gQnV0IHN0aWxsIHdvcnRoIGENCj4g
dHJ5Lg0KDQpXaGF0IG9uIGVhcnRoIG1ha2VzIHlvdSB0aGluayB0aGF0IGRpcmVjdGlvIHdvdWxk
IGJlIGFueSBkaWZmZXJlbnQ/IElmDQp5b3VyIHBlcmZvcm1hbmNlIHJlcXVpcmVtZW50cyBjYW4n
dCBjb3BlIHdpdGggJ3N5bmMnLCB0aGVuIHRoZXkgc3VyZSBhcw0KaGVsbCB3b24ndCBkZWFsIHdl
bGwgd2l0aCAnZnNjJy4NCg0KRGlyZWN0aW8gaXMgX3N5bmNocm9ub3VzXyBqdXN0IGxpa2UgJ3N5
bmMnLiBUaGUgYmlnIGRpZmZlcmVuY2UgaXMgdGhhdA0Kd2l0aCAnc3luYycgdGhlbiBhdCBsZWFz
dCB0aG9zZSByZWFkcyBhcmUgc3RpbGwgY2FjaGVkLg0KDQo+IFRoZSBvdGhlciBoYWNrIHRoYXQg
c2VlbXMgdG8gd29yayBpcyBwZXJpb2RpY2FsbHkgdHJpZ2dlcmluZyBhbg0KPiBuZnNfZ2V0YXR0
cigpLCB2aWEgbHMgLWwsIHRvIGZvcmNlIHRoZSBkaXJ0eSBwYWdlcyB0byBiZSBmbHVzaGVkIHRv
IHRoZQ0KPiBORlMgc2VydmVyLiBOb3QgZXhhY3RseSBlbGVnYW50IC4uLg0KDQo/Pz8/Pz8/Pz8/
Pz8/Pz8/Pz8/Pz8/Pz8/Pz8/Pz8/PyANCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxpbnV4IE5G
UyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15a2xlYnVzdEBuZXRhcHAuY29t
DQp3d3cubmV0YXBwLmNvbQ0KDQo=

2012-02-09 03:56:23

by Harshula

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'

Hi Chuck,

On Wed, 2012-02-08 at 10:40 -0500, Chuck Lever wrote:
> On Feb 8, 2012, at 2:43 AM, Harshula wrote:

> > Could you please expand on the subtleties involved that require an
> > application to be rewritten if forcedirectio mount option was available?
> >
> > A scenario where forcedirectio would be useful is when an application
> > reads nearly a TB of data from local disks, processes that data and then
> > dumps it to an NFS mount. All that happens while other processes are
> > reading/writing to the local disks. The application does not have an
> > O_DIRECT option nor is the source code available.
> >
> > With paged I/O the problem we see is that the NFS client system reaches
> > dirty_bytes/dirty_ratio threshold and then blocks/forces all the
> > processes to flush dirty pages. This effectively 'locks' up the NFS
> > client system while the NFS dirty pages are pushed slowly over the wire
> > to the NFS server. Some of the processes that have nothing to do with
> > writing to the NFS mount are badly impacted. A forcedirectio mount
> > option would be very helpful in this scenario. Do you have any advice on
> > alleviating such problems on the NFS client by only using existing
> > tunables?
>
> Using direct I/O would be a work-around. The fundamental problem is
> the architecture of the VM system, and over time we have been making
> improvements there.
>
> Instead of a mount option, you can fix your application to use direct
> I/O. Or you can change it to provide the kernel with (better) hints
> about the disposition of the data it is generating (madvise and
> fadvise system calls). (On Linux we assume you have source code and
> can make such changes. I realize this is not true for proprietary
> applications).
>
> You could try using the "sync" mount option to cause the NFS client to
> push writes to the server immediately rather than delaying them. This
> would also slow down applications that aggressively dirties pages on
> the client.
>
> Meanwhile, you can dial down the dirty_ratio and especially the
> dirty_background_ratio settings to trigger earlier writeback. We've
> also found increasing min_free_bytes has positive effects. The exact
> settings depend on how much memory your client has. Experimenting
> yourself is pretty harmless, so I won't give exact settings here.

Thanks for the reply. Unfortunately, not all vendors provide the source
code, so using O_DIRECT or fsync is not always an option.

Lowering dirty_bytes/dirty_ratio and
dirty_background_bytes/dirty_background_ratio did help as it smoothed
out the data transfer over the wire by pushing data out to the NFS
server sooner. Otherwise, I was seeing the data transfer over the wire
having idle periods while >10GiB of pages were being dirtied by the
processes, then congestion as soon as the dirty_ratio was reached and
the frantic flushing of dirty pages to the NFS server. However,
modifying dirty_* tunables has a system-wide impact, hence it was not
accepted.

The "sync" option, depending on the NFS server, may impact the NFS
server's performance when serving many NFS clients. But still worth a
try.

The other hack that seems to work is periodically triggering an
nfs_getattr(), via ls -l, to force the dirty pages to be flushed to the
NFS server. Not exactly elegant ...

Thanks,
#


2012-02-09 15:31:37

by Myklebust, Trond

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'

T24gVGh1LCAyMDEyLTAyLTA5IGF0IDE2OjUxICsxMTAwLCBIYXJzaHVsYSB3cm90ZToNCj4gSGkg
VHJvbmQsDQo+IA0KPiBUaGFua3MgZm9yIHRoZSByZXBseS4gQ291bGQgeW91IHBsZWFzZSBlbGFi
b3JhdGUgb24gdGhlIHN1YnRsZXRpZXMNCj4gaW52b2x2ZWQgdGhhdCByZXF1aXJlIGFuIGFwcGxp
Y2F0aW9uIHRvIGJlIHJld3JpdHRlbiBpZiBmb3JjZWRpcmVjdGlvDQo+IG1vdW50IG9wdGlvbiB3
YXMgYXZhaWxhYmxlPw0KDQpGaXJzdGx5LCB3ZSBkb24ndCBzdXBwb3J0IE9fRElSRUNUK09fQVBQ
RU5EIChzaW5jZSB0aGUgTkZTIHByb3RvY29sDQppdHNlbGYgZG9lc24ndCBzdXBwb3J0IGF0b21p
YyBhcHBlbmRzKSwgc28gdGhhdCB3b3VsZCBicmVhayBhIGJ1bmNoIG9mDQphcHBsaWNhdGlvbnMu
DQoNClNlY29uZGx5LCB1bmNhY2hlZCBJL08gbWVhbnMgdGhhdCByZWFkKCkgYW5kIHdyaXRlKCkg
cmVxdWVzdHMgbmVlZCB0byBiZQ0Kc2VyaWFsaXNlZCBieSB0aGUgYXBwbGljYXRpb24gaXRzZWxm
LCBzaW5jZSB0aGVyZSBhcmUgbm8gYXRvbWljaXR5IG9yDQpvcmRlcmluZyBndWFyYW50ZWVzIGF0
IHRoZSBWRlMsIE5GUyBvciBSUEMgY2FsbCBsZXZlbC4gTm9ybWFsbHksIHRoZQ0KcGFnZSBjYWNo
ZSBzZXJ2aWNlcyByZWFkKCkgcmVxdWVzdHMgaWYgdGhlcmUgYXJlIG91dHN0YW5kaW5nIHdyaXRl
cywgYW5kDQpzbyBwcm92aWRlcyB0aGUgYXRvbWljaXR5IGd1YXJhbnRlZXMgdGhhdCBQT1NJWCBy
ZXF1aXJlcy4NCklPVzogaWYgYSB3cml0ZSgpIG9jY3VycyB3aGlsZSB5b3UgYXJlIHJlYWRpbmcs
IHRoZSBhcHBsaWNhdGlvbiBtYXkgZW5kDQp1cCByZXRyaWV2aW5nIHBhcnQgb2YgdGhlIG9sZCBk
YXRhLCBhbmQgcGFydCBvZiB0aGUgbmV3IGRhdGEgaW5zdGVhZCBvZg0KZWl0aGVyIG9uZSBvciB0
aGUgb3RoZXIuDQoNCklPVzogeW91ciBhcHBsaWNhdGlvbiBzdGlsbCBuZWVkcyB0byBiZSBhd2Fy
ZSBvZiB0aGUgZmFjdCB0aGF0IGl0IGlzDQp1c2luZyBPX0RJUkVDVCwgYW5kIHlvdSBhcmUgYmV0
dGVyIG9mIGFkZGluZyBleHBsaWNpdCBzdXBwb3J0IGZvciBpdA0KcmF0aGVyIHRoYW4gaGFja3kg
Y2x1Z2VzIHN1Y2ggYXMgYSBmb3JjZWRpcmVjdGlvIG9wdGlvbi4NCg0KPiBPbiBUaHUsIDIwMTIt
MDItMDkgYXQgMDQ6MTIgKzAwMDAsIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+ID4gT24gVGh1
LCAyMDEyLTAyLTA5IGF0IDE0OjU2ICsxMTAwLCBIYXJzaHVsYSB3cm90ZToNCj4gPiA+DQo+ID4g
PiBUaGUgInN5bmMiIG9wdGlvbiwgZGVwZW5kaW5nIG9uIHRoZSBORlMgc2VydmVyLCBtYXkgaW1w
YWN0IHRoZSBORlMNCj4gPiA+IHNlcnZlcidzIHBlcmZvcm1hbmNlIHdoZW4gc2VydmluZyBtYW55
IE5GUyBjbGllbnRzLiBCdXQgc3RpbGwgd29ydGggYQ0KPiA+ID4gdHJ5Lg0KPiA+IA0KPiA+IFdo
YXQgb24gZWFydGggbWFrZXMgeW91IHRoaW5rIHRoYXQgZGlyZWN0aW8gd291bGQgYmUgYW55IGRp
ZmZlcmVudD8NCj4gDQo+IExpa2UgSSBzYWlkLCBzeW5jIGlzIHN0aWxsIHdvcnRoIGEgdHJ5LiBJ
IHdpbGwgZG8gT19ESVJFQ1QgVnMgc3luYyBtb3VudA0KPiBvcHRpb24gcnVucyBhbmQgc2VlIHdo
YXQgdGhlIG51bWJlcnMgbG9vayBsaWtlLiBBIHdoaWxlIGJhY2sgdGhlIG51bWJlcnMNCj4gZm9y
IGNhY2hlZCBWcyBkaXJlY3Qgc21hbGwgcmFuZG9tIHdyaXRlcyBzaG93ZWQgYXMgdGhlIG51bWJl
ciBvZiB0aHJlYWRzDQo+IGluY3JlYXNlZCB0aGUgY2FjaGVkIHBlcmZvcm1hbmNlIGZlbGwgd2Vs
bCBiZWxvdyBkaXJlY3QgcGVyZm9ybWFuY2UuIEluDQo+IHRoaXMgY2FzZSBJJ2xsIGJlIGxvb2tp
bmcgYXQgbGFyZ2Ugc3RyZWFtaW5nIHdyaXRlcywgc28gY29tcGxldGVseQ0KPiBkaWZmZXJlbnQg
c2NlbmFyaW8sIGJ1dCBJJ2QgbGlrZSB0byB2ZXJpZnkgdGhlIG51bWJlcnMgZmlyc3QuDQo+IA0K
PiBKdXN0IHRvIGJlIGNsZWFyLCBJIGFtIG5vdCBkaXNhZ3JlZWluZyB3aXRoIHlvdS4gInN5bmMi
IG1heWJlIHN1ZmZpY2llbnQNCj4gZm9yIHRoZSBzY2VuYXJpbyBJIGRlc2NyaWJlZCBlYXJsaWVy
Lg0KPiANCj4gPiBJZg0KPiA+IHlvdXIgcGVyZm9ybWFuY2UgcmVxdWlyZW1lbnRzIGNhbid0IGNv
cGUgd2l0aCAnc3luYycsIHRoZW4gdGhleSBzdXJlIGFzDQo+ID4gaGVsbCB3b24ndCBkZWFsIHdl
bGwgd2l0aCAnZnNjJy4NCj4gDQo+ICJmc2MiPyANCj4gDQo+ID4gRGlyZWN0aW8gaXMgX3N5bmNo
cm9ub3VzXyBqdXN0IGxpa2UgJ3N5bmMnLiBUaGUgYmlnIGRpZmZlcmVuY2UgaXMgdGhhdA0KPiA+
IHdpdGggJ3N5bmMnIHRoZW4gYXQgbGVhc3QgdGhvc2UgcmVhZHMgYXJlIHN0aWxsIGNhY2hlZC4N
Cj4gDQo+IFRoZXJlJ3MgYW5vdGhlciBzY2VuYXJpbywgd2hpY2ggd2UgdGFsa2VkIGFib3V0IGEg
d2hpbGUgYmFjaywgd2hlcmUgdGhlDQo+IGNhY2hlZCBhc3luYyByZWFkcyBvZiBhIHNsb3dseSBn
cm93aW5nIGZpbGUgKHRhaWwpIHdhcyBzcGl0dGluZyBvdXQNCj4gbm9uLWV4aXN0IE5VTExzIHRv
IHVzZXIgc3BhY2UuIFRoZSBmb3JjZWRpcmVjdGlvIG1vdW50IG9wdGlvbiBzaG91bGQNCj4gcHJl
dmVudCB0aGF0LiBGdXJ0aGVybW9yZSwgdGhlICJzeW5jIiBtb3VudCBvcHRpb24gd2lsbCBub3Qg
aGVscCBhbnltb3JlDQo+IGJlY2F1c2UgeW91IHJlbW92ZWQgbmZzX3JlYWRwYWdlX3N5bmMoKS4N
Cg0KTm8uIFNlZSB0aGUgcG9pbnRzIGFib3V0IE9fQVBQRU5EIGFuZCBzZXJpYWxpc2F0aW9uIG9m
IHJlYWQoKSBhbmQNCndyaXRlKCkgYWJvdmUuIFlvdSBtYXkgc3RpbGwgZW5kIHVwIHNlZWluZyBO
VUwgY2hhcmFjdGVycyAoYW5kIGluZGVlZA0Kd29yc2UgZm9ybXMgb2YgY29ycnVwdGlvbikuDQoN
Cj4gPiA+IFRoZSBvdGhlciBoYWNrIHRoYXQgc2VlbXMgdG8gd29yayBpcyBwZXJpb2RpY2FsbHkg
dHJpZ2dlcmluZyBhbg0KPiA+ID4gbmZzX2dldGF0dHIoKSwgdmlhIGxzIC1sLCB0byBmb3JjZSB0
aGUgZGlydHkgcGFnZXMgdG8gYmUgZmx1c2hlZCB0byB0aGUNCj4gPiA+IE5GUyBzZXJ2ZXIuIE5v
dCBleGFjdGx5IGVsZWdhbnQgLi4uDQo+ID4gDQo+ID4gPz8/Pz8/Pz8/Pz8/Pz8/Pz8/Pz8/Pz8/
Pz8/Pz8/Pz8gDQo+IA0KPiBpbnQgbmZzX2dldGF0dHIoc3RydWN0IHZmc21vdW50ICptbnQsIHN0
cnVjdCBkZW50cnkgKmRlbnRyeSwgc3RydWN0IGtzdGF0ICpzdGF0KQ0KPiB7DQo+ICAgICAgICAg
c3RydWN0IGlub2RlICppbm9kZSA9IGRlbnRyeS0+ZF9pbm9kZTsNCj4gICAgICAgICBpbnQgbmVl
ZF9hdGltZSA9IE5GU19JKGlub2RlKS0+Y2FjaGVfdmFsaWRpdHkgJiBORlNfSU5PX0lOVkFMSURf
QVRJTUU7DQo+ICAgICAgICAgaW50IGVycjsNCj4gDQo+ICAgICAgICAgLyogRmx1c2ggb3V0IHdy
aXRlcyB0byB0aGUgc2VydmVyIGluIG9yZGVyIHRvIHVwZGF0ZSBjL210aW1lLiAgKi8NCj4gICAg
ICAgICBpZiAoU19JU1JFRyhpbm9kZS0+aV9tb2RlKSkgew0KPiAgICAgICAgICAgICAgICAgZXJy
ID0gZmlsZW1hcF93cml0ZV9hbmRfd2FpdChpbm9kZS0+aV9tYXBwaW5nKTsNCj4gICAgICAgICAg
ICAgICAgIGlmIChlcnIpDQo+ICAgICAgICAgICAgICAgICAgICAgICAgIGdvdG8gb3V0Ow0KPiAg
ICAgICAgIH0NCg0KSSdtIGF3YXJlIG9mIHRoYXQgY29kZS4gVGhlIHBvaW50IGlzIHRoYXQgJy1v
c3luYycgZG9lcyB0aGF0IGZvciBmcmVlLg0KDQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXgg
TkZTIGNsaWVudCBtYWludGFpbmVyDQoNCk5ldEFwcA0KVHJvbmQuTXlrbGVidXN0QG5ldGFwcC5j
b20NCnd3dy5uZXRhcHAuY29tDQoNCg==

2012-02-20 05:36:07

by Harshula

[permalink] [raw]
Subject: Re: NFS Mount Option 'nofsc'

Hi Trond,

On Fri, 2012-02-10 at 16:48 +0000, Myklebust, Trond wrote:
> On Fri, 2012-02-10 at 19:07 +1100, Harshula wrote:

> > Do you see forcedirectio as a sharp object that someone could stab
> > themselves with?
>
> Yes. It does lead to some very subtle POSIX violations.

I'm trying out the alternatives. Your list of reasons were convincing. Thanks.

> > If the NFS client only does cached async reads of a slowly growing file
> > (tail), what's the problem? Is nfs_readpage_sync() gone forever, or
> > could it be revived?
>
> It wouldn't help at all. The problem is the VM's handling of pages vs
> the NFS handling of file size.
>
> The VM basically uses the file size in order to determine how much data
> a page contains. If that file size changed between the instance we
> finished the READ RPC call, and the instance the VM gets round to
> locking the page again, reading the data and then checking the file
> size, then the VM may end up copying data beyond the end of that
> retrieved by the RPC call.

nfs_readpage_sync() keeps doing rsize reads (or PAGE SIZE reads if rsize
> PAGE SIZE) till the entire PAGE has been filled or EOF is hit. Since
these are synchronous reads, the subsequent READ RPC call is not sent
until the previous READ RPC reply arrives. Hence, the READ RPC reply
contains the latest metadata about the file, from the NFS server, before
deciding whether or not to do more READ RPC calls. That is not the case
with the asynchronous READ RPC calls which are queued to be sent before
the replies are received. This results in not READing enough data from
the NFS server even when the READ RPC reply explicitly states that the
file has grown. This mismatch of data and file size is then presented to
the VM.

If you look at nfs_readpage_sync() code, it does not worry about
adjusting the number of bytes to read if it is past the *current* EOF.
Only the async code adjusts the number of bytes to read if it is past
the *current* EOF. Furthermore, testing showed that using -osync (while
nfs_readpage_sync() existed) avoided the NULLs being presented to
userspace.

cya,
#