LinuxLists.cc - problem with nfs latency during high IO

2011-03-12 12:53:42

Subject: problem with nfs latency during high IO

Hello,

I was told some days ago that my problem with my NFS system is related
to this bug, as the problem that I'm experiencing is quite similar.

The bug : https://bugzilla.redhat.com/show_bug.cgi?id=469848

The link itself explains quite well my issue, I'm just truing to copy a
big file (36Gb) to my nfs server and when I try to get an ls -l command
to the same folder where I'm copying data, the command gets stuck for
some time. This amount of time changes from a few secs to SOME minutes
(9' is the current record).
I can live with some seconds of delay, but minutes is something quite
unacceptable.

As this is an nfs server running on a red hat system (an HP ibrix x9300
with Red Hat 5.3 x86_64, kernel 2.6.18-128) I was told to apply the
patch suggested from the bug on my clients.

Unfortunately my clients are running fedora core 14 (x86_64, kernel
2.6.35.6-45) and I can't find the file that they are referring to, the
file fs/nfs/inode.c is not there and I can't find the rpm that contains it.

As the bug is a very very old one, I took it for granted that is already
applied to fedora, but I wanted to make sure that it is looking at the
file.

Can you help me on this? I'm I wrong in my supposition (is the patch
really applied)? is it possible that my problem is somewhere else?

Thanks a lot in advance for your help, please let me know if I can
provide any more information.
j

2011-03-16 23:51:15

by Simon Kirby

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

On Wed, Mar 16, 2011 at 12:45:34PM +0100, Judith Flo Gaya wrote:

> I made some tests with a value of 10 for the vm_dirty_ratio and indeed
> the ls-hang-time has decreased a lot, from 3min avg to 1.5min.
> I was wondering what is the minimum number that it is safe to use? I'm
> sure that you have already dealt with the side-effects/collateral
> damages of this action, I don't want to fix a problem creating another
> one..

For a while, we were running with this on production NFS clients:

vm/dirty_background_bytes = 1048576
vm/dirty_bytes = 2097152

which is totally crazy, but generally seemed to work for the most part,
and penalized the process creating the pages instead of totally hosing
everything else when somebody was just writing back a huge file. Without
it, we were seeing a single "dd" able to cause the load to hit 200,
simply because 200 other processes got stuck in D waiting for RPC slots
due to the slowdown. With those settings, load would stay around 3-4 and
latency was much better.

I think I removed it when trying to figure out the other flush issues,
but it seems the same problem still exists.

Simon-

2011-03-16 11:45:36

by Judith Flo Gaya

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

Hello Chuck,

On 03/15/2011 11:10 PM, Chuck Lever wrote:
>
> On Mar 15, 2011, at 5:58 PM, Judith Flo Gaya wrote:
>
>>
>> I saw that the value was 20, I don't know the impact of changing the number by units or tens... Should I test with 10 or this is too much? I assume that the behavior will change immediately right?
>
> I believe the dirty ratio is the percentage of physical memory that can be consumed by one file's dirty data before the VM starts flushing its pages asynchronously. Or it could be the amount of dirty data allowed across all files... one file or many doesn't make any difference if you are writing a single very large file.
>
> If your client memory is large, a small number should work without problem. One percent of a 16GB client is still quite a bit of memory. The current setting means you can have 20% of said 16GB client, or 3.2GB, of dirty file data on that client before it will even think about flushing it. Along comes "ls -l" and you will have to wait for the client to flush 3.2GB before it can send the GETATTR.
>
> I believe this setting does take effect immediately, but you will have to put the setting in /etc/sysctl.conf to make it last across a reboot.
>

I made some tests with a value of 10 for the vm_dirty_ratio and indeed
the ls-hang-time has decreased a lot, from 3min avg to 1.5min.
I was wondering what is the minimum number that it is safe to use? I'm
sure that you have already dealt with the side-effects/collateral
damages of this action, I don't want to fix a problem creating another
one..

Regarding the modification of the inode.c file, what do you think that
will be the next step? And how can I apply it to my system? Should I
modify the file by myself and recompile the kernel to have the changed
applied?

Thanks a lot,
j

2011-03-16 15:18:30

by Jim Rees

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

Chuck Lever wrote:

> I was wondering what is the minimum number that it is safe to use? I'm
sure that you have already dealt with the side-effects/collateral
damages of this action, I don't want to fix a problem creating another
one..

As I said before, you can set it to 1, and that will mean background
flushing kicks in at 1% of your client's physical memory.

I think 5 is the minimum for dirty_ratio, although this doesn't seem to be
documented anywhere. If you want to set it lower, you have to use
dirty_bytes instead of dirty_ratio. See the commit message here:

http://lkml.org/lkml/2008/11/23/160

Documentation is in Documentation/sysctl/vm.txt.

2011-03-15 18:03:58

by Chuck Lever III

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

On Mar 15, 2011, at 1:25 PM, Judith Flo Gaya wrote:

> Hello Chuck,
>
> On 03/15/2011 05:24 PM, Chuck Lever wrote:
>> Hi Judith-
>>
>> On Mar 12, 2011, at 7:58 AM, Judith Flo Gaya wrote:
>>
>>> Hello,
>>>
>>> I was told some days ago that my problem with my NFS system is related to this bug, as the problem that I'm experiencing is quite similar.
>>>
>>> The bug : https://bugzilla.redhat.com/show_bug.cgi?id=469848
>>>
>>> The link itself explains quite well my issue, I'm just truing to copy a big file (36Gb) to my nfs server and when I try to get an ls -l command to the same folder where I'm copying data, the command gets stuck for some time. This amount of time changes from a few secs to SOME minutes (9' is the current record).
>>> I can live with some seconds of delay, but minutes is something quite unacceptable.
>>>
>>> As this is an nfs server running on a red hat system (an HP ibrix x9300 with Red Hat 5.3 x86_64, kernel 2.6.18-128) I was told to apply the patch suggested from the bug on my clients.
>>>
>>> Unfortunately my clients are running fedora core 14 (x86_64, kernel 2.6.35.6-45) and I can't find the file that they are referring to, the file fs/nfs/inode.c is not there and I can't find the rpm that contains it.
>>>
>>> As the bug is a very very old one, I took it for granted that is already applied to fedora, but I wanted to make sure that it is looking at the file.
>>>
>>> Can you help me on this? I'm I wrong in my supposition (is the patch really applied)? is it possible that my problem is somewhere else?
>>
>> This sounds like typical behavior.
> But it is not like this when I use a RHEL6 as a client to those servers, in this case, the ls only last for some seconds, nothing like the minutes that it takes from my fedora.

Which Fedora systems, exactly? The fix I described below is almost certainly in RHEL 6.

>>
>> POSIX requires that the mtime and file size returned by stat(2) ('ls -l') reflect the most recent write(2). On NFS, the server sets both of these fields. If a client is caching dirty data, and an application does a stat(2), the client is forced to flush the dirty data so that the server can update mtime and file size appropriately. The client then does a GETATTR, and returns those values to the requesting application.
>>
> ok, sorry, I know this is a very stupid question but. what do you mean by dirty data?

Dirty data is data that your application has written to the file but which hasn't been flushed to the server's disk. This data resides in the client's page cache, on its way to the server.

> BTW i understand the time issue, but again, if the version of the kernel that the red hat has installed allows me to get the information soon, why a newer kernel in fedora does not?

Sounds like a bug. Fedora kernels newer than 2.6.32 should work as well as, or better than, RHEL 6.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2011-03-15 17:25:54

by Judith Flo Gaya

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

Hello Chuck,

On 03/15/2011 05:24 PM, Chuck Lever wrote:
> Hi Judith-
>
> On Mar 12, 2011, at 7:58 AM, Judith Flo Gaya wrote:
>
>> Hello,
>>
>> I was told some days ago that my problem with my NFS system is related to this bug, as the problem that I'm experiencing is quite similar.
>>
>> The bug : https://bugzilla.redhat.com/show_bug.cgi?id=469848
>>
>> The link itself explains quite well my issue, I'm just truing to copy a big file (36Gb) to my nfs server and when I try to get an ls -l command to the same folder where I'm copying data, the command gets stuck for some time. This amount of time changes from a few secs to SOME minutes (9' is the current record).
>> I can live with some seconds of delay, but minutes is something quite unacceptable.
>>
>> As this is an nfs server running on a red hat system (an HP ibrix x9300 with Red Hat 5.3 x86_64, kernel 2.6.18-128) I was told to apply the patch suggested from the bug on my clients.
>>
>> Unfortunately my clients are running fedora core 14 (x86_64, kernel 2.6.35.6-45) and I can't find the file that they are referring to, the file fs/nfs/inode.c is not there and I can't find the rpm that contains it.
>>
>> As the bug is a very very old one, I took it for granted that is already applied to fedora, but I wanted to make sure that it is looking at the file.
>>
>> Can you help me on this? I'm I wrong in my supposition (is the patch really applied)? is it possible that my problem is somewhere else?
>
> This sounds like typical behavior.
But it is not like this when I use a RHEL6 as a client to those servers,
in this case, the ls only last for some seconds, nothing like the
minutes that it takes from my fedora.
>
> POSIX requires that the mtime and file size returned by stat(2) ('ls -l') reflect the most recent write(2). On NFS, the server sets both of these fields. If a client is caching dirty data, and an application does a stat(2), the client is forced to flush the dirty data so that the server can update mtime and file size appropriately. The client then does a GETATTR, and returns those values to the requesting application.
>
ok, sorry, I know this is a very stupid question but. what do you mean
by dirty data?
BTW i understand the time issue, but again, if the version of the kernel
that the red hat has installed allows me to get the information soon,
why a newer kernel in fedora does not?

> The problem is that Linux caches writes aggressively. That makes flushing before the GETATTR take a long time in some cases. On some versions of Linux, it could be an indefinite amount of time; recently we added a bit of logic to make the GETATTR code path hold up additional application writes so it would be able to squeeze in the GETATTR to get a consistent snapshot of mtime and size.
>
I thought that the purpose of the patch was specifically to allow the
client to get the stat(2) info faster than before, so that this
aggressive behavior doesn't impact so much the performance of the stat
petition.
> Another issue is: what if other clients are writing to the file? Those writes won't be seen on your client, either in the form of data changes or mtime/size updates, until your client's attribute cache times out (or the file is unlocked or closed).
>
I didn't consider it, big issue indeed. Then how is RHEL doing it not to
have the problem???
> The best you can do for now is to lower the amount of dirty data the client allows to be outstanding, thus reducing the amount of time it takes for a flush to complete. This is done with a sysctl, I believe "vm.dirty_ratio," and affects all file systems on the client. Alternately, the client file system in question can be mounted with "sync" to cause writes to go to the server immediately, but that has other significant performance implications.
>
I'll give it a try and let you know how the new tests are doing.
I already considered the sync parameter but of course the performance of
the copy lowers to unacceptable times (from 6min to 40min)

Thanks,
j

2011-03-16 13:25:06

by Chuck Lever III

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

On Mar 16, 2011, at 7:45 AM, Judith Flo Gaya wrote:

> Hello Chuck,
>
> On 03/15/2011 11:10 PM, Chuck Lever wrote:
>>
>> On Mar 15, 2011, at 5:58 PM, Judith Flo Gaya wrote:
>>
>>>
>>> I saw that the value was 20, I don't know the impact of changing the number by units or tens... Should I test with 10 or this is too much? I assume that the behavior will change immediately right?
>>
>> I believe the dirty ratio is the percentage of physical memory that can be consumed by one file's dirty data before the VM starts flushing its pages asynchronously. Or it could be the amount of dirty data allowed across all files... one file or many doesn't make any difference if you are writing a single very large file.
>>
>> If your client memory is large, a small number should work without problem. One percent of a 16GB client is still quite a bit of memory. The current setting means you can have 20% of said 16GB client, or 3.2GB, of dirty file data on that client before it will even think about flushing it. Along comes "ls -l" and you will have to wait for the client to flush 3.2GB before it can send the GETATTR.
>>
>> I believe this setting does take effect immediately, but you will have to put the setting in /etc/sysctl.conf to make it last across a reboot.
>>
>
> I made some tests with a value of 10 for the vm_dirty_ratio and indeed the ls-hang-time has decreased a lot, from 3min avg to 1.5min.
> I was wondering what is the minimum number that it is safe to use? I'm sure that you have already dealt with the side-effects/collateral damages of this action, I don't want to fix a problem creating another one..

As I said before, you can set it to 1, and that will mean background flushing kicks in at 1% of your client's physical memory. I think that's probably safe nearly anywhere, but it may have deleterious effects on workload performance. You need to test various settings with your workload and your clients to see what is the best setting in your environment.

> Regarding the modification of the inode.c file, what do you think that will be the next step? And how can I apply it to my system? Should I modify the file by myself and recompile the kernel to have the changed applied?

I recommend that you file a bug against Fedora 14. See http://bugzilla.redhat.com/ .

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2011-03-15 21:47:51

by Judith Flo Gaya

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

On 3/15/11 10:28 PM, Chuck Lever wrote:
>>>>>> Can you help me on this? I'm I wrong in my supposition (is the patch really applied)? is it possible that my problem is somewhere else?
>>>>> This sounds like typical behavior.
>>>> But it is not like this when I use a RHEL6 as a client to those servers, in this case, the ls only last for some seconds, nothing like the minutes that it takes from my fedora.
>>> Which Fedora systems, exactly? The fix I described below is almost certainly in RHEL 6.
>> Fedora Core 14, 64 bit, 2.6.35.6-45
> Right, you mentioned that in your OP. Sorry.
no problem
>>>>> POSIX requires that the mtime and file size returned by stat(2) ('ls -l') reflect the most recent write(2). On NFS, the server sets both of these fields. If a client is caching dirty data, and an application does a stat(2), the client is forced to flush the dirty data so that the server can update mtime and file size appropriately. The client then does a GETATTR, and returns those values to the requesting application.
>>>>>
>>>> ok, sorry, I know this is a very stupid question but. what do you mean by dirty data?
>>> Dirty data is data that your application has written to the file but which hasn't been flushed to the server's disk. This data resides in the client's page cache, on its way to the server.
>> ok, understood. Then the sysctl change that you suggest, I've been checking both distributions, RHEL6 and FC14 and they share the same value... I assume by this that changing this value will not "help", am I right?
> It should improve behavior somewhat in both cases, but the delay won't go away entirely. This was a workaround we gave EL5 customers before this bug was addressed. In the Fedora case I wouldn't expect a strongly deterministic improvement, but the average wait for "ls -l" should go down somewhat.
I saw that the value was 20, I don't know the impact of changing the
number by units or tens... Should I test with 10 or this is too much? I
assume that the behavior will change immediately right?
j

2011-03-16 15:31:31

by Jim Rees

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

Jim Rees wrote:

I think 5 is the minimum for dirty_ratio, although this doesn't seem to be
documented anywhere. If you want to set it lower, you have to use
dirty_bytes instead of dirty_ratio.

Never mind, I just looked at the code and the 5% limit seems to have been
removed later. Last time I looked at this code was 3 years ago.

2011-03-15 21:28:34

by Chuck Lever III

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

On Mar 15, 2011, at 5:33 PM, Judith Flo Gaya wrote:

>
>
> On 3/15/11 7:03 PM, Chuck Lever wrote:
>> On Mar 15, 2011, at 1:25 PM, Judith Flo Gaya wrote:
>>
>>> Hello Chuck,
>>>
>>> On 03/15/2011 05:24 PM, Chuck Lever wrote:
>>>> Hi Judith-
>>>>
>>>> On Mar 12, 2011, at 7:58 AM, Judith Flo Gaya wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I was told some days ago that my problem with my NFS system is related to this bug, as the problem that I'm experiencing is quite similar.
>>>>>
>>>>> The bug : https://bugzilla.redhat.com/show_bug.cgi?id=469848
>>>>>
>>>>> The link itself explains quite well my issue, I'm just truing to copy a big file (36Gb) to my nfs server and when I try to get an ls -l command to the same folder where I'm copying data, the command gets stuck for some time. This amount of time changes from a few secs to SOME minutes (9' is the current record).
>>>>> I can live with some seconds of delay, but minutes is something quite unacceptable.
>>>>>
>>>>> As this is an nfs server running on a red hat system (an HP ibrix x9300 with Red Hat 5.3 x86_64, kernel 2.6.18-128) I was told to apply the patch suggested from the bug on my clients.
>>>>>
>>>>> Unfortunately my clients are running fedora core 14 (x86_64, kernel 2.6.35.6-45) and I can't find the file that they are referring to, the file fs/nfs/inode.c is not there and I can't find the rpm that contains it.
>>>>>
>>>>> As the bug is a very very old one, I took it for granted that is already applied to fedora, but I wanted to make sure that it is looking at the file.
>>>>>
>>>>> Can you help me on this? I'm I wrong in my supposition (is the patch really applied)? is it possible that my problem is somewhere else?
>>>> This sounds like typical behavior.
>>> But it is not like this when I use a RHEL6 as a client to those servers, in this case, the ls only last for some seconds, nothing like the minutes that it takes from my fedora.
>> Which Fedora systems, exactly? The fix I described below is almost certainly in RHEL 6.
> Fedora Core 14, 64 bit, 2.6.35.6-45

Right, you mentioned that in your OP. Sorry.

>>>> POSIX requires that the mtime and file size returned by stat(2) ('ls -l') reflect the most recent write(2). On NFS, the server sets both of these fields. If a client is caching dirty data, and an application does a stat(2), the client is forced to flush the dirty data so that the server can update mtime and file size appropriately. The client then does a GETATTR, and returns those values to the requesting application.
>>>>
>>> ok, sorry, I know this is a very stupid question but. what do you mean by dirty data?
>> Dirty data is data that your application has written to the file but which hasn't been flushed to the server's disk. This data resides in the client's page cache, on its way to the server.
> ok, understood. Then the sysctl change that you suggest, I've been checking both distributions, RHEL6 and FC14 and they share the same value... I assume by this that changing this value will not "help", am I right?

It should improve behavior somewhat in both cases, but the delay won't go away entirely. This was a workaround we gave EL5 customers before this bug was addressed. In the Fedora case I wouldn't expect a strongly deterministic improvement, but the average wait for "ls -l" should go down somewhat.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2011-03-16 16:52:59

by Judith Flo Gaya

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

THanks a lot to both for your suggestions and help.
I already filed a bug [1] to bugzilla.
j
[1] https://bugzilla.redhat.com/show_bug.cgi?id=688232

On 03/16/2011 04:31 PM, Jim Rees wrote:
> Jim Rees wrote:
>
> I think 5 is the minimum for dirty_ratio, although this doesn't seem to be
> documented anywhere. If you want to set it lower, you have to use
> dirty_bytes instead of dirty_ratio.
>
> Never mind, I just looked at the code and the 5% limit seems to have been
> removed later. Last time I looked at this code was 3 years ago.

--
Judith Flo Gaya
Systems Administrator IMPPC
e-mail: [email protected]
Tel (+34) 93 554-3079
Fax (+34) 93 465-1472

Institut de Medicina Predictiva i Personalitzada del C?ncer
Crta Can Ruti, Cam? de les Escoles s/n
08916 Badalona, Barcelona,
Spain
http://www.imppc.org

2011-03-15 16:25:03

by Chuck Lever III

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

Hi Judith-

On Mar 12, 2011, at 7:58 AM, Judith Flo Gaya wrote:

> Hello,
>
> I was told some days ago that my problem with my NFS system is related to this bug, as the problem that I'm experiencing is quite similar.
>
> The bug : https://bugzilla.redhat.com/show_bug.cgi?id=469848
>
> The link itself explains quite well my issue, I'm just truing to copy a big file (36Gb) to my nfs server and when I try to get an ls -l command to the same folder where I'm copying data, the command gets stuck for some time. This amount of time changes from a few secs to SOME minutes (9' is the current record).
> I can live with some seconds of delay, but minutes is something quite unacceptable.
>
> As this is an nfs server running on a red hat system (an HP ibrix x9300 with Red Hat 5.3 x86_64, kernel 2.6.18-128) I was told to apply the patch suggested from the bug on my clients.
>
> Unfortunately my clients are running fedora core 14 (x86_64, kernel 2.6.35.6-45) and I can't find the file that they are referring to, the file fs/nfs/inode.c is not there and I can't find the rpm that contains it.
>
> As the bug is a very very old one, I took it for granted that is already applied to fedora, but I wanted to make sure that it is looking at the file.
>
> Can you help me on this? I'm I wrong in my supposition (is the patch really applied)? is it possible that my problem is somewhere else?

This sounds like typical behavior.

POSIX requires that the mtime and file size returned by stat(2) ('ls -l') reflect the most recent write(2). On NFS, the server sets both of these fields. If a client is caching dirty data, and an application does a stat(2), the client is forced to flush the dirty data so that the server can update mtime and file size appropriately. The client then does a GETATTR, and returns those values to the requesting application.

The problem is that Linux caches writes aggressively. That makes flushing before the GETATTR take a long time in some cases. On some versions of Linux, it could be an indefinite amount of time; recently we added a bit of logic to make the GETATTR code path hold up additional application writes so it would be able to squeeze in the GETATTR to get a consistent snapshot of mtime and size.

Another issue is: what if other clients are writing to the file? Those writes won't be seen on your client, either in the form of data changes or mtime/size updates, until your client's attribute cache times out (or the file is unlocked or closed).

The best you can do for now is to lower the amount of dirty data the client allows to be outstanding, thus reducing the amount of time it takes for a flush to complete. This is done with a sysctl, I believe "vm.dirty_ratio," and affects all file systems on the client. Alternately, the client file system in question can be mounted with "sync" to cause writes to go to the server immediately, but that has other significant performance implications.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2011-03-15 18:16:00

by Chuck Lever III

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

On Mar 15, 2011, at 2:03 PM, Chuck Lever wrote:

>
> On Mar 15, 2011, at 1:25 PM, Judith Flo Gaya wrote:
>
>> Hello Chuck,
>>
>> On 03/15/2011 05:24 PM, Chuck Lever wrote:
>>> Hi Judith-
>>>
>>> On Mar 12, 2011, at 7:58 AM, Judith Flo Gaya wrote:
>>>
>>>> Hello,
>>>>
>>>> I was told some days ago that my problem with my NFS system is related to this bug, as the problem that I'm experiencing is quite similar.
>>>>
>>>> The bug : https://bugzilla.redhat.com/show_bug.cgi?id=469848
>>>>
>>>> The link itself explains quite well my issue, I'm just truing to copy a big file (36Gb) to my nfs server and when I try to get an ls -l command to the same folder where I'm copying data, the command gets stuck for some time. This amount of time changes from a few secs to SOME minutes (9' is the current record).
>>>> I can live with some seconds of delay, but minutes is something quite unacceptable.
>>>>
>>>> As this is an nfs server running on a red hat system (an HP ibrix x9300 with Red Hat 5.3 x86_64, kernel 2.6.18-128) I was told to apply the patch suggested from the bug on my clients.
>>>>
>>>> Unfortunately my clients are running fedora core 14 (x86_64, kernel 2.6.35.6-45) and I can't find the file that they are referring to, the file fs/nfs/inode.c is not there and I can't find the rpm that contains it.
>>>>
>>>> As the bug is a very very old one, I took it for granted that is already applied to fedora, but I wanted to make sure that it is looking at the file.
>>>>
>>>> Can you help me on this? I'm I wrong in my supposition (is the patch really applied)? is it possible that my problem is somewhere else?
>>>
>>> This sounds like typical behavior.
>> But it is not like this when I use a RHEL6 as a client to those servers, in this case, the ls only last for some seconds, nothing like the minutes that it takes from my fedora.
>
> Which Fedora systems, exactly? The fix I described below is almost certainly in RHEL 6.
>
>>>
>>> POSIX requires that the mtime and file size returned by stat(2) ('ls -l') reflect the most recent write(2). On NFS, the server sets both of these fields. If a client is caching dirty data, and an application does a stat(2), the client is forced to flush the dirty data so that the server can update mtime and file size appropriately. The client then does a GETATTR, and returns those values to the requesting application.
>>>
>> ok, sorry, I know this is a very stupid question but. what do you mean by dirty data?
>
> Dirty data is data that your application has written to the file but which hasn't been flushed to the server's disk. This data resides in the client's page cache, on its way to the server.
>
>> BTW i understand the time issue, but again, if the version of the kernel that the red hat has installed allows me to get the information soon, why a newer kernel in fedora does not?
>
> Sounds like a bug. Fedora kernels newer than 2.6.32 should work as well as, or better than, RHEL 6.

Looks like commit acdc53b2 "NFS: Replace __nfs_write_mapping with sync_inode()" removes the code that holds i_mutex while trying to flush writes before a GETATTR. This means application writes can possibly starve a stat(2) call. Trond, this seems like a regression...?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2011-03-17 01:21:49

by Harshula

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

Hi Chuck & Judith,

On Tue, 2011-03-15 at 14:15 -0400, Chuck Lever wrote:
> On Mar 15, 2011, at 2:03 PM, Chuck Lever wrote:
> > On Mar 15, 2011, at 1:25 PM, Judith Flo Gaya wrote:

> >> BTW i understand the time issue, but again, if the version of the
> kernel that the red hat has installed allows me to get the information
> soon, why a newer kernel in fedora does not?
> >
> > Sounds like a bug. Fedora kernels newer than 2.6.32 should work as
> well as, or better than, RHEL 6.
>
> Looks like commit acdc53b2 "NFS: Replace __nfs_write_mapping with
> sync_inode()" removes the code that holds i_mutex while trying to
> flush writes before a GETATTR. This means application writes can
> possibly starve a stat(2) call. Trond, this seems like a
> regression...?

RHEL 6.0 released with RH kernel 2.6.32-71 and it *does* contain the
commit acdc53b2 "NFS: Replace __nfs_write_mapping with sync_inode()"
backported to 2.6.32. So, I doubt that the reported bad Fedora
performance is due to that commit.

cya,
#

2011-03-16 13:43:31

by peter.staubach

[permalink] [raw]

Subject: RE: problem with nfs latency during high IO

Yes, "may have deleterious effects on workload performance", is true. Especially the "may" part. Many applications simply open a file, write it out, and then close it. The aggressive caching really only helps when an application is working within an open file, where the working set fits within the cache, for an extended period. Otherwise, the client may as well start flushing as soon as it can generate full WRITE requests and as long as it is doing that, it may as well throttle the application to match the rate in which pages can be cleaned.

This will help with overall performance by not tying down memory to hold dirty pages which will not get touched again until they are clean. It can also greatly help the "ls -l" problem.

ps

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Chuck Lever
Sent: Wednesday, March 16, 2011 9:25 AM
To: Judith Flo Gaya
Cc: [email protected]
Subject: Re: problem with nfs latency during high IO

On Mar 16, 2011, at 7:45 AM, Judith Flo Gaya wrote:

> Hello Chuck,
>
> On 03/15/2011 11:10 PM, Chuck Lever wrote:
>>
>> On Mar 15, 2011, at 5:58 PM, Judith Flo Gaya wrote:
>>
>>>
>>> I saw that the value was 20, I don't know the impact of changing the number by units or tens... Should I test with 10 or this is too much? I assume that the behavior will change immediately right?
>>
>> I believe the dirty ratio is the percentage of physical memory that can be consumed by one file's dirty data before the VM starts flushing its pages asynchronously. Or it could be the amount of dirty data allowed across all files... one file or many doesn't make any difference if you are writing a single very large file.
>>
>> If your client memory is large, a small number should work without problem. One percent of a 16GB client is still quite a bit of memory. The current setting means you can have 20% of said 16GB client, or 3.2GB, of dirty file data on that client before it will even think about flushing it. Along comes "ls -l" and you will have to wait for the client to flush 3.2GB before it can send the GETATTR.
>>
>> I believe this setting does take effect immediately, but you will have to put the setting in /etc/sysctl.conf to make it last across a reboot.
>>
>
> I made some tests with a value of 10 for the vm_dirty_ratio and indeed the ls-hang-time has decreased a lot, from 3min avg to 1.5min.
> I was wondering what is the minimum number that it is safe to use? I'm sure that you have already dealt with the side-effects/collateral damages of this action, I don't want to fix a problem creating another one..

As I said before, you can set it to 1, and that will mean background flushing kicks in at 1% of your client's physical memory. I think that's probably safe nearly anywhere, but it may have deleterious effects on workload performance. You need to test various settings with your workload and your clients to see what is the best setting in your environment.

> Regarding the modification of the inode.c file, what do you think that will be the next step? And how can I apply it to my system? Should I modify the file by myself and recompile the kernel to have the changed applied?

I recommend that you file a bug against Fedora 14. See http://bugzilla.redhat.com/ .

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2011-03-15 22:10:50

by Chuck Lever III

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

On Mar 15, 2011, at 5:58 PM, Judith Flo Gaya wrote:

>
>
> On 3/15/11 10:28 PM, Chuck Lever wrote:
>>
>>>>>> POSIX requires that the mtime and file size returned by stat(2) ('ls -l') reflect the most recent write(2). On NFS, the server sets both of these fields. If a client is caching dirty data, and an application does a stat(2), the client is forced to flush the dirty data so that the server can update mtime and file size appropriately. The client then does a GETATTR, and returns those values to the requesting application.
>>>>>>
>>>>> ok, sorry, I know this is a very stupid question but. what do you mean by dirty data?
>>>> Dirty data is data that your application has written to the file but which hasn't been flushed to the server's disk. This data resides in the client's page cache, on its way to the server.
>>> ok, understood. Then the sysctl change that you suggest, I've been checking both distributions, RHEL6 and FC14 and they share the same value... I assume by this that changing this value will not "help", am I right?
>> It should improve behavior somewhat in both cases, but the delay won't go away entirely. This was a workaround we gave EL5 customers before this bug was addressed. In the Fedora case I wouldn't expect a strongly deterministic improvement, but the average wait for "ls -l" should go down somewhat.
> I saw that the value was 20, I don't know the impact of changing the number by units or tens... Should I test with 10 or this is too much? I assume that the behavior will change immediately right?

I believe the dirty ratio is the percentage of physical memory that can be consumed by one file's dirty data before the VM starts flushing its pages asynchronously. Or it could be the amount of dirty data allowed across all files... one file or many doesn't make any difference if you are writing a single very large file.

If your client memory is large, a small number should work without problem. One percent of a 16GB client is still quite a bit of memory. The current setting means you can have 20% of said 16GB client, or 3.2GB, of dirty file data on that client before it will even think about flushing it. Along comes "ls -l" and you will have to wait for the client to flush 3.2GB before it can send the GETATTR.

I believe this setting does take effect immediately, but you will have to put the setting in /etc/sysctl.conf to make it last across a reboot.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2011-03-15 21:23:00

by Judith Flo Gaya

[permalink] [raw]

Subject: Re: problem with nfs latency during high IO

On 3/15/11 7:03 PM, Chuck Lever wrote:
> On Mar 15, 2011, at 1:25 PM, Judith Flo Gaya wrote:
>
>> Hello Chuck,
>>
>> On 03/15/2011 05:24 PM, Chuck Lever wrote:
>>> Hi Judith-
>>>
>>> On Mar 12, 2011, at 7:58 AM, Judith Flo Gaya wrote:
>>>
>>>> Hello,
>>>>
>>>> I was told some days ago that my problem with my NFS system is related to this bug, as the problem that I'm experiencing is quite similar.
>>>>
>>>> The bug : https://bugzilla.redhat.com/show_bug.cgi?id=469848
>>>>
>>>> The link itself explains quite well my issue, I'm just truing to copy a big file (36Gb) to my nfs server and when I try to get an ls -l command to the same folder where I'm copying data, the command gets stuck for some time. This amount of time changes from a few secs to SOME minutes (9' is the current record).
>>>> I can live with some seconds of delay, but minutes is something quite unacceptable.
>>>>
>>>> As this is an nfs server running on a red hat system (an HP ibrix x9300 with Red Hat 5.3 x86_64, kernel 2.6.18-128) I was told to apply the patch suggested from the bug on my clients.
>>>>
>>>> Unfortunately my clients are running fedora core 14 (x86_64, kernel 2.6.35.6-45) and I can't find the file that they are referring to, the file fs/nfs/inode.c is not there and I can't find the rpm that contains it.
>>>>
>>>> As the bug is a very very old one, I took it for granted that is already applied to fedora, but I wanted to make sure that it is looking at the file.
>>>>
>>>> Can you help me on this? I'm I wrong in my supposition (is the patch really applied)? is it possible that my problem is somewhere else?
>>> This sounds like typical behavior.
>> But it is not like this when I use a RHEL6 as a client to those servers, in this case, the ls only last for some seconds, nothing like the minutes that it takes from my fedora.
> Which Fedora systems, exactly? The fix I described below is almost certainly in RHEL 6.
Fedora Core 14, 64 bit, 2.6.35.6-45
>>> POSIX requires that the mtime and file size returned by stat(2) ('ls -l') reflect the most recent write(2). On NFS, the server sets both of these fields. If a client is caching dirty data, and an application does a stat(2), the client is forced to flush the dirty data so that the server can update mtime and file size appropriately. The client then does a GETATTR, and returns those values to the requesting application.
>>>
>> ok, sorry, I know this is a very stupid question but. what do you mean by dirty data?
> Dirty data is data that your application has written to the file but which hasn't been flushed to the server's disk. This data resides in the client's page cache, on its way to the server.
ok, understood. Then the sysctl change that you suggest, I've been
checking both distributions, RHEL6 and FC14 and they share the same
value... I assume by this that changing this value will not "help", am I
right?
>> BTW i understand the time issue, but again, if the version of the kernel that the red hat has installed allows me to get the information soon, why a newer kernel in fedora does not?
> Sounds like a bug. Fedora kernels newer than 2.6.32 should work as well as, or better than, RHEL 6.
I thought the same, but the test doesn't suggest that this is true ;(
I saw your next message about the difference in the code and that would
make a lot of sense!

Thanks for your help!
j