Hello all,
This is my first post, so please be gentle.... I have been working with a
customer who is attempting to build their product in ClearCase dynamic
views on Linux. When they went from Red hat Enterprise Linux 4 (update 5)
to Red Hat Enterprise Linux 5 (Update 2), their build performance degraded
dramatically. When troubleshooting the issue, we noticed that links on
RHEL 5 caused an incredible number of "STABLE" 4kb nfs writes even though
the storage we were writing to was EXPLICITLY mounted async. (This made
RHEL 5 nearly 5x slower than RHEL 4.5 in this area...)
On consultation with some internal resources, we found this change in the
2.6 kernel:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
In here it looks like the NFS client is forcing sync writes any time a
write of less than the NFS write size occurs. We tested this hypothesis by
setting the write size to 2KB. The "STABLE" writes went away and link
times came back down out of the stratosphere. We built a modified kernel
based on the RHEL 5.2 kernel (that ONLY backed out of this change) and we
got a 33% improvement in overall build speeds. In my case, I see almost
identical build times between the 2 OS's when we use this modified kernel
on RHEL 5.
Now, why am I posing this to the list? I need to understand *why* that
change was made. On the face of it, simply backing out that patch would be
perfect. I'm paranoid. I want to make sure that this is the ONLY reason:
"/* For single writes, FLUSH_STABLE is more efficient */ "
It seems more accurate to say that they *aren't* more efficient, but
rather are "safer, but slower."
I know that this is a 3+ year old update, but RHEL 4 is based on a 2.4
kernel, and SLES 9 is based on something in the same ballpark. And our
customers see problems when they go to SLES 10/RHEL 5 from the prior major
distro version.
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
Chuck Lever wrote:
>
> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
>
>> Hello all,
>>
>> This is my first post, so please be gentle.... I have been working
>> with a
>> customer who is attempting to build their product in ClearCase dynamic
>> views on Linux. When they went from Red hat Enterprise Linux 4
>> (update 5)
>> to Red Hat Enterprise Linux 5 (Update 2), their build performance
>> degraded
>> dramatically. When troubleshooting the issue, we noticed that links on
>> RHEL 5 caused an incredible number of "STABLE" 4kb nfs writes even
>> though
>> the storage we were writing to was EXPLICITLY mounted async. (This made
>> RHEL 5 nearly 5x slower than RHEL 4.5 in this area...)
>>
>> On consultation with some internal resources, we found this change in
>> the
>> 2.6 kernel:
>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>>
>>
>> In here it looks like the NFS client is forcing sync writes any time a
>> write of less than the NFS write size occurs. We tested this
>> hypothesis by
>> setting the write size to 2KB. The "STABLE" writes went away and link
>> times came back down out of the stratosphere. We built a modified kernel
>> based on the RHEL 5.2 kernel (that ONLY backed out of this change)
>> and we
>> got a 33% improvement in overall build speeds. In my case, I see almost
>> identical build times between the 2 OS's when we use this modified
>> kernel
>> on RHEL 5.
>>
>> Now, why am I posing this to the list? I need to understand *why* that
>> change was made. On the face of it, simply backing out that patch
>> would be
>> perfect. I'm paranoid. I want to make sure that this is the ONLY reason:
>> "/* For single writes, FLUSH_STABLE is more efficient */ "
>>
>> It seems more accurate to say that they *aren't* more efficient, but
>> rather are "safer, but slower."
>
> They are more efficient from the point of view that only a single RPC
> is needed for a complete write. The WRITE and COMMIT are done in a
> single request.
>
> I don't think the issue here is whether the write is stable, but it is
> whether the NFS client has to block the application for it. A stable
> write that is asynchronous to the application is faster than
> WRITE+COMMIT.
>
> So it's not "stable" that is holding you up, it's "synchronous."
> Those are orthogonal concepts.
>
Actually, the "stable" part can be a killer. It depends upon
why and when nfs_flush_inode() is invoked.
I did quite a bit of work on this aspect of RHEL-5 and discovered
that this particular code was leading to some serious slowdowns.
The server would end up doing a very slow FILE_SYNC write when
all that was really required was an UNSTABLE write at the time.
Did anyone actually measure this optimization and if so, what
were the numbers?
Thanx...
ps
On Thu, Apr 30, 2009 at 04:12:19PM -0400, Brian R Cowan wrote:
> Hello all,
>
> This is my first post, so please be gentle.... I have been working with a
> customer who is attempting to build their product in ClearCase dynamic
> views on Linux.
> I know that this is a 3+ year old update, but RHEL 4 is based on a 2.4
> kernel, and SLES 9 is based on something in the same ballpark. And our
> customers see problems when they go to SLES 10/RHEL 5 from the prior major
> distro version.
You should probably complain to the distro vendors if you use distro
kernels. And even when the change might not be diretly related please
reproduce anything posted to upstream projects without binary only
module junk like clearcase.
On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> Chuck Lever wrote:
> >
> > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> >>
> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
> >>
> Actually, the "stable" part can be a killer. It depends upon
> why and when nfs_flush_inode() is invoked.
>
> I did quite a bit of work on this aspect of RHEL-5 and discovered
> that this particular code was leading to some serious slowdowns.
> The server would end up doing a very slow FILE_SYNC write when
> all that was really required was an UNSTABLE write at the time.
>
> Did anyone actually measure this optimization and if so, what
> were the numbers?
As usual, the optimisation is workload dependent. The main type of
workload we're targetting with this patch is the app that opens a file,
writes < 4k and then closes the file. For that case, it's a no-brainer
that you don't need to split a single stable write into an unstable + a
commit.
So if the application isn't doing the above type of short write followed
by close, then exactly what is causing a flush to disk in the first
place? Ordinarily, the client will try to cache writes until the cows
come home (or until the VM tells it to reclaim memory - whichever comes
first)...
Cheers
Trond
On Apr 30, 2009, at 4:41 PM, Peter Staubach wrote:
> Chuck Lever wrote:
>>
>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
>>
>>> Hello all,
>>>
>>> This is my first post, so please be gentle.... I have been working
>>> with a
>>> customer who is attempting to build their product in ClearCase
>>> dynamic
>>> views on Linux. When they went from Red hat Enterprise Linux 4
>>> (update 5)
>>> to Red Hat Enterprise Linux 5 (Update 2), their build performance
>>> degraded
>>> dramatically. When troubleshooting the issue, we noticed that
>>> links on
>>> RHEL 5 caused an incredible number of "STABLE" 4kb nfs writes even
>>> though
>>> the storage we were writing to was EXPLICITLY mounted async. (This
>>> made
>>> RHEL 5 nearly 5x slower than RHEL 4.5 in this area...)
>>>
>>> On consultation with some internal resources, we found this change
>>> in
>>> the
>>> 2.6 kernel:
>>>
>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>>>
>>>
>>> In here it looks like the NFS client is forcing sync writes any
>>> time a
>>> write of less than the NFS write size occurs. We tested this
>>> hypothesis by
>>> setting the write size to 2KB. The "STABLE" writes went away and
>>> link
>>> times came back down out of the stratosphere. We built a modified
>>> kernel
>>> based on the RHEL 5.2 kernel (that ONLY backed out of this change)
>>> and we
>>> got a 33% improvement in overall build speeds. In my case, I see
>>> almost
>>> identical build times between the 2 OS's when we use this modified
>>> kernel
>>> on RHEL 5.
>>>
>>> Now, why am I posing this to the list? I need to understand *why*
>>> that
>>> change was made. On the face of it, simply backing out that patch
>>> would be
>>> perfect. I'm paranoid. I want to make sure that this is the ONLY
>>> reason:
>>> "/* For single writes, FLUSH_STABLE is more efficient */ "
>>>
>>> It seems more accurate to say that they *aren't* more efficient, but
>>> rather are "safer, but slower."
>>
>> They are more efficient from the point of view that only a single RPC
>> is needed for a complete write. The WRITE and COMMIT are done in a
>> single request.
>>
>> I don't think the issue here is whether the write is stable, but it
>> is
>> whether the NFS client has to block the application for it. A stable
>> write that is asynchronous to the application is faster than
>> WRITE+COMMIT.
>>
>> So it's not "stable" that is holding you up, it's "synchronous."
>> Those are orthogonal concepts.
>>
>
> Actually, the "stable" part can be a killer. It depends upon
> why and when nfs_flush_inode() is invoked.
>
> I did quite a bit of work on this aspect of RHEL-5 and discovered
> that this particular code was leading to some serious slowdowns.
> The server would end up doing a very slow FILE_SYNC write when
> all that was really required was an UNSTABLE write at the time.
If the client is asking for FILE_SYNC when it doesn't need the COMMIT,
then yes, that would hurt performance.
> Did anyone actually measure this optimization and if so, what
> were the numbers?
>
> Thanx...
>
> ps
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> Hello all,
>
> This is my first post, so please be gentle.... I have been working
> with a
> customer who is attempting to build their product in ClearCase dynamic
> views on Linux. When they went from Red hat Enterprise Linux 4
> (update 5)
> to Red Hat Enterprise Linux 5 (Update 2), their build performance
> degraded
> dramatically. When troubleshooting the issue, we noticed that links on
> RHEL 5 caused an incredible number of "STABLE" 4kb nfs writes even
> though
> the storage we were writing to was EXPLICITLY mounted async. (This
> made
> RHEL 5 nearly 5x slower than RHEL 4.5 in this area...)
>
> On consultation with some internal resources, we found this change
> in the
> 2.6 kernel:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>
> In here it looks like the NFS client is forcing sync writes any time a
> write of less than the NFS write size occurs. We tested this
> hypothesis by
> setting the write size to 2KB. The "STABLE" writes went away and link
> times came back down out of the stratosphere. We built a modified
> kernel
> based on the RHEL 5.2 kernel (that ONLY backed out of this change)
> and we
> got a 33% improvement in overall build speeds. In my case, I see
> almost
> identical build times between the 2 OS's when we use this modified
> kernel
> on RHEL 5.
>
> Now, why am I posing this to the list? I need to understand *why* that
> change was made. On the face of it, simply backing out that patch
> would be
> perfect. I'm paranoid. I want to make sure that this is the ONLY
> reason:
> "/* For single writes, FLUSH_STABLE is more efficient */ "
>
> It seems more accurate to say that they *aren't* more efficient, but
> rather are "safer, but slower."
They are more efficient from the point of view that only a single RPC
is needed for a complete write. The WRITE and COMMIT are done in a
single request.
I don't think the issue here is whether the write is stable, but it is
whether the NFS client has to block the application for it. A stable
write that is asynchronous to the application is faster than WRITE
+COMMIT.
So it's not "stable" that is holding you up, it's "synchronous."
Those are orthogonal concepts.
> I know that this is a 3+ year old update, but RHEL 4 is based on a 2.4
> kernel,
Nope, RHEL 4 is 2.6.9. RHEL 3 is 2.4.20-ish.
> and SLES 9 is based on something in the same ballpark. And our
> customers see problems when they go to SLES 10/RHEL 5 from the prior
> major
> distro version.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
[email protected] wrote on 04/30/2009 05:23:07 PM:
> As usual, the optimisation is workload dependent. The main type of
> workload we're targetting with this patch is the app that opens a file,
> writes < 4k and then closes the file. For that case, it's a no-brainer
> that you don't need to split a single stable write into an unstable + a
> commit.
The app impacted most is the gcc linker... I tested by building Samba,
then by linking smbd. We think the linker memory maps the output file.
Don't really know for sure since I don't know the gcc source any more than
I'm an expert in the Linux NFS implementation. In any event, the linker is
doing all kinds of lseeks and writes as it builds the output executable
based on the various .o files being linked in. All of those writes are
slowed down by this write change. If we were closing the file afterwards,
that would be one thing, but we're not...
>
> So if the application isn't doing the above type of short write followed
> by close, then exactly what is causing a flush to disk in the first
> place? Ordinarily, the client will try to cache writes until the cows
> come home (or until the VM tells it to reclaim memory - whichever comes
> first)...
We suspect it's the latter (something telling the system to flush memory)
but chasing that looks to be a challenge...
>
> Cheers
> Trond
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
On Fri, 2009-05-29 at 16:09 -0400, Brian R Cowan wrote:
> I think you missed the context of my comment... Previous to this
> 4-year-old update, the writes were not sent with STABLE, this update
> forced that behavior. So, before then we sent an UNSTABLE write request.
> This would either give us back the UNSTABLE or FILE_SYNC response. My
> question is this: When the server sends back UNSTABLE, as a response to
> UNSTABLE, exactly what happens? By some chance is there a separate worker
> thread that occasionally sends COMMITs back to the server?
pdflush will do it occasionally, but otherwise the COMMITs are all sent
synchronously by the thread that is flushing out the data.
In this case, the flush is done by the call to nfs_wb_page() in
nfs_readpage(), and it waits synchronously for the unstable WRITE and
the subsequent COMMIT to finish.
Note that there is no way to bypass the wait: if some other thread jumps
in and sends the COMMIT (after the unstable write has returned), then
the caller of nfs_wb_page() still has to wait for that call to complete,
and for nfs_commit_release() to mark the page as clean.
Trond
So, it is possible that either pdflush is sending the commits or us, or
that the commits are happening when the file closes, giving us one/tens of
commits instead of hundreds or thousands. That's a big difference. The
write RPCs still happen in RHEL 4, they just don't block the linker, or at
least nowhere near as often. Since there is only one application/thread
(the gcc linker) writing this file, the odds of another task getting
stalled here are minimal at best.
This optimization definitely helps server utilization for copies of large
numbers of small files, and I personally don't care which is the default
(though I have a coworker who is of the opinion that async means async,
and if he wanted sync writes, he would either mount with nfsvers=2 or
mount sync). But we need the option to turn it off for cases where it is
thought to cause problems.
You mention that one can set the async export option, but 1) it may not
always available; and 2) essentially tells the server to "lie" about write
status, something that can bite us seriously if the server crashes, hits a
disk full error. etc. And in any event, it's something that only a
particular class of clients is impacted by, and making a change to *all*
so *some* work in the expected manner feels about as graceful as dynamite
fishing...
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Trond Myklebust <[email protected]>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <[email protected]>, [email protected],
[email protected], Peter Staubach <[email protected]>
Date:
05/29/2009 04:28 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
On Fri, 2009-05-29 at 16:09 -0400, Brian R Cowan wrote:
> I think you missed the context of my comment... Previous to this
> 4-year-old update, the writes were not sent with STABLE, this update
> forced that behavior. So, before then we sent an UNSTABLE write request.
> This would either give us back the UNSTABLE or FILE_SYNC response. My
> question is this: When the server sends back UNSTABLE, as a response to
> UNSTABLE, exactly what happens? By some chance is there a separate
worker
> thread that occasionally sends COMMITs back to the server?
pdflush will do it occasionally, but otherwise the COMMITs are all sent
synchronously by the thread that is flushing out the data.
In this case, the flush is done by the call to nfs_wb_page() in
nfs_readpage(), and it waits synchronously for the unstable WRITE and
the subsequent COMMIT to finish.
Note that there is no way to bypass the wait: if some other thread jumps
in and sends the COMMIT (after the unstable write has returned), then
the caller of nfs_wb_page() still has to wait for that call to complete,
and for nfs_commit_release() to mark the page as clean.
Trond
On Fri, 2009-05-29 at 17:55 -0400, Brian R Cowan wrote:
> So, it is possible that either pdflush is sending the commits or us, or
> that the commits are happening when the file closes, giving us one/tens of
> commits instead of hundreds or thousands. That's a big difference. The
> write RPCs still happen in RHEL 4, they just don't block the linker, or at
> least nowhere near as often. Since there is only one application/thread
> (the gcc linker) writing this file, the odds of another task getting
> stalled here are minimal at best.
No, you're not listening! That COMMIT is _synchronous_ and happens
before you can proceed with the READ request. There is no economy of
scale as you seem to assume.
Trond
I am listening.
Commit is sync. I get that.
The NFS client does Async writes in RHEL 4. They *eventually* get
committed. (Doesn't really matter who causes the commit, does it.)
Read system calls may trigger cache flushing, but since not all of them
are sync writes, the reads don't *always* stall when cache flushes occur.
Builds are fast.
We do sync writes in RHEL 5, so they MUST stop and wait for the NFS server
to come back.
READ system calls stall whan the read triggers a flush of one or more
cache pages.
Builds are slow. Links are at least 4x slower.
I am perfectly willing to send you network traces showing the issue. I can
even DEMONSTRATE it for you using the remote meeting software of your
choice. I can even demonstrate the impact of removing that behavior.
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Trond Myklebust <[email protected]>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <[email protected]>, [email protected],
[email protected], Peter Staubach <[email protected]>
Date:
05/29/2009 06:06 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
On Fri, 2009-05-29 at 17:55 -0400, Brian R Cowan wrote:
> So, it is possible that either pdflush is sending the commits or us, or
> that the commits are happening when the file closes, giving us one/tens
of
> commits instead of hundreds or thousands. That's a big difference. The
> write RPCs still happen in RHEL 4, they just don't block the linker, or
at
> least nowhere near as often. Since there is only one application/thread
> (the gcc linker) writing this file, the odds of another task getting
> stalled here are minimal at best.
No, you're not listening! That COMMIT is _synchronous_ and happens
before you can proceed with the READ request. There is no economy of
scale as you seem to assume.
Trond
On Fri, 2009-05-29 at 18:20 -0400, Brian R Cowan wrote:
> I am listening.
>
> Commit is sync. I get that.
>
> The NFS client does Async writes in RHEL 4. They *eventually* get
> committed. (Doesn't really matter who causes the commit, does it.)
> Read system calls may trigger cache flushing, but since not all of them
> are sync writes, the reads don't *always* stall when cache flushes occur.
> Builds are fast.
All reads that trigger writes will trigger _sync_ writes and _sync_
commits. That's true of RHEL-5, RHEL-4, RHEL-3, and all the way back to
the very first 2.4 kernels. There is no deferred commit in that case,
because the cached dirty data needs to be overwritten by a fresh read,
which means that we may lose the data if the server reboots between the
unstable write and the ensuing read.
> We do sync writes in RHEL 5, so they MUST stop and wait for the NFS server
> to come back.
> READ system calls stall whan the read triggers a flush of one or more
> cache pages.
> Builds are slow. Links are at least 4x slower.
>
> I am perfectly willing to send you network traces showing the issue. I can
> even DEMONSTRATE it for you using the remote meeting software of your
> choice. I can even demonstrate the impact of removing that behavior.
Can you demonstrate it using a recent kernel? If it's a problem that is
limited to RHEL-5, then it is up to Peter & co to pull in the fixes from
mainline, but if the slowdown is still present in 2.6.30, then I'm all
ears. However I don't for a minute accept your explanation that this has
something to do with stable vs unstable+commit.
Trond
If you can explain how pulling that ONE change can cause the performance
issue to essentially disappear, I'd be more than happy to *try* to get a
2.6.30 test environment configured. Getting ClearCase to *install* on
kernel.org kernels is a non-trivial operation, requiring modifications to
install scripts, module makefiles, etc. Then there is the issue of
verifying that nothing else is impacted, all before I can begin to do this
test. We're talking days here.
To be blunt, I'd need something I can take to a manager who will ask me
why I'm spending so much time on an issue when we "already have the
cause."
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Trond Myklebust <[email protected]>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <[email protected]>, [email protected],
[email protected], Peter Staubach <[email protected]>
Date:
05/29/2009 06:38 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
On Fri, 2009-05-29 at 18:20 -0400, Brian R Cowan wrote:
> I am listening.
>
> Commit is sync. I get that.
>
> The NFS client does Async writes in RHEL 4. They *eventually* get
> committed. (Doesn't really matter who causes the commit, does it.)
> Read system calls may trigger cache flushing, but since not all of them
> are sync writes, the reads don't *always* stall when cache flushes
occur.
> Builds are fast.
All reads that trigger writes will trigger _sync_ writes and _sync_
commits. That's true of RHEL-5, RHEL-4, RHEL-3, and all the way back to
the very first 2.4 kernels. There is no deferred commit in that case,
because the cached dirty data needs to be overwritten by a fresh read,
which means that we may lose the data if the server reboots between the
unstable write and the ensuing read.
> We do sync writes in RHEL 5, so they MUST stop and wait for the NFS
server
> to come back.
> READ system calls stall whan the read triggers a flush of one or more
> cache pages.
> Builds are slow. Links are at least 4x slower.
>
> I am perfectly willing to send you network traces showing the issue. I
can
> even DEMONSTRATE it for you using the remote meeting software of your
> choice. I can even demonstrate the impact of removing that behavior.
Can you demonstrate it using a recent kernel? If it's a problem that is
limited to RHEL-5, then it is up to Peter & co to pull in the fixes from
mainline, but if the slowdown is still present in 2.6.30, then I'm all
ears. However I don't for a minute accept your explanation that this has
something to do with stable vs unstable+commit.
Trond
On Fri, 2009-05-29 at 19:02 -0400, Brian R Cowan wrote:
> If you can explain how pulling that ONE change can cause the performance
> issue to essentially disappear, I'd be more than happy to *try* to get a
> 2.6.30 test environment configured. Getting ClearCase to *install* on
> kernel.org kernels is a non-trivial operation, requiring modifications to
> install scripts, module makefiles, etc. Then there is the issue of
> verifying that nothing else is impacted, all before I can begin to do this
> test. We're talking days here.
>
> To be blunt, I'd need something I can take to a manager who will ask me
> why I'm spending so much time on an issue when we "already have the
> cause."
It's simple: you are the one asking for a change to the established
kernel behaviour, so you get to justify that change. Saying "it breaks
clearcase on RHEL-5" is not a justification, and I won't accept to ack
the change.
Trond
On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
<[email protected]> wrote:
> On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
>>
>
> What are you smoking? There is _NO_DIFFERENCE_ between what the server
> is supposed to do when sent a single stable write, and what it is
> supposed to do when sent an unstable write plus a commit. BOTH cases are
> supposed to result in the server writing the data to stable storage
> before the stable write / commit is allowed to return a reply.
This probably makes no difference to the discussion, but for a Linux
server there is a subtle difference between what the server is
supposed to do and what it actually does.
For a stable WRITE rpc, the Linux server sets O_SYNC in the struct
file during the vfs_writev() call and expects the underlying
filesystem to obey that flag and flush the data to disk. For a COMMIT
rpc, the Linux server uses the underlying filesystem's f_op->fsync
instead. This results in some potential differences:
* The underlying filesystem might be broken in one code path and not
the other (e.g. ignoring O_SYNC in f_op->{aio_,}write or silently
failing in f_op->fsync). These kinds of bugs tend to be subtle
because in the absence of a crash they affect only the timing of IO
and so they might not be noticed.
* The underlying filesystem might be doing more or better things in
one or the other code paths e.g. optimising allocations.
* The Linux NFS server ignores the byte range in the COMMIT rpc and
flushes the whole file (I suspect this is a historical accident rather
than deliberate policy). If there is other dirty data on that file
server-side, that other data will be written too before the COMMIT
reply is sent. This may have a performance impact, depending on the
workload.
> The extra RPC round trip (+ parsing overhead ++++) due to the commit
> call is the _only_ difference.
This is almost completely true. If the server behaved ideally and
predictably, this would be completely true.
</pedant>
--
Greg.
On Sat, May 30, 2009 at 10:22:58AM +1000, Greg Banks wrote:
> * The underlying filesystem might be doing more or better things in
> one or the other code paths e.g. optimising allocations.
Which is the case with ext3 which is pretty common. It does reasonably
well on O_SYNC as far as I can see, but has a catastrophic fsync
implementation.
> * The Linux NFS server ignores the byte range in the COMMIT rpc and
> flushes the whole file (I suspect this is a historical accident rather
> than deliberate policy). If there is other dirty data on that file
> server-side, that other data will be written too before the COMMIT
> reply is sent. This may have a performance impact, depending on the
> workload.
Right now we can't actually implement that proper because the fsync
file operation can't actually flush sub ranges. There have been some
other requests for this, but my ->fsync resdesign in on hold until
NFSD stops calling ->fsync without a file struct.
I think the open file cache will help us with that, if we can extend
it to also cache open file structs for directories.
On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
> <[email protected]> wrote:
> > On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
> >>
> >
> > What are you smoking? There is _NO_DIFFERENCE_ between what the server
> > is supposed to do when sent a single stable write, and what it is
> > supposed to do when sent an unstable write plus a commit. BOTH cases are
> > supposed to result in the server writing the data to stable storage
> > before the stable write / commit is allowed to return a reply.
>
> This probably makes no difference to the discussion, but for a Linux
> server there is a subtle difference between what the server is
> supposed to do and what it actually does.
>
> For a stable WRITE rpc, the Linux server sets O_SYNC in the struct
> file during the vfs_writev() call and expects the underlying
> filesystem to obey that flag and flush the data to disk. For a COMMIT
> rpc, the Linux server uses the underlying filesystem's f_op->fsync
> instead. This results in some potential differences:
>
> * The underlying filesystem might be broken in one code path and not
> the other (e.g. ignoring O_SYNC in f_op->{aio_,}write or silently
> failing in f_op->fsync). These kinds of bugs tend to be subtle
> because in the absence of a crash they affect only the timing of IO
> and so they might not be noticed.
>
> * The underlying filesystem might be doing more or better things in
> one or the other code paths e.g. optimising allocations.
>
> * The Linux NFS server ignores the byte range in the COMMIT rpc and
> flushes the whole file (I suspect this is a historical accident rather
> than deliberate policy). If there is other dirty data on that file
> server-side, that other data will be written too before the COMMIT
> reply is sent. This may have a performance impact, depending on the
> workload.
>
> > The extra RPC round trip (+ parsing overhead ++++) due to the commit
> > call is the _only_ difference.
>
> This is almost completely true. If the server behaved ideally and
> predictably, this would be completely true.
>
> </pedant>
>
Firstly, the server only uses O_SYNC if you turn off write gathering
(a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs
server is to always try write gathering and hence no O_SYNC.
Secondly, even if it were the case, then this does not justify changing
the client behaviour. The NFS protocol does not mandate, or even
recommend that the server use O_SYNC. All it says is that a stable write
and an unstable write+commit should both have the same result: namely
that the data+metadata must have been flushed to stable storage. The
protocol spec leaves it as an exercise to the server implementer to do
this as efficiently as possible.
Trond
On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust
<[email protected]> wrote:
> On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
>> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
>> <[email protected]> wrote:
>> > On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
>> >>
>>
>
> Firstly, the server only uses O_SYNC if you turn off write gathering
> (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs
> server is to always try write gathering and hence no O_SYNC.
Well, write gathering is a total crock that AFAICS only helps
single-file writes on NFSv2. For today's workloads all it does is
provide a hotspot on the two global variables that track writes in an
attempt to gather them. Back when I worked on a server product,
no_wdelay was one of the standard options for new exports.
> Secondly, even if it were the case, then this does not justify changing
> the client behaviour.
I totally agree, it was just an observation.
In any case, as Christoph points out, the ext3 performance difference
makes an unstable WRITE+COMMIT slower than a stable WRITE, and you
already assumed that.
--
Greg.
Been working this issue with Red hat, and didn't need to go to the list...
Well, now I do... You mention that "The main type of workload we're
targetting with this patch is the app that opens a file, writes < 4k and
then closes the file." Well, it appears that this issue also impacts
flushing pages from filesystem caches.
The reason this came up in my environment is that our product's build
auditing gives the the filesystem cache an interesting workout. When
ClearCase audits a build, the build places data in a few places,
including:
1) a build audit file that usually resides in /tmp. This build audit is
essentially a log of EVERY file open/read/write/delete/rename/etc. that
the programs called in the build script make in the clearcase "view"
you're building in. As a result, this file can get pretty large.
2) The build outputs themselves, which in this case are being written to a
remote storage location on a Linux or Solaris server, and
3) a file called .cmake.state, which is a local cache that is written to
after the build script completes containing what is essentially a "Bill of
materials" for the files created during builds in this "view."
We believe that the build audit file access is causing build output to get
flushed out of the filesystem cache. These flushes happen *in 4k chunks.*
This trips over this change since the cache pages appear to get flushed on
an individual basis.
One note is that if the build outputs were going to a clearcase view
stored on an enterprise-level NAS device, there isn't as much of an issue
because many of these return from the stable write request as soon as the
data goes into the battery-backed memory disk cache on the NAS. However,
it really impacts writes to general-purpose OS's that follow Sun's lead in
how they handle "stable" writes. The truly annoying part about this rather
subtle change is that the NFS client is specifically ignoring the client
mount options since we cannot force the "async" mount option to turn off
this behavior.
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Trond Myklebust <[email protected]>
To:
Peter Staubach <[email protected]>
Cc:
Chuck Lever <[email protected]>, Brian R Cowan/Cupertino/IBM@IBMUS,
[email protected]
Date:
04/30/2009 05:23 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
Sent by:
[email protected]
On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> Chuck Lever wrote:
> >
> > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> >>
> >>
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
> >>
> Actually, the "stable" part can be a killer. It depends upon
> why and when nfs_flush_inode() is invoked.
>
> I did quite a bit of work on this aspect of RHEL-5 and discovered
> that this particular code was leading to some serious slowdowns.
> The server would end up doing a very slow FILE_SYNC write when
> all that was really required was an UNSTABLE write at the time.
>
> Did anyone actually measure this optimization and if so, what
> were the numbers?
As usual, the optimisation is workload dependent. The main type of
workload we're targetting with this patch is the app that opens a file,
writes < 4k and then closes the file. For that case, it's a no-brainer
that you don't need to split a single stable write into an unstable + a
commit.
So if the application isn't doing the above type of short write followed
by close, then exactly what is causing a flush to disk in the first
place? Ordinarily, the client will try to cache writes until the cows
come home (or until the VM tells it to reclaim memory - whichever comes
first)...
Cheers
Trond
Look... This happens when you _flush_ the file to stable storage if
there is only a single write < wsize. It isn't the business of the NFS
layer to decide when you flush the file; that's an application
decision...
Trond
On Fri, 2009-05-29 at 11:55 -0400, Brian R Cowan wrote:
> Been working this issue with Red hat, and didn't need to go to the list...
> Well, now I do... You mention that "The main type of workload we're
> targetting with this patch is the app that opens a file, writes < 4k and
> then closes the file." Well, it appears that this issue also impacts
> flushing pages from filesystem caches.
>
> The reason this came up in my environment is that our product's build
> auditing gives the the filesystem cache an interesting workout. When
> ClearCase audits a build, the build places data in a few places,
> including:
> 1) a build audit file that usually resides in /tmp. This build audit is
> essentially a log of EVERY file open/read/write/delete/rename/etc. that
> the programs called in the build script make in the clearcase "view"
> you're building in. As a result, this file can get pretty large.
> 2) The build outputs themselves, which in this case are being written to a
> remote storage location on a Linux or Solaris server, and
> 3) a file called .cmake.state, which is a local cache that is written to
> after the build script completes containing what is essentially a "Bill of
> materials" for the files created during builds in this "view."
>
> We believe that the build audit file access is causing build output to get
> flushed out of the filesystem cache. These flushes happen *in 4k chunks.*
> This trips over this change since the cache pages appear to get flushed on
> an individual basis.
>
> One note is that if the build outputs were going to a clearcase view
> stored on an enterprise-level NAS device, there isn't as much of an issue
> because many of these return from the stable write request as soon as the
> data goes into the battery-backed memory disk cache on the NAS. However,
> it really impacts writes to general-purpose OS's that follow Sun's lead in
> how they handle "stable" writes. The truly annoying part about this rather
> subtle change is that the NFS client is specifically ignoring the client
> mount options since we cannot force the "async" mount option to turn off
> this behavior.
>
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to [email protected] to be sure your PMR is updated in
> case I am not available.
>
>
>
> From:
> Trond Myklebust <[email protected]>
> To:
> Peter Staubach <[email protected]>
> Cc:
> Chuck Lever <[email protected]>, Brian R Cowan/Cupertino/IBM@IBMUS,
> [email protected]
> Date:
> 04/30/2009 05:23 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
> Sent by:
> [email protected]
>
>
>
> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> > Chuck Lever wrote:
> > >
> > > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> > >>
> > >>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>
> > >>
> > Actually, the "stable" part can be a killer. It depends upon
> > why and when nfs_flush_inode() is invoked.
> >
> > I did quite a bit of work on this aspect of RHEL-5 and discovered
> > that this particular code was leading to some serious slowdowns.
> > The server would end up doing a very slow FILE_SYNC write when
> > all that was really required was an UNSTABLE write at the time.
> >
> > Did anyone actually measure this optimization and if so, what
> > were the numbers?
>
> As usual, the optimisation is workload dependent. The main type of
> workload we're targetting with this patch is the app that opens a file,
> writes < 4k and then closes the file. For that case, it's a no-brainer
> that you don't need to split a single stable write into an unstable + a
> commit.
>
> So if the application isn't doing the above type of short write followed
> by close, then exactly what is causing a flush to disk in the first
> place? Ordinarily, the client will try to cache writes until the cows
> come home (or until the VM tells it to reclaim memory - whichever comes
> first)...
>
> Cheers
> Trond
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On May 29, 2009, at 11:55 AM, Brian R Cowan wrote:
> Been working this issue with Red hat, and didn't need to go to the
> list...
> Well, now I do... You mention that "The main type of workload we're
> targetting with this patch is the app that opens a file, writes < 4k
> and
> then closes the file." Well, it appears that this issue also impacts
> flushing pages from filesystem caches.
>
> The reason this came up in my environment is that our product's build
> auditing gives the the filesystem cache an interesting workout. When
> ClearCase audits a build, the build places data in a few places,
> including:
> 1) a build audit file that usually resides in /tmp. This build audit
> is
> essentially a log of EVERY file open/read/write/delete/rename/etc.
> that
> the programs called in the build script make in the clearcase "view"
> you're building in. As a result, this file can get pretty large.
> 2) The build outputs themselves, which in this case are being
> written to a
> remote storage location on a Linux or Solaris server, and
> 3) a file called .cmake.state, which is a local cache that is
> written to
> after the build script completes containing what is essentially a
> "Bill of
> materials" for the files created during builds in this "view."
>
> We believe that the build audit file access is causing build output
> to get
> flushed out of the filesystem cache. These flushes happen *in 4k
> chunks.*
> This trips over this change since the cache pages appear to get
> flushed on
> an individual basis.
So, are you saying that the application is flushing after every 4KB
write(2), or that the application has written a bunch of pages, and VM/
VFS on the client is doing the synchronous page flushes? If it's the
application doing this, then you really do not want to mitigate this
by defeating the STABLE writes -- the application must have some
requirement that the data is permanent.
Unless I have misunderstood something, the previous faster behavior
was due to cheating, and put your data at risk. I can't see how
replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would
cause such a significant performance impact.
> One note is that if the build outputs were going to a clearcase view
> stored on an enterprise-level NAS device, there isn't as much of an
> issue
> because many of these return from the stable write request as soon
> as the
> data goes into the battery-backed memory disk cache on the NAS.
> However,
> it really impacts writes to general-purpose OS's that follow Sun's
> lead in
> how they handle "stable" writes. The truly annoying part about this
> rather
> subtle change is that the NFS client is specifically ignoring the
> client
> mount options since we cannot force the "async" mount option to turn
> off
> this behavior.
You may have a misunderstanding about what exactly "async" does. The
"sync" / "async" mount options control only whether the application
waits for the data to be flushed to permanent storage. They have no
effect on any file system I know of _how_ specifically the data is
moved from the page cache to permanent storage.
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to [email protected] to be sure your PMR is
> updated in
> case I am not available.
>
>
>
> From:
> Trond Myklebust <[email protected]>
> To:
> Peter Staubach <[email protected]>
> Cc:
> Chuck Lever <[email protected]>, Brian R Cowan/Cupertino/
> IBM@IBMUS,
> [email protected]
> Date:
> 04/30/2009 05:23 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page
> flushing
> Sent by:
> [email protected]
>
>
>
> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
>> Chuck Lever wrote:
>>>
>>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
>>>>
>>>>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>
>>>>
>> Actually, the "stable" part can be a killer. It depends upon
>> why and when nfs_flush_inode() is invoked.
>>
>> I did quite a bit of work on this aspect of RHEL-5 and discovered
>> that this particular code was leading to some serious slowdowns.
>> The server would end up doing a very slow FILE_SYNC write when
>> all that was really required was an UNSTABLE write at the time.
>>
>> Did anyone actually measure this optimization and if so, what
>> were the numbers?
>
> As usual, the optimisation is workload dependent. The main type of
> workload we're targetting with this patch is the app that opens a
> file,
> writes < 4k and then closes the file. For that case, it's a no-brainer
> that you don't need to split a single stable write into an unstable
> + a
> commit.
>
> So if the application isn't doing the above type of short write
> followed
> by close, then exactly what is causing a flush to disk in the first
> place? Ordinarily, the client will try to cache writes until the cows
> come home (or until the VM tells it to reclaim memory - whichever
> comes
> first)...
>
> Cheers
> Trond
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
Ah, but I submit that the application isn't making the decision... The OS
is. My testcase is building Samba on Linux using gcc. The gcc linker sure
isn't deciding to flush the file. It's happily seeking/reading and
seeking/writing with no idea what is happening under the covers. When the
build gets audited, the cache gets flushed... No audit, no flush. The only
apparent difference is that we have an audit file getting written to on
the local disk. The linker has no idea it's getting audited.
I'm interested in knowing what kind of performance benefit this
optimization is providing in small-file writes. Unless it's incredibly
dramatic, then I really don't see why we can't do one of the following:
1) get rid of it,
2) find some way to not do it when the OS flushes filesystem cache, or
3) make the "async" mount option turn it off, or
4) create a new mount option to force the optimization on/off.
I just don't see how a single RPC saved is saving all that much time.
Since:
- open
- write (unstable) <write size
- commit
- close
Depends on the commit call to finish writing to disk, and
- open
- write (stable) <write size
- close
Also depends on the time taken to writ ethe data to disk, I can't see the
one less RPC buying that much time, other than perhaps on NAS devices.
This may reduce the server load, but this is ignoring the mount options.
We can't turn this behavior OFF, and that's the biggest issue. I don't
mind the small-file-write optimization itself, as long as I and my
customers are able to CHOOSE whether the optimization is active. It boils
down to this: when I *categorically* say that the mount is async, the OS
should pay attention. There are cases when the OS doesn't know best. If
the OS always knew what would work best, there wouldn't be nearly as many
mount options as there are now.
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Trond Myklebust <[email protected]>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <[email protected]>, [email protected],
[email protected], Peter Staubach <[email protected]>
Date:
05/29/2009 12:47 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
Sent by:
[email protected]
Look... This happens when you _flush_ the file to stable storage if
there is only a single write < wsize. It isn't the business of the NFS
layer to decide when you flush the file; that's an application
decision...
Trond
On Fri, 2009-05-29 at 11:55 -0400, Brian R Cowan wrote:
> Been working this issue with Red hat, and didn't need to go to the
list...
> Well, now I do... You mention that "The main type of workload we're
> targetting with this patch is the app that opens a file, writes < 4k and
> then closes the file." Well, it appears that this issue also impacts
> flushing pages from filesystem caches.
>
> The reason this came up in my environment is that our product's build
> auditing gives the the filesystem cache an interesting workout. When
> ClearCase audits a build, the build places data in a few places,
> including:
> 1) a build audit file that usually resides in /tmp. This build audit is
> essentially a log of EVERY file open/read/write/delete/rename/etc. that
> the programs called in the build script make in the clearcase "view"
> you're building in. As a result, this file can get pretty large.
> 2) The build outputs themselves, which in this case are being written to
a
> remote storage location on a Linux or Solaris server, and
> 3) a file called .cmake.state, which is a local cache that is written to
> after the build script completes containing what is essentially a "Bill
of
> materials" for the files created during builds in this "view."
>
> We believe that the build audit file access is causing build output to
get
> flushed out of the filesystem cache. These flushes happen *in 4k
chunks.*
> This trips over this change since the cache pages appear to get flushed
on
> an individual basis.
>
> One note is that if the build outputs were going to a clearcase view
> stored on an enterprise-level NAS device, there isn't as much of an
issue
> because many of these return from the stable write request as soon as
the
> data goes into the battery-backed memory disk cache on the NAS. However,
> it really impacts writes to general-purpose OS's that follow Sun's lead
in
> how they handle "stable" writes. The truly annoying part about this
rather
> subtle change is that the NFS client is specifically ignoring the client
> mount options since we cannot force the "async" mount option to turn off
> this behavior.
>
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to [email protected] to be sure your PMR is updated
in
> case I am not available.
>
>
>
> From:
> Trond Myklebust <[email protected]>
> To:
> Peter Staubach <[email protected]>
> Cc:
> Chuck Lever <[email protected]>, Brian R Cowan/Cupertino/IBM@IBMUS,
> [email protected]
> Date:
> 04/30/2009 05:23 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page
flushing
> Sent by:
> [email protected]
>
>
>
> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> > Chuck Lever wrote:
> > >
> > > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> > >>
> > >>
>
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>
> > >>
> > Actually, the "stable" part can be a killer. It depends upon
> > why and when nfs_flush_inode() is invoked.
> >
> > I did quite a bit of work on this aspect of RHEL-5 and discovered
> > that this particular code was leading to some serious slowdowns.
> > The server would end up doing a very slow FILE_SYNC write when
> > all that was really required was an UNSTABLE write at the time.
> >
> > Did anyone actually measure this optimization and if so, what
> > were the numbers?
>
> As usual, the optimisation is workload dependent. The main type of
> workload we're targetting with this patch is the app that opens a file,
> writes < 4k and then closes the file. For that case, it's a no-brainer
> that you don't need to split a single stable write into an unstable + a
> commit.
>
> So if the application isn't doing the above type of short write followed
> by close, then exactly what is causing a flush to disk in the first
> place? Ordinarily, the client will try to cache writes until the cows
> come home (or until the VM tells it to reclaim memory - whichever comes
> first)...
>
> Cheers
> Trond
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
> Ah, but I submit that the application isn't making the decision... The OS
> is. My testcase is building Samba on Linux using gcc. The gcc linker sure
> isn't deciding to flush the file. It's happily seeking/reading and
> seeking/writing with no idea what is happening under the covers. When the
> build gets audited, the cache gets flushed... No audit, no flush. The only
> apparent difference is that we have an audit file getting written to on
> the local disk. The linker has no idea it's getting audited.
>
> I'm interested in knowing what kind of performance benefit this
> optimization is providing in small-file writes. Unless it's incredibly
> dramatic, then I really don't see why we can't do one of the following:
> 1) get rid of it,
> 2) find some way to not do it when the OS flushes filesystem cache, or
> 3) make the "async" mount option turn it off, or
> 4) create a new mount option to force the optimization on/off.
>
> I just don't see how a single RPC saved is saving all that much time.
> Since:
> - open
> - write (unstable) <write size
> - commit
> - close
> Depends on the commit call to finish writing to disk, and
> - open
> - write (stable) <write size
> - close
> Also depends on the time taken to writ ethe data to disk, I can't see the
> one less RPC buying that much time, other than perhaps on NAS devices.
>
> This may reduce the server load, but this is ignoring the mount options.
> We can't turn this behavior OFF, and that's the biggest issue. I don't
> mind the small-file-write optimization itself, as long as I and my
> customers are able to CHOOSE whether the optimization is active. It boils
> down to this: when I *categorically* say that the mount is async, the OS
> should pay attention. There are cases when the OS doesn't know best. If
> the OS always knew what would work best, there wouldn't be nearly as many
> mount options as there are now.
What are you smoking? There is _NO_DIFFERENCE_ between what the server
is supposed to do when sent a single stable write, and what it is
supposed to do when sent an unstable write plus a commit. BOTH cases are
supposed to result in the server writing the data to stable storage
before the stable write / commit is allowed to return a reply.
The extra RPC round trip (+ parsing overhead ++++) due to the commit
call is the _only_ difference.
No, you can't turn this behaviour off (unless you use the 'async' export
option on a Linux server), but there is no difference there between the
stable write and the unstable write + commit.
THEY BOTH RESULT IN THE SAME BEHAVIOUR.
Trond
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to [email protected] to be sure your PMR is updated in
> case I am not available.
>
>
>
> From:
> Trond Myklebust <[email protected]>
> To:
> Brian R Cowan/Cupertino/IBM@IBMUS
> Cc:
> Chuck Lever <[email protected]>, [email protected],
> [email protected], Peter Staubach <[email protected]>
> Date:
> 05/29/2009 12:47 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
> Sent by:
> [email protected]
>
>
>
> Look... This happens when you _flush_ the file to stable storage if
> there is only a single write < wsize. It isn't the business of the NFS
> layer to decide when you flush the file; that's an application
> decision...
>
> Trond
>
>
>
> On Fri, 2009-05-29 at 11:55 -0400, Brian R Cowan wrote:
> > Been working this issue with Red hat, and didn't need to go to the
> list...
> > Well, now I do... You mention that "The main type of workload we're
> > targetting with this patch is the app that opens a file, writes < 4k and
>
> > then closes the file." Well, it appears that this issue also impacts
> > flushing pages from filesystem caches.
> >
> > The reason this came up in my environment is that our product's build
> > auditing gives the the filesystem cache an interesting workout. When
> > ClearCase audits a build, the build places data in a few places,
> > including:
> > 1) a build audit file that usually resides in /tmp. This build audit is
> > essentially a log of EVERY file open/read/write/delete/rename/etc. that
> > the programs called in the build script make in the clearcase "view"
> > you're building in. As a result, this file can get pretty large.
> > 2) The build outputs themselves, which in this case are being written to
> a
> > remote storage location on a Linux or Solaris server, and
> > 3) a file called .cmake.state, which is a local cache that is written to
>
> > after the build script completes containing what is essentially a "Bill
> of
> > materials" for the files created during builds in this "view."
> >
> > We believe that the build audit file access is causing build output to
> get
> > flushed out of the filesystem cache. These flushes happen *in 4k
> chunks.*
> > This trips over this change since the cache pages appear to get flushed
> on
> > an individual basis.
> >
> > One note is that if the build outputs were going to a clearcase view
> > stored on an enterprise-level NAS device, there isn't as much of an
> issue
> > because many of these return from the stable write request as soon as
> the
> > data goes into the battery-backed memory disk cache on the NAS. However,
>
> > it really impacts writes to general-purpose OS's that follow Sun's lead
> in
> > how they handle "stable" writes. The truly annoying part about this
> rather
> > subtle change is that the NFS client is specifically ignoring the client
>
> > mount options since we cannot force the "async" mount option to turn off
>
> > this behavior.
> >
> > =================================================================
> > Brian Cowan
> > Advisory Software Engineer
> > ClearCase Customer Advocacy Group (CAG)
> > Rational Software
> > IBM Software Group
> > 81 Hartwell Ave
> > Lexington, MA
> >
> > Phone: 1.781.372.3580
> > Web: http://www.ibm.com/software/rational/support/
> >
> >
> > Please be sure to update your PMR using ESR at
> > http://www-306.ibm.com/software/support/probsub.html or cc all
> > correspondence to [email protected] to be sure your PMR is updated
> in
> > case I am not available.
> >
> >
> >
> > From:
> > Trond Myklebust <[email protected]>
> > To:
> > Peter Staubach <[email protected]>
> > Cc:
> > Chuck Lever <[email protected]>, Brian R Cowan/Cupertino/IBM@IBMUS,
>
> > [email protected]
> > Date:
> > 04/30/2009 05:23 PM
> > Subject:
> > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page
> flushing
> > Sent by:
> > [email protected]
> >
> >
> >
> > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> > > Chuck Lever wrote:
> > > >
> > > > On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> > > >>
> > > >>
> >
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>
> >
> > > >>
> > > Actually, the "stable" part can be a killer. It depends upon
> > > why and when nfs_flush_inode() is invoked.
> > >
> > > I did quite a bit of work on this aspect of RHEL-5 and discovered
> > > that this particular code was leading to some serious slowdowns.
> > > The server would end up doing a very slow FILE_SYNC write when
> > > all that was really required was an UNSTABLE write at the time.
> > >
> > > Did anyone actually measure this optimization and if so, what
> > > were the numbers?
> >
> > As usual, the optimisation is workload dependent. The main type of
> > workload we're targetting with this patch is the app that opens a file,
> > writes < 4k and then closes the file. For that case, it's a no-brainer
> > that you don't need to split a single stable write into an unstable + a
> > commit.
> >
> > So if the application isn't doing the above type of short write followed
> > by close, then exactly what is causing a flush to disk in the first
> > place? Ordinarily, the client will try to cache writes until the cows
> > come home (or until the VM tells it to reclaim memory - whichever comes
> > first)...
> >
> > Cheers
> > Trond
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> You may have a misunderstanding about what exactly "async" does. The
> "sync" / "async" mount options control only whether the application
> waits for the data to be flushed to permanent storage. They have no
> effect on any file system I know of _how_ specifically the data is
> moved from the page cache to permanent storage.
The problem is that the client change seems to cause the application to
stop until this stable write completes... What is interesting is that it's
not always a write operation that the linker gets stuck on. Our best
hypothesis -- from correlating times in strace and tcpdump traces -- is
that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()*
system calls on the output file (that is opened for read/write). We THINK
the read call triggers a FILE_SYNC write if the page is dirty...and that
is why the read calls are taking so long. Seeing writes happening when the
app is waiting for a read is odd to say the least... (In my test, there is
nothing else running on the Virtual machines, so the only thing that could
be triggering the filesystem activity is the build test...)
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Chuck Lever <[email protected]>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Trond Myklebust <[email protected]>, [email protected],
[email protected], Peter Staubach <[email protected]>
Date:
05/29/2009 01:02 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
Sent by:
[email protected]
On May 29, 2009, at 11:55 AM, Brian R Cowan wrote:
> Been working this issue with Red hat, and didn't need to go to the
> list...
> Well, now I do... You mention that "The main type of workload we're
> targetting with this patch is the app that opens a file, writes < 4k
> and
> then closes the file." Well, it appears that this issue also impacts
> flushing pages from filesystem caches.
>
> The reason this came up in my environment is that our product's build
> auditing gives the the filesystem cache an interesting workout. When
> ClearCase audits a build, the build places data in a few places,
> including:
> 1) a build audit file that usually resides in /tmp. This build audit
> is
> essentially a log of EVERY file open/read/write/delete/rename/etc.
> that
> the programs called in the build script make in the clearcase "view"
> you're building in. As a result, this file can get pretty large.
> 2) The build outputs themselves, which in this case are being
> written to a
> remote storage location on a Linux or Solaris server, and
> 3) a file called .cmake.state, which is a local cache that is
> written to
> after the build script completes containing what is essentially a
> "Bill of
> materials" for the files created during builds in this "view."
>
> We believe that the build audit file access is causing build output
> to get
> flushed out of the filesystem cache. These flushes happen *in 4k
> chunks.*
> This trips over this change since the cache pages appear to get
> flushed on
> an individual basis.
So, are you saying that the application is flushing after every 4KB
write(2), or that the application has written a bunch of pages, and VM/
VFS on the client is doing the synchronous page flushes? If it's the
application doing this, then you really do not want to mitigate this
by defeating the STABLE writes -- the application must have some
requirement that the data is permanent.
Unless I have misunderstood something, the previous faster behavior
was due to cheating, and put your data at risk. I can't see how
replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would
cause such a significant performance impact.
> One note is that if the build outputs were going to a clearcase view
> stored on an enterprise-level NAS device, there isn't as much of an
> issue
> because many of these return from the stable write request as soon
> as the
> data goes into the battery-backed memory disk cache on the NAS.
> However,
> it really impacts writes to general-purpose OS's that follow Sun's
> lead in
> how they handle "stable" writes. The truly annoying part about this
> rather
> subtle change is that the NFS client is specifically ignoring the
> client
> mount options since we cannot force the "async" mount option to turn
> off
> this behavior.
You may have a misunderstanding about what exactly "async" does. The
"sync" / "async" mount options control only whether the application
waits for the data to be flushed to permanent storage. They have no
effect on any file system I know of _how_ specifically the data is
moved from the page cache to permanent storage.
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to [email protected] to be sure your PMR is
> updated in
> case I am not available.
>
>
>
> From:
> Trond Myklebust <[email protected]>
> To:
> Peter Staubach <[email protected]>
> Cc:
> Chuck Lever <[email protected]>, Brian R Cowan/Cupertino/
> IBM@IBMUS,
> [email protected]
> Date:
> 04/30/2009 05:23 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page
> flushing
> Sent by:
> [email protected]
>
>
>
> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
>> Chuck Lever wrote:
>>>
>>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
>>>>
>>>>
>
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>
>>>>
>> Actually, the "stable" part can be a killer. It depends upon
>> why and when nfs_flush_inode() is invoked.
>>
>> I did quite a bit of work on this aspect of RHEL-5 and discovered
>> that this particular code was leading to some serious slowdowns.
>> The server would end up doing a very slow FILE_SYNC write when
>> all that was really required was an UNSTABLE write at the time.
>>
>> Did anyone actually measure this optimization and if so, what
>> were the numbers?
>
> As usual, the optimisation is workload dependent. The main type of
> workload we're targetting with this patch is the app that opens a
> file,
> writes < 4k and then closes the file. For that case, it's a no-brainer
> that you don't need to split a single stable write into an unstable
> + a
> commit.
>
> So if the application isn't doing the above type of short write
> followed
> by close, then exactly what is causing a flush to disk in the first
> place? Ordinarily, the client will try to cache writes until the cows
> come home (or until the VM tells it to reclaim memory - whichever
> comes
> first)...
>
> Cheers
> Trond
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
> > You may have a misunderstanding about what exactly "async" does. The
> > "sync" / "async" mount options control only whether the application
> > waits for the data to be flushed to permanent storage. They have no
> > effect on any file system I know of _how_ specifically the data is
> > moved from the page cache to permanent storage.
>
> The problem is that the client change seems to cause the application to
> stop until this stable write completes... What is interesting is that it's
> not always a write operation that the linker gets stuck on. Our best
> hypothesis -- from correlating times in strace and tcpdump traces -- is
> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()*
> system calls on the output file (that is opened for read/write). We THINK
> the read call triggers a FILE_SYNC write if the page is dirty...and that
> is why the read calls are taking so long. Seeing writes happening when the
> app is waiting for a read is odd to say the least... (In my test, there is
> nothing else running on the Virtual machines, so the only thing that could
> be triggering the filesystem activity is the build test...)
Yes. If the page is dirty, but not up to date, then it needs to be
cleaned before you can overwrite the contents with the results of a
fresh read.
That means flushing the data to disk... Which again means doing either a
stable write or an unstable write+commit. The former is more efficient
that the latter, 'cos it accomplishes the exact same work in a single
RPC call.
Trond
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to [email protected] to be sure your PMR is updated in
> case I am not available.
>
>
>
> From:
> Chuck Lever <[email protected]>
> To:
> Brian R Cowan/Cupertino/IBM@IBMUS
> Cc:
> Trond Myklebust <[email protected]>, [email protected],
> [email protected], Peter Staubach <[email protected]>
> Date:
> 05/29/2009 01:02 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
> Sent by:
> [email protected]
>
>
>
>
> On May 29, 2009, at 11:55 AM, Brian R Cowan wrote:
>
> > Been working this issue with Red hat, and didn't need to go to the
> > list...
> > Well, now I do... You mention that "The main type of workload we're
> > targetting with this patch is the app that opens a file, writes < 4k
> > and
> > then closes the file." Well, it appears that this issue also impacts
> > flushing pages from filesystem caches.
> >
> > The reason this came up in my environment is that our product's build
> > auditing gives the the filesystem cache an interesting workout. When
> > ClearCase audits a build, the build places data in a few places,
> > including:
> > 1) a build audit file that usually resides in /tmp. This build audit
> > is
> > essentially a log of EVERY file open/read/write/delete/rename/etc.
> > that
> > the programs called in the build script make in the clearcase "view"
> > you're building in. As a result, this file can get pretty large.
> > 2) The build outputs themselves, which in this case are being
> > written to a
> > remote storage location on a Linux or Solaris server, and
> > 3) a file called .cmake.state, which is a local cache that is
> > written to
> > after the build script completes containing what is essentially a
> > "Bill of
> > materials" for the files created during builds in this "view."
> >
> > We believe that the build audit file access is causing build output
> > to get
> > flushed out of the filesystem cache. These flushes happen *in 4k
> > chunks.*
> > This trips over this change since the cache pages appear to get
> > flushed on
> > an individual basis.
>
> So, are you saying that the application is flushing after every 4KB
> write(2), or that the application has written a bunch of pages, and VM/
> VFS on the client is doing the synchronous page flushes? If it's the
> application doing this, then you really do not want to mitigate this
> by defeating the STABLE writes -- the application must have some
> requirement that the data is permanent.
>
> Unless I have misunderstood something, the previous faster behavior
> was due to cheating, and put your data at risk. I can't see how
> replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would
> cause such a significant performance impact.
>
> > One note is that if the build outputs were going to a clearcase view
> > stored on an enterprise-level NAS device, there isn't as much of an
> > issue
> > because many of these return from the stable write request as soon
> > as the
> > data goes into the battery-backed memory disk cache on the NAS.
> > However,
> > it really impacts writes to general-purpose OS's that follow Sun's
> > lead in
> > how they handle "stable" writes. The truly annoying part about this
> > rather
> > subtle change is that the NFS client is specifically ignoring the
> > client
> > mount options since we cannot force the "async" mount option to turn
> > off
> > this behavior.
>
> You may have a misunderstanding about what exactly "async" does. The
> "sync" / "async" mount options control only whether the application
> waits for the data to be flushed to permanent storage. They have no
> effect on any file system I know of _how_ specifically the data is
> moved from the page cache to permanent storage.
>
> > =================================================================
> > Brian Cowan
> > Advisory Software Engineer
> > ClearCase Customer Advocacy Group (CAG)
> > Rational Software
> > IBM Software Group
> > 81 Hartwell Ave
> > Lexington, MA
> >
> > Phone: 1.781.372.3580
> > Web: http://www.ibm.com/software/rational/support/
> >
> >
> > Please be sure to update your PMR using ESR at
> > http://www-306.ibm.com/software/support/probsub.html or cc all
> > correspondence to [email protected] to be sure your PMR is
> > updated in
> > case I am not available.
> >
> >
> >
> > From:
> > Trond Myklebust <[email protected]>
> > To:
> > Peter Staubach <[email protected]>
> > Cc:
> > Chuck Lever <[email protected]>, Brian R Cowan/Cupertino/
> > IBM@IBMUS,
> > [email protected]
> > Date:
> > 04/30/2009 05:23 PM
> > Subject:
> > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page
> > flushing
> > Sent by:
> > [email protected]
> >
> >
> >
> > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> >> Chuck Lever wrote:
> >>>
> >>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> >>>>
> >>>>
> >
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>
> >
> >>>>
> >> Actually, the "stable" part can be a killer. It depends upon
> >> why and when nfs_flush_inode() is invoked.
> >>
> >> I did quite a bit of work on this aspect of RHEL-5 and discovered
> >> that this particular code was leading to some serious slowdowns.
> >> The server would end up doing a very slow FILE_SYNC write when
> >> all that was really required was an UNSTABLE write at the time.
> >>
> >> Did anyone actually measure this optimization and if so, what
> >> were the numbers?
> >
> > As usual, the optimisation is workload dependent. The main type of
> > workload we're targetting with this patch is the app that opens a
> > file,
> > writes < 4k and then closes the file. For that case, it's a no-brainer
> > that you don't need to split a single stable write into an unstable
> > + a
> > commit.
> >
> > So if the application isn't doing the above type of short write
> > followed
> > by close, then exactly what is causing a flush to disk in the first
> > place? Ordinarily, the client will try to cache writes until the cows
> > come home (or until the VM tells it to reclaim memory - whichever
> > comes
> > first)...
> >
> > Cheers
> > Trond
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> > in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On May 29, 2009, at 1:42 PM, Trond Myklebust wrote:
> On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
>>> You may have a misunderstanding about what exactly "async" does.
>>> The
>>> "sync" / "async" mount options control only whether the application
>>> waits for the data to be flushed to permanent storage. They have no
>>> effect on any file system I know of _how_ specifically the data is
>>> moved from the page cache to permanent storage.
>>
>> The problem is that the client change seems to cause the
>> application to
>> stop until this stable write completes... What is interesting is
>> that it's
>> not always a write operation that the linker gets stuck on. Our best
>> hypothesis -- from correlating times in strace and tcpdump traces
>> -- is
>> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by
>> *read()*
>> system calls on the output file (that is opened for read/write). We
>> THINK
>> the read call triggers a FILE_SYNC write if the page is dirty...and
>> that
>> is why the read calls are taking so long. Seeing writes happening
>> when the
>> app is waiting for a read is odd to say the least... (In my test,
>> there is
>> nothing else running on the Virtual machines, so the only thing
>> that could
>> be triggering the filesystem activity is the build test...)
>
> Yes. If the page is dirty, but not up to date, then it needs to be
> cleaned before you can overwrite the contents with the results of a
> fresh read.
> That means flushing the data to disk... Which again means doing
> either a
> stable write or an unstable write+commit. The former is more efficient
> that the latter, 'cos it accomplishes the exact same work in a single
> RPC call.
It might be prudent to flush the whole file when such a dirty page is
discovered to get the benefit of write coalescing.
> Trond
>
>> =================================================================
>> Brian Cowan
>> Advisory Software Engineer
>> ClearCase Customer Advocacy Group (CAG)
>> Rational Software
>> IBM Software Group
>> 81 Hartwell Ave
>> Lexington, MA
>>
>> Phone: 1.781.372.3580
>> Web: http://www.ibm.com/software/rational/support/
>>
>>
>> Please be sure to update your PMR using ESR at
>> http://www-306.ibm.com/software/support/probsub.html or cc all
>> correspondence to [email protected] to be sure your PMR is
>> updated in
>> case I am not available.
>>
>>
>>
>> From:
>> Chuck Lever <[email protected]>
>> To:
>> Brian R Cowan/Cupertino/IBM@IBMUS
>> Cc:
>> Trond Myklebust <[email protected]>, [email protected]
>> ,
>> [email protected], Peter Staubach <[email protected]>
>> Date:
>> 05/29/2009 01:02 PM
>> Subject:
>> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page
>> flushing
>> Sent by:
>> [email protected]
>>
>>
>>
>>
>> On May 29, 2009, at 11:55 AM, Brian R Cowan wrote:
>>
>>> Been working this issue with Red hat, and didn't need to go to the
>>> list...
>>> Well, now I do... You mention that "The main type of workload we're
>>> targetting with this patch is the app that opens a file, writes < 4k
>>> and
>>> then closes the file." Well, it appears that this issue also impacts
>>> flushing pages from filesystem caches.
>>>
>>> The reason this came up in my environment is that our product's
>>> build
>>> auditing gives the the filesystem cache an interesting workout. When
>>> ClearCase audits a build, the build places data in a few places,
>>> including:
>>> 1) a build audit file that usually resides in /tmp. This build audit
>>> is
>>> essentially a log of EVERY file open/read/write/delete/rename/etc.
>>> that
>>> the programs called in the build script make in the clearcase "view"
>>> you're building in. As a result, this file can get pretty large.
>>> 2) The build outputs themselves, which in this case are being
>>> written to a
>>> remote storage location on a Linux or Solaris server, and
>>> 3) a file called .cmake.state, which is a local cache that is
>>> written to
>>> after the build script completes containing what is essentially a
>>> "Bill of
>>> materials" for the files created during builds in this "view."
>>>
>>> We believe that the build audit file access is causing build output
>>> to get
>>> flushed out of the filesystem cache. These flushes happen *in 4k
>>> chunks.*
>>> This trips over this change since the cache pages appear to get
>>> flushed on
>>> an individual basis.
>>
>> So, are you saying that the application is flushing after every 4KB
>> write(2), or that the application has written a bunch of pages, and
>> VM/
>> VFS on the client is doing the synchronous page flushes? If it's the
>> application doing this, then you really do not want to mitigate this
>> by defeating the STABLE writes -- the application must have some
>> requirement that the data is permanent.
>>
>> Unless I have misunderstood something, the previous faster behavior
>> was due to cheating, and put your data at risk. I can't see how
>> replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would
>> cause such a significant performance impact.
>>
>>> One note is that if the build outputs were going to a clearcase view
>>> stored on an enterprise-level NAS device, there isn't as much of an
>>> issue
>>> because many of these return from the stable write request as soon
>>> as the
>>> data goes into the battery-backed memory disk cache on the NAS.
>>> However,
>>> it really impacts writes to general-purpose OS's that follow Sun's
>>> lead in
>>> how they handle "stable" writes. The truly annoying part about this
>>> rather
>>> subtle change is that the NFS client is specifically ignoring the
>>> client
>>> mount options since we cannot force the "async" mount option to turn
>>> off
>>> this behavior.
>>
>> You may have a misunderstanding about what exactly "async" does. The
>> "sync" / "async" mount options control only whether the application
>> waits for the data to be flushed to permanent storage. They have no
>> effect on any file system I know of _how_ specifically the data is
>> moved from the page cache to permanent storage.
>>
>>> =================================================================
>>> Brian Cowan
>>> Advisory Software Engineer
>>> ClearCase Customer Advocacy Group (CAG)
>>> Rational Software
>>> IBM Software Group
>>> 81 Hartwell Ave
>>> Lexington, MA
>>>
>>> Phone: 1.781.372.3580
>>> Web: http://www.ibm.com/software/rational/support/
>>>
>>>
>>> Please be sure to update your PMR using ESR at
>>> http://www-306.ibm.com/software/support/probsub.html or cc all
>>> correspondence to [email protected] to be sure your PMR is
>>> updated in
>>> case I am not available.
>>>
>>>
>>>
>>> From:
>>> Trond Myklebust <[email protected]>
>>> To:
>>> Peter Staubach <[email protected]>
>>> Cc:
>>> Chuck Lever <[email protected]>, Brian R Cowan/Cupertino/
>>> IBM@IBMUS,
>>> [email protected]
>>> Date:
>>> 04/30/2009 05:23 PM
>>> Subject:
>>> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page
>>> flushing
>>> Sent by:
>>> [email protected]
>>>
>>>
>>>
>>> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
>>>> Chuck Lever wrote:
>>>>>
>>>>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
>>>>>>
>>>>>>
>>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>>
>>>
>>>>>>
>>>> Actually, the "stable" part can be a killer. It depends upon
>>>> why and when nfs_flush_inode() is invoked.
>>>>
>>>> I did quite a bit of work on this aspect of RHEL-5 and discovered
>>>> that this particular code was leading to some serious slowdowns.
>>>> The server would end up doing a very slow FILE_SYNC write when
>>>> all that was really required was an UNSTABLE write at the time.
>>>>
>>>> Did anyone actually measure this optimization and if so, what
>>>> were the numbers?
>>>
>>> As usual, the optimisation is workload dependent. The main type of
>>> workload we're targetting with this patch is the app that opens a
>>> file,
>>> writes < 4k and then closes the file. For that case, it's a no-
>>> brainer
>>> that you don't need to split a single stable write into an unstable
>>> + a
>>> commit.
>>>
>>> So if the application isn't doing the above type of short write
>>> followed
>>> by close, then exactly what is causing a flush to disk in the first
>>> place? Ordinarily, the client will try to cache writes until the
>>> cows
>>> come home (or until the VM tells it to reclaim memory - whichever
>>> comes
>>> first)...
>>>
>>> Cheers
>>> Trond
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
>>> in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-
>> nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
Trond Myklebust wrote:
> Look... This happens when you _flush_ the file to stable storage if
> there is only a single write < wsize. It isn't the business of the NFS
> layer to decide when you flush the file; that's an application
> decision...
>
>
I think that one easy way to show why this optimization is
not quite what we would all like, why there only being a
single write _now_ isn't quite sufficient, is to write a
block of a file and then read it back. Things like
compilers and linkers might do this during their random
access to the file being created. I would guess that this
audit thing that Brian has refered to does the same sort
of thing.
ps
ps. Why do we flush dirty pages before they can be read?
I am not even clear why we care about waiting for an
already existing flush to be completed before using the
page to satisfy a read system call.
> Trond
>
>
>
> On Fri, 2009-05-29 at 11:55 -0400, Brian R Cowan wrote:
>
>> Been working this issue with Red hat, and didn't need to go to the list...
>> Well, now I do... You mention that "The main type of workload we're
>> targetting with this patch is the app that opens a file, writes < 4k and
>> then closes the file." Well, it appears that this issue also impacts
>> flushing pages from filesystem caches.
>>
>> The reason this came up in my environment is that our product's build
>> auditing gives the the filesystem cache an interesting workout. When
>> ClearCase audits a build, the build places data in a few places,
>> including:
>> 1) a build audit file that usually resides in /tmp. This build audit is
>> essentially a log of EVERY file open/read/write/delete/rename/etc. that
>> the programs called in the build script make in the clearcase "view"
>> you're building in. As a result, this file can get pretty large.
>> 2) The build outputs themselves, which in this case are being written to a
>> remote storage location on a Linux or Solaris server, and
>> 3) a file called .cmake.state, which is a local cache that is written to
>> after the build script completes containing what is essentially a "Bill of
>> materials" for the files created during builds in this "view."
>>
>> We believe that the build audit file access is causing build output to get
>> flushed out of the filesystem cache. These flushes happen *in 4k chunks.*
>> This trips over this change since the cache pages appear to get flushed on
>> an individual basis.
>>
>> One note is that if the build outputs were going to a clearcase view
>> stored on an enterprise-level NAS device, there isn't as much of an issue
>> because many of these return from the stable write request as soon as the
>> data goes into the battery-backed memory disk cache on the NAS. However,
>> it really impacts writes to general-purpose OS's that follow Sun's lead in
>> how they handle "stable" writes. The truly annoying part about this rather
>> subtle change is that the NFS client is specifically ignoring the client
>> mount options since we cannot force the "async" mount option to turn off
>> this behavior.
>>
>> =================================================================
>> Brian Cowan
>> Advisory Software Engineer
>> ClearCase Customer Advocacy Group (CAG)
>> Rational Software
>> IBM Software Group
>> 81 Hartwell Ave
>> Lexington, MA
>>
>> Phone: 1.781.372.3580
>> Web: http://www.ibm.com/software/rational/support/
>>
>>
>> Please be sure to update your PMR using ESR at
>> http://www-306.ibm.com/software/support/probsub.html or cc all
>> correspondence to [email protected] to be sure your PMR is updated in
>> case I am not available.
>>
>>
>>
>> From:
>> Trond Myklebust <[email protected]>
>> To:
>> Peter Staubach <[email protected]>
>> Cc:
>> Chuck Lever <[email protected]>, Brian R Cowan/Cupertino/IBM@IBMUS,
>> [email protected]
>> Date:
>> 04/30/2009 05:23 PM
>> Subject:
>> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
>> Sent by:
>> [email protected]
>>
>>
>>
>> On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
>>
>>> Chuck Lever wrote:
>>>
>>>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
>>>>
>>>>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>>
>>
>>> Actually, the "stable" part can be a killer. It depends upon
>>> why and when nfs_flush_inode() is invoked.
>>>
>>> I did quite a bit of work on this aspect of RHEL-5 and discovered
>>> that this particular code was leading to some serious slowdowns.
>>> The server would end up doing a very slow FILE_SYNC write when
>>> all that was really required was an UNSTABLE write at the time.
>>>
>>> Did anyone actually measure this optimization and if so, what
>>> were the numbers?
>>>
>> As usual, the optimisation is workload dependent. The main type of
>> workload we're targetting with this patch is the app that opens a file,
>> writes < 4k and then closes the file. For that case, it's a no-brainer
>> that you don't need to split a single stable write into an unstable + a
>> commit.
>>
>> So if the application isn't doing the above type of short write followed
>> by close, then exactly what is causing a flush to disk in the first
>> place? Ordinarily, the client will try to cache writes until the cows
>> come home (or until the VM tells it to reclaim memory - whichever comes
>> first)...
>>
>> Cheers
>> Trond
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
>
>
Trond Myklebust wrote:
> On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
>
>>> You may have a misunderstanding about what exactly "async" does. The
>>> "sync" / "async" mount options control only whether the application
>>> waits for the data to be flushed to permanent storage. They have no
>>> effect on any file system I know of _how_ specifically the data is
>>> moved from the page cache to permanent storage.
>>>
>> The problem is that the client change seems to cause the application to
>> stop until this stable write completes... What is interesting is that it's
>> not always a write operation that the linker gets stuck on. Our best
>> hypothesis -- from correlating times in strace and tcpdump traces -- is
>> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()*
>> system calls on the output file (that is opened for read/write). We THINK
>> the read call triggers a FILE_SYNC write if the page is dirty...and that
>> is why the read calls are taking so long. Seeing writes happening when the
>> app is waiting for a read is odd to say the least... (In my test, there is
>> nothing else running on the Virtual machines, so the only thing that could
>> be triggering the filesystem activity is the build test...)
>>
>
> Yes. If the page is dirty, but not up to date, then it needs to be
> cleaned before you can overwrite the contents with the results of a
> fresh read.
> That means flushing the data to disk... Which again means doing either a
> stable write or an unstable write+commit. The former is more efficient
> that the latter, 'cos it accomplishes the exact same work in a single
> RPC call.
In the normal case, we aren't overwriting the contents with the
results of a fresh read. We are going to simply return the
current contents of the page. Given this, then why is the normal
data cache consistency mechanism, based on the attribute cache,
not sufficient?
Thanx...
ps
> Yes. If the page is dirty, but not up to date, then it needs to be
> cleaned before you can overwrite the contents with the results of a
> fresh read.
> That means flushing the data to disk... Which again means doing either a
> stable write or an unstable write+commit. The former is more efficient
> that the latter, 'cos it accomplishes the exact same work in a single
> RPC call.
I suspect that the COMMIT RPC's are done somewhere other than in the flush
itself. If the "write + commit" operation was happening in the that exact
matter, then the change in the git at the beginning of this thread *would
not have impacted client performance*. I can demonstrate -- at will --
that it does impact performance. So, there is something that keeps track
of the number of writes and issues the commits without slowing down the
application. This git change bypasses that and degrades the linker
performance.
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Trond Myklebust <[email protected]>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <[email protected]>, [email protected],
[email protected], Peter Staubach <[email protected]>
Date:
05/29/2009 01:43 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
Sent by:
[email protected]
On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
> > You may have a misunderstanding about what exactly "async" does. The
> > "sync" / "async" mount options control only whether the application
> > waits for the data to be flushed to permanent storage. They have no
> > effect on any file system I know of _how_ specifically the data is
> > moved from the page cache to permanent storage.
>
> The problem is that the client change seems to cause the application to
> stop until this stable write completes... What is interesting is that
it's
> not always a write operation that the linker gets stuck on. Our best
> hypothesis -- from correlating times in strace and tcpdump traces -- is
> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()*
> system calls on the output file (that is opened for read/write). We
THINK
> the read call triggers a FILE_SYNC write if the page is dirty...and that
> is why the read calls are taking so long. Seeing writes happening when
the
> app is waiting for a read is odd to say the least... (In my test, there
is
> nothing else running on the Virtual machines, so the only thing that
could
> be triggering the filesystem activity is the build test...)
Yes. If the page is dirty, but not up to date, then it needs to be
cleaned before you can overwrite the contents with the results of a
fresh read.
That means flushing the data to disk... Which again means doing either a
stable write or an unstable write+commit. The former is more efficient
that the latter, 'cos it accomplishes the exact same work in a single
RPC call.
Trond
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to [email protected] to be sure your PMR is updated
in
> case I am not available.
>
>
>
> From:
> Chuck Lever <[email protected]>
> To:
> Brian R Cowan/Cupertino/IBM@IBMUS
> Cc:
> Trond Myklebust <[email protected]>, [email protected],
> [email protected], Peter Staubach <[email protected]>
> Date:
> 05/29/2009 01:02 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page
flushing
> Sent by:
> [email protected]
>
>
>
>
> On May 29, 2009, at 11:55 AM, Brian R Cowan wrote:
>
> > Been working this issue with Red hat, and didn't need to go to the
> > list...
> > Well, now I do... You mention that "The main type of workload we're
> > targetting with this patch is the app that opens a file, writes < 4k
> > and
> > then closes the file." Well, it appears that this issue also impacts
> > flushing pages from filesystem caches.
> >
> > The reason this came up in my environment is that our product's build
> > auditing gives the the filesystem cache an interesting workout. When
> > ClearCase audits a build, the build places data in a few places,
> > including:
> > 1) a build audit file that usually resides in /tmp. This build audit
> > is
> > essentially a log of EVERY file open/read/write/delete/rename/etc.
> > that
> > the programs called in the build script make in the clearcase "view"
> > you're building in. As a result, this file can get pretty large.
> > 2) The build outputs themselves, which in this case are being
> > written to a
> > remote storage location on a Linux or Solaris server, and
> > 3) a file called .cmake.state, which is a local cache that is
> > written to
> > after the build script completes containing what is essentially a
> > "Bill of
> > materials" for the files created during builds in this "view."
> >
> > We believe that the build audit file access is causing build output
> > to get
> > flushed out of the filesystem cache. These flushes happen *in 4k
> > chunks.*
> > This trips over this change since the cache pages appear to get
> > flushed on
> > an individual basis.
>
> So, are you saying that the application is flushing after every 4KB
> write(2), or that the application has written a bunch of pages, and VM/
> VFS on the client is doing the synchronous page flushes? If it's the
> application doing this, then you really do not want to mitigate this
> by defeating the STABLE writes -- the application must have some
> requirement that the data is permanent.
>
> Unless I have misunderstood something, the previous faster behavior
> was due to cheating, and put your data at risk. I can't see how
> replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would
> cause such a significant performance impact.
>
> > One note is that if the build outputs were going to a clearcase view
> > stored on an enterprise-level NAS device, there isn't as much of an
> > issue
> > because many of these return from the stable write request as soon
> > as the
> > data goes into the battery-backed memory disk cache on the NAS.
> > However,
> > it really impacts writes to general-purpose OS's that follow Sun's
> > lead in
> > how they handle "stable" writes. The truly annoying part about this
> > rather
> > subtle change is that the NFS client is specifically ignoring the
> > client
> > mount options since we cannot force the "async" mount option to turn
> > off
> > this behavior.
>
> You may have a misunderstanding about what exactly "async" does. The
> "sync" / "async" mount options control only whether the application
> waits for the data to be flushed to permanent storage. They have no
> effect on any file system I know of _how_ specifically the data is
> moved from the page cache to permanent storage.
>
> > =================================================================
> > Brian Cowan
> > Advisory Software Engineer
> > ClearCase Customer Advocacy Group (CAG)
> > Rational Software
> > IBM Software Group
> > 81 Hartwell Ave
> > Lexington, MA
> >
> > Phone: 1.781.372.3580
> > Web: http://www.ibm.com/software/rational/support/
> >
> >
> > Please be sure to update your PMR using ESR at
> > http://www-306.ibm.com/software/support/probsub.html or cc all
> > correspondence to [email protected] to be sure your PMR is
> > updated in
> > case I am not available.
> >
> >
> >
> > From:
> > Trond Myklebust <[email protected]>
> > To:
> > Peter Staubach <[email protected]>
> > Cc:
> > Chuck Lever <[email protected]>, Brian R Cowan/Cupertino/
> > IBM@IBMUS,
> > [email protected]
> > Date:
> > 04/30/2009 05:23 PM
> > Subject:
> > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page
> > flushing
> > Sent by:
> > [email protected]
> >
> >
> >
> > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote:
> >> Chuck Lever wrote:
> >>>
> >>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote:
> >>>>
> >>>>
> >
>
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2
>
> >
> >>>>
> >> Actually, the "stable" part can be a killer. It depends upon
> >> why and when nfs_flush_inode() is invoked.
> >>
> >> I did quite a bit of work on this aspect of RHEL-5 and discovered
> >> that this particular code was leading to some serious slowdowns.
> >> The server would end up doing a very slow FILE_SYNC write when
> >> all that was really required was an UNSTABLE write at the time.
> >>
> >> Did anyone actually measure this optimization and if so, what
> >> were the numbers?
> >
> > As usual, the optimisation is workload dependent. The main type of
> > workload we're targetting with this patch is the app that opens a
> > file,
> > writes < 4k and then closes the file. For that case, it's a no-brainer
> > that you don't need to split a single stable write into an unstable
> > + a
> > commit.
> >
> > So if the application isn't doing the above type of short write
> > followed
> > by close, then exactly what is causing a flush to disk in the first
> > place? Ordinarily, the client will try to cache writes until the cows
> > come home (or until the VM tells it to reclaim memory - whichever
> > comes
> > first)...
> >
> > Cheers
> > Trond
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> > in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On Fri, 2009-05-29 at 13:42 -0400, Trond Myklebust wrote:
> On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
> > > You may have a misunderstanding about what exactly "async" does. The
> > > "sync" / "async" mount options control only whether the application
> > > waits for the data to be flushed to permanent storage. They have no
> > > effect on any file system I know of _how_ specifically the data is
> > > moved from the page cache to permanent storage.
> >
> > The problem is that the client change seems to cause the application to
> > stop until this stable write completes... What is interesting is that it's
> > not always a write operation that the linker gets stuck on. Our best
> > hypothesis -- from correlating times in strace and tcpdump traces -- is
> > that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()*
> > system calls on the output file (that is opened for read/write). We THINK
> > the read call triggers a FILE_SYNC write if the page is dirty...and that
> > is why the read calls are taking so long. Seeing writes happening when the
> > app is waiting for a read is odd to say the least... (In my test, there is
> > nothing else running on the Virtual machines, so the only thing that could
> > be triggering the filesystem activity is the build test...)
>
> Yes. If the page is dirty, but not up to date, then it needs to be
> cleaned before you can overwrite the contents with the results of a
> fresh read.
> That means flushing the data to disk... Which again means doing either a
> stable write or an unstable write+commit. The former is more efficient
> that the latter, 'cos it accomplishes the exact same work in a single
> RPC call.
>
> Trond
In fact, I suspect your real gripe is rather with the logic that marks a
page as being up to date (i.e. whether or not they require a READ call).
I suggest trying kernel 2.6.27 or newer, and seeing if the changes that
are in those kernels fix your problem.
Trond
On Fri, 2009-05-29 at 13:55 -0400, Brian R Cowan wrote:
> > Yes. If the page is dirty, but not up to date, then it needs to be
> > cleaned before you can overwrite the contents with the results of a
> > fresh read.
> > That means flushing the data to disk... Which again means doing either a
> > stable write or an unstable write+commit. The former is more efficient
> > that the latter, 'cos it accomplishes the exact same work in a single
> > RPC call.
>
> I suspect that the COMMIT RPC's are done somewhere other than in the flush
> itself. If the "write + commit" operation was happening in the that exact
> matter, then the change in the git at the beginning of this thread *would
> not have impacted client performance*. I can demonstrate -- at will --
> that it does impact performance. So, there is something that keeps track
> of the number of writes and issues the commits without slowing down the
> application. This git change bypasses that and degrades the linker
> performance.
If the server gives slower performance for a single stable write, vs.
the same unstable write + commit, then you are demonstrating that the
server is seriously _broken_.
The only other explanation, is if the client prior to that patch being
applied was somehow failing to send out the COMMIT. If so, then the
client was broken, and the patch is a fix that results in correct
behaviour. That would mean that the rest of the client flush code is
probably still broken, but at least the nfs_wb_page() is now correct.
Those are the only 2 options.
Trond
On Fri, 2009-05-29 at 13:47 -0400, Chuck Lever wrote:
> On May 29, 2009, at 1:42 PM, Trond Myklebust wrote:
>
> > On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
> >>> You may have a misunderstanding about what exactly "async" does.
> >>> The
> >>> "sync" / "async" mount options control only whether the application
> >>> waits for the data to be flushed to permanent storage. They have no
> >>> effect on any file system I know of _how_ specifically the data is
> >>> moved from the page cache to permanent storage.
> >>
> >> The problem is that the client change seems to cause the
> >> application to
> >> stop until this stable write completes... What is interesting is
> >> that it's
> >> not always a write operation that the linker gets stuck on. Our best
> >> hypothesis -- from correlating times in strace and tcpdump traces
> >> -- is
> >> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by
> >> *read()*
> >> system calls on the output file (that is opened for read/write). We
> >> THINK
> >> the read call triggers a FILE_SYNC write if the page is dirty...and
> >> that
> >> is why the read calls are taking so long. Seeing writes happening
> >> when the
> >> app is waiting for a read is odd to say the least... (In my test,
> >> there is
> >> nothing else running on the Virtual machines, so the only thing
> >> that could
> >> be triggering the filesystem activity is the build test...)
> >
> > Yes. If the page is dirty, but not up to date, then it needs to be
> > cleaned before you can overwrite the contents with the results of a
> > fresh read.
> > That means flushing the data to disk... Which again means doing
> > either a
> > stable write or an unstable write+commit. The former is more efficient
> > that the latter, 'cos it accomplishes the exact same work in a single
> > RPC call.
>
> It might be prudent to flush the whole file when such a dirty page is
> discovered to get the benefit of write coalescing.
There are very few workloads where that will help. You basically have to
be modifying the end of a page that has not previously been read in (so
is not already marked up to date) and then writing into the beginning of
the next page, which must also be not up to date.
Trond
There is a third option, that the COMMIT calls are not coming from the
same thread of execution that the write call is. The symptoms would seem
to bear that out. As would the fact that the performance degradation
occurs both when the server is Linux itself and when it is Solaris (any
NFSv3-supporting version). I'm not saying that Solaris is bug-free, but it
would be unusual if they are both broken the same way. The linux nfs FAQ
says:
-----------------------
* NFS Version 3 introduces the concept of "safe asynchronous writes." A
Version 3 client can specify that the server is allowed to reply before it
has saved the requested data to disk, permitting the server to gather
small NFS write operations into a single efficient disk write operation. A
Version 3 client can also specify that the data must be written to disk
before the server replies, just like a Version 2 write. The client
specifies the type of write by setting the stable_how field in the
arguments of each write operation to UNSTABLE to request a safe
asynchronous write, and FILE_SYNC for an NFS Version 2 style write.
Servers indicate whether the requested data is permanently stored by
setting a corresponding field in the response to each NFS write operation.
A server can respond to an UNSTABLE write request with an UNSTABLE reply
or a FILE_SYNC reply, depending on whether or not the requested data
resides on permanent storage yet. An NFS protocol-compliant server must
respond to a FILE_SYNC request only with a FILE_SYNC reply.
Clients ensure that data that was written using a safe asynchronous write
has been written onto permanent storage using a new operation available in
Version 3 called a COMMIT. Servers do not send a response to a COMMIT
operation until all data specified in the request has been written to
permanent storage. NFS Version 3 clients must protect buffered data that
has been written using a safe asynchronous write but not yet committed. If
a server reboots before a client has sent an appropriate COMMIT, the
server can reply to the eventual COMMIT request in a way that forces the
client to resend the original write operation. Version 3 clients use
COMMIT operations when flushing safe asynchronous writes to the server
during a close(2) or fsync(2) system call, or when encountering memory
pressure.
-----------------------
Now, what happens in the client when the server cones back with the
UNSTABLE reply?
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Trond Myklebust <[email protected]>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <[email protected]>, [email protected],
[email protected], Peter Staubach <[email protected]>
Date:
05/29/2009 02:07 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
On Fri, 2009-05-29 at 13:55 -0400, Brian R Cowan wrote:
> > Yes. If the page is dirty, but not up to date, then it needs to be
> > cleaned before you can overwrite the contents with the results of a
> > fresh read.
> > That means flushing the data to disk... Which again means doing either
a
> > stable write or an unstable write+commit. The former is more efficient
> > that the latter, 'cos it accomplishes the exact same work in a single
> > RPC call.
>
> I suspect that the COMMIT RPC's are done somewhere other than in the
flush
> itself. If the "write + commit" operation was happening in the that
exact
> matter, then the change in the git at the beginning of this thread
*would
> not have impacted client performance*. I can demonstrate -- at will --
> that it does impact performance. So, there is something that keeps track
> of the number of writes and issues the commits without slowing down the
> application. This git change bypasses that and degrades the linker
> performance.
If the server gives slower performance for a single stable write, vs.
the same unstable write + commit, then you are demonstrating that the
server is seriously _broken_.
The only other explanation, is if the client prior to that patch being
applied was somehow failing to send out the COMMIT. If so, then the
client was broken, and the patch is a fix that results in correct
behaviour. That would mean that the rest of the client flush code is
probably still broken, but at least the nfs_wb_page() is now correct.
Those are the only 2 options.
Trond
On Fri, 2009-05-29 at 13:48 -0400, Peter Staubach wrote:
> Trond Myklebust wrote:
> > Look... This happens when you _flush_ the file to stable storage if
> > there is only a single write < wsize. It isn't the business of the NFS
> > layer to decide when you flush the file; that's an application
> > decision...
> >
> >
>
> I think that one easy way to show why this optimization is
> not quite what we would all like, why there only being a
> single write _now_ isn't quite sufficient, is to write a
> block of a file and then read it back. Things like
> compilers and linkers might do this during their random
> access to the file being created. I would guess that this
> audit thing that Brian has refered to does the same sort
> of thing.
>
> ps
>
> ps. Why do we flush dirty pages before they can be read?
> I am not even clear why we care about waiting for an
> already existing flush to be completed before using the
> page to satisfy a read system call.
We only do this if the page cannot be marked as up to date. i.e. there
have to be parts of the page which contain valid data on the server, and
that our client hasn't read in yet, and that aren't being overwritten by
our write.
Trond
Peter, this is my point. The application/client-side end result is that
we're making a read wait for a write. We already have the data we need in
the cache, since the application is what put it in there to begin with.
I think this is a classic "unintended consequence" that is being observed
on SuSE 10, Red hat 5, and I'm sure others.
But since people using my product have only just started moving to Red hat
5, we're seeing more of these... There aren't too many people who build
across NFS, not when local storage is relatively cheap, and much faster.
But there are companies that do this so the build results are available
even if the build host has been turned off, gone to standby/hibernate, or
is even a virtual machine that no longer exists. The biggest problem here
that the unavoidable extra filesystem cache load that build auditing
creates appears to trigger the flushing. For whatever reason, those
flushes happen in such a way to trigger the STABLE writes instead of the
faster UNSTABLE ones.
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Peter Staubach <[email protected]>
To:
Trond Myklebust <[email protected]>
Cc:
Brian R Cowan/Cupertino/IBM@IBMUS, Chuck Lever <[email protected]>,
[email protected], [email protected]
Date:
05/29/2009 01:51 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
Trond Myklebust wrote:
> On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
>
>>> You may have a misunderstanding about what exactly "async" does. The
>>> "sync" / "async" mount options control only whether the application
>>> waits for the data to be flushed to permanent storage. They have no
>>> effect on any file system I know of _how_ specifically the data is
>>> moved from the page cache to permanent storage.
>>>
>> The problem is that the client change seems to cause the application to
>> stop until this stable write completes... What is interesting is that
it's
>> not always a write operation that the linker gets stuck on. Our best
>> hypothesis -- from correlating times in strace and tcpdump traces -- is
>> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()*
>> system calls on the output file (that is opened for read/write). We
THINK
>> the read call triggers a FILE_SYNC write if the page is dirty...and
that
>> is why the read calls are taking so long. Seeing writes happening when
the
>> app is waiting for a read is odd to say the least... (In my test, there
is
>> nothing else running on the Virtual machines, so the only thing that
could
>> be triggering the filesystem activity is the build test...)
>>
>
> Yes. If the page is dirty, but not up to date, then it needs to be
> cleaned before you can overwrite the contents with the results of a
> fresh read.
> That means flushing the data to disk... Which again means doing either a
> stable write or an unstable write+commit. The former is more efficient
> that the latter, 'cos it accomplishes the exact same work in a single
> RPC call.
In the normal case, we aren't overwriting the contents with the
results of a fresh read. We are going to simply return the
current contents of the page. Given this, then why is the normal
data cache consistency mechanism, based on the attribute cache,
not sufficient?
Thanx...
ps
On Fri, 2009-05-29 at 14:18 -0400, Brian R Cowan wrote:
> There is a third option, that the COMMIT calls are not coming from the
> same thread of execution that the write call is. The symptoms would seem
> to bear that out. As would the fact that the performance degradation
> occurs both when the server is Linux itself and when it is Solaris (any
> NFSv3-supporting version). I'm not saying that Solaris is bug-free, but it
> would be unusual if they are both broken the same way. The linux nfs FAQ
> says:
>
> -----------------------
> * NFS Version 3 introduces the concept of "safe asynchronous writes." A
> Version 3 client can specify that the server is allowed to reply before it
> has saved the requested data to disk, permitting the server to gather
> small NFS write operations into a single efficient disk write operation. A
> Version 3 client can also specify that the data must be written to disk
> before the server replies, just like a Version 2 write. The client
> specifies the type of write by setting the stable_how field in the
> arguments of each write operation to UNSTABLE to request a safe
> asynchronous write, and FILE_SYNC for an NFS Version 2 style write.
>
> Servers indicate whether the requested data is permanently stored by
> setting a corresponding field in the response to each NFS write operation.
> A server can respond to an UNSTABLE write request with an UNSTABLE reply
> or a FILE_SYNC reply, depending on whether or not the requested data
> resides on permanent storage yet. An NFS protocol-compliant server must
> respond to a FILE_SYNC request only with a FILE_SYNC reply.
>
> Clients ensure that data that was written using a safe asynchronous write
> has been written onto permanent storage using a new operation available in
> Version 3 called a COMMIT. Servers do not send a response to a COMMIT
> operation until all data specified in the request has been written to
> permanent storage. NFS Version 3 clients must protect buffered data that
> has been written using a safe asynchronous write but not yet committed. If
> a server reboots before a client has sent an appropriate COMMIT, the
> server can reply to the eventual COMMIT request in a way that forces the
> client to resend the original write operation. Version 3 clients use
> COMMIT operations when flushing safe asynchronous writes to the server
> during a close(2) or fsync(2) system call, or when encountering memory
> pressure.
> -----------------------
>
> Now, what happens in the client when the server cones back with the
> UNSTABLE reply?
The server cannot reply with an UNSTABLE reply to a stable write
request. See above.
As for your assertion that the COMMIT comes from some other thread of
execution. I don't see how that can change anything. Some thread,
somewhere has to wait for that COMMIT to complete. If it isn't your
application, then the same burden falls on another application or the
pdflush thread. While that may feel more interactive to you, it still
means that you are making the server + some local process do more work
(extra RPC round trip) for no good reason.
Trond
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to [email protected] to be sure your PMR is updated in
> case I am not available.
>
>
>
> From:
> Trond Myklebust <[email protected]>
> To:
> Brian R Cowan/Cupertino/IBM@IBMUS
> Cc:
> Chuck Lever <[email protected]>, [email protected],
> [email protected], Peter Staubach <[email protected]>
> Date:
> 05/29/2009 02:07 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
>
>
>
> On Fri, 2009-05-29 at 13:55 -0400, Brian R Cowan wrote:
> > > Yes. If the page is dirty, but not up to date, then it needs to be
> > > cleaned before you can overwrite the contents with the results of a
> > > fresh read.
> > > That means flushing the data to disk... Which again means doing either
> a
> > > stable write or an unstable write+commit. The former is more efficient
> > > that the latter, 'cos it accomplishes the exact same work in a single
> > > RPC call.
> >
> > I suspect that the COMMIT RPC's are done somewhere other than in the
> flush
> > itself. If the "write + commit" operation was happening in the that
> exact
> > matter, then the change in the git at the beginning of this thread
> *would
> > not have impacted client performance*. I can demonstrate -- at will --
> > that it does impact performance. So, there is something that keeps track
>
> > of the number of writes and issues the commits without slowing down the
> > application. This git change bypasses that and degrades the linker
> > performance.
>
> If the server gives slower performance for a single stable write, vs.
> the same unstable write + commit, then you are demonstrating that the
> server is seriously _broken_.
>
> The only other explanation, is if the client prior to that patch being
> applied was somehow failing to send out the COMMIT. If so, then the
> client was broken, and the patch is a fix that results in correct
> behaviour. That would mean that the rest of the client flush code is
> probably still broken, but at least the nfs_wb_page() is now correct.
>
> Those are the only 2 options.
>
> Trond
>
>
>
On Fri, 2009-05-29 at 13:51 -0400, Peter Staubach wrote:
> Trond Myklebust wrote:
> > On Fri, 2009-05-29 at 13:38 -0400, Brian R Cowan wrote:
> >
> >>> You may have a misunderstanding about what exactly "async" does. The
> >>> "sync" / "async" mount options control only whether the application
> >>> waits for the data to be flushed to permanent storage. They have no
> >>> effect on any file system I know of _how_ specifically the data is
> >>> moved from the page cache to permanent storage.
> >>>
> >> The problem is that the client change seems to cause the application to
> >> stop until this stable write completes... What is interesting is that it's
> >> not always a write operation that the linker gets stuck on. Our best
> >> hypothesis -- from correlating times in strace and tcpdump traces -- is
> >> that the FILE_SYNC'ed write NFS RPCs are in fact triggered by *read()*
> >> system calls on the output file (that is opened for read/write). We THINK
> >> the read call triggers a FILE_SYNC write if the page is dirty...and that
> >> is why the read calls are taking so long. Seeing writes happening when the
> >> app is waiting for a read is odd to say the least... (In my test, there is
> >> nothing else running on the Virtual machines, so the only thing that could
> >> be triggering the filesystem activity is the build test...)
> >>
> >
> > Yes. If the page is dirty, but not up to date, then it needs to be
> > cleaned before you can overwrite the contents with the results of a
> > fresh read.
> > That means flushing the data to disk... Which again means doing either a
> > stable write or an unstable write+commit. The former is more efficient
> > that the latter, 'cos it accomplishes the exact same work in a single
> > RPC call.
>
> In the normal case, we aren't overwriting the contents with the
> results of a fresh read. We are going to simply return the
> current contents of the page. Given this, then why is the normal
> data cache consistency mechanism, based on the attribute cache,
> not sufficient?
It is. You would need to look into why the page was not marked with the
PG_uptodate flag when it was being filled. We generally do try to do
that whenever possible.
Trond
I think you missed the context of my comment... Previous to this
4-year-old update, the writes were not sent with STABLE, this update
forced that behavior. So, before then we sent an UNSTABLE write request.
This would either give us back the UNSTABLE or FILE_SYNC response. My
question is this: When the server sends back UNSTABLE, as a response to
UNSTABLE, exactly what happens? By some chance is there a separate worker
thread that occasionally sends COMMITs back to the server?
The performance data we have would seem to bear that out. When we backed
out the force of STABLE writes, the link times went back up and the reads
stopped waiting on the cache flushes. If, as you say, this change had no
impact on how the client actually performed these flushes, backing out the
change would not have made links take 4x longer on Red Hat 5. All we did
in our test was back out that change...
I'm willing to discuss this issue in a conference call. I can send the
bridge information to those who are interested, as well as the other
people here in IBM I've been working with... At least one of them is a
regular contributor -- Frank Filz...
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Trond Myklebust <[email protected]>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Chuck Lever <[email protected]>, [email protected],
[email protected], Peter Staubach <[email protected]>
Date:
05/29/2009 02:31 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
On Fri, 2009-05-29 at 14:18 -0400, Brian R Cowan wrote:
> There is a third option, that the COMMIT calls are not coming from the
> same thread of execution that the write call is. The symptoms would seem
> to bear that out. As would the fact that the performance degradation
> occurs both when the server is Linux itself and when it is Solaris (any
> NFSv3-supporting version). I'm not saying that Solaris is bug-free, but
it
> would be unusual if they are both broken the same way. The linux nfs FAQ
> says:
>
> -----------------------
> * NFS Version 3 introduces the concept of "safe asynchronous writes." A
> Version 3 client can specify that the server is allowed to reply before
it
> has saved the requested data to disk, permitting the server to gather
> small NFS write operations into a single efficient disk write operation.
A
> Version 3 client can also specify that the data must be written to disk
> before the server replies, just like a Version 2 write. The client
> specifies the type of write by setting the stable_how field in the
> arguments of each write operation to UNSTABLE to request a safe
> asynchronous write, and FILE_SYNC for an NFS Version 2 style write.
>
> Servers indicate whether the requested data is permanently stored by
> setting a corresponding field in the response to each NFS write
operation.
> A server can respond to an UNSTABLE write request with an UNSTABLE reply
> or a FILE_SYNC reply, depending on whether or not the requested data
> resides on permanent storage yet. An NFS protocol-compliant server must
> respond to a FILE_SYNC request only with a FILE_SYNC reply.
>
> Clients ensure that data that was written using a safe asynchronous
write
> has been written onto permanent storage using a new operation available
in
> Version 3 called a COMMIT. Servers do not send a response to a COMMIT
> operation until all data specified in the request has been written to
> permanent storage. NFS Version 3 clients must protect buffered data that
> has been written using a safe asynchronous write but not yet committed.
If
> a server reboots before a client has sent an appropriate COMMIT, the
> server can reply to the eventual COMMIT request in a way that forces the
> client to resend the original write operation. Version 3 clients use
> COMMIT operations when flushing safe asynchronous writes to the server
> during a close(2) or fsync(2) system call, or when encountering memory
> pressure.
> -----------------------
>
> Now, what happens in the client when the server cones back with the
> UNSTABLE reply?
The server cannot reply with an UNSTABLE reply to a stable write
request. See above.
As for your assertion that the COMMIT comes from some other thread of
execution. I don't see how that can change anything. Some thread,
somewhere has to wait for that COMMIT to complete. If it isn't your
application, then the same burden falls on another application or the
pdflush thread. While that may feel more interactive to you, it still
means that you are making the server + some local process do more work
(extra RPC round trip) for no good reason.
Trond
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to [email protected] to be sure your PMR is updated
in
> case I am not available.
>
>
>
> From:
> Trond Myklebust <[email protected]>
> To:
> Brian R Cowan/Cupertino/IBM@IBMUS
> Cc:
> Chuck Lever <[email protected]>, [email protected],
> [email protected], Peter Staubach <[email protected]>
> Date:
> 05/29/2009 02:07 PM
> Subject:
> Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page
flushing
>
>
>
> On Fri, 2009-05-29 at 13:55 -0400, Brian R Cowan wrote:
> > > Yes. If the page is dirty, but not up to date, then it needs to be
> > > cleaned before you can overwrite the contents with the results of a
> > > fresh read.
> > > That means flushing the data to disk... Which again means doing
either
> a
> > > stable write or an unstable write+commit. The former is more
efficient
> > > that the latter, 'cos it accomplishes the exact same work in a
single
> > > RPC call.
> >
> > I suspect that the COMMIT RPC's are done somewhere other than in the
> flush
> > itself. If the "write + commit" operation was happening in the that
> exact
> > matter, then the change in the git at the beginning of this thread
> *would
> > not have impacted client performance*. I can demonstrate -- at will --
> > that it does impact performance. So, there is something that keeps
track
>
> > of the number of writes and issues the commits without slowing down
the
> > application. This git change bypasses that and degrades the linker
> > performance.
>
> If the server gives slower performance for a single stable write, vs.
> the same unstable write + commit, then you are demonstrating that the
> server is seriously _broken_.
>
> The only other explanation, is if the client prior to that patch being
> applied was somehow failing to send out the COMMIT. If so, then the
> client was broken, and the patch is a fix that results in correct
> behaviour. That would mean that the rest of the client flush code is
> probably still broken, but at least the nfs_wb_page() is now correct.
>
> Those are the only 2 options.
>
> Trond
>
>
>
Actually wdelay is the export default, and I recall the man page saying
something along the lines of doing this to allow the server to coalesce
writes. Somewhere else (I think in another part of this thread) it's
mentioned that the server will sit for up to 10ms waiting for other writes
to this export. The reality is that wdelay+FILE_SYNC = up to a 10ms delay
waiting for the write RPC to come back. That being said, I would rather
leave this alone so that we don't accidentally impact something else.
After all, the no_wdelay export option will work around it nicely in an
all-Linux environment, and file pages don't flush with FILE_SYNC on
2.6.29.
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Steve Dickson <[email protected]>
To:
Neil Brown <[email protected]>, Greg Banks <[email protected]>
Cc:
Brian R Cowan/Cupertino/IBM@IBMUS, [email protected]
Date:
06/05/2009 07:38 AM
Subject:
Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS
I/O performance degraded by FLUSH_STABLE page flushing
Brian R Cowan wrote:
> Trond Myklebust <[email protected]> wrote on 06/04/2009
02:04:58
> PM:
>
>> Did you try turning off write gathering on the server (i.e. add the
>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
>
> Just tried it, this seems to be a very useful workaround as well. The
> FILE_SYNC write calls come back in about the same amount of time as the
> write+commit pairs... Speeds up building regardless of the network
> filesystem (ClearCase MVFS or straight NFS).
Does anybody had the history as to why 'no_wdelay' is an
export default? As Brian mentioned later in this thread
it only helps Linux servers, but that's good thing, IMHO. ;-)
So I would have no problem changing the default export
options in nfs-utils, but it would be nice to know why
it was there in the first place...
Neil, Greg??
steved.
On Fri, 2009-06-05 at 15:54 -0400, J. Bruce Fields wrote:
> On Fri, Jun 05, 2009 at 12:12:08PM -0400, Trond Myklebust wrote:
> > On Fri, 2009-06-05 at 10:54 -0400, Christoph Hellwig wrote:
> > > On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote:
> > > > > NFSD stops calling ->fsync without a file struct.
> > > > >
> > > > > I think the open file cache will help us with that, if we can extend
> > > > > it to also cache open file structs for directories.
> > > >
> > > > Krishna Kumar--do you think that'd be a reasonable thing to do?
> > >
> > > Btw, do you have at least the basic open files cache queue for 2.6.31?
> > >
> >
> > Now that _will_ badly screw up the write gathering heuristic...
>
> How?
>
The heuristic looks at inode->i_writecount in order to figure out how
many nfsd threads are currently trying to write to the file. The
reference to i_writecount is held by the struct file.
The problam is that if you start sharing struct file among several nfsd
threads by means of a cache, then the i_writecount will not change, and
so the heuristic fails.
While we won't miss it much in NFSv3 and v4, it may change the
performance of the few systems out there that still believe NFSv2 is the
best thing since sliced bread...
Trond
--- linux-2.6.30.i686/fs/nfs/file.c.org
+++ linux-2.6.30.i686/fs/nfs/file.c
@@ -337,15 +337,15 @@ static int nfs_write_begin(struct file *
struct page **pagep, void **fsdata)
{
int ret;
- pgoff_t index;
+ pgoff_t index = pos >> PAGE_CACHE_SHIFT;
struct page *page;
- index = pos >> PAGE_CACHE_SHIFT;
dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
file->f_path.dentry->d_parent->d_name.name,
file->f_path.dentry->d_name.name,
mapping->host->i_ino, len, (long long) pos);
+start:
/*
* Prevent starvation issues if someone is doing a consistency
* sync-to-disk
@@ -364,6 +364,12 @@ static int nfs_write_begin(struct file *
if (ret) {
unlock_page(page);
page_cache_release(page);
+ } else if ((file->f_mode & FMODE_READ) && !PageUptodate(page) &&
+ ((pos & (PAGE_CACHE_SIZE - 1)) || len != PAGE_CACHE_SIZE)) {
+ ret = nfs_readpage(file, page);
+ page_cache_release(page);
+ if (!ret)
+ goto start;
}
return ret;
}
On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote:
> Trond Myklebust ([email protected]) wrote on 2 June 2009 13:27:
> >Write gathering relies on waiting an arbitrary length of time in order
> >to see if someone is going to send another write. The protocol offers no
> >guidance as to how long that wait should be, and so (at least on the
> >Linux server) we've coded in a hard wait of 10ms if and only if we see
> >that something else has the file open for writing.
> >One problem with the Linux implementation is that the "something else"
> >could be another nfs server thread that happens to be in nfsd_write(),
> >however it could also be another open NFSv4 stateid, or a NLM lock, or a
> >local process that has the file open for writing.
> >Another problem is that the nfs server keeps a record of the last file
> >that was accessed, and also waits if it sees you are writing again to
> >that same file. Of course it has no idea if this is truly a parallel
> >write, or if it just happens that you are writing again to the same file
> >using O_SYNC...
>
> I think the decision to write or wait doesn't belong to the nfs
> server; it should just send the writes immediately. It's up to the
> fs/block/device layers to do the gathering. I understand that the
> client should try to do the gathering before sending the request to
> the wire
This isn't something that we've just pulled out of a hat. It dates back
to pre-NFSv3 times, when every write had to be synchronously committed
to disk before the RPC call could return.
See, for instance,
http://books.google.com/books?id=y9GgPhjyOUwC&pg=PA243&lpg=PA243&dq=What
+is+nfs+write
+gathering&source=bl&ots=M8s0XS2SLd&sig=ctmxQrpII2_Ti4czgpGZrF9mmds&hl=en&ei=Xa0mSrLMC8iptgfSsqHsBg&sa=X&oi=book_result&ct=result&resnum=3
The point is that while it is a good idea for NFSv2, we have much better
methods of dealing with multiple writes in NFSv3 and v4...
Trond
On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
>
> Tom Talpey wrote:
> > On 6/5/2009 7:35 AM, Steve Dickson wrote:
> >> Brian R Cowan wrote:
> >>> Trond Myklebust<[email protected]> wrote on 06/04/2009
> >>> 02:04:58
> >>> PM:
> >>>
> >>>> Did you try turning off write gathering on the server (i.e. add the
> >>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> >>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> >>> Just tried it, this seems to be a very useful workaround as well. The
> >>> FILE_SYNC write calls come back in about the same amount of time as the
> >>> write+commit pairs... Speeds up building regardless of the network
> >>> filesystem (ClearCase MVFS or straight NFS).
> >>
> >> Does anybody had the history as to why 'no_wdelay' is an
> >> export default?
> >
> > Because "wdelay" is a complete crock?
> >
> > Adding 10ms to every write RPC only helps if there's a steady
> > single-file stream arriving at the server. In most other workloads
> > it only slows things down.
> >
> > The better solution is to continue tuning the clients to issue
> > writes in a more sequential and less all-or-nothing fashion.
> > There are plenty of other less crock-ful things to do in the
> > server, too.
> Ok... So do you think removing it as a default would cause
> any regressions?
It might for NFSv2 clients, since they don't have the option of using
unstable writes. I'd therefore prefer a kernel solution that makes write
gathering an NFSv2 only feature.
Cheers
Trond
On 6/5/2009 7:35 AM, Steve Dickson wrote:
> Brian R Cowan wrote:
>> Trond Myklebust<[email protected]> wrote on 06/04/2009 02:04:58
>> PM:
>>
>>> Did you try turning off write gathering on the server (i.e. add the
>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
>> Just tried it, this seems to be a very useful workaround as well. The
>> FILE_SYNC write calls come back in about the same amount of time as the
>> write+commit pairs... Speeds up building regardless of the network
>> filesystem (ClearCase MVFS or straight NFS).
>
> Does anybody had the history as to why 'no_wdelay' is an
> export default?
Because "wdelay" is a complete crock?
Adding 10ms to every write RPC only helps if there's a steady
single-file stream arriving at the server. In most other workloads
it only slows things down.
The better solution is to continue tuning the clients to issue
writes in a more sequential and less all-or-nothing fashion.
There are plenty of other less crock-ful things to do in the
server, too.
Tom.
As Brian mentioned later in this thread
> it only helps Linux servers, but that's good thing, IMHO. ;-)
>
> So I would have no problem changing the default export
> options in nfs-utils, but it would be nice to know why
> it was there in the first place...
>
> Neil, Greg??
>
> steved.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Trond Myklebust wrote:
> On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote:
>
>> Trond Myklebust ([email protected]) wrote on 2 June 2009 13:27:
>> >Write gathering relies on waiting an arbitrary length of time in order
>> >to see if someone is going to send another write. The protocol offers no
>> >guidance as to how long that wait should be, and so (at least on the
>> >Linux server) we've coded in a hard wait of 10ms if and only if we see
>> >that something else has the file open for writing.
>> >One problem with the Linux implementation is that the "something else"
>> >could be another nfs server thread that happens to be in nfsd_write(),
>> >however it could also be another open NFSv4 stateid, or a NLM lock, or a
>> >local process that has the file open for writing.
>> >Another problem is that the nfs server keeps a record of the last file
>> >that was accessed, and also waits if it sees you are writing again to
>> >that same file. Of course it has no idea if this is truly a parallel
>> >write, or if it just happens that you are writing again to the same file
>> >using O_SYNC...
>>
>> I think the decision to write or wait doesn't belong to the nfs
>> server; it should just send the writes immediately. It's up to the
>> fs/block/device layers to do the gathering. I understand that the
>> client should try to do the gathering before sending the request to
>> the wire
>>
Just to be clear, the linux NFS server does not gather the writes.
Writes are passed immediately to the fs. nfsd simply waits 10ms before
sync'ing the writes to disk. This allows the underlying file system
time to do the gathering and sync data in larger chunks. Of course,
this is only for stables writes and wdelay is enabled for the export.
Dean
>
> This isn't something that we've just pulled out of a hat. It dates back
> to pre-NFSv3 times, when every write had to be synchronously committed
> to disk before the RPC call could return.
>
> See, for instance,
>
> http://books.google.com/books?id=y9GgPhjyOUwC&pg=PA243&lpg=PA243&dq=What
> +is+nfs+write
> +gathering&source=bl&ots=M8s0XS2SLd&sig=ctmxQrpII2_Ti4czgpGZrF9mmds&hl=en&ei=Xa0mSrLMC8iptgfSsqHsBg&sa=X&oi=book_result&ct=result&resnum=3
>
> The point is that while it is a good idea for NFSv2, we have much better
> methods of dealing with multiple writes in NFSv3 and v4...
>
> Trond
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Fri, 2009-06-05 at 07:35 -0400, Steve Dickson wrote:
> Brian R Cowan wrote:
> > Trond Myklebust <[email protected]> wrote on 06/04/2009 02:04:58
> > PM:
> >
> >> Did you try turning off write gathering on the server (i.e. add the
> >> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> >> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> >
> > Just tried it, this seems to be a very useful workaround as well. The
> > FILE_SYNC write calls come back in about the same amount of time as the
> > write+commit pairs... Speeds up building regardless of the network
> > filesystem (ClearCase MVFS or straight NFS).
>
> Does anybody had the history as to why 'no_wdelay' is an
> export default? As Brian mentioned later in this thread
> it only helps Linux servers, but that's good thing, IMHO. ;-)
>
> So I would have no problem changing the default export
> options in nfs-utils, but it would be nice to know why
> it was there in the first place...
It dates back to the days when most Linux clients in use in the field
were NFSv2 only. After all, it has only been 15 years...
Trond
Well, that's a good reason to get rid of those Solaris servers. :-)
Seriously, though, we do _not_ fix server bugs by changing the client.
If we had been doing something that was incorrect, or not recommended
by the NFS spec, then matters would be different...
Trond
On Jun 4, 2009, at 17:30, Brian R Cowan <[email protected]> wrote:
> I'll have to see if/how this impacts the flush behavior. I don't
> THINK we
> are doing getattrs in the middle of the link, but the trace
> information
> kind of went astray when the VM's gor reverted to base OS.
>
> Also, your recommended workaround of setting no_wdelay only works if
> the
> NFS server is Linux, the option isn't available on Solaris or HP-UX.
> This
> limits it's usefulness in heterogenous environments. Solaris 10
> doesn't
> support async NFS exports, and we've already discussed how the small-
> write
> optimization overrides write behavior on async mounts.
>
> =================================================================
> Brian Cowan
> Advisory Software Engineer
> ClearCase Customer Advocacy Group (CAG)
> Rational Software
> IBM Software Group
> 81 Hartwell Ave
> Lexington, MA
>
> Phone: 1.781.372.3580
> Web: http://www.ibm.com/software/rational/support/
>
>
> Please be sure to update your PMR using ESR at
> http://www-306.ibm.com/software/support/probsub.html or cc all
> correspondence to [email protected] to be sure your PMR is
> updated in
> case I am not available.
>
>
>
> From:
> Trond Myklebust <[email protected]>
> To:
> Brian R Cowan/Cupertino/IBM@IBMUS
> Cc:
> Carlos Carvalho <[email protected]>, [email protected],
> [email protected]
> Date:
> 06/04/2009 04:57 PM
> Subject:
> Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write
> NFS
> I/O performance degraded by FLUSH_STABLE page flushing
>
>
>
> On Thu, 2009-06-04 at 16:43 -0400, Brian R Cowan wrote:
>> What I'm trying to understand is why RHEL 4 is not flushing anywhere
> near
>> as often. Either RHEL4 erred on the side of not writing, and RHEL5 is
>> erring on the opposite side, or RHEL5 is doing unnecessary flushes...
> I've
>> seen that 2.6.29 flushes less than the Red hat 2.6.18-derived
>> kernels,
> but
>> it still flushes a lot more than RHEL 4 does.
>
> Most of that increase is probably mainly due to the changes to the way
> stat() works. More precisely, it would be due to this patch:
>
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=70b9ecbdb9c5fdc731f8780bffd45d9519020c4a
>
>
> which went into Linux 2.6.16 in order to fix a posix compatibility
> issue.
>
> Trond
>
>
>
Personally, I would leave the default export options alone. Simply because
they more or less match the defaults for the other NFS servers.
Also, there may be negative impacts of changing the default export option
to no_wdelay on really busy servers. One possible result is that more CPU
time gets spent waiting on writes to disk.
I'm a bit paranoid when it comes to tuning *server* settings, since they
impact all clients all at once, where client tuning generally only impacts
the one client.
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Trond Myklebust <[email protected]>
To:
Steve Dickson <[email protected]>
Cc:
Neil Brown <[email protected]>, Greg Banks <[email protected]>, Brian R
Cowan/Cupertino/IBM@IBMUS, [email protected]
Date:
06/05/2009 08:48 AM
Subject:
Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS
I/O performance degraded by FLUSH_STABLE page flushing
On Fri, 2009-06-05 at 07:35 -0400, Steve Dickson wrote:
> Brian R Cowan wrote:
> > Trond Myklebust <[email protected]> wrote on 06/04/2009
02:04:58
> > PM:
> >
> >> Did you try turning off write gathering on the server (i.e. add the
> >> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> >> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> >
> > Just tried it, this seems to be a very useful workaround as well. The
> > FILE_SYNC write calls come back in about the same amount of time as
the
> > write+commit pairs... Speeds up building regardless of the network
> > filesystem (ClearCase MVFS or straight NFS).
>
> Does anybody had the history as to why 'no_wdelay' is an
> export default? As Brian mentioned later in this thread
> it only helps Linux servers, but that's good thing, IMHO. ;-)
>
> So I would have no problem changing the default export
> options in nfs-utils, but it would be nice to know why
> it was there in the first place...
It dates back to the days when most Linux clients in use in the field
were NFSv2 only. After all, it has only been 15 years...
Trond
Peter Staubach <[email protected]> wrote on 06/04/2009 05:07:29 PM:
> > What I'm trying to understand is why RHEL 4 is not flushing anywhere
near
> > as often. Either RHEL4 erred on the side of not writing, and RHEL5 is
> > erring on the opposite side, or RHEL5 is doing unnecessary flushes...
I've
> > seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels,
but
> > it still flushes a lot more than RHEL 4 does.
> >
> >
>
> I think that you are making a lot of assumptions here, that
> are not necessarily backed by the evidence. The base cause
> here seems more likely to me to be the setting of PG_uptodate
> being different on the different releases, ie. RHEL-4, RHEL-5,
> and 2.6.29. All of these kernels contain the support to
> write out pages which are not marked as PG_uptodate.
>
> ps
I'm trying to find out why the paging/flushing is happening. It's
incredibly trivial to reproduce, just link something large over NFS. RHEL4
writes to the smbd file about 150x, RHEL 5 writes to it > 500x, and 2.6.29
writes about 340x. I have network traces showing that. I'm now trying to
understand why... So we an determine if there is anything that can be done
about it...
Trond's note about a getattr change that went into 2.6.16 may be important
since we have also seen this slowdown on SuSE 10, which is based on 2.6.16
kernels. I'm just a little unsure of why the gcc linker would be calling
getattr... Time to collect more straces, I guess, and then to see what
happens under the covers... (Be just my luck if the seek eventually causes
nfs_getattr to be called, though it would certainly explain the behavior.)
On Wed, 2009-06-24 at 15:54 -0400, Peter Staubach wrote:
> Hi.
>
> I have a proposal for possibly resolving this issue.
>
> I believe that this situation occurs due to the way that the
> Linux NFS client handles writes which modify partial pages.
>
> The Linux NFS client handles partial page modifications by
> allocating a page from the page cache, copying the data from
> the user level into the page, and then keeping track of the
> offset and length of the modified portions of the page. The
> page is not marked as up to date because there are portions
> of the page which do not contain valid file contents.
>
> When a read call comes in for a portion of the page, the
> contents of the page must be read in the from the server.
> However, since the page may already contain some modified
> data, that modified data must be written to the server
> before the file contents can be read back in the from server.
> And, since the writing and reading can not be done atomically,
> the data must be written and committed to stable storage on
> the server for safety purposes. This means either a
> FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT.
> This has been discussed at length previously.
>
> This algorithm could be described as modify-write-read. It
> is most efficient when the application only updates pages
> and does not read them.
>
> My proposed solution is to add a heuristic to decide whether
> to do this modify-write-read algorithm or switch to a read-
> modify-write algorithm when initially allocating the page
> in the write system call path. The heuristic uses the modes
> that the file was opened with, the offset in the page to
> read from, and the size of the region to read.
>
> If the file was opened for reading in addition to writing
> and the page would not be filled completely with data from
> the user level, then read in the old contents of the page
> and mark it as Uptodate before copying in the new data. If
> the page would be completely filled with data from the user
> level, then there would be no reason to read in the old
> contents because they would just be copied over.
>
> This would optimize for applications which randomly access
> and update portions of files. The linkage editor for the
> C compiler is an example of such a thing.
>
> I tested the attached patch by using rpmbuild to build the
> current Fedora rawhide kernel. The kernel without the
> patch generated about 153,000 READ requests and 265,500
> WRITE requests. The modified kernel containing the patch
> generated about 156,000 READ requests and 257,000 WRITE
> requests. Thus, about 3,000 more READ requests were
> generated, but about 8,500 fewer WRITE requests were
> generated. I suspect that many of these additional
> WRITE requests were probably FILE_SYNC requests to WRITE
> a single page, but I didn't test this theory.
>
> Thanx...
>
> ps
>
> Signed-off-by: Peter Staubach <[email protected]>
> plain text document attachment (read-modify-write.devel)
> --- linux-2.6.30.i686/fs/nfs/file.c.org
> +++ linux-2.6.30.i686/fs/nfs/file.c
> @@ -337,15 +337,15 @@ static int nfs_write_begin(struct file *
> struct page **pagep, void **fsdata)
> {
> int ret;
> - pgoff_t index;
> + pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> struct page *page;
> - index = pos >> PAGE_CACHE_SHIFT;
>
> dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
> file->f_path.dentry->d_parent->d_name.name,
> file->f_path.dentry->d_name.name,
> mapping->host->i_ino, len, (long long) pos);
>
> +start:
> /*
> * Prevent starvation issues if someone is doing a consistency
> * sync-to-disk
> @@ -364,6 +364,12 @@ static int nfs_write_begin(struct file *
> if (ret) {
> unlock_page(page);
> page_cache_release(page);
> + } else if ((file->f_mode & FMODE_READ) && !PageUptodate(page) &&
> + ((pos & (PAGE_CACHE_SIZE - 1)) || len != PAGE_CACHE_SIZE)) {
It might also be nice to put the above test in a little inlined helper
function (called nfs_want_read_modify_write() ?).
So, a number of questions spring to mind:
1. What if we're extending the file? We might not need to read the
page at all in that case (see nfs_write_end()).
2. What if the page is already dirty or is carrying an uncommitted
unstable write?
3. We might want to try to avoid looping more than once here. If
the kernel is very low on memory, we might just want to write
out the data rather than read the page and risk having the VM
eject it before we can dirty it.
4. Should we be starting an async readahead on the next page?
Single page sized reads can be a nuisance too, if you are
writing huge amounts of data.
> + ret = nfs_readpage(file, page);
> + page_cache_release(page);
> + if (!ret)
> + goto start;
> }
> return ret;
> }
Cheers
Trond
I'll have to see if/how this impacts the flush behavior. I don't THINK we
are doing getattrs in the middle of the link, but the trace information
kind of went astray when the VM's gor reverted to base OS.
Also, your recommended workaround of setting no_wdelay only works if the
NFS server is Linux, the option isn't available on Solaris or HP-UX. This
limits it's usefulness in heterogenous environments. Solaris 10 doesn't
support async NFS exports, and we've already discussed how the small-write
optimization overrides write behavior on async mounts.
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Trond Myklebust <[email protected]>
To:
Brian R Cowan/Cupertino/IBM@IBMUS
Cc:
Carlos Carvalho <[email protected]>, [email protected],
[email protected]
Date:
06/04/2009 04:57 PM
Subject:
Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS
I/O performance degraded by FLUSH_STABLE page flushing
On Thu, 2009-06-04 at 16:43 -0400, Brian R Cowan wrote:
> What I'm trying to understand is why RHEL 4 is not flushing anywhere
near
> as often. Either RHEL4 erred on the side of not writing, and RHEL5 is
> erring on the opposite side, or RHEL5 is doing unnecessary flushes...
I've
> seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels,
but
> it still flushes a lot more than RHEL 4 does.
Most of that increase is probably mainly due to the changes to the way
stat() works. More precisely, it would be due to this patch:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=70b9ecbdb9c5fdc731f8780bffd45d9519020c4a
which went into Linux 2.6.16 in order to fix a posix compatibility
issue.
Trond
On Thu, 2009-06-04 at 16:43 -0400, Brian R Cowan wrote:
> What I'm trying to understand is why RHEL 4 is not flushing anywhere near
> as often. Either RHEL4 erred on the side of not writing, and RHEL5 is
> erring on the opposite side, or RHEL5 is doing unnecessary flushes... I've
> seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, but
> it still flushes a lot more than RHEL 4 does.
Most of that increase is probably mainly due to the changes to the way
stat() works. More precisely, it would be due to this patch:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=70b9ecbdb9c5fdc731f8780bffd45d9519020c4a
which went into Linux 2.6.16 in order to fix a posix compatibility
issue.
Trond
On Mon, Jun 15, 2009 at 08:55:58PM -0400, bfields wrote:
> Whoops--actually, it's the opposite problem: a bugfix patch that went
> upstream removed this, and I didn't merge that back into my for-2.6.31
> branch. OK, time to do that, and then this is all much simpler....
> Thanks for calling my attention to that!
Having fixed that... the following is what I'm applying (on top of
Trond's).
--b.
Brian R Cowan wrote:
> Trond Myklebust <[email protected]> wrote on 06/04/2009 02:04:58
> PM:
>
>> Did you try turning off write gathering on the server (i.e. add the
>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
>
> Just tried it, this seems to be a very useful workaround as well. The
> FILE_SYNC write calls come back in about the same amount of time as the
> write+commit pairs... Speeds up building regardless of the network
> filesystem (ClearCase MVFS or straight NFS).
Does anybody had the history as to why 'no_wdelay' is an
export default? As Brian mentioned later in this thread
it only helps Linux servers, but that's good thing, IMHO. ;-)
So I would have no problem changing the default export
options in nfs-utils, but it would be nice to know why
it was there in the first place...
Neil, Greg??
steved.
On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote:
> On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote:
> >
> >
> > Trond Myklebust wrote:
> > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
> > >> Tom Talpey wrote:
> > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
> > >>>> Brian R Cowan wrote:
> > >>>>> Trond Myklebust<[email protected]> wrote on 06/04/2009
> > >>>>> 02:04:58
> > >>>>> PM:
> > >>>>>
> > >>>>>> Did you try turning off write gathering on the server (i.e. add the
> > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> > >>>>> Just tried it, this seems to be a very useful workaround as well. The
> > >>>>> FILE_SYNC write calls come back in about the same amount of time as the
> > >>>>> write+commit pairs... Speeds up building regardless of the network
> > >>>>> filesystem (ClearCase MVFS or straight NFS).
> > >>>> Does anybody had the history as to why 'no_wdelay' is an
> > >>>> export default?
> > >>> Because "wdelay" is a complete crock?
> > >>>
> > >>> Adding 10ms to every write RPC only helps if there's a steady
> > >>> single-file stream arriving at the server. In most other workloads
> > >>> it only slows things down.
> > >>>
> > >>> The better solution is to continue tuning the clients to issue
> > >>> writes in a more sequential and less all-or-nothing fashion.
> > >>> There are plenty of other less crock-ful things to do in the
> > >>> server, too.
> > >> Ok... So do you think removing it as a default would cause
> > >> any regressions?
> > >
> > > It might for NFSv2 clients, since they don't have the option of using
> > > unstable writes. I'd therefore prefer a kernel solution that makes write
> > > gathering an NFSv2 only feature.
> > Sounds good to me! ;-)
>
> Patch welcomed.--b.
Something like this ought to suffice...
-----------------------------------------------------------------------
From: Trond Myklebust <[email protected]>
NFSD: Make sure that write gathering only applies to NFSv2
NFSv3 and above can use unstable writes whenever they are sending more
than one write, rather than relying on the flaky write gathering
heuristics. More often than not, write gathering is currently getting it
wrong when the NFSv3 clients are sending a single write with FILE_SYNC
for efficiency reasons.
This patch turns off write gathering for NFSv3/v4, and ensure that
it only applies to the one case that can actually benefit: namely NFSv2.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfsd/vfs.c | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index b660435..f30cc4e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -975,6 +975,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
__be32 err = 0;
int host_err;
int stable = *stablep;
+ int use_wgather;
#ifdef MSNFS
err = nfserr_perm;
@@ -993,9 +994,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
* - the sync export option has been set, or
* - the client requested O_SYNC behavior (NFSv3 feature).
* - The file system doesn't support fsync().
- * When gathered writes have been configured for this volume,
+ * When NFSv2 gathered writes have been configured for this volume,
* flushing the data to disk is handled separately below.
*/
+ use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp);
if (!file->f_op->fsync) {/* COMMIT3 cannot work */
stable = 2;
@@ -1004,7 +1006,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
if (!EX_ISSYNC(exp))
stable = 0;
- if (stable && !EX_WGATHER(exp)) {
+ if (stable && !use_wgather) {
spin_lock(&file->f_lock);
file->f_flags |= O_SYNC;
spin_unlock(&file->f_lock);
@@ -1040,7 +1042,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
* nice and simple solution (IMHO), and it seems to
* work:-)
*/
- if (EX_WGATHER(exp)) {
+ if (use_wgather) {
if (atomic_read(&inode->i_writecount) > 1
|| (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
dprintk("nfsd: write defer %d\n", task_pid_nr(current));
On Thu, 2009-06-04 at 13:42 -0400, Brian R Cowan wrote:
> I've been looking in more detail in the network traces that started all
> this, and doing some additional testing with the 2.6.29 kernel in an
> NFS-only build...
>
> In brief:
> 1) RHEL 5 generates >3x the network write traffic than RHEL4 when linking
> Samba's smbd.
> 2) In RHEL 5, Those unnecessary writes are slowed down by the "FILE_SYNC"
> optimization put in place for small writes.
> 3) That optimization seems to be removed from the kernel somewhere between
> 2.6.18 and 2.6.29.
> 4) Unfortunately the "unnecessary write before read" behavior is still
> present in 2.6.29.
>
> In detail:
> In RHEL 5, I see a lot of reads from offset {whatever} *immediately*
> preceded by a write to *the same offset*. This is obviously a bad thing,
> now the trick is finding out where it is coming from. The
> write-before-read behavior is happening on the smbd file itself (not
> surprising since that's the only file we're writing in this test...). This
> happens with every 2.6.18 and later kernel I've tested to date.
>
> In RHEL 5, most of the writes are FILE_SYNC writes, which appear to take
> something on the order of 10ms to come back. When using a 2.6.29 kernel,
> the TOTAL time for the write+commit rpc set (write rpc, write reply,
> commit rpc, commit reply), to come back is something like 2ms. I guess the
> NFS servers aren't handling FILE_SYNC writes very well. in 2.6.29, ALL the
> write calls appear to be unstable writes, in RHEL5, most are FILE_SYNC
> writes. (Network traces available upon request.)
Did you try turning off write gathering on the server (i.e. add the
'no_wdelay' export option)? As I said earlier, that forces a delay of
10ms per RPC call, which might explain the FILE_SYNC slowness.
> Neither is quite as fast as RHEL 4, because the link under RHEL 4 only
> puts about 150 WRITE rpc's on the wire. RHEL 5 generates more than 500
> when building on NFS, and 2.6.29 puts about 340 write rpc's, plus a
> similar number of COMMITs, on the wire.
>
> The bottom line:
> * If someone can help me find where 2.6 stopped setting small writes to
> FILE_SYNC, I'd appreciate it. It would save me time walking through >50
> commitdiffs in gitweb...
It still does set FILE_SYNC for single page writes.
> * Is this the correct place to start discussing the annoying
> write-before-almost-every-read behavior that 2.6.18 picked up and 2.6.29
> continues?
Yes, but you'll need to tell us a bit more about the write patterns. Are
these random writes, or are they sequential? Is there any file locking
involved?
As I've said earlier in this thread, all NFS clients will flush out the
dirty data if a page that is being attempted read also contains
uninitialised areas.
Trond
Brian R Cowan wrote:
> Trond Myklebust <[email protected]> wrote on 06/04/2009 02:04:58
> PM:
>
>
>> Did you try turning off write gathering on the server (i.e. add the
>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
>>
>
> Just tried it, this seems to be a very useful workaround as well. The
> FILE_SYNC write calls come back in about the same amount of time as the
> write+commit pairs... Speeds up building regardless of the network
> filesystem (ClearCase MVFS or straight NFS).
>
>
>>> The bottom line:
>>> * If someone can help me find where 2.6 stopped setting small writes
>>>
> to
>
>>> FILE_SYNC, I'd appreciate it. It would save me time walking through
>>>
>> 50
>>
>>> commitdiffs in gitweb...
>>>
>> It still does set FILE_SYNC for single page writes.
>>
>
> Well, the network trace *seems* to say otherwise, but that could be
> because the 2.6.29 kernel is now reliably following a code path that
> doesn't set up to do FILE_SYNC writes for these flushes... Just like the
> RHEL 5 traces didn't have every "small" write to the link output file go
> out as a FILE_SYNC write.
>
>
>>> * Is this the correct place to start discussing the annoying
>>> write-before-almost-every-read behavior that 2.6.18 picked up and
>>>
> 2.6.29
>
>>> continues?
>>>
>> Yes, but you'll need to tell us a bit more about the write patterns. Are
>> these random writes, or are they sequential? Is there any file locking
>> involved?
>>
>
> Well, it's just a link, so it's random read/write traffic. (read object
> file/library, add stuff to output file, seek somewhere else and update a
> table, etc., etc.) All I did here was build Samba over nfs, remove
> bin/smbd, and then do a "make bin/smbd" to rebuild it. My network traces
> show that the file is opened "UNCHECKED" when doing the build in straight
> NFS, and "EXCLUSIVE" when building in a ClearCase view. This change does
> not seem to impact the behavior. We never lock the output file. The
> write-before-read happens all over the place. And when we did straces and
> lined up the call times, is it a read operation triggering the write.
>
>
>> As I've said earlier in this thread, all NFS clients will flush out the
>> dirty data if a page that is being attempted read also contains
>> uninitialised areas.
>>
>
> What I'm trying to understand is why RHEL 4 is not flushing anywhere near
> as often. Either RHEL4 erred on the side of not writing, and RHEL5 is
> erring on the opposite side, or RHEL5 is doing unnecessary flushes... I've
> seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, but
> it still flushes a lot more than RHEL 4 does.
>
>
I think that you are making a lot of assumptions here, that
are not necessarily backed by the evidence. The base cause
here seems more likely to me to be the setting of PG_uptodate
being different on the different releases, ie. RHEL-4, RHEL-5,
and 2.6.29. All of these kernels contain the support to
write out pages which are not marked as PG_uptodate.
ps
> In any event, that doesn't help us here since 1) ClearCase can't work with
> that kernel; 2) Red Hat won't support use of that kernel on RHEL 5; and 3)
> the amount of code review my customer would have to go through to get the
> whole kernel vetted for use in their environment is frightening.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Trond Myklebust <[email protected]> wrote on 06/04/2009 02:04:58
PM:
> Did you try turning off write gathering on the server (i.e. add the
> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> 10ms per RPC call, which might explain the FILE_SYNC slowness.
Just tried it, this seems to be a very useful workaround as well. The
FILE_SYNC write calls come back in about the same amount of time as the
write+commit pairs... Speeds up building regardless of the network
filesystem (ClearCase MVFS or straight NFS).
> > The bottom line:
> > * If someone can help me find where 2.6 stopped setting small writes
to
> > FILE_SYNC, I'd appreciate it. It would save me time walking through
>50
> > commitdiffs in gitweb...
>
> It still does set FILE_SYNC for single page writes.
Well, the network trace *seems* to say otherwise, but that could be
because the 2.6.29 kernel is now reliably following a code path that
doesn't set up to do FILE_SYNC writes for these flushes... Just like the
RHEL 5 traces didn't have every "small" write to the link output file go
out as a FILE_SYNC write.
>
> > * Is this the correct place to start discussing the annoying
> > write-before-almost-every-read behavior that 2.6.18 picked up and
2.6.29
> > continues?
>
> Yes, but you'll need to tell us a bit more about the write patterns. Are
> these random writes, or are they sequential? Is there any file locking
> involved?
Well, it's just a link, so it's random read/write traffic. (read object
file/library, add stuff to output file, seek somewhere else and update a
table, etc., etc.) All I did here was build Samba over nfs, remove
bin/smbd, and then do a "make bin/smbd" to rebuild it. My network traces
show that the file is opened "UNCHECKED" when doing the build in straight
NFS, and "EXCLUSIVE" when building in a ClearCase view. This change does
not seem to impact the behavior. We never lock the output file. The
write-before-read happens all over the place. And when we did straces and
lined up the call times, is it a read operation triggering the write.
>
> As I've said earlier in this thread, all NFS clients will flush out the
> dirty data if a page that is being attempted read also contains
> uninitialised areas.
What I'm trying to understand is why RHEL 4 is not flushing anywhere near
as often. Either RHEL4 erred on the side of not writing, and RHEL5 is
erring on the opposite side, or RHEL5 is doing unnecessary flushes... I've
seen that 2.6.29 flushes less than the Red hat 2.6.18-derived kernels, but
it still flushes a lot more than RHEL 4 does.
In any event, that doesn't help us here since 1) ClearCase can't work with
that kernel; 2) Red Hat won't support use of that kernel on RHEL 5; and 3)
the amount of code review my customer would have to go through to get the
whole kernel vetted for use in their environment is frightening.
I've been looking in more detail in the network traces that started all
this, and doing some additional testing with the 2.6.29 kernel in an
NFS-only build...
In brief:
1) RHEL 5 generates >3x the network write traffic than RHEL4 when linking
Samba's smbd.
2) In RHEL 5, Those unnecessary writes are slowed down by the "FILE_SYNC"
optimization put in place for small writes.
3) That optimization seems to be removed from the kernel somewhere between
2.6.18 and 2.6.29.
4) Unfortunately the "unnecessary write before read" behavior is still
present in 2.6.29.
In detail:
In RHEL 5, I see a lot of reads from offset {whatever} *immediately*
preceded by a write to *the same offset*. This is obviously a bad thing,
now the trick is finding out where it is coming from. The
write-before-read behavior is happening on the smbd file itself (not
surprising since that's the only file we're writing in this test...). This
happens with every 2.6.18 and later kernel I've tested to date.
In RHEL 5, most of the writes are FILE_SYNC writes, which appear to take
something on the order of 10ms to come back. When using a 2.6.29 kernel,
the TOTAL time for the write+commit rpc set (write rpc, write reply,
commit rpc, commit reply), to come back is something like 2ms. I guess the
NFS servers aren't handling FILE_SYNC writes very well. in 2.6.29, ALL the
write calls appear to be unstable writes, in RHEL5, most are FILE_SYNC
writes. (Network traces available upon request.)
Neither is quite as fast as RHEL 4, because the link under RHEL 4 only
puts about 150 WRITE rpc's on the wire. RHEL 5 generates more than 500
when building on NFS, and 2.6.29 puts about 340 write rpc's, plus a
similar number of COMMITs, on the wire.
The bottom line:
* If someone can help me find where 2.6 stopped setting small writes to
FILE_SYNC, I'd appreciate it. It would save me time walking through >50
commitdiffs in gitweb...
* Is this the correct place to start discussing the annoying
write-before-almost-every-read behavior that 2.6.18 picked up and 2.6.29
continues?
=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA
Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/
Please be sure to update your PMR using ESR at
http://www-306.ibm.com/software/support/probsub.html or cc all
correspondence to [email protected] to be sure your PMR is updated in
case I am not available.
From:
Trond Myklebust <[email protected]>
To:
Carlos Carvalho <[email protected]>
Cc:
[email protected]
Date:
06/03/2009 01:10 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
Sent by:
[email protected]
On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote:
> Trond Myklebust ([email protected]) wrote on 2 June 2009 13:27:
> >Write gathering relies on waiting an arbitrary length of time in order
> >to see if someone is going to send another write. The protocol offers
no
> >guidance as to how long that wait should be, and so (at least on the
> >Linux server) we've coded in a hard wait of 10ms if and only if we see
> >that something else has the file open for writing.
> >One problem with the Linux implementation is that the "something else"
> >could be another nfs server thread that happens to be in nfsd_write(),
> >however it could also be another open NFSv4 stateid, or a NLM lock, or
a
> >local process that has the file open for writing.
> >Another problem is that the nfs server keeps a record of the last file
> >that was accessed, and also waits if it sees you are writing again to
> >that same file. Of course it has no idea if this is truly a parallel
> >write, or if it just happens that you are writing again to the same
file
> >using O_SYNC...
>
> I think the decision to write or wait doesn't belong to the nfs
> server; it should just send the writes immediately. It's up to the
> fs/block/device layers to do the gathering. I understand that the
> client should try to do the gathering before sending the request to
> the wire
This isn't something that we've just pulled out of a hat. It dates back
to pre-NFSv3 times, when every write had to be synchronously committed
to disk before the RPC call could return.
See, for instance,
http://books.google.com/books?id=y9GgPhjyOUwC&pg=PA243&lpg=PA243&dq=What
+is+nfs+write
+gathering&source=bl&ots=M8s0XS2SLd&sig=ctmxQrpII2_Ti4czgpGZrF9mmds&hl=en&ei=Xa0mSrLMC8iptgfSsqHsBg&sa=X&oi=book_result&ct=result&resnum=3
The point is that while it is a good idea for NFSv2, we have much better
methods of dealing with multiple writes in NFSv3 and v4...
Trond
On Tue, June 16, 2009 9:08 am, J. Bruce Fields wrote:
> + if (host_err >= 0 && stable)
> + wait_for_concurrent_writes(file, use_wgather, &host_err);
>
Surely you want this to be:
if (host_err >= 0 && stable && use_wgather)
host_err = wait_for_concurrent_writes(file);
as
- this is more readable
- setting last_ino and last_dev is pointless when !use_wgather
- we aren't interested in differentiation between non-negative values of
host_err.
NeilBrown
On Mon, 2009-06-15 at 19:08 -0400, J. Bruce Fields wrote:
> On Fri, Jun 05, 2009 at 12:35:15PM -0400, Trond Myklebust wrote:
> > On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote:
> > > On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote:
> > > >
> > > >
> > > > Trond Myklebust wrote:
> > > > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
> > > > >> Tom Talpey wrote:
> > > > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
> > > > >>>> Brian R Cowan wrote:
> > > > >>>>> Trond Myklebust<[email protected]> wrote on 06/04/2009
> > > > >>>>> 02:04:58
> > > > >>>>> PM:
> > > > >>>>>
> > > > >>>>>> Did you try turning off write gathering on the server (i.e. add the
> > > > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> > > > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> > > > >>>>> Just tried it, this seems to be a very useful workaround as well. The
> > > > >>>>> FILE_SYNC write calls come back in about the same amount of time as the
> > > > >>>>> write+commit pairs... Speeds up building regardless of the network
> > > > >>>>> filesystem (ClearCase MVFS or straight NFS).
> > > > >>>> Does anybody had the history as to why 'no_wdelay' is an
> > > > >>>> export default?
> > > > >>> Because "wdelay" is a complete crock?
> > > > >>>
> > > > >>> Adding 10ms to every write RPC only helps if there's a steady
> > > > >>> single-file stream arriving at the server. In most other workloads
> > > > >>> it only slows things down.
> > > > >>>
> > > > >>> The better solution is to continue tuning the clients to issue
> > > > >>> writes in a more sequential and less all-or-nothing fashion.
> > > > >>> There are plenty of other less crock-ful things to do in the
> > > > >>> server, too.
> > > > >> Ok... So do you think removing it as a default would cause
> > > > >> any regressions?
> > > > >
> > > > > It might for NFSv2 clients, since they don't have the option of using
> > > > > unstable writes. I'd therefore prefer a kernel solution that makes write
> > > > > gathering an NFSv2 only feature.
> > > > Sounds good to me! ;-)
> > >
> > > Patch welcomed.--b.
> >
> > Something like this ought to suffice...
>
> Thanks, applied.
>
> I'd also like to apply cleanup something like the following--there's
> probably some cleaner way, but it just bothers me to have this
> write-gathering special case take up the bulk of nfsd_vfs_write....
>
> --b.
>
> commit bfe7680d68afaf3f0b1195c8976db1fd1f03229d
> Author: J. Bruce Fields <[email protected]>
> Date: Mon Jun 15 16:03:53 2009 -0700
>
> nfsd: Pull write-gathering code out of nfsd_vfs_write
>
> This is a relatively self-contained piece of code that handles a special
> case--move it to its own function.
>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index a8aac7f..de68557 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -963,6 +963,44 @@ static void kill_suid(struct dentry *dentry)
> mutex_unlock(&dentry->d_inode->i_mutex);
> }
>
> +/*
> + * Gathered writes: If another process is currently writing to the file,
> + * there's a high chance this is another nfsd (triggered by a bulk write
> + * from a client's biod). Rather than syncing the file with each write
> + * request, we sleep for 10 msec.
> + *
> + * I don't know if this roughly approximates C. Juszak's idea of
> + * gathered writes, but it's a nice and simple solution (IMHO), and it
> + * seems to work:-)
> + *
> + * Note: we do this only in the NFSv2 case, since v3 and higher have a
> + * better tool (separate unstable writes and commits) for solving this
> + * problem.
> + */
> +static void wait_for_concurrent_writes(struct file *file, int use_wgather, int *host_err)
> +{
> + struct inode *inode = file->f_path.dentry->d_inode;
> + static ino_t last_ino;
> + static dev_t last_dev;
> +
> + if (!use_wgather)
> + goto out;
> + if (atomic_read(&inode->i_writecount) > 1
> + || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
> + dprintk("nfsd: write defer %d\n", task_pid_nr(current));
> + msleep(10);
> + dprintk("nfsd: write resume %d\n", task_pid_nr(current));
> + }
> +
> + if (inode->i_state & I_DIRTY) {
> + dprintk("nfsd: write sync %d\n", task_pid_nr(current));
> + *host_err = nfsd_sync(file);
> + }
> +out:
> + last_ino = inode->i_ino;
> + last_dev = inode->i_sb->s_dev;
> +}
Shouldn't you also timestamp the last_ino/last_dev? Currently you can
end up waiting even if the last time you referenced this file was 10
minutes ago...
> +
> static __be32
> nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
> loff_t offset, struct kvec *vec, int vlen,
> @@ -1025,41 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
> if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
> kill_suid(dentry);
>
> - if (host_err >= 0 && stable) {
> - static ino_t last_ino;
> - static dev_t last_dev;
> -
> - /*
> - * Gathered writes: If another process is currently
> - * writing to the file, there's a high chance
> - * this is another nfsd (triggered by a bulk write
> - * from a client's biod). Rather than syncing the
> - * file with each write request, we sleep for 10 msec.
> - *
> - * I don't know if this roughly approximates
> - * C. Juszak's idea of gathered writes, but it's a
> - * nice and simple solution (IMHO), and it seems to
> - * work:-)
> - */
> - if (use_wgather) {
> - if (atomic_read(&inode->i_writecount) > 1
> - || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
> - dprintk("nfsd: write defer %d\n", task_pid_nr(current));
> - msleep(10);
> - dprintk("nfsd: write resume %d\n", task_pid_nr(current));
> - }
> -
> - if (inode->i_state & I_DIRTY) {
> - dprintk("nfsd: write sync %d\n", task_pid_nr(current));
> - host_err=nfsd_sync(file);
> - }
> -#if 0
> - wake_up(&inode->i_wait);
> -#endif
> - }
> - last_ino = inode->i_ino;
> - last_dev = inode->i_sb->s_dev;
> - }
> + if (host_err >= 0 && stable)
> + wait_for_concurrent_writes(file, use_wgather, &host_err);
>
> dprintk("nfsd: write complete host_err=%d\n", host_err);
> if (host_err >= 0) {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 16, 2009 at 10:21:50AM +1000, NeilBrown wrote:
> On Tue, June 16, 2009 9:08 am, J. Bruce Fields wrote:
>
> > + if (host_err >= 0 && stable)
> > + wait_for_concurrent_writes(file, use_wgather, &host_err);
> >
>
> Surely you want this to be:
>
> if (host_err >= 0 && stable && use_wgather)
> host_err = wait_for_concurrent_writes(file);
> as
> - this is more readable
> - setting last_ino and last_dev is pointless when !use_wgather
Yep, thanks.
> - we aren't interested in differentiation between non-negative values of
> host_err.
Unfortunately, just below:
if (host_err >= 0) {
err = 0;
*cnt = host_err;
} else
err = nfserrno(host_err);
We could save that count earlier, e.g.:
@@ -1014,6 +1013,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
int host_err;
int stable = *stablep;
int use_wgather;
+ int bytes;
#ifdef MSNFS
err = nfserr_perm;
@@ -1056,6 +1056,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
set_fs(oldfs);
if (host_err >= 0) {
nfsdstats.io_write += host_err;
+ bytes = host_err;
fsnotify_modify(file->f_path.dentry);
}
@@ -1063,13 +1064,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fh
if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
kill_suid(dentry);
- if (host_err >= 0 && stable)
- wait_for_concurrent_writes(file, use_wgather, &host_err);
+ if (host_err >= 0 && stable && use_wgather)
+ host_err = wait_for_concurrent_writes(file);
dprintk("nfsd: write complete host_err=%d\n", host_err);
if (host_err >= 0) {
err = 0;
- *cnt = host_err;
+ *cnt = bytes;
} else
err = nfserrno(host_err);
out:
--b.
On Tue, June 16, 2009 10:33 am, J. Bruce Fields wrote:
> On Tue, Jun 16, 2009 at 10:21:50AM +1000, NeilBrown wrote:
>> On Tue, June 16, 2009 9:08 am, J. Bruce Fields wrote:
>>
>> > + if (host_err >= 0 && stable)
>> > + wait_for_concurrent_writes(file, use_wgather, &host_err);
>> >
>>
>> Surely you want this to be:
>>
>> if (host_err >= 0 && stable && use_wgather)
>> host_err = wait_for_concurrent_writes(file);
>> as
>> - this is more readable
>> - setting last_ino and last_dev is pointless when !use_wgather
>
> Yep, thanks.
>
>> - we aren't interested in differentiation between non-negative values
>> of
>> host_err.
>
> Unfortunately, just below:
>
> if (host_err >= 0) {
> err = 0;
> *cnt = host_err;
> } else
> err = nfserrno(host_err);
>
Ahh.... that must be in code you haven't pushed out yet.
I don't see it in mainline or git.linux-nfs.org
> We could save that count earlier, e.g.:
>
> @@ -1014,6 +1013,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh
> *fhp,
> int host_err;
> int stable = *stablep;
> int use_wgather;
> + int bytes;
>
> #ifdef MSNFS
> err = nfserr_perm;
> @@ -1056,6 +1056,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh
> *fhp,
> set_fs(oldfs);
> if (host_err >= 0) {
> nfsdstats.io_write += host_err;
> + bytes = host_err;
> fsnotify_modify(file->f_path.dentry);
Or even
if (host_err >= 0) {
bytes = host_err;
nfsdstats.io_write += bytes
...
And if you did that in whatever patch move the assignment to
*cnt to the bottom of the function, it might be even more readable!
Thanks,
NeilBrown
On Tue, Jun 16, 2009 at 10:50:57AM +1000, NeilBrown wrote:
> On Tue, June 16, 2009 10:33 am, J. Bruce Fields wrote:
> > On Tue, Jun 16, 2009 at 10:21:50AM +1000, NeilBrown wrote:
> >> On Tue, June 16, 2009 9:08 am, J. Bruce Fields wrote:
> >>
> >> > + if (host_err >= 0 && stable)
> >> > + wait_for_concurrent_writes(file, use_wgather, &host_err);
> >> >
> >>
> >> Surely you want this to be:
> >>
> >> if (host_err >= 0 && stable && use_wgather)
> >> host_err = wait_for_concurrent_writes(file);
> >> as
> >> - this is more readable
> >> - setting last_ino and last_dev is pointless when !use_wgather
> >
> > Yep, thanks.
> >
> >> - we aren't interested in differentiation between non-negative values
> >> of
> >> host_err.
> >
> > Unfortunately, just below:
> >
> > if (host_err >= 0) {
> > err = 0;
> > *cnt = host_err;
> > } else
> > err = nfserrno(host_err);
> >
>
> Ahh.... that must be in code you haven't pushed out yet.
> I don't see it in mainline or git.linux-nfs.org
Whoops--actually, it's the opposite problem: a bugfix patch that went
upstream removed this, and I didn't merge that back into my for-2.6.31
branch. OK, time to do that, and then this is all much simpler....
Thanks for calling my attention to that!
--b.
>
> > We could save that count earlier, e.g.:
> >
> > @@ -1014,6 +1013,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh
> > *fhp,
> > int host_err;
> > int stable = *stablep;
> > int use_wgather;
> > + int bytes;
> >
> > #ifdef MSNFS
> > err = nfserr_perm;
> > @@ -1056,6 +1056,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh
> > *fhp,
> > set_fs(oldfs);
> > if (host_err >= 0) {
> > nfsdstats.io_write += host_err;
> > + bytes = host_err;
> > fsnotify_modify(file->f_path.dentry);
>
> Or even
>
> if (host_err >= 0) {
> bytes = host_err;
> nfsdstats.io_write += bytes
> ...
>
> And if you did that in whatever patch move the assignment to
> *cnt to the bottom of the function, it might be even more readable!
>
> Thanks,
> NeilBrown
>
>
On Mon, Jun 15, 2009 at 05:32:04PM -0700, Trond Myklebust wrote:
> On Mon, 2009-06-15 at 19:08 -0400, J. Bruce Fields wrote:
> > On Fri, Jun 05, 2009 at 12:35:15PM -0400, Trond Myklebust wrote:
> > > On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote:
> > > > On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote:
> > > > >
> > > > >
> > > > > Trond Myklebust wrote:
> > > > > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
> > > > > >> Tom Talpey wrote:
> > > > > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
> > > > > >>>> Brian R Cowan wrote:
> > > > > >>>>> Trond Myklebust<[email protected]> wrote on 06/04/2009
> > > > > >>>>> 02:04:58
> > > > > >>>>> PM:
> > > > > >>>>>
> > > > > >>>>>> Did you try turning off write gathering on the server (i.e. add the
> > > > > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> > > > > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> > > > > >>>>> Just tried it, this seems to be a very useful workaround as well. The
> > > > > >>>>> FILE_SYNC write calls come back in about the same amount of time as the
> > > > > >>>>> write+commit pairs... Speeds up building regardless of the network
> > > > > >>>>> filesystem (ClearCase MVFS or straight NFS).
> > > > > >>>> Does anybody had the history as to why 'no_wdelay' is an
> > > > > >>>> export default?
> > > > > >>> Because "wdelay" is a complete crock?
> > > > > >>>
> > > > > >>> Adding 10ms to every write RPC only helps if there's a steady
> > > > > >>> single-file stream arriving at the server. In most other workloads
> > > > > >>> it only slows things down.
> > > > > >>>
> > > > > >>> The better solution is to continue tuning the clients to issue
> > > > > >>> writes in a more sequential and less all-or-nothing fashion.
> > > > > >>> There are plenty of other less crock-ful things to do in the
> > > > > >>> server, too.
> > > > > >> Ok... So do you think removing it as a default would cause
> > > > > >> any regressions?
> > > > > >
> > > > > > It might for NFSv2 clients, since they don't have the option of using
> > > > > > unstable writes. I'd therefore prefer a kernel solution that makes write
> > > > > > gathering an NFSv2 only feature.
> > > > > Sounds good to me! ;-)
> > > >
> > > > Patch welcomed.--b.
> > >
> > > Something like this ought to suffice...
> >
> > Thanks, applied.
> >
> > I'd also like to apply cleanup something like the following--there's
> > probably some cleaner way, but it just bothers me to have this
> > write-gathering special case take up the bulk of nfsd_vfs_write....
> >
> > --b.
> >
> > commit bfe7680d68afaf3f0b1195c8976db1fd1f03229d
> > Author: J. Bruce Fields <[email protected]>
> > Date: Mon Jun 15 16:03:53 2009 -0700
> >
> > nfsd: Pull write-gathering code out of nfsd_vfs_write
> >
> > This is a relatively self-contained piece of code that handles a special
> > case--move it to its own function.
> >
> > Signed-off-by: J. Bruce Fields <[email protected]>
> >
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index a8aac7f..de68557 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -963,6 +963,44 @@ static void kill_suid(struct dentry *dentry)
> > mutex_unlock(&dentry->d_inode->i_mutex);
> > }
> >
> > +/*
> > + * Gathered writes: If another process is currently writing to the file,
> > + * there's a high chance this is another nfsd (triggered by a bulk write
> > + * from a client's biod). Rather than syncing the file with each write
> > + * request, we sleep for 10 msec.
> > + *
> > + * I don't know if this roughly approximates C. Juszak's idea of
> > + * gathered writes, but it's a nice and simple solution (IMHO), and it
> > + * seems to work:-)
> > + *
> > + * Note: we do this only in the NFSv2 case, since v3 and higher have a
> > + * better tool (separate unstable writes and commits) for solving this
> > + * problem.
> > + */
> > +static void wait_for_concurrent_writes(struct file *file, int use_wgather, int *host_err)
> > +{
> > + struct inode *inode = file->f_path.dentry->d_inode;
> > + static ino_t last_ino;
> > + static dev_t last_dev;
> > +
> > + if (!use_wgather)
> > + goto out;
> > + if (atomic_read(&inode->i_writecount) > 1
> > + || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
> > + dprintk("nfsd: write defer %d\n", task_pid_nr(current));
> > + msleep(10);
> > + dprintk("nfsd: write resume %d\n", task_pid_nr(current));
> > + }
> > +
> > + if (inode->i_state & I_DIRTY) {
> > + dprintk("nfsd: write sync %d\n", task_pid_nr(current));
> > + *host_err = nfsd_sync(file);
> > + }
> > +out:
> > + last_ino = inode->i_ino;
> > + last_dev = inode->i_sb->s_dev;
> > +}
>
> Shouldn't you also timestamp the last_ino/last_dev? Currently you can
> end up waiting even if the last time you referenced this file was 10
> minutes ago...
Maybe, but I don't know that avoiding the delay in the case where
use_wdelay writes are coming rarely is particularly important.
(Note this is just a single static last_ino/last_dev, so the timestamp
would just tell us how long ago there was last a use_wdelay write.)
I'm not as interested in making wdelay work better--someone who uses v2
and wants to benchmark it can do that--as I am interested in just
getting it out of the way so I don't have to look at it again....
--b.
>
> > +
> > static __be32
> > nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
> > loff_t offset, struct kvec *vec, int vlen,
> > @@ -1025,41 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
> > if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
> > kill_suid(dentry);
> >
> > - if (host_err >= 0 && stable) {
> > - static ino_t last_ino;
> > - static dev_t last_dev;
> > -
> > - /*
> > - * Gathered writes: If another process is currently
> > - * writing to the file, there's a high chance
> > - * this is another nfsd (triggered by a bulk write
> > - * from a client's biod). Rather than syncing the
> > - * file with each write request, we sleep for 10 msec.
> > - *
> > - * I don't know if this roughly approximates
> > - * C. Juszak's idea of gathered writes, but it's a
> > - * nice and simple solution (IMHO), and it seems to
> > - * work:-)
> > - */
> > - if (use_wgather) {
> > - if (atomic_read(&inode->i_writecount) > 1
> > - || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
> > - dprintk("nfsd: write defer %d\n", task_pid_nr(current));
> > - msleep(10);
> > - dprintk("nfsd: write resume %d\n", task_pid_nr(current));
> > - }
> > -
> > - if (inode->i_state & I_DIRTY) {
> > - dprintk("nfsd: write sync %d\n", task_pid_nr(current));
> > - host_err=nfsd_sync(file);
> > - }
> > -#if 0
> > - wake_up(&inode->i_wait);
> > -#endif
> > - }
> > - last_ino = inode->i_ino;
> > - last_dev = inode->i_sb->s_dev;
> > - }
> > + if (host_err >= 0 && stable)
> > + wait_for_concurrent_writes(file, use_wgather, &host_err);
> >
> > dprintk("nfsd: write complete host_err=%d\n", host_err);
> > if (host_err >= 0) {
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
From: J. Bruce Fields <[email protected]>
Updating last_ino and last_dev probably isn't useful in the !use_wgather
case.
Also remove some pointless ifdef'd-out code.
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/nfsd/vfs.c | 25 ++++++++++---------------
1 files changed, 10 insertions(+), 15 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f30cc4e..ebf56c6 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1026,7 +1026,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
kill_suid(dentry);
- if (host_err >= 0 && stable) {
+ if (host_err >= 0 && stable && use_wgather) {
static ino_t last_ino;
static dev_t last_dev;
@@ -1042,21 +1042,16 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
* nice and simple solution (IMHO), and it seems to
* work:-)
*/
- if (use_wgather) {
- if (atomic_read(&inode->i_writecount) > 1
- || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
- dprintk("nfsd: write defer %d\n", task_pid_nr(current));
- msleep(10);
- dprintk("nfsd: write resume %d\n", task_pid_nr(current));
- }
+ if (atomic_read(&inode->i_writecount) > 1
+ || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
+ dprintk("nfsd: write defer %d\n", task_pid_nr(current));
+ msleep(10);
+ dprintk("nfsd: write resume %d\n", task_pid_nr(current));
+ }
- if (inode->i_state & I_DIRTY) {
- dprintk("nfsd: write sync %d\n", task_pid_nr(current));
- host_err=nfsd_sync(file);
- }
-#if 0
- wake_up(&inode->i_wait);
-#endif
+ if (inode->i_state & I_DIRTY) {
+ dprintk("nfsd: write sync %d\n", task_pid_nr(current));
+ host_err=nfsd_sync(file);
}
last_ino = inode->i_ino;
last_dev = inode->i_sb->s_dev;
--
1.6.0.4
From: J. Bruce Fields <[email protected]>
There's no need to check host_err >= 0 every time here when we could
check host_err < 0 once, following the usual kernel style.
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/nfsd/vfs.c | 15 ++++++++-------
1 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 6ad76a4..1cf7061 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1053,19 +1053,20 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
oldfs = get_fs(); set_fs(KERNEL_DS);
host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &offset);
set_fs(oldfs);
- if (host_err >= 0) {
- *cnt = host_err;
- nfsdstats.io_write += host_err;
- fsnotify_modify(file->f_path.dentry);
- }
+ if (host_err < 0)
+ goto out_nfserr;
+ *cnt = host_err;
+ nfsdstats.io_write += host_err;
+ fsnotify_modify(file->f_path.dentry);
/* clear setuid/setgid flag after write */
- if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
+ if (inode->i_mode & (S_ISUID | S_ISGID))
kill_suid(dentry);
- if (host_err >= 0 && stable && use_wgather)
+ if (stable && use_wgather)
host_err = wait_for_concurrent_writes(file);
+out_nfserr:
dprintk("nfsd: write complete host_err=%d\n", host_err);
if (host_err >= 0)
err = 0;
--
1.6.0.4
From: J. Bruce Fields <[email protected]>
This is a relatively self-contained piece of code that handles a special
case--move it to its own function.
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/nfsd/vfs.c | 69 ++++++++++++++++++++++++++++++++------------------------
1 files changed, 39 insertions(+), 30 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ebf56c6..6ad76a4 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -963,6 +963,43 @@ static void kill_suid(struct dentry *dentry)
mutex_unlock(&dentry->d_inode->i_mutex);
}
+/*
+ * Gathered writes: If another process is currently writing to the file,
+ * there's a high chance this is another nfsd (triggered by a bulk write
+ * from a client's biod). Rather than syncing the file with each write
+ * request, we sleep for 10 msec.
+ *
+ * I don't know if this roughly approximates C. Juszak's idea of
+ * gathered writes, but it's a nice and simple solution (IMHO), and it
+ * seems to work:-)
+ *
+ * Note: we do this only in the NFSv2 case, since v3 and higher have a
+ * better tool (separate unstable writes and commits) for solving this
+ * problem.
+ */
+static int wait_for_concurrent_writes(struct file *file)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ static ino_t last_ino;
+ static dev_t last_dev;
+ int err = 0;
+
+ if (atomic_read(&inode->i_writecount) > 1
+ || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
+ dprintk("nfsd: write defer %d\n", task_pid_nr(current));
+ msleep(10);
+ dprintk("nfsd: write resume %d\n", task_pid_nr(current));
+ }
+
+ if (inode->i_state & I_DIRTY) {
+ dprintk("nfsd: write sync %d\n", task_pid_nr(current));
+ err = nfsd_sync(file);
+ }
+ last_ino = inode->i_ino;
+ last_dev = inode->i_sb->s_dev;
+ return err;
+}
+
static __be32
nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
loff_t offset, struct kvec *vec, int vlen,
@@ -1026,36 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
kill_suid(dentry);
- if (host_err >= 0 && stable && use_wgather) {
- static ino_t last_ino;
- static dev_t last_dev;
-
- /*
- * Gathered writes: If another process is currently
- * writing to the file, there's a high chance
- * this is another nfsd (triggered by a bulk write
- * from a client's biod). Rather than syncing the
- * file with each write request, we sleep for 10 msec.
- *
- * I don't know if this roughly approximates
- * C. Juszak's idea of gathered writes, but it's a
- * nice and simple solution (IMHO), and it seems to
- * work:-)
- */
- if (atomic_read(&inode->i_writecount) > 1
- || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
- dprintk("nfsd: write defer %d\n", task_pid_nr(current));
- msleep(10);
- dprintk("nfsd: write resume %d\n", task_pid_nr(current));
- }
-
- if (inode->i_state & I_DIRTY) {
- dprintk("nfsd: write sync %d\n", task_pid_nr(current));
- host_err=nfsd_sync(file);
- }
- last_ino = inode->i_ino;
- last_dev = inode->i_sb->s_dev;
- }
+ if (host_err >= 0 && stable && use_wgather)
+ host_err = wait_for_concurrent_writes(file);
dprintk("nfsd: write complete host_err=%d\n", host_err);
if (host_err >= 0)
--
1.6.0.4
On Fri, Jun 05, 2009 at 12:35:15PM -0400, Trond Myklebust wrote:
> On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote:
> > On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote:
> > >
> > >
> > > Trond Myklebust wrote:
> > > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
> > > >> Tom Talpey wrote:
> > > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
> > > >>>> Brian R Cowan wrote:
> > > >>>>> Trond Myklebust<[email protected]> wrote on 06/04/2009
> > > >>>>> 02:04:58
> > > >>>>> PM:
> > > >>>>>
> > > >>>>>> Did you try turning off write gathering on the server (i.e. add the
> > > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> > > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> > > >>>>> Just tried it, this seems to be a very useful workaround as well. The
> > > >>>>> FILE_SYNC write calls come back in about the same amount of time as the
> > > >>>>> write+commit pairs... Speeds up building regardless of the network
> > > >>>>> filesystem (ClearCase MVFS or straight NFS).
> > > >>>> Does anybody had the history as to why 'no_wdelay' is an
> > > >>>> export default?
> > > >>> Because "wdelay" is a complete crock?
> > > >>>
> > > >>> Adding 10ms to every write RPC only helps if there's a steady
> > > >>> single-file stream arriving at the server. In most other workloads
> > > >>> it only slows things down.
> > > >>>
> > > >>> The better solution is to continue tuning the clients to issue
> > > >>> writes in a more sequential and less all-or-nothing fashion.
> > > >>> There are plenty of other less crock-ful things to do in the
> > > >>> server, too.
> > > >> Ok... So do you think removing it as a default would cause
> > > >> any regressions?
> > > >
> > > > It might for NFSv2 clients, since they don't have the option of using
> > > > unstable writes. I'd therefore prefer a kernel solution that makes write
> > > > gathering an NFSv2 only feature.
> > > Sounds good to me! ;-)
> >
> > Patch welcomed.--b.
>
> Something like this ought to suffice...
Thanks, applied.
I'd also like to apply cleanup something like the following--there's
probably some cleaner way, but it just bothers me to have this
write-gathering special case take up the bulk of nfsd_vfs_write....
--b.
commit bfe7680d68afaf3f0b1195c8976db1fd1f03229d
Author: J. Bruce Fields <[email protected]>
Date: Mon Jun 15 16:03:53 2009 -0700
nfsd: Pull write-gathering code out of nfsd_vfs_write
This is a relatively self-contained piece of code that handles a special
case--move it to its own function.
Signed-off-by: J. Bruce Fields <[email protected]>
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index a8aac7f..de68557 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -963,6 +963,44 @@ static void kill_suid(struct dentry *dentry)
mutex_unlock(&dentry->d_inode->i_mutex);
}
+/*
+ * Gathered writes: If another process is currently writing to the file,
+ * there's a high chance this is another nfsd (triggered by a bulk write
+ * from a client's biod). Rather than syncing the file with each write
+ * request, we sleep for 10 msec.
+ *
+ * I don't know if this roughly approximates C. Juszak's idea of
+ * gathered writes, but it's a nice and simple solution (IMHO), and it
+ * seems to work:-)
+ *
+ * Note: we do this only in the NFSv2 case, since v3 and higher have a
+ * better tool (separate unstable writes and commits) for solving this
+ * problem.
+ */
+static void wait_for_concurrent_writes(struct file *file, int use_wgather, int *host_err)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ static ino_t last_ino;
+ static dev_t last_dev;
+
+ if (!use_wgather)
+ goto out;
+ if (atomic_read(&inode->i_writecount) > 1
+ || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
+ dprintk("nfsd: write defer %d\n", task_pid_nr(current));
+ msleep(10);
+ dprintk("nfsd: write resume %d\n", task_pid_nr(current));
+ }
+
+ if (inode->i_state & I_DIRTY) {
+ dprintk("nfsd: write sync %d\n", task_pid_nr(current));
+ *host_err = nfsd_sync(file);
+ }
+out:
+ last_ino = inode->i_ino;
+ last_dev = inode->i_sb->s_dev;
+}
+
static __be32
nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
loff_t offset, struct kvec *vec, int vlen,
@@ -1025,41 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID)))
kill_suid(dentry);
- if (host_err >= 0 && stable) {
- static ino_t last_ino;
- static dev_t last_dev;
-
- /*
- * Gathered writes: If another process is currently
- * writing to the file, there's a high chance
- * this is another nfsd (triggered by a bulk write
- * from a client's biod). Rather than syncing the
- * file with each write request, we sleep for 10 msec.
- *
- * I don't know if this roughly approximates
- * C. Juszak's idea of gathered writes, but it's a
- * nice and simple solution (IMHO), and it seems to
- * work:-)
- */
- if (use_wgather) {
- if (atomic_read(&inode->i_writecount) > 1
- || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) {
- dprintk("nfsd: write defer %d\n", task_pid_nr(current));
- msleep(10);
- dprintk("nfsd: write resume %d\n", task_pid_nr(current));
- }
-
- if (inode->i_state & I_DIRTY) {
- dprintk("nfsd: write sync %d\n", task_pid_nr(current));
- host_err=nfsd_sync(file);
- }
-#if 0
- wake_up(&inode->i_wait);
-#endif
- }
- last_ino = inode->i_ino;
- last_dev = inode->i_sb->s_dev;
- }
+ if (host_err >= 0 && stable)
+ wait_for_concurrent_writes(file, use_wgather, &host_err);
dprintk("nfsd: write complete host_err=%d\n", host_err);
if (host_err >= 0) {
Tom Talpey wrote:
> On 6/5/2009 7:35 AM, Steve Dickson wrote:
>> Brian R Cowan wrote:
>>> Trond Myklebust<[email protected]> wrote on 06/04/2009
>>> 02:04:58
>>> PM:
>>>
>>>> Did you try turning off write gathering on the server (i.e. add the
>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
>>> Just tried it, this seems to be a very useful workaround as well. The
>>> FILE_SYNC write calls come back in about the same amount of time as the
>>> write+commit pairs... Speeds up building regardless of the network
>>> filesystem (ClearCase MVFS or straight NFS).
>>
>> Does anybody had the history as to why 'no_wdelay' is an
>> export default?
>
> Because "wdelay" is a complete crock?
>
> Adding 10ms to every write RPC only helps if there's a steady
> single-file stream arriving at the server. In most other workloads
> it only slows things down.
>
> The better solution is to continue tuning the clients to issue
> writes in a more sequential and less all-or-nothing fashion.
> There are plenty of other less crock-ful things to do in the
> server, too.
Ok... So do you think removing it as a default would cause
any regressions?
steved.
Trond Myklebust wrote:
> On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
>> Tom Talpey wrote:
>>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
>>>> Brian R Cowan wrote:
>>>>> Trond Myklebust<[email protected]> wrote on 06/04/2009
>>>>> 02:04:58
>>>>> PM:
>>>>>
>>>>>> Did you try turning off write gathering on the server (i.e. add the
>>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
>>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
>>>>> Just tried it, this seems to be a very useful workaround as well. The
>>>>> FILE_SYNC write calls come back in about the same amount of time as the
>>>>> write+commit pairs... Speeds up building regardless of the network
>>>>> filesystem (ClearCase MVFS or straight NFS).
>>>> Does anybody had the history as to why 'no_wdelay' is an
>>>> export default?
>>> Because "wdelay" is a complete crock?
>>>
>>> Adding 10ms to every write RPC only helps if there's a steady
>>> single-file stream arriving at the server. In most other workloads
>>> it only slows things down.
>>>
>>> The better solution is to continue tuning the clients to issue
>>> writes in a more sequential and less all-or-nothing fashion.
>>> There are plenty of other less crock-ful things to do in the
>>> server, too.
>> Ok... So do you think removing it as a default would cause
>> any regressions?
>
> It might for NFSv2 clients, since they don't have the option of using
> unstable writes. I'd therefore prefer a kernel solution that makes write
> gathering an NFSv2 only feature.
Sounds good to me! ;-)
steved.
On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote:
> > NFSD stops calling ->fsync without a file struct.
> >
> > I think the open file cache will help us with that, if we can extend
> > it to also cache open file structs for directories.
>
> Krishna Kumar--do you think that'd be a reasonable thing to do?
Btw, do you have at least the basic open files cache queue for 2.6.31?
On Fri, Jun 05, 2009 at 10:54:50AM -0400, Christoph Hellwig wrote:
> On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote:
> > > NFSD stops calling ->fsync without a file struct.
> > >
> > > I think the open file cache will help us with that, if we can extend
> > > it to also cache open file structs for directories.
> >
> > Krishna Kumar--do you think that'd be a reasonable thing to do?
>
> Btw, do you have at least the basic open files cache queue for 2.6.31?
No. I'll try to give it a look this afternoon.
--b.
On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote:
>
>
> Trond Myklebust wrote:
> > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote:
> >> Tom Talpey wrote:
> >>> On 6/5/2009 7:35 AM, Steve Dickson wrote:
> >>>> Brian R Cowan wrote:
> >>>>> Trond Myklebust<[email protected]> wrote on 06/04/2009
> >>>>> 02:04:58
> >>>>> PM:
> >>>>>
> >>>>>> Did you try turning off write gathering on the server (i.e. add the
> >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of
> >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness.
> >>>>> Just tried it, this seems to be a very useful workaround as well. The
> >>>>> FILE_SYNC write calls come back in about the same amount of time as the
> >>>>> write+commit pairs... Speeds up building regardless of the network
> >>>>> filesystem (ClearCase MVFS or straight NFS).
> >>>> Does anybody had the history as to why 'no_wdelay' is an
> >>>> export default?
> >>> Because "wdelay" is a complete crock?
> >>>
> >>> Adding 10ms to every write RPC only helps if there's a steady
> >>> single-file stream arriving at the server. In most other workloads
> >>> it only slows things down.
> >>>
> >>> The better solution is to continue tuning the clients to issue
> >>> writes in a more sequential and less all-or-nothing fashion.
> >>> There are plenty of other less crock-ful things to do in the
> >>> server, too.
> >> Ok... So do you think removing it as a default would cause
> >> any regressions?
> >
> > It might for NFSv2 clients, since they don't have the option of using
> > unstable writes. I'd therefore prefer a kernel solution that makes write
> > gathering an NFSv2 only feature.
> Sounds good to me! ;-)
Patch welcomed.--b.
On Fri, 2009-06-05 at 10:54 -0400, Christoph Hellwig wrote:
> On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote:
> > > NFSD stops calling ->fsync without a file struct.
> > >
> > > I think the open file cache will help us with that, if we can extend
> > > it to also cache open file structs for directories.
> >
> > Krishna Kumar--do you think that'd be a reasonable thing to do?
>
> Btw, do you have at least the basic open files cache queue for 2.6.31?
>
Now that _will_ badly screw up the write gathering heuristic...
Trond
On Fri, Jun 05, 2009 at 12:12:08PM -0400, Trond Myklebust wrote:
> On Fri, 2009-06-05 at 10:54 -0400, Christoph Hellwig wrote:
> > On Mon, Jun 01, 2009 at 06:30:08PM -0400, J. Bruce Fields wrote:
> > > > NFSD stops calling ->fsync without a file struct.
> > > >
> > > > I think the open file cache will help us with that, if we can extend
> > > > it to also cache open file structs for directories.
> > >
> > > Krishna Kumar--do you think that'd be a reasonable thing to do?
> >
> > Btw, do you have at least the basic open files cache queue for 2.6.31?
> >
>
> Now that _will_ badly screw up the write gathering heuristic...
How?
--b.
On Sat, May 30, 2009 at 03:57:56AM -0400, Christoph Hellwig wrote:
> On Sat, May 30, 2009 at 10:22:58AM +1000, Greg Banks wrote:
> > * The underlying filesystem might be doing more or better things in
> > one or the other code paths e.g. optimising allocations.
>
> Which is the case with ext3 which is pretty common. It does reasonably
> well on O_SYNC as far as I can see, but has a catastrophic fsync
> implementation.
>
> > * The Linux NFS server ignores the byte range in the COMMIT rpc and
> > flushes the whole file (I suspect this is a historical accident rather
> > than deliberate policy). If there is other dirty data on that file
> > server-side, that other data will be written too before the COMMIT
> > reply is sent. This may have a performance impact, depending on the
> > workload.
>
> Right now we can't actually implement that proper because the fsync
> file operation can't actually flush sub ranges. There have been some
> other requests for this, but my ->fsync resdesign in on hold until
> NFSD stops calling ->fsync without a file struct.
>
> I think the open file cache will help us with that, if we can extend
> it to also cache open file structs for directories.
Krishna Kumar--do you think that'd be a reasonable thing to do?
--b.
On Sat, May 30, 2009 at 11:02:47PM +1000, Greg Banks wrote:
> On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust
> <[email protected]> wrote:
> > On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
> >> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
> >> <[email protected]> wrote:
> >> > On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
> >> >>
> >>
> >
> > Firstly, the server only uses O_SYNC if you turn off write gathering
> > (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs
> > server is to always try write gathering and hence no O_SYNC.
>
> Well, write gathering is a total crock that AFAICS only helps
> single-file writes on NFSv2. For today's workloads all it does is
> provide a hotspot on the two global variables that track writes in an
> attempt to gather them. Back when I worked on a server product,
> no_wdelay was one of the standard options for new exports.
Should be a simple nfs-utils patch to change the default.
--b.
>
> > Secondly, even if it were the case, then this does not justify changing
> > the client behaviour.
>
> I totally agree, it was just an observation.
>
> In any case, as Christoph points out, the ext3 performance difference
> makes an unstable WRITE+COMMIT slower than a stable WRITE, and you
> already assumed that.
>
> --
> Greg.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On May 30, 2009, at 9:02 AM, Greg Banks wrote:
> On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust
> <[email protected]> wrote:
>> On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
>>> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
>>> <[email protected]> wrote:
>>>> On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
>>>>>
>>>
>>
>> Firstly, the server only uses O_SYNC if you turn off write gathering
>> (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs
>> server is to always try write gathering and hence no O_SYNC.
>
> Well, write gathering is a total crock that AFAICS only helps
> single-file writes on NFSv2. For today's workloads all it does is
> provide a hotspot on the two global variables that track writes in an
> attempt to gather them. Back when I worked on a server product,
> no_wdelay was one of the standard options for new exports.
Really? Even for NFSv3/4 FILE_SYNC? I can understand that it
wouldn't have any real effect on UNSTABLE.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Tue, 2009-06-02 at 11:00 -0400, Chuck Lever wrote:
> On May 30, 2009, at 9:02 AM, Greg Banks wrote:
> > On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust
> > <[email protected]> wrote:
> >> On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
> >>> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
> >>> <[email protected]> wrote:
> >>>> On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
> >>>>>
> >>>
> >>
> >> Firstly, the server only uses O_SYNC if you turn off write gathering
> >> (a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs
> >> server is to always try write gathering and hence no O_SYNC.
> >
> > Well, write gathering is a total crock that AFAICS only helps
> > single-file writes on NFSv2. For today's workloads all it does is
> > provide a hotspot on the two global variables that track writes in an
> > attempt to gather them. Back when I worked on a server product,
> > no_wdelay was one of the standard options for new exports.
>
> Really? Even for NFSv3/4 FILE_SYNC? I can understand that it
> wouldn't have any real effect on UNSTABLE.
The question is why would a sensible client ever want to send more than
1 NFSv3 write with FILE_SYNC? If you need to send multiple writes in
parallel to the same file, then it makes much more sense to use
UNSTABLE.
Write gathering relies on waiting an arbitrary length of time in order
to see if someone is going to send another write. The protocol offers no
guidance as to how long that wait should be, and so (at least on the
Linux server) we've coded in a hard wait of 10ms if and only if we see
that something else has the file open for writing.
One problem with the Linux implementation is that the "something else"
could be another nfs server thread that happens to be in nfsd_write(),
however it could also be another open NFSv4 stateid, or a NLM lock, or a
local process that has the file open for writing.
Another problem is that the nfs server keeps a record of the last file
that was accessed, and also waits if it sees you are writing again to
that same file. Of course it has no idea if this is truly a parallel
write, or if it just happens that you are writing again to the same file
using O_SYNC...
Trond
On Jun 2, 2009, at 1:27 PM, Trond Myklebust wrote:
> On Tue, 2009-06-02 at 11:00 -0400, Chuck Lever wrote:
>> On May 30, 2009, at 9:02 AM, Greg Banks wrote:
>>> On Sat, May 30, 2009 at 10:26 PM, Trond Myklebust
>>> <[email protected]> wrote:
>>>> On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
>>>>> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
>>>>> <[email protected]> wrote:
>>>>>> On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
>>>>>>>
>>>>>
>>>>
>>>> Firstly, the server only uses O_SYNC if you turn off write
>>>> gathering
>>>> (a.k.a. the 'wdelay' option). The default behaviour for the Linux
>>>> nfs
>>>> server is to always try write gathering and hence no O_SYNC.
>>>
>>> Well, write gathering is a total crock that AFAICS only helps
>>> single-file writes on NFSv2. For today's workloads all it does is
>>> provide a hotspot on the two global variables that track writes in
>>> an
>>> attempt to gather them. Back when I worked on a server product,
>>> no_wdelay was one of the standard options for new exports.
>>
>> Really? Even for NFSv3/4 FILE_SYNC? I can understand that it
>> wouldn't have any real effect on UNSTABLE.
>
> The question is why would a sensible client ever want to send more
> than
> 1 NFSv3 write with FILE_SYNC?
A client might behave this way if an application was performing random
4KB synchronous writes to a large file, or the VM is aggressively
flushing single pages to try to mitigate a low-memory situation. IOW
it may not be up to the client...
Penalizing FILE_SYNC writes, even a little, by waiting a bit could
also reduce the server's workload by slowing clients that are pounding
a server with synchronous writes.
Not an argument, really... but it seems like there are some scenarios
where delaying synchronous writes could still be useful. The real
question is whether these scenarios occur frequently enough to warrant
the overhead in the server. It would be nice to see some I/O trace
data.
> If you need to send multiple writes in
> parallel to the same file, then it makes much more sense to use
> UNSTABLE.
Yep, agreed.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
Trond Myklebust ([email protected]) wrote on 2 June 2009 13:27:
>Write gathering relies on waiting an arbitrary length of time in order
>to see if someone is going to send another write. The protocol offers no
>guidance as to how long that wait should be, and so (at least on the
>Linux server) we've coded in a hard wait of 10ms if and only if we see
>that something else has the file open for writing.
>One problem with the Linux implementation is that the "something else"
>could be another nfs server thread that happens to be in nfsd_write(),
>however it could also be another open NFSv4 stateid, or a NLM lock, or a
>local process that has the file open for writing.
>Another problem is that the nfs server keeps a record of the last file
>that was accessed, and also waits if it sees you are writing again to
>that same file. Of course it has no idea if this is truly a parallel
>write, or if it just happens that you are writing again to the same file
>using O_SYNC...
I think the decision to write or wait doesn't belong to the nfs
server; it should just send the writes immediately. It's up to the
fs/block/device layers to do the gathering. I understand that the
client should try to do the gathering before sending the request to
the wire.
Dean Hildebrand ([email protected]) wrote on 3 June 2009 17:28:
>Trond Myklebust wrote:
>> On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote:
>>
>>> Trond Myklebust ([email protected]) wrote on 2 June 2009 13:27:
>>> >Write gathering relies on waiting an arbitrary length of time in order
>>> >to see if someone is going to send another write. The protocol offers no
>>> >guidance as to how long that wait should be, and so (at least on the
>>> >Linux server) we've coded in a hard wait of 10ms if and only if we see
>>> >that something else has the file open for writing.
>>> >One problem with the Linux implementation is that the "something else"
>>> >could be another nfs server thread that happens to be in nfsd_write(),
>>> >however it could also be another open NFSv4 stateid, or a NLM lock, or a
>>> >local process that has the file open for writing.
>>> >Another problem is that the nfs server keeps a record of the last file
>>> >that was accessed, and also waits if it sees you are writing again to
>>> >that same file. Of course it has no idea if this is truly a parallel
>>> >write, or if it just happens that you are writing again to the same file
>>> >using O_SYNC...
>>>
>>> I think the decision to write or wait doesn't belong to the nfs
>>> server; it should just send the writes immediately. It's up to the
>>> fs/block/device layers to do the gathering. I understand that the
>>> client should try to do the gathering before sending the request to
>>> the wire
>>>
>Just to be clear, the linux NFS server does not gather the writes.
>Writes are passed immediately to the fs.
Ah! That's much better.
>nfsd simply waits 10ms before
>sync'ing the writes to disk. This allows the underlying file system
****
>time to do the gathering and sync data in larger chunks.
OK, all is perfectly fine then.
Since syncs seem to be a requirement of the protocol, perhaps the 10ms
delay could be made tunable to allow admins more flexibility. For
example, if we change other timeouts we could adjust the nfs sync one
accordingly. Could be an option to nfsd or, better, a variable in /proc.
Thanks Dean and Trond for the explanations.
Hi.
I have a proposal for possibly resolving this issue.
I believe that this situation occurs due to the way that the
Linux NFS client handles writes which modify partial pages.
The Linux NFS client handles partial page modifications by
allocating a page from the page cache, copying the data from
the user level into the page, and then keeping track of the
offset and length of the modified portions of the page. The
page is not marked as up to date because there are portions
of the page which do not contain valid file contents.
When a read call comes in for a portion of the page, the
contents of the page must be read in the from the server.
However, since the page may already contain some modified
data, that modified data must be written to the server
before the file contents can be read back in the from server.
And, since the writing and reading can not be done atomically,
the data must be written and committed to stable storage on
the server for safety purposes. This means either a
FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT.
This has been discussed at length previously.
This algorithm could be described as modify-write-read. It
is most efficient when the application only updates pages
and does not read them.
My proposed solution is to add a heuristic to decide whether
to do this modify-write-read algorithm or switch to a read-
modify-write algorithm when initially allocating the page
in the write system call path. The heuristic uses the modes
that the file was opened with, the offset in the page to
read from, and the size of the region to read.
If the file was opened for reading in addition to writing
and the page would not be filled completely with data from
the user level, then read in the old contents of the page
and mark it as Uptodate before copying in the new data. If
the page would be completely filled with data from the user
level, then there would be no reason to read in the old
contents because they would just be copied over.
This would optimize for applications which randomly access
and update portions of files. The linkage editor for the
C compiler is an example of such a thing.
I tested the attached patch by using rpmbuild to build the
current Fedora rawhide kernel. The kernel without the
patch generated about 269,500 WRITE requests. The modified
kernel containing the patch generated about 261,000 WRITE
requests. Thus, about 8,500 fewer WRITE requests were
generated. I suspect that many of these additional
WRITE requests were probably FILE_SYNC requests to WRITE
a single page, but I didn't test this theory.
The previous version of this patch caused the NFS client to
generate around 3,000 more READ requests. This version
actually causes the NFS client to generate almost 500 fewer
READ requests.
Thanx...
ps
Signed-off-by: Peter Staubach <[email protected]>
Trond Myklebust wrote:
>
> It might also be nice to put the above test in a little inlined helper
> function (called nfs_want_read_modify_write() ?).
>
>
Good suggestion.
> So, a number of questions spring to mind:
>
> 1. What if we're extending the file? We might not need to read the
> page at all in that case (see nfs_write_end()).
>
Yup.
> 2. What if the page is already dirty or is carrying an uncommitted
> unstable write?
>
Yup.
> 3. We might want to try to avoid looping more than once here. If
> the kernel is very low on memory, we might just want to write
> out the data rather than read the page and risk having the VM
> eject it before we can dirty it.
>
Yup.
> 4. Should we be starting an async readahead on the next page?
> Single page sized reads can be a nuisance too, if you are
> writing huge amounts of data.
This one is tough. It sounds good, but seems difficult to implement.
I think that this could be viewed as an optimization.
ps
On Thu, 2009-07-09 at 10:12 -0400, Peter Staubach wrote:
> Signed-off-by: Peter Staubach <[email protected]>
Please could you send such patches as inline, rather than as
attachments. It makes it harder to comment on the patch contents...
> +static int nfs_want_read_modify_write(struct file *file, struct page *page,
> + loff_t pos, unsigned len)
> +{
> + unsigned int pglen = nfs_page_length(page);
> + unsigned int offset = pos & (PAGE_CACHE_SIZE - 1);
> + unsigned int end = offset + len;
> +
> + if ((file->f_mode & FMODE_READ) && /* open for read? */
> + !PageUptodate(page) && /* Uptodate? */
> + !PageDirty(page) && /* Dirty already? */
> + !PagePrivate(page) && /* i/o request already? */
I don't think you need the PageDirty() test. These days we should be
guaranteed to always have PagePrivate() set whenever PageDirty() is
(although the converse is not true). Anything else would be a bug...
> + pglen && /* valid bytes of file? */
> + (end < pglen || offset)) /* replace all valid bytes? */
> + return 1;
> + return 0;
> +}
> +
Trond Myklebust wrote:
> On Thu, 2009-07-09 at 10:12 -0400, Peter Staubach wrote:
>
>
>> Signed-off-by: Peter Staubach <[email protected]>
>>
>
> Please could you send such patches as inline, rather than as
> attachments. It makes it harder to comment on the patch contents...
>
>
I will investigate how to do this.
>> +static int nfs_want_read_modify_write(struct file *file, struct page *page,
>> + loff_t pos, unsigned len)
>> +{
>> + unsigned int pglen = nfs_page_length(page);
>> + unsigned int offset = pos & (PAGE_CACHE_SIZE - 1);
>> + unsigned int end = offset + len;
>> +
>> + if ((file->f_mode & FMODE_READ) && /* open for read? */
>> + !PageUptodate(page) && /* Uptodate? */
>> + !PageDirty(page) && /* Dirty already? */
>> + !PagePrivate(page) && /* i/o request already? */
>>
>
> I don't think you need the PageDirty() test. These days we should be
> guaranteed to always have PagePrivate() set whenever PageDirty() is
> (although the converse is not true). Anything else would be a bug...
>
>
Okie doke. It seemed to me that this should be true, but it was
safer to leave both tests.
I will remove that PageDirty test, retest, and then send another
version of the patch. I will be out next week, so it will take a
couple of weeks.
Thanx...
ps
>> + pglen && /* valid bytes of file? */
>> + (end < pglen || offset)) /* replace all valid bytes? */
>> + return 1;
>> + return 0;
>> +}
>> +
>>
>
>
On Fri, Jul 10, 2009 at 11:57:02AM -0400, Peter Staubach wrote:
> Trond Myklebust wrote:
>> On Thu, 2009-07-09 at 10:12 -0400, Peter Staubach wrote:
>>
>>
>>> Signed-off-by: Peter Staubach <[email protected]>
>>>
>>
>> Please could you send such patches as inline, rather than as
>> attachments. It makes it harder to comment on the patch contents...
>>
>>
>
> I will investigate how to do this.
See Documentation/email-clients.txt. (It has an entry for Thunderbird,
for example.)
--b.
>
>>> +static int nfs_want_read_modify_write(struct file *file, struct page *page,
>>> + loff_t pos, unsigned len)
>>> +{
>>> + unsigned int pglen = nfs_page_length(page);
>>> + unsigned int offset = pos & (PAGE_CACHE_SIZE - 1);
>>> + unsigned int end = offset + len;
>>> +
>>> + if ((file->f_mode & FMODE_READ) && /* open for read? */
>>> + !PageUptodate(page) && /* Uptodate? */
>>> + !PageDirty(page) && /* Dirty already? */
>>> + !PagePrivate(page) && /* i/o request already? */
>>>
>>
>> I don't think you need the PageDirty() test. These days we should be
>> guaranteed to always have PagePrivate() set whenever PageDirty() is
>> (although the converse is not true). Anything else would be a bug...
>>
>>
>
> Okie doke. It seemed to me that this should be true, but it was
> safer to leave both tests.
>
> I will remove that PageDirty test, retest, and then send another
> version of the patch. I will be out next week, so it will take a
> couple of weeks.
>
> Thanx...
>
> ps
>
>>> + pglen && /* valid bytes of file? */
>>> + (end < pglen || offset)) /* replace all valid bytes? */
>>> + return 1;
>>> + return 0;
>>> +}
>>> +
>>>
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi.
I have a proposal for possibly resolving this issue.
I believe that this situation occurs due to the way that the
Linux NFS client handles writes which modify partial pages.
The Linux NFS client handles partial page modifications by
allocating a page from the page cache, copying the data from
the user level into the page, and then keeping track of the
offset and length of the modified portions of the page. The
page is not marked as up to date because there are portions
of the page which do not contain valid file contents.
When a read call comes in for a portion of the page, the
contents of the page must be read in the from the server.
However, since the page may already contain some modified
data, that modified data must be written to the server
before the file contents can be read back in the from server.
And, since the writing and reading can not be done atomically,
the data must be written and committed to stable storage on
the server for safety purposes. This means either a
FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT.
This has been discussed at length previously.
This algorithm could be described as modify-write-read. It
is most efficient when the application only updates pages
and does not read them.
My proposed solution is to add a heuristic to decide whether
to do this modify-write-read algorithm or switch to a read-
modify-write algorithm when initially allocating the page
in the write system call path. The heuristic uses the modes
that the file was opened with, the offset in the page to
read from, and the size of the region to read.
If the file was opened for reading in addition to writing
and the page would not be filled completely with data from
the user level, then read in the old contents of the page
and mark it as Uptodate before copying in the new data. If
the page would be completely filled with data from the user
level, then there would be no reason to read in the old
contents because they would just be copied over.
This would optimize for applications which randomly access
and update portions of files. The linkage editor for the
C compiler is an example of such a thing.
I tested the attached patch by using rpmbuild to build the
current Fedora rawhide kernel. The kernel without the
patch generated about 269,500 WRITE requests. The modified
kernel containing the patch generated about 261,000 WRITE
requests. Thus, about 8,500 fewer WRITE requests were
generated. I suspect that many of these additional
WRITE requests were probably FILE_SYNC requests to WRITE
a single page, but I didn't test this theory.
The difference between this patch and the previous one was
to remove the unneeded PageDirty() test. I then retested to
ensure that the resulting system continued to behave as
desired.
Thanx...
ps
Signed-off-by: Peter Staubach <[email protected]>
--- linux-2.6.30.i686/fs/nfs/file.c.org
+++ linux-2.6.30.i686/fs/nfs/file.c
@@ -328,6 +328,42 @@ nfs_file_fsync(struct file *file, struct
}
/*
+ * Decide whether a read/modify/write cycle may be more efficient
+ * then a modify/write/read cycle when writing to a page in the
+ * page cache.
+ *
+ * The modify/write/read cycle may occur if a page is read before
+ * being completely filled by the writer. In this situation, the
+ * page must be completely written to stable storage on the server
+ * before it can be refilled by reading in the page from the server.
+ * This can lead to expensive, small, FILE_SYNC mode writes being
+ * done.
+ *
+ * It may be more efficient to read the page first if the file is
+ * open for reading in addition to writing, the page is not marked
+ * as Uptodate, it is not dirty or waiting to be committed,
+ * indicating that it was previously allocated and then modified,
+ * that there were valid bytes of data in that range of the file,
+ * and that the new data won't completely replace the old data in
+ * that range of the file.
+ */
+static int nfs_want_read_modify_write(struct file *file, struct page *page,
+ loff_t pos, unsigned len)
+{
+ unsigned int pglen = nfs_page_length(page);
+ unsigned int offset = pos & (PAGE_CACHE_SIZE - 1);
+ unsigned int end = offset + len;
+
+ if ((file->f_mode & FMODE_READ) && /* open for read? */
+ !PageUptodate(page) && /* Uptodate? */
+ !PagePrivate(page) && /* i/o request already? */
+ pglen && /* valid bytes of file? */
+ (end < pglen || offset)) /* replace all valid bytes? */
+ return 1;
+ return 0;
+}
+
+/*
* This does the "real" work of the write. We must allocate and lock the
* page to be sent back to the generic routine, which then copies the
* data from user space.
@@ -340,15 +376,16 @@ static int nfs_write_begin(struct file *
struct page **pagep, void **fsdata)
{
int ret;
- pgoff_t index;
+ pgoff_t index = pos >> PAGE_CACHE_SHIFT;
struct page *page;
- index = pos >> PAGE_CACHE_SHIFT;
+ int once_thru = 0;
dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
file->f_path.dentry->d_parent->d_name.name,
file->f_path.dentry->d_name.name,
mapping->host->i_ino, len, (long long) pos);
+start:
/*
* Prevent starvation issues if someone is doing a consistency
* sync-to-disk
@@ -367,6 +404,13 @@ static int nfs_write_begin(struct file *
if (ret) {
unlock_page(page);
page_cache_release(page);
+ } else if (!once_thru &&
+ nfs_want_read_modify_write(file, page, pos, len)) {
+ once_thru = 1;
+ ret = nfs_readpage(file, page);
+ page_cache_release(page);
+ if (!ret)
+ goto start;
}
return ret;
}
On Tue, 2009-08-04 at 13:52 -0400, Peter Staubach wrote:
> Signed-off-by: Peter Staubach <[email protected]>
>
> --- linux-2.6.30.i686/fs/nfs/file.c.org
> +++ linux-2.6.30.i686/fs/nfs/file.c
> @@ -328,6 +328,42 @@ nfs_file_fsync(struct file *file, struct
> }
>
> /*
> + * Decide whether a read/modify/write cycle may be more efficient
> + * then a modify/write/read cycle when writing to a page in the
> + * page cache.
> + *
> + * The modify/write/read cycle may occur if a page is read before
> + * being completely filled by the writer. In this situation, the
> + * page must be completely written to stable storage on the server
> + * before it can be refilled by reading in the page from the server.
> + * This can lead to expensive, small, FILE_SYNC mode writes being
> + * done.
> + *
> + * It may be more efficient to read the page first if the file is
> + * open for reading in addition to writing, the page is not marked
> + * as Uptodate, it is not dirty or waiting to be committed,
> + * indicating that it was previously allocated and then modified,
> + * that there were valid bytes of data in that range of the file,
> + * and that the new data won't completely replace the old data in
> + * that range of the file.
> + */
> +static int nfs_want_read_modify_write(struct file *file, struct page *page,
> + loff_t pos, unsigned len)
> +{
> + unsigned int pglen = nfs_page_length(page);
> + unsigned int offset = pos & (PAGE_CACHE_SIZE - 1);
> + unsigned int end = offset + len;
> +
> + if ((file->f_mode & FMODE_READ) && /* open for read? */
> + !PageUptodate(page) && /* Uptodate? */
> + !PagePrivate(page) && /* i/o request already? */
> + pglen && /* valid bytes of file? */
> + (end < pglen || offset)) /* replace all valid bytes? */
> + return 1;
> + return 0;
> +}
> +
> +/*
> * This does the "real" work of the write. We must allocate and lock the
> * page to be sent back to the generic routine, which then copies the
> * data from user space.
> @@ -340,15 +376,16 @@ static int nfs_write_begin(struct file *
> struct page **pagep, void **fsdata)
> {
> int ret;
> - pgoff_t index;
> + pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> struct page *page;
> - index = pos >> PAGE_CACHE_SHIFT;
> + int once_thru = 0;
>
> dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
> file->f_path.dentry->d_parent->d_name.name,
> file->f_path.dentry->d_name.name,
> mapping->host->i_ino, len, (long long) pos);
>
> +start:
> /*
> * Prevent starvation issues if someone is doing a consistency
> * sync-to-disk
> @@ -367,6 +404,13 @@ static int nfs_write_begin(struct file *
> if (ret) {
> unlock_page(page);
> page_cache_release(page);
> + } else if (!once_thru &&
> + nfs_want_read_modify_write(file, page, pos, len)) {
> + once_thru = 1;
> + ret = nfs_readpage(file, page);
> + page_cache_release(page);
> + if (!ret)
> + goto start;
> }
> return ret;
> }
>
Thanks! Applied...
Trond