We had a customer experience some performance issues after migrating their
MQseries servers from AIX to Linux. Their performance benchmark basically puts
50000 messages in a message queue, and a tcpdump captured during these tests
would show a ton of very small writes that were sequential but not contiguous.
After doing some investigation with systemtap we determined that when we called
nfs_updatepage() we were not being allowed to extend the write because the
inode->i_flock was not NULL. So then later when we'd arrive at
nfs_try_to_update_request() we would always wind up calling nfs_wb_page().
I gave the customer a test kernel using a patch similar to the one that follows
and the test results were favorable, with far fewer writes, the majority of
which were utilizing the full wsize. For example, the top ten write sizes and
number of occurrences from a tcpdump captured while running the benchmark with
an unpatched kernel:
$ tshark -r before.pcap.gz -R "nfs.opcode==write && nfs.stateid4.hash==0xf09c"
-T fields -e nfs.write.data_length | sort | uniq -c | sort -nr | head
5852 512
5575 1024
2262 1035
2160 1121
1661 1023
1460 1074
1413 1073
1394 1152
1244 1055
933 1804
contrasted with a tcpdump captured while running the benchmark with the test
kernel:
$ tshark -r after.pcap.gz -R "nfs.opcode==write && nfs.stateid4.hash==0x9f87"
-T fields -e nfs.write.data_length | sort | uniq -c | sort -nr | head
917 65536
76 36864
69 20480
55 53248
32 18432
31 49152
31 4096
31 32768
30 16384
25 65536,4096
Scott Mayhew (1):
NFS: Allow nfs_updatepage to extend a write to cover a full page when
we have a lock that covers the entire file
fs/nfs/write.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
--
1.7.11.7
Currently nfs_updatepage allows a write to be extended to cover a full
page only if we don't have a byte range lock on the file... but if we've
got the whole file locked, then we should be allowed to extend the
write.
Signed-off-by: Scott Mayhew <[email protected]>
---
fs/nfs/write.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index a2c7c28..f35fb4f 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -908,13 +908,16 @@ int nfs_updatepage(struct file *file, struct page *page,
file->f_path.dentry->d_name.name, count,
(long long)(page_file_offset(page) + offset));
- /* If we're not using byte range locks, and we know the page
+ /* If we're not using byte range locks (or if the range of the
+ * lock covers the entire file), and we know the page
* is up to date, it may be more efficient to extend the write
* to cover the entire page in order to avoid fragmentation
* inefficiencies.
*/
if (nfs_write_pageuptodate(page, inode) &&
- inode->i_flock == NULL &&
+ (inode->i_flock == NULL ||
+ (inode->i_flock->fl_start == 0 &&
+ inode->i_flock->fl_end == OFFSET_MAX)) &&
!(file->f_flags & O_DSYNC)) {
count = max(count + offset, nfs_page_length(page));
offset = 0;
--
1.7.11.7
On Thu, 23 May 2013 17:53:41 -0400
Scott Mayhew <[email protected]> wrote:
> Currently nfs_updatepage allows a write to be extended to cover a full
> page only if we don't have a byte range lock on the file... but if we've
> got the whole file locked, then we should be allowed to extend the
> write.
>
> Signed-off-by: Scott Mayhew <[email protected]>
> ---
> fs/nfs/write.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index a2c7c28..f35fb4f 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -908,13 +908,16 @@ int nfs_updatepage(struct file *file, struct page *page,
> file->f_path.dentry->d_name.name, count,
> (long long)(page_file_offset(page) + offset));
>
> - /* If we're not using byte range locks, and we know the page
> + /* If we're not using byte range locks (or if the range of the
> + * lock covers the entire file), and we know the page
> * is up to date, it may be more efficient to extend the write
> * to cover the entire page in order to avoid fragmentation
> * inefficiencies.
> */
> if (nfs_write_pageuptodate(page, inode) &&
> - inode->i_flock == NULL &&
> + (inode->i_flock == NULL ||
> + (inode->i_flock->fl_start == 0 &&
> + inode->i_flock->fl_end == OFFSET_MAX)) &&
> !(file->f_flags & O_DSYNC)) {
> count = max(count + offset, nfs_page_length(page));
> offset = 0;
Sounds like a reasonable proposition, but I think you might need to do
more vetting of the locks...
For instance, does it make sense to do this if it's a F_RDLCK? Also,
you're only looking at the first lock in the i_flock list. Might it
make more sense to walk the list and see whether the page might be
entirely covered by a lock that doesn't extend over the whole file?
--
Jeff Layton <[email protected]>
On Thu, 23 May 2013 22:30:10 +0000
"Myklebust, Trond" <[email protected]> wrote:
> On Thu, 2013-05-23 at 18:24 -0400, Jeff Layton wrote:
> > On Thu, 23 May 2013 17:53:41 -0400
> > Scott Mayhew <[email protected]> wrote:
> >
> > > Currently nfs_updatepage allows a write to be extended to cover a full
> > > page only if we don't have a byte range lock on the file... but if we've
> > > got the whole file locked, then we should be allowed to extend the
> > > write.
> > >
> > > Signed-off-by: Scott Mayhew <[email protected]>
> > > ---
> > > fs/nfs/write.c | 7 +++++--
> > > 1 file changed, 5 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > index a2c7c28..f35fb4f 100644
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -908,13 +908,16 @@ int nfs_updatepage(struct file *file, struct page *page,
> > > file->f_path.dentry->d_name.name, count,
> > > (long long)(page_file_offset(page) + offset));
> > >
> > > - /* If we're not using byte range locks, and we know the page
> > > + /* If we're not using byte range locks (or if the range of the
> > > + * lock covers the entire file), and we know the page
> > > * is up to date, it may be more efficient to extend the write
> > > * to cover the entire page in order to avoid fragmentation
> > > * inefficiencies.
> > > */
> > > if (nfs_write_pageuptodate(page, inode) &&
> > > - inode->i_flock == NULL &&
> > > + (inode->i_flock == NULL ||
> > > + (inode->i_flock->fl_start == 0 &&
> > > + inode->i_flock->fl_end == OFFSET_MAX)) &&
> > > !(file->f_flags & O_DSYNC)) {
> > > count = max(count + offset, nfs_page_length(page));
> > > offset = 0;
> >
> > Sounds like a reasonable proposition, but I think you might need to do
> > more vetting of the locks...
> >
> > For instance, does it make sense to do this if it's a F_RDLCK? Also,
> > you're only looking at the first lock in the i_flock list. Might it
> > make more sense to walk the list and see whether the page might be
> > entirely covered by a lock that doesn't extend over the whole file?
> >
>
> I'm guessing that the answer is to both these questions are "no":
> - Anybody who is writing while holding a F_RDLCK is likely doing
> something wrong.
Right, so I think we ought to be conservative here and not extend the
write if this is an F_RDLCK.
> - Walking the lock list on every write can quickly get painful if we
> have lots of small locks.
>
True, but it's probably still preferable to do that than to do a bunch
of small I/Os to the server. But, that's an optimization that can be
done later. Hardly anyone does real byte-range locking so I'm fine with
this approach for now.
> However it may make a lot of sense to look at whether or not we hold a
> NFSv4 write delegation.
>
Yes, that would be a good thing too. Having a helper function like you
suggested should make it easier to encapsulate that logic sanely.
--
Jeff Layton <[email protected]>
Hi Scott,
On Thu, 2013-05-23 at 17:53 -0400, Scott Mayhew wrote:
> Currently nfs_updatepage allows a write to be extended to cover a full
> page only if we don't have a byte range lock on the file... but if we've
> got the whole file locked, then we should be allowed to extend the
> write.
>
> Signed-off-by: Scott Mayhew <[email protected]>
> ---
> fs/nfs/write.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index a2c7c28..f35fb4f 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -908,13 +908,16 @@ int nfs_updatepage(struct file *file, struct page *page,
> file->f_path.dentry->d_name.name, count,
> (long long)(page_file_offset(page) + offset));
>
> - /* If we're not using byte range locks, and we know the page
> + /* If we're not using byte range locks (or if the range of the
> + * lock covers the entire file), and we know the page
> * is up to date, it may be more efficient to extend the write
> * to cover the entire page in order to avoid fragmentation
> * inefficiencies.
> */
> if (nfs_write_pageuptodate(page, inode) &&
> - inode->i_flock == NULL &&
> + (inode->i_flock == NULL ||
> + (inode->i_flock->fl_start == 0 &&
> + inode->i_flock->fl_end == OFFSET_MAX)) &&
> !(file->f_flags & O_DSYNC)) {
Can we put this condition into a helper function? I started with the
"nfs_write_pageuptodate()" thingy, but now we're starting to add in
extra complications...
Thanks!
Trond
> count = max(count + offset, nfs_page_length(page));
> offset = 0;
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Thu, 2013-05-23 at 18:24 -0400, Jeff Layton wrote:
> On Thu, 23 May 2013 17:53:41 -0400
> Scott Mayhew <[email protected]> wrote:
>
> > Currently nfs_updatepage allows a write to be extended to cover a full
> > page only if we don't have a byte range lock on the file... but if we've
> > got the whole file locked, then we should be allowed to extend the
> > write.
> >
> > Signed-off-by: Scott Mayhew <[email protected]>
> > ---
> > fs/nfs/write.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index a2c7c28..f35fb4f 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -908,13 +908,16 @@ int nfs_updatepage(struct file *file, struct page *page,
> > file->f_path.dentry->d_name.name, count,
> > (long long)(page_file_offset(page) + offset));
> >
> > - /* If we're not using byte range locks, and we know the page
> > + /* If we're not using byte range locks (or if the range of the
> > + * lock covers the entire file), and we know the page
> > * is up to date, it may be more efficient to extend the write
> > * to cover the entire page in order to avoid fragmentation
> > * inefficiencies.
> > */
> > if (nfs_write_pageuptodate(page, inode) &&
> > - inode->i_flock == NULL &&
> > + (inode->i_flock == NULL ||
> > + (inode->i_flock->fl_start == 0 &&
> > + inode->i_flock->fl_end == OFFSET_MAX)) &&
> > !(file->f_flags & O_DSYNC)) {
> > count = max(count + offset, nfs_page_length(page));
> > offset = 0;
>
> Sounds like a reasonable proposition, but I think you might need to do
> more vetting of the locks...
>
> For instance, does it make sense to do this if it's a F_RDLCK? Also,
> you're only looking at the first lock in the i_flock list. Might it
> make more sense to walk the list and see whether the page might be
> entirely covered by a lock that doesn't extend over the whole file?
>
I'm guessing that the answer is to both these questions are "no":
- Anybody who is writing while holding a F_RDLCK is likely doing
something wrong.
- Walking the lock list on every write can quickly get painful if we
have lots of small locks.
However it may make a lot of sense to look at whether or not we hold a
NFSv4 write delegation.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
>From 3938f17ef84f5c4889fd7f827109f89c932df569 Mon Sep 17 00:00:00 2001
From: Scott Mayhew <[email protected]>
Date: Wed, 22 May 2013 17:03:17 -0400
Subject: [PATCH RFC] NFS: Allow nfs_updatepage to extend a write under
additional circumstances
Currently nfs_updatepage allows a write to be extended to cover a full
page only if we don't have a byte range lock lock on the file... but if
we have a write delegation on the file or if we have the whole file
locked for writing then we should be allowed to extend the write as
well.
Signed-off-by: Scott Mayhew <[email protected]>
---
fs/nfs/write.c | 31 +++++++++++++++++++++++--------
1 file changed, 23 insertions(+), 8 deletions(-)
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index a2c7c28..c8a1bcc 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -888,6 +888,28 @@ out:
return PageUptodate(page) != 0;
}
+/* If we know the page is up to date, and we're not using byte range locks (or
+ * if we have the whole file locked for writing), it may be more efficient to
+ * extend the write to cover the entire page in order to avoid fragmentation
+ * inefficiencies.
+ *
+ * If the file is opened for synchronous writes or if we have a write delegation
+ * from the server then we can just skip the rest of the checks.
+ */
+static int nfs_can_extend_write(struct file *file, struct page *page, struct inode *inode)
+{
+ if (file->f_flags & O_DSYNC)
+ return 0;
+ if (nfs_have_delegation(inode, FMODE_WRITE))
+ return 1;
+ if (nfs_write_pageuptodate(page, inode) && (inode->i_flock == NULL ||
+ (inode->i_flock->fl_start == 0 &&
+ inode->i_flock->fl_end == OFFSET_MAX &&
+ inode->i_flock->fl_type != F_RDLCK)))
+ return 1;
+ return 0;
+}
+
/*
* Update and possibly write a cached page of an NFS file.
*
@@ -908,14 +930,7 @@ int nfs_updatepage(struct file *file, struct page *page,
file->f_path.dentry->d_name.name, count,
(long long)(page_file_offset(page) + offset));
- /* If we're not using byte range locks, and we know the page
- * is up to date, it may be more efficient to extend the write
- * to cover the entire page in order to avoid fragmentation
- * inefficiencies.
- */
- if (nfs_write_pageuptodate(page, inode) &&
- inode->i_flock == NULL &&
- !(file->f_flags & O_DSYNC)) {
+ if (nfs_can_extend_write(file, page, inode)) {
count = max(count + offset, nfs_page_length(page));
offset = 0;
}
--
1.7.11.7
On Tue, 4 Jun 2013 09:21:49 -0400
Scott Mayhew <[email protected]> wrote:
> From: Scott Mayhew <[email protected]>
> To: Jeff Layton <[email protected]>
> Cc: "Myklebust, Trond" <[email protected]>, "[email protected]" <[email protected]>
> Subject: Re: [PATCH RFC 1/1] NFS: Allow nfs_updatepage to extend a write to cover a full page when we have a lock that covers the entire file
> Date: Tue, 4 Jun 2013 09:21:49 -0400
> Sender: [email protected]
> User-Agent: Mutt/1.5.20 (2009-06-14)
>
> On Fri, 24 May 2013, Jeff Layton wrote:
>
> > On Thu, 23 May 2013 22:30:10 +0000
> > "Myklebust, Trond" <[email protected]> wrote:
> >
> > > On Thu, 2013-05-23 at 18:24 -0400, Jeff Layton wrote:
> > > > On Thu, 23 May 2013 17:53:41 -0400
> > > > Scott Mayhew <[email protected]> wrote:
> > > >
> > > > > Currently nfs_updatepage allows a write to be extended to cover a full
> > > > > page only if we don't have a byte range lock on the file... but if we've
> > > > > got the whole file locked, then we should be allowed to extend the
> > > > > write.
> > > > >
> > > > > Signed-off-by: Scott Mayhew <[email protected]>
> > > > > ---
> > > > > fs/nfs/write.c | 7 +++++--
> > > > > 1 file changed, 5 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > > > index a2c7c28..f35fb4f 100644
> > > > > --- a/fs/nfs/write.c
> > > > > +++ b/fs/nfs/write.c
> > > > > @@ -908,13 +908,16 @@ int nfs_updatepage(struct file *file, struct page *page,
> > > > > file->f_path.dentry->d_name.name, count,
> > > > > (long long)(page_file_offset(page) + offset));
> > > > >
> > > > > - /* If we're not using byte range locks, and we know the page
> > > > > + /* If we're not using byte range locks (or if the range of the
> > > > > + * lock covers the entire file), and we know the page
> > > > > * is up to date, it may be more efficient to extend the write
> > > > > * to cover the entire page in order to avoid fragmentation
> > > > > * inefficiencies.
> > > > > */
> > > > > if (nfs_write_pageuptodate(page, inode) &&
> > > > > - inode->i_flock == NULL &&
> > > > > + (inode->i_flock == NULL ||
> > > > > + (inode->i_flock->fl_start == 0 &&
> > > > > + inode->i_flock->fl_end == OFFSET_MAX)) &&
> > > > > !(file->f_flags & O_DSYNC)) {
> > > > > count = max(count + offset, nfs_page_length(page));
> > > > > offset = 0;
> > > >
> > > > Sounds like a reasonable proposition, but I think you might need to do
> > > > more vetting of the locks...
> > > >
> > > > For instance, does it make sense to do this if it's a F_RDLCK? Also,
> > > > you're only looking at the first lock in the i_flock list. Might it
> > > > make more sense to walk the list and see whether the page might be
> > > > entirely covered by a lock that doesn't extend over the whole file?
> > > >
> > >
> > > I'm guessing that the answer is to both these questions are "no":
> > > - Anybody who is writing while holding a F_RDLCK is likely doing
> > > something wrong.
> >
> > Right, so I think we ought to be conservative here and not extend the
> > write if this is an F_RDLCK.
> >
> > > - Walking the lock list on every write can quickly get painful if we
> > > have lots of small locks.
> > >
> >
> > True, but it's probably still preferable to do that than to do a bunch
> > of small I/Os to the server. But, that's an optimization that can be
> > done later. Hardly anyone does real byte-range locking so I'm fine with
> > this approach for now.
> >
> > > However it may make a lot of sense to look at whether or not we hold a
> > > NFSv4 write delegation.
> > >
> >
> > Yes, that would be a good thing too. Having a helper function like you
> > suggested should make it easier to encapsulate that logic sanely.
> >
> Here's an updated patch that moves the logic to a helper function,
> checks to see if we have a write delegation, and checks the lock type.
>
> -Scott
>
> From 3938f17ef84f5c4889fd7f827109f89c932df569 Mon Sep 17 00:00:00 2001
> From: Scott Mayhew <[email protected]>
> Date: Wed, 22 May 2013 17:03:17 -0400
> Subject: [PATCH RFC] NFS: Allow nfs_updatepage to extend a write under
> additional circumstances
>
> Currently nfs_updatepage allows a write to be extended to cover a full
> page only if we don't have a byte range lock lock on the file... but if
> we have a write delegation on the file or if we have the whole file
> locked for writing then we should be allowed to extend the write as
> well.
>
> Signed-off-by: Scott Mayhew <[email protected]>
> ---
> fs/nfs/write.c | 31 +++++++++++++++++++++++--------
> 1 file changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index a2c7c28..c8a1bcc 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -888,6 +888,28 @@ out:
> return PageUptodate(page) != 0;
> }
>
> +/* If we know the page is up to date, and we're not using byte range locks (or
> + * if we have the whole file locked for writing), it may be more efficient to
> + * extend the write to cover the entire page in order to avoid fragmentation
> + * inefficiencies.
> + *
> + * If the file is opened for synchronous writes or if we have a write delegation
> + * from the server then we can just skip the rest of the checks.
> + */
> +static int nfs_can_extend_write(struct file *file, struct page *page, struct inode *inode)
> +{
> + if (file->f_flags & O_DSYNC)
> + return 0;
> + if (nfs_have_delegation(inode, FMODE_WRITE))
> + return 1;
> + if (nfs_write_pageuptodate(page, inode) && (inode->i_flock == NULL ||
> + (inode->i_flock->fl_start == 0 &&
> + inode->i_flock->fl_end == OFFSET_MAX &&
> + inode->i_flock->fl_type != F_RDLCK)))
> + return 1;
> + return 0;
> +}
> +
> /*
> * Update and possibly write a cached page of an NFS file.
> *
> @@ -908,14 +930,7 @@ int nfs_updatepage(struct file *file, struct page *page,
> file->f_path.dentry->d_name.name, count,
> (long long)(page_file_offset(page) + offset));
>
> - /* If we're not using byte range locks, and we know the page
> - * is up to date, it may be more efficient to extend the write
> - * to cover the entire page in order to avoid fragmentation
> - * inefficiencies.
> - */
> - if (nfs_write_pageuptodate(page, inode) &&
> - inode->i_flock == NULL &&
> - !(file->f_flags & O_DSYNC)) {
> + if (nfs_can_extend_write(file, page, inode)) {
> count = max(count + offset, nfs_page_length(page));
> offset = 0;
> }
Sorry I didn't chime in on this before. Looks sane to me...
Reviewed-by: Jeff Layton <[email protected]>
On Tue, 4 Jun 2013 09:21:49 -0400
Scott Mayhew <[email protected]> wrote:
>
> Currently nfs_updatepage allows a write to be extended to cover a full
> page only if we don't have a byte range lock lock on the file... but if
> we have a write delegation on the file or if we have the whole file
> locked for writing then we should be allowed to extend the write as
> well.
>
> Signed-off-by: Scott Mayhew <[email protected]>
> ---
> fs/nfs/write.c | 31 +++++++++++++++++++++++--------
> 1 file changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index a2c7c28..c8a1bcc 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -888,6 +888,28 @@ out:
> return PageUptodate(page) != 0;
> }
>
> +/* If we know the page is up to date, and we're not using byte range locks (or
> + * if we have the whole file locked for writing), it may be more efficient to
> + * extend the write to cover the entire page in order to avoid fragmentation
> + * inefficiencies.
> + *
> + * If the file is opened for synchronous writes or if we have a write delegation
> + * from the server then we can just skip the rest of the checks.
> + */
> +static int nfs_can_extend_write(struct file *file, struct page *page, struct inode *inode)
> +{
> + if (file->f_flags & O_DSYNC)
> + return 0;
> + if (nfs_have_delegation(inode, FMODE_WRITE))
> + return 1;
> + if (nfs_write_pageuptodate(page, inode) && (inode->i_flock == NULL ||
> + (inode->i_flock->fl_start == 0 &&
> + inode->i_flock->fl_end == OFFSET_MAX &&
> + inode->i_flock->fl_type != F_RDLCK)))
> + return 1;
> + return 0;
> +}
> +
> /*
> * Update and possibly write a cached page of an NFS file.
> *
> @@ -908,14 +930,7 @@ int nfs_updatepage(struct file *file, struct page *page,
> file->f_path.dentry->d_name.name, count,
> (long long)(page_file_offset(page) + offset));
>
> - /* If we're not using byte range locks, and we know the page
> - * is up to date, it may be more efficient to extend the write
> - * to cover the entire page in order to avoid fragmentation
> - * inefficiencies.
> - */
> - if (nfs_write_pageuptodate(page, inode) &&
> - inode->i_flock == NULL &&
> - !(file->f_flags & O_DSYNC)) {
> + if (nfs_can_extend_write(file, page, inode)) {
> count = max(count + offset, nfs_page_length(page));
> offset = 0;
> }
Looks reasonable to me...
Acked-by: Jeff Layton <[email protected]>