2017-07-26 17:55:43

by Jeffrey Layton

[permalink] [raw]
Subject: [PATCH v2 0/4] mm/gfs2: extend file_* API, and convert gfs2 to errseq_t error reporting

From: Jeff Layton <[email protected]>

I sent a small patch earlier this week to make sync_file_range use
errseq_t reporting.

This set respins that patch into a patch that adds a bit more file_*
infrastructure, and then patches to make sync_file_range and fsync
on gfs2 report writeback errors properly.

There's also a small cleanup patch for mm/filemap.c to consolidate
the DAX handling checks in the existing infrastructure.

Jeff Layton (4):
mm: consolidate dax / non-dax checks for writeback
mm: add file_fdatawait_range and file_write_and_wait
fs: convert sync_file_range to use errseq_t based error-tracking
gfs2: convert to errseq_t based writeback error reporting for fsync

fs/gfs2/file.c | 6 +++--
fs/sync.c | 4 +--
include/linux/fs.h | 7 +++++-
mm/filemap.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++-----
4 files changed, 77 insertions(+), 11 deletions(-)

--
2.13.3


2017-07-26 17:55:49

by Jeffrey Layton

[permalink] [raw]
Subject: [PATCH v2 4/4] gfs2: convert to errseq_t based writeback error reporting for fsync

From: Jeff Layton <[email protected]>

This means that we need to export the new file_fdatawait_range symbol.

Also, fix a place where a writeback error might get dropped in the
gfs2_is_jdata case.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/gfs2/file.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index c2062a108d19..c53ac6efd04c 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -668,12 +668,14 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
if (ret)
return ret;
if (gfs2_is_jdata(ip))
- filemap_write_and_wait(mapping);
+ ret = file_write_and_wait(file);
+ if (ret)
+ return ret;
gfs2_ail_flush(ip->i_gl, 1);
}

if (mapping->nrpages)
- ret = filemap_fdatawait_range(mapping, start, end);
+ ret = file_fdatawait_range(file, start, end);

return ret ? ret : ret1;
}
--
2.13.3

2017-07-26 17:55:47

by Jeffrey Layton

[permalink] [raw]
Subject: [PATCH v2 1/4] mm: consolidate dax / non-dax checks for writeback

From: Jeff Layton <[email protected]>

We have this complex conditional copied to several places. Turn it into
a helper function.

Signed-off-by: Jeff Layton <[email protected]>
---
mm/filemap.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index e1cca770688f..72e46e6f0d9a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -522,12 +522,17 @@ int filemap_fdatawait(struct address_space *mapping)
}
EXPORT_SYMBOL(filemap_fdatawait);

+static bool mapping_needs_writeback(struct address_space *mapping)
+{
+ return (!dax_mapping(mapping) && mapping->nrpages) ||
+ (dax_mapping(mapping) && mapping->nrexceptional);
+}
+
int filemap_write_and_wait(struct address_space *mapping)
{
int err = 0;

- if ((!dax_mapping(mapping) && mapping->nrpages) ||
- (dax_mapping(mapping) && mapping->nrexceptional)) {
+ if (mapping_needs_writeback(mapping)) {
err = filemap_fdatawrite(mapping);
/*
* Even if the above returned error, the pages may be
@@ -566,8 +571,7 @@ int filemap_write_and_wait_range(struct address_space *mapping,
{
int err = 0;

- if ((!dax_mapping(mapping) && mapping->nrpages) ||
- (dax_mapping(mapping) && mapping->nrexceptional)) {
+ if (mapping_needs_writeback(mapping)) {
err = __filemap_fdatawrite_range(mapping, lstart, lend,
WB_SYNC_ALL);
/* See comment of filemap_write_and_wait() */
@@ -656,8 +660,7 @@ int file_write_and_wait_range(struct file *file, loff_t lstart, loff_t lend)
int err = 0, err2;
struct address_space *mapping = file->f_mapping;

- if ((!dax_mapping(mapping) && mapping->nrpages) ||
- (dax_mapping(mapping) && mapping->nrexceptional)) {
+ if (mapping_needs_writeback(mapping)) {
err = __filemap_fdatawrite_range(mapping, lstart, lend,
WB_SYNC_ALL);
/* See comment of filemap_write_and_wait() */
--
2.13.3

2017-07-26 17:56:08

by Jeffrey Layton

[permalink] [raw]
Subject: [PATCH v2 3/4] fs: convert sync_file_range to use errseq_t based error-tracking

From: Jeff Layton <[email protected]>

sync_file_range doesn't call down into the filesystem directly at all.
It only kicks off writeback of pagecache pages and optionally waits
on the result.

Convert sync_file_range to use errseq_t based error tracking, under the
assumption that most users will prefer this behavior when errors occur.

Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/sync.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index 2a54c1f22035..27d6b8bbcb6a 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -342,7 +342,7 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,

ret = 0;
if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
- ret = filemap_fdatawait_range(mapping, offset, endbyte);
+ ret = file_fdatawait_range(f.file, offset, endbyte);
if (ret < 0)
goto out_put;
}
@@ -355,7 +355,7 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
}

if (flags & SYNC_FILE_RANGE_WAIT_AFTER)
- ret = filemap_fdatawait_range(mapping, offset, endbyte);
+ ret = file_fdatawait_range(f.file, offset, endbyte);

out_put:
fdput(f);
--
2.13.3

2017-07-26 17:56:29

by Jeffrey Layton

[permalink] [raw]
Subject: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

From: Jeff Layton <[email protected]>

Some filesystem fsync routines will need these.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/fs.h | 7 ++++++-
mm/filemap.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21e7df1ad613..bc57a79294f0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2544,6 +2544,8 @@ extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
loff_t lend);
extern bool filemap_range_has_page(struct address_space *, loff_t lstart,
loff_t lend);
+extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
+ loff_t lend);
extern int filemap_write_and_wait(struct address_space *mapping);
extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
@@ -2552,11 +2554,14 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
extern int filemap_fdatawrite_range(struct address_space *mapping,
loff_t start, loff_t end);
extern int filemap_check_errors(struct address_space *mapping);
-
extern void __filemap_set_wb_err(struct address_space *mapping, int err);
+
+extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
+ loff_t lend);
extern int __must_check file_check_and_advance_wb_err(struct file *file);
extern int __must_check file_write_and_wait_range(struct file *file,
loff_t start, loff_t end);
+extern int __must_check file_write_and_wait(struct file *file);

/**
* filemap_set_wb_err - set a writeback error on an address_space
diff --git a/mm/filemap.c b/mm/filemap.c
index 72e46e6f0d9a..b904a8dfa43d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -476,6 +476,29 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
EXPORT_SYMBOL(filemap_fdatawait_range);

/**
+ * file_fdatawait_range - wait for writeback to complete
+ * @file: file pointing to address space structure to wait for
+ * @start_byte: offset in bytes where the range starts
+ * @end_byte: offset in bytes where the range ends (inclusive)
+ *
+ * Walk the list of under-writeback pages of the address space that file
+ * refers to, in the given range and wait for all of them. Check error
+ * status of the address space vs. the file->f_wb_err cursor and return it.
+ *
+ * Since the error status of the file is advanced by this function,
+ * callers are responsible for checking the return value and handling and/or
+ * reporting the error.
+ */
+int file_fdatawait_range(struct file *file, loff_t start_byte, loff_t end_byte)
+{
+ struct address_space *mapping = file->f_mapping;
+
+ __filemap_fdatawait_range(mapping, start_byte, end_byte);
+ return file_check_and_advance_wb_err(file);
+}
+EXPORT_SYMBOL(file_fdatawait_range);
+
+/**
* filemap_fdatawait_keep_errors - wait for writeback without clearing errors
* @mapping: address space structure to wait for
*
@@ -675,6 +698,39 @@ int file_write_and_wait_range(struct file *file, loff_t lstart, loff_t lend)
EXPORT_SYMBOL(file_write_and_wait_range);

/**
+ * file_write_and_wait - write out whole file and wait on it and return any
+ * writeback errors since we last checked
+ * @file: file to write back and wait on
+ *
+ * Write back the whole file and wait on its mapping. Afterward, check for
+ * errors that may have occurred since our file->f_wb_err cursor was last
+ * updated.
+ */
+int file_write_and_wait(struct file *file)
+{
+ int err = 0, err2;
+ struct address_space *mapping = file->f_mapping;
+
+ if ((!dax_mapping(mapping) && mapping->nrpages) ||
+ (dax_mapping(mapping) && mapping->nrexceptional)) {
+ err = filemap_fdatawrite(mapping);
+ /* See comment of filemap_write_and_wait() */
+ if (err != -EIO) {
+ loff_t i_size = i_size_read(mapping->host);
+
+ if (i_size != 0)
+ __filemap_fdatawait_range(mapping, 0,
+ i_size - 1);
+ }
+ }
+ err2 = file_check_and_advance_wb_err(file);
+ if (!err)
+ err = err2;
+ return err;
+}
+EXPORT_SYMBOL(file_write_and_wait);
+
+/**
* replace_page_cache_page - replace a pagecache page with a new one
* @old: page to be replaced
* @new: page to replace with
--
2.13.3

2017-07-26 19:13:12

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

On Wed, Jul 26, 2017 at 01:55:36PM -0400, Jeff Layton wrote:
> +int file_write_and_wait(struct file *file)
> +{
> + int err = 0, err2;
> + struct address_space *mapping = file->f_mapping;
> +
> + if ((!dax_mapping(mapping) && mapping->nrpages) ||
> + (dax_mapping(mapping) && mapping->nrexceptional)) {

Since patch 1 exists, shouldn't this use the new helper?

> + err = filemap_fdatawrite(mapping);
> + /* See comment of filemap_write_and_wait() */
> + if (err != -EIO) {
> + loff_t i_size = i_size_read(mapping->host);
> +
> + if (i_size != 0)
> + __filemap_fdatawait_range(mapping, 0,
> + i_size - 1);
> + }
> + }
> + err2 = file_check_and_advance_wb_err(file);
> + if (!err)
> + err = err2;
> + return err;

Would this be clearer written as:

if (err)
return err;
return err2;

or even ...

return err ? err : err2;

2017-07-26 19:21:11

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] gfs2: convert to errseq_t based writeback error reporting for fsync

On Wed, Jul 26, 2017 at 01:55:38PM -0400, Jeff Layton wrote:
> @@ -668,12 +668,14 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
> if (ret)
> return ret;
> if (gfs2_is_jdata(ip))
> - filemap_write_and_wait(mapping);
> + ret = file_write_and_wait(file);
> + if (ret)
> + return ret;
> gfs2_ail_flush(ip->i_gl, 1);
> }

Do we want to skip flushing the AIL if there was an error (possibly
previously encountered)? I'd think we'd want to flush the AIL then report
the error, like this:

if (gfs2_is_jdata(ip))
- filemap_write_and_wait(mapping);
+ ret = file_write_and_wait(file);
gfs2_ail_flush(ip->i_gl, 1);
+ if (ret)
+ return ret;
}

2017-07-26 19:50:28

by Bob Peterson

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

----- Original Message -----
| From: Jeff Layton <[email protected]>
|
| Some filesystem fsync routines will need these.
|
| Signed-off-by: Jeff Layton <[email protected]>
| ---
| include/linux/fs.h | 7 ++++++-
| mm/filemap.c | 56
| ++++++++++++++++++++++++++++++++++++++++++++++++++++++
| 2 files changed, 62 insertions(+), 1 deletion(-)
(snip)
| diff --git a/mm/filemap.c b/mm/filemap.c
| index 72e46e6f0d9a..b904a8dfa43d 100644
| --- a/mm/filemap.c
| +++ b/mm/filemap.c
(snip)
| @@ -675,6 +698,39 @@ int file_write_and_wait_range(struct file *file, loff_t
| lstart, loff_t lend)
| EXPORT_SYMBOL(file_write_and_wait_range);
|
| /**
| + * file_write_and_wait - write out whole file and wait on it and return any
| + * writeback errors since we last checked
| + * @file: file to write back and wait on
| + *
| + * Write back the whole file and wait on its mapping. Afterward, check for
| + * errors that may have occurred since our file->f_wb_err cursor was last
| + * updated.
| + */
| +int file_write_and_wait(struct file *file)
| +{
| + int err = 0, err2;
| + struct address_space *mapping = file->f_mapping;
| +
| + if ((!dax_mapping(mapping) && mapping->nrpages) ||
| + (dax_mapping(mapping) && mapping->nrexceptional)) {

Seems like we should make the new function mapping_needs_writeback more
central (mm.h or fs.h?) and call it here ^.

| + err = filemap_fdatawrite(mapping);
| + /* See comment of filemap_write_and_wait() */
| + if (err != -EIO) {
| + loff_t i_size = i_size_read(mapping->host);
| +
| + if (i_size != 0)
| + __filemap_fdatawait_range(mapping, 0,
| + i_size - 1);
| + }
| + }
| + err2 = file_check_and_advance_wb_err(file);
| + if (!err)
| + err = err2;
| + return err;

In the past, I've seen more elegant constructs like:
return (err ? err : err2);
but I don't know what's considered more ugly or hackish.

Regards,

Bob Peterson
Red Hat File Systems

2017-07-26 22:18:34

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

On Wed, 2017-07-26 at 12:13 -0700, Matthew Wilcox wrote:
> On Wed, Jul 26, 2017 at 01:55:36PM -0400, Jeff Layton wrote:
> > +int file_write_and_wait(struct file *file)
> > +{
> > + int err = 0, err2;
> > + struct address_space *mapping = file->f_mapping;
> > +
> > + if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > + (dax_mapping(mapping) && mapping->nrexceptional)) {
>
> Since patch 1 exists, shouldn't this use the new helper?
>

<facepalm>

yes, will fix


> > + err = filemap_fdatawrite(mapping);
> > + /* See comment of filemap_write_and_wait() */
> > + if (err != -EIO) {
> > + loff_t i_size = i_size_read(mapping->host);
> > +
> > + if (i_size != 0)
> > + __filemap_fdatawait_range(mapping, 0,
> > + i_size - 1);
> > + }
> > + }
> > + err2 = file_check_and_advance_wb_err(file);
> > + if (!err)
> > + err = err2;
> > + return err;
>
> Would this be clearer written as:
>
> if (err)
> return err;
> return err2;
>
> or even ...
>
> return err ? err : err2;
>

Meh -- I like it the way I have it. If we don't have an error already,
then just take the one from the check and advance.

That said, I don't have a terribly strong preference here, so if anyone
does, then I can be easily persuaded.

--
--
Jeff Layton <[email protected]>

2017-07-26 22:22:58

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] gfs2: convert to errseq_t based writeback error reporting for fsync

On Wed, 2017-07-26 at 12:21 -0700, Matthew Wilcox wrote:
> On Wed, Jul 26, 2017 at 01:55:38PM -0400, Jeff Layton wrote:
> > @@ -668,12 +668,14 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
> > if (ret)
> > return ret;
> > if (gfs2_is_jdata(ip))
> > - filemap_write_and_wait(mapping);
> > + ret = file_write_and_wait(file);
> > + if (ret)
> > + return ret;
> > gfs2_ail_flush(ip->i_gl, 1);
> > }
>
> Do we want to skip flushing the AIL if there was an error (possibly
> previously encountered)? I'd think we'd want to flush the AIL then report
> the error, like this:
>

I wondered about that. Note that earlier in the function, we also bail
out without flushing the AIL if sync_inode_metadata fails, so I assumed
that we'd want to do the same here.

I could definitely be wrong and am fine with changing it if so.
Discarding the error like we do today seems wrong though.

Bob, thoughts?


> if (gfs2_is_jdata(ip))
> - filemap_write_and_wait(mapping);
> + ret = file_write_and_wait(file);
> gfs2_ail_flush(ip->i_gl, 1);
> + if (ret)
> + return ret;
> }
--
Jeff Layton <[email protected]>

2017-07-27 08:43:46

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] mm: consolidate dax / non-dax checks for writeback

On Wed 26-07-17 13:55:35, Jeff Layton wrote:
> From: Jeff Layton <[email protected]>
>
> We have this complex conditional copied to several places. Turn it into
> a helper function.
>
> Signed-off-by: Jeff Layton <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> mm/filemap.c | 15 +++++++++------
> 1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index e1cca770688f..72e46e6f0d9a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -522,12 +522,17 @@ int filemap_fdatawait(struct address_space *mapping)
> }
> EXPORT_SYMBOL(filemap_fdatawait);
>
> +static bool mapping_needs_writeback(struct address_space *mapping)
> +{
> + return (!dax_mapping(mapping) && mapping->nrpages) ||
> + (dax_mapping(mapping) && mapping->nrexceptional);
> +}
> +
> int filemap_write_and_wait(struct address_space *mapping)
> {
> int err = 0;
>
> - if ((!dax_mapping(mapping) && mapping->nrpages) ||
> - (dax_mapping(mapping) && mapping->nrexceptional)) {
> + if (mapping_needs_writeback(mapping)) {
> err = filemap_fdatawrite(mapping);
> /*
> * Even if the above returned error, the pages may be
> @@ -566,8 +571,7 @@ int filemap_write_and_wait_range(struct address_space *mapping,
> {
> int err = 0;
>
> - if ((!dax_mapping(mapping) && mapping->nrpages) ||
> - (dax_mapping(mapping) && mapping->nrexceptional)) {
> + if (mapping_needs_writeback(mapping)) {
> err = __filemap_fdatawrite_range(mapping, lstart, lend,
> WB_SYNC_ALL);
> /* See comment of filemap_write_and_wait() */
> @@ -656,8 +660,7 @@ int file_write_and_wait_range(struct file *file, loff_t lstart, loff_t lend)
> int err = 0, err2;
> struct address_space *mapping = file->f_mapping;
>
> - if ((!dax_mapping(mapping) && mapping->nrpages) ||
> - (dax_mapping(mapping) && mapping->nrexceptional)) {
> + if (mapping_needs_writeback(mapping)) {
> err = __filemap_fdatawrite_range(mapping, lstart, lend,
> WB_SYNC_ALL);
> /* See comment of filemap_write_and_wait() */
> --
> 2.13.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-07-27 08:49:17

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> +int file_write_and_wait(struct file *file)
> +{
> + int err = 0, err2;
> + struct address_space *mapping = file->f_mapping;
> +
> + if ((!dax_mapping(mapping) && mapping->nrpages) ||
> + (dax_mapping(mapping) && mapping->nrexceptional)) {
> + err = filemap_fdatawrite(mapping);
> + /* See comment of filemap_write_and_wait() */
> + if (err != -EIO) {
> + loff_t i_size = i_size_read(mapping->host);
> +
> + if (i_size != 0)
> + __filemap_fdatawait_range(mapping, 0,
> + i_size - 1);
> + }
> + }

Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
range and ignore i_size. It is much easier than trying to wrap your head
around possible races with file operations modifying i_size.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-07-27 12:47:14

by Bob Peterson

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] gfs2: convert to errseq_t based writeback error reporting for fsync

----- Original Message -----
| On Wed, 2017-07-26 at 12:21 -0700, Matthew Wilcox wrote:
| > On Wed, Jul 26, 2017 at 01:55:38PM -0400, Jeff Layton wrote:
| > > @@ -668,12 +668,14 @@ static int gfs2_fsync(struct file *file, loff_t
| > > start, loff_t end,
| > > if (ret)
| > > return ret;
| > > if (gfs2_is_jdata(ip))
| > > - filemap_write_and_wait(mapping);
| > > + ret = file_write_and_wait(file);
| > > + if (ret)
| > > + return ret;
| > > gfs2_ail_flush(ip->i_gl, 1);
| > > }
| >
| > Do we want to skip flushing the AIL if there was an error (possibly
| > previously encountered)? I'd think we'd want to flush the AIL then report
| > the error, like this:
| >
|
| I wondered about that. Note that earlier in the function, we also bail
| out without flushing the AIL if sync_inode_metadata fails, so I assumed
| that we'd want to do the same here.
|
| I could definitely be wrong and am fine with changing it if so.
| Discarding the error like we do today seems wrong though.
|
| Bob, thoughts?

Hi Jeff, Matthew,

I'm not sure there's a right or wrong answer here. I don't know what's
best from a "correctness" point of view.

I guess I'm leaning toward Jeff's original solution where we don't
call gfs2_ail_flush() on error. The main purpose of ail_flush is to
go through buffer descriptors (bds) attached to the glock and generate
revokes for them in a new transaction. If there's an error condition,
trying to go through more hoops will probably just get us into more
trouble. If the error is -ENOMEM, we don't want to allocate new memory
for the new transaction. If the error is -EIO, we probably don't
want to encourage more writing either.

So on the one hand, it might be good to get rid of the buffer descriptors
so we don't leak memory, but that's probably also done elsewhere.
I have not chased down what happens in that case, but the same thing
would happen in the existing -EIO case a few lines above.

On the other hand, we probably don't want to start a new transaction
and start adding revokes to it, and such, due to the error.

Perhaps Steve Whitehouse can weigh in?

Regards,

Bob Peterson
Red Hat File Systems

2017-07-27 12:48:34

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > +int file_write_and_wait(struct file *file)
> > +{
> > + int err = 0, err2;
> > + struct address_space *mapping = file->f_mapping;
> > +
> > + if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > + (dax_mapping(mapping) && mapping->nrexceptional)) {
> > + err = filemap_fdatawrite(mapping);
> > + /* See comment of filemap_write_and_wait() */
> > + if (err != -EIO) {
> > + loff_t i_size = i_size_read(mapping->host);
> > +
> > + if (i_size != 0)
> > + __filemap_fdatawait_range(mapping, 0,
> > + i_size - 1);
> > + }
> > + }
>
> Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> range and ignore i_size. It is much easier than trying to wrap your head
> around possible races with file operations modifying i_size.
>
> Honza

I'm basically emulating _exactly_ what filemap_write_and_wait does here,
as I'm leery of making subtle behavior changes in the actual writeback
behavior. For example:

-----------------8<----------------
static inline int __filemap_fdatawrite(struct address_space *mapping,
int sync_mode)
{
return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
}

int filemap_fdatawrite(struct address_space *mapping)
{
return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
}
EXPORT_SYMBOL(filemap_fdatawrite);
-----------------8<----------------

...which then sets up the wbc with the right ranges and sync mode and
kicks off writepages. But then, it does the i_size_read to figure out
what range it should wait on (with the shortcut for the size == 0 case).

My assumption was that it was intentionally designed that way, but I'm
guessing from your comments that it wasn't? If so, then we can turn
file_write_and_wait a static inline wrapper around
file_write_and_wait_range.
--
Jeff Layton <[email protected]>

2017-07-28 12:37:15

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] gfs2: convert to errseq_t based writeback error reporting for fsync

Hi,


On 27/07/17 13:47, Bob Peterson wrote:
> ----- Original Message -----
> | On Wed, 2017-07-26 at 12:21 -0700, Matthew Wilcox wrote:
> | > On Wed, Jul 26, 2017 at 01:55:38PM -0400, Jeff Layton wrote:
> | > > @@ -668,12 +668,14 @@ static int gfs2_fsync(struct file *file, loff_t
> | > > start, loff_t end,
> | > > if (ret)
> | > > return ret;
> | > > if (gfs2_is_jdata(ip))
> | > > - filemap_write_and_wait(mapping);
> | > > + ret = file_write_and_wait(file);
> | > > + if (ret)
> | > > + return ret;
> | > > gfs2_ail_flush(ip->i_gl, 1);
> | > > }
> | >
> | > Do we want to skip flushing the AIL if there was an error (possibly
> | > previously encountered)? I'd think we'd want to flush the AIL then report
> | > the error, like this:
> | >
> |
> | I wondered about that. Note that earlier in the function, we also bail
> | out without flushing the AIL if sync_inode_metadata fails, so I assumed
> | that we'd want to do the same here.
> |
> | I could definitely be wrong and am fine with changing it if so.
> | Discarding the error like we do today seems wrong though.
> |
> | Bob, thoughts?
>
> Hi Jeff, Matthew,
>
> I'm not sure there's a right or wrong answer here. I don't know what's
> best from a "correctness" point of view.
>
> I guess I'm leaning toward Jeff's original solution where we don't
> call gfs2_ail_flush() on error. The main purpose of ail_flush is to
> go through buffer descriptors (bds) attached to the glock and generate
> revokes for them in a new transaction. If there's an error condition,
> trying to go through more hoops will probably just get us into more
> trouble. If the error is -ENOMEM, we don't want to allocate new memory
> for the new transaction. If the error is -EIO, we probably don't
> want to encourage more writing either.
>
> So on the one hand, it might be good to get rid of the buffer descriptors
> so we don't leak memory, but that's probably also done elsewhere.
> I have not chased down what happens in that case, but the same thing
> would happen in the existing -EIO case a few lines above.
>
> On the other hand, we probably don't want to start a new transaction
> and start adding revokes to it, and such, due to the error.
>
> Perhaps Steve Whitehouse can weigh in?
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems

Yes, we probably do want to skip the ail flush if there is an error. We
don't know whether the error is permanent or transient at that stage. If
a previous stage of the fsync has failed, then there may be nothing for
the next stage to do anyway, so it is probably not a big deal either
way. So long as the error is reported to the caller, then we should be ok,

Steve.

2017-07-28 12:47:42

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] gfs2: convert to errseq_t based writeback error reporting for fsync

On Fri, 2017-07-28 at 13:37 +0100, Steven Whitehouse wrote:
> Hi,
>
>
> On 27/07/17 13:47, Bob Peterson wrote:
> > ----- Original Message -----
> > > On Wed, 2017-07-26 at 12:21 -0700, Matthew Wilcox wrote:
> > > > On Wed, Jul 26, 2017 at 01:55:38PM -0400, Jeff Layton wrote:
> > > > > @@ -668,12 +668,14 @@ static int gfs2_fsync(struct file *file, loff_t
> > > > > start, loff_t end,
> > > > > if (ret)
> > > > > return ret;
> > > > > if (gfs2_is_jdata(ip))
> > > > > - filemap_write_and_wait(mapping);
> > > > > + ret = file_write_and_wait(file);
> > > > > + if (ret)
> > > > > + return ret;
> > > > > gfs2_ail_flush(ip->i_gl, 1);
> > > > > }
> > > >
> > > > Do we want to skip flushing the AIL if there was an error (possibly
> > > > previously encountered)? I'd think we'd want to flush the AIL then report
> > > > the error, like this:
> > > >
> > >
> > > I wondered about that. Note that earlier in the function, we also bail
> > > out without flushing the AIL if sync_inode_metadata fails, so I assumed
> > > that we'd want to do the same here.
> > >
> > > I could definitely be wrong and am fine with changing it if so.
> > > Discarding the error like we do today seems wrong though.
> > >
> > > Bob, thoughts?
> >
> > Hi Jeff, Matthew,
> >
> > I'm not sure there's a right or wrong answer here. I don't know what's
> > best from a "correctness" point of view.
> >
> > I guess I'm leaning toward Jeff's original solution where we don't
> > call gfs2_ail_flush() on error. The main purpose of ail_flush is to
> > go through buffer descriptors (bds) attached to the glock and generate
> > revokes for them in a new transaction. If there's an error condition,
> > trying to go through more hoops will probably just get us into more
> > trouble. If the error is -ENOMEM, we don't want to allocate new memory
> > for the new transaction. If the error is -EIO, we probably don't
> > want to encourage more writing either.
> >
> > So on the one hand, it might be good to get rid of the buffer descriptors
> > so we don't leak memory, but that's probably also done elsewhere.
> > I have not chased down what happens in that case, but the same thing
> > would happen in the existing -EIO case a few lines above.
> >
> > On the other hand, we probably don't want to start a new transaction
> > and start adding revokes to it, and such, due to the error.
> >
> > Perhaps Steve Whitehouse can weigh in?
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat File Systems
>
> Yes, we probably do want to skip the ail flush if there is an error. We
> don't know whether the error is permanent or transient at that stage. If
> a previous stage of the fsync has failed, then there may be nothing for
> the next stage to do anyway, so it is probably not a big deal either
> way. So long as the error is reported to the caller, then we should be ok,
>

Ok, cool. I'll plan to carry this patch for now as it depends on an
earlier one in the series. One more question though:

Is it correct in the gfs2_is_jdata case to ignore the range that was
passed in from the caller? ->fsync gets start and end arguments, but
this will always write back the whole range. Is that necessary in this
case?

--
Jeff Layton <[email protected]>

2017-07-28 12:54:46

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] gfs2: convert to errseq_t based writeback error reporting for fsync

Hi,


On 28/07/17 13:47, Jeff Layton wrote:
> On Fri, 2017-07-28 at 13:37 +0100, Steven Whitehouse wrote:
>> Hi,
>>
>>
>> On 27/07/17 13:47, Bob Peterson wrote:
>>> ----- Original Message -----
>>>> On Wed, 2017-07-26 at 12:21 -0700, Matthew Wilcox wrote:
>>>>> On Wed, Jul 26, 2017 at 01:55:38PM -0400, Jeff Layton wrote:
>>>>>> @@ -668,12 +668,14 @@ static int gfs2_fsync(struct file *file, loff_t
>>>>>> start, loff_t end,
>>>>>> if (ret)
>>>>>> return ret;
>>>>>> if (gfs2_is_jdata(ip))
>>>>>> - filemap_write_and_wait(mapping);
>>>>>> + ret = file_write_and_wait(file);
>>>>>> + if (ret)
>>>>>> + return ret;
>>>>>> gfs2_ail_flush(ip->i_gl, 1);
>>>>>> }
>>>>> Do we want to skip flushing the AIL if there was an error (possibly
>>>>> previously encountered)? I'd think we'd want to flush the AIL then report
>>>>> the error, like this:
>>>>>
>>>> I wondered about that. Note that earlier in the function, we also bail
>>>> out without flushing the AIL if sync_inode_metadata fails, so I assumed
>>>> that we'd want to do the same here.
>>>>
>>>> I could definitely be wrong and am fine with changing it if so.
>>>> Discarding the error like we do today seems wrong though.
>>>>
>>>> Bob, thoughts?
>>> Hi Jeff, Matthew,
>>>
>>> I'm not sure there's a right or wrong answer here. I don't know what's
>>> best from a "correctness" point of view.
>>>
>>> I guess I'm leaning toward Jeff's original solution where we don't
>>> call gfs2_ail_flush() on error. The main purpose of ail_flush is to
>>> go through buffer descriptors (bds) attached to the glock and generate
>>> revokes for them in a new transaction. If there's an error condition,
>>> trying to go through more hoops will probably just get us into more
>>> trouble. If the error is -ENOMEM, we don't want to allocate new memory
>>> for the new transaction. If the error is -EIO, we probably don't
>>> want to encourage more writing either.
>>>
>>> So on the one hand, it might be good to get rid of the buffer descriptors
>>> so we don't leak memory, but that's probably also done elsewhere.
>>> I have not chased down what happens in that case, but the same thing
>>> would happen in the existing -EIO case a few lines above.
>>>
>>> On the other hand, we probably don't want to start a new transaction
>>> and start adding revokes to it, and such, due to the error.
>>>
>>> Perhaps Steve Whitehouse can weigh in?
>>>
>>> Regards,
>>>
>>> Bob Peterson
>>> Red Hat File Systems
>> Yes, we probably do want to skip the ail flush if there is an error. We
>> don't know whether the error is permanent or transient at that stage. If
>> a previous stage of the fsync has failed, then there may be nothing for
>> the next stage to do anyway, so it is probably not a big deal either
>> way. So long as the error is reported to the caller, then we should be ok,
>>
> Ok, cool. I'll plan to carry this patch for now as it depends on an
> earlier one in the series. One more question though:
>
> Is it correct in the gfs2_is_jdata case to ignore the range that was
> passed in from the caller? ->fsync gets start and end arguments, but
> this will always write back the whole range. Is that necessary in this
> case?
>
It probably doesn't matter really. We try to discourage the use of jdata
from userspace. There are a few internal files that use it still, and it
is there for backwards compatibility more than anything. So performance
is generally not a problem for that. The ordered write mode is the
important one.

So you are right that it might be better to add the range into that call
too, but it is not likely that anybody will notice the performance
improvement,

Steve.

2017-07-31 11:27:07

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
> On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> > On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > > +int file_write_and_wait(struct file *file)
> > > +{
> > > + int err = 0, err2;
> > > + struct address_space *mapping = file->f_mapping;
> > > +
> > > + if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > > + (dax_mapping(mapping) && mapping->nrexceptional)) {
> > > + err = filemap_fdatawrite(mapping);
> > > + /* See comment of filemap_write_and_wait() */
> > > + if (err != -EIO) {
> > > + loff_t i_size = i_size_read(mapping->host);
> > > +
> > > + if (i_size != 0)
> > > + __filemap_fdatawait_range(mapping, 0,
> > > + i_size - 1);
> > > + }
> > > + }
> >
> > Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> > range and ignore i_size. It is much easier than trying to wrap your head
> > around possible races with file operations modifying i_size.
> >
> > Honza
>
> I'm basically emulating _exactly_ what filemap_write_and_wait does here,
> as I'm leery of making subtle behavior changes in the actual writeback
> behavior. For example:
>
> -----------------8<----------------
> static inline int __filemap_fdatawrite(struct address_space *mapping,
> int sync_mode)
> {
> return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
> }
>
> int filemap_fdatawrite(struct address_space *mapping)
> {
> return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
> }
> EXPORT_SYMBOL(filemap_fdatawrite);
> -----------------8<----------------
>
> ...which then sets up the wbc with the right ranges and sync mode and
> kicks off writepages. But then, it does the i_size_read to figure out
> what range it should wait on (with the shortcut for the size == 0 case).
>
> My assumption was that it was intentionally designed that way, but I'm
> guessing from your comments that it wasn't? If so, then we can turn
> file_write_and_wait a static inline wrapper around
> file_write_and_wait_range.

FWIW, I did a bit of archaeology in the linux-history tree and found
this patch from Marcelo in 2004. Is this optimization still helpful? If
not, then that does simplify the code a bit.

-------------------8<--------------------

[PATCH] small wait_on_page_writeback_range() optimization

filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
parameter. This is not needed since we know the EOF from the inode. Use
that instead.

Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
---
mm/filemap.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 78e18b7639b6..55fb7b4141e4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
*/
int filemap_fdatawait(struct address_space *mapping)
{
- return wait_on_page_writeback_range(mapping, 0, -1);
+ loff_t i_size = i_size_read(mapping->host);
+
+ if (i_size == 0)
+ return 0;
+
+ return wait_on_page_writeback_range(mapping, 0,
+ (i_size - 1) >> PAGE_CACHE_SHIFT);
}
EXPORT_SYMBOL(filemap_fdatawait);

2017-07-31 11:32:39

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

Hi,


On 31/07/17 12:27, Jeff Layton wrote:
> On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
>> On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
>>> On Wed 26-07-17 13:55:36, Jeff Layton wrote:
>>>> +int file_write_and_wait(struct file *file)
>>>> +{
>>>> + int err = 0, err2;
>>>> + struct address_space *mapping = file->f_mapping;
>>>> +
>>>> + if ((!dax_mapping(mapping) && mapping->nrpages) ||
>>>> + (dax_mapping(mapping) && mapping->nrexceptional)) {
>>>> + err = filemap_fdatawrite(mapping);
>>>> + /* See comment of filemap_write_and_wait() */
>>>> + if (err != -EIO) {
>>>> + loff_t i_size = i_size_read(mapping->host);
>>>> +
>>>> + if (i_size != 0)
>>>> + __filemap_fdatawait_range(mapping, 0,
>>>> + i_size - 1);
>>>> + }
>>>> + }
>>> Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
>>> range and ignore i_size. It is much easier than trying to wrap your head
>>> around possible races with file operations modifying i_size.
>>>
>>> Honza
>> I'm basically emulating _exactly_ what filemap_write_and_wait does here,
>> as I'm leery of making subtle behavior changes in the actual writeback
>> behavior. For example:
>>
>> -----------------8<----------------
>> static inline int __filemap_fdatawrite(struct address_space *mapping,
>> int sync_mode)
>> {
>> return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
>> }
>>
>> int filemap_fdatawrite(struct address_space *mapping)
>> {
>> return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
>> }
>> EXPORT_SYMBOL(filemap_fdatawrite);
>> -----------------8<----------------
>>
>> ...which then sets up the wbc with the right ranges and sync mode and
>> kicks off writepages. But then, it does the i_size_read to figure out
>> what range it should wait on (with the shortcut for the size == 0 case).
>>
>> My assumption was that it was intentionally designed that way, but I'm
>> guessing from your comments that it wasn't? If so, then we can turn
>> file_write_and_wait a static inline wrapper around
>> file_write_and_wait_range.
> FWIW, I did a bit of archaeology in the linux-history tree and found
> this patch from Marcelo in 2004. Is this optimization still helpful? If
> not, then that does simplify the code a bit.
>
> -------------------8<--------------------
>
> [PATCH] small wait_on_page_writeback_range() optimization
>
> filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
> parameter. This is not needed since we know the EOF from the inode. Use
> that instead.
>
> Signed-off-by: Marcelo Tosatti <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
> ---
> mm/filemap.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 78e18b7639b6..55fb7b4141e4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
> */
> int filemap_fdatawait(struct address_space *mapping)
> {
> - return wait_on_page_writeback_range(mapping, 0, -1);
> + loff_t i_size = i_size_read(mapping->host);
> +
> + if (i_size == 0)
> + return 0;
> +
> + return wait_on_page_writeback_range(mapping, 0,
> + (i_size - 1) >> PAGE_CACHE_SHIFT);
> }
> EXPORT_SYMBOL(filemap_fdatawait);
>

Does this ever get called in cases where we would not hold fs locks? In
that case we definitely don't want to be relying on i_size,

Steve.

2017-07-31 11:44:21

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
> Hi,
>
>
> On 31/07/17 12:27, Jeff Layton wrote:
> > On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
> > > On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> > > > On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > > > > +int file_write_and_wait(struct file *file)
> > > > > +{
> > > > > + int err = 0, err2;
> > > > > + struct address_space *mapping = file->f_mapping;
> > > > > +
> > > > > + if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > > > > + (dax_mapping(mapping) && mapping->nrexceptional)) {
> > > > > + err = filemap_fdatawrite(mapping);
> > > > > + /* See comment of filemap_write_and_wait() */
> > > > > + if (err != -EIO) {
> > > > > + loff_t i_size = i_size_read(mapping->host);
> > > > > +
> > > > > + if (i_size != 0)
> > > > > + __filemap_fdatawait_range(mapping, 0,
> > > > > + i_size - 1);
> > > > > + }
> > > > > + }
> > > >
> > > > Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> > > > range and ignore i_size. It is much easier than trying to wrap your head
> > > > around possible races with file operations modifying i_size.
> > > >
> > > > Honza
> > >
> > > I'm basically emulating _exactly_ what filemap_write_and_wait does here,
> > > as I'm leery of making subtle behavior changes in the actual writeback
> > > behavior. For example:
> > >
> > > -----------------8<----------------
> > > static inline int __filemap_fdatawrite(struct address_space *mapping,
> > > int sync_mode)
> > > {
> > > return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
> > > }
> > >
> > > int filemap_fdatawrite(struct address_space *mapping)
> > > {
> > > return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
> > > }
> > > EXPORT_SYMBOL(filemap_fdatawrite);
> > > -----------------8<----------------
> > >
> > > ...which then sets up the wbc with the right ranges and sync mode and
> > > kicks off writepages. But then, it does the i_size_read to figure out
> > > what range it should wait on (with the shortcut for the size == 0 case).
> > >
> > > My assumption was that it was intentionally designed that way, but I'm
> > > guessing from your comments that it wasn't? If so, then we can turn
> > > file_write_and_wait a static inline wrapper around
> > > file_write_and_wait_range.
> >
> > FWIW, I did a bit of archaeology in the linux-history tree and found
> > this patch from Marcelo in 2004. Is this optimization still helpful? If
> > not, then that does simplify the code a bit.
> >
> > -------------------8<--------------------
> >
> > [PATCH] small wait_on_page_writeback_range() optimization
> >
> > filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
> > parameter. This is not needed since we know the EOF from the inode. Use
> > that instead.
> >
> > Signed-off-by: Marcelo Tosatti <[email protected]>
> > Signed-off-by: Andrew Morton <[email protected]>
> > Signed-off-by: Linus Torvalds <[email protected]>
> > ---
> > mm/filemap.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 78e18b7639b6..55fb7b4141e4 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
> > */
> > int filemap_fdatawait(struct address_space *mapping)
> > {
> > - return wait_on_page_writeback_range(mapping, 0, -1);
> > + loff_t i_size = i_size_read(mapping->host);
> > +
> > + if (i_size == 0)
> > + return 0;
> > +
> > + return wait_on_page_writeback_range(mapping, 0,
> > + (i_size - 1) >> PAGE_CACHE_SHIFT);
> > }
> > EXPORT_SYMBOL(filemap_fdatawait);
> >
>
> Does this ever get called in cases where we would not hold fs locks? In
> that case we definitely don't want to be relying on i_size,
>
> Steve.
>

Yes. We can initiate and wait on writeback from any context where you
can sleep, really.

We're just waiting on whole file writeback here, so I don't think
there's anything wrong. As long as the i_size was valid at some point in
time prior to waiting then you're ok.

The question I have is more whether this optimization is still useful.

What we do now is just walk the radix tree and wait_on_page_writeback
for each page. Do we gain anything by avoiding ranges beyond the current
EOF with the pagecache infrastructure of 2017?

--
Jeff Layton <[email protected]>

2017-07-31 12:05:39

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

Hi,


On 31/07/17 12:44, Jeff Layton wrote:
> On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
>> Hi,
>>
>>
>> On 31/07/17 12:27, Jeff Layton wrote:
>>> On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
>>>> On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
>>>>> On Wed 26-07-17 13:55:36, Jeff Layton wrote:
>>>>>> +int file_write_and_wait(struct file *file)
>>>>>> +{
>>>>>> + int err = 0, err2;
>>>>>> + struct address_space *mapping = file->f_mapping;
>>>>>> +
>>>>>> + if ((!dax_mapping(mapping) && mapping->nrpages) ||
>>>>>> + (dax_mapping(mapping) && mapping->nrexceptional)) {
>>>>>> + err = filemap_fdatawrite(mapping);
>>>>>> + /* See comment of filemap_write_and_wait() */
>>>>>> + if (err != -EIO) {
>>>>>> + loff_t i_size = i_size_read(mapping->host);
>>>>>> +
>>>>>> + if (i_size != 0)
>>>>>> + __filemap_fdatawait_range(mapping, 0,
>>>>>> + i_size - 1);
>>>>>> + }
>>>>>> + }
>>>>> Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
>>>>> range and ignore i_size. It is much easier than trying to wrap your head
>>>>> around possible races with file operations modifying i_size.
>>>>>
>>>>> Honza
>>>> I'm basically emulating _exactly_ what filemap_write_and_wait does here,
>>>> as I'm leery of making subtle behavior changes in the actual writeback
>>>> behavior. For example:
>>>>
>>>> -----------------8<----------------
>>>> static inline int __filemap_fdatawrite(struct address_space *mapping,
>>>> int sync_mode)
>>>> {
>>>> return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
>>>> }
>>>>
>>>> int filemap_fdatawrite(struct address_space *mapping)
>>>> {
>>>> return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
>>>> }
>>>> EXPORT_SYMBOL(filemap_fdatawrite);
>>>> -----------------8<----------------
>>>>
>>>> ...which then sets up the wbc with the right ranges and sync mode and
>>>> kicks off writepages. But then, it does the i_size_read to figure out
>>>> what range it should wait on (with the shortcut for the size == 0 case).
>>>>
>>>> My assumption was that it was intentionally designed that way, but I'm
>>>> guessing from your comments that it wasn't? If so, then we can turn
>>>> file_write_and_wait a static inline wrapper around
>>>> file_write_and_wait_range.
>>> FWIW, I did a bit of archaeology in the linux-history tree and found
>>> this patch from Marcelo in 2004. Is this optimization still helpful? If
>>> not, then that does simplify the code a bit.
>>>
>>> -------------------8<--------------------
>>>
>>> [PATCH] small wait_on_page_writeback_range() optimization
>>>
>>> filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
>>> parameter. This is not needed since we know the EOF from the inode. Use
>>> that instead.
>>>
>>> Signed-off-by: Marcelo Tosatti <[email protected]>
>>> Signed-off-by: Andrew Morton <[email protected]>
>>> Signed-off-by: Linus Torvalds <[email protected]>
>>> ---
>>> mm/filemap.c | 8 +++++++-
>>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 78e18b7639b6..55fb7b4141e4 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
>>> */
>>> int filemap_fdatawait(struct address_space *mapping)
>>> {
>>> - return wait_on_page_writeback_range(mapping, 0, -1);
>>> + loff_t i_size = i_size_read(mapping->host);
>>> +
>>> + if (i_size == 0)
>>> + return 0;
>>> +
>>> + return wait_on_page_writeback_range(mapping, 0,
>>> + (i_size - 1) >> PAGE_CACHE_SHIFT);
>>> }
>>> EXPORT_SYMBOL(filemap_fdatawait);
>>>
>> Does this ever get called in cases where we would not hold fs locks? In
>> that case we definitely don't want to be relying on i_size,
>>
>> Steve.
>>
> Yes. We can initiate and wait on writeback from any context where you
> can sleep, really.
>
> We're just waiting on whole file writeback here, so I don't think
> there's anything wrong. As long as the i_size was valid at some point in
> time prior to waiting then you're ok.
>
> The question I have is more whether this optimization is still useful.
>
> What we do now is just walk the radix tree and wait_on_page_writeback
> for each page. Do we gain anything by avoiding ranges beyond the current
> EOF with the pagecache infrastructure of 2017?
>

If this can be called from anywhere without fs locks, then i_size is not
known. That has been a problem in the past since i_size may have changed
on another node. We avoid that in this case due to only changing i_size
under an exclusive lock, and also only having dirty pages when we have
an exclusive lock. There is another case though, if the inode is a block
device, i_size will be zero. That is the case for the address space that
looks after rgrps for GFS2. We do (luckily!) call
filemap_fdatawait_range() directly in that case. For "normal" inodes
though, the address space for metadata is backed by the block device
inode, so that looks like it might be an issue, since
fs/gfs2/glops.c:inode_go_sync() calls filemap_fdatawait() on the
metamapping. It might potentially be an issue in other cases too,

Steve.

2017-07-31 12:07:48

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

On Mon 31-07-17 07:44:16, Jeff Layton wrote:
> On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
> > On 31/07/17 12:27, Jeff Layton wrote:
> > > On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
> > > > On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> > > > > On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > > > > > +int file_write_and_wait(struct file *file)
> > > > > > +{
> > > > > > + int err = 0, err2;
> > > > > > + struct address_space *mapping = file->f_mapping;
> > > > > > +
> > > > > > + if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > > > > > + (dax_mapping(mapping) && mapping->nrexceptional)) {
> > > > > > + err = filemap_fdatawrite(mapping);
> > > > > > + /* See comment of filemap_write_and_wait() */
> > > > > > + if (err != -EIO) {
> > > > > > + loff_t i_size = i_size_read(mapping->host);
> > > > > > +
> > > > > > + if (i_size != 0)
> > > > > > + __filemap_fdatawait_range(mapping, 0,
> > > > > > + i_size - 1);
> > > > > > + }
> > > > > > + }
> > > > >
> > > > > Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> > > > > range and ignore i_size. It is much easier than trying to wrap your head
> > > > > around possible races with file operations modifying i_size.
> > > > >
> > > > > Honza
> > > >
> > > > I'm basically emulating _exactly_ what filemap_write_and_wait does here,
> > > > as I'm leery of making subtle behavior changes in the actual writeback
> > > > behavior. For example:
> > > >
> > > > -----------------8<----------------
> > > > static inline int __filemap_fdatawrite(struct address_space *mapping,
> > > > int sync_mode)
> > > > {
> > > > return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
> > > > }
> > > >
> > > > int filemap_fdatawrite(struct address_space *mapping)
> > > > {
> > > > return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
> > > > }
> > > > EXPORT_SYMBOL(filemap_fdatawrite);
> > > > -----------------8<----------------
> > > >
> > > > ...which then sets up the wbc with the right ranges and sync mode and
> > > > kicks off writepages. But then, it does the i_size_read to figure out
> > > > what range it should wait on (with the shortcut for the size == 0 case).
> > > >
> > > > My assumption was that it was intentionally designed that way, but I'm
> > > > guessing from your comments that it wasn't? If so, then we can turn
> > > > file_write_and_wait a static inline wrapper around
> > > > file_write_and_wait_range.
> > >
> > > FWIW, I did a bit of archaeology in the linux-history tree and found
> > > this patch from Marcelo in 2004. Is this optimization still helpful? If
> > > not, then that does simplify the code a bit.
> > >
> > > -------------------8<--------------------
> > >
> > > [PATCH] small wait_on_page_writeback_range() optimization
> > >
> > > filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
> > > parameter. This is not needed since we know the EOF from the inode. Use
> > > that instead.
> > >
> > > Signed-off-by: Marcelo Tosatti <[email protected]>
> > > Signed-off-by: Andrew Morton <[email protected]>
> > > Signed-off-by: Linus Torvalds <[email protected]>
> > > ---
> > > mm/filemap.c | 8 +++++++-
> > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 78e18b7639b6..55fb7b4141e4 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
> > > */
> > > int filemap_fdatawait(struct address_space *mapping)
> > > {
> > > - return wait_on_page_writeback_range(mapping, 0, -1);
> > > + loff_t i_size = i_size_read(mapping->host);
> > > +
> > > + if (i_size == 0)
> > > + return 0;
> > > +
> > > + return wait_on_page_writeback_range(mapping, 0,
> > > + (i_size - 1) >> PAGE_CACHE_SHIFT);
> > > }
> > > EXPORT_SYMBOL(filemap_fdatawait);
> > >
> >
> > Does this ever get called in cases where we would not hold fs locks? In
> > that case we definitely don't want to be relying on i_size,
> >
> > Steve.
> >
>
> Yes. We can initiate and wait on writeback from any context where you
> can sleep, really.
>
> We're just waiting on whole file writeback here, so I don't think
> there's anything wrong. As long as the i_size was valid at some point in
> time prior to waiting then you're ok.
>
> The question I have is more whether this optimization is still useful.
>
> What we do now is just walk the radix tree and wait_on_page_writeback
> for each page. Do we gain anything by avoiding ranges beyond the current
> EOF with the pagecache infrastructure of 2017?

FWIW I'm not aware of any significant benefit of using i_size in
filemap_fdatawait() - we iterate to the end of the radix tree node anyway
since pagevec_lookup_tag() does not support range searches anyway (I'm
working on fixing that however even after that the benefit would be still
rather marginal).

What Marcello might have meant even back in 2004 was that if we are in the
middle of truncate, i_size is already reduced but page cache not truncated
yet, then filemap_fdatawait() does not have to wait for writeback of
truncated pages. That might be a noticeable benefit even today if such race
happens however I'm not sure it's worth optimizing for and surprises
arising from randomly snapshotting i_size (which especially for clustered
filesystems may be out of date) IMHO overweight the possible advantage.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-07-31 12:22:49

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

On Mon, 2017-07-31 at 13:05 +0100, Steven Whitehouse wrote:
> Hi,
>
>
> On 31/07/17 12:44, Jeff Layton wrote:
> > On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
> > > Hi,
> > >
> > >
> > > On 31/07/17 12:27, Jeff Layton wrote:
> > > > On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
> > > > > On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> > > > > > On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > > > > > > +int file_write_and_wait(struct file *file)
> > > > > > > +{
> > > > > > > + int err = 0, err2;
> > > > > > > + struct address_space *mapping = file->f_mapping;
> > > > > > > +
> > > > > > > + if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > > > > > > + (dax_mapping(mapping) && mapping->nrexceptional)) {
> > > > > > > + err = filemap_fdatawrite(mapping);
> > > > > > > + /* See comment of filemap_write_and_wait() */
> > > > > > > + if (err != -EIO) {
> > > > > > > + loff_t i_size = i_size_read(mapping->host);
> > > > > > > +
> > > > > > > + if (i_size != 0)
> > > > > > > + __filemap_fdatawait_range(mapping, 0,
> > > > > > > + i_size - 1);
> > > > > > > + }
> > > > > > > + }
> > > > > >
> > > > > > Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> > > > > > range and ignore i_size. It is much easier than trying to wrap your head
> > > > > > around possible races with file operations modifying i_size.
> > > > > >
> > > > > > Honza
> > > > >
> > > > > I'm basically emulating _exactly_ what filemap_write_and_wait does here,
> > > > > as I'm leery of making subtle behavior changes in the actual writeback
> > > > > behavior. For example:
> > > > >
> > > > > -----------------8<----------------
> > > > > static inline int __filemap_fdatawrite(struct address_space *mapping,
> > > > > int sync_mode)
> > > > > {
> > > > > return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
> > > > > }
> > > > >
> > > > > int filemap_fdatawrite(struct address_space *mapping)
> > > > > {
> > > > > return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
> > > > > }
> > > > > EXPORT_SYMBOL(filemap_fdatawrite);
> > > > > -----------------8<----------------
> > > > >
> > > > > ...which then sets up the wbc with the right ranges and sync mode and
> > > > > kicks off writepages. But then, it does the i_size_read to figure out
> > > > > what range it should wait on (with the shortcut for the size == 0 case).
> > > > >
> > > > > My assumption was that it was intentionally designed that way, but I'm
> > > > > guessing from your comments that it wasn't? If so, then we can turn
> > > > > file_write_and_wait a static inline wrapper around
> > > > > file_write_and_wait_range.
> > > >
> > > > FWIW, I did a bit of archaeology in the linux-history tree and found
> > > > this patch from Marcelo in 2004. Is this optimization still helpful? If
> > > > not, then that does simplify the code a bit.
> > > >
> > > > -------------------8<--------------------
> > > >
> > > > [PATCH] small wait_on_page_writeback_range() optimization
> > > >
> > > > filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
> > > > parameter. This is not needed since we know the EOF from the inode. Use
> > > > that instead.
> > > >
> > > > Signed-off-by: Marcelo Tosatti <[email protected]>
> > > > Signed-off-by: Andrew Morton <[email protected]>
> > > > Signed-off-by: Linus Torvalds <[email protected]>
> > > > ---
> > > > mm/filemap.c | 8 +++++++-
> > > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 78e18b7639b6..55fb7b4141e4 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
> > > > */
> > > > int filemap_fdatawait(struct address_space *mapping)
> > > > {
> > > > - return wait_on_page_writeback_range(mapping, 0, -1);
> > > > + loff_t i_size = i_size_read(mapping->host);
> > > > +
> > > > + if (i_size == 0)
> > > > + return 0;
> > > > +
> > > > + return wait_on_page_writeback_range(mapping, 0,
> > > > + (i_size - 1) >> PAGE_CACHE_SHIFT);
> > > > }
> > > > EXPORT_SYMBOL(filemap_fdatawait);
> > > >
> > >
> > > Does this ever get called in cases where we would not hold fs locks? In
> > > that case we definitely don't want to be relying on i_size,
> > >
> > > Steve.
> > >
> >
> > Yes. We can initiate and wait on writeback from any context where you
> > can sleep, really.
> >
> > We're just waiting on whole file writeback here, so I don't think
> > there's anything wrong. As long as the i_size was valid at some point in
> > time prior to waiting then you're ok.
> >
> > The question I have is more whether this optimization is still useful.
> >
> > What we do now is just walk the radix tree and wait_on_page_writeback
> > for each page. Do we gain anything by avoiding ranges beyond the current
> > EOF with the pagecache infrastructure of 2017?
> >
>
> If this can be called from anywhere without fs locks, then i_size is not
> known. That has been a problem in the past since i_size may have changed
> on another node. We avoid that in this case due to only changing i_size
> under an exclusive lock, and also only having dirty pages when we have
> an exclusive lock. There is another case though, if the inode is a block
> device, i_size will be zero. That is the case for the address space that
> looks after rgrps for GFS2. We do (luckily!) call
> filemap_fdatawait_range() directly in that case. For "normal" inodes
> though, the address space for metadata is backed by the block device
> inode, so that looks like it might be an issue, since
> fs/gfs2/glops.c:inode_go_sync() calls filemap_fdatawait() on the
> metamapping. It might potentially be an issue in other cases too,
>
> Steve.
>

Some of those do sound problematic.

Again though, we're only waiting on writeback here, and I assume with
gfs2 that would only be pages that were written on the local node.

Is it possible to have pages under writeback and in still in the tree,
but that are beyond the current i_size? It seems like that's the main
worrisome case.

--
Jeff Layton <[email protected]>

2017-07-31 12:25:51

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

Hi,


On 31/07/17 13:22, Jeff Layton wrote:
> On Mon, 2017-07-31 at 13:05 +0100, Steven Whitehouse wrote:
>> Hi,
>>
>>
>> On 31/07/17 12:44, Jeff Layton wrote:
>>> On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
>>>> Hi,
>>>>
>>>>
>>>> On 31/07/17 12:27, Jeff Layton wrote:
>>>>> On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
>>>>>> On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
>>>>>>> On Wed 26-07-17 13:55:36, Jeff Layton wrote:
>>>>>>>> +int file_write_and_wait(struct file *file)
>>>>>>>> +{
>>>>>>>> + int err = 0, err2;
>>>>>>>> + struct address_space *mapping = file->f_mapping;
>>>>>>>> +
>>>>>>>> + if ((!dax_mapping(mapping) && mapping->nrpages) ||
>>>>>>>> + (dax_mapping(mapping) && mapping->nrexceptional)) {
>>>>>>>> + err = filemap_fdatawrite(mapping);
>>>>>>>> + /* See comment of filemap_write_and_wait() */
>>>>>>>> + if (err != -EIO) {
>>>>>>>> + loff_t i_size = i_size_read(mapping->host);
>>>>>>>> +
>>>>>>>> + if (i_size != 0)
>>>>>>>> + __filemap_fdatawait_range(mapping, 0,
>>>>>>>> + i_size - 1);
>>>>>>>> + }
>>>>>>>> + }
>>>>>>> Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
>>>>>>> range and ignore i_size. It is much easier than trying to wrap your head
>>>>>>> around possible races with file operations modifying i_size.
>>>>>>>
>>>>>>> Honza
>>>>>> I'm basically emulating _exactly_ what filemap_write_and_wait does here,
>>>>>> as I'm leery of making subtle behavior changes in the actual writeback
>>>>>> behavior. For example:
>>>>>>
>>>>>> -----------------8<----------------
>>>>>> static inline int __filemap_fdatawrite(struct address_space *mapping,
>>>>>> int sync_mode)
>>>>>> {
>>>>>> return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
>>>>>> }
>>>>>>
>>>>>> int filemap_fdatawrite(struct address_space *mapping)
>>>>>> {
>>>>>> return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
>>>>>> }
>>>>>> EXPORT_SYMBOL(filemap_fdatawrite);
>>>>>> -----------------8<----------------
>>>>>>
>>>>>> ...which then sets up the wbc with the right ranges and sync mode and
>>>>>> kicks off writepages. But then, it does the i_size_read to figure out
>>>>>> what range it should wait on (with the shortcut for the size == 0 case).
>>>>>>
>>>>>> My assumption was that it was intentionally designed that way, but I'm
>>>>>> guessing from your comments that it wasn't? If so, then we can turn
>>>>>> file_write_and_wait a static inline wrapper around
>>>>>> file_write_and_wait_range.
>>>>> FWIW, I did a bit of archaeology in the linux-history tree and found
>>>>> this patch from Marcelo in 2004. Is this optimization still helpful? If
>>>>> not, then that does simplify the code a bit.
>>>>>
>>>>> -------------------8<--------------------
>>>>>
>>>>> [PATCH] small wait_on_page_writeback_range() optimization
>>>>>
>>>>> filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
>>>>> parameter. This is not needed since we know the EOF from the inode. Use
>>>>> that instead.
>>>>>
>>>>> Signed-off-by: Marcelo Tosatti <[email protected]>
>>>>> Signed-off-by: Andrew Morton <[email protected]>
>>>>> Signed-off-by: Linus Torvalds <[email protected]>
>>>>> ---
>>>>> mm/filemap.c | 8 +++++++-
>>>>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>>> index 78e18b7639b6..55fb7b4141e4 100644
>>>>> --- a/mm/filemap.c
>>>>> +++ b/mm/filemap.c
>>>>> @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
>>>>> */
>>>>> int filemap_fdatawait(struct address_space *mapping)
>>>>> {
>>>>> - return wait_on_page_writeback_range(mapping, 0, -1);
>>>>> + loff_t i_size = i_size_read(mapping->host);
>>>>> +
>>>>> + if (i_size == 0)
>>>>> + return 0;
>>>>> +
>>>>> + return wait_on_page_writeback_range(mapping, 0,
>>>>> + (i_size - 1) >> PAGE_CACHE_SHIFT);
>>>>> }
>>>>> EXPORT_SYMBOL(filemap_fdatawait);
>>>>>
>>>> Does this ever get called in cases where we would not hold fs locks? In
>>>> that case we definitely don't want to be relying on i_size,
>>>>
>>>> Steve.
>>>>
>>> Yes. We can initiate and wait on writeback from any context where you
>>> can sleep, really.
>>>
>>> We're just waiting on whole file writeback here, so I don't think
>>> there's anything wrong. As long as the i_size was valid at some point in
>>> time prior to waiting then you're ok.
>>>
>>> The question I have is more whether this optimization is still useful.
>>>
>>> What we do now is just walk the radix tree and wait_on_page_writeback
>>> for each page. Do we gain anything by avoiding ranges beyond the current
>>> EOF with the pagecache infrastructure of 2017?
>>>
>> If this can be called from anywhere without fs locks, then i_size is not
>> known. That has been a problem in the past since i_size may have changed
>> on another node. We avoid that in this case due to only changing i_size
>> under an exclusive lock, and also only having dirty pages when we have
>> an exclusive lock. There is another case though, if the inode is a block
>> device, i_size will be zero. That is the case for the address space that
>> looks after rgrps for GFS2. We do (luckily!) call
>> filemap_fdatawait_range() directly in that case. For "normal" inodes
>> though, the address space for metadata is backed by the block device
>> inode, so that looks like it might be an issue, since
>> fs/gfs2/glops.c:inode_go_sync() calls filemap_fdatawait() on the
>> metamapping. It might potentially be an issue in other cases too,
>>
>> Steve.
>>
> Some of those do sound problematic.
>
> Again though, we're only waiting on writeback here, and I assume with
> gfs2 that would only be pages that were written on the local node.
Yes
>
> Is it possible to have pages under writeback and in still in the tree,
> but that are beyond the current i_size? It seems like that's the main
> worrisome case.
>
Thats what I was wondering too. I'm not 100% sure without some more
detailed investigation. Either way the block device case also seems
problematic, although not impossible to special case I suppose. The real
question is what do we get from this optmisation? Is the pain of
checking correctness worth it for the benefits gained,

Steve.

2017-07-31 12:38:46

by Bob Peterson

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

----- Original Message -----
| > If this can be called from anywhere without fs locks, then i_size is not
| > known. That has been a problem in the past since i_size may have changed
| > on another node. We avoid that in this case due to only changing i_size
| > under an exclusive lock, and also only having dirty pages when we have
| > an exclusive lock. There is another case though, if the inode is a block
| > device, i_size will be zero. That is the case for the address space that
| > looks after rgrps for GFS2. We do (luckily!) call
| > filemap_fdatawait_range() directly in that case. For "normal" inodes
| > though, the address space for metadata is backed by the block device
| > inode, so that looks like it might be an issue, since
| > fs/gfs2/glops.c:inode_go_sync() calls filemap_fdatawait() on the
| > metamapping. It might potentially be an issue in other cases too,
| >
| > Steve.
| >
|
| Some of those do sound problematic.
|
| Again though, we're only waiting on writeback here, and I assume with
| gfs2 that would only be pages that were written on the local node.
|
| Is it possible to have pages under writeback and in still in the tree,
| but that are beyond the current i_size? It seems like that's the main
| worrisome case.
|
| --
| Jeff Layton <[email protected]>

Hi Jeff,

I believe the answer is yes.

I was recently "bitten" by a case where (whether due to a bug or not)
I had blocks allocated in a GFS2 file beyond i_size. I had implemented a
delete algorithm that used i_size, but I found cases where files couldn't
be deleted because of blocks hanging out past EOF. I'm not sure if they
can be in writeback, but possibly. It's already on my "to investigate"
list, but I haven't gotten to it yet. Yes, it seems like a bug. Yes, we
need to fix it. But now there may be lots of legacy file systems out in
the field that have this problem. Not sure if they can get to writeback
until I study the situation more closely.

I believe Ben Marzinski also may have come across a case in which we
can have blocks in writeback that are beyond i_size. See the commit
message on Ben's patch here:

https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-next&id=fd4c5748b8d3f7420e8932ed0bde3d53cc8acc9d

Regards,

Bob Peterson
Red Hat File Systems

2017-07-31 13:00:43

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

On Mon, 2017-07-31 at 14:07 +0200, Jan Kara wrote:
> On Mon 31-07-17 07:44:16, Jeff Layton wrote:
> > On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
> > > On 31/07/17 12:27, Jeff Layton wrote:
> > > > On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
> > > > > On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> > > > > > On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > > > > > > +int file_write_and_wait(struct file *file)
> > > > > > > +{
> > > > > > > + int err = 0, err2;
> > > > > > > + struct address_space *mapping = file->f_mapping;
> > > > > > > +
> > > > > > > + if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > > > > > > + (dax_mapping(mapping) && mapping->nrexceptional)) {
> > > > > > > + err = filemap_fdatawrite(mapping);
> > > > > > > + /* See comment of filemap_write_and_wait() */
> > > > > > > + if (err != -EIO) {
> > > > > > > + loff_t i_size = i_size_read(mapping->host);
> > > > > > > +
> > > > > > > + if (i_size != 0)
> > > > > > > + __filemap_fdatawait_range(mapping, 0,
> > > > > > > + i_size - 1);
> > > > > > > + }
> > > > > > > + }
> > > > > >
> > > > > > Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> > > > > > range and ignore i_size. It is much easier than trying to wrap your head
> > > > > > around possible races with file operations modifying i_size.
> > > > > >
> > > > > > Honza
> > > > >
> > > > > I'm basically emulating _exactly_ what filemap_write_and_wait does here,
> > > > > as I'm leery of making subtle behavior changes in the actual writeback
> > > > > behavior. For example:
> > > > >
> > > > > -----------------8<----------------
> > > > > static inline int __filemap_fdatawrite(struct address_space *mapping,
> > > > > int sync_mode)
> > > > > {
> > > > > return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
> > > > > }
> > > > >
> > > > > int filemap_fdatawrite(struct address_space *mapping)
> > > > > {
> > > > > return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
> > > > > }
> > > > > EXPORT_SYMBOL(filemap_fdatawrite);
> > > > > -----------------8<----------------
> > > > >
> > > > > ...which then sets up the wbc with the right ranges and sync mode and
> > > > > kicks off writepages. But then, it does the i_size_read to figure out
> > > > > what range it should wait on (with the shortcut for the size == 0 case).
> > > > >
> > > > > My assumption was that it was intentionally designed that way, but I'm
> > > > > guessing from your comments that it wasn't? If so, then we can turn
> > > > > file_write_and_wait a static inline wrapper around
> > > > > file_write_and_wait_range.
> > > >
> > > > FWIW, I did a bit of archaeology in the linux-history tree and found
> > > > this patch from Marcelo in 2004. Is this optimization still helpful? If
> > > > not, then that does simplify the code a bit.
> > > >
> > > > -------------------8<--------------------
> > > >
> > > > [PATCH] small wait_on_page_writeback_range() optimization
> > > >
> > > > filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
> > > > parameter. This is not needed since we know the EOF from the inode. Use
> > > > that instead.
> > > >
> > > > Signed-off-by: Marcelo Tosatti <[email protected]>
> > > > Signed-off-by: Andrew Morton <[email protected]>
> > > > Signed-off-by: Linus Torvalds <[email protected]>
> > > > ---
> > > > mm/filemap.c | 8 +++++++-
> > > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 78e18b7639b6..55fb7b4141e4 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
> > > > */
> > > > int filemap_fdatawait(struct address_space *mapping)
> > > > {
> > > > - return wait_on_page_writeback_range(mapping, 0, -1);
> > > > + loff_t i_size = i_size_read(mapping->host);
> > > > +
> > > > + if (i_size == 0)
> > > > + return 0;
> > > > +
> > > > + return wait_on_page_writeback_range(mapping, 0,
> > > > + (i_size - 1) >> PAGE_CACHE_SHIFT);
> > > > }
> > > > EXPORT_SYMBOL(filemap_fdatawait);
> > > >
> > >
> > > Does this ever get called in cases where we would not hold fs locks? In
> > > that case we definitely don't want to be relying on i_size,
> > >
> > > Steve.
> > >
> >
> > Yes. We can initiate and wait on writeback from any context where you
> > can sleep, really.
> >
> > We're just waiting on whole file writeback here, so I don't think
> > there's anything wrong. As long as the i_size was valid at some point in
> > time prior to waiting then you're ok.
> >
> > The question I have is more whether this optimization is still useful.
> >
> > What we do now is just walk the radix tree and wait_on_page_writeback
> > for each page. Do we gain anything by avoiding ranges beyond the current
> > EOF with the pagecache infrastructure of 2017?
>
> FWIW I'm not aware of any significant benefit of using i_size in
> filemap_fdatawait() - we iterate to the end of the radix tree node anyway
> since pagevec_lookup_tag() does not support range searches anyway (I'm
> working on fixing that however even after that the benefit would be still
> rather marginal).
>
> What Marcello might have meant even back in 2004 was that if we are in the
> middle of truncate, i_size is already reduced but page cache not truncated
> yet, then filemap_fdatawait() does not have to wait for writeback of
> truncated pages. That might be a noticeable benefit even today if such race
> happens however I'm not sure it's worth optimizing for and surprises
> arising from randomly snapshotting i_size (which especially for clustered
> filesystems may be out of date) IMHO overweight the possible advantage.
>
> Honza

Thanks for clarifying.

Given that file_write_and_wait is a new helper function anyway, I'll
just make it a wrapper around file_write_and_wait_range. Since it might
be racy, should remove this optimization from the "legacy"
filemap_fdatawait / filemap_fdatawait_keep_errors calls?

--
Jeff Layton <[email protected]>

2017-07-31 13:32:49

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

On Mon 31-07-17 09:00:37, Jeff Layton wrote:
> On Mon, 2017-07-31 at 14:07 +0200, Jan Kara wrote:
> > On Mon 31-07-17 07:44:16, Jeff Layton wrote:
> > > On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
> > > > On 31/07/17 12:27, Jeff Layton wrote:
> > > > > On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
> > > > > > On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> > > > > > > On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > > > > > > > +int file_write_and_wait(struct file *file)
> > > > > > > > +{
> > > > > > > > + int err = 0, err2;
> > > > > > > > + struct address_space *mapping = file->f_mapping;
> > > > > > > > +
> > > > > > > > + if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > > > > > > > + (dax_mapping(mapping) && mapping->nrexceptional)) {
> > > > > > > > + err = filemap_fdatawrite(mapping);
> > > > > > > > + /* See comment of filemap_write_and_wait() */
> > > > > > > > + if (err != -EIO) {
> > > > > > > > + loff_t i_size = i_size_read(mapping->host);
> > > > > > > > +
> > > > > > > > + if (i_size != 0)
> > > > > > > > + __filemap_fdatawait_range(mapping, 0,
> > > > > > > > + i_size - 1);
> > > > > > > > + }
> > > > > > > > + }
> > > > > > >
> > > > > > > Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> > > > > > > range and ignore i_size. It is much easier than trying to wrap your head
> > > > > > > around possible races with file operations modifying i_size.
> > > > > > >
> > > > > > > Honza
> > > > > >
> > > > > > I'm basically emulating _exactly_ what filemap_write_and_wait does here,
> > > > > > as I'm leery of making subtle behavior changes in the actual writeback
> > > > > > behavior. For example:
> > > > > >
> > > > > > -----------------8<----------------
> > > > > > static inline int __filemap_fdatawrite(struct address_space *mapping,
> > > > > > int sync_mode)
> > > > > > {
> > > > > > return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
> > > > > > }
> > > > > >
> > > > > > int filemap_fdatawrite(struct address_space *mapping)
> > > > > > {
> > > > > > return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
> > > > > > }
> > > > > > EXPORT_SYMBOL(filemap_fdatawrite);
> > > > > > -----------------8<----------------
> > > > > >
> > > > > > ...which then sets up the wbc with the right ranges and sync mode and
> > > > > > kicks off writepages. But then, it does the i_size_read to figure out
> > > > > > what range it should wait on (with the shortcut for the size == 0 case).
> > > > > >
> > > > > > My assumption was that it was intentionally designed that way, but I'm
> > > > > > guessing from your comments that it wasn't? If so, then we can turn
> > > > > > file_write_and_wait a static inline wrapper around
> > > > > > file_write_and_wait_range.
> > > > >
> > > > > FWIW, I did a bit of archaeology in the linux-history tree and found
> > > > > this patch from Marcelo in 2004. Is this optimization still helpful? If
> > > > > not, then that does simplify the code a bit.
> > > > >
> > > > > -------------------8<--------------------
> > > > >
> > > > > [PATCH] small wait_on_page_writeback_range() optimization
> > > > >
> > > > > filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
> > > > > parameter. This is not needed since we know the EOF from the inode. Use
> > > > > that instead.
> > > > >
> > > > > Signed-off-by: Marcelo Tosatti <[email protected]>
> > > > > Signed-off-by: Andrew Morton <[email protected]>
> > > > > Signed-off-by: Linus Torvalds <[email protected]>
> > > > > ---
> > > > > mm/filemap.c | 8 +++++++-
> > > > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > index 78e18b7639b6..55fb7b4141e4 100644
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
> > > > > */
> > > > > int filemap_fdatawait(struct address_space *mapping)
> > > > > {
> > > > > - return wait_on_page_writeback_range(mapping, 0, -1);
> > > > > + loff_t i_size = i_size_read(mapping->host);
> > > > > +
> > > > > + if (i_size == 0)
> > > > > + return 0;
> > > > > +
> > > > > + return wait_on_page_writeback_range(mapping, 0,
> > > > > + (i_size - 1) >> PAGE_CACHE_SHIFT);
> > > > > }
> > > > > EXPORT_SYMBOL(filemap_fdatawait);
> > > > >
> > > >
> > > > Does this ever get called in cases where we would not hold fs locks? In
> > > > that case we definitely don't want to be relying on i_size,
> > > >
> > > > Steve.
> > > >
> > >
> > > Yes. We can initiate and wait on writeback from any context where you
> > > can sleep, really.
> > >
> > > We're just waiting on whole file writeback here, so I don't think
> > > there's anything wrong. As long as the i_size was valid at some point in
> > > time prior to waiting then you're ok.
> > >
> > > The question I have is more whether this optimization is still useful.
> > >
> > > What we do now is just walk the radix tree and wait_on_page_writeback
> > > for each page. Do we gain anything by avoiding ranges beyond the current
> > > EOF with the pagecache infrastructure of 2017?
> >
> > FWIW I'm not aware of any significant benefit of using i_size in
> > filemap_fdatawait() - we iterate to the end of the radix tree node anyway
> > since pagevec_lookup_tag() does not support range searches anyway (I'm
> > working on fixing that however even after that the benefit would be still
> > rather marginal).
> >
> > What Marcello might have meant even back in 2004 was that if we are in the
> > middle of truncate, i_size is already reduced but page cache not truncated
> > yet, then filemap_fdatawait() does not have to wait for writeback of
> > truncated pages. That might be a noticeable benefit even today if such race
> > happens however I'm not sure it's worth optimizing for and surprises
> > arising from randomly snapshotting i_size (which especially for clustered
> > filesystems may be out of date) IMHO overweight the possible advantage.
> >
> > Honza
>
> Thanks for clarifying.
>
> Given that file_write_and_wait is a new helper function anyway, I'll
> just make it a wrapper around file_write_and_wait_range. Since it might

Agreed.

> be racy, should remove this optimization from the "legacy"
> filemap_fdatawait / filemap_fdatawait_keep_errors calls?

I'm for it.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-07-31 16:49:31

by Jeffrey Layton

[permalink] [raw]
Subject: [PATCH v3] mm: add file_fdatawait_range and file_write_and_wait

From: Jeff Layton <[email protected]>

Necessary now for gfs2_fsync and sync_file_range, but there will
eventually be other callers.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/fs.h | 11 ++++++++++-
mm/filemap.c | 23 +++++++++++++++++++++++
2 files changed, 33 insertions(+), 1 deletion(-)

v3: make file_write_and_wait a wrapper around file_write_and_wait_range

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 526b6a9f30d4..909210bd6366 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2549,6 +2549,8 @@ static inline int filemap_fdatawait(struct address_space *mapping)

extern bool filemap_range_has_page(struct address_space *, loff_t lstart,
loff_t lend);
+extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
+ loff_t lend);
extern int filemap_write_and_wait(struct address_space *mapping);
extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
@@ -2557,12 +2559,19 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
extern int filemap_fdatawrite_range(struct address_space *mapping,
loff_t start, loff_t end);
extern int filemap_check_errors(struct address_space *mapping);
-
extern void __filemap_set_wb_err(struct address_space *mapping, int err);
+
+extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
+ loff_t lend);
extern int __must_check file_check_and_advance_wb_err(struct file *file);
extern int __must_check file_write_and_wait_range(struct file *file,
loff_t start, loff_t end);

+static inline int file_write_and_wait(struct file *file)
+{
+ return file_write_and_wait_range(file, 0, LLONG_MAX);
+}
+
/**
* filemap_set_wb_err - set a writeback error on an address_space
* @mapping: mapping in which to set writeback error
diff --git a/mm/filemap.c b/mm/filemap.c
index 953804b29a75..85dfe3bee324 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -476,6 +476,29 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
EXPORT_SYMBOL(filemap_fdatawait_range);

/**
+ * file_fdatawait_range - wait for writeback to complete
+ * @file: file pointing to address space structure to wait for
+ * @start_byte: offset in bytes where the range starts
+ * @end_byte: offset in bytes where the range ends (inclusive)
+ *
+ * Walk the list of under-writeback pages of the address space that file
+ * refers to, in the given range and wait for all of them. Check error
+ * status of the address space vs. the file->f_wb_err cursor and return it.
+ *
+ * Since the error status of the file is advanced by this function,
+ * callers are responsible for checking the return value and handling and/or
+ * reporting the error.
+ */
+int file_fdatawait_range(struct file *file, loff_t start_byte, loff_t end_byte)
+{
+ struct address_space *mapping = file->f_mapping;
+
+ __filemap_fdatawait_range(mapping, start_byte, end_byte);
+ return file_check_and_advance_wb_err(file);
+}
+EXPORT_SYMBOL(file_fdatawait_range);
+
+/**
* filemap_fdatawait_keep_errors - wait for writeback without clearing errors
* @mapping: address space structure to wait for
*
--
2.13.3

2017-08-01 09:52:35

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3] mm: add file_fdatawait_range and file_write_and_wait

On Mon 31-07-17 12:49:25, Jeff Layton wrote:
> From: Jeff Layton <[email protected]>
>
> Necessary now for gfs2_fsync and sync_file_range, but there will
> eventually be other callers.
>
> Signed-off-by: Jeff Layton <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> include/linux/fs.h | 11 ++++++++++-
> mm/filemap.c | 23 +++++++++++++++++++++++
> 2 files changed, 33 insertions(+), 1 deletion(-)
>
> v3: make file_write_and_wait a wrapper around file_write_and_wait_range
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 526b6a9f30d4..909210bd6366 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2549,6 +2549,8 @@ static inline int filemap_fdatawait(struct address_space *mapping)
>
> extern bool filemap_range_has_page(struct address_space *, loff_t lstart,
> loff_t lend);
> +extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
> + loff_t lend);
> extern int filemap_write_and_wait(struct address_space *mapping);
> extern int filemap_write_and_wait_range(struct address_space *mapping,
> loff_t lstart, loff_t lend);
> @@ -2557,12 +2559,19 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
> extern int filemap_fdatawrite_range(struct address_space *mapping,
> loff_t start, loff_t end);
> extern int filemap_check_errors(struct address_space *mapping);
> -
> extern void __filemap_set_wb_err(struct address_space *mapping, int err);
> +
> +extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
> + loff_t lend);
> extern int __must_check file_check_and_advance_wb_err(struct file *file);
> extern int __must_check file_write_and_wait_range(struct file *file,
> loff_t start, loff_t end);
>
> +static inline int file_write_and_wait(struct file *file)
> +{
> + return file_write_and_wait_range(file, 0, LLONG_MAX);
> +}
> +
> /**
> * filemap_set_wb_err - set a writeback error on an address_space
> * @mapping: mapping in which to set writeback error
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 953804b29a75..85dfe3bee324 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -476,6 +476,29 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
> EXPORT_SYMBOL(filemap_fdatawait_range);
>
> /**
> + * file_fdatawait_range - wait for writeback to complete
> + * @file: file pointing to address space structure to wait for
> + * @start_byte: offset in bytes where the range starts
> + * @end_byte: offset in bytes where the range ends (inclusive)
> + *
> + * Walk the list of under-writeback pages of the address space that file
> + * refers to, in the given range and wait for all of them. Check error
> + * status of the address space vs. the file->f_wb_err cursor and return it.
> + *
> + * Since the error status of the file is advanced by this function,
> + * callers are responsible for checking the return value and handling and/or
> + * reporting the error.
> + */
> +int file_fdatawait_range(struct file *file, loff_t start_byte, loff_t end_byte)
> +{
> + struct address_space *mapping = file->f_mapping;
> +
> + __filemap_fdatawait_range(mapping, start_byte, end_byte);
> + return file_check_and_advance_wb_err(file);
> +}
> +EXPORT_SYMBOL(file_fdatawait_range);
> +
> +/**
> * filemap_fdatawait_keep_errors - wait for writeback without clearing errors
> * @mapping: address space structure to wait for
> *
> --
> 2.13.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR