2017-04-12 12:06:25

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 00/17] fs: introduce new writeback error reporting and convert existing API as a wrapper around it

This is the second posting of this patchset. The basic idea with this
set is to tighten up how we handle errors that occur during writeback.
There are many places where writeback errors in the kernel can be lost
today, which can cause silent data corruption.

Also, the current method of clearing errors out of the mapping on fsync
has always been something of a problem for userland. This replaces it
with better defined semantics where userland can ensure that writeback
errors are reported on all open file descriptors.

In addition to incorporating review feedback (mostly from Neil and
Willy), this reworks the existing API for tracking writeback errors into
wrappers around the new API. That will greatly reduce the amount of code
we'll need to change in filesystem specific drivers if it turns out to
be an acceptable approach. If that does seem reasonable, then we can
probably get rid of the new API and just drop it in as replacements for
the old one.

Despite the small diffstat here this is a fairly major change to how
writeback errors are handled, and will need careful review and testing.

This works as expected with some really basic by-hand testing, and it of
course works when there are no errors in play. Once the patchset is a
bit more settled, I do still plan to roll some xfstests to validate that
this works and keeps working as expected.

Comments and feedback are welcome here.

Jeff Layton (17):
mm: drop "wait" parameter from write_one_page
mm: fix mapping_set_error call in me_pagecache_dirty
buffer: use mapping_set_error instead of setting the flag
ext2: don't test/clear AS_EIO flag
orangefs: don't call filemap_write_and_wait from fsync
mm: doc comment for scary spot in write_one_page
fs: new infrastructure for writeback error handling and reporting
fs: retrofit old error reporting API onto new infrastructure
mm: remove AS_EIO and AS_ENOSPC flags
dax: set errors in mapping when writeback fails
nilfs2: set the mapping error when calling SetPageError on writeback
mm: ensure that we set mapping error if writeout() fails
mm: don't TestClearPageError in __filemap_fdatawait_range
9p: set mapping error when writeback fails in launder_page
fuse: set mapping error in writepage_locked when it fails
cifs: set mapping error when page writeback fails in writepage or
launder_pages
cifs: remove some unneeded mapping_set_error calls

Documentation/filesystems/vfs.txt | 13 +-
fs/9p/vfs_addr.c | 5 +-
fs/btrfs/file.c | 10 +-
fs/btrfs/tree-log.c | 9 +-
fs/buffer.c | 2 +-
fs/cifs/cifsfs.c | 4 +-
fs/cifs/file.c | 14 +--
fs/cifs/inode.c | 6 +-
fs/dax.c | 4 +-
fs/exofs/dir.c | 2 +-
fs/ext2/dir.c | 2 +-
fs/ext2/file.c | 8 +-
fs/f2fs/file.c | 3 +
fs/f2fs/node.c | 6 +-
fs/fuse/file.c | 8 +-
fs/jfs/jfs_metapage.c | 4 +-
fs/minix/dir.c | 2 +-
fs/nilfs2/segment.c | 1 +
fs/open.c | 3 +
fs/orangefs/file.c | 5 +-
fs/sync.c | 5 +-
fs/sysv/dir.c | 2 +-
fs/ufs/dir.c | 2 +-
include/linux/fs.h | 24 +++-
include/linux/mm.h | 2 +-
include/linux/pagemap.h | 17 +--
ipc/shm.c | 5 +-
mm/filemap.c | 258 ++++++++++++++++++++++++++++++++------
mm/memory-failure.c | 2 +-
mm/migrate.c | 6 +-
mm/page-writeback.c | 20 +--
31 files changed, 341 insertions(+), 113 deletions(-)

--
2.9.3


2017-04-12 12:06:29

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 02/17] mm: fix mapping_set_error call in me_pagecache_dirty

The error code should be negative. Since this ends up in the default
case anyway, this is harmless, but it's less confusing to negate it.
Also, later patches will require a negative error code here.

Signed-off-by: Jeff Layton <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
---
mm/memory-failure.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 27f7210e7fab..4b56e53e5378 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -674,7 +674,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn)
* the first EIO, but we're not worse than other parts
* of the kernel.
*/
- mapping_set_error(mapping, EIO);
+ mapping_set_error(mapping, -EIO);
}

return me_pagecache_clean(p, pfn);
--
2.9.3

2017-04-12 12:06:35

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 06/17] mm: doc comment for scary spot in write_one_page

Not sure what to do here just yet.

Signed-off-by: Jeff Layton <[email protected]>
---
mm/page-writeback.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index de0dbf12e2c1..3ac8399dc984 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2388,6 +2388,12 @@ int write_one_page(struct page *page)
ret = mapping->a_ops->writepage(page, &wbc);
if (ret == 0) {
wait_on_page_writeback(page);
+ /*
+ * FIXME: is this racy? What guarantees that PG_error
+ * will still be set once we get around to checking it?
+ * What if writeback fails, but then a read is issued
+ * before we check this, and that calls ClearPageError?
+ */
if (PageError(page))
ret = -EIO;
}
--
2.9.3

2017-04-12 12:06:45

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 12/17] mm: ensure that we set mapping error if writeout() fails

If writepage fails during a page migration, then we need to ensure that
fsync will see it by flagging the mapping.

Signed-off-by: Jeff Layton <[email protected]>
---
mm/migrate.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index ed97c2c14fa8..a77b9e769d0b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -792,7 +792,11 @@ static int writeout(struct address_space *mapping, struct page *page)
/* unlocked. Relock */
lock_page(page);

- return (rc < 0) ? -EIO : -EAGAIN;
+ if (rc < 0) {
+ mapping_set_error(mapping, rc);
+ return -EIO;
+ }
+ return -EAGAIN;
}

/*
--
2.9.3

2017-04-12 12:06:41

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 04/17] ext2: don't test/clear AS_EIO flag

ext2 does a test+clear of AS_EIO flag. With the new error reporting
infrastructure, we don't need to clear anything. Sample the file's
current error code, and after writeback just report whether any
errors have been seen since then.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/ext2/file.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index b21891a6bfca..0ca77d337c94 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -177,7 +177,12 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;

ret = generic_file_fsync(file, start, end, datasync);
- if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
+ if (ret == -EIO)
+ ret = -EIO;
+ else
+ ret = filemap_check_errors(mapping);
+
+ if (ret) {
/* We don't really know where the IO error happened... */
ext2_error(sb, __func__,
"detected IO error when writing metadata buffers");
--
2.9.3

2017-04-12 12:06:47

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 13/17] mm: don't TestClearPageError in __filemap_fdatawait_range

The -EIO returned here can end up overriding whatever error is marked in
the address space, and be returned at fsync time, even when there is a
more appropriate error stored in the mapping.

Read errors are also sometimes tracked on a per-page level using
PG_error. Suppose we have a read error on a page, and then that page is
subsequently dirtied by overwriting the whole page. Writeback doesn't
clear PG_error, so we can then end up successfully writing back that
page and still return -EIO on fsync.

Worse yet, PG_error is cleared during a sync() syscall, but the -EIO
return from that is silently discarded. Any subsystem that is relying on
PG_error to report errors during fsync can easily lose writeback errors
due to this. All you need is a stray sync() call on the box at the wrong
time and you've lost the error.

Since the handling of the PG_error flag is somewhat inconsistent across
subsystems, let's just rely on marking the address space when there are
writeback errors. Change the TestClearPageError call to ClearPageError,
and make __filemap_fdatawait_range a void return function.

Signed-off-by: Jeff Layton <[email protected]>
---
mm/filemap.c | 19 +++++--------------
1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 525dddc15abb..b43975ca7a2e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -362,17 +362,16 @@ int filemap_flush(struct address_space *mapping)
}
EXPORT_SYMBOL(filemap_flush);

-static int __filemap_fdatawait_range(struct address_space *mapping,
+static void __filemap_fdatawait_range(struct address_space *mapping,
loff_t start_byte, loff_t end_byte)
{
pgoff_t index = start_byte >> PAGE_SHIFT;
pgoff_t end = end_byte >> PAGE_SHIFT;
struct pagevec pvec;
int nr_pages;
- int ret = 0;

if (end_byte < start_byte)
- goto out;
+ return;

pagevec_init(&pvec, 0);
while ((index <= end) &&
@@ -389,14 +388,11 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
continue;

wait_on_page_writeback(page);
- if (TestClearPageError(page))
- ret = -EIO;
+ ClearPageError(page);
}
pagevec_release(&pvec);
cond_resched();
}
-out:
- return ret;
}

/**
@@ -416,15 +412,10 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
loff_t end_byte)
{
- int ret, ret2;
wb_err_t since = READ_ONCE(mapping->wb_err);

- ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
- ret2 = filemap_check_wb_error(mapping, since);
- if (!ret)
- ret = ret2;
-
- return ret;
+ __filemap_fdatawait_range(mapping, start_byte, end_byte);
+ return filemap_check_wb_error(mapping, since);
}
EXPORT_SYMBOL(filemap_fdatawait_range);

--
2.9.3

2017-04-12 12:07:11

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 15/17] fuse: set mapping error in writepage_locked when it fails

This ensures that we see errors on fsync when writeback fails.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/fuse/file.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7972fdb189bf..c098baf668e6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1672,6 +1672,7 @@ static int fuse_writepage_locked(struct page *page)
err_free:
fuse_request_free(req);
err:
+ mapping_set_error(page->mapping, error);
end_page_writeback(page);
return error;
}
--
2.9.3

2017-04-12 12:06:39

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 08/17] fs: retrofit old error reporting API onto new infrastructure

Now that we have a better way to store and report errors that occur
during writeback, we need to convert the existing codebase to use it. We
could just adapt all of the filesystem code and related infrastructure
to the new API, but that's a lot of churn.

When it comes to setting errors in the mapping, filemap_set_wb_error is
a drop-in replacement for mapping_set_error. Turn that function into a
simple wrapper around the new one.

Because we want to ensure that writeback errors are always reported at
fsync time, inject filemap_report_wb_error calls much closer to the
syscall boundary. For fsync calls (and things like the nfsd equivalent),
we either return the error that the fsync operation returns, or the one
returned by filemap_report_wb_error. In both cases, we advance the
file->f_wb_err to the latest value.

The final piece of the puzzle is what to do about filemap_check_errors
calls that are being called directly or via filemap_* functions. Here,
we must take a little "creative license".

Since we now handle advancing the file->f_wb_err value at the generic
filesystem layer, we no longer need those callers to do anything like
that, or clear errors out of the mapping. A lot of the existing codebase
relies on being getting an error back from those functions when there is
a writeback problem, so we do still want to have them report writeback
errors somehow.

When reporting writeback errors, we will always report errors that have
occurred since a particular point in time. With the old writeback error
reporting, the time we used was "since it was last tested/cleared" which
is entirely arbitrary and potentially racy. Now, we can at least report
the latest error that has occurred since an arbitrary point in time
(represented as a sampled wb_err_t value).

In the case where we don't have a struct file to work with, this patch
just has the wrappers sample the mapping->wb_err value, and use that as
an arbitrary point from which to check for errors.

That's probably not ideal, particularly in the case of something like
filemap_fdatawait, but I'm not sure it's any worse than what we
already have, and this gives us a basis from which to work.

A lot of those callers will likely want to change to a model where they
sample a wb_err_t much earlier (perhaps when starting a transaction) and
then use that when checking to see if an error occurred later. I'm open
to adding new functions to the API to enable that in later patches.

Signed-off-by: Jeff Layton <[email protected]>
---
Documentation/filesystems/vfs.txt | 9 ++++-----
fs/btrfs/file.c | 10 ++--------
fs/btrfs/tree-log.c | 9 ++-------
fs/ext2/file.c | 3 ++-
fs/f2fs/file.c | 3 +++
fs/f2fs/node.c | 6 +-----
fs/fuse/file.c | 7 +++++--
fs/sync.c | 5 ++++-
include/linux/fs.h | 1 -
include/linux/pagemap.h | 7 +------
ipc/shm.c | 5 ++++-
mm/filemap.c | 32 ++++++++++----------------------
12 files changed, 38 insertions(+), 59 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 1e0119f1c46a..493bf052fa33 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -577,7 +577,7 @@ written at any point after PG_Dirty is clear. Once it is known to be
safe, PG_Writeback is cleared.

If there is an error during writeback, then the address_space should be
-marked with an error (typically using filemap_set_wb_error), in order to
+marked with an error (typically using mapping_set_error), in order to
ensure that the error can later be reported to the application when an
fsync is issued.

@@ -895,10 +895,9 @@ otherwise noted.

release: called when the last reference to an open file is closed

- fsync: called by the fsync(2) system call. Filesystems that use the
- pagecache should call filemap_report_wb_error before returning
- to ensure that any errors that occurred during writeback are
- reported and the file's error sequence advanced.
+ fsync: called by the fsync(2) system call. Errors that were previously
+ recorded using mapping_set_error will automatically be returned to
+ the application and the file's error sequence advanced.

fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 520cb7230b2d..8ae0884350dd 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1962,6 +1962,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
int ret = 0;
bool full_sync = 0;
u64 len;
+ wb_err_t wb_since = READ_ONCE(file->f_wb_err);

/*
* The range length can be represented by u64, we have to do the typecasts
@@ -2079,14 +2080,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
*/
clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
&BTRFS_I(inode)->runtime_flags);
- /*
- * An ordered extent might have started before and completed
- * already with io errors, in which case the inode was not
- * updated and we end up here. So check the inode's mapping
- * flags for any errors that might have happened while doing
- * writeback of file data.
- */
- ret = filemap_check_errors(inode->i_mapping);
+ ret = filemap_check_wb_error(inode->i_mapping, wb_since);
inode_unlock(inode);
goto out;
}
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index a59674c3e69e..b7f5dabae999 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3972,12 +3972,6 @@ static int wait_ordered_extents(struct btrfs_trans_handle *trans,
test_bit(BTRFS_ORDERED_IOERR, &ordered->flags)));

if (test_bit(BTRFS_ORDERED_IOERR, &ordered->flags)) {
- /*
- * Clear the AS_EIO/AS_ENOSPC flags from the inode's
- * i_mapping flags, so that the next fsync won't get
- * an outdated io error too.
- */
- filemap_check_errors(inode->i_mapping);
*ordered_io_error = true;
break;
}
@@ -4171,6 +4165,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
u64 test_gen;
int ret = 0;
int num = 0;
+ wb_err_t since = filemap_sample_wb_error(inode->vfs_inode.i_mapping);

INIT_LIST_HEAD(&extents);

@@ -4214,7 +4209,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
* without writing to the log tree and the fsync must report the
* file data write error and not commit the current transaction.
*/
- ret = filemap_check_errors(inode->vfs_inode.i_mapping);
+ ret = filemap_check_wb_error(inode->vfs_inode.i_mapping, since);
if (ret)
ctx->io_err = ret;
process:
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0ca77d337c94..bfb87969ad9b 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -175,12 +175,13 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
int ret;
struct super_block *sb = file->f_mapping->host->i_sb;
struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+ wb_err_t since = READ_ONCE(file->f_wb_err);

ret = generic_file_fsync(file, start, end, datasync);
if (ret == -EIO)
ret = -EIO;
else
- ret = filemap_check_errors(mapping);
+ ret = filemap_check_wb_error(mapping, since);

if (ret) {
/* We don't really know where the IO error happened... */
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 5f7317875a67..2ab4a28aa933 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
.nr_to_write = LONG_MAX,
.for_reclaim = 0,
};
+ wb_err_t since = READ_ONCE(file->f_wb_err);

if (unlikely(f2fs_readonly(inode->i_sb)))
return 0;
@@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
}

ret = wait_on_node_pages_writeback(sbi, ino);
+ if (ret == 0)
+ ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
if (ret)
goto out;

diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 481aa8dc79f4..b3ef9504fd8b 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1630,7 +1630,7 @@ int wait_on_node_pages_writeback(struct f2fs_sb_info *sbi, nid_t ino)
{
pgoff_t index = 0, end = ULONG_MAX;
struct pagevec pvec;
- int ret2, ret = 0;
+ int ret = 0;

pagevec_init(&pvec, 0);

@@ -1658,10 +1658,6 @@ int wait_on_node_pages_writeback(struct f2fs_sb_info *sbi, nid_t ino)
pagevec_release(&pvec);
cond_resched();
}
-
- ret2 = filemap_check_errors(NODE_MAPPING(sbi));
- if (!ret)
- ret = ret2;
return ret;
}

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ec238fb5a584..7972fdb189bf 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -398,6 +398,7 @@ static int fuse_flush(struct file *file, fl_owner_t id)
struct fuse_req *req;
struct fuse_flush_in inarg;
int err;
+ wb_err_t since = READ_ONCE(file->f_wb_err);

if (is_bad_inode(inode))
return -EIO;
@@ -413,7 +414,7 @@ static int fuse_flush(struct file *file, fl_owner_t id)
fuse_sync_writes(inode);
inode_unlock(inode);

- err = filemap_check_errors(file->f_mapping);
+ err = filemap_check_wb_error(file->f_mapping, since);
if (err)
return err;

@@ -446,6 +447,7 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
FUSE_ARGS(args);
struct fuse_fsync_in inarg;
int err;
+ wb_err_t since;

if (is_bad_inode(inode))
return -EIO;
@@ -461,6 +463,7 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
if (err)
goto out;

+ since = READ_ONCE(file->f_wb_err);
fuse_sync_writes(inode);

/*
@@ -468,7 +471,7 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
* filemap_write_and_wait_range() does not catch errors.
* We have to do this directly after fuse_sync_writes()
*/
- err = filemap_check_errors(file->f_mapping);
+ err = filemap_check_wb_error(file->f_mapping, since);
if (err)
goto out;

diff --git a/fs/sync.c b/fs/sync.c
index 11ba023434b1..cfde7a2b88ec 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -182,6 +182,7 @@ SYSCALL_DEFINE1(syncfs, int, fd)
*/
int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
{
+ int ret, ret2;
struct inode *inode = file->f_mapping->host;

if (!file->f_op->fsync)
@@ -192,7 +193,9 @@ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
spin_unlock(&inode->i_lock);
mark_inode_dirty_sync(inode);
}
- return call_fsync(file, start, end, datasync);
+ ret = call_fsync(file, start, end, datasync);
+ ret2 = filemap_report_wb_error(file);
+ return ret ? : ret2;
}
EXPORT_SYMBOL(vfs_fsync_range);

diff --git a/include/linux/fs.h b/include/linux/fs.h
index bcc791d43c6e..bd2cfee92b4f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2532,7 +2532,6 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
loff_t start, loff_t end, int sync_mode);
extern int filemap_fdatawrite_range(struct address_space *mapping,
loff_t start, loff_t end);
-extern int filemap_check_errors(struct address_space *mapping);
extern void __filemap_set_wb_error(struct address_space *mapping, int err);
extern wb_err_t filemap_sample_wb_error(struct address_space *mapping);
extern int filemap_report_wb_error(struct file *file);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 84943e8057ef..458caa916244 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -30,12 +30,7 @@ enum mapping_flags {

static inline void mapping_set_error(struct address_space *mapping, int error)
{
- if (unlikely(error)) {
- if (error == -ENOSPC)
- set_bit(AS_ENOSPC, &mapping->flags);
- else
- set_bit(AS_EIO, &mapping->flags);
- }
+ return filemap_set_wb_error(mapping, error);
}

static inline void mapping_set_unevictable(struct address_space *mapping)
diff --git a/ipc/shm.c b/ipc/shm.c
index 481d2a9c298a..5d8df57cfc76 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -448,11 +448,14 @@ static int shm_release(struct inode *ino, struct file *file)

static int shm_fsync(struct file *file, loff_t start, loff_t end, int datasync)
{
+ int ret, ret2;
struct shm_file_data *sfd = shm_file_data(file);

if (!sfd->file->f_op->fsync)
return -EINVAL;
- return call_fsync(sfd->file, start, end, datasync);
+ ret = call_fsync(sfd->file, start, end, datasync);
+ ret2 = filemap_report_wb_error(file);
+ return ret ? : ret2;
}

static long shm_fallocate(struct file *file, int mode, loff_t offset,
diff --git a/mm/filemap.c b/mm/filemap.c
index 4966e9dea945..525dddc15abb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -295,20 +295,6 @@ void delete_from_page_cache(struct page *page)
}
EXPORT_SYMBOL(delete_from_page_cache);

-int filemap_check_errors(struct address_space *mapping)
-{
- int ret = 0;
- /* Check for outstanding write errors */
- if (test_bit(AS_ENOSPC, &mapping->flags) &&
- test_and_clear_bit(AS_ENOSPC, &mapping->flags))
- ret = -ENOSPC;
- if (test_bit(AS_EIO, &mapping->flags) &&
- test_and_clear_bit(AS_EIO, &mapping->flags))
- ret = -EIO;
- return ret;
-}
-EXPORT_SYMBOL(filemap_check_errors);
-
/**
* __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
* @mapping: address space structure to write
@@ -431,9 +417,10 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
loff_t end_byte)
{
int ret, ret2;
+ wb_err_t since = READ_ONCE(mapping->wb_err);

ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
- ret2 = filemap_check_errors(mapping);
+ ret2 = filemap_check_wb_error(mapping, since);
if (!ret)
ret = ret2;

@@ -489,6 +476,7 @@ EXPORT_SYMBOL(filemap_fdatawait);
int filemap_write_and_wait(struct address_space *mapping)
{
int err = 0;
+ wb_err_t since = READ_ONCE(mapping->wb_err);

if ((!dax_mapping(mapping) && mapping->nrpages) ||
(dax_mapping(mapping) && mapping->nrexceptional)) {
@@ -500,12 +488,12 @@ int filemap_write_and_wait(struct address_space *mapping)
* thing (e.g. bug) happened, so we avoid waiting for it.
*/
if (err != -EIO) {
- int err2 = filemap_fdatawait(mapping);
+ filemap_fdatawait_keep_errors(mapping);
if (!err)
- err = err2;
+ err = filemap_check_wb_error(mapping, since);
}
} else {
- err = filemap_check_errors(mapping);
+ err = filemap_check_wb_error(mapping, since);
}
return err;
}
@@ -526,6 +514,7 @@ int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
{
int err = 0;
+ wb_err_t since = READ_ONCE(mapping->wb_err);

if ((!dax_mapping(mapping) && mapping->nrpages) ||
(dax_mapping(mapping) && mapping->nrexceptional)) {
@@ -533,13 +522,12 @@ int filemap_write_and_wait_range(struct address_space *mapping,
WB_SYNC_ALL);
/* See comment of filemap_write_and_wait() */
if (err != -EIO) {
- int err2 = filemap_fdatawait_range(mapping,
- lstart, lend);
+ __filemap_fdatawait_range(mapping, lstart, lend);
if (!err)
- err = err2;
+ err = filemap_check_wb_error(mapping, since);
}
} else {
- err = filemap_check_errors(mapping);
+ err = filemap_check_wb_error(mapping, since);
}
return err;
}
--
2.9.3

2017-04-12 12:06:34

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 01/17] mm: drop "wait" parameter from write_one_page

The callers all set it to 1. Also, make it clear that this function will
not set any sort of AS_* error, and that the caller must do so if
necessary. No existing caller uses this on normal files, so none of them
need it.

Signed-off-by: Jeff Layton <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
---
fs/exofs/dir.c | 2 +-
fs/ext2/dir.c | 2 +-
fs/jfs/jfs_metapage.c | 4 ++--
fs/minix/dir.c | 2 +-
fs/sysv/dir.c | 2 +-
fs/ufs/dir.c | 2 +-
include/linux/mm.h | 2 +-
mm/page-writeback.c | 14 +++++++-------
8 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 42f9a0a0c4ca..e163ed980c20 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -72,7 +72,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len)
set_page_dirty(page);

if (IS_DIRSYNC(dir))
- err = write_one_page(page, 1);
+ err = write_one_page(page);
else
unlock_page(page);

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index d9650c9508e4..e2709695b177 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -100,7 +100,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
}

if (IS_DIRSYNC(dir)) {
- err = write_one_page(page, 1);
+ err = write_one_page(page);
if (!err)
err = sync_inode_metadata(dir, 1);
} else {
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index 489aaa1403e5..744fa3c079e6 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -711,7 +711,7 @@ void force_metapage(struct metapage *mp)
get_page(page);
lock_page(page);
set_page_dirty(page);
- write_one_page(page, 1);
+ write_one_page(page);
clear_bit(META_forcewrite, &mp->flag);
put_page(page);
}
@@ -756,7 +756,7 @@ void release_metapage(struct metapage * mp)
set_page_dirty(page);
if (test_bit(META_sync, &mp->flag)) {
clear_bit(META_sync, &mp->flag);
- write_one_page(page, 1);
+ write_one_page(page);
lock_page(page); /* write_one_page unlocks the page */
}
} else if (mp->lsn) /* discard_metapage doesn't remove it */
diff --git a/fs/minix/dir.c b/fs/minix/dir.c
index 7edc9b395700..baa9721f1299 100644
--- a/fs/minix/dir.c
+++ b/fs/minix/dir.c
@@ -57,7 +57,7 @@ static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len)
mark_inode_dirty(dir);
}
if (IS_DIRSYNC(dir))
- err = write_one_page(page, 1);
+ err = write_one_page(page);
else
unlock_page(page);
return err;
diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c
index 5bdae85ceef7..f5191cb2c947 100644
--- a/fs/sysv/dir.c
+++ b/fs/sysv/dir.c
@@ -45,7 +45,7 @@ static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len)
mark_inode_dirty(dir);
}
if (IS_DIRSYNC(dir))
- err = write_one_page(page, 1);
+ err = write_one_page(page);
else
unlock_page(page);
return err;
diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index de01b8f2aa78..48609f1d9580 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -53,7 +53,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len)
mark_inode_dirty(dir);
}
if (IS_DIRSYNC(dir))
- err = write_one_page(page, 1);
+ err = write_one_page(page);
else
unlock_page(page);
return err;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00a8fa7e366a..f25b76486645 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2187,7 +2187,7 @@ extern void filemap_map_pages(struct vm_fault *vmf,
extern int filemap_page_mkwrite(struct vm_fault *vmf);

/* mm/page-writeback.c */
-int write_one_page(struct page *page, int wait);
+int write_one_page(struct page *page);
void task_dirty_inc(struct task_struct *tsk);

/* readahead.c */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d8ac2a7fb9e7..de0dbf12e2c1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2361,15 +2361,16 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
}

/**
- * write_one_page - write out a single page and optionally wait on I/O
+ * write_one_page - write out a single page and wait on I/O
* @page: the page to write
- * @wait: if true, wait on writeout
*
* The page must be locked by the caller and will be unlocked upon return.
*
- * write_one_page() returns a negative error code if I/O failed.
+ * write_one_page() returns a negative error code if I/O failed. Note that
+ * the address_space is not marked for error. The caller must do this if
+ * needed.
*/
-int write_one_page(struct page *page, int wait)
+int write_one_page(struct page *page)
{
struct address_space *mapping = page->mapping;
int ret = 0;
@@ -2380,13 +2381,12 @@ int write_one_page(struct page *page, int wait)

BUG_ON(!PageLocked(page));

- if (wait)
- wait_on_page_writeback(page);
+ wait_on_page_writeback(page);

if (clear_page_dirty_for_io(page)) {
get_page(page);
ret = mapping->a_ops->writepage(page, &wbc);
- if (ret == 0 && wait) {
+ if (ret == 0) {
wait_on_page_writeback(page);
if (PageError(page))
ret = -EIO;
--
2.9.3

2017-04-12 12:08:03

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 16/17] cifs: set mapping error when page writeback fails in writepage or launder_pages

Signed-off-by: Jeff Layton <[email protected]>
---
fs/cifs/file.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index aa3debbba826..d9a3657b8d43 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2234,14 +2234,16 @@ cifs_writepage_locked(struct page *page, struct writeback_control *wbc)
set_page_writeback(page);
retry_write:
rc = cifs_partialpagewrite(page, 0, PAGE_SIZE);
- if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL)
+ if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL) {
goto retry_write;
- else if (rc == -EAGAIN)
+ } else if (rc == -EAGAIN) {
redirty_page_for_writepage(wbc, page);
- else if (rc != 0)
+ } else if (rc != 0) {
SetPageError(page);
- else
+ mapping_set_error(page->mapping, rc);
+ } else {
SetPageUptodate(page);
+ }
end_page_writeback(page);
put_page(page);
free_xid(xid);
--
2.9.3

2017-04-12 12:08:01

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 17/17] cifs: remove some unneeded mapping_set_error calls

Now that writeback errors are handled on a per-file basis using the new
sequence counter method at the vfs layer, we no longer need to re-set
errors in the mapping after doing writeback in non-fsync codepaths.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/cifs/cifsfs.c | 4 +---
fs/cifs/file.c | 4 +---
fs/cifs/inode.c | 6 ++----
3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index dd3f5fabfdf6..017a2d1d02c7 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -829,10 +829,8 @@ static loff_t cifs_llseek(struct file *file, loff_t offset, int whence)
if (!CIFS_CACHE_READ(CIFS_I(inode)) && inode->i_mapping &&
inode->i_mapping->nrpages != 0) {
rc = filemap_fdatawait(inode->i_mapping);
- if (rc) {
- mapping_set_error(inode->i_mapping, rc);
+ if (rc)
return rc;
- }
}
/*
* Some applications poll for the file length in this strange
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index d9a3657b8d43..563c298c2a2a 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -722,9 +722,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
cinode = CIFS_I(inode);

if (can_flush) {
- rc = filemap_write_and_wait(inode->i_mapping);
- mapping_set_error(inode->i_mapping, rc);
-
+ filemap_write_and_wait(inode->i_mapping);
if (tcon->unix_ext)
rc = cifs_get_inode_info_unix(&inode, full_path,
inode->i_sb, xid);
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index b261db34103c..f789d44d4b73 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -2178,8 +2178,7 @@ cifs_setattr_unix(struct dentry *direntry, struct iattr *attrs)
* will be truncated anyway? Also, should we error out here if
* the flush returns error?
*/
- rc = filemap_write_and_wait(inode->i_mapping);
- mapping_set_error(inode->i_mapping, rc);
+ filemap_write_and_wait(inode->i_mapping);
rc = 0;

if (attrs->ia_valid & ATTR_SIZE) {
@@ -2321,8 +2320,7 @@ cifs_setattr_nounix(struct dentry *direntry, struct iattr *attrs)
* will be truncated anyway? Also, should we error out here if
* the flush returns error?
*/
- rc = filemap_write_and_wait(inode->i_mapping);
- mapping_set_error(inode->i_mapping, rc);
+ filemap_write_and_wait(inode->i_mapping);
rc = 0;

if (attrs->ia_valid & ATTR_SIZE) {
--
2.9.3

2017-04-12 12:08:34

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 10/17] dax: set errors in mapping when writeback fails

In order to get proper error codes from fsync, we must set an error in
the mapping range when writeback fails.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/dax.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 85abd741253d..9b6b04030c3f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -901,8 +901,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,

ret = dax_writeback_one(bdev, mapping, indices[i],
pvec.pages[i]);
- if (ret < 0)
+ if (ret < 0) {
+ mapping_set_error(mapping, ret);
return ret;
+ }
}
}
return 0;
--
2.9.3

2017-04-12 12:08:36

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 05/17] orangefs: don't call filemap_write_and_wait from fsync

Orangefs doesn't do buffered writes yet, so there's no point in
initiating and waiting for writeback.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/orangefs/file.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index e6bbc8083d77..17ab42c4db52 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -646,14 +646,11 @@ static int orangefs_fsync(struct file *file,
loff_t end,
int datasync)
{
- int ret = -EINVAL;
+ int ret;
struct orangefs_inode_s *orangefs_inode =
ORANGEFS_I(file_inode(file));
struct orangefs_kernel_op_s *new_op = NULL;

- /* required call */
- filemap_write_and_wait_range(file->f_mapping, start, end);
-
new_op = op_alloc(ORANGEFS_VFS_OP_FSYNC);
if (!new_op)
return -ENOMEM;
--
2.9.3

2017-04-12 12:09:11

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 14/17] 9p: set mapping error when writeback fails in launder_page

launder_page is just writeback under the page lock. We still need to
mark the mapping for errors there when they occur.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/9p/vfs_addr.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index adaf6f6dd858..7af6e6501698 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -223,8 +223,11 @@ static int v9fs_launder_page(struct page *page)
v9fs_fscache_wait_on_page_write(inode, page);
if (clear_page_dirty_for_io(page)) {
retval = v9fs_vfs_writepage_locked(page);
- if (retval)
+ if (retval) {
+ if (retval != -EAGAIN)
+ mapping_set_error(page->mapping, retval);
return retval;
+ }
}
return 0;
}
--
2.9.3

2017-04-12 12:09:58

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 11/17] nilfs2: set the mapping error when calling SetPageError on writeback

In a later patch, we're going to want to make the fsync codepath not do
a TestClearPageError call as that can override the error set in the
address space. To do that though, we need to ensure that filesystems
that are relying on the PG_error bit for reporting writeback errors
also set an error in the address space.

Ensure that this is set in nilfs2.

Cc: Ryusuke Konishi <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nilfs2/segment.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index febed1217b3f..612d4b446793 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1745,6 +1745,7 @@ static void nilfs_end_page_io(struct page *page, int err)
} else {
__set_page_dirty_nobuffers(page);
SetPageError(page);
+ mapping_set_error(page_mapping(page), err);
}

end_page_writeback(page);
--
2.9.3

2017-04-12 12:10:20

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 09/17] mm: remove AS_EIO and AS_ENOSPC flags

They're no longer used.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/pagemap.h | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 458caa916244..533591ff7c9c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -19,13 +19,11 @@
* Bits in mapping->flags.
*/
enum mapping_flags {
- AS_EIO = 0, /* IO error on async write */
- AS_ENOSPC = 1, /* ENOSPC on async write */
- AS_MM_ALL_LOCKS = 2, /* under mm_take_all_locks() */
- AS_UNEVICTABLE = 3, /* e.g., ramdisk, SHM_LOCK */
- AS_EXITING = 4, /* final truncate in progress */
+ AS_MM_ALL_LOCKS = 0, /* under mm_take_all_locks() */
+ AS_UNEVICTABLE = 1, /* e.g., ramdisk, SHM_LOCK */
+ AS_EXITING = 2, /* final truncate in progress */
/* writeback related tags are not used */
- AS_NO_WRITEBACK_TAGS = 5,
+ AS_NO_WRITEBACK_TAGS = 3,
};

static inline void mapping_set_error(struct address_space *mapping, int error)
--
2.9.3

2017-04-12 12:10:40

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 07/17] fs: new infrastructure for writeback error handling and reporting

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

It's those non-fsync callers that are problematic. We should be
reporting writeback errors during fsync, but many places in the code
clear out errors before they can be properly reported, or report errors
at nonsensical times. If I get -EIO on a stat() call, there is no reason
for me to assume that it is because some previous writeback failed.

This patch adds a small bit of new infrastructure for setting and
reporting errors during pagecache writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fd may not be associated with one another in any way. They could even be
in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds a wb_error field and a sequence counter to the
address_space, and a corresponding sequence counter in the struct file.
When errors are reported during writeback, we set the error field in the
mapping and increment the sequence counter.

When fsync or flush is called, we check the sequence in the file vs. the
one in the mapping. If the file's counter is behind the one in the
mapping, then we update the sequence counter in the file to the value of
the one in the mapping and report the error. If the file is "caught up"
then we just report 0.

This changes the semantics of fsync such that applications can now use
it to determine whether there were any writeback errors since fsync(fd)
was last called (or since the file was opened in the case of fsync
having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The basic idea here is for filesystems to use filemap_set_wb_error to
set the error in the mapping when there are writeback errors, and then
have the fsync and flush operations call filemap_report_wb_error just
before returning to ensure that those errors get reported properly.

Eventually, it may make sense to move the reporting into the generic
vfs_fsync_range helper, but doing it this way for now makes it simpler
to convert filesystems to the new API individually.

Signed-off-by: Jeff Layton <[email protected]>
---
Documentation/filesystems/vfs.txt | 14 ++-
fs/open.c | 3 +
include/linux/fs.h | 23 +++++
mm/filemap.c | 209 ++++++++++++++++++++++++++++++++++++++
4 files changed, 247 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 94dd27ef4a76..1e0119f1c46a 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,6 +576,11 @@ should clear PG_Dirty and set PG_Writeback. It can be actually
written at any point after PG_Dirty is clear. Once it is known to be
safe, PG_Writeback is cleared.

+If there is an error during writeback, then the address_space should be
+marked with an error (typically using filemap_set_wb_error), in order to
+ensure that the error can later be reported to the application when an
+fsync is issued.
+
Writeback makes use of a writeback_control structure...

struct address_space_operations
@@ -884,11 +889,16 @@ otherwise noted.
"private_data" member in the file structure if you want to point
to a device structure

- flush: called by the close(2) system call to flush a file
+ flush: called by the close(2) system call to flush a file. Writeback
+ errors not previously reported via fsync are allowed to be reported
+ here as you would for fsync.

release: called when the last reference to an open file is closed

- fsync: called by the fsync(2) system call
+ fsync: called by the fsync(2) system call. Filesystems that use the
+ pagecache should call filemap_report_wb_error before returning
+ to ensure that any errors that occurred during writeback are
+ reported and the file's error sequence advanced.

fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
diff --git a/fs/open.c b/fs/open.c
index 949cef29c3bb..88bfed8d3c88 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -709,6 +709,9 @@ static int do_dentry_open(struct file *f,
f->f_inode = inode;
f->f_mapping = inode->i_mapping;

+ /* Ensure that we skip any errors that predate opening of the file */
+ f->f_wb_err = filemap_sample_wb_error(f->f_mapping);
+
if (unlikely(f->f_flags & O_PATH)) {
f->f_mode = FMODE_PATH;
f->f_op = &empty_fops;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7bb45e8..bcc791d43c6e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -376,6 +376,16 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);

+/*
+ * A wb_err_t is a combination of error value, sequence counter and flag that
+ * is used to track errors that occur during writeback. When a new writeback
+ * error occurs, we set the error field in it and increment the sequence
+ * counter if the current value has been fetched since it was last set.
+ *
+ * See the filemap_*_wb_error functions for details.
+ */
+typedef u32 wb_err_t;
+
struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
@@ -394,6 +404,7 @@ struct address_space {
gfp_t gfp_mask; /* implicit gfp mask for allocations */
struct list_head private_list; /* ditto */
void *private_data; /* ditto */
+ wb_err_t wb_err;
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
@@ -846,6 +857,7 @@ struct file {
* Must not be taken from IRQ context.
*/
spinlock_t f_lock;
+ wb_err_t f_wb_err;
atomic_long_t f_count;
unsigned int f_flags;
fmode_t f_mode;
@@ -2521,6 +2533,17 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
extern int filemap_fdatawrite_range(struct address_space *mapping,
loff_t start, loff_t end);
extern int filemap_check_errors(struct address_space *mapping);
+extern void __filemap_set_wb_error(struct address_space *mapping, int err);
+extern wb_err_t filemap_sample_wb_error(struct address_space *mapping);
+extern int filemap_report_wb_error(struct file *file);
+extern int filemap_check_wb_error(struct address_space *mapping, wb_err_t since);
+
+static inline void filemap_set_wb_error(struct address_space *mapping, int err)
+{
+ /* Optimize for the common case of no error */
+ if (unlikely(err))
+ __filemap_set_wb_error(mapping, err);
+}

extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
int datasync);
diff --git a/mm/filemap.c b/mm/filemap.c
index 1694623a6289..4966e9dea945 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -545,6 +545,215 @@ int filemap_write_and_wait_range(struct address_space *mapping,
}
EXPORT_SYMBOL(filemap_write_and_wait_range);

+/*
+ * The wb_err field in the address_space provides a place to store writeback
+ * errors. We endeavor to deliver writeback errors to fsync on all open file
+ * descriptors that were open at the time that the error was caught. We do
+ * this using a 32-bit value to store the error, with the upper bits as a
+ * sequence counter. We can store any error up to MAX_ERRNO.
+ *
+ * Additionally, we reserve one bit to indicate whether any fd has grabbed the
+ * value to record in its struct file. If nothing has, then we don't really
+ * need to increment the counter.
+ */
+
+/* The low bits are designated for error code (max of MAX_ERRNO) */
+#define WB_ERR_SHIFT ilog2(MAX_ERRNO + 1)
+
+/* This bit is used as a flag to indicate whether the value has been seen */
+#define WB_ERR_SEEN (1 << WB_ERR_SHIFT)
+
+/* Increment the counter by this much to ensure that we don't touch earlier
+ * values */
+#define WB_ERR_CTR_INC (1 << (WB_ERR_SHIFT + 1))
+
+/**
+ * __filemap_set_wb_error - set the wb error in the mapping for later reporting
+ * @mapping: mapping in which the error should be set
+ * @err: error to set. must be negative value but not less than -MAX_ERRNO
+ *
+ * When an error occurs during writeback of inode data, we must report that
+ * error during fsync. This function sets the writeback error field in the
+ * mapping, and increments the sequence counter. When fsync or close is later
+ * performed, the caller can then check the sequence in the mapping against
+ * the one in the file to determine whether the error should be reported.
+ *
+ * Because there are so few bits for the counter, we try to avoid incrementing
+ * it unless someone is going to record the value for later comparison. This
+ * is tracked by a bit in the 32 bit word that we use as a "seen" flag.
+ *
+ * Note that we always use the latest writeback error, since POSIX states
+ * that when there are multiple errors (e.g. -EIO followed by -ENOSPC),
+ * that any possible error may be returned.
+ *
+ * Most callers will want to use the filemap_set_wb_error wrapper to
+ * efficiently handle the common case where err is 0.
+ */
+void __filemap_set_wb_error(struct address_space *mapping, int err)
+{
+ wb_err_t old;
+
+ /* MAX_ERRNO must be able to serve as a mask */
+ BUILD_BUG_ON_NOT_POWER_OF_2(MAX_ERRNO + 1);
+
+ /*
+ * Ensure the error code actually fits where we want it to go. If it
+ * doesn't then just throw a warning and don't record anything. We
+ * also don't accept zero here as that would effectively clear a
+ * previous error.
+ */
+ if (WARN(unlikely(err == 0 || (unsigned int)-err > MAX_ERRNO),
+ "err = %d\n", err))
+ return;
+
+ old = READ_ONCE(mapping->wb_err);
+ for (;;) {
+ wb_err_t new, cur;
+
+ /* Clear out error bits and set new error */
+ new = (old & ~(MAX_ERRNO|WB_ERR_SEEN)) | -err;
+
+ /* Only increment if someone has looked at it */
+ if (old & WB_ERR_SEEN)
+ new += WB_ERR_CTR_INC;
+
+ /* If there would be no change, then call it done */
+ if (new == old)
+ break;
+
+ /* Try to swap the new value into place */
+ cur = cmpxchg(&mapping->wb_err, old, new);
+
+ /*
+ * Call it success if we did the swap or someone else beat us
+ * to it for the same value.
+ */
+ if (likely(cur == old || cur == new))
+ break;
+
+ /* Raced with an update, try again */
+ old = cur;
+ }
+}
+EXPORT_SYMBOL(__filemap_set_wb_error);
+
+/**
+ * filemap_sample_wb_error - grab current wb_err_t value for mapping
+ * @mapping: the mapping from which to sample the error
+ *
+ * This function allows callers to "sample" the wb_err_t value from the
+ * mapping, marking it as "seen" if required.
+ *
+ * Note that we handle the common case where we've had no writeback errors
+ * as a special case. We don't need to mark the SEEN bit in that case since
+ * an all-zeroed out wb_err_t will only ever exist at first initialization.
+ */
+wb_err_t filemap_sample_wb_error(struct address_space *mapping)
+{
+ wb_err_t old = READ_ONCE(mapping->wb_err);
+ wb_err_t new = old;
+
+ /*
+ * For the common case of no errors ever having been set, we can skip
+ * marking the SEEN bit. Once an error has been set, the value will
+ * never go back to zero.
+ */
+ if (old != 0) {
+ new |= WB_ERR_SEEN;
+ if (old != new)
+ cmpxchg(&mapping->wb_err, old, new);
+ }
+ return new;
+}
+EXPORT_SYMBOL(filemap_sample_wb_error);
+
+/**
+ * filemap_report_wb_error - report wb error (if any) that was previously set
+ * @file: struct file on which the error is being reported
+ *
+ * When userland calls fsync (or something like nfsd does the equivalent), we
+ * want to report any writeback errors that occurred since the last fsync (or
+ * since the file was opened if there haven't been any).
+ *
+ * Grab the wb_err from the mapping. If it matches what we have in the file,
+ * then just quickly return 0. The file is all caught up.
+ *
+ * If it doesn't match, then take the mapping value, set the "seen" flag in
+ * it and try to swap it into place. If it works, or another task beat us
+ * to it with the new value, then update the f_wb_err and return the error
+ * portion. The error at this point must be reported via proper channels
+ * (a'la fsync, or NFS COMMIT operation, etc.).
+ *
+ * While we handle mapping->wb_err with atomic operations, the f_wb_err
+ * value is protected by the f_lock since we must ensure that it reflects
+ * the latest value swapped in for this file descriptor.
+ */
+int filemap_report_wb_error(struct file *file)
+{
+ int err = 0;
+ struct address_space *mapping = file->f_mapping;
+ wb_err_t old, new;
+
+ /*
+ * Catch the common case where nothing has changed without locking
+ *
+ * We always store values with the "seen" bit set (except in the case
+ * where the entire thing is 0), so if this matches what we already
+ * have, then we can call it done. There is nothing to update in that
+ * case.
+ */
+ if (likely(READ_ONCE(mapping->wb_err) == READ_ONCE(file->f_wb_err)))
+ goto out;
+
+ /* Something changed, must use slow path */
+ spin_lock(&file->f_lock);
+ /* Fetch again to make sure we get the latest */
+ old = READ_ONCE(mapping->wb_err);
+
+ if (likely(old != file->f_wb_err)) {
+ /*
+ * Set the flag and try to swap it into place if it has
+ * changed.
+ *
+ * We don't care about the outcome of the swap here. If the
+ * swap doesn't occur, then it has either been updated by a
+ * writer who is bumping the seq count anyway, or another
+ * reader who is just setting the "seen" flag. Either outcome
+ * is OK here, and we can advance f_wb_err and return an
+ * error based on what we have.
+ */
+ new = old | WB_ERR_SEEN;
+ if (new != old)
+ cmpxchg(&mapping->wb_err, old, new);
+ file->f_wb_err = new;
+ err = -(new & MAX_ERRNO);
+ }
+ spin_unlock(&file->f_lock);
+out:
+ return err;
+}
+EXPORT_SYMBOL(filemap_report_wb_error);
+
+/**
+ * filemap_check_wb_error - has an error occurred since the mark was sampled?
+ * @mapping: mapping to check for writeback errors
+ * @since: previously-sampled wb_err_t
+ *
+ * Grab the wb_err_t value from the mapping, and see if it has changed "since"
+ * the given value was sampled.
+ *
+ * If it has then report the latest error set, otherwise return 0.
+ */
+int filemap_check_wb_error(struct address_space *mapping, wb_err_t since)
+{
+ wb_err_t cur = READ_ONCE(mapping->wb_err);
+
+ if (likely(cur == since))
+ return 0;
+ return -(cur & MAX_ERRNO);
+}
+EXPORT_SYMBOL(filemap_check_wb_error);
+
/**
* replace_page_cache_page - replace a pagecache page with a new one
* @old: page to be replaced
--
2.9.3

2017-04-12 12:11:42

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 03/17] buffer: use mapping_set_error instead of setting the flag

Signed-off-by: Jeff Layton <[email protected]>
---
fs/buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9196f2a270da..70638941066d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -483,7 +483,7 @@ static void __remove_assoc_queue(struct buffer_head *bh)
list_del_init(&bh->b_assoc_buffers);
WARN_ON(!bh->b_assoc_map);
if (buffer_write_io_error(bh))
- set_bit(AS_EIO, &bh->b_assoc_map->flags);
+ mapping_set_error(bh->b_assoc_map, -EIO);
bh->b_assoc_map = NULL;
}

--
2.9.3

2017-04-12 12:15:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] mm: drop "wait" parameter from write_one_page

On Wed 12-04-17 08:05:58, Jeff Layton wrote:
> The callers all set it to 1. Also, make it clear that this function will
> not set any sort of AS_* error, and that the caller must do so if
> necessary. No existing caller uses this on normal files, so none of them
> need it.
>
> Signed-off-by: Jeff Layton <[email protected]>
> Reviewed-by: Ross Zwisler <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/exofs/dir.c | 2 +-
> fs/ext2/dir.c | 2 +-
> fs/jfs/jfs_metapage.c | 4 ++--
> fs/minix/dir.c | 2 +-
> fs/sysv/dir.c | 2 +-
> fs/ufs/dir.c | 2 +-
> include/linux/mm.h | 2 +-
> mm/page-writeback.c | 14 +++++++-------
> 8 files changed, 15 insertions(+), 15 deletions(-)
>
> diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
> index 42f9a0a0c4ca..e163ed980c20 100644
> --- a/fs/exofs/dir.c
> +++ b/fs/exofs/dir.c
> @@ -72,7 +72,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len)
> set_page_dirty(page);
>
> if (IS_DIRSYNC(dir))
> - err = write_one_page(page, 1);
> + err = write_one_page(page);
> else
> unlock_page(page);
>
> diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
> index d9650c9508e4..e2709695b177 100644
> --- a/fs/ext2/dir.c
> +++ b/fs/ext2/dir.c
> @@ -100,7 +100,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
> }
>
> if (IS_DIRSYNC(dir)) {
> - err = write_one_page(page, 1);
> + err = write_one_page(page);
> if (!err)
> err = sync_inode_metadata(dir, 1);
> } else {
> diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
> index 489aaa1403e5..744fa3c079e6 100644
> --- a/fs/jfs/jfs_metapage.c
> +++ b/fs/jfs/jfs_metapage.c
> @@ -711,7 +711,7 @@ void force_metapage(struct metapage *mp)
> get_page(page);
> lock_page(page);
> set_page_dirty(page);
> - write_one_page(page, 1);
> + write_one_page(page);
> clear_bit(META_forcewrite, &mp->flag);
> put_page(page);
> }
> @@ -756,7 +756,7 @@ void release_metapage(struct metapage * mp)
> set_page_dirty(page);
> if (test_bit(META_sync, &mp->flag)) {
> clear_bit(META_sync, &mp->flag);
> - write_one_page(page, 1);
> + write_one_page(page);
> lock_page(page); /* write_one_page unlocks the page */
> }
> } else if (mp->lsn) /* discard_metapage doesn't remove it */
> diff --git a/fs/minix/dir.c b/fs/minix/dir.c
> index 7edc9b395700..baa9721f1299 100644
> --- a/fs/minix/dir.c
> +++ b/fs/minix/dir.c
> @@ -57,7 +57,7 @@ static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len)
> mark_inode_dirty(dir);
> }
> if (IS_DIRSYNC(dir))
> - err = write_one_page(page, 1);
> + err = write_one_page(page);
> else
> unlock_page(page);
> return err;
> diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c
> index 5bdae85ceef7..f5191cb2c947 100644
> --- a/fs/sysv/dir.c
> +++ b/fs/sysv/dir.c
> @@ -45,7 +45,7 @@ static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len)
> mark_inode_dirty(dir);
> }
> if (IS_DIRSYNC(dir))
> - err = write_one_page(page, 1);
> + err = write_one_page(page);
> else
> unlock_page(page);
> return err;
> diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
> index de01b8f2aa78..48609f1d9580 100644
> --- a/fs/ufs/dir.c
> +++ b/fs/ufs/dir.c
> @@ -53,7 +53,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len)
> mark_inode_dirty(dir);
> }
> if (IS_DIRSYNC(dir))
> - err = write_one_page(page, 1);
> + err = write_one_page(page);
> else
> unlock_page(page);
> return err;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 00a8fa7e366a..f25b76486645 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2187,7 +2187,7 @@ extern void filemap_map_pages(struct vm_fault *vmf,
> extern int filemap_page_mkwrite(struct vm_fault *vmf);
>
> /* mm/page-writeback.c */
> -int write_one_page(struct page *page, int wait);
> +int write_one_page(struct page *page);
> void task_dirty_inc(struct task_struct *tsk);
>
> /* readahead.c */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d8ac2a7fb9e7..de0dbf12e2c1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2361,15 +2361,16 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
> }
>
> /**
> - * write_one_page - write out a single page and optionally wait on I/O
> + * write_one_page - write out a single page and wait on I/O
> * @page: the page to write
> - * @wait: if true, wait on writeout
> *
> * The page must be locked by the caller and will be unlocked upon return.
> *
> - * write_one_page() returns a negative error code if I/O failed.
> + * write_one_page() returns a negative error code if I/O failed. Note that
> + * the address_space is not marked for error. The caller must do this if
> + * needed.
> */
> -int write_one_page(struct page *page, int wait)
> +int write_one_page(struct page *page)
> {
> struct address_space *mapping = page->mapping;
> int ret = 0;
> @@ -2380,13 +2381,12 @@ int write_one_page(struct page *page, int wait)
>
> BUG_ON(!PageLocked(page));
>
> - if (wait)
> - wait_on_page_writeback(page);
> + wait_on_page_writeback(page);
>
> if (clear_page_dirty_for_io(page)) {
> get_page(page);
> ret = mapping->a_ops->writepage(page, &wbc);
> - if (ret == 0 && wait) {
> + if (ret == 0) {
> wait_on_page_writeback(page);
> if (PageError(page))
> ret = -EIO;
> --
> 2.9.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-04-12 12:16:18

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 02/17] mm: fix mapping_set_error call in me_pagecache_dirty

On Wed 12-04-17 08:05:59, Jeff Layton wrote:
> The error code should be negative. Since this ends up in the default
> case anyway, this is harmless, but it's less confusing to negate it.
> Also, later patches will require a negative error code here.
>
> Signed-off-by: Jeff Layton <[email protected]>
> Reviewed-by: Ross Zwisler <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> mm/memory-failure.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 27f7210e7fab..4b56e53e5378 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -674,7 +674,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn)
> * the first EIO, but we're not worse than other parts
> * of the kernel.
> */
> - mapping_set_error(mapping, EIO);
> + mapping_set_error(mapping, -EIO);
> }
>
> return me_pagecache_clean(p, pfn);
> --
> 2.9.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-04-12 12:17:21

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 03/17] buffer: use mapping_set_error instead of setting the flag

On Wed 12-04-17 08:06:00, Jeff Layton wrote:
> Signed-off-by: Jeff Layton <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/buffer.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 9196f2a270da..70638941066d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -483,7 +483,7 @@ static void __remove_assoc_queue(struct buffer_head *bh)
> list_del_init(&bh->b_assoc_buffers);
> WARN_ON(!bh->b_assoc_map);
> if (buffer_write_io_error(bh))
> - set_bit(AS_EIO, &bh->b_assoc_map->flags);
> + mapping_set_error(bh->b_assoc_map, -EIO);
> bh->b_assoc_map = NULL;
> }
>
> --
> 2.9.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-04-12 12:29:32

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 04/17] ext2: don't test/clear AS_EIO flag

On Wed 12-04-17 08:06:01, Jeff Layton wrote:
> ext2 does a test+clear of AS_EIO flag. With the new error reporting
> infrastructure, we don't need to clear anything. Sample the file's
> current error code, and after writeback just report whether any
> errors have been seen since then.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/ext2/file.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index b21891a6bfca..0ca77d337c94 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -177,7 +177,12 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
>
> ret = generic_file_fsync(file, start, end, datasync);
> - if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
> + if (ret == -EIO)
> + ret = -EIO;
> + else
> + ret = filemap_check_errors(mapping);
> +

IMO making __generic_file_fsync() perform the filemap_check_errors() after
sync_mapping_buffers() is better and deals with all filesystem using
generic_file_fsync(). Then we can just remove the AS_EIO check here.

Honza

> + if (ret) {
> /* We don't really know where the IO error happened... */
> ext2_error(sb, __func__,
> "detected IO error when writing metadata buffers");
> --
> 2.9.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-04-12 12:30:40

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 04/17] ext2: don't test/clear AS_EIO flag

On Wed, 2017-04-12 at 14:29 +0200, Jan Kara wrote:
> On Wed 12-04-17 08:06:01, Jeff Layton wrote:
> > ext2 does a test+clear of AS_EIO flag. With the new error reporting
> > infrastructure, we don't need to clear anything. Sample the file's
> > current error code, and after writeback just report whether any
> > errors have been seen since then.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/ext2/file.c | 7 ++++++-
> > 1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> > index b21891a6bfca..0ca77d337c94 100644
> > --- a/fs/ext2/file.c
> > +++ b/fs/ext2/file.c
> > @@ -177,7 +177,12 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> > struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
> >
> > ret = generic_file_fsync(file, start, end, datasync);
> > - if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
> > + if (ret == -EIO)
> > + ret = -EIO;
> > + else
> > + ret = filemap_check_errors(mapping);
> > +
>
> IMO making __generic_file_fsync() perform the filemap_check_errors() after
> sync_mapping_buffers() is better and deals with all filesystem using
> generic_file_fsync(). Then we can just remove the AS_EIO check here.
>
> Honza
>

Thanks, I was just looking at this patch and thinking that it wasn't
quite right. I'll take your approach on the next set.

> > + if (ret) {
> > /* We don't really know where the IO error happened... */
> > ext2_error(sb, __func__,
> > "detected IO error when writing metadata buffers");
> > --
> > 2.9.3
> >

--
Jeff Layton <[email protected]>

2017-04-12 13:01:40

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 06/17] mm: doc comment for scary spot in write_one_page

On Wed, 2017-04-12 at 08:06 -0400, Jeff Layton wrote:
> Not sure what to do here just yet.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> mm/page-writeback.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index de0dbf12e2c1..3ac8399dc984 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2388,6 +2388,12 @@ int write_one_page(struct page *page)
> ret = mapping->a_ops->writepage(page, &wbc);
> if (ret == 0) {
> wait_on_page_writeback(page);
> + /*
> + * FIXME: is this racy? What guarantees that PG_error
> + * will still be set once we get around to checking it?
> + * What if writeback fails, but then a read is issued
> + * before we check this, and that calls ClearPageError?
> + */
> if (PageError(page))
> ret = -EIO;
> }

Ahh, we are always under the page lock here, and this is generally used
for writing out directory pages anyway. I'm fine with dropping this
patch unless someone else sees a problem here.
--
Jeff Layton <[email protected]>

2017-04-12 14:27:32

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] mm: drop "wait" parameter from write_one_page

On Wed, Apr 12, 2017 at 08:05:58AM -0400, Jeff Layton wrote:
> The callers all set it to 1. Also, make it clear that this function will
> not set any sort of AS_* error, and that the caller must do so if
> necessary. No existing caller uses this on normal files, so none of them
> need it.

So ... anyone who doesn't check the error code loses an error indication.

> +++ b/fs/jfs/jfs_metapage.c
> @@ -711,7 +711,7 @@ void force_metapage(struct metapage *mp)
> get_page(page);
> lock_page(page);
> set_page_dirty(page);
> - write_one_page(page, 1);
> + write_one_page(page);
> clear_bit(META_forcewrite, &mp->flag);
> put_page(page);
> }
> @@ -756,7 +756,7 @@ void release_metapage(struct metapage * mp)
> set_page_dirty(page);
> if (test_bit(META_sync, &mp->flag)) {
> clear_bit(META_sync, &mp->flag);
> - write_one_page(page, 1);
> + write_one_page(page);
> lock_page(page); /* write_one_page unlocks the page */
> }
> } else if (mp->lsn) /* discard_metapage doesn't remove it */

This looks quite bad. If my reading is right, these pages are part of
the journal. I think somebody who knows JFS needs to figure out what
should happen here ...

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 00a8fa7e366a..f25b76486645 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2187,7 +2187,7 @@ extern void filemap_map_pages(struct vm_fault *vmf,
> extern int filemap_page_mkwrite(struct vm_fault *vmf);
>
> /* mm/page-writeback.c */
> -int write_one_page(struct page *page, int wait);
> +int write_one_page(struct page *page);
> void task_dirty_inc(struct task_struct *tsk);
>
> /* readahead.c */

Can we mark this as __must_check so JFS picks up a couple of warnings?

Reviewed-by: Matthew Wilcox <[email protected]>

2017-04-12 14:28:17

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 02/17] mm: fix mapping_set_error call in me_pagecache_dirty

On Wed, Apr 12, 2017 at 08:05:59AM -0400, Jeff Layton wrote:
> The error code should be negative. Since this ends up in the default
> case anyway, this is harmless, but it's less confusing to negate it.
> Also, later patches will require a negative error code here.
>
> Signed-off-by: Jeff Layton <[email protected]>
> Reviewed-by: Ross Zwisler <[email protected]>

Reviewed-by: Matthew Wilcox <[email protected]>

2017-04-12 14:29:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 03/17] buffer: use mapping_set_error instead of setting the flag

On Wed, Apr 12, 2017 at 08:06:00AM -0400, Jeff Layton wrote:
> Signed-off-by: Jeff Layton <[email protected]>

Reviewed-by: Matthew Wilcox <[email protected]>

2017-04-12 14:34:54

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] mm: drop "wait" parameter from write_one_page

On Wed, 2017-04-12 at 07:27 -0700, Matthew Wilcox wrote:
> On Wed, Apr 12, 2017 at 08:05:58AM -0400, Jeff Layton wrote:
> > The callers all set it to 1. Also, make it clear that this function will
> > not set any sort of AS_* error, and that the caller must do so if
> > necessary. No existing caller uses this on normal files, so none of them
> > need it.
>
> So ... anyone who doesn't check the error code loses an error indication.
>
> > +++ b/fs/jfs/jfs_metapage.c
> > @@ -711,7 +711,7 @@ void force_metapage(struct metapage *mp)
> > get_page(page);
> > lock_page(page);
> > set_page_dirty(page);
> > - write_one_page(page, 1);
> > + write_one_page(page);
> > clear_bit(META_forcewrite, &mp->flag);
> > put_page(page);
> > }
> > @@ -756,7 +756,7 @@ void release_metapage(struct metapage * mp)
> > set_page_dirty(page);
> > if (test_bit(META_sync, &mp->flag)) {
> > clear_bit(META_sync, &mp->flag);
> > - write_one_page(page, 1);
> > + write_one_page(page);
> > lock_page(page); /* write_one_page unlocks the page */
> > }
> > } else if (mp->lsn) /* discard_metapage doesn't remove it */
>
> This looks quite bad. If my reading is right, these pages are part of
> the journal. I think somebody who knows JFS needs to figure out what
> should happen here ...
>
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 00a8fa7e366a..f25b76486645 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2187,7 +2187,7 @@ extern void filemap_map_pages(struct vm_fault *vmf,
> > extern int filemap_page_mkwrite(struct vm_fault *vmf);
> >
> > /* mm/page-writeback.c */
> > -int write_one_page(struct page *page, int wait);
> > +int write_one_page(struct page *page);
> > void task_dirty_inc(struct task_struct *tsk);
> >
> > /* readahead.c */
>
> Can we mark this as __must_check so JFS picks up a couple of warnings?
>
> Reviewed-by: Matthew Wilcox <[email protected]>

Good idea -- I'll roll that in with this patch.
--
Jeff Layton <[email protected]>

2017-04-12 14:39:01

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 06/17] mm: doc comment for scary spot in write_one_page

On Wed, Apr 12, 2017 at 09:01:34AM -0400, Jeff Layton wrote:
> On Wed, 2017-04-12 at 08:06 -0400, Jeff Layton wrote:
> > Not sure what to do here just yet.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > mm/page-writeback.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index de0dbf12e2c1..3ac8399dc984 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2388,6 +2388,12 @@ int write_one_page(struct page *page)
> > ret = mapping->a_ops->writepage(page, &wbc);
> > if (ret == 0) {
> > wait_on_page_writeback(page);
> > + /*
> > + * FIXME: is this racy? What guarantees that PG_error
> > + * will still be set once we get around to checking it?
> > + * What if writeback fails, but then a read is issued
> > + * before we check this, and that calls ClearPageError?
> > + */
> > if (PageError(page))
> > ret = -EIO;
> > }
>
> Ahh, we are always under the page lock here, and this is generally used
> for writing out directory pages anyway. I'm fine with dropping this
> patch unless someone else sees a problem here.

->writepage drops the page lock. We're still holding a refcount on this
page, but that's not going to prevent read being called. But maybe the
filesystem won't call read on a page that's marked as PageError?

2017-04-12 15:13:05

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] mm: drop "wait" parameter from write_one_page

On 04/12/2017 09:27 AM, Matthew Wilcox wrote:
> On Wed, Apr 12, 2017 at 08:05:58AM -0400, Jeff Layton wrote:
>> The callers all set it to 1. Also, make it clear that this function will
>> not set any sort of AS_* error, and that the caller must do so if
>> necessary. No existing caller uses this on normal files, so none of them
>> need it.
>
> So ... anyone who doesn't check the error code loses an error indication.
>
>> +++ b/fs/jfs/jfs_metapage.c
>> @@ -711,7 +711,7 @@ void force_metapage(struct metapage *mp)
>> get_page(page);
>> lock_page(page);
>> set_page_dirty(page);
>> - write_one_page(page, 1);
>> + write_one_page(page);
>> clear_bit(META_forcewrite, &mp->flag);
>> put_page(page);
>> }
>> @@ -756,7 +756,7 @@ void release_metapage(struct metapage * mp)
>> set_page_dirty(page);
>> if (test_bit(META_sync, &mp->flag)) {
>> clear_bit(META_sync, &mp->flag);
>> - write_one_page(page, 1);
>> + write_one_page(page);
>> lock_page(page); /* write_one_page unlocks the page */
>> }
>> } else if (mp->lsn) /* discard_metapage doesn't remove it */
>
> This looks quite bad. If my reading is right, these pages are part of
> the journal. I think somebody who knows JFS needs to figure out what
> should happen here ...

Actually, these are pages containing file system metadata, so we
shouldn't be silently ignoring errors. Probably the only real recovery
JFS can make at this point is report the error and mark the file system
dirty.

>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 00a8fa7e366a..f25b76486645 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2187,7 +2187,7 @@ extern void filemap_map_pages(struct vm_fault *vmf,
>> extern int filemap_page_mkwrite(struct vm_fault *vmf);
>>
>> /* mm/page-writeback.c */
>> -int write_one_page(struct page *page, int wait);
>> +int write_one_page(struct page *page);
>> void task_dirty_inc(struct task_struct *tsk);
>>
>> /* readahead.c */
>
> Can we mark this as __must_check so JFS picks up a couple of warnings?

Sure. that'll make me fix it.

>
> Reviewed-by: Matthew Wilcox <[email protected]>
>

2017-04-12 15:53:00

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 06/17] mm: doc comment for scary spot in write_one_page

On Wed, 2017-04-12 at 07:38 -0700, Matthew Wilcox wrote:
> On Wed, Apr 12, 2017 at 09:01:34AM -0400, Jeff Layton wrote:
> > On Wed, 2017-04-12 at 08:06 -0400, Jeff Layton wrote:
> > > Not sure what to do here just yet.
> > >
> > > Signed-off-by: Jeff Layton <[email protected]>
> > > ---
> > > mm/page-writeback.c | 6 ++++++
> > > 1 file changed, 6 insertions(+)
> > >
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index de0dbf12e2c1..3ac8399dc984 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -2388,6 +2388,12 @@ int write_one_page(struct page *page)
> > > ret = mapping->a_ops->writepage(page, &wbc);
> > > if (ret == 0) {
> > > wait_on_page_writeback(page);
> > > + /*
> > > + * FIXME: is this racy? What guarantees that PG_error
> > > + * will still be set once we get around to checking it?
> > > + * What if writeback fails, but then a read is issued
> > > + * before we check this, and that calls ClearPageError?
> > > + */
> > > if (PageError(page))
> > > ret = -EIO;
> > > }
> >
> > Ahh, we are always under the page lock here, and this is generally used
> > for writing out directory pages anyway. I'm fine with dropping this
> > patch unless someone else sees a problem here.
>
> ->writepage drops the page lock. We're still holding a refcount on this
> page, but that's not going to prevent read being called. But maybe the
> filesystem won't call read on a page that's marked as PageError?

Hard to be sure there. I really wonder if that check is needed at all,
the more I look at it. After all, we are calling writepage with
WB_SYNC_ALL so we should get an error there.

Is it also possible these pages could be written back before that point
(due to memory pressure or something) and that fail?

Maybe we should just have a call to filemap_check_errors on exiting
this function?

With the the wb_err_t based stuff, we could change it to sample the
wb_err early, and then use that to see if an error has occurred since
then. Maybe we should even allow callers to pass a wb_err_t in here, so
we can report errors that have occurred since a known point?
--
Jeff Layton <[email protected]>

2017-04-12 18:42:06

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 07/17] fs: new infrastructure for writeback error handling and reporting

My apologies, this patch in particular should have gotten an updated
changelog. Here's a revised patch. The only real difference in this is
the updated changelog.

----------------------------8<--------------------------------

[PATCH] fs: new infrastructure for writeback error handling and reporting

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

It's those non-fsync callers that are problematic. We should be
reporting writeback errors during fsync, but many places in the code
clear out errors before they can be properly reported, or report errors
at nonsensical times. If I get -EIO on a stat() call, there is no reason
for me to assume that it is because some previous writeback failed. The
fact that it also clears out the error such that a subsequent fsync
returns 0 is a bug, IMO, and a nasty one since that's potentially silent
data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set defines a new 32-bit value (wb_err_t) that encompasses an
error code (up to MAX_ERROR), a sequence counter and a "seen" flag.

One of these is added to struct address_space, and a corresponding one
is added to struct file. When errors are reported during writeback, we
set the error field, and clear the seen flag and increment the sequence
counter if the seen flag is set.

On fsync we can check the file's value against what's in the mapping and
quickly return 0 if it hasn't changed. If it has changed, we'll set the
seen flag if it's not already set, update the value in the struct file
to the latest and return an error.

This changes the semantics of fsync such that applications can now use
it to determine whether there were any writeback errors since fsync(fd)
was last called (or since the file was opened in the case of fsync
having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The basic idea here is for filesystems to use filemap_set_wb_error to
set the error in the mapping when there are writeback errors, and then
have the fsync and flush operations call filemap_report_wb_error just
before returning to ensure that those errors get reported properly.

Eventually, it may make sense to move the reporting into the generic
vfs_fsync_range helper, but doing it this way for now makes it simpler
to convert filesystems to the new API individually.

Signed-off-by: Jeff Layton <[email protected]>
---
Documentation/filesystems/vfs.txt | 10 +-
fs/open.c | 3 +
include/linux/fs.h | 23 +++++
mm/filemap.c | 209 ++++++++++++++++++++++++++++++++++++++
4 files changed, 244 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 94dd27ef4a76..ed06fb39822b 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,6 +576,11 @@ should clear PG_Dirty and set PG_Writeback. It can be actually
written at any point after PG_Dirty is clear. Once it is known to be
safe, PG_Writeback is cleared.

+If there is an error during writeback, then the address_space should be
+marked with an error (typically using filemap_set_wb_error), in order to
+ensure that the error can later be reported to the application when an
+fsync is issued.
+
Writeback makes use of a writeback_control structure...

struct address_space_operations
@@ -888,7 +893,10 @@ otherwise noted.

release: called when the last reference to an open file is closed

- fsync: called by the fsync(2) system call
+ fsync: called by the fsync(2) system call. Filesystems that use the
+ pagecache should call filemap_report_wb_error before returning
+ to ensure that any errors that occurred during writeback are
+ reported and the file's error sequence advanced.

fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
diff --git a/fs/open.c b/fs/open.c
index 949cef29c3bb..88bfed8d3c88 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -709,6 +709,9 @@ static int do_dentry_open(struct file *f,
f->f_inode = inode;
f->f_mapping = inode->i_mapping;

+ /* Ensure that we skip any errors that predate opening of the file */
+ f->f_wb_err = filemap_sample_wb_error(f->f_mapping);
+
if (unlikely(f->f_flags & O_PATH)) {
f->f_mode = FMODE_PATH;
f->f_op = &empty_fops;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7bb45e8..c5ab4c982839 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -376,6 +376,16 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);

+/*
+ * A wb_err_t is a combination of error value, sequence counter and flag that
+ * is used to track errors that occur during writeback. When a new writeback
+ * error occurs, we set the error field in it and increment the sequence
+ * counter if the current value has been fetched since it was last set.
+ *
+ * See the filemap_*_wb_error functions for details.
+ */
+typedef u32 wb_err_t;
+
struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
@@ -394,6 +404,7 @@ struct address_space {
gfp_t gfp_mask; /* implicit gfp mask for allocations */
struct list_head private_list; /* ditto */
void *private_data; /* ditto */
+ wb_err_t wb_err;
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
@@ -846,6 +857,7 @@ struct file {
* Must not be taken from IRQ context.
*/
spinlock_t f_lock;
+ wb_err_t f_wb_err;
atomic_long_t f_count;
unsigned int f_flags;
fmode_t f_mode;
@@ -2521,6 +2533,17 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
extern int filemap_fdatawrite_range(struct address_space *mapping,
loff_t start, loff_t end);
extern int filemap_check_errors(struct address_space *mapping);
+extern void __filemap_set_wb_error(struct address_space *mapping, int err);
+extern wb_err_t filemap_sample_wb_error(struct address_space *mapping);
+extern int __must_check filemap_report_wb_error(struct file *file);
+extern int filemap_check_wb_error(struct address_space *mapping, wb_err_t since);
+
+static inline void filemap_set_wb_error(struct address_space *mapping, int err)
+{
+ /* Optimize for the common case of no error */
+ if (unlikely(err))
+ __filemap_set_wb_error(mapping, err);
+}

extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
int datasync);
diff --git a/mm/filemap.c b/mm/filemap.c
index 1694623a6289..4966e9dea945 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -545,6 +545,215 @@ int filemap_write_and_wait_range(struct address_space *mapping,
}
EXPORT_SYMBOL(filemap_write_and_wait_range);

+/*
+ * The wb_err field in the address_space provides a place to store writeback
+ * errors. We endeavor to deliver writeback errors to fsync on all open file
+ * descriptors that were open at the time that the error was caught. We do
+ * this using a 32-bit value to store the error, with the upper bits as a
+ * sequence counter. We can store any error up to MAX_ERRNO.
+ *
+ * Additionally, we reserve one bit to indicate whether any fd has grabbed the
+ * value to record in its struct file. If nothing has, then we don't really
+ * need to increment the counter.
+ */
+
+/* The low bits are designated for error code (max of MAX_ERRNO) */
+#define WB_ERR_SHIFT ilog2(MAX_ERRNO + 1)
+
+/* This bit is used as a flag to indicate whether the value has been seen */
+#define WB_ERR_SEEN (1 << WB_ERR_SHIFT)
+
+/* Increment the counter by this much to ensure that we don't touch earlier
+ * values */
+#define WB_ERR_CTR_INC (1 << (WB_ERR_SHIFT + 1))
+
+/**
+ * __filemap_set_wb_error - set the wb error in the mapping for later reporting
+ * @mapping: mapping in which the error should be set
+ * @err: error to set. must be negative value but not less than -MAX_ERRNO
+ *
+ * When an error occurs during writeback of inode data, we must report that
+ * error during fsync. This function sets the writeback error field in the
+ * mapping, and increments the sequence counter. When fsync or close is later
+ * performed, the caller can then check the sequence in the mapping against
+ * the one in the file to determine whether the error should be reported.
+ *
+ * Because there are so few bits for the counter, we try to avoid incrementing
+ * it unless someone is going to record the value for later comparison. This
+ * is tracked by a bit in the 32 bit word that we use as a "seen" flag.
+ *
+ * Note that we always use the latest writeback error, since POSIX states
+ * that when there are multiple errors (e.g. -EIO followed by -ENOSPC),
+ * that any possible error may be returned.
+ *
+ * Most callers will want to use the filemap_set_wb_error wrapper to
+ * efficiently handle the common case where err is 0.
+ */
+void __filemap_set_wb_error(struct address_space *mapping, int err)
+{
+ wb_err_t old;
+
+ /* MAX_ERRNO must be able to serve as a mask */
+ BUILD_BUG_ON_NOT_POWER_OF_2(MAX_ERRNO + 1);
+
+ /*
+ * Ensure the error code actually fits where we want it to go. If it
+ * doesn't then just throw a warning and don't record anything. We
+ * also don't accept zero here as that would effectively clear a
+ * previous error.
+ */
+ if (WARN(unlikely(err == 0 || (unsigned int)-err > MAX_ERRNO),
+ "err = %d\n", err))
+ return;
+
+ old = READ_ONCE(mapping->wb_err);
+ for (;;) {
+ wb_err_t new, cur;
+
+ /* Clear out error bits and set new error */
+ new = (old & ~(MAX_ERRNO|WB_ERR_SEEN)) | -err;
+
+ /* Only increment if someone has looked at it */
+ if (old & WB_ERR_SEEN)
+ new += WB_ERR_CTR_INC;
+
+ /* If there would be no change, then call it done */
+ if (new == old)
+ break;
+
+ /* Try to swap the new value into place */
+ cur = cmpxchg(&mapping->wb_err, old, new);
+
+ /*
+ * Call it success if we did the swap or someone else beat us
+ * to it for the same value.
+ */
+ if (likely(cur == old || cur == new))
+ break;
+
+ /* Raced with an update, try again */
+ old = cur;
+ }
+}
+EXPORT_SYMBOL(__filemap_set_wb_error);
+
+/**
+ * filemap_sample_wb_error - grab current wb_err_t value for mapping
+ * @mapping: the mapping from which to sample the error
+ *
+ * This function allows callers to "sample" the wb_err_t value from the
+ * mapping, marking it as "seen" if required.
+ *
+ * Note that we handle the common case where we've had no writeback errors
+ * as a special case. We don't need to mark the SEEN bit in that case since
+ * an all-zeroed out wb_err_t will only ever exist at first initialization.
+ */
+wb_err_t filemap_sample_wb_error(struct address_space *mapping)
+{
+ wb_err_t old = READ_ONCE(mapping->wb_err);
+ wb_err_t new = old;
+
+ /*
+ * For the common case of no errors ever having been set, we can skip
+ * marking the SEEN bit. Once an error has been set, the value will
+ * never go back to zero.
+ */
+ if (old != 0) {
+ new |= WB_ERR_SEEN;
+ if (old != new)
+ cmpxchg(&mapping->wb_err, old, new);
+ }
+ return new;
+}
+EXPORT_SYMBOL(filemap_sample_wb_error);
+
+/**
+ * filemap_report_wb_error - report wb error (if any) that was previously set
+ * @file: struct file on which the error is being reported
+ *
+ * When userland calls fsync (or something like nfsd does the equivalent), we
+ * want to report any writeback errors that occurred since the last fsync (or
+ * since the file was opened if there haven't been any).
+ *
+ * Grab the wb_err from the mapping. If it matches what we have in the file,
+ * then just quickly return 0. The file is all caught up.
+ *
+ * If it doesn't match, then take the mapping value, set the "seen" flag in
+ * it and try to swap it into place. If it works, or another task beat us
+ * to it with the new value, then update the f_wb_err and return the error
+ * portion. The error at this point must be reported via proper channels
+ * (a'la fsync, or NFS COMMIT operation, etc.).
+ *
+ * While we handle mapping->wb_err with atomic operations, the f_wb_err
+ * value is protected by the f_lock since we must ensure that it reflects
+ * the latest value swapped in for this file descriptor.
+ */
+int filemap_report_wb_error(struct file *file)
+{
+ int err = 0;
+ struct address_space *mapping = file->f_mapping;
+ wb_err_t old, new;
+
+ /*
+ * Catch the common case where nothing has changed without locking
+ *
+ * We always store values with the "seen" bit set (except in the case
+ * where the entire thing is 0), so if this matches what we already
+ * have, then we can call it done. There is nothing to update in that
+ * case.
+ */
+ if (likely(READ_ONCE(mapping->wb_err) == READ_ONCE(file->f_wb_err)))
+ goto out;
+
+ /* Something changed, must use slow path */
+ spin_lock(&file->f_lock);
+ /* Fetch again to make sure we get the latest */
+ old = READ_ONCE(mapping->wb_err);
+
+ if (likely(old != file->f_wb_err)) {
+ /*
+ * Set the flag and try to swap it into place if it has
+ * changed.
+ *
+ * We don't care about the outcome of the swap here. If the
+ * swap doesn't occur, then it has either been updated by a
+ * writer who is bumping the seq count anyway, or another
+ * reader who is just setting the "seen" flag. Either outcome
+ * is OK here, and we can advance f_wb_err and return an
+ * error based on what we have.
+ */
+ new = old | WB_ERR_SEEN;
+ if (new != old)
+ cmpxchg(&mapping->wb_err, old, new);
+ file->f_wb_err = new;
+ err = -(new & MAX_ERRNO);
+ }
+ spin_unlock(&file->f_lock);
+out:
+ return err;
+}
+EXPORT_SYMBOL(filemap_report_wb_error);
+
+/**
+ * filemap_check_wb_error - has an error occurred since the mark was sampled?
+ * @mapping: mapping to check for writeback errors
+ * @since: previously-sampled wb_err_t
+ *
+ * Grab the wb_err_t value from the mapping, and see if it has changed "since"
+ * the given value was sampled.
+ *
+ * If it has then report the latest error set, otherwise return 0.
+ */
+int filemap_check_wb_error(struct address_space *mapping, wb_err_t since)
+{
+ wb_err_t cur = READ_ONCE(mapping->wb_err);
+
+ if (likely(cur == since))
+ return 0;
+ return -(cur & MAX_ERRNO);
+}
+EXPORT_SYMBOL(filemap_check_wb_error);
+
/**
* replace_page_cache_page - replace a pagecache page with a new one
* @old: page to be replaced
--
2.9.3

--
Jeff Layton <[email protected]>

2017-04-12 21:37:02

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 06/17] mm: doc comment for scary spot in write_one_page

On Wed, Apr 12 2017, Jeff Layton wrote:

> On Wed, 2017-04-12 at 07:38 -0700, Matthew Wilcox wrote:
>> On Wed, Apr 12, 2017 at 09:01:34AM -0400, Jeff Layton wrote:
>> > On Wed, 2017-04-12 at 08:06 -0400, Jeff Layton wrote:
>> > > Not sure what to do here just yet.
>> > >
>> > > Signed-off-by: Jeff Layton <[email protected]>
>> > > ---
>> > > mm/page-writeback.c | 6 ++++++
>> > > 1 file changed, 6 insertions(+)
>> > >
>> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> > > index de0dbf12e2c1..3ac8399dc984 100644
>> > > --- a/mm/page-writeback.c
>> > > +++ b/mm/page-writeback.c
>> > > @@ -2388,6 +2388,12 @@ int write_one_page(struct page *page)
>> > > ret = mapping->a_ops->writepage(page, &wbc);
>> > > if (ret == 0) {
>> > > wait_on_page_writeback(page);
>> > > + /*
>> > > + * FIXME: is this racy? What guarantees that PG_error
>> > > + * will still be set once we get around to checking it?
>> > > + * What if writeback fails, but then a read is issued
>> > > + * before we check this, and that calls ClearPageError?
>> > > + */
>> > > if (PageError(page))
>> > > ret = -EIO;
>> > > }
>> >
>> > Ahh, we are always under the page lock here, and this is generally used
>> > for writing out directory pages anyway. I'm fine with dropping this
>> > patch unless someone else sees a problem here.
>>
>> ->writepage drops the page lock. We're still holding a refcount on this
>> page, but that's not going to prevent read being called. But maybe the
>> filesystem won't call read on a page that's marked as PageError?
>
> Hard to be sure there. I really wonder if that check is needed at all,
> the more I look at it. After all, we are calling writepage with
> WB_SYNC_ALL so we should get an error there.

WB_SYNC_ALL doesn't cause writepage to wait. It might case it to ask
for REQ_SYNC, so the write requests gets priority in the block layer.
WB_SYNC_ALL does cause writepages (with an 's') to wait.
(At least, that is how I read the code).

>
> Is it also possible these pages could be written back before that point
> (due to memory pressure or something) and that fail?

Probably, in which case clear_page_dirty_for_io() will fail and
write_one_page() will just unlock the page.

>
> Maybe we should just have a call to filemap_check_errors on exiting
> this function?

I'm leaning in that direction.

>
> With the the wb_err_t based stuff, we could change it to sample the
> wb_err early, and then use that to see if an error has occurred since
> then. Maybe we should even allow callers to pass a wb_err_t in here, so
> we can report errors that have occurred since a known point?

That feels to me like over-engineering. We would need to
unconditionally call writepage() for that to work.

We seem to be agreed that write errors for buffered writes are reported
per-address-space. To get per-page errors you have to use direct IO.
Let's focus on that policy and make it work.

Thanks,
NeilBrown


Attachments:
signature.asc (832.00 B)

2017-04-12 21:56:17

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 07/17] fs: new infrastructure for writeback error handling and reporting

On Wed, Apr 12 2017, Jeff Layton wrote:


> +void __filemap_set_wb_error(struct address_space *mapping, int err)

I was really hoping that this would be

void __set_wb_error(wb_err_t *wb_err, int err)

so

Then nfs_context_set_write_error could become

static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
{
__set_wb_error(&ctx->wb_err, error);
}

and filemap_set_sb_error() would be:

static inline void filemap_set_wb_error(struct address_space *mapping, int err)
{
/* Optimize for the common case of no error */
if (unlikely(err))
__set_wb_error(&mapping->f_wb_err, err);
}

Similarly we would have
wb_err_t sample_wb_error(wb_err_t *wb_err)
{
...
}

and

wb_err_t filemap_sample_wb_error(struct address_space *mapping)
{
return sample_wb_error(&mapping->f_wb_err);
}

so nfs_file_fsync_commit() could have
ret = sample_wb_error(&ctx->wb_err);
in place of
ret = xchg(&ctx->error, 0);

int filemap_report_wb_error(struct file *file)

would become

int filemap_report_wb_error(struct file *file, wb_err_t *err)

or something.

The address space is just one (obvious) place where the wb error can be
stored. The filesystem might have a different place with finer
granularity (nfs already does).


> +wb_err_t filemap_sample_wb_error(struct address_space *mapping)
> +{
> + wb_err_t old = READ_ONCE(mapping->wb_err);
> + wb_err_t new = old;
> +
> + /*
> + * For the common case of no errors ever having been set, we can skip
> + * marking the SEEN bit. Once an error has been set, the value will
> + * never go back to zero.
> + */
> + if (old != 0) {
> + new |= WB_ERR_SEEN;
> + if (old != new)
> + cmpxchg(&mapping->wb_err, old, new);
> + }
> + return new;
> +}

I do like how the use of cmpxchg work out here - no looping!

Thanks
NeilBrown


Attachments:
signature.asc (832.00 B)

2017-04-12 22:15:15

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 08/17] fs: retrofit old error reporting API onto new infrastructure

On Wed, Apr 12 2017, Jeff Layton wrote:

> Now that we have a better way to store and report errors that occur
> during writeback, we need to convert the existing codebase to use it. We
> could just adapt all of the filesystem code and related infrastructure
> to the new API, but that's a lot of churn.
>
> When it comes to setting errors in the mapping, filemap_set_wb_error is
> a drop-in replacement for mapping_set_error. Turn that function into a
> simple wrapper around the new one.
>
> Because we want to ensure that writeback errors are always reported at
> fsync time, inject filemap_report_wb_error calls much closer to the
> syscall boundary. For fsync calls (and things like the nfsd equivalent),
> we either return the error that the fsync operation returns, or the one
> returned by filemap_report_wb_error. In both cases, we advance the
> file->f_wb_err to the latest value.
>
> The final piece of the puzzle is what to do about filemap_check_errors
> calls that are being called directly or via filemap_* functions. Here,
> we must take a little "creative license".
>
> Since we now handle advancing the file->f_wb_err value at the generic
> filesystem layer, we no longer need those callers to do anything like
> that, or clear errors out of the mapping. A lot of the existing codebase
> relies on being getting an error back from those functions when there is
> a writeback problem, so we do still want to have them report writeback
> errors somehow.
>
> When reporting writeback errors, we will always report errors that have
> occurred since a particular point in time. With the old writeback error
> reporting, the time we used was "since it was last tested/cleared" which
> is entirely arbitrary and potentially racy. Now, we can at least report
> the latest error that has occurred since an arbitrary point in time
> (represented as a sampled wb_err_t value).
>
> In the case where we don't have a struct file to work with, this patch
> just has the wrappers sample the mapping->wb_err value, and use that as
> an arbitrary point from which to check for errors.
>
> That's probably not ideal, particularly in the case of something like
> filemap_fdatawait, but I'm not sure it's any worse than what we
> already have, and this gives us a basis from which to work.

This aspect of the patch looked rather odd -- sampling a wb_err_t at
some arbitrary time, and then comparing it later. So that for going to
the trouble of explaining it.

I suspect that the filemap_check_wb_error() will need to be moved
into some parent of the current call site, which is essentially what you
suggest below. It would be nice if we could do that first, rather than
having the current rather odd code. But maybe this way is an easier
transition. It isn't obviously wrong, it just isn't obviously right
either.

> int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> {
> + int ret, ret2;
> struct inode *inode = file->f_mapping->host;
>
> if (!file->f_op->fsync)
> @@ -192,7 +193,9 @@ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> spin_unlock(&inode->i_lock);
> mark_inode_dirty_sync(inode);
> }
> - return call_fsync(file, start, end, datasync);
> + ret = call_fsync(file, start, end, datasync);
> + ret2 = filemap_report_wb_error(file);
> + return ret ? : ret2;
> }
> EXPORT_SYMBOL(vfs_fsync_range);
>
....

> static int shm_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> {
> + int ret, ret2;
> struct shm_file_data *sfd = shm_file_data(file);
>
> if (!sfd->file->f_op->fsync)
> return -EINVAL;
> - return call_fsync(sfd->file, start, end, datasync);
> + ret = call_fsync(sfd->file, start, end, datasync);
> + ret2 = filemap_report_wb_error(file);
> + return ret ? : ret2;
> }

These are the only two places that call_fsync() is called.
Did you consider moving the filemap_report_wb_error() call into
call_fsync() ??

Thanks,
NeilBrown


Attachments:
signature.asc (832.00 B)

2017-04-12 22:41:28

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 08/17] fs: retrofit old error reporting API onto new infrastructure

On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
> On Wed, Apr 12 2017, Jeff Layton wrote:
>
> > Now that we have a better way to store and report errors that occur
> > during writeback, we need to convert the existing codebase to use it. We
> > could just adapt all of the filesystem code and related infrastructure
> > to the new API, but that's a lot of churn.
> >
> > When it comes to setting errors in the mapping, filemap_set_wb_error is
> > a drop-in replacement for mapping_set_error. Turn that function into a
> > simple wrapper around the new one.
> >
> > Because we want to ensure that writeback errors are always reported at
> > fsync time, inject filemap_report_wb_error calls much closer to the
> > syscall boundary. For fsync calls (and things like the nfsd equivalent),
> > we either return the error that the fsync operation returns, or the one
> > returned by filemap_report_wb_error. In both cases, we advance the
> > file->f_wb_err to the latest value.
> >
> > The final piece of the puzzle is what to do about filemap_check_errors
> > calls that are being called directly or via filemap_* functions. Here,
> > we must take a little "creative license".
> >
> > Since we now handle advancing the file->f_wb_err value at the generic
> > filesystem layer, we no longer need those callers to do anything like
> > that, or clear errors out of the mapping. A lot of the existing codebase
> > relies on being getting an error back from those functions when there is
> > a writeback problem, so we do still want to have them report writeback
> > errors somehow.
> >
> > When reporting writeback errors, we will always report errors that have
> > occurred since a particular point in time. With the old writeback error
> > reporting, the time we used was "since it was last tested/cleared" which
> > is entirely arbitrary and potentially racy. Now, we can at least report
> > the latest error that has occurred since an arbitrary point in time
> > (represented as a sampled wb_err_t value).
> >
> > In the case where we don't have a struct file to work with, this patch
> > just has the wrappers sample the mapping->wb_err value, and use that as
> > an arbitrary point from which to check for errors.
> >
> > That's probably not ideal, particularly in the case of something like
> > filemap_fdatawait, but I'm not sure it's any worse than what we
> > already have, and this gives us a basis from which to work.
>
> This aspect of the patch looked rather odd -- sampling a wb_err_t at
> some arbitrary time, and then comparing it later. So that for going to
> the trouble of explaining it.
>
> I suspect that the filemap_check_wb_error() will need to be moved
> into some parent of the current call site, which is essentially what you
> suggest below. It would be nice if we could do that first, rather than
> having the current rather odd code. But maybe this way is an easier
> transition. It isn't obviously wrong, it just isn't obviously right
> either.
>

Yeah. It's just such a daunting task to have to change so much of the
existing code. I'm looking for ways to make this simpler.

I think it probably is reasonable for filemap_write_and_wait* to just
sample it as early as possible in those functions. filemap_fdatawait is
the real questionable one, as you may have already had some writebacks
complete with errors.

In any case, my thinking was that the old code is not obviously correct
either, so while this shortens the "error capture window" on these
calls, it seems like a reasonable place to start improving things.

I'm happy to with the filesystem implementors to pick more sensible
places to grab the wb_err_t, and/or add APIs that help enable doing
this correctly.

> > int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> > {
> > + int ret, ret2;
> > struct inode *inode = file->f_mapping->host;
> >
> > if (!file->f_op->fsync)
> > @@ -192,7 +193,9 @@ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> > spin_unlock(&inode->i_lock);
> > mark_inode_dirty_sync(inode);
> > }
> > - return call_fsync(file, start, end, datasync);
> > + ret = call_fsync(file, start, end, datasync);
> > + ret2 = filemap_report_wb_error(file);
> > + return ret ? : ret2;
> > }
> > EXPORT_SYMBOL(vfs_fsync_range);
> >
>
> ....
>
> > static int shm_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> > {
> > + int ret, ret2;
> > struct shm_file_data *sfd = shm_file_data(file);
> >
> > if (!sfd->file->f_op->fsync)
> > return -EINVAL;
> > - return call_fsync(sfd->file, start, end, datasync);
> > + ret = call_fsync(sfd->file, start, end, datasync);
> > + ret2 = filemap_report_wb_error(file);
> > + return ret ? : ret2;
> > }
>
> These are the only two places that call_fsync() is called.
> Did you consider moving the filemap_report_wb_error() call into
> call_fsync() ??
>

I did, and then I thought "do I really want that part in the static
inline?" Of course, that's effectively the same thing, so I'll move it
into call_fsync in the next patch.

--
Jeff Layton <[email protected]>

2017-04-12 22:55:55

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 06/17] mm: doc comment for scary spot in write_one_page

On Thu, 2017-04-13 at 07:36 +1000, NeilBrown wrote:
> On Wed, Apr 12 2017, Jeff Layton wrote:
>
> > On Wed, 2017-04-12 at 07:38 -0700, Matthew Wilcox wrote:
> > > On Wed, Apr 12, 2017 at 09:01:34AM -0400, Jeff Layton wrote:
> > > > On Wed, 2017-04-12 at 08:06 -0400, Jeff Layton wrote:
> > > > > Not sure what to do here just yet.
> > > > >
> > > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > > ---
> > > > > mm/page-writeback.c | 6 ++++++
> > > > > 1 file changed, 6 insertions(+)
> > > > >
> > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > > index de0dbf12e2c1..3ac8399dc984 100644
> > > > > --- a/mm/page-writeback.c
> > > > > +++ b/mm/page-writeback.c
> > > > > @@ -2388,6 +2388,12 @@ int write_one_page(struct page *page)
> > > > > ret = mapping->a_ops->writepage(page, &wbc);
> > > > > if (ret == 0) {
> > > > > wait_on_page_writeback(page);
> > > > > + /*
> > > > > + * FIXME: is this racy? What guarantees that PG_error
> > > > > + * will still be set once we get around to checking it?
> > > > > + * What if writeback fails, but then a read is issued
> > > > > + * before we check this, and that calls ClearPageError?
> > > > > + */
> > > > > if (PageError(page))
> > > > > ret = -EIO;
> > > > > }
> > > >
> > > > Ahh, we are always under the page lock here, and this is generally used
> > > > for writing out directory pages anyway. I'm fine with dropping this
> > > > patch unless someone else sees a problem here.
> > >
> > > ->writepage drops the page lock. We're still holding a refcount on this
> > > page, but that's not going to prevent read being called. But maybe the
> > > filesystem won't call read on a page that's marked as PageError?
> >
> > Hard to be sure there. I really wonder if that check is needed at all,
> > the more I look at it. After all, we are calling writepage with
> > WB_SYNC_ALL so we should get an error there.
>
> WB_SYNC_ALL doesn't cause writepage to wait. It might case it to ask
> for REQ_SYNC, so the write requests gets priority in the block layer.
> WB_SYNC_ALL does cause writepages (with an 's') to wait.
> (At least, that is how I read the code).
>

Yeah, I realized that after I sent it. If it waited, we wouldn't need
to wait_on_page_writeback there.

> >
> > Is it also possible these pages could be written back before that point
> > (due to memory pressure or something) and that fail?
>
> Probably, in which case clear_page_dirty_for_io() will fail and
> write_one_page() will just unlock the page.
>

Right. So we're already potentially missing errors here for writeback
if it happened to occur before we called clear_page_dirty_for_io.

> >
> > Maybe we should just have a call to filemap_check_errors on exiting
> > this function?
>
> I'm leaning in that direction.
>
> > With the the wb_err_t based stuff, we could change it to sample the
> > wb_err early, and then use that to see if an error has occurred since
> > then. Maybe we should even allow callers to pass a wb_err_t in here, so
> > we can report errors that have occurred since a known point?
>
> That feels to me like over-engineering. We would need to
> unconditionally call writepage() for that to work.
>
> We seem to be agreed that write errors for buffered writes are reported
> per-address-space. To get per-page errors you have to use direct IO.
> Let's focus on that policy and make it work.
>

That makes sense. I don't think we'd need to always call writepage
there though. We just need to call filemap_check_wb_error even in the
case where writepage wasn't called. IOW, we want to know if there was a
writeback error on the inode, even if it occurred before we tried to
clear the dirty flag.

So, with the new API, we just have to ensure that we sample the
wb_err_t at an earlier time, and then pass that value into
write_one_page so that it can use it to detect the errors since then.

For something like JFS (which apparently uses this function for its fs
journal), the "since" value could be the value at the time that the
journal was last flushed. Then you know whether something failed
regardless of whether it was synced out to disk before this function
was called.

Most of the other callers use this for synchronous directory changes,
so those might want to keep a wb_err_t value in their inode structures,
depending on what they're looking for there. Again, I think this is a
place where we need some input from the fs maintainers.
--
Jeff Layton <[email protected]>

2017-04-12 23:02:05

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 07/17] fs: new infrastructure for writeback error handling and reporting

On Thu, 2017-04-13 at 07:55 +1000, NeilBrown wrote:
> On Wed, Apr 12 2017, Jeff Layton wrote:
>
>
> > +void __filemap_set_wb_error(struct address_space *mapping, int err)
>
> I was really hoping that this would be
>
> void __set_wb_error(wb_err_t *wb_err, int err)
>
> so
>
> Then nfs_context_set_write_error could become
>
> static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
> {
> __set_wb_error(&ctx->wb_err, error);
> }
>
> and filemap_set_sb_error() would be:
>
> static inline void filemap_set_wb_error(struct address_space *mapping, int err)
> {
> /* Optimize for the common case of no error */
> if (unlikely(err))
> __set_wb_error(&mapping->f_wb_err, err);
> }
>
> Similarly we would have
> wb_err_t sample_wb_error(wb_err_t *wb_err)
> {
> ...
> }
>
> and
>
> wb_err_t filemap_sample_wb_error(struct address_space *mapping)
> {
> return sample_wb_error(&mapping->f_wb_err);
> }
>
> so nfs_file_fsync_commit() could have
> ret = sample_wb_error(&ctx->wb_err);
> in place of
> ret = xchg(&ctx->error, 0);
>
> int filemap_report_wb_error(struct file *file)
>
> would become
>
> int filemap_report_wb_error(struct file *file, wb_err_t *err)
>
> or something.
>
> The address space is just one (obvious) place where the wb error can be
> stored. The filesystem might have a different place with finer
> granularity (nfs already does).
>
>

I think it'd be much simpler to adapt NFS over to use the new
infrastructure (I have a draft patch for that already). You'd lose the
ability to track a different error for each nfs_open_context, but I'm
not sure how valuable that is anyway. I'll need to think about that
one...

> > +wb_err_t filemap_sample_wb_error(struct address_space *mapping)
> > +{
> > + wb_err_t old = READ_ONCE(mapping->wb_err);
> > + wb_err_t new = old;
> > +
> > + /*
> > + * For the common case of no errors ever having been set, we can skip
> > + * marking the SEEN bit. Once an error has been set, the value will
> > + * never go back to zero.
> > + */
> > + if (old != 0) {
> > + new |= WB_ERR_SEEN;
> > + if (old != new)
> > + cmpxchg(&mapping->wb_err, old, new);
> > + }
> > + return new;
> > +}
>
> I do like how the use of cmpxchg work out here - no looping!
>

Me too. :)
--
Jeff Layton <[email protected]>

2017-04-17 15:17:56

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 08/17] fs: retrofit old error reporting API onto new infrastructure

On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
> On Wed, Apr 12 2017, Jeff Layton wrote:
>
> > Now that we have a better way to store and report errors that occur
> > during writeback, we need to convert the existing codebase to use it. We
> > could just adapt all of the filesystem code and related infrastructure
> > to the new API, but that's a lot of churn.
> >
> > When it comes to setting errors in the mapping, filemap_set_wb_error is
> > a drop-in replacement for mapping_set_error. Turn that function into a
> > simple wrapper around the new one.
> >
> > Because we want to ensure that writeback errors are always reported at
> > fsync time, inject filemap_report_wb_error calls much closer to the
> > syscall boundary. For fsync calls (and things like the nfsd equivalent),
> > we either return the error that the fsync operation returns, or the one
> > returned by filemap_report_wb_error. In both cases, we advance the
> > file->f_wb_err to the latest value.
> >
> > The final piece of the puzzle is what to do about filemap_check_errors
> > calls that are being called directly or via filemap_* functions. Here,
> > we must take a little "creative license".
> >
> > Since we now handle advancing the file->f_wb_err value at the generic
> > filesystem layer, we no longer need those callers to do anything like
> > that, or clear errors out of the mapping. A lot of the existing codebase
> > relies on being getting an error back from those functions when there is
> > a writeback problem, so we do still want to have them report writeback
> > errors somehow.
> >
> > When reporting writeback errors, we will always report errors that have
> > occurred since a particular point in time. With the old writeback error
> > reporting, the time we used was "since it was last tested/cleared" which
> > is entirely arbitrary and potentially racy. Now, we can at least report
> > the latest error that has occurred since an arbitrary point in time
> > (represented as a sampled wb_err_t value).
> >
> > In the case where we don't have a struct file to work with, this patch
> > just has the wrappers sample the mapping->wb_err value, and use that as
> > an arbitrary point from which to check for errors.
> >
> > That's probably not ideal, particularly in the case of something like
> > filemap_fdatawait, but I'm not sure it's any worse than what we
> > already have, and this gives us a basis from which to work.
>
> This aspect of the patch looked rather odd -- sampling a wb_err_t at
> some arbitrary time, and then comparing it later. So that for going to
> the trouble of explaining it.
>
> I suspect that the filemap_check_wb_error() will need to be moved
> into some parent of the current call site, which is essentially what you
> suggest below. It would be nice if we could do that first, rather than
> having the current rather odd code. But maybe this way is an easier
> transition. It isn't obviously wrong, it just isn't obviously right
> either.
>

I thought about this a bit over the weekend, and I think we have a
couple of other options here when we don't have a struct file to work
with.

1) we could pass in a zeroed out value for "since" when we don't have
struct file. The way the implementation works, that tells you whether
there has ever been a writeback error since the inode was first
instantiated. The downside there is that we don't have a way to clear
this out until the inode is evicted, so that's a clear behavior change
from how AS_* flags work today.

2) we could store a second wb_err_t value in the mapping. That one
would basically act as the wb_err_t "cursor" for these cases. We have
to do something like filemap_report_wb_error but would swap the value
into the mapping's cursor instead of dealing with struct file. That's
also not perfectly like what we have with AS_EIO/AS_ENOSPC flags, but
is probably close enough not to matter. Doing that with atomics may be
tricky though.

Alternately, we could just try to plumb in reasonable "since" values
for all of the callers. Some of those won't be too hard to do (and some
don't even really need the errors at all), but some I really have no
clue on.

Option #2 above gives us a stopgap way to move those callers over
without having to have them worry about the details as much.

> > int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> > {
> > + int ret, ret2;
> > struct inode *inode = file->f_mapping->host;
> >
> > if (!file->f_op->fsync)
> > @@ -192,7 +193,9 @@ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> > spin_unlock(&inode->i_lock);
> > mark_inode_dirty_sync(inode);
> > }
> > - return call_fsync(file, start, end, datasync);
> > + ret = call_fsync(file, start, end, datasync);
> > + ret2 = filemap_report_wb_error(file);
> > + return ret ? : ret2;
> > }
> > EXPORT_SYMBOL(vfs_fsync_range);
> >
>
> ....
>
> > static int shm_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> > {
> > + int ret, ret2;
> > struct shm_file_data *sfd = shm_file_data(file);
> >
> > if (!sfd->file->f_op->fsync)
> > return -EINVAL;
> > - return call_fsync(sfd->file, start, end, datasync);
> > + ret = call_fsync(sfd->file, start, end, datasync);
> > + ret2 = filemap_report_wb_error(file);
> > + return ret ? : ret2;
> > }
>
> These are the only two places that call_fsync() is called.
> Did you consider moving the filemap_report_wb_error() call into
> call_fsync() ??
>
> Thanks,
> NeilBrown

--
Jeff Layton <[email protected]>

2017-04-17 22:54:16

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 07/17] fs: new infrastructure for writeback error handling and reporting

On Wed, Apr 12 2017, Jeff Layton wrote:

> On Thu, 2017-04-13 at 07:55 +1000, NeilBrown wrote:
>> On Wed, Apr 12 2017, Jeff Layton wrote:
>>
>>
>> > +void __filemap_set_wb_error(struct address_space *mapping, int err)
>>
>> I was really hoping that this would be
>>
>> void __set_wb_error(wb_err_t *wb_err, int err)
>>
>> so
>>
>> Then nfs_context_set_write_error could become
>>
>> static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
>> {
>> __set_wb_error(&ctx->wb_err, error);
>> }
>>
>> and filemap_set_sb_error() would be:
>>
>> static inline void filemap_set_wb_error(struct address_space *mapping, int err)
>> {
>> /* Optimize for the common case of no error */
>> if (unlikely(err))
>> __set_wb_error(&mapping->f_wb_err, err);
>> }
>>
>> Similarly we would have
>> wb_err_t sample_wb_error(wb_err_t *wb_err)
>> {
>> ...
>> }
>>
>> and
>>
>> wb_err_t filemap_sample_wb_error(struct address_space *mapping)
>> {
>> return sample_wb_error(&mapping->f_wb_err);
>> }
>>
>> so nfs_file_fsync_commit() could have
>> ret = sample_wb_error(&ctx->wb_err);
>> in place of
>> ret = xchg(&ctx->error, 0);
>>
>> int filemap_report_wb_error(struct file *file)
>>
>> would become
>>
>> int filemap_report_wb_error(struct file *file, wb_err_t *err)
>>
>> or something.
>>
>> The address space is just one (obvious) place where the wb error can be
>> stored. The filesystem might have a different place with finer
>> granularity (nfs already does).
>>
>>
>
> I think it'd be much simpler to adapt NFS over to use the new
> infrastructure (I have a draft patch for that already). You'd lose the
> ability to track a different error for each nfs_open_context, but I'm
> not sure how valuable that is anyway. I'll need to think about that
> one...

From a technical perspective, it might be "simpler" but I contest "much
simpler". I think it would be easy to put one wb_err_t per
nfs_open_context, if the former were designed well (which itself would
be easy).

From a political perspective, I doubt it would be simple. NFS is the
way it is for a reason, and convincing an author that their reason is
not valid tends to be harder than most technical issues.
(looking to history...
the 'error' field was added to the nfs_open_context in
Commit: 6caf69feb23a ("NFSv2/v3/v4: Place NFS nfs_page shared data into a single structure that hangs off filp->private_data. As a side effect, this also cleans up the NFSv4 private file state info.")

in 2.6.12. Prior to that file->f_error was used.
Prior to commit 9ffb8c3a1955 ("Import 2.2.3pre1") (which has no comment)
errors were ... interesting. Look for nfs_check_error in
commit d9c0ffee4db7 ("Import 2.1.128") and notice the use of current->pid!!
All commits from the history.git tree.
)

It is quite possible for an NFS server to return different errors to
different users. It might be odd, but it is possible. Should an error
that affects one user pollute all other users?

Thanks,
NeilBrown


Attachments:
signature.asc (832.00 B)

2017-04-17 22:56:25

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 08/17] fs: retrofit old error reporting API onto new infrastructure

On Wed, Apr 12 2017, Jeff Layton wrote:

> On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
>>
>> I suspect that the filemap_check_wb_error() will need to be moved
>> into some parent of the current call site, which is essentially what you
>> suggest below. It would be nice if we could do that first, rather than
>> having the current rather odd code. But maybe this way is an easier
>> transition. It isn't obviously wrong, it just isn't obviously right
>> either.
>>
>
> Yeah. It's just such a daunting task to have to change so much of the
> existing code. I'm looking for ways to make this simpler.
>
> I think it probably is reasonable for filemap_write_and_wait* to just
> sample it as early as possible in those functions. filemap_fdatawait is
> the real questionable one, as you may have already had some writebacks
> complete with errors.
>
> In any case, my thinking was that the old code is not obviously correct
> either, so while this shortens the "error capture window" on these
> calls, it seems like a reasonable place to start improving things.

I agree. It wouldn't hurt to add a note to this effect in the patch
comment so that people understand that the code isn't seen to be
"correct" but only "no worse" with clear direction on what sort of
improvement might be appropriate.

Thanks,
NeilBrown


Attachments:
signature.asc (832.00 B)

2017-04-21 12:47:18

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 08/17] fs: retrofit old error reporting API onto new infrastructure

On Tue, 2017-04-18 at 08:56 +1000, NeilBrown wrote:
> On Wed, Apr 12 2017, Jeff Layton wrote:
>
> > On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
> > >
> > > I suspect that the filemap_check_wb_error() will need to be moved
> > > into some parent of the current call site, which is essentially what you
> > > suggest below. It would be nice if we could do that first, rather than
> > > having the current rather odd code. But maybe this way is an easier
> > > transition. It isn't obviously wrong, it just isn't obviously right
> > > either.
> > >
> >
> > Yeah. It's just such a daunting task to have to change so much of the
> > existing code. I'm looking for ways to make this simpler.
> >
> > I think it probably is reasonable for filemap_write_and_wait* to just
> > sample it as early as possible in those functions. filemap_fdatawait is
> > the real questionable one, as you may have already had some writebacks
> > complete with errors.
> >
> > In any case, my thinking was that the old code is not obviously correct
> > either, so while this shortens the "error capture window" on these
> > calls, it seems like a reasonable place to start improving things.
>
> I agree. It wouldn't hurt to add a note to this effect in the patch
> comment so that people understand that the code isn't seen to be
> "correct" but only "no worse" with clear direction on what sort of
> improvement might be appropriate.
>

I've got a cleaned-up set that is getting close to ready for
reposting. Before I do though, I think there is another option here
that's worth discussing.

We could store a second wb_err_t (aka errseq_t in the new set) in the
mapping that would would basically act as a "cursor" for these cases.
filemap_check_errors would need to do something like
filemap_report_wb_error, but it would swap the value into the mapping's
cursor instead of dealing with the one in struct file.

I don't really like adding yet another field here, but the struct
address_space definition has this:

__attribute__((aligned(sizeof(long))));

Adding the wb_err field means that we end up growing the struct by 8
bytes on x86_64 anyway. Adding another 4 bytes would just consume the
pad, so it wouldn't cost anything there. YMMV on other arches of
course.

That's also not perfectly like what we have with AS_EIO/AS_ENOSPC
flags, but is probably close enough not to matter.

So...this would let us limp along for even longer with the model of
reporting since last check. I'm not sure that's a good thing though. A
long term goal here is to have kernel code that's dealing with
writeback be more deliberate about the point from which it's checking
errors, and this doesn't help promote that.
--
Jeff Layton <[email protected]>

2017-04-23 22:39:10

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 08/17] fs: retrofit old error reporting API onto new infrastructure

On Fri, Apr 21 2017, Jeff Layton wrote:

> On Tue, 2017-04-18 at 08:56 +1000, NeilBrown wrote:
>> On Wed, Apr 12 2017, Jeff Layton wrote:
>>
>> > On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
>> > >
>> > > I suspect that the filemap_check_wb_error() will need to be moved
>> > > into some parent of the current call site, which is essentially what you
>> > > suggest below. It would be nice if we could do that first, rather than
>> > > having the current rather odd code. But maybe this way is an easier
>> > > transition. It isn't obviously wrong, it just isn't obviously right
>> > > either.
>> > >
>> >
>> > Yeah. It's just such a daunting task to have to change so much of the
>> > existing code. I'm looking for ways to make this simpler.
>> >
>> > I think it probably is reasonable for filemap_write_and_wait* to just
>> > sample it as early as possible in those functions. filemap_fdatawait is
>> > the real questionable one, as you may have already had some writebacks
>> > complete with errors.
>> >
>> > In any case, my thinking was that the old code is not obviously correct
>> > either, so while this shortens the "error capture window" on these
>> > calls, it seems like a reasonable place to start improving things.
>>
>> I agree. It wouldn't hurt to add a note to this effect in the patch
>> comment so that people understand that the code isn't seen to be
>> "correct" but only "no worse" with clear direction on what sort of
>> improvement might be appropriate.
>>
>
> I've got a cleaned-up set that is getting close to ready for
> reposting. Before I do though, I think there is another option here
> that's worth discussing.
>
> We could store a second wb_err_t (aka errseq_t in the new set) in the
> mapping that would would basically act as a "cursor" for these cases.
> filemap_check_errors would need to do something like
> filemap_report_wb_error, but it would swap the value into the mapping's
> cursor instead of dealing with the one in struct file.
>
> I don't really like adding yet another field here, but the struct
> address_space definition has this:
>
> __attribute__((aligned(sizeof(long))));
>
> Adding the wb_err field means that we end up growing the struct by 8
> bytes on x86_64 anyway. Adding another 4 bytes would just consume the
> pad, so it wouldn't cost anything there. YMMV on other arches of
> course.
>
> That's also not perfectly like what we have with AS_EIO/AS_ENOSPC
> flags, but is probably close enough not to matter.
>
> So...this would let us limp along for even longer with the model of
> reporting since last check. I'm not sure that's a good thing though. A
> long term goal here is to have kernel code that's dealing with
> writeback be more deliberate about the point from which it's checking
> errors, and this doesn't help promote that.

I think this question needs some input from filesystem developers who
might be affected by the answer.

My preference is to not add this field. I think we would eventually
want to remove it again, and it is easier to ensure it doesn't stay
forever if it is never added.
The version without this field isn't (I think) too bad, but maybe it is
bad enough to motivate fs developers to create a better solution in each
individual case.

If some filesystem developer says they don't like that sort of social
engineering, or objects for any other reason, I will bow to the superior
stake they hold.

NeilBrown


Attachments:
signature.asc (832.00 B)

2017-04-24 11:50:36

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 08/17] fs: retrofit old error reporting API onto new infrastructure

On Mon, 2017-04-24 at 08:38 +1000, NeilBrown wrote:
> On Fri, Apr 21 2017, Jeff Layton wrote:
>
> > On Tue, 2017-04-18 at 08:56 +1000, NeilBrown wrote:
> > > On Wed, Apr 12 2017, Jeff Layton wrote:
> > >
> > > > On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
> > > > >
> > > > > I suspect that the filemap_check_wb_error() will need to be moved
> > > > > into some parent of the current call site, which is essentially what you
> > > > > suggest below. It would be nice if we could do that first, rather than
> > > > > having the current rather odd code. But maybe this way is an easier
> > > > > transition. It isn't obviously wrong, it just isn't obviously right
> > > > > either.
> > > > >
> > > >
> > > > Yeah. It's just such a daunting task to have to change so much of the
> > > > existing code. I'm looking for ways to make this simpler.
> > > >
> > > > I think it probably is reasonable for filemap_write_and_wait* to just
> > > > sample it as early as possible in those functions. filemap_fdatawait is
> > > > the real questionable one, as you may have already had some writebacks
> > > > complete with errors.
> > > >
> > > > In any case, my thinking was that the old code is not obviously correct
> > > > either, so while this shortens the "error capture window" on these
> > > > calls, it seems like a reasonable place to start improving things.
> > >
> > > I agree. It wouldn't hurt to add a note to this effect in the patch
> > > comment so that people understand that the code isn't seen to be
> > > "correct" but only "no worse" with clear direction on what sort of
> > > improvement might be appropriate.
> > >
> >
> > I've got a cleaned-up set that is getting close to ready for
> > reposting. Before I do though, I think there is another option here
> > that's worth discussing.
> >
> > We could store a second wb_err_t (aka errseq_t in the new set) in the
> > mapping that would would basically act as a "cursor" for these cases.
> > filemap_check_errors would need to do something like
> > filemap_report_wb_error, but it would swap the value into the mapping's
> > cursor instead of dealing with the one in struct file.
> >
> > I don't really like adding yet another field here, but the struct
> > address_space definition has this:
> >
> > __attribute__((aligned(sizeof(long))));
> >
> > Adding the wb_err field means that we end up growing the struct by 8
> > bytes on x86_64 anyway. Adding another 4 bytes would just consume the
> > pad, so it wouldn't cost anything there. YMMV on other arches of
> > course.
> >
> > That's also not perfectly like what we have with AS_EIO/AS_ENOSPC
> > flags, but is probably close enough not to matter.
> >
> > So...this would let us limp along for even longer with the model of
> > reporting since last check. I'm not sure that's a good thing though. A
> > long term goal here is to have kernel code that's dealing with
> > writeback be more deliberate about the point from which it's checking
> > errors, and this doesn't help promote that.
>
> I think this question needs some input from filesystem developers who
> might be affected by the answer.
>
> My preference is to not add this field. I think we would eventually
> want to remove it again, and it is easier to ensure it doesn't stay
> forever if it is never added.
> The version without this field isn't (I think) too bad, but maybe it is
> bad enough to motivate fs developers to create a better solution in each
> individual case.
>
> If some filesystem developer says they don't like that sort of social
> engineering, or objects for any other reason, I will bow to the superior
> stake they hold.
>
>

That's pretty much my view too. I just figured I needed to throw the
option out there in the interest of full disclosure.

I think keeping a per-mapping cursor like this does make sense in some
situations though. For instance, there does seem to be quite a bit of
local fs journaling code that goes through the pagecache. For those, I
could see keeping the cursor in some sort of per-journal structure, and
doing a check-and-advance against that in appropriate places.

This is an option we can bring up for folks who do want to continue to
use a similar error tracking model in these situations though.
--
Jeff Layton <[email protected]>