I recently did some work to wire up -ENOSPC handling in ceph, and found
I could get back -EIO errors in some cases when I should have instead
gotten -ENOSPC. The problem was that the ceph writeback code would set
PG_error on a writeback error, and that error would clobber the mapping
error.
While I fixed that problem by simply not setting that bit on errors,
that led me down a rabbit hole of looking at how PG_error is being
handled in the kernel.
This patch series is a few fixes for things that I 100% noticed by
inspection. I don't have a great way to test these since they involve
error handling. I can certainly doctor up a kernel to inject errors
in this code and test by hand however if these look plausible up front.
Jeff Layton (3):
nilfs2: set the mapping error when calling SetPageError on writeback
mm: don't TestClearPageError in __filemap_fdatawait_range
mm: set mapping error when launder_pages fails
fs/nilfs2/segment.c | 1 +
mm/filemap.c | 19 ++++---------------
mm/truncate.c | 6 +++++-
3 files changed, 10 insertions(+), 16 deletions(-)
--
2.9.3
If launder_page fails, then we hit a problem writing back some inode
data. Ensure that we communicate that fact in a subsequent fsync since
another task could still have it open for write.
Signed-off-by: Jeff Layton <[email protected]>
---
mm/truncate.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/truncate.c b/mm/truncate.c
index dd7b24e083c5..49ad4e2a6cb6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -593,11 +593,15 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
static int do_launder_page(struct address_space *mapping, struct page *page)
{
+ int ret;
+
if (!PageDirty(page))
return 0;
if (page->mapping != mapping || mapping->a_ops->launder_page == NULL)
return 0;
- return mapping->a_ops->launder_page(page);
+ ret = mapping->a_ops->launder_page(page);
+ mapping_set_error(mapping, ret);
+ return ret;
}
/**
--
2.9.3
In a later patch, we're going to want to make the fsync codepath not do
a TestClearPageError call as that can override the error set in the
address space. To do that though, we need to ensure that filesystems
that are relying on the PG_error bit for reporting writeback errors
also set an error in the address space.
The only place I've found that looks potentially problematic is this
spot in nilfs2. Ensure that it sets an error in the mapping in addition
to setting PageError.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nilfs2/segment.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index bedcae2c28e6..c1041b07060e 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1743,6 +1743,7 @@ static void nilfs_end_page_io(struct page *page, int err)
} else {
__set_page_dirty_nobuffers(page);
SetPageError(page);
+ mapping_set_error(page_mapping(page), err);
}
end_page_writeback(page);
--
2.9.3
On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> I recently did some work to wire up -ENOSPC handling in ceph, and found
> I could get back -EIO errors in some cases when I should have instead
> gotten -ENOSPC. The problem was that the ceph writeback code would set
> PG_error on a writeback error, and that error would clobber the mapping
> error.
>
I should also note that relying on PG_error to report writeback errors
is inherently unreliable as well. If someone calls sync() before your
fsync gets in there, then you'll likely lose it anyway.
filemap_fdatawait_keep_errors will preserve the error in the mapping,
but not the individual PG_error flags, so I think we do want to ensure
that the mapping error is set when there is a writeback error and not
rely on PG_error bit for that.
> While I fixed that problem by simply not setting that bit on errors,
> that led me down a rabbit hole of looking at how PG_error is being
> handled in the kernel.
>
> This patch series is a few fixes for things that I 100% noticed by
> inspection. I don't have a great way to test these since they involve
> error handling. I can certainly doctor up a kernel to inject errors
> in this code and test by hand however if these look plausible up front.
>
> Jeff Layton (3):
> nilfs2: set the mapping error when calling SetPageError on writeback
> mm: don't TestClearPageError in __filemap_fdatawait_range
> mm: set mapping error when launder_pages fails
>
> fs/nilfs2/segment.c | 1 +
> mm/filemap.c | 19 ++++---------------
> mm/truncate.c | 6 +++++-
> 3 files changed, 10 insertions(+), 16 deletions(-)
>
(cc'ing Ross...)
Just when I thought that only NILFS2 needed a little work here, I see
another spot...
I think that we should also need to fix dax_writeback_mapping_range to
set a mapping error on writeback as well. It looks like that's not
happening today. Something like the patch below (obviously untested).
I'll also plan to follow up with a patch to vfs.txt to outline how
writeback errors should be handled by filesystems, assuming that this
patchset isn't completely off base.
-------------------8<-----------------------
[PATCH] dax: set error in mapping when writeback fails
In order to get proper error codes from fsync, we must set an error in
the mapping range when writeback fails.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/dax.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index c45598b912e1..9005d90deeda 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
ret = dax_writeback_one(bdev, mapping, indices[i],
pvec.pages[i]);
- if (ret < 0)
+ if (ret < 0) {
+ mapping_set_error(mapping, ret);
return ret;
+ }
}
}
return 0;
--
2.9.3
The -EIO returned here can end up overriding whatever error is marked in
the address space. This means that an -ENOSPC error (AS_ENOSPC) can end
up being turned into -EIO if a page gets PG_error set on it during error
handling. Arguably, that's a bug in the writeback code, but...
Read errors are also tracked on a per page level using PG_error. Suppose
we have a read error on a page, and then that page is subsequently
dirtied by overwriting the whole page. Writeback doesn't clear PG_error,
so we can then end up successfully writing back that page and still
return -EIO on fsync.
Since the handling of this bit is somewhat inconsistent across
subsystems, let's just rely on marking the address space when there
are writeback errors.
Signed-off-by: Jeff Layton <[email protected]>
---
mm/filemap.c | 19 ++++---------------
1 file changed, 4 insertions(+), 15 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 3f9afded581b..2b0b4ff4668b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -375,17 +375,16 @@ int filemap_flush(struct address_space *mapping)
}
EXPORT_SYMBOL(filemap_flush);
-static int __filemap_fdatawait_range(struct address_space *mapping,
+static void __filemap_fdatawait_range(struct address_space *mapping,
loff_t start_byte, loff_t end_byte)
{
pgoff_t index = start_byte >> PAGE_SHIFT;
pgoff_t end = end_byte >> PAGE_SHIFT;
struct pagevec pvec;
int nr_pages;
- int ret = 0;
if (end_byte < start_byte)
- goto out;
+ return;
pagevec_init(&pvec, 0);
while ((index <= end) &&
@@ -402,14 +401,10 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
continue;
wait_on_page_writeback(page);
- if (TestClearPageError(page))
- ret = -EIO;
}
pagevec_release(&pvec);
cond_resched();
}
-out:
- return ret;
}
/**
@@ -429,14 +424,8 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
loff_t end_byte)
{
- int ret, ret2;
-
- ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
- ret2 = filemap_check_errors(mapping);
- if (!ret)
- ret = ret2;
-
- return ret;
+ __filemap_fdatawait_range(mapping, start_byte, end_byte);
+ return filemap_check_errors(mapping);
}
EXPORT_SYMBOL(filemap_fdatawait_range);
--
2.9.3
On Sun, Mar 05 2017, Jeff Layton wrote:
> I recently did some work to wire up -ENOSPC handling in ceph, and found
> I could get back -EIO errors in some cases when I should have instead
> gotten -ENOSPC. The problem was that the ceph writeback code would set
> PG_error on a writeback error, and that error would clobber the mapping
> error.
>
> While I fixed that problem by simply not setting that bit on errors,
> that led me down a rabbit hole of looking at how PG_error is being
> handled in the kernel.
Speaking of rabbit holes... I thought to wonder how IO error propagate
up from NFS.
It doesn't use SetPageError or mapping_set_error() for files (except in
one case that looks a bit odd).
It has an "nfs_open_context" and store the latest error in ctx->error.
So when you get around to documenting how this is supposed to work, it
would be worth while describing the required observable behaviour, and
note that while filesystems can use mapping_set_error() to achieve this,
they don't have to.
I notice that
drivers/staging/lustre/lustre/llite/rw.c
fs/afs/write.c
fs/btrfs/extent_io.c
fs/cifs/file.c
fs/jffs2/file.c
fs/jfs/jfs_metapage.c
fs/ntfs/aops.c
(and possible others) all have SetPageError() calls that seem to be
in response to a write error to a file, but don't appear to have
matching mapping_set_error() calls. Did you look at these? Did I miss
something?
Thanks,
NeilBrown
>
> This patch series is a few fixes for things that I 100% noticed by
> inspection. I don't have a great way to test these since they involve
> error handling. I can certainly doctor up a kernel to inject errors
> in this code and test by hand however if these look plausible up front.
>
> Jeff Layton (3):
> nilfs2: set the mapping error when calling SetPageError on writeback
> mm: don't TestClearPageError in __filemap_fdatawait_range
> mm: set mapping error when launder_pages fails
>
> fs/nilfs2/segment.c | 1 +
> mm/filemap.c | 19 ++++---------------
> mm/truncate.c | 6 +++++-
> 3 files changed, 10 insertions(+), 16 deletions(-)
>
> --
> 2.9.3
On Mon, 2017-03-06 at 14:06 +1100, NeilBrown wrote:
> On Sun, Mar 05 2017, Jeff Layton wrote:
>
> > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > I could get back -EIO errors in some cases when I should have instead
> > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > PG_error on a writeback error, and that error would clobber the mapping
> > error.
> >
> > While I fixed that problem by simply not setting that bit on errors,
> > that led me down a rabbit hole of looking at how PG_error is being
> > handled in the kernel.
>
> Speaking of rabbit holes... I thought to wonder how IO error propagate
> up from NFS.
> It doesn't use SetPageError or mapping_set_error() for files (except in
> one case that looks a bit odd).
> It has an "nfs_open_context" and store the latest error in ctx->error.
>
> So when you get around to documenting how this is supposed to work, it
> would be worth while describing the required observable behaviour, and
> note that while filesystems can use mapping_set_error() to achieve this,
> they don't have to.
>
> I notice that
> drivers/staging/lustre/lustre/llite/rw.c
> fs/afs/write.c
> fs/btrfs/extent_io.c
> fs/cifs/file.c
> fs/jffs2/file.c
> fs/jfs/jfs_metapage.c
> fs/ntfs/aops.c
>
> (and possible others) all have SetPageError() calls that seem to be
> in response to a write error to a file, but don't appear to have
> matching mapping_set_error() calls. Did you look at these? Did I miss
> something?
>
Those are all in writepage implementations, and the callers of writepage
all call mapping_set_error if it returns error. The exception is
write_one_page, which is typically used for writing out dir info and
such, and so it's not really necessary there.
Now that I look though, I think I may have gotten the page migration
codepath wrong. I had convinced myself it was ok before, but looking
again, I think we probably need to add a mapping_set_error call to
writeout() as well. I'll go over that more carefully in a little while.
> >
> > This patch series is a few fixes for things that I 100% noticed by
> > inspection. I don't have a great way to test these since they involve
> > error handling. I can certainly doctor up a kernel to inject errors
> > in this code and test by hand however if these look plausible up front.
> >
> > Jeff Layton (3):
> > nilfs2: set the mapping error when calling SetPageError on writeback
> > mm: don't TestClearPageError in __filemap_fdatawait_range
> > mm: set mapping error when launder_pages fails
> >
> > fs/nilfs2/segment.c | 1 +
> > mm/filemap.c | 19 ++++---------------
> > mm/truncate.c | 6 +++++-
> > 3 files changed, 10 insertions(+), 16 deletions(-)
> >
> > --
> > 2.9.3
--
Jeff Layton <[email protected]>
On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > I could get back -EIO errors in some cases when I should have instead
> > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > PG_error on a writeback error, and that error would clobber the mapping
> > error.
> >
>
> I should also note that relying on PG_error to report writeback errors
> is inherently unreliable as well. If someone calls sync() before your
> fsync gets in there, then you'll likely lose it anyway.
>
> filemap_fdatawait_keep_errors will preserve the error in the mapping,
> but not the individual PG_error flags, so I think we do want to ensure
> that the mapping error is set when there is a writeback error and not
> rely on PG_error bit for that.
>
> > While I fixed that problem by simply not setting that bit on errors,
> > that led me down a rabbit hole of looking at how PG_error is being
> > handled in the kernel.
> >
> > This patch series is a few fixes for things that I 100% noticed by
> > inspection. I don't have a great way to test these since they involve
> > error handling. I can certainly doctor up a kernel to inject errors
> > in this code and test by hand however if these look plausible up front.
> >
> > Jeff Layton (3):
> > nilfs2: set the mapping error when calling SetPageError on writeback
> > mm: don't TestClearPageError in __filemap_fdatawait_range
> > mm: set mapping error when launder_pages fails
> >
> > fs/nilfs2/segment.c | 1 +
> > mm/filemap.c | 19 ++++---------------
> > mm/truncate.c | 6 +++++-
> > 3 files changed, 10 insertions(+), 16 deletions(-)
> >
>
> (cc'ing Ross...)
>
> Just when I thought that only NILFS2 needed a little work here, I see
> another spot...
>
> I think that we should also need to fix dax_writeback_mapping_range to
> set a mapping error on writeback as well. It looks like that's not
> happening today. Something like the patch below (obviously untested).
>
> I'll also plan to follow up with a patch to vfs.txt to outline how
> writeback errors should be handled by filesystems, assuming that this
> patchset isn't completely off base.
>
> -------------------8<-----------------------
>
> [PATCH] dax: set error in mapping when writeback fails
>
> In order to get proper error codes from fsync, we must set an error in
> the mapping range when writeback fails.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/dax.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index c45598b912e1..9005d90deeda 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>
> ret = dax_writeback_one(bdev, mapping, indices[i],
> pvec.pages[i]);
> - if (ret < 0)
> + if (ret < 0) {
> + mapping_set_error(mapping, ret);
> return ret;
> + }
(Adding Jan)
I tested this a bit, and for the DAX case at least I don't think this does
what you want. The current code already returns -EIO if dax_writeback_one()
hits an error, which bubbles up through the call stack and makes the fsync()
call in userspace fail with EIO, as we want. With both ext4 and xfs this
patch (applied to v4.10) makes it so that we fail the current fsync() due to
the return value of -EIO, then we fail the next fsync() as well because only
then do we actually process the AS_EIO flag inside of filemap_check_errors().
I think maybe the missing piece is that our normal DAX fsync call stack
doesn't include a call to filemap_check_errors() if we return -EIO. Here's
our stack in xfs:
dax_writeback_mapping_range+0x32/0x70
xfs_vm_writepages+0x8c/0xf0
do_writepages+0x21/0x30
__filemap_fdatawrite_range+0xc6/0x100
filemap_write_and_wait_range+0x44/0x90
xfs_file_fsync+0x7a/0x2c0
vfs_fsync_range+0x4b/0xb0
? trace_hardirqs_on_caller+0xf5/0x1b0
do_fsync+0x3d/0x70
SyS_fsync+0x10/0x20
entry_SYSCALL_64_fastpath+0x1f/0xc2
On the subsequent fsync() call we *do* end up calling filemap_check_errors()
via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
mapping:
filemap_fdatawait_range+0x3b/0x80
filemap_write_and_wait_range+0x5a/0x90
xfs_file_fsync+0x7a/0x2c0
vfs_fsync_range+0x4b/0xb0
? trace_hardirqs_on_caller+0xf5/0x1b0
do_fsync+0x3d/0x70
SyS_fsync+0x10/0x20
entry_SYSCALL_64_fastpath+0x1f/0xc2
Was your concern just that you didn't think that fsync() was properly
returning an error when dax_writeback_one() hit an error? Or is there another
path by which we need to report the error, where it is actually important that
we set AS_EIO? If it's the latter, then I think we need to rework the fsync
call path so that we both generate and consume AS_EIO on the same call,
probably in filemap_write_and_wait_range().
On Mon 06-03-17 16:08:01, Ross Zwisler wrote:
> On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > > I could get back -EIO errors in some cases when I should have instead
> > > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > > PG_error on a writeback error, and that error would clobber the mapping
> > > error.
> > >
> >
> > I should also note that relying on PG_error to report writeback errors
> > is inherently unreliable as well. If someone calls sync() before your
> > fsync gets in there, then you'll likely lose it anyway.
> >
> > filemap_fdatawait_keep_errors will preserve the error in the mapping,
> > but not the individual PG_error flags, so I think we do want to ensure
> > that the mapping error is set when there is a writeback error and not
> > rely on PG_error bit for that.
> >
> > > While I fixed that problem by simply not setting that bit on errors,
> > > that led me down a rabbit hole of looking at how PG_error is being
> > > handled in the kernel.
> > >
> > > This patch series is a few fixes for things that I 100% noticed by
> > > inspection. I don't have a great way to test these since they involve
> > > error handling. I can certainly doctor up a kernel to inject errors
> > > in this code and test by hand however if these look plausible up front.
> > >
> > > Jeff Layton (3):
> > > nilfs2: set the mapping error when calling SetPageError on writeback
> > > mm: don't TestClearPageError in __filemap_fdatawait_range
> > > mm: set mapping error when launder_pages fails
> > >
> > > fs/nilfs2/segment.c | 1 +
> > > mm/filemap.c | 19 ++++---------------
> > > mm/truncate.c | 6 +++++-
> > > 3 files changed, 10 insertions(+), 16 deletions(-)
> > >
> >
> > (cc'ing Ross...)
> >
> > Just when I thought that only NILFS2 needed a little work here, I see
> > another spot...
> >
> > I think that we should also need to fix dax_writeback_mapping_range to
> > set a mapping error on writeback as well. It looks like that's not
> > happening today. Something like the patch below (obviously untested).
> >
> > I'll also plan to follow up with a patch to vfs.txt to outline how
> > writeback errors should be handled by filesystems, assuming that this
> > patchset isn't completely off base.
> >
> > -------------------8<-----------------------
> >
> > [PATCH] dax: set error in mapping when writeback fails
> >
> > In order to get proper error codes from fsync, we must set an error in
> > the mapping range when writeback fails.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/dax.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index c45598b912e1..9005d90deeda 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> >
> > ret = dax_writeback_one(bdev, mapping, indices[i],
> > pvec.pages[i]);
> > - if (ret < 0)
> > + if (ret < 0) {
> > + mapping_set_error(mapping, ret);
> > return ret;
> > + }
>
> (Adding Jan)
>
> I tested this a bit, and for the DAX case at least I don't think this does
> what you want. The current code already returns -EIO if dax_writeback_one()
> hits an error, which bubbles up through the call stack and makes the fsync()
> call in userspace fail with EIO, as we want. With both ext4 and xfs this
> patch (applied to v4.10) makes it so that we fail the current fsync() due to
> the return value of -EIO, then we fail the next fsync() as well because only
> then do we actually process the AS_EIO flag inside of filemap_check_errors().
>
> I think maybe the missing piece is that our normal DAX fsync call stack
> doesn't include a call to filemap_check_errors() if we return -EIO. Here's
> our stack in xfs:
>
> dax_writeback_mapping_range+0x32/0x70
> xfs_vm_writepages+0x8c/0xf0
> do_writepages+0x21/0x30
> __filemap_fdatawrite_range+0xc6/0x100
> filemap_write_and_wait_range+0x44/0x90
> xfs_file_fsync+0x7a/0x2c0
> vfs_fsync_range+0x4b/0xb0
> ? trace_hardirqs_on_caller+0xf5/0x1b0
> do_fsync+0x3d/0x70
> SyS_fsync+0x10/0x20
> entry_SYSCALL_64_fastpath+0x1f/0xc2
>
> On the subsequent fsync() call we *do* end up calling filemap_check_errors()
> via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
> mapping:
>
> filemap_fdatawait_range+0x3b/0x80
> filemap_write_and_wait_range+0x5a/0x90
> xfs_file_fsync+0x7a/0x2c0
> vfs_fsync_range+0x4b/0xb0
> ? trace_hardirqs_on_caller+0xf5/0x1b0
> do_fsync+0x3d/0x70
> SyS_fsync+0x10/0x20
> entry_SYSCALL_64_fastpath+0x1f/0xc2
>
> Was your concern just that you didn't think that fsync() was properly
> returning an error when dax_writeback_one() hit an error? Or is there another
> path by which we need to report the error, where it is actually important that
> we set AS_EIO? If it's the latter, then I think we need to rework the fsync
> call path so that we both generate and consume AS_EIO on the same call,
> probably in filemap_write_and_wait_range().
So I believe this is due to the special handling of EIO inside
filemap_write_and_wait(). Normally, filemap_check_errors() happens inside
filemap_fdatawait() there however not for EIO returned from
filemap_fdatawrite(). In that case we bail out immediately. So I think
Jeff's patch is correct but we need to change filemap_write_and_wait() to
call also filemap_check_errors() directly on EIO from filemap_fdatawrite().
On a more general note (DAX is actually fine here), I find the current
practice of clearing page dirty bits on error and reporting it just once
problematic. It keeps the system running but data is lost and possibly
without getting the error anywhere where it is useful. We get away with
this because it is a rare event but it seems like a problematic behavior.
But this is more for the discussion at LSF.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, 2017-03-07 at 11:26 +0100, Jan Kara wrote:
> On Mon 06-03-17 16:08:01, Ross Zwisler wrote:
> > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > > > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > > > I could get back -EIO errors in some cases when I should have instead
> > > > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > > > PG_error on a writeback error, and that error would clobber the mapping
> > > > error.
> > > >
> > >
> > > I should also note that relying on PG_error to report writeback errors
> > > is inherently unreliable as well. If someone calls sync() before your
> > > fsync gets in there, then you'll likely lose it anyway.
> > >
> > > filemap_fdatawait_keep_errors will preserve the error in the mapping,
> > > but not the individual PG_error flags, so I think we do want to ensure
> > > that the mapping error is set when there is a writeback error and not
> > > rely on PG_error bit for that.
> > >
> > > > While I fixed that problem by simply not setting that bit on errors,
> > > > that led me down a rabbit hole of looking at how PG_error is being
> > > > handled in the kernel.
> > > >
> > > > This patch series is a few fixes for things that I 100% noticed by
> > > > inspection. I don't have a great way to test these since they involve
> > > > error handling. I can certainly doctor up a kernel to inject errors
> > > > in this code and test by hand however if these look plausible up front.
> > > >
> > > > Jeff Layton (3):
> > > > nilfs2: set the mapping error when calling SetPageError on writeback
> > > > mm: don't TestClearPageError in __filemap_fdatawait_range
> > > > mm: set mapping error when launder_pages fails
> > > >
> > > > fs/nilfs2/segment.c | 1 +
> > > > mm/filemap.c | 19 ++++---------------
> > > > mm/truncate.c | 6 +++++-
> > > > 3 files changed, 10 insertions(+), 16 deletions(-)
> > > >
> > >
> > > (cc'ing Ross...)
> > >
> > > Just when I thought that only NILFS2 needed a little work here, I see
> > > another spot...
> > >
> > > I think that we should also need to fix dax_writeback_mapping_range to
> > > set a mapping error on writeback as well. It looks like that's not
> > > happening today. Something like the patch below (obviously untested).
> > >
> > > I'll also plan to follow up with a patch to vfs.txt to outline how
> > > writeback errors should be handled by filesystems, assuming that this
> > > patchset isn't completely off base.
> > >
> > > -------------------8<-----------------------
> > >
> > > [PATCH] dax: set error in mapping when writeback fails
> > >
> > > In order to get proper error codes from fsync, we must set an error in
> > > the mapping range when writeback fails.
> > >
> > > Signed-off-by: Jeff Layton <[email protected]>
> > > ---
> > > fs/dax.c | 4 +++-
> > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index c45598b912e1..9005d90deeda 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> > >
> > > ret = dax_writeback_one(bdev, mapping, indices[i],
> > > pvec.pages[i]);
> > > - if (ret < 0)
> > > + if (ret < 0) {
> > > + mapping_set_error(mapping, ret);
> > > return ret;
> > > + }
> >
> > (Adding Jan)
> >
> > I tested this a bit, and for the DAX case at least I don't think this does
> > what you want. The current code already returns -EIO if dax_writeback_one()
> > hits an error, which bubbles up through the call stack and makes the fsync()
> > call in userspace fail with EIO, as we want. With both ext4 and xfs this
> > patch (applied to v4.10) makes it so that we fail the current fsync() due to
> > the return value of -EIO, then we fail the next fsync() as well because only
> > then do we actually process the AS_EIO flag inside of filemap_check_errors().
> >
> > I think maybe the missing piece is that our normal DAX fsync call stack
> > doesn't include a call to filemap_check_errors() if we return -EIO. Here's
> > our stack in xfs:
> >
> > dax_writeback_mapping_range+0x32/0x70
> > xfs_vm_writepages+0x8c/0xf0
> > do_writepages+0x21/0x30
> > __filemap_fdatawrite_range+0xc6/0x100
> > filemap_write_and_wait_range+0x44/0x90
> > xfs_file_fsync+0x7a/0x2c0
> > vfs_fsync_range+0x4b/0xb0
> > ? trace_hardirqs_on_caller+0xf5/0x1b0
> > do_fsync+0x3d/0x70
> > SyS_fsync+0x10/0x20
> > entry_SYSCALL_64_fastpath+0x1f/0xc2
> >
> > On the subsequent fsync() call we *do* end up calling filemap_check_errors()
> > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
> > mapping:
> >
> > filemap_fdatawait_range+0x3b/0x80
> > filemap_write_and_wait_range+0x5a/0x90
> > xfs_file_fsync+0x7a/0x2c0
> > vfs_fsync_range+0x4b/0xb0
> > ? trace_hardirqs_on_caller+0xf5/0x1b0
> > do_fsync+0x3d/0x70
> > SyS_fsync+0x10/0x20
> > entry_SYSCALL_64_fastpath+0x1f/0xc2
> >
> > Was your concern just that you didn't think that fsync() was properly
> > returning an error when dax_writeback_one() hit an error? Or is there another
> > path by which we need to report the error, where it is actually important that
> > we set AS_EIO? If it's the latter, then I think we need to rework the fsync
> > call path so that we both generate and consume AS_EIO on the same call,
> > probably in filemap_write_and_wait_range().
>
> So I believe this is due to the special handling of EIO inside
> filemap_write_and_wait(). Normally, filemap_check_errors() happens inside
> filemap_fdatawait() there however not for EIO returned from
> filemap_fdatawrite(). In that case we bail out immediately. So I think
> Jeff's patch is correct but we need to change filemap_write_and_wait() to
> call also filemap_check_errors() directly on EIO from filemap_fdatawrite().
>
Yes that makes total sense. I've got a filemap_write_and_wait patch in
my pile now that does this. I'll run what I have through an xfstests
run, and see how it does, and will plan to post a v2 set once I do.
> On a more general note (DAX is actually fine here), I find the current
> practice of clearing page dirty bits on error and reporting it just once
> problematic. It keeps the system running but data is lost and possibly
> without getting the error anywhere where it is useful. We get away with
> this because it is a rare event but it seems like a problematic behavior.
> But this is more for the discussion at LSF.
>
That really is the crux of the matter. Unfortunately, that's sort of how
the POSIX write/fsync model is designed. If we want to change that, then
I think that we have to consider what a new interface for this would
look like. Maybe we can do something there with new sync_file_range
flags?
I think this probably also dovetails with Kevin Wolf's proposed LSF
topic too, so maybe we can discuss all of this together there.
--
Jeff Layton <[email protected]>
On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> On Mon 06-03-17 16:08:01, Ross Zwisler wrote:
> > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > > > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > > > I could get back -EIO errors in some cases when I should have instead
> > > > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > > > PG_error on a writeback error, and that error would clobber the mapping
> > > > error.
> > > >
> > >
> > > I should also note that relying on PG_error to report writeback errors
> > > is inherently unreliable as well. If someone calls sync() before your
> > > fsync gets in there, then you'll likely lose it anyway.
> > >
> > > filemap_fdatawait_keep_errors will preserve the error in the mapping,
> > > but not the individual PG_error flags, so I think we do want to ensure
> > > that the mapping error is set when there is a writeback error and not
> > > rely on PG_error bit for that.
> > >
> > > > While I fixed that problem by simply not setting that bit on errors,
> > > > that led me down a rabbit hole of looking at how PG_error is being
> > > > handled in the kernel.
> > > >
> > > > This patch series is a few fixes for things that I 100% noticed by
> > > > inspection. I don't have a great way to test these since they involve
> > > > error handling. I can certainly doctor up a kernel to inject errors
> > > > in this code and test by hand however if these look plausible up front.
> > > >
> > > > Jeff Layton (3):
> > > > nilfs2: set the mapping error when calling SetPageError on writeback
> > > > mm: don't TestClearPageError in __filemap_fdatawait_range
> > > > mm: set mapping error when launder_pages fails
> > > >
> > > > fs/nilfs2/segment.c | 1 +
> > > > mm/filemap.c | 19 ++++---------------
> > > > mm/truncate.c | 6 +++++-
> > > > 3 files changed, 10 insertions(+), 16 deletions(-)
> > > >
> > >
> > > (cc'ing Ross...)
> > >
> > > Just when I thought that only NILFS2 needed a little work here, I see
> > > another spot...
> > >
> > > I think that we should also need to fix dax_writeback_mapping_range to
> > > set a mapping error on writeback as well. It looks like that's not
> > > happening today. Something like the patch below (obviously untested).
> > >
> > > I'll also plan to follow up with a patch to vfs.txt to outline how
> > > writeback errors should be handled by filesystems, assuming that this
> > > patchset isn't completely off base.
> > >
> > > -------------------8<-----------------------
> > >
> > > [PATCH] dax: set error in mapping when writeback fails
> > >
> > > In order to get proper error codes from fsync, we must set an error in
> > > the mapping range when writeback fails.
> > >
> > > Signed-off-by: Jeff Layton <[email protected]>
> > > ---
> > > fs/dax.c | 4 +++-
> > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index c45598b912e1..9005d90deeda 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> > >
> > > ret = dax_writeback_one(bdev, mapping, indices[i],
> > > pvec.pages[i]);
> > > - if (ret < 0)
> > > + if (ret < 0) {
> > > + mapping_set_error(mapping, ret);
> > > return ret;
> > > + }
> >
> > (Adding Jan)
> >
> > I tested this a bit, and for the DAX case at least I don't think this does
> > what you want. The current code already returns -EIO if dax_writeback_one()
> > hits an error, which bubbles up through the call stack and makes the fsync()
> > call in userspace fail with EIO, as we want. With both ext4 and xfs this
> > patch (applied to v4.10) makes it so that we fail the current fsync() due to
> > the return value of -EIO, then we fail the next fsync() as well because only
> > then do we actually process the AS_EIO flag inside of filemap_check_errors().
> >
> > I think maybe the missing piece is that our normal DAX fsync call stack
> > doesn't include a call to filemap_check_errors() if we return -EIO. Here's
> > our stack in xfs:
> >
> > dax_writeback_mapping_range+0x32/0x70
> > xfs_vm_writepages+0x8c/0xf0
> > do_writepages+0x21/0x30
> > __filemap_fdatawrite_range+0xc6/0x100
> > filemap_write_and_wait_range+0x44/0x90
> > xfs_file_fsync+0x7a/0x2c0
> > vfs_fsync_range+0x4b/0xb0
> > ? trace_hardirqs_on_caller+0xf5/0x1b0
> > do_fsync+0x3d/0x70
> > SyS_fsync+0x10/0x20
> > entry_SYSCALL_64_fastpath+0x1f/0xc2
> >
> > On the subsequent fsync() call we *do* end up calling filemap_check_errors()
> > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
> > mapping:
> >
> > filemap_fdatawait_range+0x3b/0x80
> > filemap_write_and_wait_range+0x5a/0x90
> > xfs_file_fsync+0x7a/0x2c0
> > vfs_fsync_range+0x4b/0xb0
> > ? trace_hardirqs_on_caller+0xf5/0x1b0
> > do_fsync+0x3d/0x70
> > SyS_fsync+0x10/0x20
> > entry_SYSCALL_64_fastpath+0x1f/0xc2
> >
> > Was your concern just that you didn't think that fsync() was properly
> > returning an error when dax_writeback_one() hit an error? Or is there another
> > path by which we need to report the error, where it is actually important that
> > we set AS_EIO? If it's the latter, then I think we need to rework the fsync
> > call path so that we both generate and consume AS_EIO on the same call,
> > probably in filemap_write_and_wait_range().
>
> So I believe this is due to the special handling of EIO inside
> filemap_write_and_wait(). Normally, filemap_check_errors() happens inside
s/filemap_write_and_wait/filemap_write_and_wait_range/ for this particular
case, but we definitely want to make changes that keep them consistent.
> filemap_fdatawait() there however not for EIO returned from
> filemap_fdatawrite(). In that case we bail out immediately. So I think
> Jeff's patch is correct but we need to change filemap_write_and_wait() to
> call also filemap_check_errors() directly on EIO from filemap_fdatawrite().
So I guess my question was: why is it important that we set AS_EIO, if the EIO
is already being reported correctly? Is it just for consistency with the
buffered fsync case, or is there currently a path where the -EIO from DAX will
be missed, and we actually need AS_EIO to be set in mapping->flags so that we
correctly report an error?
> On a more general note (DAX is actually fine here), I find the current
> practice of clearing page dirty bits on error and reporting it just once
> problematic. It keeps the system running but data is lost and possibly
> without getting the error anywhere where it is useful. We get away with
> this because it is a rare event but it seems like a problematic behavior.
> But this is more for the discussion at LSF.
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
On Tue 07-03-17 08:59:16, Ross Zwisler wrote:
> On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > On Mon 06-03-17 16:08:01, Ross Zwisler wrote:
> > > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> > > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > > > > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > > > > I could get back -EIO errors in some cases when I should have instead
> > > > > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > > > > PG_error on a writeback error, and that error would clobber the mapping
> > > > > error.
> > > > >
> > > >
> > > > I should also note that relying on PG_error to report writeback errors
> > > > is inherently unreliable as well. If someone calls sync() before your
> > > > fsync gets in there, then you'll likely lose it anyway.
> > > >
> > > > filemap_fdatawait_keep_errors will preserve the error in the mapping,
> > > > but not the individual PG_error flags, so I think we do want to ensure
> > > > that the mapping error is set when there is a writeback error and not
> > > > rely on PG_error bit for that.
> > > >
> > > > > While I fixed that problem by simply not setting that bit on errors,
> > > > > that led me down a rabbit hole of looking at how PG_error is being
> > > > > handled in the kernel.
> > > > >
> > > > > This patch series is a few fixes for things that I 100% noticed by
> > > > > inspection. I don't have a great way to test these since they involve
> > > > > error handling. I can certainly doctor up a kernel to inject errors
> > > > > in this code and test by hand however if these look plausible up front.
> > > > >
> > > > > Jeff Layton (3):
> > > > > nilfs2: set the mapping error when calling SetPageError on writeback
> > > > > mm: don't TestClearPageError in __filemap_fdatawait_range
> > > > > mm: set mapping error when launder_pages fails
> > > > >
> > > > > fs/nilfs2/segment.c | 1 +
> > > > > mm/filemap.c | 19 ++++---------------
> > > > > mm/truncate.c | 6 +++++-
> > > > > 3 files changed, 10 insertions(+), 16 deletions(-)
> > > > >
> > > >
> > > > (cc'ing Ross...)
> > > >
> > > > Just when I thought that only NILFS2 needed a little work here, I see
> > > > another spot...
> > > >
> > > > I think that we should also need to fix dax_writeback_mapping_range to
> > > > set a mapping error on writeback as well. It looks like that's not
> > > > happening today. Something like the patch below (obviously untested).
> > > >
> > > > I'll also plan to follow up with a patch to vfs.txt to outline how
> > > > writeback errors should be handled by filesystems, assuming that this
> > > > patchset isn't completely off base.
> > > >
> > > > -------------------8<-----------------------
> > > >
> > > > [PATCH] dax: set error in mapping when writeback fails
> > > >
> > > > In order to get proper error codes from fsync, we must set an error in
> > > > the mapping range when writeback fails.
> > > >
> > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > ---
> > > > fs/dax.c | 4 +++-
> > > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index c45598b912e1..9005d90deeda 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> > > >
> > > > ret = dax_writeback_one(bdev, mapping, indices[i],
> > > > pvec.pages[i]);
> > > > - if (ret < 0)
> > > > + if (ret < 0) {
> > > > + mapping_set_error(mapping, ret);
> > > > return ret;
> > > > + }
> > >
> > > (Adding Jan)
> > >
> > > I tested this a bit, and for the DAX case at least I don't think this does
> > > what you want. The current code already returns -EIO if dax_writeback_one()
> > > hits an error, which bubbles up through the call stack and makes the fsync()
> > > call in userspace fail with EIO, as we want. With both ext4 and xfs this
> > > patch (applied to v4.10) makes it so that we fail the current fsync() due to
> > > the return value of -EIO, then we fail the next fsync() as well because only
> > > then do we actually process the AS_EIO flag inside of filemap_check_errors().
> > >
> > > I think maybe the missing piece is that our normal DAX fsync call stack
> > > doesn't include a call to filemap_check_errors() if we return -EIO. Here's
> > > our stack in xfs:
> > >
> > > dax_writeback_mapping_range+0x32/0x70
> > > xfs_vm_writepages+0x8c/0xf0
> > > do_writepages+0x21/0x30
> > > __filemap_fdatawrite_range+0xc6/0x100
> > > filemap_write_and_wait_range+0x44/0x90
> > > xfs_file_fsync+0x7a/0x2c0
> > > vfs_fsync_range+0x4b/0xb0
> > > ? trace_hardirqs_on_caller+0xf5/0x1b0
> > > do_fsync+0x3d/0x70
> > > SyS_fsync+0x10/0x20
> > > entry_SYSCALL_64_fastpath+0x1f/0xc2
> > >
> > > On the subsequent fsync() call we *do* end up calling filemap_check_errors()
> > > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
> > > mapping:
> > >
> > > filemap_fdatawait_range+0x3b/0x80
> > > filemap_write_and_wait_range+0x5a/0x90
> > > xfs_file_fsync+0x7a/0x2c0
> > > vfs_fsync_range+0x4b/0xb0
> > > ? trace_hardirqs_on_caller+0xf5/0x1b0
> > > do_fsync+0x3d/0x70
> > > SyS_fsync+0x10/0x20
> > > entry_SYSCALL_64_fastpath+0x1f/0xc2
> > >
> > > Was your concern just that you didn't think that fsync() was properly
> > > returning an error when dax_writeback_one() hit an error? Or is there another
> > > path by which we need to report the error, where it is actually important that
> > > we set AS_EIO? If it's the latter, then I think we need to rework the fsync
> > > call path so that we both generate and consume AS_EIO on the same call,
> > > probably in filemap_write_and_wait_range().
> >
> > So I believe this is due to the special handling of EIO inside
> > filemap_write_and_wait(). Normally, filemap_check_errors() happens inside
>
> s/filemap_write_and_wait/filemap_write_and_wait_range/ for this particular
> case, but we definitely want to make changes that keep them consistent.
>
> > filemap_fdatawait() there however not for EIO returned from
> > filemap_fdatawrite(). In that case we bail out immediately. So I think
> > Jeff's patch is correct but we need to change filemap_write_and_wait() to
> > call also filemap_check_errors() directly on EIO from filemap_fdatawrite().
>
> So I guess my question was: why is it important that we set AS_EIO, if the EIO
> is already being reported correctly? Is it just for consistency with the
> buffered fsync case, or is there currently a path where the -EIO from DAX will
> be missed, and we actually need AS_EIO to be set in mapping->flags so that we
> correctly report an error?
It is just for consistency and future-proofing. E.g. if we ever decided to do
some background flushing of CPU caches, current DAX behavior would
suddently result in missed reports of EIO errors.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Sun, 5 Mar 2017 08:35:33 -0500, Jeff Layton <[email protected]> wrote:
> In a later patch, we're going to want to make the fsync codepath not do
> a TestClearPageError call as that can override the error set in the
> address space. To do that though, we need to ensure that filesystems
> that are relying on the PG_error bit for reporting writeback errors
> also set an error in the address space.
>
> The only place I've found that looks potentially problematic is this
> spot in nilfs2. Ensure that it sets an error in the mapping in addition
> to setting PageError.
>
> Signed-off-by: Jeff Layton <[email protected]>
Acked-by: Ryusuke Konishi <[email protected]>
Agreed that nilfs2 needs this if the successive patch is applied.
Thanks,
Ryusuke Konishi
> ---
> fs/nilfs2/segment.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index bedcae2c28e6..c1041b07060e 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1743,6 +1743,7 @@ static void nilfs_end_page_io(struct page *page, int err)
> } else {
> __set_page_dirty_nobuffers(page);
> SetPageError(page);
> + mapping_set_error(page_mapping(page), err);
> }
>
> end_page_writeback(page);
> --
> 2.9.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> On a more general note (DAX is actually fine here), I find the current
> practice of clearing page dirty bits on error and reporting it just once
> problematic. It keeps the system running but data is lost and possibly
> without getting the error anywhere where it is useful. We get away with
> this because it is a rare event but it seems like a problematic behavior.
> But this is more for the discussion at LSF.
I'm actually running into this in the last day or two because some MM
folks at $WORK have been trying to push hard for GFP_NOFS removal in
ext4 (at least when we are holding some mutex/semaphore like
i_data_sem) because otherwise it's possible for the OOM killer to be
unable to kill processes because they are holding on to locks that
ext4 is holding.
I've done some initial investigation, and while it's not that hard to
remove GFP_NOFS from certain parts of the writepages() codepath (which
is where we had been are running into problems), a really, REALLY big
problem is if any_filesystem->writepages() returns ENOMEM, it causes
silent data loss, because the pages are marked clean, and so data
written using buffered writeback goes *poof*.
I confirmed this by creating a test kernel with a simple patch such
that if the ext4 file system is mounted with -o debug, there was a 1
in 16 chance that ext4_writepages will immediately return with ENOMEM
(and printk the inode number, so I knew which inodes had gotten the
ENOMEM treatment). The result was **NOT** pretty.
What I think we should strongly consider is at the very least, special
case ENOMEM being returned by writepages() during background
writeback, and *not* mark the pages clean, and make sure the inode
stays on the dirty inode list, so we can retry the write later. This
is especially important since the process that issued the write may
have gone away, so there might not even be a userspace process to
complain to. By converting certain page allocations (most notably in
ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
release the i_data_sem lock and return an error. This should allow
allow the OOM killer to do its dirty deed, and hopefully we can retry
the writepages() for that inode later.
In the case of a data integrity sync being issued by fsync() or
umount(), we could allow ENOMEM to get returned to userspace in that
case as well. I'm not convinced all userspace code will handle an
ENOMEM correctly or sanely, but at least they people will be (less
likely) to blame file system developers. :-)
The real problem that's going on here, by the way, is that people are
trying to run programs in insanely tight containers, and then when the
kernel locks up, they blame the mm developers. But if there is silent
data corruption, they will blame the fs developers instead. And while
kernel lockups are temporary (all you have to do is let the watchdog
reboot the system :-), silent data corruption is *forever*. So what
we really need to do is to allow the OOM killer do its work, and if
job owners are unhappy that their processes are getting OOM killed,
maybe they will be suitably incentivized to pay for more memory in
their containers....
- Ted
P.S. Michael Hocko, apologies for not getting back to you with your
GFP_NOFS removal patches. But the possibility of fs malfunctions that
might lead to silent data corruption is why I'm being very cautious,
and I now have rather strong confirmation that this is not just an
irrational concern on my part. (This is also performance review
season, FAST conference was last week, and Usenix ATC program
committee reviews are due this week. So apologies for any reply
latency.)
On Wed 08-03-17 21:57:25, Ted Tso wrote:
> On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > On a more general note (DAX is actually fine here), I find the current
> > practice of clearing page dirty bits on error and reporting it just once
> > problematic. It keeps the system running but data is lost and possibly
> > without getting the error anywhere where it is useful. We get away with
> > this because it is a rare event but it seems like a problematic behavior.
> > But this is more for the discussion at LSF.
>
> I'm actually running into this in the last day or two because some MM
> folks at $WORK have been trying to push hard for GFP_NOFS removal in
> ext4 (at least when we are holding some mutex/semaphore like
> i_data_sem) because otherwise it's possible for the OOM killer to be
> unable to kill processes because they are holding on to locks that
> ext4 is holding.
>
> I've done some initial investigation, and while it's not that hard to
> remove GFP_NOFS from certain parts of the writepages() codepath (which
> is where we had been are running into problems), a really, REALLY big
> problem is if any_filesystem->writepages() returns ENOMEM, it causes
> silent data loss, because the pages are marked clean, and so data
> written using buffered writeback goes *poof*.
>
> I confirmed this by creating a test kernel with a simple patch such
> that if the ext4 file system is mounted with -o debug, there was a 1
> in 16 chance that ext4_writepages will immediately return with ENOMEM
> (and printk the inode number, so I knew which inodes had gotten the
> ENOMEM treatment). The result was **NOT** pretty.
>
> What I think we should strongly consider is at the very least, special
> case ENOMEM being returned by writepages() during background
> writeback, and *not* mark the pages clean, and make sure the inode
> stays on the dirty inode list, so we can retry the write later. This
> is especially important since the process that issued the write may
> have gone away, so there might not even be a userspace process to
> complain to. By converting certain page allocations (most notably in
> ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> release the i_data_sem lock and return an error. This should allow
> allow the OOM killer to do its dirty deed, and hopefully we can retry
> the writepages() for that inode later.
Yeah, so if we can hope the error is transient, keeping pages dirty and
retrying the write is definitely better option. For start we can say that
ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
there's no hope of getting data to disk and so we just discard them. It
will be somewhat rough distinction but probably better than what we have
now.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > On a more general note (DAX is actually fine here), I find the current
> > > practice of clearing page dirty bits on error and reporting it just once
> > > problematic. It keeps the system running but data is lost and possibly
> > > without getting the error anywhere where it is useful. We get away with
> > > this because it is a rare event but it seems like a problematic behavior.
> > > But this is more for the discussion at LSF.
> >
> > I'm actually running into this in the last day or two because some MM
> > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > ext4 (at least when we are holding some mutex/semaphore like
> > i_data_sem) because otherwise it's possible for the OOM killer to be
> > unable to kill processes because they are holding on to locks that
> > ext4 is holding.
> >
> > I've done some initial investigation, and while it's not that hard to
> > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > is where we had been are running into problems), a really, REALLY big
> > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > silent data loss, because the pages are marked clean, and so data
> > written using buffered writeback goes *poof*.
> >
> > I confirmed this by creating a test kernel with a simple patch such
> > that if the ext4 file system is mounted with -o debug, there was a 1
> > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > (and printk the inode number, so I knew which inodes had gotten the
> > ENOMEM treatment). The result was **NOT** pretty.
> >
> > What I think we should strongly consider is at the very least, special
> > case ENOMEM being returned by writepages() during background
> > writeback, and *not* mark the pages clean, and make sure the inode
> > stays on the dirty inode list, so we can retry the write later. This
> > is especially important since the process that issued the write may
> > have gone away, so there might not even be a userspace process to
> > complain to. By converting certain page allocations (most notably in
> > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > release the i_data_sem lock and return an error. This should allow
> > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > the writepages() for that inode later.
>
> Yeah, so if we can hope the error is transient, keeping pages dirty and
> retrying the write is definitely better option. For start we can say that
> ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> there's no hope of getting data to disk and so we just discard them. It
> will be somewhat rough distinction but probably better than what we have
> now.
>
> Honza
I'm not sure about ENOSPC there. That's a return code that is
specifically expected to be returned by fsync. It seems like that ought
not be considered a transient error?
--
Jeff Layton <[email protected]>
On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > On a more general note (DAX is actually fine here), I find the current
> > > > practice of clearing page dirty bits on error and reporting it just once
> > > > problematic. It keeps the system running but data is lost and possibly
> > > > without getting the error anywhere where it is useful. We get away with
> > > > this because it is a rare event but it seems like a problematic behavior.
> > > > But this is more for the discussion at LSF.
> > >
> > > I'm actually running into this in the last day or two because some MM
> > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > ext4 (at least when we are holding some mutex/semaphore like
> > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > unable to kill processes because they are holding on to locks that
> > > ext4 is holding.
> > >
> > > I've done some initial investigation, and while it's not that hard to
> > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > is where we had been are running into problems), a really, REALLY big
> > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > silent data loss, because the pages are marked clean, and so data
> > > written using buffered writeback goes *poof*.
> > >
> > > I confirmed this by creating a test kernel with a simple patch such
> > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > (and printk the inode number, so I knew which inodes had gotten the
> > > ENOMEM treatment). The result was **NOT** pretty.
> > >
> > > What I think we should strongly consider is at the very least, special
> > > case ENOMEM being returned by writepages() during background
> > > writeback, and *not* mark the pages clean, and make sure the inode
> > > stays on the dirty inode list, so we can retry the write later. This
> > > is especially important since the process that issued the write may
> > > have gone away, so there might not even be a userspace process to
> > > complain to. By converting certain page allocations (most notably in
> > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > release the i_data_sem lock and return an error. This should allow
> > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > the writepages() for that inode later.
> >
> > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > retrying the write is definitely better option. For start we can say that
> > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > there's no hope of getting data to disk and so we just discard them. It
> > will be somewhat rough distinction but probably better than what we have
> > now.
> >
> > Honza
>
> I'm not sure about ENOSPC there. That's a return code that is
> specifically expected to be returned by fsync. It seems like that ought
> not be considered a transient error?
Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
Long term, we may need to make at least some ENOSPC situations behave as
transient to make thin provisioned storage not loose data in case admin
does not supply additional space fast enough (i.e., before ENOSPC is
actually hit).
EIO is actually in a similar bucket although probably more on the "hard
failure" side - I can imagine there can by types of storage and situations
where the loss of connectivity to the storage is only transient. But for
start I would not bother with this.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote:
> On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > > On a more general note (DAX is actually fine here), I find the current
> > > > > practice of clearing page dirty bits on error and reporting it just once
> > > > > problematic. It keeps the system running but data is lost and possibly
> > > > > without getting the error anywhere where it is useful. We get away with
> > > > > this because it is a rare event but it seems like a problematic behavior.
> > > > > But this is more for the discussion at LSF.
> > > >
> > > > I'm actually running into this in the last day or two because some MM
> > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > > ext4 (at least when we are holding some mutex/semaphore like
> > > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > > unable to kill processes because they are holding on to locks that
> > > > ext4 is holding.
> > > >
> > > > I've done some initial investigation, and while it's not that hard to
> > > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > > is where we had been are running into problems), a really, REALLY big
> > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > > silent data loss, because the pages are marked clean, and so data
> > > > written using buffered writeback goes *poof*.
> > > >
> > > > I confirmed this by creating a test kernel with a simple patch such
> > > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > > (and printk the inode number, so I knew which inodes had gotten the
> > > > ENOMEM treatment). The result was **NOT** pretty.
> > > >
> > > > What I think we should strongly consider is at the very least, special
> > > > case ENOMEM being returned by writepages() during background
> > > > writeback, and *not* mark the pages clean, and make sure the inode
> > > > stays on the dirty inode list, so we can retry the write later. This
> > > > is especially important since the process that issued the write may
> > > > have gone away, so there might not even be a userspace process to
> > > > complain to. By converting certain page allocations (most notably in
> > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > > release the i_data_sem lock and return an error. This should allow
> > > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > > the writepages() for that inode later.
> > >
> > > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > > retrying the write is definitely better option. For start we can say that
> > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > > there's no hope of getting data to disk and so we just discard them. It
> > > will be somewhat rough distinction but probably better than what we have
> > > now.
> > >
> > > Honza
> >
> > I'm not sure about ENOSPC there. That's a return code that is
> > specifically expected to be returned by fsync. It seems like that ought
> > not be considered a transient error?
>
> Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
> Long term, we may need to make at least some ENOSPC situations behave as
> transient to make thin provisioned storage not loose data in case admin
> does not supply additional space fast enough (i.e., before ENOSPC is
> actually hit).
>
Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
transient error? Just have it hang until we get enough space when that
tunable is enabled?
> EIO is actually in a similar bucket although probably more on the "hard
> failure" side - I can imagine there can by types of storage and situations
> where the loss of connectivity to the storage is only transient. But for
> start I would not bother with this.
>
> Honza
I don't see what we can reasonably do with -EIO other than return a hard
error. If we want to deal with loss of connectivity to storage as a
transient failure, I think that we'd need to ensure that the lower
layers return more distinct error codes in those cases (ENODEV or ENXIO
maybe? Or declare a new kernel-internal code -- EDEVGONE?).
In any case, I think that the basic idea of marking certain
writepage/writepages/launder_page errors as transient might be a
reasonable approach to handling this sanely.
The problem with all of this though is that we have a pile of existing
code that will likely need to be reworked for the new error handling. I
expect that we'll have to walk all of the
writepage/writepages/launder_page implementations and fix them up one by
one once we sort out the rules for this.
--
Jeff Layton <[email protected]>
On Thu, Mar 09, 2017 at 07:43:12AM -0500, Jeff Layton wrote:
> On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote:
> > On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> > > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > > > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > > > On a more general note (DAX is actually fine here), I find the current
> > > > > > practice of clearing page dirty bits on error and reporting it just once
> > > > > > problematic. It keeps the system running but data is lost and possibly
> > > > > > without getting the error anywhere where it is useful. We get away with
> > > > > > this because it is a rare event but it seems like a problematic behavior.
> > > > > > But this is more for the discussion at LSF.
> > > > >
> > > > > I'm actually running into this in the last day or two because some MM
> > > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > > > ext4 (at least when we are holding some mutex/semaphore like
> > > > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > > > unable to kill processes because they are holding on to locks that
> > > > > ext4 is holding.
> > > > >
> > > > > I've done some initial investigation, and while it's not that hard to
> > > > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > > > is where we had been are running into problems), a really, REALLY big
> > > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > > > silent data loss, because the pages are marked clean, and so data
> > > > > written using buffered writeback goes *poof*.
> > > > >
> > > > > I confirmed this by creating a test kernel with a simple patch such
> > > > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > > > (and printk the inode number, so I knew which inodes had gotten the
> > > > > ENOMEM treatment). The result was **NOT** pretty.
> > > > >
> > > > > What I think we should strongly consider is at the very least, special
> > > > > case ENOMEM being returned by writepages() during background
> > > > > writeback, and *not* mark the pages clean, and make sure the inode
> > > > > stays on the dirty inode list, so we can retry the write later. This
> > > > > is especially important since the process that issued the write may
> > > > > have gone away, so there might not even be a userspace process to
> > > > > complain to. By converting certain page allocations (most notably in
> > > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > > > release the i_data_sem lock and return an error. This should allow
> > > > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > > > the writepages() for that inode later.
> > > >
> > > > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > > > retrying the write is definitely better option. For start we can say that
> > > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > > > there's no hope of getting data to disk and so we just discard them. It
> > > > will be somewhat rough distinction but probably better than what we have
> > > > now.
> > > >
> > > > Honza
> > >
> > > I'm not sure about ENOSPC there. That's a return code that is
> > > specifically expected to be returned by fsync. It seems like that ought
> > > not be considered a transient error?
> >
> > Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
> > Long term, we may need to make at least some ENOSPC situations behave as
> > transient to make thin provisioned storage not loose data in case admin
> > does not supply additional space fast enough (i.e., before ENOSPC is
> > actually hit).
> >
>
> Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
> transient error? Just have it hang until we get enough space when that
> tunable is enabled?
>
Just FYI, XFS has a similar error configuration mechanism that we use
for essentially this purpose when dealing with metadata buffer I/O
errors. The original motivation was to help us deal with the varying
requirements for thin provisioning. I.e., whether a particular error is
permanent or transient depends on the admin's preference and affects
whether the filesystem continuously retries failed I/Os in anticipation
of future success or gives up quickly and shuts down (or something in
between).
See roughly commits 192852be8b5 through e6b3bb7896 for the initial code,
/sys/fs/xfs/<dev>/error on an XFS filesystem for the user interface, and
the "Error handling" section of Documentation/filesystems/xfs.txt for
information on how the interface works.
Brian
> > EIO is actually in a similar bucket although probably more on the "hard
> > failure" side - I can imagine there can by types of storage and situations
> > where the loss of connectivity to the storage is only transient. But for
> > start I would not bother with this.
> >
> > Honza
>
> I don't see what we can reasonably do with -EIO other than return a hard
> error. If we want to deal with loss of connectivity to storage as a
> transient failure, I think that we'd need to ensure that the lower
> layers return more distinct error codes in those cases (ENODEV or ENXIO
> maybe? Or declare a new kernel-internal code -- EDEVGONE?).
>
> In any case, I think that the basic idea of marking certain
> writepage/writepages/launder_page errors as transient might be a
> reasonable approach to handling this sanely.
>
> The problem with all of this though is that we have a pile of existing
> code that will likely need to be reworked for the new error handling. I
> expect that we'll have to walk all of the
> writepage/writepages/launder_page implementations and fix them up one by
> one once we sort out the rules for this.
>
> --
> Jeff Layton <[email protected]>
On Thu, Mar 09, 2017 at 07:43:12AM -0500, Jeff Layton wrote:
>
> Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
> transient error? Just have it hang until we get enough space when that
> tunable is enabled?
Or maybe we need a new kernel-internal errno (ala ERESTARSYS) which
means it's a "soft ENOSPC"? It would get translated to ENOSPC if it
gets propagated to userspace, but that way for devices like dm-thin or
other storage array with thin volumes, it could send back a soft
ENOSPC, while for file systems where "ENOSPC means ENOSPC", we can
treat those as a hard ENOSPC.
- Ted