2010-11-09 16:45:07

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] clear PageError bit in msync & fsync

Temporary IO failures, eg. due to loss of both multipath paths, can
permanently leave the PageError bit set on a page, resulting in
msync or fsync returning -EIO over and over again, even if IO is
now getting to the disk correctly.

We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in
the filemap_fdatawait_range function. Also clearing the PageError
bit on the page allows subsequent msync or fsync calls on this file
to return without an error, if the subsequent IO succeeds.

Unfortunately data written out in the msync or fsync call that
returned -EIO can still get lost, because the page dirty bit appears
to not get restored on IO error. However, the alternative could be
potentially all of memory filling up with uncleanable dirty pages,
hanging the system, so there is no nice choice here...

Signed-off-by: Rik van Riel <[email protected]>

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5f38c46..4805fde 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,7 +198,7 @@ static inline int __TestClearPage##uname(struct page *page) { return 0; }
struct page; /* forward declaration */

TESTPAGEFLAG(Locked, locked) TESTSETFLAG(Locked, locked)
-PAGEFLAG(Error, error)
+PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
diff --git a/mm/filemap.c b/mm/filemap.c
index 61ba5e4..ba27b83 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -296,7 +296,7 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
continue;

wait_on_page_writeback(page);
- if (PageError(page))
+ if (TestClearPageError(page))
ret = -EIO;
}
pagevec_release(&pvec);


2010-11-09 18:09:42

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On Tue, Nov 09, 2010 at 11:44:22AM -0500, Rik van Riel wrote:
> Temporary IO failures, eg. due to loss of both multipath paths, can
> permanently leave the PageError bit set on a page, resulting in
> msync or fsync returning -EIO over and over again, even if IO is
> now getting to the disk correctly.
>
> We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in
> the filemap_fdatawait_range function. Also clearing the PageError
> bit on the page allows subsequent msync or fsync calls on this file
> to return without an error, if the subsequent IO succeeds.
>
> Unfortunately data written out in the msync or fsync call that
> returned -EIO can still get lost, because the page dirty bit appears
> to not get restored on IO error. However, the alternative could be
> potentially all of memory filling up with uncleanable dirty pages,
> hanging the system, so there is no nice choice here...
>
> Signed-off-by: Rik van Riel <[email protected]>
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5f38c46..4805fde 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -198,7 +198,7 @@ static inline int __TestClearPage##uname(struct page *page) { return 0; }
> struct page; /* forward declaration */
>
> TESTPAGEFLAG(Locked, locked) TESTSETFLAG(Locked, locked)
> -PAGEFLAG(Error, error)
> +PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
> PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
> PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
> PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 61ba5e4..ba27b83 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -296,7 +296,7 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
> continue;
>
> wait_on_page_writeback(page);
> - if (PageError(page))
> + if (TestClearPageError(page))
> ret = -EIO;
> }
> pagevec_release(&pvec);
>

I don't think losing the dirty bit is a problem. If the msync/fsync
fails with EIO, it's userspace's job to reissue the write, not the
kernel's.

Returning EIO only once per actual error looks correct to me.

Acked-by: Valerie Aurora <[email protected]>

-VAL

2010-11-09 19:20:13

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On Tue, 9 Nov 2010 11:44:22 -0500
Rik van Riel <[email protected]> wrote:

> Temporary IO failures, eg. due to loss of both multipath paths, can
> permanently leave the PageError bit set on a page, resulting in
> msync or fsync returning -EIO over and over again, even if IO is
> now getting to the disk correctly.
>
> We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in
> the filemap_fdatawait_range function. Also clearing the PageError
> bit on the page allows subsequent msync or fsync calls on this file
> to return without an error, if the subsequent IO succeeds.
>
> Unfortunately data written out in the msync or fsync call that
> returned -EIO can still get lost, because the page dirty bit appears
> to not get restored on IO error. However, the alternative could be
> potentially all of memory filling up with uncleanable dirty pages,
> hanging the system, so there is no nice choice here...
>
> Signed-off-by: Rik van Riel <[email protected]>
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5f38c46..4805fde 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -198,7 +198,7 @@ static inline int __TestClearPage##uname(struct page *page) { return 0; }
> struct page; /* forward declaration */
>
> TESTPAGEFLAG(Locked, locked) TESTSETFLAG(Locked, locked)
> -PAGEFLAG(Error, error)
> +PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
> PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
> PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
> PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 61ba5e4..ba27b83 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -296,7 +296,7 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
> continue;
>
> wait_on_page_writeback(page);
> - if (PageError(page))
> + if (TestClearPageError(page))
> ret = -EIO;
> }
> pagevec_release(&pvec);
>

This does leave the page in sort of a funky state. The uptodate bit
will still probably be set, but the dirty bit won't be. The page will
be effectively "disconnected" from the backing store until someone
writes to it.

I suppose though that this is the best that can reasonably be done in
this situation however...

Acked-by: Jeff Layton <[email protected]>

2010-11-09 19:33:44

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On 11/09/2010 02:21 PM, Jeff Layton wrote:

> This does leave the page in sort of a funky state. The uptodate bit
> will still probably be set, but the dirty bit won't be. The page will
> be effectively "disconnected" from the backing store until someone
> writes to it.
>
> I suppose though that this is the best that can reasonably be done in
> this situation however...

I spent a few days looking for alternatives, and indeed I found
nothing better...

There are essentially two possibilities:
1) the VM can potentially be filled up with uncleanable dirty pages, or
2) pages that hit an IO error are left in a clean state, so they can
be reclaimed under memory pressure

Alternative 1 could cause the entire system to deadlock, while
option 2 puts the onus on userland apps to rewrite the data
from a failed msync/fsync.

Currently the VM has behaviour #2 which is preserved with my
patch.

The only difference with my patch is, we won't keep returning
-EIO on subsequent, error free, msync or fsync calls to files
that had an IO error at some previous point in the past.

--
All rights reversed

2010-11-09 21:07:24

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On Tue, Nov 09, 2010 at 02:33:29PM -0500, Rik van Riel wrote:
>
> There are essentially two possibilities:
> 1) the VM can potentially be filled up with uncleanable dirty pages, or
> 2) pages that hit an IO error are left in a clean state, so they can
> be reclaimed under memory pressure
>
> Alternative 1 could cause the entire system to deadlock, while
> option 2 puts the onus on userland apps to rewrite the data
> from a failed msync/fsync.
>
> Currently the VM has behaviour #2 which is preserved with my
> patch.
>
> The only difference with my patch is, we won't keep returning
> -EIO on subsequent, error free, msync or fsync calls to files
> that had an IO error at some previous point in the past.

Do we guarantee that the application will get EIO at least once? I
thought there were issues where the error bit could get lost if the
page writeback was triggered by sync() run by a third-party
application.


- Ted

2010-11-09 21:16:06

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On 11/09/2010 04:07 PM, Ted Ts'o wrote:
> On Tue, Nov 09, 2010 at 02:33:29PM -0500, Rik van Riel wrote:
>>
>> There are essentially two possibilities:
>> 1) the VM can potentially be filled up with uncleanable dirty pages, or
>> 2) pages that hit an IO error are left in a clean state, so they can
>> be reclaimed under memory pressure
>>
>> Alternative 1 could cause the entire system to deadlock, while
>> option 2 puts the onus on userland apps to rewrite the data
>> from a failed msync/fsync.
>>
>> Currently the VM has behaviour #2 which is preserved with my
>> patch.
>>
>> The only difference with my patch is, we won't keep returning
>> -EIO on subsequent, error free, msync or fsync calls to files
>> that had an IO error at some previous point in the past.
>
> Do we guarantee that the application will get EIO at least once? I
> thought there were issues where the error bit could get lost if the
> page writeback was triggered by sync() run by a third-party
> application.

There is no such guarantee in the current kernel, either
with or without my patch.

A third application calling fsync or msync can get the
EIO cleared, so the application that did the write does
not see it.

The VM could also reclaim the PageError page due to
memory pressure, so the application calling fsync or
msync does not see it.

I see no good way in which we could guarantee that
every process calling msync or fsync on a file that
had an IO error in the past gets EIO once - at least,
not without every one of them always getting EIO on
the file even after the IO path is good again...

--
All rights reversed

2010-11-09 21:24:49

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On 11/09/2010 04:21 PM, Zan Lynx wrote:
> On 11/9/10 12:33 PM, Rik van Riel wrote:
>> On 11/09/2010 02:21 PM, Jeff Layton wrote:
>>
>>> This does leave the page in sort of a funky state. The uptodate bit
>>> will still probably be set, but the dirty bit won't be. The page will
>>> be effectively "disconnected" from the backing store until someone
>>> writes to it.
>>>
>>> I suppose though that this is the best that can reasonably be done in
>>> this situation however...
>>
>> I spent a few days looking for alternatives, and indeed I found
>> nothing better...
>
> Just an off the top of my head crazy idea...
>
> Could you leave the error bit set on the page and treat it as a dirty
> bit during a future msync, clearing the error bit at that point.
>
> The general idea would be to leave the error set unless an explicit
> write was requested.

The problem with that is that the page will be unreclaimable,
and the VM could get filled with PageError pages and be unable
to make further progress (if the IO path does not come back).

--
All rights reversed

2010-11-09 21:39:43

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On Tue 09-11-10 11:44:22, Rik van Riel wrote:
> Temporary IO failures, eg. due to loss of both multipath paths, can
> permanently leave the PageError bit set on a page, resulting in
> msync or fsync returning -EIO over and over again, even if IO is
> now getting to the disk correctly.
>
> We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in
> the filemap_fdatawait_range function. Also clearing the PageError
> bit on the page allows subsequent msync or fsync calls on this file
> to return without an error, if the subsequent IO succeeds.
>
> Unfortunately data written out in the msync or fsync call that
> returned -EIO can still get lost, because the page dirty bit appears
> to not get restored on IO error. However, the alternative could be
> potentially all of memory filling up with uncleanable dirty pages,
> hanging the system, so there is no nice choice here...
>
> Signed-off-by: Rik van Riel <[email protected]>
The patch looks OK to me so feel free to add
Acked-by: Jan Kara <[email protected]>

But I'd like to talk about what you write in the last paragraph because
that is bugging me for some time as well and I believe we *should* find a
reasonable answer to that.

As you write above, it's not a big deal to keep page with IO error dirty
but we have to come up with a reasonable way for kernel to eventually get
rid of dirty pages it is unable to write. I think it is possible to handle
without any in-kernel magic. The most common case is when the device just
does not respond for some time - all processes writing to the device would
soon block in balance_dirty_pages() so they don't bring the kernel OOM.
Then we just need to provide for userspace a way to discard all pages
backed by given device including dirty pages - that way userspace could
tell, when there's no way the device is going back and processes would get
unblocked and start getting EIO. Only userspace would have to be careful
not to get blocked in balance_dirty_pages() while trying to prune the
cache.

I think this would be good enough, what do you think?

Honza

> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5f38c46..4805fde 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -198,7 +198,7 @@ static inline int __TestClearPage##uname(struct page *page) { return 0; }
> struct page; /* forward declaration */
>
> TESTPAGEFLAG(Locked, locked) TESTSETFLAG(Locked, locked)
> -PAGEFLAG(Error, error)
> +PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
> PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
> PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
> PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 61ba5e4..ba27b83 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -296,7 +296,7 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
> continue;
>
> wait_on_page_writeback(page);
> - if (PageError(page))
> + if (TestClearPageError(page))
> ret = -EIO;
> }
> pagevec_release(&pvec);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-11-09 21:42:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On Tue, 09 Nov 2010 16:15:52 -0500
Rik van Riel <[email protected]> wrote:

> On 11/09/2010 04:07 PM, Ted Ts'o wrote:
> > On Tue, Nov 09, 2010 at 02:33:29PM -0500, Rik van Riel wrote:
> >>
> >> There are essentially two possibilities:
> >> 1) the VM can potentially be filled up with uncleanable dirty pages, or
> >> 2) pages that hit an IO error are left in a clean state, so they can
> >> be reclaimed under memory pressure
> >>
> >> Alternative 1 could cause the entire system to deadlock, while
> >> option 2 puts the onus on userland apps to rewrite the data
> >> from a failed msync/fsync.
> >>
> >> Currently the VM has behaviour #2 which is preserved with my
> >> patch.
> >>
> >> The only difference with my patch is, we won't keep returning
> >> -EIO on subsequent, error free, msync or fsync calls to files
> >> that had an IO error at some previous point in the past.
> >
> > Do we guarantee that the application will get EIO at least once? I
> > thought there were issues where the error bit could get lost if the
> > page writeback was triggered by sync() run by a third-party
> > application.
>
> There is no such guarantee in the current kernel, either
> with or without my patch.
>
> A third application calling fsync or msync can get the
> EIO cleared, so the application that did the write does
> not see it.

yup. It's a userspace bug, really. Although that bug might be
expressed as "userspace didn't know about linux-specific EIO
behaviour".

> The VM could also reclaim the PageError page due to
> memory pressure, so the application calling fsync or
> msync does not see it.

That would be a kernel bug, methinks. The page's end_io handler should
set the address_space's AS_EIO flag (see mpage_end_io_write()), to be
later returned to (and cleared by) the fsync/msync caller.

It wouldn't surprise me if lots of end_io handlers got that wrong.

> I see no good way in which we could guarantee that
> every process calling msync or fsync on a file that
> had an IO error in the past gets EIO once - at least,
> not without every one of them always getting EIO on
> the file even after the IO path is good again...

Yes, there's no obviously good design here.

And a lot of these problems also apply to ENOSPC, and an ENOSPC
condition most certainly does magically fix itself up in real time...

2010-11-09 21:42:48

by Zan Lynx

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On 11/9/10 12:33 PM, Rik van Riel wrote:
> On 11/09/2010 02:21 PM, Jeff Layton wrote:
>
>> This does leave the page in sort of a funky state. The uptodate bit
>> will still probably be set, but the dirty bit won't be. The page will
>> be effectively "disconnected" from the backing store until someone
>> writes to it.
>>
>> I suppose though that this is the best that can reasonably be done in
>> this situation however...
>
> I spent a few days looking for alternatives, and indeed I found
> nothing better...

Just an off the top of my head crazy idea...

Could you leave the error bit set on the page and treat it as a dirty
bit during a future msync, clearing the error bit at that point.

The general idea would be to leave the error set unless an explicit
write was requested.

--
Zan Lynx
[email protected]

"Knowledge is Power. Power Corrupts. Study Hard. Be Evil."

2010-11-09 21:44:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On Tue 09-11-10 16:15:52, Rik van Riel wrote:
> On 11/09/2010 04:07 PM, Ted Ts'o wrote:
> >On Tue, Nov 09, 2010 at 02:33:29PM -0500, Rik van Riel wrote:
> >>
> >>There are essentially two possibilities:
> >>1) the VM can potentially be filled up with uncleanable dirty pages, or
> >>2) pages that hit an IO error are left in a clean state, so they can
> >> be reclaimed under memory pressure
> >>
> >>Alternative 1 could cause the entire system to deadlock, while
> >>option 2 puts the onus on userland apps to rewrite the data
> >>from a failed msync/fsync.
> >>
> >>Currently the VM has behaviour #2 which is preserved with my
> >>patch.
> >>
> >>The only difference with my patch is, we won't keep returning
> >>-EIO on subsequent, error free, msync or fsync calls to files
> >>that had an IO error at some previous point in the past.
> >
> >Do we guarantee that the application will get EIO at least once? I
> >thought there were issues where the error bit could get lost if the
> >page writeback was triggered by sync() run by a third-party
> >application.
>
> There is no such guarantee in the current kernel, either
> with or without my patch.
>
> A third application calling fsync or msync can get the
> EIO cleared, so the application that did the write does
> not see it.
>
> The VM could also reclaim the PageError page due to
> memory pressure, so the application calling fsync or
> msync does not see it.
What should be done in this case is that we set AS_EIO in the
page->mapping->flags like we currently do in fs/buffer.c.

> I see no good way in which we could guarantee that
> every process calling msync or fsync on a file that
> had an IO error in the past gets EIO once - at least,
> not without every one of them always getting EIO on
> the file even after the IO path is good again...
Yes, this is rather hard to do...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-11-11 16:32:09

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On 11/09/2010 04:07 PM, Ted Ts'o wrote:

> Do we guarantee that the application will get EIO at least once? I
> thought there were issues where the error bit could get lost if the
> page writeback was triggered by sync() run by a third-party
> application.

Andrew, both the current kernel before and after my patch does
not guarantee the behaviour Ted has pointed out is required.

Johannes came up with a way to guarantee that - I'll send in
a patch for that today. That patch will obsolete the patch
I sent before (I don't think you've applied it yet).

--
All rights reversed

2010-11-12 04:36:54

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On 11/09/2010 04:41 PM, Andrew Morton wrote:

> yup. It's a userspace bug, really. Although that bug might be
> expressed as "userspace didn't know about linux-specific EIO
> behaviour".

Looking at this some more, I am not convinced this is a userspace
bug.

First, let me describe the problem scenario:
1) process A calls write
2) process B calls write
3) process A calls fsync, runs into an IO error, returns -EIO
4) process B calls fsync, returns success
(even though data could have been lost!)

Common sense, as well as these snippets from the fsync man
page, suggest that this behaviour is incorrect:

DESCRIPTION
fsync() transfers ("flushes") all modified in-core data of (i.e.,
modified buffer cache pages for) the file referred to by the file
descriptor fd to the disk device
...
RETURN VALUE
On success, these system calls return zero. On error, -1 is
returned, and errno is set appropriately.

Johannes had the idea of storing an IO error sequence idea
in the file descriptor and the address_space (or inode).
That way when an IO error happens, we can increment the
IO error sequence counter (mapping->io_error_seq).

When doing an fsync or msync, we can see whether the
inode's IO error sequence has been incremented since the
last time we looked. If it has, we set our own counter
to the same number and return -EIO to userspace.

The problem here is that the same file descriptor may be
shared between multiple processes. For example, on fork
the child process inherits file descriptors from the parent.
If the child reads 4kB, the file position indicator (f_pos)
will be changed for the parent (and siblings), too.

This is an odd quirk of Unix semantics, maybe inherited
from vfork (I suspect it would have been easier for everybody
if the child got a _copy_ of each file descriptor, instead of
sharing them with parent and siblings), but chances are there
is some program out there by now that relies on this behaviour.

Given that the file descriptors are shared, we will run into
the same race above, when process A and B are siblings or
parent & child.

That leaves the solution of putting the IO error sequence
number in its own little array pointed to by struct files_struct.
Not only will that complicate expand_files, it will also require
that we tell vfs_sync and similar functions either the file
descriptor number or a pointer to the io error sequence counter,
because the "struct file" does not tell us the file descriptor
number...

I'm not sure anyone will like the complexity of the above
solution (hi Al).

There is another "solution" to consider - redirtying pages
when write IO fails. This has the issue of potentially filling
up system memory with uncleanable dirty pages and deadlocking
the whole system.

There does not seem to be a nice solution to this problem, but
it does look like a problem we may want to fix.

--
All rights reversed

2010-11-12 15:53:05

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On Thu, 11 Nov 2010 23:36:39 -0500
Rik van Riel <[email protected]> wrote:

> On 11/09/2010 04:41 PM, Andrew Morton wrote:
>
> > yup. It's a userspace bug, really. Although that bug might be
> > expressed as "userspace didn't know about linux-specific EIO
> > behaviour".
>
> Looking at this some more, I am not convinced this is a userspace
> bug.
>
> First, let me describe the problem scenario:
> 1) process A calls write
> 2) process B calls write
> 3) process A calls fsync, runs into an IO error, returns -EIO
> 4) process B calls fsync, returns success
> (even though data could have been lost!)
>
> Common sense, as well as these snippets from the fsync man
> page, suggest that this behaviour is incorrect:
>
> DESCRIPTION
> fsync() transfers ("flushes") all modified in-core data of (i.e.,
> modified buffer cache pages for) the file referred to by the file
> descriptor fd to the disk device
> ...
> RETURN VALUE
> On success, these system calls return zero. On error, -1 is
> returned, and errno is set appropriately.
>

I'll agree that that situation sucks for userspace but I'm not sure
that problem scenario is technically wrong. The error got reported to
userspace after all, just not to both processes that had done writes.

The root cause here is that we don't track the file descriptor that was
used to dirty specific pages. The reason is simple, IMO -- it would be
an unmanageable rabbit-hole.

Here's another related "problem" scenario (for purposes of argument):

Suppose between steps 2 and 3, the VM decides to flush out the pages
dirtied by process A, but not the ones from process B. That succeeds,
but just afterward the disk goes toes-up.

Now, process A issues an fsync. He gets an error but his data was
flushed to disk just fine. Is that also incorrect behavior?

--
Jeff Layton <[email protected]>

2010-11-12 17:05:00

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On 11/12/2010 10:52 AM, Jeff Layton wrote:

> Now, process A issues an fsync. He gets an error but his data was
> flushed to disk just fine. Is that also incorrect behavior?

I suspect it is better for fsync to return an error when
it wasn't process A's error (but there was an error), than
to pretend everything was just fine when in fact an error
did happen.

When getting an error, the program can retry the write
(to redirty the pages) and retry the IO by calling fsync
again.

If no real error happened, at worst it gets to do the
IO twice.

--
All rights reversed

2010-11-12 20:52:30

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On 11/09/2010 03:24 PM, Rik van Riel wrote:
> On 11/09/2010 04:21 PM, Zan Lynx wrote:
>> On 11/9/10 12:33 PM, Rik van Riel wrote:
>>> On 11/09/2010 02:21 PM, Jeff Layton wrote:
>>>
>>>> This does leave the page in sort of a funky state. The uptodate bit
>>>> will still probably be set, but the dirty bit won't be. The page will
>>>> be effectively "disconnected" from the backing store until someone
>>>> writes to it.
>>>>
>>>> I suppose though that this is the best that can reasonably be done in
>>>> this situation however...
>>>
>>> I spent a few days looking for alternatives, and indeed I found
>>> nothing better...
>>
>> Just an off the top of my head crazy idea...
>>
>> Could you leave the error bit set on the page and treat it as a dirty
>> bit during a future msync, clearing the error bit at that point.
>>
>> The general idea would be to leave the error set unless an explicit
>> write was requested.
>
> The problem with that is that the page will be unreclaimable,
> and the VM could get filled with PageError pages and be unable
> to make further progress (if the IO path does not come back).

As a further crazy idea ;) what if it only persisted for "X" write
attempts? Maybe (sigh) a tunable?

That way several fsyncs get the chance to see it, but eventually
enough writebacks will go off to give up and clear it. Hacky,
but an idea ...

-Eric

2010-11-12 21:36:47

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] clear PageError bit in msync & fsync

On Fri, 12 Nov 2010 14:51:51 -0600
Eric Sandeen <[email protected]> wrote:

> On 11/09/2010 03:24 PM, Rik van Riel wrote:
> > On 11/09/2010 04:21 PM, Zan Lynx wrote:
> >> On 11/9/10 12:33 PM, Rik van Riel wrote:
> >>> On 11/09/2010 02:21 PM, Jeff Layton wrote:
> >>>
> >>>> This does leave the page in sort of a funky state. The uptodate bit
> >>>> will still probably be set, but the dirty bit won't be. The page will
> >>>> be effectively "disconnected" from the backing store until someone
> >>>> writes to it.
> >>>>
> >>>> I suppose though that this is the best that can reasonably be done in
> >>>> this situation however...
> >>>
> >>> I spent a few days looking for alternatives, and indeed I found
> >>> nothing better...
> >>
> >> Just an off the top of my head crazy idea...
> >>
> >> Could you leave the error bit set on the page and treat it as a dirty
> >> bit during a future msync, clearing the error bit at that point.
> >>
> >> The general idea would be to leave the error set unless an explicit
> >> write was requested.
> >
> > The problem with that is that the page will be unreclaimable,
> > and the VM could get filled with PageError pages and be unable
> > to make further progress (if the IO path does not come back).
>
> As a further crazy idea ;) what if it only persisted for "X" write
> attempts? Maybe (sigh) a tunable?
>
> That way several fsyncs get the chance to see it, but eventually
> enough writebacks will go off to give up and clear it. Hacky,
> but an idea ...

That is an interesting idea. Not losing your dirty data in the face of
a transient error would certainly be a nice-to-have. One has to
consider that applications using mmap might have a hard time reissuing
the writes. Keeping the dirty bit set might be less problematic in that
situation.

Blue-skying for a min...

1) you could instead or in addition allow some method for discarding
the dirty pages that are backed by this device manually. Some magical
file under /sys maybe? That way you have some way to get rid of the data
when you know that the device isn't coming back. Doing that manually
might be safer than relying on a certain number of retries (though it
does require someone to know what they're doing in order to clear the
problem).

2) Could you prevent new pages that are backed by this device from
being dirtied or mmapped until the problem is cleared? Not exactly sure
how to implement that, but it might keep someone from making things
worse when this sort of problem occurs.

--
Jeff Layton <[email protected]>