2021-02-05 23:14:15

by Matthew Wilcox

[permalink] [raw]
Subject: [RFC] Better page cache error handling

Scenario:

You have a disk with a bad sector. This disk will take 30 seconds of
trying progressively harder to recover the sector before timing the
I/O out and returning BLK_STS_MEDIUM to the filesystem. Unfortunately
for you, this bad sector happens to have landed in index.html on your
webserver which gets one hit per second. You have configured 500
threads on your webserver.

Today:

We allocate a page and try to read it. 29 threads pile up waiting
for the page lock in filemap_update_page() (or whatever variant of
that you're looking at; I'm talking about linux-next). The original
requester gets the error and returns -EIO to userspace. One of the
lucky 29 waiting threads sends another read. 30 more threads pile up
while it's processing. Eventually, all 500 threads of your webserver
are sleeping waiting for their turn to get an EIO.

With the below patch:

We allocate a page and try to read it. 29 threads pile up waiting
for the page lock in filemap_update_page(). The error returned by the
original I/O is shared between all 29 waiters as well as being returned
to the requesting thread. The next request for index.html will send
another I/O, and more waiters will pile up trying to get the page lock,
but at no time will more than 30 threads be waiting for the I/O to fail.

----

Individual filesystems will have to be modified to call unlock_page_err()
to take advantage of this. Unconverted filesystems will continue to
behave as in the first scenario.

I've only tested it lightly, but it doesn't seem to break anything.
It needs some targetted testing with error injection which I haven't
done yet. It probably also needs some refinement to not report
errors from readahead. Also need to audit the other callers of
put_and_wait_on_page_locked(). This patch interferes with the page
folio work, so I'm not planning on pushing it for a couple of releases.

----

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 16a1e82e3aeb..faeb6c4af7fd 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -183,7 +183,7 @@ iomap_read_page_end_io(struct bio_vec *bvec, int error)
}

if (!iop || atomic_sub_and_test(bvec->bv_len, &iop->read_bytes_pending))
- unlock_page(page);
+ unlock_page_err(page, error);
}

static void
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index fda84e88b2ba..e750881bedfe 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -564,11 +564,13 @@ struct wait_page_key {
struct page *page;
int bit_nr;
int page_match;
+ int err;
};

struct wait_page_queue {
struct page *page;
int bit_nr;
+ int err;
wait_queue_entry_t wait;
};

@@ -591,6 +593,7 @@ extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags);
extern void unlock_page(struct page *page);
+extern void unlock_page_err(struct page *page, int err);
extern void unlock_page_fscache(struct page *page);

/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 97ff7294516e..515e0136f00f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1077,6 +1077,7 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
* afterwards to avoid any races. This store-release pairs
* with the load-acquire in wait_on_page_bit_common().
*/
+ wait_page->err = key->err;
smp_store_release(&wait->flags, flags | WQ_FLAG_WOKEN);
wake_up_state(wait->private, mode);

@@ -1094,7 +1095,7 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
return (flags & WQ_FLAG_EXCLUSIVE) != 0;
}

-static void wake_up_page_bit(struct page *page, int bit_nr)
+static void wake_up_page_bit(struct page *page, int bit_nr, int err)
{
wait_queue_head_t *q = page_waitqueue(page);
struct wait_page_key key;
@@ -1103,6 +1104,7 @@ static void wake_up_page_bit(struct page *page, int bit_nr)

key.page = page;
key.bit_nr = bit_nr;
+ key.err = err;
key.page_match = 0;

bookmark.flags = 0;
@@ -1152,7 +1154,7 @@ static void wake_up_page(struct page *page, int bit)
{
if (!PageWaiters(page))
return;
- wake_up_page_bit(page, bit);
+ wake_up_page_bit(page, bit, 0);
}

/*
@@ -1214,6 +1216,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
wait->func = wake_page_function;
wait_page.page = page;
wait_page.bit_nr = bit_nr;
+ wait_page.err = 0;

repeat:
wait->flags = 0;
@@ -1325,8 +1328,10 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
*/
if (behavior == EXCLUSIVE)
return wait->flags & WQ_FLAG_DONE ? 0 : -EINTR;
+ if (behavior != DROP)
+ wait_page.err = 0;

- return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
+ return wait->flags & WQ_FLAG_WOKEN ? wait_page.err : -EINTR;
}

void wait_on_page_bit(struct page *page, int bit_nr)
@@ -1408,8 +1413,9 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
#endif

/**
- * unlock_page - unlock a locked page
+ * unlock_page_err - unlock a locked page
* @page: the page
+ * @err: errno to be communicated to waiters
*
* Unlocks the page and wakes up sleepers in wait_on_page_locked().
* Also wakes sleepers in wait_on_page_writeback() because the wakeup
@@ -1422,13 +1428,19 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
* portably (architectures that do LL/SC can test any bit, while x86 can
* test the sign bit).
*/
-void unlock_page(struct page *page)
+void unlock_page_err(struct page *page, int err)
{
BUILD_BUG_ON(PG_waiters != 7);
page = compound_head(page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
if (clear_bit_unlock_is_negative_byte(PG_locked, &page->flags))
- wake_up_page_bit(page, PG_locked);
+ wake_up_page_bit(page, PG_locked, err);
+}
+EXPORT_SYMBOL(unlock_page_err);
+
+void unlock_page(struct page *page)
+{
+ unlock_page_err(page, 0);
}
EXPORT_SYMBOL(unlock_page);

@@ -1446,7 +1458,7 @@ void unlock_page_fscache(struct page *page)
page = compound_head(page);
VM_BUG_ON_PAGE(!PagePrivate2(page), page);
clear_bit_unlock(PG_fscache, &page->flags);
- wake_up_page_bit(page, PG_fscache);
+ wake_up_page_bit(page, PG_fscache, 0);
}
EXPORT_SYMBOL(unlock_page_fscache);

@@ -2298,8 +2310,11 @@ static int filemap_update_page(struct kiocb *iocb,
if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
return -EAGAIN;
if (!(iocb->ki_flags & IOCB_WAITQ)) {
- put_and_wait_on_page_locked(page, TASK_KILLABLE);
- return AOP_TRUNCATED_PAGE;
+ error = put_and_wait_on_page_locked(page,
+ TASK_KILLABLE);
+ if (!error)
+ return AOP_TRUNCATED_PAGE;
+ return error;
}
error = __lock_page_async(page, iocb->ki_waitq);
if (error)


2021-02-24 13:17:59

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] Better page cache error handling

Hi Matthew!

On Fri 05-02-21 16:11:42, Matthew Wilcox wrote:
> Scenario:
>
> You have a disk with a bad sector. This disk will take 30 seconds of
> trying progressively harder to recover the sector before timing the
> I/O out and returning BLK_STS_MEDIUM to the filesystem. Unfortunately
> for you, this bad sector happens to have landed in index.html on your
> webserver which gets one hit per second. You have configured 500
> threads on your webserver.
>
> Today:
>
> We allocate a page and try to read it. 29 threads pile up waiting
> for the page lock in filemap_update_page() (or whatever variant of
> that you're looking at; I'm talking about linux-next). The original
> requester gets the error and returns -EIO to userspace. One of the
> lucky 29 waiting threads sends another read. 30 more threads pile up
> while it's processing. Eventually, all 500 threads of your webserver
> are sleeping waiting for their turn to get an EIO.
>
> With the below patch:
>
> We allocate a page and try to read it. 29 threads pile up waiting
> for the page lock in filemap_update_page(). The error returned by the
> original I/O is shared between all 29 waiters as well as being returned
> to the requesting thread. The next request for index.html will send
> another I/O, and more waiters will pile up trying to get the page lock,
> but at no time will more than 30 threads be waiting for the I/O to fail.

Interesting idea. It certainly improves current behavior. I just wonder
whether this isn't a partial solution to a problem and a full solution of
it would have to go in a different direction? I mean it just seems
wrong that each reader (let's assume they just won't overlap) has to retry
the failed IO and wait for the HW to figure out it's not going to work.
Shouldn't we cache the error state with the page? And I understand that we
then also have to deal with the problem how to invalidate the error state
when the block might eventually become readable (for stuff like temporary
IO failures). That would need some signalling from the driver to the page
cache, maybe in a form of some error recovery sequence counter or something
like that. For stuff like iSCSI, multipath, or NBD it could be doable I
believe...

Honza

>
> ----
>
> Individual filesystems will have to be modified to call unlock_page_err()
> to take advantage of this. Unconverted filesystems will continue to
> behave as in the first scenario.
>
> I've only tested it lightly, but it doesn't seem to break anything.
> It needs some targetted testing with error injection which I haven't
> done yet. It probably also needs some refinement to not report
> errors from readahead. Also need to audit the other callers of
> put_and_wait_on_page_locked(). This patch interferes with the page
> folio work, so I'm not planning on pushing it for a couple of releases.
>
> ----
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 16a1e82e3aeb..faeb6c4af7fd 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -183,7 +183,7 @@ iomap_read_page_end_io(struct bio_vec *bvec, int error)
> }
>
> if (!iop || atomic_sub_and_test(bvec->bv_len, &iop->read_bytes_pending))
> - unlock_page(page);
> + unlock_page_err(page, error);
> }
>
> static void
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index fda84e88b2ba..e750881bedfe 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -564,11 +564,13 @@ struct wait_page_key {
> struct page *page;
> int bit_nr;
> int page_match;
> + int err;
> };
>
> struct wait_page_queue {
> struct page *page;
> int bit_nr;
> + int err;
> wait_queue_entry_t wait;
> };
>
> @@ -591,6 +593,7 @@ extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
> extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
> unsigned int flags);
> extern void unlock_page(struct page *page);
> +extern void unlock_page_err(struct page *page, int err);
> extern void unlock_page_fscache(struct page *page);
>
> /*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 97ff7294516e..515e0136f00f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1077,6 +1077,7 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
> * afterwards to avoid any races. This store-release pairs
> * with the load-acquire in wait_on_page_bit_common().
> */
> + wait_page->err = key->err;
> smp_store_release(&wait->flags, flags | WQ_FLAG_WOKEN);
> wake_up_state(wait->private, mode);
>
> @@ -1094,7 +1095,7 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
> return (flags & WQ_FLAG_EXCLUSIVE) != 0;
> }
>
> -static void wake_up_page_bit(struct page *page, int bit_nr)
> +static void wake_up_page_bit(struct page *page, int bit_nr, int err)
> {
> wait_queue_head_t *q = page_waitqueue(page);
> struct wait_page_key key;
> @@ -1103,6 +1104,7 @@ static void wake_up_page_bit(struct page *page, int bit_nr)
>
> key.page = page;
> key.bit_nr = bit_nr;
> + key.err = err;
> key.page_match = 0;
>
> bookmark.flags = 0;
> @@ -1152,7 +1154,7 @@ static void wake_up_page(struct page *page, int bit)
> {
> if (!PageWaiters(page))
> return;
> - wake_up_page_bit(page, bit);
> + wake_up_page_bit(page, bit, 0);
> }
>
> /*
> @@ -1214,6 +1216,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
> wait->func = wake_page_function;
> wait_page.page = page;
> wait_page.bit_nr = bit_nr;
> + wait_page.err = 0;
>
> repeat:
> wait->flags = 0;
> @@ -1325,8 +1328,10 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
> */
> if (behavior == EXCLUSIVE)
> return wait->flags & WQ_FLAG_DONE ? 0 : -EINTR;
> + if (behavior != DROP)
> + wait_page.err = 0;
>
> - return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
> + return wait->flags & WQ_FLAG_WOKEN ? wait_page.err : -EINTR;
> }
>
> void wait_on_page_bit(struct page *page, int bit_nr)
> @@ -1408,8 +1413,9 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
> #endif
>
> /**
> - * unlock_page - unlock a locked page
> + * unlock_page_err - unlock a locked page
> * @page: the page
> + * @err: errno to be communicated to waiters
> *
> * Unlocks the page and wakes up sleepers in wait_on_page_locked().
> * Also wakes sleepers in wait_on_page_writeback() because the wakeup
> @@ -1422,13 +1428,19 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
> * portably (architectures that do LL/SC can test any bit, while x86 can
> * test the sign bit).
> */
> -void unlock_page(struct page *page)
> +void unlock_page_err(struct page *page, int err)
> {
> BUILD_BUG_ON(PG_waiters != 7);
> page = compound_head(page);
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> if (clear_bit_unlock_is_negative_byte(PG_locked, &page->flags))
> - wake_up_page_bit(page, PG_locked);
> + wake_up_page_bit(page, PG_locked, err);
> +}
> +EXPORT_SYMBOL(unlock_page_err);
> +
> +void unlock_page(struct page *page)
> +{
> + unlock_page_err(page, 0);
> }
> EXPORT_SYMBOL(unlock_page);
>
> @@ -1446,7 +1458,7 @@ void unlock_page_fscache(struct page *page)
> page = compound_head(page);
> VM_BUG_ON_PAGE(!PagePrivate2(page), page);
> clear_bit_unlock(PG_fscache, &page->flags);
> - wake_up_page_bit(page, PG_fscache);
> + wake_up_page_bit(page, PG_fscache, 0);
> }
> EXPORT_SYMBOL(unlock_page_fscache);
>
> @@ -2298,8 +2310,11 @@ static int filemap_update_page(struct kiocb *iocb,
> if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
> return -EAGAIN;
> if (!(iocb->ki_flags & IOCB_WAITQ)) {
> - put_and_wait_on_page_locked(page, TASK_KILLABLE);
> - return AOP_TRUNCATED_PAGE;
> + error = put_and_wait_on_page_locked(page,
> + TASK_KILLABLE);
> + if (!error)
> + return AOP_TRUNCATED_PAGE;
> + return error;
> }
> error = __lock_page_async(page, iocb->ki_waitq);
> if (error)
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-02-24 17:49:43

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] Better page cache error handling

On Wed 24-02-21 13:41:15, Matthew Wilcox wrote:
> On Wed, Feb 24, 2021 at 01:38:48PM +0100, Jan Kara wrote:
> > > We allocate a page and try to read it. 29 threads pile up waiting
> > > for the page lock in filemap_update_page(). The error returned by the
> > > original I/O is shared between all 29 waiters as well as being returned
> > > to the requesting thread. The next request for index.html will send
> > > another I/O, and more waiters will pile up trying to get the page lock,
> > > but at no time will more than 30 threads be waiting for the I/O to fail.
> >
> > Interesting idea. It certainly improves current behavior. I just wonder
> > whether this isn't a partial solution to a problem and a full solution of
> > it would have to go in a different direction? I mean it just seems
> > wrong that each reader (let's assume they just won't overlap) has to retry
> > the failed IO and wait for the HW to figure out it's not going to work.
> > Shouldn't we cache the error state with the page? And I understand that we
> > then also have to deal with the problem how to invalidate the error state
> > when the block might eventually become readable (for stuff like temporary
> > IO failures). That would need some signalling from the driver to the page
> > cache, maybe in a form of some error recovery sequence counter or something
> > like that. For stuff like iSCSI, multipath, or NBD it could be doable I
> > believe...
>
> That felt like a larger change than I wanted to make. I already have
> a few big projects on my plate!

I can understand that ;)

> Also, it's not clear to me that the host can necessarily figure out when
> a device has fixed an error -- certainly for the three cases you list
> it can be done. I think we'd want a timer to indicate that it's worth
> retrying instead of returning the error.
>
> Anyway, that seems like a lot of data to cram into a struct page. So I
> think my proposal is still worth pursuing while waiting for someone to
> come up with a perfect solution.

Yes, timer could be a fallback. Or we could just schedule work to discard
all 'error' pages in the fs in an hour or so. Not perfect but more or less
workable I'd say. Also I don't think we need to cram this directly into
struct page - I think it is perfectly fine to kmalloc() structure we need
for caching if we hit error and just don't cache if the allocation fails.
Then we might just reference it from appropriate place... didn't put too
much thought to this...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-02-25 00:56:13

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC] Better page cache error handling

On Wed, Feb 24, 2021 at 01:38:48PM +0100, Jan Kara wrote:
> > We allocate a page and try to read it. 29 threads pile up waiting
> > for the page lock in filemap_update_page(). The error returned by the
> > original I/O is shared between all 29 waiters as well as being returned
> > to the requesting thread. The next request for index.html will send
> > another I/O, and more waiters will pile up trying to get the page lock,
> > but at no time will more than 30 threads be waiting for the I/O to fail.
>
> Interesting idea. It certainly improves current behavior. I just wonder
> whether this isn't a partial solution to a problem and a full solution of
> it would have to go in a different direction? I mean it just seems
> wrong that each reader (let's assume they just won't overlap) has to retry
> the failed IO and wait for the HW to figure out it's not going to work.
> Shouldn't we cache the error state with the page? And I understand that we
> then also have to deal with the problem how to invalidate the error state
> when the block might eventually become readable (for stuff like temporary
> IO failures). That would need some signalling from the driver to the page
> cache, maybe in a form of some error recovery sequence counter or something
> like that. For stuff like iSCSI, multipath, or NBD it could be doable I
> believe...

That felt like a larger change than I wanted to make. I already have
a few big projects on my plate!

Also, it's not clear to me that the host can necessarily figure out when
a device has fixed an error -- certainly for the three cases you list
it can be done. I think we'd want a timer to indicate that it's worth
retrying instead of returning the error.

Anyway, that seems like a lot of data to cram into a struct page. So I
think my proposal is still worth pursuing while waiting for someone to
come up with a perfect solution.

2021-02-25 04:40:23

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] Better page cache error handling

On Feb 24, 2021, at 6:41 AM, Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Feb 24, 2021 at 01:38:48PM +0100, Jan Kara wrote:
>>> We allocate a page and try to read it. 29 threads pile up waiting
>>> for the page lock in filemap_update_page(). The error returned by the
>>> original I/O is shared between all 29 waiters as well as being returned
>>> to the requesting thread. The next request for index.html will send
>>> another I/O, and more waiters will pile up trying to get the page lock,
>>> but at no time will more than 30 threads be waiting for the I/O to fail.
>>
>> Interesting idea. It certainly improves current behavior. I just wonder
>> whether this isn't a partial solution to a problem and a full solution of
>> it would have to go in a different direction? I mean it just seems
>> wrong that each reader (let's assume they just won't overlap) has to retry
>> the failed IO and wait for the HW to figure out it's not going to work.
>> Shouldn't we cache the error state with the page? And I understand that we
>> then also have to deal with the problem how to invalidate the error state
>> when the block might eventually become readable (for stuff like temporary
>> IO failures). That would need some signalling from the driver to the page
>> cache, maybe in a form of some error recovery sequence counter or something
>> like that. For stuff like iSCSI, multipath, or NBD it could be doable I
>> believe...
>
> That felt like a larger change than I wanted to make. I already have
> a few big projects on my plate!
>
> Also, it's not clear to me that the host can necessarily figure out when
> a device has fixed an error -- certainly for the three cases you list
> it can be done. I think we'd want a timer to indicate that it's worth
> retrying instead of returning the error.
>
> Anyway, that seems like a lot of data to cram into a struct page. So I
> think my proposal is still worth pursuing while waiting for someone to
> come up with a perfect solution.

Since you would know that the page is bad at this point (not uptodate,
does not contain valid data) you could potentially re-use some other
fields in struct page, or potentially store something in the page itself?
That would avoid bloating struct page with fields that are only rarely
needed. Userspace shouldn't be able to read the page at that point if
it is not marked uptodate, but they could overwrite it, so you wouldn't
want to store any kind of complex data structure there, but you _could_
store a magic, an error value, and a timeout, that are only valid if
!uptodate (cleared if the page were totally overwritten by userspace).

Yes, it's nasty, but better than growing struct page, and better than
blocking userspace threads for tens of minutes when a block is bad.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2021-02-25 04:45:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC] Better page cache error handling

On Wed, Feb 24, 2021 at 04:41:26PM -0700, Andreas Dilger wrote:
> Since you would know that the page is bad at this point (not uptodate,
> does not contain valid data) you could potentially re-use some other

Oh, we don't know that. We know _a_ read has failed. There could be
up to 128 blocks that comprise this (64kB) page, and we don't want to
prevent reads to those other blocks in the page to fail unnecessarily.

> fields in struct page, or potentially store something in the page itself?
> That would avoid bloating struct page with fields that are only rarely
> needed. Userspace shouldn't be able to read the page at that point if
> it is not marked uptodate, but they could overwrite it, so you wouldn't
> want to store any kind of complex data structure there, but you _could_
> store a magic, an error value, and a timeout, that are only valid if
> !uptodate (cleared if the page were totally overwritten by userspace).
>
> Yes, it's nasty, but better than growing struct page, and better than
> blocking userspace threads for tens of minutes when a block is bad.

The current state blocks threads for tens of minutes. I'm proposing
reducing it down to 30 seconds. I'd want to see a more concrete
proposal than this ...

(also, a per-page data structure might blow up nastily if the entire
drive is inaccessible, rather than just a single bad block)