2020-06-17 16:33:03

by Boris Burkov

[permalink] [raw]
Subject: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.

This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.

Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.

The following represents an example execution demonstrating the race:

CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!

We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.

To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. There is no race in write_one_eb, that we know of, but for
consistency, apply it there too. All the codepaths that clear TREE_REF
check for io, so they would not be able to clear it after this point.

Signed-off-by: Boris Burkov <[email protected]>
---
fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++----------------
1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c59e07360083..f6758ebbb6a2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
num_pages = num_extent_pages(eb);
atomic_set(&eb->io_pages, num_pages);
+ /*
+ * It is possible for releasepage to clear the TREE_REF bit before we
+ * set io_pages. See check_buffer_tree_ref for a more detailed comment.
+ */
+ check_buffer_tree_ref(eb);

/* set btree blocks beyond nritems with 0 to avoid stale content. */
nritems = btrfs_header_nritems(eb);
@@ -5086,25 +5091,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
static void check_buffer_tree_ref(struct extent_buffer *eb)
{
int refs;
- /* the ref bit is tricky. We have to make sure it is set
- * if we have the buffer dirty. Otherwise the
- * code to free a buffer can end up dropping a dirty
- * page
+ /*
+ * The TREE_REF bit is first set when the extent_buffer is added
+ * to the radix tree. It is also reset, if unset, when a new reference
+ * is created by find_extent_buffer.
*
- * Once the ref bit is set, it won't go away while the
- * buffer is dirty or in writeback, and it also won't
- * go away while we have the reference count on the
- * eb bumped.
+ * It is only cleared in two cases: freeing the last non-tree
+ * reference to the extent_buffer when its STALE bit is set or
+ * calling releasepage when the tree reference is the only reference.
*
- * We can't just set the ref bit without bumping the
- * ref on the eb because free_extent_buffer might
- * see the ref bit and try to clear it. If this happens
- * free_extent_buffer might end up dropping our original
- * ref by mistake and freeing the page before we are able
- * to add one more ref.
+ * In both cases, care is taken to ensure that the extent_buffer's
+ * pages are not under io. However, releasepage can be concurrently
+ * called with creating new references, which is prone to race
+ * conditions between the calls to check_tree_ref in those codepaths
+ * and clearing TREE_REF in try_release_extent_buffer.
*
- * So bump the ref count first, then set the bit. If someone
- * beat us to it, drop the ref we added.
+ * The actual lifetime of the extent_buffer in the radix tree is
+ * adequately protected by the refcount, but the TREE_REF bit and
+ * its corresponding reference are not. To protect against this
+ * class of races, we call check_buffer_tree_ref from the codepaths
+ * which trigger io after they set eb->io_pages. Note that once io is
+ * initiated, TREE_REF can no longer be cleared, so that is the
+ * moment at which any such race is best fixed.
*/
refs = atomic_read(&eb->refs);
if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
@@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
eb->read_mirror = 0;
atomic_set(&eb->io_pages, num_reads);
+ /*
+ * It is possible for releasepage to clear the TREE_REF bit before we
+ * set io_pages. See check_buffer_tree_ref for a more detailed comment.
+ */
+ check_buffer_tree_ref(eb);
for (i = 0; i < num_pages; i++) {
page = eb->pages[i];

--
2.24.1


2020-06-17 17:25:14

by Filipe Manana

[permalink] [raw]
Subject: Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <[email protected]> wrote:
>
> Under somewhat convoluted conditions, it is possible to attempt to
> release an extent_buffer that is under io, which triggers a BUG_ON in
> btrfs_release_extent_buffer_pages.
>
> This relies on a few different factors. First, extent_buffer reads done
> as readahead for searching use WAIT_NONE, so they free the local extent
> buffer reference while the io is outstanding. However, they should still
> be protected by TREE_REF. However, if the system is doing signficant
> reclaim, and simultaneously heavily accessing the extent_buffers, it is
> possible for releasepage to race with two concurrent readahead attempts
> in a way that leaves TREE_REF unset when the readahead extent buffer is
> released.
>
> Essentially, if two tasks race to allocate a new extent_buffer, but the
> winner who attempts the first io is rebuffed by a page being locked
> (likely by the reclaim itself) then the loser will still go ahead with
> issuing the readahead. The loser's call to find_extent_buffer must also
> race with the reclaim task reading the extent_buffer's refcount as 1 in
> a way that allows the reclaim to re-clear the TREE_REF checked by
> find_extent_buffer.
>
> The following represents an example execution demonstrating the race:
>
> CPU0 CPU1 CPU2
> reada_for_search reada_for_search
> readahead_tree_block readahead_tree_block
> find_create_tree_block find_create_tree_block
> alloc_extent_buffer alloc_extent_buffer
> find_extent_buffer // not found
> allocates eb
> lock pages
> associate pages to eb
> insert eb into radix tree
> set TREE_REF, refs == 2
> unlock pages
> read_extent_buffer_pages // WAIT_NONE
> not uptodate (brand new eb)
> lock_page
> if !trylock_page
> goto unlock_exit // not an error
> free_extent_buffer
> release_extent_buffer
> atomic_dec_and_test refs to 1
> find_extent_buffer // found
> try_release_extent_buffer
> take refs_lock
> reads refs == 1; no io
> atomic_inc_not_zero refs to 2
> mark_buffer_accessed
> check_buffer_tree_ref
> // not STALE, won't take refs_lock
> refs == 2; TREE_REF set // no action
> read_extent_buffer_pages // WAIT_NONE
> clear TREE_REF
> release_extent_buffer
> atomic_dec_and_test refs to 1
> unlock_page
> still not uptodate (CPU1 read failed on trylock_page)
> locks pages
> set io_pages > 0
> submit io
> return
> release_extent_buffer

Small detail, missing the call to free_extent_buffer() from
readahead_tree_block(), which is what ends calling
release_extent_buffer().

> dec refs to 0
> delete from radix tree
> btrfs_release_extent_buffer_pages
> BUG_ON(io_pages > 0)!!!
>
> We observe this at a very low rate in production and were also able to
> reproduce it in a test environment by introducing some spurious delays
> and by introducing probabilistic trylock_page failures.
>
> To fix it, we apply check_tree_ref at a point where it could not
> possibly be unset by a competing task: after io_pages has been
> incremented. There is no race in write_one_eb, that we know of, but for
> consistency, apply it there too. All the codepaths that clear TREE_REF
> check for io, so they would not be able to clear it after this point.
>
> Signed-off-by: Boris Burkov <[email protected]>

Btw, if you have a stack trace, it would be good to include it in the
change log for grep-ability in case someone runs into the same
problem.

> ---
> fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++----------------
> 1 file changed, 29 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c59e07360083..f6758ebbb6a2 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
> clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
> num_pages = num_extent_pages(eb);
> atomic_set(&eb->io_pages, num_pages);
> + /*
> + * It is possible for releasepage to clear the TREE_REF bit before we
> + * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> + */
> + check_buffer_tree_ref(eb);

This is a whole different case from the one described in the
changelog, as this is in the write path.
Why do we need this one?

At this point the eb pages are marked with the dirty bit, and
btree_releasepage() returns 0 if the page is dirty and we don't end up
calling try_release_extent_buffer().
So I don't understand why this hunk is needed, what variant of the
problem it's trying to solve.

>
> /* set btree blocks beyond nritems with 0 to avoid stale content. */
> nritems = btrfs_header_nritems(eb);
> @@ -5086,25 +5091,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
> static void check_buffer_tree_ref(struct extent_buffer *eb)
> {
> int refs;
> - /* the ref bit is tricky. We have to make sure it is set
> - * if we have the buffer dirty. Otherwise the
> - * code to free a buffer can end up dropping a dirty
> - * page
> + /*
> + * The TREE_REF bit is first set when the extent_buffer is added
> + * to the radix tree. It is also reset, if unset, when a new reference
> + * is created by find_extent_buffer.
> *
> - * Once the ref bit is set, it won't go away while the
> - * buffer is dirty or in writeback, and it also won't
> - * go away while we have the reference count on the
> - * eb bumped.
> + * It is only cleared in two cases: freeing the last non-tree
> + * reference to the extent_buffer when its STALE bit is set or
> + * calling releasepage when the tree reference is the only reference.
> *
> - * We can't just set the ref bit without bumping the
> - * ref on the eb because free_extent_buffer might
> - * see the ref bit and try to clear it. If this happens
> - * free_extent_buffer might end up dropping our original
> - * ref by mistake and freeing the page before we are able
> - * to add one more ref.
> + * In both cases, care is taken to ensure that the extent_buffer's
> + * pages are not under io. However, releasepage can be concurrently
> + * called with creating new references, which is prone to race
> + * conditions between the calls to check_tree_ref in those codepaths
> + * and clearing TREE_REF in try_release_extent_buffer.
> *
> - * So bump the ref count first, then set the bit. If someone
> - * beat us to it, drop the ref we added.
> + * The actual lifetime of the extent_buffer in the radix tree is
> + * adequately protected by the refcount, but the TREE_REF bit and
> + * its corresponding reference are not. To protect against this
> + * class of races, we call check_buffer_tree_ref from the codepaths
> + * which trigger io after they set eb->io_pages. Note that once io is
> + * initiated, TREE_REF can no longer be cleared, so that is the
> + * moment at which any such race is best fixed.
> */
> refs = atomic_read(&eb->refs);
> if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> @@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
> clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
> eb->read_mirror = 0;
> atomic_set(&eb->io_pages, num_reads);
> + /*
> + * It is possible for releasepage to clear the TREE_REF bit before we
> + * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> + */
> + check_buffer_tree_ref(eb);

This is the hunk that fixes the problem described in the change log,
and it looks good to me.

Thanks.

> for (i = 0; i < num_pages; i++) {
> page = eb->pages[i];
>
> --
> 2.24.1
>


--
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

2020-06-17 17:39:41

by Filipe Manana

[permalink] [raw]
Subject: Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

On Wed, Jun 17, 2020 at 6:20 PM Filipe Manana <[email protected]> wrote:
>
> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <[email protected]> wrote:
> >
> > Under somewhat convoluted conditions, it is possible to attempt to
> > release an extent_buffer that is under io, which triggers a BUG_ON in
> > btrfs_release_extent_buffer_pages.
> >
> > This relies on a few different factors. First, extent_buffer reads done
> > as readahead for searching use WAIT_NONE, so they free the local extent
> > buffer reference while the io is outstanding. However, they should still
> > be protected by TREE_REF. However, if the system is doing signficant
> > reclaim, and simultaneously heavily accessing the extent_buffers, it is
> > possible for releasepage to race with two concurrent readahead attempts
> > in a way that leaves TREE_REF unset when the readahead extent buffer is
> > released.
> >
> > Essentially, if two tasks race to allocate a new extent_buffer, but the
> > winner who attempts the first io is rebuffed by a page being locked
> > (likely by the reclaim itself) then the loser will still go ahead with
> > issuing the readahead. The loser's call to find_extent_buffer must also
> > race with the reclaim task reading the extent_buffer's refcount as 1 in
> > a way that allows the reclaim to re-clear the TREE_REF checked by
> > find_extent_buffer.
> >
> > The following represents an example execution demonstrating the race:
> >
> > CPU0 CPU1 CPU2
> > reada_for_search reada_for_search
> > readahead_tree_block readahead_tree_block
> > find_create_tree_block find_create_tree_block
> > alloc_extent_buffer alloc_extent_buffer
> > find_extent_buffer // not found
> > allocates eb
> > lock pages
> > associate pages to eb
> > insert eb into radix tree
> > set TREE_REF, refs == 2
> > unlock pages
> > read_extent_buffer_pages // WAIT_NONE
> > not uptodate (brand new eb)
> > lock_page
> > if !trylock_page
> > goto unlock_exit // not an error
> > free_extent_buffer
> > release_extent_buffer
> > atomic_dec_and_test refs to 1
> > find_extent_buffer // found
> > try_release_extent_buffer
> > take refs_lock
> > reads refs == 1; no io
> > atomic_inc_not_zero refs to 2
> > mark_buffer_accessed
> > check_buffer_tree_ref
> > // not STALE, won't take refs_lock
> > refs == 2; TREE_REF set // no action
> > read_extent_buffer_pages // WAIT_NONE
> > clear TREE_REF
> > release_extent_buffer
> > atomic_dec_and_test refs to 1
> > unlock_page
> > still not uptodate (CPU1 read failed on trylock_page)
> > locks pages
> > set io_pages > 0
> > submit io
> > return
> > release_extent_buffer
>
> Small detail, missing the call to free_extent_buffer() from
> readahead_tree_block(), which is what ends calling
> release_extent_buffer().
>
> > dec refs to 0
> > delete from radix tree
> > btrfs_release_extent_buffer_pages
> > BUG_ON(io_pages > 0)!!!
> >
> > We observe this at a very low rate in production and were also able to
> > reproduce it in a test environment by introducing some spurious delays
> > and by introducing probabilistic trylock_page failures.
> >
> > To fix it, we apply check_tree_ref at a point where it could not
> > possibly be unset by a competing task: after io_pages has been
> > incremented. There is no race in write_one_eb, that we know of, but for
> > consistency, apply it there too. All the codepaths that clear TREE_REF
> > check for io, so they would not be able to clear it after this point.
> >
> > Signed-off-by: Boris Burkov <[email protected]>
>
> Btw, if you have a stack trace, it would be good to include it in the
> change log for grep-ability in case someone runs into the same
> problem.
>
> > ---
> > fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++----------------
> > 1 file changed, 29 insertions(+), 16 deletions(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index c59e07360083..f6758ebbb6a2 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
> > clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
> > num_pages = num_extent_pages(eb);
> > atomic_set(&eb->io_pages, num_pages);
> > + /*
> > + * It is possible for releasepage to clear the TREE_REF bit before we
> > + * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> > + */
> > + check_buffer_tree_ref(eb);
>
> This is a whole different case from the one described in the
> changelog, as this is in the write path.
> Why do we need this one?
>
> At this point the eb pages are marked with the dirty bit, and
> btree_releasepage() returns 0 if the page is dirty and we don't end up
> calling try_release_extent_buffer().

In addition to that, when write_one_eb() is called we have incremented
its ref count before and EXTENT_BUFFER_WRITEBACK is set on the eb,
try_release_extent_buffer() should return and never call
release_extent_buffer(), since the refcount is > 1 and
extent_buffer_under_io() returns true.

> So I don't understand why this hunk is needed, what variant of the
> problem it's trying to solve.
>
> >
> > /* set btree blocks beyond nritems with 0 to avoid stale content. */
> > nritems = btrfs_header_nritems(eb);
> > @@ -5086,25 +5091,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
> > static void check_buffer_tree_ref(struct extent_buffer *eb)
> > {
> > int refs;
> > - /* the ref bit is tricky. We have to make sure it is set
> > - * if we have the buffer dirty. Otherwise the
> > - * code to free a buffer can end up dropping a dirty
> > - * page
> > + /*
> > + * The TREE_REF bit is first set when the extent_buffer is added
> > + * to the radix tree. It is also reset, if unset, when a new reference
> > + * is created by find_extent_buffer.
> > *
> > - * Once the ref bit is set, it won't go away while the
> > - * buffer is dirty or in writeback, and it also won't
> > - * go away while we have the reference count on the
> > - * eb bumped.
> > + * It is only cleared in two cases: freeing the last non-tree
> > + * reference to the extent_buffer when its STALE bit is set or
> > + * calling releasepage when the tree reference is the only reference.
> > *
> > - * We can't just set the ref bit without bumping the
> > - * ref on the eb because free_extent_buffer might
> > - * see the ref bit and try to clear it. If this happens
> > - * free_extent_buffer might end up dropping our original
> > - * ref by mistake and freeing the page before we are able
> > - * to add one more ref.
> > + * In both cases, care is taken to ensure that the extent_buffer's
> > + * pages are not under io. However, releasepage can be concurrently
> > + * called with creating new references, which is prone to race
> > + * conditions between the calls to check_tree_ref in those codepaths
> > + * and clearing TREE_REF in try_release_extent_buffer.
> > *
> > - * So bump the ref count first, then set the bit. If someone
> > - * beat us to it, drop the ref we added.
> > + * The actual lifetime of the extent_buffer in the radix tree is
> > + * adequately protected by the refcount, but the TREE_REF bit and
> > + * its corresponding reference are not. To protect against this
> > + * class of races, we call check_buffer_tree_ref from the codepaths
> > + * which trigger io after they set eb->io_pages. Note that once io is
> > + * initiated, TREE_REF can no longer be cleared, so that is the
> > + * moment at which any such race is best fixed.
> > */
> > refs = atomic_read(&eb->refs);
> > if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> > @@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
> > clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
> > eb->read_mirror = 0;
> > atomic_set(&eb->io_pages, num_reads);
> > + /*
> > + * It is possible for releasepage to clear the TREE_REF bit before we
> > + * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> > + */
> > + check_buffer_tree_ref(eb);
>
> This is the hunk that fixes the problem described in the change log,
> and it looks good to me.
>
> Thanks.
>
> > for (i = 0; i < num_pages; i++) {
> > page = eb->pages[i];
> >
> > --
> > 2.24.1
> >
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



--
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

2020-06-17 17:48:24

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

On 17 Jun 2020, at 13:20, Filipe Manana wrote:

> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <[email protected]> wrote:
>
>> ---
>> fs/btrfs/extent_io.c | 45
>> ++++++++++++++++++++++++++++----------------
>> 1 file changed, 29 insertions(+), 16 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index c59e07360083..f6758ebbb6a2 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3927,6 +3927,11 @@ static noinline_for_stack int
>> write_one_eb(struct extent_buffer *eb,
>> clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
>> num_pages = num_extent_pages(eb);
>> atomic_set(&eb->io_pages, num_pages);
>> + /*
>> + * It is possible for releasepage to clear the TREE_REF bit
>> before we
>> + * set io_pages. See check_buffer_tree_ref for a more
>> detailed comment.
>> + */
>> + check_buffer_tree_ref(eb);
>
> This is a whole different case from the one described in the
> changelog, as this is in the write path.
> Why do we need this one?

This was Josef’s idea, but I really like the symmetry. You set
io_pages, you do the tree_ref dance. Everyone fiddling with the write
back bit right now correctly clears writeback after doing the atomic_dec
on io_pages, but the race is tiny and prone to getting exposed again by
shifting code around. Tree ref checks around io_pages are the most
reliable way to prevent this bug from coming back again later.

-chris

2020-06-17 18:28:58

by Filipe Manana

[permalink] [raw]
Subject: Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

On Wed, Jun 17, 2020 at 7:19 PM Josef Bacik <[email protected]> wrote:
>
> On 6/17/20 2:11 PM, Filipe Manana wrote:
> > On Wed, Jun 17, 2020 at 6:43 PM Chris Mason <[email protected]> wrote:
> >>
> >> On 17 Jun 2020, at 13:20, Filipe Manana wrote:
> >>
> >>> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <[email protected]> wrote:
> >>>
> >>>> ---
> >>>> fs/btrfs/extent_io.c | 45
> >>>> ++++++++++++++++++++++++++++----------------
> >>>> 1 file changed, 29 insertions(+), 16 deletions(-)
> >>>>
> >>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> >>>> index c59e07360083..f6758ebbb6a2 100644
> >>>> --- a/fs/btrfs/extent_io.c
> >>>> +++ b/fs/btrfs/extent_io.c
> >>>> @@ -3927,6 +3927,11 @@ static noinline_for_stack int
> >>>> write_one_eb(struct extent_buffer *eb,
> >>>> clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
> >>>> num_pages = num_extent_pages(eb);
> >>>> atomic_set(&eb->io_pages, num_pages);
> >>>> + /*
> >>>> + * It is possible for releasepage to clear the TREE_REF bit
> >>>> before we
> >>>> + * set io_pages. See check_buffer_tree_ref for a more
> >>>> detailed comment.
> >>>> + */
> >>>> + check_buffer_tree_ref(eb);
> >>>
> >>> This is a whole different case from the one described in the
> >>> changelog, as this is in the write path.
> >>> Why do we need this one?
> >>
> >> This was Josef’s idea, but I really like the symmetry. You set
> >> io_pages, you do the tree_ref dance. Everyone fiddling with the write
> >> back bit right now correctly clears writeback after doing the atomic_dec
> >> on io_pages, but the race is tiny and prone to getting exposed again by
> >> shifting code around. Tree ref checks around io_pages are the most
> >> reliable way to prevent this bug from coming back again later.
> >
> > Ok, but that still doesn't answer my question.
> > Is there an actual race/problem this hunk solves?
> >
> > Before calling write_one_eb() we increment the ref on the eb and we
> > also call lock_extent_buffer_for_io(),
> > which clears the dirty bit and sets the writeback bit on the eb while
> > holding its ref_locks spin_lock.
> >
> > Even if we get to try_release_extent_buffer, it calls
> > extent_buffer_under_io(eb) while holding the ref_locks spin_lock,
> > so at any time it should return true, as either the dirty or the
> > writeback bit is set.
> >
> > Is this purely a safety guard that is being introduced?
> >
> > Can we at least describe in the changelog why we are adding this hunk
> > in the write path?
> > All it mentions is a race between reading and releasing pages, there's
> > nothing mentioned about races with writeback.
> >
>
> I think maybe we make that bit a separate patch, and leave the panic fix on it's
> own. I suggested this because I have lofty ideas of killing the refs_lock, and
> this would at least keep us consistent in our usage TREE_REF to save us from
> freeing stuff that may be currently under IO.
>
> I'm super not happy with our reference handling coupled with releasepage, but
> these are the kind of hoops we're going to have to jump through until we have
> some better infrastructure in place to handle metadata. Thanks,

Ok, so it's just a safety guard then.
Yes, either a separate patch or at the very least mention why we are
adding that in the change log.

Thanks.

>
> Josef



--
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

2020-06-17 18:40:39

by Boris Burkov

[permalink] [raw]
Subject: [PATCH v2 btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.

This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.

Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.

The following represents an example execution demonstrating the race:

CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!

We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.

To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.

stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX:
0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI:
ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09:
0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12:
0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15:
ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000)
knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4:
00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX:
0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09:
00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12:
00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15:
00007f5a94aecc40

Signed-off-by: Boris Burkov <[email protected]>
---
fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++----------------
1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c59e07360083..95313bb7fe40 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5086,25 +5086,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
static void check_buffer_tree_ref(struct extent_buffer *eb)
{
int refs;
- /* the ref bit is tricky. We have to make sure it is set
- * if we have the buffer dirty. Otherwise the
- * code to free a buffer can end up dropping a dirty
- * page
+ /*
+ * The TREE_REF bit is first set when the extent_buffer is added
+ * to the radix tree. It is also reset, if unset, when a new reference
+ * is created by find_extent_buffer.
*
- * Once the ref bit is set, it won't go away while the
- * buffer is dirty or in writeback, and it also won't
- * go away while we have the reference count on the
- * eb bumped.
+ * It is only cleared in two cases: freeing the last non-tree
+ * reference to the extent_buffer when its STALE bit is set or
+ * calling releasepage when the tree reference is the only reference.
*
- * We can't just set the ref bit without bumping the
- * ref on the eb because free_extent_buffer might
- * see the ref bit and try to clear it. If this happens
- * free_extent_buffer might end up dropping our original
- * ref by mistake and freeing the page before we are able
- * to add one more ref.
+ * In both cases, care is taken to ensure that the extent_buffer's
+ * pages are not under io. However, releasepage can be concurrently
+ * called with creating new references, which is prone to race
+ * conditions between the calls to check_buffer_tree_ref in those
+ * codepaths and clearing TREE_REF in try_release_extent_buffer.
*
- * So bump the ref count first, then set the bit. If someone
- * beat us to it, drop the ref we added.
+ * The actual lifetime of the extent_buffer in the radix tree is
+ * adequately protected by the refcount, but the TREE_REF bit and
+ * its corresponding reference are not. To protect against this
+ * class of races, we call check_buffer_tree_ref from the codepaths
+ * which trigger io after they set eb->io_pages. Note that once io is
+ * initiated, TREE_REF can no longer be cleared, so that is the
+ * moment at which any such race is best fixed.
*/
refs = atomic_read(&eb->refs);
if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
@@ -5555,6 +5558,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
eb->read_mirror = 0;
atomic_set(&eb->io_pages, num_reads);
+ /*
+ * It is possible for releasepage to clear the TREE_REF bit before we
+ * set io_pages. See check_buffer_tree_ref for a more detailed comment.
+ */
+ check_buffer_tree_ref(eb);
for (i = 0; i < num_pages; i++) {
page = eb->pages[i];

--
2.24.1

2020-06-17 18:56:31

by Filipe Manana

[permalink] [raw]
Subject: Re: [PATCH v2 btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

On Wed, Jun 17, 2020 at 7:39 PM Boris Burkov <[email protected]> wrote:
>
> Under somewhat convoluted conditions, it is possible to attempt to
> release an extent_buffer that is under io, which triggers a BUG_ON in
> btrfs_release_extent_buffer_pages.
>
> This relies on a few different factors. First, extent_buffer reads done
> as readahead for searching use WAIT_NONE, so they free the local extent
> buffer reference while the io is outstanding. However, they should still
> be protected by TREE_REF. However, if the system is doing signficant
> reclaim, and simultaneously heavily accessing the extent_buffers, it is
> possible for releasepage to race with two concurrent readahead attempts
> in a way that leaves TREE_REF unset when the readahead extent buffer is
> released.
>
> Essentially, if two tasks race to allocate a new extent_buffer, but the
> winner who attempts the first io is rebuffed by a page being locked
> (likely by the reclaim itself) then the loser will still go ahead with
> issuing the readahead. The loser's call to find_extent_buffer must also
> race with the reclaim task reading the extent_buffer's refcount as 1 in
> a way that allows the reclaim to re-clear the TREE_REF checked by
> find_extent_buffer.
>
> The following represents an example execution demonstrating the race:
>
> CPU0 CPU1 CPU2
> reada_for_search reada_for_search
> readahead_tree_block readahead_tree_block
> find_create_tree_block find_create_tree_block
> alloc_extent_buffer alloc_extent_buffer
> find_extent_buffer // not found
> allocates eb
> lock pages
> associate pages to eb
> insert eb into radix tree
> set TREE_REF, refs == 2
> unlock pages
> read_extent_buffer_pages // WAIT_NONE
> not uptodate (brand new eb)
> lock_page
> if !trylock_page
> goto unlock_exit // not an error
> free_extent_buffer
> release_extent_buffer
> atomic_dec_and_test refs to 1
> find_extent_buffer // found
> try_release_extent_buffer
> take refs_lock
> reads refs == 1; no io
> atomic_inc_not_zero refs to 2
> mark_buffer_accessed
> check_buffer_tree_ref
> // not STALE, won't take refs_lock
> refs == 2; TREE_REF set // no action
> read_extent_buffer_pages // WAIT_NONE
> clear TREE_REF
> release_extent_buffer
> atomic_dec_and_test refs to 1
> unlock_page
> still not uptodate (CPU1 read failed on trylock_page)
> locks pages
> set io_pages > 0
> submit io
> return
> free_extent_buffer
> release_extent_buffer
> dec refs to 0
> delete from radix tree
> btrfs_release_extent_buffer_pages
> BUG_ON(io_pages > 0)!!!
>
> We observe this at a very low rate in production and were also able to
> reproduce it in a test environment by introducing some spurious delays
> and by introducing probabilistic trylock_page failures.
>
> To fix it, we apply check_tree_ref at a point where it could not
> possibly be unset by a competing task: after io_pages has been
> incremented. All the codepaths that clear TREE_REF check for io, so they
> would not be able to clear it after this point until the io is done.
>
> stack trace, for reference:
> [1417839.424739] ------------[ cut here ]------------
> [1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
> [1417839.447024] invalid opcode: 0000 [#1] SMP
> [1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
> [1417839.517008] Code: ed e9 ...
> [1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
> [1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX:
> 0000000000000028
> [1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI:
> ffff888102d6def0
> [1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09:
> 0000000000000238
> [1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12:
> 0000000000000000
> [1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15:
> ffff888797aafd90
> [1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000)
> knlGS:0000000000000000
> [1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4:
> 00000000003606e0
> [1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [1417839.731320] Call Trace:
> [1417839.737103] release_extent_buffer+0x39/0x90
> [1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
> [1417839.758645] btrfs_search_slot+0x260/0x9b0
> [1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
> [1417839.778427] btrfs_get_extent+0x15f/0x830
> [1417839.787665] ? submit_extent_page+0xc4/0x1c0
> [1417839.797474] ? __do_readpage+0x299/0x7a0
> [1417839.806515] __do_readpage+0x33b/0x7a0
> [1417839.815171] ? btrfs_releasepage+0x70/0x70
> [1417839.824597] extent_readpages+0x28f/0x400
> [1417839.833836] read_pages+0x6a/0x1c0
> [1417839.841729] ? startup_64+0x2/0x30
> [1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
> [1417839.860590] filemap_fault+0x6c7/0x990
> [1417839.869252] ? xas_load+0x8/0x80
> [1417839.876756] ? xas_find+0x150/0x190
> [1417839.884839] ? filemap_map_pages+0x295/0x3b0
> [1417839.894652] __do_fault+0x32/0x110
> [1417839.902540] __handle_mm_fault+0xacd/0x1000
> [1417839.912156] handle_mm_fault+0xaa/0x1c0
> [1417839.921004] __do_page_fault+0x242/0x4b0
> [1417839.930044] ? page_fault+0x8/0x30
> [1417839.937933] page_fault+0x1e/0x30
> [1417839.945631] RIP: 0033:0x33c4bae
> [1417839.952927] Code: Bad RIP value.
> [1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
> [1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX:
> 0000000000000000
> [1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 0000000000000002
> [1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09:
> 00007f5a94aeccb8
> [1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12:
> 00007f5a94aecc79
> [1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15:
> 00007f5a94aecc40
>
> Signed-off-by: Boris Burkov <[email protected]>

Reviewed-by: Filipe Manana <[email protected]>

Looks good, thanks!

> ---
> fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++----------------
> 1 file changed, 24 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c59e07360083..95313bb7fe40 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5086,25 +5086,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
> static void check_buffer_tree_ref(struct extent_buffer *eb)
> {
> int refs;
> - /* the ref bit is tricky. We have to make sure it is set
> - * if we have the buffer dirty. Otherwise the
> - * code to free a buffer can end up dropping a dirty
> - * page
> + /*
> + * The TREE_REF bit is first set when the extent_buffer is added
> + * to the radix tree. It is also reset, if unset, when a new reference
> + * is created by find_extent_buffer.
> *
> - * Once the ref bit is set, it won't go away while the
> - * buffer is dirty or in writeback, and it also won't
> - * go away while we have the reference count on the
> - * eb bumped.
> + * It is only cleared in two cases: freeing the last non-tree
> + * reference to the extent_buffer when its STALE bit is set or
> + * calling releasepage when the tree reference is the only reference.
> *
> - * We can't just set the ref bit without bumping the
> - * ref on the eb because free_extent_buffer might
> - * see the ref bit and try to clear it. If this happens
> - * free_extent_buffer might end up dropping our original
> - * ref by mistake and freeing the page before we are able
> - * to add one more ref.
> + * In both cases, care is taken to ensure that the extent_buffer's
> + * pages are not under io. However, releasepage can be concurrently
> + * called with creating new references, which is prone to race
> + * conditions between the calls to check_buffer_tree_ref in those
> + * codepaths and clearing TREE_REF in try_release_extent_buffer.
> *
> - * So bump the ref count first, then set the bit. If someone
> - * beat us to it, drop the ref we added.
> + * The actual lifetime of the extent_buffer in the radix tree is
> + * adequately protected by the refcount, but the TREE_REF bit and
> + * its corresponding reference are not. To protect against this
> + * class of races, we call check_buffer_tree_ref from the codepaths
> + * which trigger io after they set eb->io_pages. Note that once io is
> + * initiated, TREE_REF can no longer be cleared, so that is the
> + * moment at which any such race is best fixed.
> */
> refs = atomic_read(&eb->refs);
> if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> @@ -5555,6 +5558,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
> clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
> eb->read_mirror = 0;
> atomic_set(&eb->io_pages, num_reads);
> + /*
> + * It is possible for releasepage to clear the TREE_REF bit before we
> + * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> + */
> + check_buffer_tree_ref(eb);
> for (i = 0; i < num_pages; i++) {
> page = eb->pages[i];
>
> --
> 2.24.1
>


--
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

2020-06-17 19:05:49

by Filipe Manana

[permalink] [raw]
Subject: Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

On Wed, Jun 17, 2020 at 6:43 PM Chris Mason <[email protected]> wrote:
>
> On 17 Jun 2020, at 13:20, Filipe Manana wrote:
>
> > On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <[email protected]> wrote:
> >
> >> ---
> >> fs/btrfs/extent_io.c | 45
> >> ++++++++++++++++++++++++++++----------------
> >> 1 file changed, 29 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> >> index c59e07360083..f6758ebbb6a2 100644
> >> --- a/fs/btrfs/extent_io.c
> >> +++ b/fs/btrfs/extent_io.c
> >> @@ -3927,6 +3927,11 @@ static noinline_for_stack int
> >> write_one_eb(struct extent_buffer *eb,
> >> clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
> >> num_pages = num_extent_pages(eb);
> >> atomic_set(&eb->io_pages, num_pages);
> >> + /*
> >> + * It is possible for releasepage to clear the TREE_REF bit
> >> before we
> >> + * set io_pages. See check_buffer_tree_ref for a more
> >> detailed comment.
> >> + */
> >> + check_buffer_tree_ref(eb);
> >
> > This is a whole different case from the one described in the
> > changelog, as this is in the write path.
> > Why do we need this one?
>
> This was Josef’s idea, but I really like the symmetry. You set
> io_pages, you do the tree_ref dance. Everyone fiddling with the write
> back bit right now correctly clears writeback after doing the atomic_dec
> on io_pages, but the race is tiny and prone to getting exposed again by
> shifting code around. Tree ref checks around io_pages are the most
> reliable way to prevent this bug from coming back again later.

Ok, but that still doesn't answer my question.
Is there an actual race/problem this hunk solves?

Before calling write_one_eb() we increment the ref on the eb and we
also call lock_extent_buffer_for_io(),
which clears the dirty bit and sets the writeback bit on the eb while
holding its ref_locks spin_lock.

Even if we get to try_release_extent_buffer, it calls
extent_buffer_under_io(eb) while holding the ref_locks spin_lock,
so at any time it should return true, as either the dirty or the
writeback bit is set.

Is this purely a safety guard that is being introduced?

Can we at least describe in the changelog why we are adding this hunk
in the write path?
All it mentions is a race between reading and releasing pages, there's
nothing mentioned about races with writeback.

Thanks.

>
> -chris



--
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

2020-06-17 19:07:04

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

On 6/17/20 2:11 PM, Filipe Manana wrote:
> On Wed, Jun 17, 2020 at 6:43 PM Chris Mason <[email protected]> wrote:
>>
>> On 17 Jun 2020, at 13:20, Filipe Manana wrote:
>>
>>> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <[email protected]> wrote:
>>>
>>>> ---
>>>> fs/btrfs/extent_io.c | 45
>>>> ++++++++++++++++++++++++++++----------------
>>>> 1 file changed, 29 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>>> index c59e07360083..f6758ebbb6a2 100644
>>>> --- a/fs/btrfs/extent_io.c
>>>> +++ b/fs/btrfs/extent_io.c
>>>> @@ -3927,6 +3927,11 @@ static noinline_for_stack int
>>>> write_one_eb(struct extent_buffer *eb,
>>>> clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
>>>> num_pages = num_extent_pages(eb);
>>>> atomic_set(&eb->io_pages, num_pages);
>>>> + /*
>>>> + * It is possible for releasepage to clear the TREE_REF bit
>>>> before we
>>>> + * set io_pages. See check_buffer_tree_ref for a more
>>>> detailed comment.
>>>> + */
>>>> + check_buffer_tree_ref(eb);
>>>
>>> This is a whole different case from the one described in the
>>> changelog, as this is in the write path.
>>> Why do we need this one?
>>
>> This was Josef’s idea, but I really like the symmetry. You set
>> io_pages, you do the tree_ref dance. Everyone fiddling with the write
>> back bit right now correctly clears writeback after doing the atomic_dec
>> on io_pages, but the race is tiny and prone to getting exposed again by
>> shifting code around. Tree ref checks around io_pages are the most
>> reliable way to prevent this bug from coming back again later.
>
> Ok, but that still doesn't answer my question.
> Is there an actual race/problem this hunk solves?
>
> Before calling write_one_eb() we increment the ref on the eb and we
> also call lock_extent_buffer_for_io(),
> which clears the dirty bit and sets the writeback bit on the eb while
> holding its ref_locks spin_lock.
>
> Even if we get to try_release_extent_buffer, it calls
> extent_buffer_under_io(eb) while holding the ref_locks spin_lock,
> so at any time it should return true, as either the dirty or the
> writeback bit is set.
>
> Is this purely a safety guard that is being introduced?
>
> Can we at least describe in the changelog why we are adding this hunk
> in the write path?
> All it mentions is a race between reading and releasing pages, there's
> nothing mentioned about races with writeback.
>

I think maybe we make that bit a separate patch, and leave the panic fix on it's
own. I suggested this because I have lofty ideas of killing the refs_lock, and
this would at least keep us consistent in our usage TREE_REF to save us from
freeing stuff that may be currently under IO.

I'm super not happy with our reference handling coupled with releasepage, but
these are the kind of hoops we're going to have to jump through until we have
some better infrastructure in place to handle metadata. Thanks,

Josef

2020-06-17 21:50:26

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

Hi Boris,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on v5.8-rc1 next-20200617]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url: https://github.com/0day-ci/linux/commits/Boris-Burkov/btrfs-fix-fatal-extent_buffer-readahead-vs-releasepage-race/20200618-002941
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-rhel (attached as .config)
compiler: gcc-9 (Debian 9.3.0-13) 9.3.0
reproduce (this is a W=1 build):
# save the attached .config to linux build tree
make W=1 ARCH=x86_64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All error/warnings (new ones prefixed by >>, old ones prefixed by <<):

In file included from fs/btrfs/extent_io.c:19:
fs/btrfs/ctree.h:2216:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
2216 | size_t __const btrfs_get_num_csums(void);
| ^~~~~~~
fs/btrfs/extent_io.c: In function 'write_one_eb':
>> fs/btrfs/extent_io.c:3934:2: error: implicit declaration of function 'check_buffer_tree_ref' [-Werror=implicit-function-declaration]
3934 | check_buffer_tree_ref(eb);
| ^~~~~~~~~~~~~~~~~~~~~
fs/btrfs/extent_io.c: At top level:
>> fs/btrfs/extent_io.c:5091:13: warning: conflicting types for 'check_buffer_tree_ref'
5091 | static void check_buffer_tree_ref(struct extent_buffer *eb)
| ^~~~~~~~~~~~~~~~~~~~~
>> fs/btrfs/extent_io.c:5091:13: error: static declaration of 'check_buffer_tree_ref' follows non-static declaration
fs/btrfs/extent_io.c:3934:2: note: previous implicit declaration of 'check_buffer_tree_ref' was here
3934 | check_buffer_tree_ref(eb);
| ^~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors

vim +/check_buffer_tree_ref +3934 fs/btrfs/extent_io.c

3915
3916 static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
3917 struct writeback_control *wbc,
3918 struct extent_page_data *epd)
3919 {
3920 u64 offset = eb->start;
3921 u32 nritems;
3922 int i, num_pages;
3923 unsigned long start, end;
3924 unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
3925 int ret = 0;
3926
3927 clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
3928 num_pages = num_extent_pages(eb);
3929 atomic_set(&eb->io_pages, num_pages);
3930 /*
3931 * It is possible for releasepage to clear the TREE_REF bit before we
3932 * set io_pages. See check_buffer_tree_ref for a more detailed comment.
3933 */
> 3934 check_buffer_tree_ref(eb);
3935
3936 /* set btree blocks beyond nritems with 0 to avoid stale content. */
3937 nritems = btrfs_header_nritems(eb);
3938 if (btrfs_header_level(eb) > 0) {
3939 end = btrfs_node_key_ptr_offset(nritems);
3940
3941 memzero_extent_buffer(eb, end, eb->len - end);
3942 } else {
3943 /*
3944 * leaf:
3945 * header 0 1 2 .. N ... data_N .. data_2 data_1 data_0
3946 */
3947 start = btrfs_item_nr_offset(nritems);
3948 end = BTRFS_LEAF_DATA_OFFSET + leaf_data_end(eb);
3949 memzero_extent_buffer(eb, start, end - start);
3950 }
3951
3952 for (i = 0; i < num_pages; i++) {
3953 struct page *p = eb->pages[i];
3954
3955 clear_page_dirty_for_io(p);
3956 set_page_writeback(p);
3957 ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
3958 p, offset, PAGE_SIZE, 0,
3959 &epd->bio,
3960 end_bio_extent_buffer_writepage,
3961 0, 0, 0, false);
3962 if (ret) {
3963 set_btree_ioerr(p);
3964 if (PageWriteback(p))
3965 end_page_writeback(p);
3966 if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
3967 end_extent_buffer_writeback(eb);
3968 ret = -EIO;
3969 break;
3970 }
3971 offset += PAGE_SIZE;
3972 update_nr_written(wbc, 1);
3973 unlock_page(p);
3974 }
3975
3976 if (unlikely(ret)) {
3977 for (; i < num_pages; i++) {
3978 struct page *p = eb->pages[i];
3979 clear_page_dirty_for_io(p);
3980 unlock_page(p);
3981 }
3982 }
3983
3984 return ret;
3985 }
3986

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (4.52 kB)
.config.gz (43.81 kB)
Download all attachments

2020-06-17 23:28:53

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

Hi Boris,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on v5.8-rc1 next-20200617]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url: https://github.com/0day-ci/linux/commits/Boris-Burkov/btrfs-fix-fatal-extent_buffer-readahead-vs-releasepage-race/20200618-002941
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-allyesconfig (attached as .config)
compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 487ca07fcc75d52755c9fe2ee05bcb3b6eeeec44)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install x86_64 cross compiling tool for clang build
# apt-get install binutils-x86-64-linux-gnu
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>, old ones prefixed by <<):

In file included from fs/btrfs/extent_io.c:19:
fs/btrfs/ctree.h:2216:8: warning: 'const' type qualifier on return type has no effect [-Wignored-qualifiers]
size_t __const btrfs_get_num_csums(void);
^~~~~~~~
>> fs/btrfs/extent_io.c:3934:2: error: implicit declaration of function 'check_buffer_tree_ref' [-Werror,-Wimplicit-function-declaration]
check_buffer_tree_ref(eb);
^
>> fs/btrfs/extent_io.c:5091:13: error: static declaration of 'check_buffer_tree_ref' follows non-static declaration
static void check_buffer_tree_ref(struct extent_buffer *eb)
^
fs/btrfs/extent_io.c:3934:2: note: previous implicit declaration is here
check_buffer_tree_ref(eb);
^
1 warning and 2 errors generated.

vim +/check_buffer_tree_ref +3934 fs/btrfs/extent_io.c

3915
3916 static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
3917 struct writeback_control *wbc,
3918 struct extent_page_data *epd)
3919 {
3920 u64 offset = eb->start;
3921 u32 nritems;
3922 int i, num_pages;
3923 unsigned long start, end;
3924 unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
3925 int ret = 0;
3926
3927 clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
3928 num_pages = num_extent_pages(eb);
3929 atomic_set(&eb->io_pages, num_pages);
3930 /*
3931 * It is possible for releasepage to clear the TREE_REF bit before we
3932 * set io_pages. See check_buffer_tree_ref for a more detailed comment.
3933 */
> 3934 check_buffer_tree_ref(eb);
3935
3936 /* set btree blocks beyond nritems with 0 to avoid stale content. */
3937 nritems = btrfs_header_nritems(eb);
3938 if (btrfs_header_level(eb) > 0) {
3939 end = btrfs_node_key_ptr_offset(nritems);
3940
3941 memzero_extent_buffer(eb, end, eb->len - end);
3942 } else {
3943 /*
3944 * leaf:
3945 * header 0 1 2 .. N ... data_N .. data_2 data_1 data_0
3946 */
3947 start = btrfs_item_nr_offset(nritems);
3948 end = BTRFS_LEAF_DATA_OFFSET + leaf_data_end(eb);
3949 memzero_extent_buffer(eb, start, end - start);
3950 }
3951
3952 for (i = 0; i < num_pages; i++) {
3953 struct page *p = eb->pages[i];
3954
3955 clear_page_dirty_for_io(p);
3956 set_page_writeback(p);
3957 ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
3958 p, offset, PAGE_SIZE, 0,
3959 &epd->bio,
3960 end_bio_extent_buffer_writepage,
3961 0, 0, 0, false);
3962 if (ret) {
3963 set_btree_ioerr(p);
3964 if (PageWriteback(p))
3965 end_page_writeback(p);
3966 if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
3967 end_extent_buffer_writeback(eb);
3968 ret = -EIO;
3969 break;
3970 }
3971 offset += PAGE_SIZE;
3972 update_nr_written(wbc, 1);
3973 unlock_page(p);
3974 }
3975
3976 if (unlikely(ret)) {
3977 for (; i < num_pages; i++) {
3978 struct page *p = eb->pages[i];
3979 clear_page_dirty_for_io(p);
3980 unlock_page(p);
3981 }
3982 }
3983
3984 return ret;
3985 }
3986

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (4.57 kB)
.config.gz (72.24 kB)
Download all attachments

2020-06-18 20:36:59

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH v2 btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

On Wed, Jun 17, 2020 at 11:35:19AM -0700, Boris Burkov wrote:
[...]

> Signed-off-by: Boris Burkov <[email protected]>

Added to misc-next, thanks.