2016-10-11 15:23:07

by Dave Jones

[permalink] [raw]
Subject: btrfs bio linked list corruption.

This is from Linus' current tree, with Al's iovec fixups on top.

------------[ cut here ]------------
WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880
ffff880503878b80 ffffe8ffff806648 ffffe8ffffc06600 ffff880502808008
Call Trace:
[<ffffffff8d32007c>] dump_stack+0x4f/0x73
[<ffffffff8d07a6c1>] __warn+0xc1/0xe0
[<ffffffff8d07a73a>] warn_slowpath_fmt+0x5a/0x80
[<ffffffff8d33e689>] __list_add+0x89/0xb0
[<ffffffff8d30a1c8>] blk_sq_make_request+0x2f8/0x350
[<ffffffff8d2fd9cc>] ? generic_make_request+0xec/0x240
[<ffffffff8d2fd9d9>] generic_make_request+0xf9/0x240
[<ffffffff8d2fdb98>] submit_bio+0x78/0x150
[<ffffffff8d349c05>] ? __percpu_counter_add+0x85/0xb0
[<ffffffffc03627de>] btrfs_map_bio+0x19e/0x330 [btrfs]
[<ffffffffc03289ca>] btree_submit_bio_hook+0xfa/0x110 [btrfs]
[<ffffffffc034ff15>] submit_one_bio+0x65/0xa0 [btrfs]
[<ffffffffc0358cb0>] read_extent_buffer_pages+0x2f0/0x3d0 [btrfs]
[<ffffffffc0327020>] ? free_root_pointers+0x60/0x60 [btrfs]
[<ffffffffc03283c8>] btree_read_extent_buffer_pages.constprop.55+0xa8/0x110 [btrfs]
[<ffffffffc0328bcd>] read_tree_block+0x2d/0x50 [btrfs]
[<ffffffffc03080a4>] read_block_for_search.isra.33+0x134/0x330 [btrfs]
[<ffffffff8d7c2d6c>] ? _raw_write_unlock+0x2c/0x50
[<ffffffffc0302fec>] ? unlock_up+0x16c/0x1a0 [btrfs]
[<ffffffffc030a3d0>] btrfs_search_slot+0x450/0xa40 [btrfs]
[<ffffffffc0324983>] btrfs_del_csums+0xe3/0x2e0 [btrfs]
[<ffffffffc03134fd>] __btrfs_free_extent.isra.82+0x32d/0xc90 [btrfs]
[<ffffffffc03178b3>] __btrfs_run_delayed_refs+0x4d3/0x1010 [btrfs]
[<ffffffff8d33e5d7>] ? debug_smp_processor_id+0x17/0x20
[<ffffffff8d0c6109>] ? get_lock_stats+0x19/0x50
[<ffffffffc031b32c>] btrfs_run_delayed_refs+0x9c/0x2d0 [btrfs]
[<ffffffffc033d628>] btrfs_truncate_inode_items+0x888/0xda0 [btrfs]
[<ffffffffc033dc25>] btrfs_truncate+0xe5/0x2b0 [btrfs]
[<ffffffffc033e569>] btrfs_setattr+0x249/0x360 [btrfs]
[<ffffffff8d1f4092>] notify_change+0x252/0x440
[<ffffffff8d1d164e>] do_truncate+0x6e/0xc0
[<ffffffff8d1d1a4c>] do_sys_ftruncate.constprop.19+0x10c/0x170
[<ffffffff8d33e5f3>] ? __this_cpu_preempt_check+0x13/0x20
[<ffffffff8d1d1ad9>] SyS_ftruncate+0x9/0x10
[<ffffffff8d00259c>] do_syscall_64+0x5c/0x170
[<ffffffff8d7c2f8b>] entry_SYSCALL64_slow_path+0x25/0x25
--[ end trace 906673a2f703b373 ]---


2016-10-11 15:11:50

by Al Viro

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.

On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
> This is from Linus' current tree, with Al's iovec fixups on top.

Those iovec fixups are in the current tree... TBH, I don't see anything
in splice-related stuff that could come anywhere near that (short of
some general memory corruption having random effects of that sort).

Could you try to bisect that sucker, or is it too hard to reproduce?

2016-10-11 15:20:51

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.

On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
> On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
> > This is from Linus' current tree, with Al's iovec fixups on top.
>
> Those iovec fixups are in the current tree...

ah yeah, git quietly dropped my local copy when I rebased so I didn't notice.

> TBH, I don't see anything
> in splice-related stuff that could come anywhere near that (short of
> some general memory corruption having random effects of that sort).
>
> Could you try to bisect that sucker, or is it too hard to reproduce?

Only hit it the once overnight so far. Will see if I can find a better way to
reproduce today.

Dave

2016-10-11 15:21:18

by Chris Mason

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.



On 10/11/2016 11:19 AM, Dave Jones wrote:
> On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
> > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
> > > This is from Linus' current tree, with Al's iovec fixups on top.
> >
> > Those iovec fixups are in the current tree...
>
> ah yeah, git quietly dropped my local copy when I rebased so I didn't notice.
>
> > TBH, I don't see anything
> > in splice-related stuff that could come anywhere near that (short of
> > some general memory corruption having random effects of that sort).
> >
> > Could you try to bisect that sucker, or is it too hard to reproduce?
>
> Only hit it the once overnight so far. Will see if I can find a better way to
> reproduce today.

This call trace is reading metadata so we can finish the truncate. I'd
say adding more memory pressure would make it happen more often.

I'll try to trigger.

-chris

2016-10-11 15:50:49

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.

On Tue, Oct 11, 2016 at 11:20:41AM -0400, Chris Mason wrote:
>
>
> On 10/11/2016 11:19 AM, Dave Jones wrote:
> > On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
> > > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
> > > > This is from Linus' current tree, with Al's iovec fixups on top.
> > >
> > > Those iovec fixups are in the current tree...
> >
> > ah yeah, git quietly dropped my local copy when I rebased so I didn't notice.
> >
> > > TBH, I don't see anything
> > > in splice-related stuff that could come anywhere near that (short of
> > > some general memory corruption having random effects of that sort).
> > >
> > > Could you try to bisect that sucker, or is it too hard to reproduce?
> >
> > Only hit it the once overnight so far. Will see if I can find a better way to
> > reproduce today.
>
> This call trace is reading metadata so we can finish the truncate. I'd
> say adding more memory pressure would make it happen more often.

That story checks out. There were a bunch of oom's in the log before this.

Dave

2016-10-11 16:25:32

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.

On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
>
>
> On 10/11/2016 10:45 AM, Dave Jones wrote:
> > This is from Linus' current tree, with Al's iovec fixups on top.
> >
> > ------------[ cut here ]------------
> > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
> > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
> > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
> > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880
> > ffff880503878b80 ffffe8ffff806648 ffffe8ffffc06600 ffff880502808008
> > Call Trace:
> > [<ffffffff8d32007c>] dump_stack+0x4f/0x73
> > [<ffffffff8d07a6c1>] __warn+0xc1/0xe0
> > [<ffffffff8d07a73a>] warn_slowpath_fmt+0x5a/0x80
> > [<ffffffff8d33e689>] __list_add+0x89/0xb0
> > [<ffffffff8d30a1c8>] blk_sq_make_request+0x2f8/0x350
>
> /*
> * A task plug currently exists. Since this is completely lockless,
> * utilize that to temporarily store requests until the task is
> * either done or scheduled away.
> */
> plug = current->plug;
> if (plug) {
> blk_mq_bio_to_request(rq, bio);
> if (!request_count)
> trace_block_plug(q);
>
> blk_mq_put_ctx(data.ctx);
>
> if (request_count >= BLK_MAX_REQUEST_COUNT) {
> blk_flush_plug_list(plug, false);
> trace_block_plug(q);
> }
>
> list_add_tail(&rq->queuelist, &plug->mq_list);
> ^^^^^^^^^^^^^^^^^^^^^^
>
> Dave, is this where we're crashing? This seems strange.

According to objdump -S ..


ffffffff8130a1b7: 48 8b 70 50 mov 0x50(%rax),%rsi
list_add_tail(&rq->queuelist, &ctx->rq_list);
ffffffff8130a1bb: 48 8d 50 48 lea 0x48(%rax),%rdx
ffffffff8130a1bf: 48 89 45 a8 mov %rax,-0x58(%rbp)
ffffffff8130a1c3: e8 38 44 03 00 callq ffffffff8133e600 <__list_add>
blk_mq_hctx_mark_pending(hctx, ctx);
ffffffff8130a1c8: 48 8b 45 a8 mov -0x58(%rbp),%rax
ffffffff8130a1cc: 4c 89 ff mov %r15,%rdi

That looks like the list_add_tail from __blk_mq_insert_req_list

Dave

2016-10-11 16:25:46

by Chris Mason

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.



On 10/11/2016 10:45 AM, Dave Jones wrote:
> This is from Linus' current tree, with Al's iovec fixups on top.
>
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
> CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
> ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
> ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880
> ffff880503878b80 ffffe8ffff806648 ffffe8ffffc06600 ffff880502808008
> Call Trace:
> [<ffffffff8d32007c>] dump_stack+0x4f/0x73
> [<ffffffff8d07a6c1>] __warn+0xc1/0xe0
> [<ffffffff8d07a73a>] warn_slowpath_fmt+0x5a/0x80
> [<ffffffff8d33e689>] __list_add+0x89/0xb0
> [<ffffffff8d30a1c8>] blk_sq_make_request+0x2f8/0x350

/*
* A task plug currently exists. Since this is completely lockless,
* utilize that to temporarily store requests until the task is
* either done or scheduled away.
*/
plug = current->plug;
if (plug) {
blk_mq_bio_to_request(rq, bio);
if (!request_count)
trace_block_plug(q);

blk_mq_put_ctx(data.ctx);

if (request_count >= BLK_MAX_REQUEST_COUNT) {
blk_flush_plug_list(plug, false);
trace_block_plug(q);
}

list_add_tail(&rq->queuelist, &plug->mq_list);
^^^^^^^^^^^^^^^^^^^^^^

Dave, is this where we're crashing? This seems strange.

-chris

2016-10-12 13:48:26

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.

On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
>
>
> On 10/11/2016 10:45 AM, Dave Jones wrote:
> > This is from Linus' current tree, with Al's iovec fixups on top.
> >
> > ------------[ cut here ]------------
> > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
> > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
> > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
> > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880

I hit this again overnight, it's the same trace, the only difference
being slightly different addresses in the list pointers:

[42572.777196] list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc90000647cd8. (prev=ffff880503a0ba00).

I'm actually a little surprised that ->next was the same across two
reboots on two different kernel builds. That might be a sign this is
more repeatable than I'd thought, even if it does take hours of runtime
right now to trigger it. I'll try and narrow the scope of what trinity
is doing to see if I can make it happen faster.

Dave

2016-10-12 14:40:26

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.

On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
> On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
> >
> >
> > On 10/11/2016 10:45 AM, Dave Jones wrote:
> > > This is from Linus' current tree, with Al's iovec fixups on top.
> > >
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> > > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
> > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
> > > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
> > > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880
>
> I hit this again overnight, it's the same trace, the only difference
> being slightly different addresses in the list pointers:
>
> [42572.777196] list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc90000647cd8. (prev=ffff880503a0ba00).
>
> I'm actually a little surprised that ->next was the same across two
> reboots on two different kernel builds. That might be a sign this is
> more repeatable than I'd thought, even if it does take hours of runtime
> right now to trigger it. I'll try and narrow the scope of what trinity
> is doing to see if I can make it happen faster.

.. and of course the first thing that happens is a completely different
btrfs trace..


WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs]
CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000
ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98
0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a

Call Trace:
[<ffffffffb731ff3c>] dump_stack+0x4f/0x73
[<ffffffffb707a6c1>] __warn+0xc1/0xe0
[<ffffffffb707a7e8>] warn_slowpath_null+0x18/0x20
[<ffffffffc01d312a>] start_transaction+0x40a/0x440 [btrfs]
[<ffffffffc01a6215>] ? btrfs_alloc_path+0x15/0x20 [btrfs]
[<ffffffffc01d31b2>] btrfs_join_transaction+0x12/0x20 [btrfs]
[<ffffffffc01d92cf>] cow_file_range_inline+0xef/0x830 [btrfs]
[<ffffffffc01d9d75>] cow_file_range.isra.64+0x365/0x480 [btrfs]
[<ffffffffb77c24cc>] ? _raw_spin_unlock+0x2c/0x50
[<ffffffffc01f229f>] ? release_extent_buffer+0x9f/0x110 [btrfs]
[<ffffffffc01da299>] run_delalloc_nocow+0x409/0xbd0 [btrfs]
[<ffffffffb70c6109>] ? get_lock_stats+0x19/0x50
[<ffffffffc01dadea>] run_delalloc_range+0x38a/0x3e0 [btrfs]
[<ffffffffc01f4aba>] writepage_delalloc.isra.47+0x10a/0x190 [btrfs]
[<ffffffffc01f7678>] __extent_writepage+0xd8/0x2c0 [btrfs]
[<ffffffffc01f7b2e>] extent_write_cache_pages.isra.44.constprop.63+0x2ce/0x430 [btrfs]
[<ffffffffb733e497>] ? debug_smp_processor_id+0x17/0x20
[<ffffffffb70c6109>] ? get_lock_stats+0x19/0x50
[<ffffffffc01f8278>] extent_writepages+0x58/0x80 [btrfs]
[<ffffffffc01d7a80>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[<ffffffffc01d4a63>] btrfs_writepages+0x23/0x30 [btrfs]
[<ffffffffb7162e9c>] do_writepages+0x1c/0x30
[<ffffffffb71550f1>] __filemap_fdatawrite_range+0xc1/0x100
[<ffffffffb71551ee>] filemap_fdatawrite_range+0xe/0x10
[<ffffffffc01eb2fb>] btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
[<ffffffffc01f0820>] btrfs_wait_ordered_range+0x40/0x100 [btrfs]
[<ffffffffc01eb5d5>] btrfs_sync_file+0x285/0x390 [btrfs]
[<ffffffffb7207626>] vfs_fsync_range+0x46/0xa0
[<ffffffffb72076d8>] do_fsync+0x38/0x60
[<ffffffffb720795b>] SyS_fsync+0xb/0x10
[<ffffffffb700259c>] do_syscall_64+0x5c/0x170
[<ffffffffb77c2e4b>] entry_SYSCALL64_slow_path+0x25/0x25

2016-10-12 14:43:32

by Chris Mason

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.

On 10/12/2016 10:40 AM, Dave Jones wrote:
> On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
> > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
> > >
> > >
> > > On 10/11/2016 10:45 AM, Dave Jones wrote:
> > > > This is from Linus' current tree, with Al's iovec fixups on top.
> > > >
> > > > ------------[ cut here ]------------
> > > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> > > > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
> > > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
> > > > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
> > > > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880
> >
> > I hit this again overnight, it's the same trace, the only difference
> > being slightly different addresses in the list pointers:
> >
> > [42572.777196] list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc90000647cd8. (prev=ffff880503a0ba00).
> >
> > I'm actually a little surprised that ->next was the same across two
> > reboots on two different kernel builds. That might be a sign this is
> > more repeatable than I'd thought, even if it does take hours of runtime
> > right now to trigger it. I'll try and narrow the scope of what trinity
> > is doing to see if I can make it happen faster.
>
> .. and of course the first thing that happens is a completely different
> btrfs trace..
>
>
> WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs]
> CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
> ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000
> ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98
> 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a

This isn't even IO. Uuughhhh. We're going to need a fast enough test
that we can bisect.

-chris

2016-10-13 18:17:10

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.

On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote:
> On 10/12/2016 10:40 AM, Dave Jones wrote:
> > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
> > > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
> > > >
> > > >
> > > > On 10/11/2016 10:45 AM, Dave Jones wrote:
> > > > > This is from Linus' current tree, with Al's iovec fixups on top.
> > > > >
> > > > > ------------[ cut here ]------------
> > > > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> > > > > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
> > > > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
> > > > > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
> > > > > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880
> > >
> > > I hit this again overnight, it's the same trace, the only difference
> > > being slightly different addresses in the list pointers:
> > >
> > > [42572.777196] list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc90000647cd8. (prev=ffff880503a0ba00).
> > >
> > > I'm actually a little surprised that ->next was the same across two
> > > reboots on two different kernel builds. That might be a sign this is
> > > more repeatable than I'd thought, even if it does take hours of runtime
> > > right now to trigger it. I'll try and narrow the scope of what trinity
> > > is doing to see if I can make it happen faster.
> >
> > .. and of course the first thing that happens is a completely different
> > btrfs trace..
> >
> >
> > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs]
> > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
> > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000
> > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98
> > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a
>
> This isn't even IO. Uuughhhh. We're going to need a fast enough test
> that we can bisect.

Progress...
I've found that this combination of syscalls..

./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2

hits one of these two bugs in a few minutes runtime.

Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
Mix them together though, and something goes awry.

Dave

2016-10-13 21:20:14

by Chris Mason

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.

On 10/13/2016 02:16 PM, Dave Jones wrote:
> On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote:
> > On 10/12/2016 10:40 AM, Dave Jones wrote:
> > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
> > > > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
> > > > >
> > > > >
> > > > > On 10/11/2016 10:45 AM, Dave Jones wrote:
> > > > > > This is from Linus' current tree, with Al's iovec fixups on top.
> > > > > >
> > > > > > ------------[ cut here ]------------
> > > > > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> > > > > > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
> > > > > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
> > > > > > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
> > > > > > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880
> > > >
> > > > I hit this again overnight, it's the same trace, the only difference
> > > > being slightly different addresses in the list pointers:
> > > >
> > > > [42572.777196] list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc90000647cd8. (prev=ffff880503a0ba00).
> > > >
> > > > I'm actually a little surprised that ->next was the same across two
> > > > reboots on two different kernel builds. That might be a sign this is
> > > > more repeatable than I'd thought, even if it does take hours of runtime
> > > > right now to trigger it. I'll try and narrow the scope of what trinity
> > > > is doing to see if I can make it happen faster.
> > >
> > > .. and of course the first thing that happens is a completely different
> > > btrfs trace..
> > >
> > >
> > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs]
> > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
> > > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000
> > > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98
> > > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a
> >
> > This isn't even IO. Uuughhhh. We're going to need a fast enough test
> > that we can bisect.
>
> Progress...
> I've found that this combination of syscalls..
>
> ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2
>
> hits one of these two bugs in a few minutes runtime.
>
> Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
> Mix them together though, and something goes awry.
>

Hasn't triggered here yet. I'll leave it running though.

-chris

2016-10-13 22:24:05

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.

On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote:

> > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs]
> > > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
> > > > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000
> > > > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98
> > > > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a
> > >
> > > This isn't even IO. Uuughhhh. We're going to need a fast enough test
> > > that we can bisect.
> >
> > Progress...
> > I've found that this combination of syscalls..
> >
> > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2
> >
> > hits one of these two bugs in a few minutes runtime.
> >
> > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
> > Mix them together though, and something goes awry.
> >
>
> Hasn't triggered here yet. I'll leave it running though.

With that combo of params I triggered it 3-4 times in a row within minutes.. Then
as soon as I posted, it stopped being so easy to repro.

There's some other variable I haven't figured out yet (maybe how the random way that files
get opened in fds/testfiles.c), but it does seem to point at the xattr changes.

I'll poke at it some more tomorrow.

Dave

2016-10-16 00:42:55

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.

On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote:

> > > > .. and of course the first thing that happens is a completely different
> > > > btrfs trace..
> > > >
> > > >
> > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs]
> > > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
> > > > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000
> > > > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98
> > > > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a
> > >
> > > This isn't even IO. Uuughhhh. We're going to need a fast enough test
> > > that we can bisect.
> >
> > Progress...
> > I've found that this combination of syscalls..
> >
> > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2
> >
> > hits one of these two bugs in a few minutes runtime.
> >
> > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
> > Mix them together though, and something goes awry.
> >
> Hasn't triggered here yet. I'll leave it running though.

The hits keep coming..

BUG: Bad page state in process kworker/u8:12 pfn:4988fa
page:ffffea0012623e80 count:0 mapcount:0 mapping:ffff8804450456e0 index:0x9

flags: 0x400000000000000c(referenced|uptodate)
page dumped because: non-NULL mapping
CPU: 2 PID: 1388 Comm: kworker/u8:12 Not tainted 4.8.0-think+ #18
Workqueue: writeback wb_workfn
(flush-btrfs-1)

ffffc90000aef7e8
ffffffff81320e7c
ffffea0012623e80
ffffffff819fe6ec

ffffc90000aef810
ffffffff81159b3f
0000000000000000
ffffea0012623e80

400000000000000c
ffffc90000aef820
ffffffff81159bfa
ffffc90000aef868

Call Trace:
[<ffffffff81320e7c>] dump_stack+0x4f/0x73
[<ffffffff81159b3f>] bad_page+0xbf/0x120
[<ffffffff81159bfa>] free_pages_check_bad+0x5a/0x70
[<ffffffff8115c0fb>] free_hot_cold_page+0x20b/0x270
[<ffffffff8115c41b>] free_hot_cold_page_list+0x2b/0x50
[<ffffffff81165062>] release_pages+0x2d2/0x380
[<ffffffff811665d2>] __pagevec_release+0x22/0x30
[<ffffffffa009f810>] extent_write_cache_pages.isra.48.constprop.63+0x350/0x430 [btrfs]
[<ffffffff8133f487>] ? debug_smp_processor_id+0x17/0x20
[<ffffffff810c6999>] ? get_lock_stats+0x19/0x50
[<ffffffffa009fce8>] extent_writepages+0x58/0x80 [btrfs]
[<ffffffffa007f150>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[<ffffffffa007c0d3>] btrfs_writepages+0x23/0x30 [btrfs]
[<ffffffff8116370c>] do_writepages+0x1c/0x30
[<ffffffff81202d63>] __writeback_single_inode+0x33/0x180
[<ffffffff8120357b>] writeback_sb_inodes+0x2cb/0x5d0
[<ffffffff8120390d>] __writeback_inodes_wb+0x8d/0xc0
[<ffffffff81203c03>] wb_writeback+0x203/0x210
[<ffffffff81204197>] wb_workfn+0xe7/0x2a0
[<ffffffff810c8b7f>] ? __lock_acquire.isra.32+0x1cf/0x8c0
[<ffffffff8109458a>] process_one_work+0x1da/0x4b0
[<ffffffff8109452a>] ? process_one_work+0x17a/0x4b0
[<ffffffff810948a9>] worker_thread+0x49/0x490
[<ffffffff81094860>] ? process_one_work+0x4b0/0x4b0
[<ffffffff81094860>] ? process_one_work+0x4b0/0x4b0

2016-10-18 01:08:27

by Chris Mason

[permalink] [raw]
Subject: Re: btrfs bio linked list corruption.

On Sat, Oct 15, 2016 at 08:42:40PM -0400, Dave Jones wrote:
>On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote:
>
> > > > > .. and of course the first thing that happens is a completely different
> > > > > btrfs trace..
> > > > >
> > > > >
> > > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs]
> > > > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
> > > > > ffffc900019076a8 ffffffffb731ff3c 0000000000000000 0000000000000000
> > > > > ffffc900019076e8 ffffffffb707a6c1 000001e9f5806ce0 ffff8804f74c4d98
> > > > > 0000000000000801 ffff880501cfa2a8 000000000000008a 000000000000008a
> > > >
> > > > This isn't even IO. Uuughhhh. We're going to need a fast enough test
> > > > that we can bisect.
> > >
> > > Progress...
> > > I've found that this combination of syscalls..
> > >
> > > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2
> > >
> > > hits one of these two bugs in a few minutes runtime.
> > >
> > > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
> > > Mix them together though, and something goes awry.
> > >
> > Hasn't triggered here yet. I'll leave it running though.
>
>The hits keep coming..
>
>BUG: Bad page state in process kworker/u8:12 pfn:4988fa
>page:ffffea0012623e80 count:0 mapcount:0 mapping:ffff8804450456e0 index:0x9

Hmpf, I've had this running since Friday without failing. Can you send
me your .config please?

-chris

2016-10-18 22:42:24

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 11, 2016 at 10:45:07AM -0400, Dave Jones wrote:

> WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
> CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
> ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
> ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880
> ffff880503878b80 ffffe8ffff806648 ffffe8ffffc06600 ffff880502808008
> Call Trace:
> [<ffffffff8d32007c>] dump_stack+0x4f/0x73
> [<ffffffff8d07a6c1>] __warn+0xc1/0xe0
> [<ffffffff8d07a73a>] warn_slowpath_fmt+0x5a/0x80
> [<ffffffff8d33e689>] __list_add+0x89/0xb0
> [<ffffffff8d30a1c8>] blk_sq_make_request+0x2f8/0x350
> [<ffffffff8d2fd9cc>] ? generic_make_request+0xec/0x240
> [<ffffffff8d2fd9d9>] generic_make_request+0xf9/0x240
> [<ffffffff8d2fdb98>] submit_bio+0x78/0x150
> [<ffffffff8d349c05>] ? __percpu_counter_add+0x85/0xb0
> [<ffffffffc03627de>] btrfs_map_bio+0x19e/0x330 [btrfs]
> [<ffffffffc03289ca>] btree_submit_bio_hook+0xfa/0x110 [btrfs]
> [<ffffffffc034ff15>] submit_one_bio+0x65/0xa0 [btrfs]
> [<ffffffffc0358cb0>] read_extent_buffer_pages+0x2f0/0x3d0 [btrfs]
> [<ffffffffc0327020>] ? free_root_pointers+0x60/0x60 [btrfs]
> [<ffffffffc03283c8>] btree_read_extent_buffer_pages.constprop.55+0xa8/0x110 [btrfs]
> [<ffffffffc0328bcd>] read_tree_block+0x2d/0x50 [btrfs]
> [<ffffffffc03080a4>] read_block_for_search.isra.33+0x134/0x330 [btrfs]
> [<ffffffff8d7c2d6c>] ? _raw_write_unlock+0x2c/0x50
> [<ffffffffc0302fec>] ? unlock_up+0x16c/0x1a0 [btrfs]
> [<ffffffffc030a3d0>] btrfs_search_slot+0x450/0xa40 [btrfs]
> [<ffffffffc0324983>] btrfs_del_csums+0xe3/0x2e0 [btrfs]
> [<ffffffffc03134fd>] __btrfs_free_extent.isra.82+0x32d/0xc90 [btrfs]
> [<ffffffffc03178b3>] __btrfs_run_delayed_refs+0x4d3/0x1010 [btrfs]
> [<ffffffff8d33e5d7>] ? debug_smp_processor_id+0x17/0x20
> [<ffffffff8d0c6109>] ? get_lock_stats+0x19/0x50
> [<ffffffffc031b32c>] btrfs_run_delayed_refs+0x9c/0x2d0 [btrfs]
> [<ffffffffc033d628>] btrfs_truncate_inode_items+0x888/0xda0 [btrfs]
> [<ffffffffc033dc25>] btrfs_truncate+0xe5/0x2b0 [btrfs]
> [<ffffffffc033e569>] btrfs_setattr+0x249/0x360 [btrfs]
> [<ffffffff8d1f4092>] notify_change+0x252/0x440
> [<ffffffff8d1d164e>] do_truncate+0x6e/0xc0
> [<ffffffff8d1d1a4c>] do_sys_ftruncate.constprop.19+0x10c/0x170
> [<ffffffff8d33e5f3>] ? __this_cpu_preempt_check+0x13/0x20
> [<ffffffff8d1d1ad9>] SyS_ftruncate+0x9/0x10
> [<ffffffff8d00259c>] do_syscall_64+0x5c/0x170
> [<ffffffff8d7c2f8b>] entry_SYSCALL64_slow_path+0x25/0x25

So Chris had me do a run on ext4 just for giggles. It took a while, but
eventually this fell out...


WARNING: CPU: 3 PID: 21324 at lib/list_debug.c:33 __list_add+0x89/0xb0
list_add corruption. prev->next should be next (ffffe8ffffc05648), but was ffffc9000028bcd8. (prev=ffff880503a145c0).
CPU: 3 PID: 21324 Comm: modprobe Not tainted 4.9.0-rc1-think+ #1
ffffc90000a6b7b8 ffffffff81320e3c ffffc90000a6b808 0000000000000000
ffffc90000a6b7f8 ffffffff8107a711 0000002100000246 ffff8805039f1740
ffff880503a145c0 ffffe8ffffc05648 ffffe8ffffa05600 ffff880502c39548
Call Trace:
[<ffffffff81320e3c>] dump_stack+0x4f/0x73
[<ffffffff8107a711>] __warn+0xc1/0xe0
[<ffffffff8107a78a>] warn_slowpath_fmt+0x5a/0x80
[<ffffffff8133f499>] __list_add+0x89/0xb0
[<ffffffff8130af88>] blk_sq_make_request+0x2f8/0x350
[<ffffffff812fe6dc>] ? generic_make_request+0xec/0x240
[<ffffffff812fe6e9>] generic_make_request+0xf9/0x240
[<ffffffff812fe8a8>] submit_bio+0x78/0x150
[<ffffffff8120bde6>] ? __find_get_block+0x126/0x130
[<ffffffff8120cbff>] submit_bh_wbc+0x16f/0x1e0
[<ffffffff8120a400>] ? __end_buffer_read_notouch+0x20/0x20
[<ffffffff8120d958>] ll_rw_block+0xa8/0xb0
[<ffffffff8120da0f>] __breadahead+0x3f/0x70
[<ffffffff81264ffc>] __ext4_get_inode_loc+0x37c/0x3d0
[<ffffffff8126806d>] ext4_iget+0x8d/0xb90
[<ffffffff811f0759>] ? d_alloc_parallel+0x329/0x700
[<ffffffff81268b9a>] ext4_iget_normal+0x2a/0x30
[<ffffffff81273cd6>] ext4_lookup+0x136/0x250
[<ffffffff811e118d>] lookup_slow+0x12d/0x220
[<ffffffff811e3897>] walk_component+0x1e7/0x310
[<ffffffff811e33f8>] ? path_init+0x4d8/0x520
[<ffffffff811e4022>] path_lookupat+0x62/0x120
[<ffffffff811e4f22>] ? getname_flags+0x32/0x180
[<ffffffff811e5278>] filename_lookup+0xa8/0x130
[<ffffffff81352526>] ? strncpy_from_user+0x46/0x170
[<ffffffff811e4f3e>] ? getname_flags+0x4e/0x180
[<ffffffff811e53d1>] user_path_at_empty+0x31/0x40
[<ffffffff811d9df1>] vfs_fstatat+0x61/0xc0
[<ffffffff810c8b9f>] ? __lock_acquire.isra.32+0x1cf/0x8c0
[<ffffffff811da30e>] SYSC_newstat+0x2e/0x60
[<ffffffff8133f403>] ? __this_cpu_preempt_check+0x13/0x20
[<ffffffff811da499>] SyS_newstat+0x9/0x10
[<ffffffff8100259c>] do_syscall_64+0x5c/0x170
[<ffffffff817c27cb>] entry_SYSCALL64_slow_path+0x25/0x25

So this one isn't a btrfs specific problem as I first thought.

This sometimes reproduces within minutes, sometimes hours, which makes
it a pain to bisect. It only started showing up this merge window though.

Dave

2016-10-18 23:13:02

by Jens Axboe

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/18/2016 04:42 PM, Dave Jones wrote:
> On Tue, Oct 11, 2016 at 10:45:07AM -0400, Dave Jones wrote:
>
> > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> > list_add corruption. prev->next should be next (ffffe8ffff806648), but was ffffc9000067fcd8. (prev=ffff880503878b80).
> > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
> > ffffc90000d87458 ffffffff8d32007c ffffc90000d874a8 0000000000000000
> > ffffc90000d87498 ffffffff8d07a6c1 0000002100000246 ffff88050388e880
> > ffff880503878b80 ffffe8ffff806648 ffffe8ffffc06600 ffff880502808008
> > Call Trace:
> > [<ffffffff8d32007c>] dump_stack+0x4f/0x73
> > [<ffffffff8d07a6c1>] __warn+0xc1/0xe0
> > [<ffffffff8d07a73a>] warn_slowpath_fmt+0x5a/0x80
> > [<ffffffff8d33e689>] __list_add+0x89/0xb0
> > [<ffffffff8d30a1c8>] blk_sq_make_request+0x2f8/0x350
> > [<ffffffff8d2fd9cc>] ? generic_make_request+0xec/0x240
> > [<ffffffff8d2fd9d9>] generic_make_request+0xf9/0x240
> > [<ffffffff8d2fdb98>] submit_bio+0x78/0x150
> > [<ffffffff8d349c05>] ? __percpu_counter_add+0x85/0xb0
> > [<ffffffffc03627de>] btrfs_map_bio+0x19e/0x330 [btrfs]
> > [<ffffffffc03289ca>] btree_submit_bio_hook+0xfa/0x110 [btrfs]
> > [<ffffffffc034ff15>] submit_one_bio+0x65/0xa0 [btrfs]
> > [<ffffffffc0358cb0>] read_extent_buffer_pages+0x2f0/0x3d0 [btrfs]
> > [<ffffffffc0327020>] ? free_root_pointers+0x60/0x60 [btrfs]
> > [<ffffffffc03283c8>] btree_read_extent_buffer_pages.constprop.55+0xa8/0x110 [btrfs]
> > [<ffffffffc0328bcd>] read_tree_block+0x2d/0x50 [btrfs]
> > [<ffffffffc03080a4>] read_block_for_search.isra.33+0x134/0x330 [btrfs]
> > [<ffffffff8d7c2d6c>] ? _raw_write_unlock+0x2c/0x50
> > [<ffffffffc0302fec>] ? unlock_up+0x16c/0x1a0 [btrfs]
> > [<ffffffffc030a3d0>] btrfs_search_slot+0x450/0xa40 [btrfs]
> > [<ffffffffc0324983>] btrfs_del_csums+0xe3/0x2e0 [btrfs]
> > [<ffffffffc03134fd>] __btrfs_free_extent.isra.82+0x32d/0xc90 [btrfs]
> > [<ffffffffc03178b3>] __btrfs_run_delayed_refs+0x4d3/0x1010 [btrfs]
> > [<ffffffff8d33e5d7>] ? debug_smp_processor_id+0x17/0x20
> > [<ffffffff8d0c6109>] ? get_lock_stats+0x19/0x50
> > [<ffffffffc031b32c>] btrfs_run_delayed_refs+0x9c/0x2d0 [btrfs]
> > [<ffffffffc033d628>] btrfs_truncate_inode_items+0x888/0xda0 [btrfs]
> > [<ffffffffc033dc25>] btrfs_truncate+0xe5/0x2b0 [btrfs]
> > [<ffffffffc033e569>] btrfs_setattr+0x249/0x360 [btrfs]
> > [<ffffffff8d1f4092>] notify_change+0x252/0x440
> > [<ffffffff8d1d164e>] do_truncate+0x6e/0xc0
> > [<ffffffff8d1d1a4c>] do_sys_ftruncate.constprop.19+0x10c/0x170
> > [<ffffffff8d33e5f3>] ? __this_cpu_preempt_check+0x13/0x20
> > [<ffffffff8d1d1ad9>] SyS_ftruncate+0x9/0x10
> > [<ffffffff8d00259c>] do_syscall_64+0x5c/0x170
> > [<ffffffff8d7c2f8b>] entry_SYSCALL64_slow_path+0x25/0x25
>
> So Chris had me do a run on ext4 just for giggles. It took a while, but
> eventually this fell out...
>
>
> WARNING: CPU: 3 PID: 21324 at lib/list_debug.c:33 __list_add+0x89/0xb0
> list_add corruption. prev->next should be next (ffffe8ffffc05648), but was ffffc9000028bcd8. (prev=ffff880503a145c0).
> CPU: 3 PID: 21324 Comm: modprobe Not tainted 4.9.0-rc1-think+ #1
> ffffc90000a6b7b8 ffffffff81320e3c ffffc90000a6b808 0000000000000000
> ffffc90000a6b7f8 ffffffff8107a711 0000002100000246 ffff8805039f1740
> ffff880503a145c0 ffffe8ffffc05648 ffffe8ffffa05600 ffff880502c39548
> Call Trace:
> [<ffffffff81320e3c>] dump_stack+0x4f/0x73
> [<ffffffff8107a711>] __warn+0xc1/0xe0
> [<ffffffff8107a78a>] warn_slowpath_fmt+0x5a/0x80
> [<ffffffff8133f499>] __list_add+0x89/0xb0
> [<ffffffff8130af88>] blk_sq_make_request+0x2f8/0x350
> [<ffffffff812fe6dc>] ? generic_make_request+0xec/0x240
> [<ffffffff812fe6e9>] generic_make_request+0xf9/0x240
> [<ffffffff812fe8a8>] submit_bio+0x78/0x150
> [<ffffffff8120bde6>] ? __find_get_block+0x126/0x130
> [<ffffffff8120cbff>] submit_bh_wbc+0x16f/0x1e0
> [<ffffffff8120a400>] ? __end_buffer_read_notouch+0x20/0x20
> [<ffffffff8120d958>] ll_rw_block+0xa8/0xb0
> [<ffffffff8120da0f>] __breadahead+0x3f/0x70
> [<ffffffff81264ffc>] __ext4_get_inode_loc+0x37c/0x3d0
> [<ffffffff8126806d>] ext4_iget+0x8d/0xb90
> [<ffffffff811f0759>] ? d_alloc_parallel+0x329/0x700
> [<ffffffff81268b9a>] ext4_iget_normal+0x2a/0x30
> [<ffffffff81273cd6>] ext4_lookup+0x136/0x250
> [<ffffffff811e118d>] lookup_slow+0x12d/0x220
> [<ffffffff811e3897>] walk_component+0x1e7/0x310
> [<ffffffff811e33f8>] ? path_init+0x4d8/0x520
> [<ffffffff811e4022>] path_lookupat+0x62/0x120
> [<ffffffff811e4f22>] ? getname_flags+0x32/0x180
> [<ffffffff811e5278>] filename_lookup+0xa8/0x130
> [<ffffffff81352526>] ? strncpy_from_user+0x46/0x170
> [<ffffffff811e4f3e>] ? getname_flags+0x4e/0x180
> [<ffffffff811e53d1>] user_path_at_empty+0x31/0x40
> [<ffffffff811d9df1>] vfs_fstatat+0x61/0xc0
> [<ffffffff810c8b9f>] ? __lock_acquire.isra.32+0x1cf/0x8c0
> [<ffffffff811da30e>] SYSC_newstat+0x2e/0x60
> [<ffffffff8133f403>] ? __this_cpu_preempt_check+0x13/0x20
> [<ffffffff811da499>] SyS_newstat+0x9/0x10
> [<ffffffff8100259c>] do_syscall_64+0x5c/0x170
> [<ffffffff817c27cb>] entry_SYSCALL64_slow_path+0x25/0x25
>
> So this one isn't a btrfs specific problem as I first thought.
>
> This sometimes reproduces within minutes, sometimes hours, which makes
> it a pain to bisect. It only started showing up this merge window though.

Chinner reported the same thing on XFS, I'll look into it asap.

--
Jens Axboe

2016-10-18 23:34:20

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 18, 2016 at 05:12:41PM -0600, Jens Axboe wrote:
>On 10/18/2016 04:42 PM, Dave Jones wrote:
>>So Chris had me do a run on ext4 just for giggles. It took a while, but
>>eventually this fell out...
>>
>>
>>WARNING: CPU: 3 PID: 21324 at lib/list_debug.c:33 __list_add+0x89/0xb0
>>list_add corruption. prev->next should be next (ffffe8ffffc05648), but was ffffc9000028bcd8. (prev=ffff880503a145c0).
>>CPU: 3 PID: 21324 Comm: modprobe Not tainted 4.9.0-rc1-think+ #1
>> ffffc90000a6b7b8 ffffffff81320e3c ffffc90000a6b808 0000000000000000
>> ffffc90000a6b7f8 ffffffff8107a711 0000002100000246 ffff8805039f1740
>> ffff880503a145c0 ffffe8ffffc05648 ffffe8ffffa05600 ffff880502c39548
>>Call Trace:
>> [<ffffffff81320e3c>] dump_stack+0x4f/0x73
>> [<ffffffff8107a711>] __warn+0xc1/0xe0
>> [<ffffffff8107a78a>] warn_slowpath_fmt+0x5a/0x80
>> [<ffffffff8133f499>] __list_add+0x89/0xb0
>> [<ffffffff8130af88>] blk_sq_make_request+0x2f8/0x350
>> [<ffffffff812fe6dc>] ? generic_make_request+0xec/0x240
>> [<ffffffff812fe6e9>] generic_make_request+0xf9/0x240
>> [<ffffffff812fe8a8>] submit_bio+0x78/0x150
>> [<ffffffff8120bde6>] ? __find_get_block+0x126/0x130
>> [<ffffffff8120cbff>] submit_bh_wbc+0x16f/0x1e0
>> [<ffffffff8120a400>] ? __end_buffer_read_notouch+0x20/0x20
>> [<ffffffff8120d958>] ll_rw_block+0xa8/0xb0
>> [<ffffffff8120da0f>] __breadahead+0x3f/0x70
>> [<ffffffff81264ffc>] __ext4_get_inode_loc+0x37c/0x3d0
>> [<ffffffff8126806d>] ext4_iget+0x8d/0xb90
>> [<ffffffff811f0759>] ? d_alloc_parallel+0x329/0x700
>> [<ffffffff81268b9a>] ext4_iget_normal+0x2a/0x30
>> [<ffffffff81273cd6>] ext4_lookup+0x136/0x250
>> [<ffffffff811e118d>] lookup_slow+0x12d/0x220
>> [<ffffffff811e3897>] walk_component+0x1e7/0x310
>> [<ffffffff811e33f8>] ? path_init+0x4d8/0x520
>> [<ffffffff811e4022>] path_lookupat+0x62/0x120
>> [<ffffffff811e4f22>] ? getname_flags+0x32/0x180
>> [<ffffffff811e5278>] filename_lookup+0xa8/0x130
>> [<ffffffff81352526>] ? strncpy_from_user+0x46/0x170
>> [<ffffffff811e4f3e>] ? getname_flags+0x4e/0x180
>> [<ffffffff811e53d1>] user_path_at_empty+0x31/0x40
>> [<ffffffff811d9df1>] vfs_fstatat+0x61/0xc0
>> [<ffffffff810c8b9f>] ? __lock_acquire.isra.32+0x1cf/0x8c0
>> [<ffffffff811da30e>] SYSC_newstat+0x2e/0x60
>> [<ffffffff8133f403>] ? __this_cpu_preempt_check+0x13/0x20
>> [<ffffffff811da499>] SyS_newstat+0x9/0x10
>> [<ffffffff8100259c>] do_syscall_64+0x5c/0x170
>> [<ffffffff817c27cb>] entry_SYSCALL64_slow_path+0x25/0x25
>>
>>So this one isn't a btrfs specific problem as I first thought.
>>
>>This sometimes reproduces within minutes, sometimes hours, which makes
>>it a pain to bisect. It only started showing up this merge window though.
>
>Chinner reported the same thing on XFS, I'll look into it asap.

Jens, not sure if you saw the whole thread. This has triggered bad page
state errors, and also corrupted a btrfs list. It hurts me to say, but
it might not actually be your fault.

-chris

2016-10-18 23:36:48

by Jens Axboe

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/18/2016 05:31 PM, Chris Mason wrote:
> On Tue, Oct 18, 2016 at 05:12:41PM -0600, Jens Axboe wrote:
>> On 10/18/2016 04:42 PM, Dave Jones wrote:
>>> So Chris had me do a run on ext4 just for giggles. It took a while, but
>>> eventually this fell out...
>>>
>>>
>>> WARNING: CPU: 3 PID: 21324 at lib/list_debug.c:33 __list_add+0x89/0xb0
>>> list_add corruption. prev->next should be next (ffffe8ffffc05648),
>>> but was ffffc9000028bcd8. (prev=ffff880503a145c0).
>>> CPU: 3 PID: 21324 Comm: modprobe Not tainted 4.9.0-rc1-think+ #1
>>> ffffc90000a6b7b8 ffffffff81320e3c ffffc90000a6b808 0000000000000000
>>> ffffc90000a6b7f8 ffffffff8107a711 0000002100000246 ffff8805039f1740
>>> ffff880503a145c0 ffffe8ffffc05648 ffffe8ffffa05600 ffff880502c39548
>>> Call Trace:
>>> [<ffffffff81320e3c>] dump_stack+0x4f/0x73
>>> [<ffffffff8107a711>] __warn+0xc1/0xe0
>>> [<ffffffff8107a78a>] warn_slowpath_fmt+0x5a/0x80
>>> [<ffffffff8133f499>] __list_add+0x89/0xb0
>>> [<ffffffff8130af88>] blk_sq_make_request+0x2f8/0x350
>>> [<ffffffff812fe6dc>] ? generic_make_request+0xec/0x240
>>> [<ffffffff812fe6e9>] generic_make_request+0xf9/0x240
>>> [<ffffffff812fe8a8>] submit_bio+0x78/0x150
>>> [<ffffffff8120bde6>] ? __find_get_block+0x126/0x130
>>> [<ffffffff8120cbff>] submit_bh_wbc+0x16f/0x1e0
>>> [<ffffffff8120a400>] ? __end_buffer_read_notouch+0x20/0x20
>>> [<ffffffff8120d958>] ll_rw_block+0xa8/0xb0
>>> [<ffffffff8120da0f>] __breadahead+0x3f/0x70
>>> [<ffffffff81264ffc>] __ext4_get_inode_loc+0x37c/0x3d0
>>> [<ffffffff8126806d>] ext4_iget+0x8d/0xb90
>>> [<ffffffff811f0759>] ? d_alloc_parallel+0x329/0x700
>>> [<ffffffff81268b9a>] ext4_iget_normal+0x2a/0x30
>>> [<ffffffff81273cd6>] ext4_lookup+0x136/0x250
>>> [<ffffffff811e118d>] lookup_slow+0x12d/0x220
>>> [<ffffffff811e3897>] walk_component+0x1e7/0x310
>>> [<ffffffff811e33f8>] ? path_init+0x4d8/0x520
>>> [<ffffffff811e4022>] path_lookupat+0x62/0x120
>>> [<ffffffff811e4f22>] ? getname_flags+0x32/0x180
>>> [<ffffffff811e5278>] filename_lookup+0xa8/0x130
>>> [<ffffffff81352526>] ? strncpy_from_user+0x46/0x170
>>> [<ffffffff811e4f3e>] ? getname_flags+0x4e/0x180
>>> [<ffffffff811e53d1>] user_path_at_empty+0x31/0x40
>>> [<ffffffff811d9df1>] vfs_fstatat+0x61/0xc0
>>> [<ffffffff810c8b9f>] ? __lock_acquire.isra.32+0x1cf/0x8c0
>>> [<ffffffff811da30e>] SYSC_newstat+0x2e/0x60
>>> [<ffffffff8133f403>] ? __this_cpu_preempt_check+0x13/0x20
>>> [<ffffffff811da499>] SyS_newstat+0x9/0x10
>>> [<ffffffff8100259c>] do_syscall_64+0x5c/0x170
>>> [<ffffffff817c27cb>] entry_SYSCALL64_slow_path+0x25/0x25
>>>
>>> So this one isn't a btrfs specific problem as I first thought.
>>>
>>> This sometimes reproduces within minutes, sometimes hours, which makes
>>> it a pain to bisect. It only started showing up this merge window
>>> though.
>>
>> Chinner reported the same thing on XFS, I'll look into it asap.
>
> Jens, not sure if you saw the whole thread. This has triggered bad page
> state errors, and also corrupted a btrfs list. It hurts me to say, but
> it might not actually be your fault.

Just went through the block changes again, pretty basic this round and
nothing that touches plugging at all. General memory corruption could
certainly explain it...

--
Jens Axboe

2016-10-18 23:39:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason <[email protected]> wrote:
>
> Jens, not sure if you saw the whole thread. This has triggered bad page
> state errors, and also corrupted a btrfs list. It hurts me to say, but it
> might not actually be your fault.

Where is that thread, and what is the "this" that triggers problems?

Looking at the "->mq_list" users, I'm not seeing any changes there in
the last year or so. So I don't think it's the list itself.

Linus

2016-10-18 23:46:15

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 18, 2016 at 04:39:22PM -0700, Linus Torvalds wrote:
>On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason <[email protected]> wrote:
>>
>> Jens, not sure if you saw the whole thread. This has triggered bad page
>> state errors, and also corrupted a btrfs list. It hurts me to say, but it
>> might not actually be your fault.
>
>Where is that thread, and what is the "this" that triggers problems?
>
>Looking at the "->mq_list" users, I'm not seeing any changes there in
>the last year or so. So I don't think it's the list itself.

Seems to be the whole thing:

http://www.gossamer-threads.com/lists/linux/kernel/2545792

My guess is xattr, but I don't have a good reason for that.

-chris

2016-10-19 00:11:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 18, 2016 at 4:42 PM, Chris Mason <[email protected]> wrote:
>
> Seems to be the whole thing:

Ahh. On lkml, so I do have it in my mailbox, but Dave changed the
subject line when he tested on ext4 rather than btrfs..

Anyway, the corrupted address is somewhat interesting. As Dave Jones
said, he saw

list_add corruption. prev->next should be next (ffffe8ffff806648),
but was ffffc9000067fcd8. (prev=ffff880503878b80).
list_add corruption. prev->next should be next (ffffe8ffffc05648),
but was ffffc9000028bcd8. (prev=ffff880503a145c0).

and Dave Chinner reports

list_add corruption. prev->next should be next (ffffe8ffffc02808),
but was ffffc90005f6bda8. (prev=ffff88013363bb80).

and it's worth noting that the "but was" is a remarkably consistent
vmalloc address (the ffffc9000.. pattern gives it away). In fact, it's
identical across two boots for DaveJ in the low 14 bits, and fairly
high up in those low 14 bots (0x3cd8).

DaveC has a different address, but it's also in the vmalloc space, and
also looks like it is fairly high up in 14 bits (0x3da8). So in both
cases it's almost certainly a stack address with a fairly empty stack.
The differences are presumably due to different kernel configurations
and/or just different filesystems calling the same function that does
the same bad thing but now at different depths in the stack.

Adding Andy to the cc, because this *might* be triggered by the
vmalloc stack code itself. Maybe the re-use of stacks showing some
problem? Maybe Chris (who can't see the problem) doesn't have
CONFIG_VMAP_STACK enabled?

Andy - this is on lkml, under

Dave Chinner:
[regression, 4.9-rc1] blk-mq: list corruption in request queue

Dave Jones:
btrfs bio linked list corruption.
Re: bio linked list corruption.

and they are definitely the same thing across three different
filesystems (xfs, btrfs and ext4), and they are consistent enough that
there is almost certainly a single very specific memory corrupting
issue that overwrites something with a stack pointer.

Linus

2016-10-19 00:20:12

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 18, 2016 at 05:10:56PM -0700, Linus Torvalds wrote:
>On Tue, Oct 18, 2016 at 4:42 PM, Chris Mason <[email protected]> wrote:
>>
>> Seems to be the whole thing:
>
>Ahh. On lkml, so I do have it in my mailbox, but Dave changed the
>subject line when he tested on ext4 rather than btrfs..
>
>Anyway, the corrupted address is somewhat interesting. As Dave Jones
>said, he saw
>
> list_add corruption. prev->next should be next (ffffe8ffff806648),
>but was ffffc9000067fcd8. (prev=ffff880503878b80).
> list_add corruption. prev->next should be next (ffffe8ffffc05648),
>but was ffffc9000028bcd8. (prev=ffff880503a145c0).
>
>and Dave Chinner reports
>
> list_add corruption. prev->next should be next (ffffe8ffffc02808),
>but was ffffc90005f6bda8. (prev=ffff88013363bb80).
>
>and it's worth noting that the "but was" is a remarkably consistent
>vmalloc address (the ffffc9000.. pattern gives it away). In fact, it's
>identical across two boots for DaveJ in the low 14 bits, and fairly
>high up in those low 14 bots (0x3cd8).
>
>DaveC has a different address, but it's also in the vmalloc space, and
>also looks like it is fairly high up in 14 bits (0x3da8). So in both
>cases it's almost certainly a stack address with a fairly empty stack.
>The differences are presumably due to different kernel configurations
>and/or just different filesystems calling the same function that does
>the same bad thing but now at different depths in the stack.
>
>Adding Andy to the cc, because this *might* be triggered by the
>vmalloc stack code itself. Maybe the re-use of stacks showing some
>problem? Maybe Chris (who can't see the problem) doesn't have
>CONFIG_VMAP_STACK enabled?

CONFIG_VMAP_STACK=y, but maybe I just need to hammer on process creation
more. I'm testing in a hugely stripped down VM, so Dave might have more
background stuff going on.

-chris

2016-10-19 00:29:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 18, 2016 at 5:10 PM, Linus Torvalds
<[email protected]> wrote:
>
> Adding Andy to the cc, because this *might* be triggered by the
> vmalloc stack code itself. Maybe the re-use of stacks showing some
> problem? Maybe Chris (who can't see the problem) doesn't have
> CONFIG_VMAP_STACK enabled?

I bet it's the plug itself that is the stack address. In fact, it's
probably that mq_list head pointer

I think every single users of block plugging uses the pattern

struct blk_plug plug;

blk_start_plug(&plug);

and then we'll have

INIT_LIST_HEAD(&plug->mq_list);

which initializes that mq_list head with the stack addresses pointing to itself.

So when we see something like this:

list_add corruption. prev->next should be next (ffffe8ffff806648),
but was ffffc9000067fcd8. (prev=ffff880503878b80)

and it comes from

list_add_tail(&rq->queuelist, &plug->mq_list);

which will expand to

__list_add(new, head->prev, head)

which in this case *should* be:

__list_add(&rq->queuelist, plug->mq_list.prev, &plug->mq_list);

so in fact we *should* have "next" be a stack address.

So that debug message is really really odd. I would expect that "next"
is the stack address (because we're adding to the tail of the list, so
"next" is the list head itself), but the debug message corruption
printout says that "was" is the stack address, but next isn't.

Weird.The "but was" value actually looks like the right address should
look, but the actual address (which *should* be just "&plug->mq_list"
and really should be on the stack) looks bogus.

I'm now very confused.

Linus

2016-10-19 01:06:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/18/2016 05:10 PM, Linus Torvalds wrote:
> On Tue, Oct 18, 2016 at 4:42 PM, Chris Mason <[email protected]> wrote:
>>
>> Seems to be the whole thing:
>
> Ahh. On lkml, so I do have it in my mailbox, but Dave changed the
> subject line when he tested on ext4 rather than btrfs..
>
> Anyway, the corrupted address is somewhat interesting. As Dave Jones
> said, he saw
>
> list_add corruption. prev->next should be next (ffffe8ffff806648),
> but was ffffc9000067fcd8. (prev=ffff880503878b80).
> list_add corruption. prev->next should be next (ffffe8ffffc05648),
> but was ffffc9000028bcd8. (prev=ffff880503a145c0).
>
> and Dave Chinner reports
>
> list_add corruption. prev->next should be next (ffffe8ffffc02808),
> but was ffffc90005f6bda8. (prev=ffff88013363bb80).
>
> and it's worth noting that the "but was" is a remarkably consistent
> vmalloc address (the ffffc9000.. pattern gives it away). In fact, it's
> identical across two boots for DaveJ in the low 14 bits, and fairly
> high up in those low 14 bots (0x3cd8).
>
> DaveC has a different address, but it's also in the vmalloc space, and
> also looks like it is fairly high up in 14 bits (0x3da8). So in both
> cases it's almost certainly a stack address with a fairly empty stack.
> The differences are presumably due to different kernel configurations
> and/or just different filesystems calling the same function that does
> the same bad thing but now at different depths in the stack.
>
> Adding Andy to the cc, because this *might* be triggered by the
> vmalloc stack code itself. Maybe the re-use of stacks showing some
> problem? Maybe Chris (who can't see the problem) doesn't have
> CONFIG_VMAP_STACK enabled?

Wouldn't this cause the exact opposite problem? If the warning is to be
believed, then prev is *not* on the stack but somehow prev->next ended
up pointing to the stack. If stack reuse caused something to corrupt a
value on the stack, then how would this cause a stack address to be
written to a non-stack location? All I can think of is that "prev"
itself is corrupted somehow.

One possible debugging approach would be to change:

#define NR_CACHED_STACKS 2

to

#define NR_CACHED_STACKS 0

in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will
force an immediate TLB flush after vfree.

Also, CONFIG_DEBUG_VIRTUAL=y can be quite helpful for debugging stack
issues. I'm tempted to do something equivalent to hardwiring that
option on for a while if CONFIG_VMAP_STACK=y.

2016-10-19 17:09:53

by Philipp Hahn

[permalink] [raw]
Subject: Re: bio linked list corruption.

Hello,

Am 19.10.2016 um 01:42 schrieb Chris Mason:
> On Tue, Oct 18, 2016 at 04:39:22PM -0700, Linus Torvalds wrote:
>> On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason <[email protected]> wrote:
>>>
>>> Jens, not sure if you saw the whole thread. This has triggered bad page
>>> state errors, and also corrupted a btrfs list. It hurts me to say,
>>> but it
>>> might not actually be your fault.
>>
>> Where is that thread, and what is the "this" that triggers problems?
>>
>> Looking at the "->mq_list" users, I'm not seeing any changes there in
>> the last year or so. So I don't think it's the list itself.
>
> Seems to be the whole thing:
>
> http://www.gossamer-threads.com/lists/linux/kernel/2545792
>
> My guess is xattr, but I don't have a good reason for that.

Nearly a month ago I reported also a "list_add corruption", but with 4.1.6:
<http://marc.info/?l=linux-kernel&m=147508265316854&w=2>

That server rungs Samba4, which also is a heavy user of xattr.

Might be that it is related.

Philipp Hahn

2016-10-19 17:44:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 19, 2016 at 10:09 AM, Philipp Hahn <[email protected]> wrote:
>
> Nearly a month ago I reported also a "list_add corruption", but with 4.1.6:
> <http://marc.info/?l=linux-kernel&m=147508265316854&w=2>
>
> That server rungs Samba4, which also is a heavy user of xattr.

That one looks very different. In fact, the list that got corrupted
for you has since been changed to a hlist (which is *similar* to our
doubly-linked list, but has a smaller head and does not allow adding
to the end of the list).

Also, the "should be" and "was" values are very close, and switched:

should be ffffffff81ab3ca8, but was ffffffff81ab3cc8
should be ffffffff81ab3cc8, but was ffffffff81ab3ca8

so it actually looks like it was the same data structure. In
particular, it looks like enqueue_timer() ended up racing on adding an
entry to one index in the "base->vectors[]" array, while hitting an
entry that was pointing to another index near-by.

So I don't think it's related. Yours looks like some subtle timer base
race. It smells like a locking problem with timers. I'm not seeing
what it might be, but it *might* have been fixed by doing the
TIMER_MIGRATING bit right in add_timer_on() (commit 22b886dd1018).

Adding some timer people just in case, but I don't think your 4.1
report is related.

Linus

2016-10-20 06:52:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: bio linked list corruption.


* Linus Torvalds <[email protected]> wrote:

> On Wed, Oct 19, 2016 at 10:09 AM, Philipp Hahn <[email protected]> wrote:
> >
> > Nearly a month ago I reported also a "list_add corruption", but with 4.1.6:
> > <http://marc.info/?l=linux-kernel&m=147508265316854&w=2>
> >
> > That server rungs Samba4, which also is a heavy user of xattr.
>
> That one looks very different. In fact, the list that got corrupted
> for you has since been changed to a hlist (which is *similar* to our
> doubly-linked list, but has a smaller head and does not allow adding
> to the end of the list).
>
> Also, the "should be" and "was" values are very close, and switched:
>
> should be ffffffff81ab3ca8, but was ffffffff81ab3cc8
> should be ffffffff81ab3cc8, but was ffffffff81ab3ca8
>
> so it actually looks like it was the same data structure. In
> particular, it looks like enqueue_timer() ended up racing on adding an
> entry to one index in the "base->vectors[]" array, while hitting an
> entry that was pointing to another index near-by.
>
> So I don't think it's related. Yours looks like some subtle timer base
> race. It smells like a locking problem with timers. I'm not seeing
> what it might be, but it *might* have been fixed by doing the
> TIMER_MIGRATING bit right in add_timer_on() (commit 22b886dd1018).
>
> Adding some timer people just in case, but I don't think your 4.1
> report is related.

Side note: in case timer callback related corruption is suspected, a very
efficient debugging method is to enable debugobjects tracking+checking:

CONFIG_DEBUG_OBJECTS=y
CONFIG_DEBUG_OBJECTS_SELFTEST=y
CONFIG_DEBUG_OBJECTS_FREE=y
CONFIG_DEBUG_OBJECTS_TIMERS=y
CONFIG_DEBUG_OBJECTS_WORK=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER=y
CONFIG_DEBUG_KOBJECT_RELEASE=y

( Appending these to the .config and running 'make oldconfig' should enable all of
these. )

If the problem is in any of these areas then a debug warning could trigger at a
more convenient place than some later 'some unrelated bits got corrupted' warning.

Thanks,

Ingo

2016-10-20 07:21:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Thu, 20 Oct 2016, Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
> > So I don't think it's related. Yours looks like some subtle timer base
> > race. It smells like a locking problem with timers. I'm not seeing
> > what it might be, but it *might* have been fixed by doing the
> > TIMER_MIGRATING bit right in add_timer_on() (commit 22b886dd1018).

4.1 does not have the new fangled timer wheel. So 22b886dd1018 is
irrelevant.

> > Adding some timer people just in case, but I don't think your 4.1
> > report is related.

This looks like one of the classic issues where an enqueued timer gets
freed/reinitialized or otherwise manipulated. So yes, Ingos recommendation
for debugging is the right thing to do.

Thanks,

tglx

2016-10-20 22:48:29

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 18, 2016 at 05:28:44PM -0700, Linus Torvalds wrote:
> On Tue, Oct 18, 2016 at 5:10 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Adding Andy to the cc, because this *might* be triggered by the
> > vmalloc stack code itself. Maybe the re-use of stacks showing some
> > problem? Maybe Chris (who can't see the problem) doesn't have
> > CONFIG_VMAP_STACK enabled?
>
> I bet it's the plug itself that is the stack address. In fact, it's
> probably that mq_list head pointer

So I've done a few experiments the last couple days.

1, I see some kind of disaster happen with every filesystem
ext4, btrfs, xfs. For some reason I can repro it faster on btrfs
(though xfs blew up pretty quickly too, but I don't know if it's
the same as this list corruption bug).

2, I ran for 24 hours with VMAP_STACK turned off. I saw some
_different_ btrfs problems, but I never hit that list debug corruption
once.

3, I turned vmap stacks back on, and got this pretty quickly.
Another new flavor of crash, but Chris recommended I post this one
because it looks interesting.

[ 3943.514961] BUG: Bad page state in process kworker/u8:14 pfn:482244
[ 3943.532400] page:ffffea0012089100 count:0 mapcount:0 mapping:ffff8804c40d6ae0 index:0x2f
[ 3943.551865] flags: 0x4000000000000008(uptodate)
[ 3943.561652] page dumped because: non-NULL mapping
[ 3943.587698] CPU: 2 PID: 26823 Comm: kworker/u8:14 Not tainted 4.9.0-rc1-think+ #9
[ 3943.607409] Workqueue: writeback wb_workfn
[ 3943.617194] (flush-btrfs-2)
[ 3943.617260] ffffc90001bf7870
[ 3943.627007] ffffffff8130c93c
[ 3943.627075] ffffea0012089100
[ 3943.627112] ffffffff819ff37c
[ 3943.627149] ffffc90001bf7898
[ 3943.636918] ffffffff81150fef
[ 3943.636985] 0000000000000000
[ 3943.637021] ffffea0012089100
[ 3943.637059] 4000000000000008
[ 3943.646965] ffffc90001bf78a8
[ 3943.647041] ffffffff811510aa
[ 3943.647081] ffffc90001bf78f0
[ 3943.647126] Call Trace:
[ 3943.657068] [<ffffffff8130c93c>] dump_stack+0x4f/0x73
[ 3943.666996] [<ffffffff81150fef>] bad_page+0xbf/0x120
[ 3943.676839] [<ffffffff811510aa>] free_pages_check_bad+0x5a/0x70
[ 3943.686646] [<ffffffff8115355b>] free_hot_cold_page+0x20b/0x270
[ 3943.696402] [<ffffffff8115387b>] free_hot_cold_page_list+0x2b/0x50
[ 3943.706092] [<ffffffff8115c1fd>] release_pages+0x2bd/0x350
[ 3943.715726] [<ffffffff8115d732>] __pagevec_release+0x22/0x30
[ 3943.725358] [<ffffffffa00a0d4e>] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs]
[ 3943.735126] [<ffffffffa00a1199>] extent_writepages+0x49/0x60 [btrfs]
[ 3943.744808] [<ffffffffa0081840>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[ 3943.754457] [<ffffffffa007e993>] btrfs_writepages+0x23/0x30 [btrfs]
[ 3943.764085] [<ffffffff8115a91c>] do_writepages+0x1c/0x30
[ 3943.773667] [<ffffffff811f65f3>] __writeback_single_inode+0x33/0x180
[ 3943.783233] [<ffffffff811f6de8>] writeback_sb_inodes+0x2a8/0x5b0
[ 3943.792870] [<ffffffff811f733b>] wb_writeback+0xeb/0x1f0
[ 3943.802326] [<ffffffff811f7972>] wb_workfn+0xd2/0x280
[ 3943.811673] [<ffffffff810906e5>] process_one_work+0x1d5/0x490
[ 3943.821044] [<ffffffff81090685>] ? process_one_work+0x175/0x490
[ 3943.830447] [<ffffffff810909e9>] worker_thread+0x49/0x490
[ 3943.839756] [<ffffffff810909a0>] ? process_one_work+0x490/0x490
[ 3943.849074] [<ffffffff810909a0>] ? process_one_work+0x490/0x490
[ 3943.858264] [<ffffffff81095b5e>] kthread+0xee/0x110
[ 3943.867451] [<ffffffff81095a70>] ? kthread_park+0x60/0x60
[ 3943.876616] [<ffffffff81095a70>] ? kthread_park+0x60/0x60
[ 3943.885624] [<ffffffff81095a70>] ? kthread_park+0x60/0x60
[ 3943.894580] [<ffffffff81790492>] ret_from_fork+0x22/0x30

This feels like chasing a moving target, because the crash keeps changing..
I'm going to spend some time trying to at least pin down a selection
of syscalls that trinity can reproduce this with quickly.

Early-on, it seemed like this was xattr related, but now I'm not so sure.
Once or twice, I was able to repro it within a few minutes using just
writev, fsync, lsetxattr and lremovexattr. Then a day later, I found I
could run for a day before seeing it. Position of the moon or something.
Or it could have been entirely unrelated to the actual syscalls being run,
and based just on how contended the cpu/memory was.

Dave

2016-10-20 22:50:35

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote:

> One possible debugging approach would be to change:
>
> #define NR_CACHED_STACKS 2
>
> to
>
> #define NR_CACHED_STACKS 0
>
> in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will
> force an immediate TLB flush after vfree.

I can give that idea some runtime, but it sounds like this a case where
we're trying to prove a negative, and that'll just run and run ? In which case I
might do this when I'm travelling on Sunday.

> Also, CONFIG_DEBUG_VIRTUAL=y can be quite helpful for debugging stack
> issues. I'm tempted to do something equivalent to hardwiring that
> option on for a while if CONFIG_VMAP_STACK=y.

This one I had on. Nothing interesting jumped out.

Dave

2016-10-20 23:01:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones <[email protected]> wrote:
> On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote:
>
> > One possible debugging approach would be to change:
> >
> > #define NR_CACHED_STACKS 2
> >
> > to
> >
> > #define NR_CACHED_STACKS 0
> >
> > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will
> > force an immediate TLB flush after vfree.
>
> I can give that idea some runtime, but it sounds like this a case where
> we're trying to prove a negative, and that'll just run and run ? In which case I
> might do this when I'm travelling on Sunday.

The idea is that the stack will be free and unmapped immediately upon
process exit if configured like this so that bogus stack accesses (by
the CPU, not DMA) would OOPS immediately.

2016-10-20 23:03:52

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Thu, Oct 20, 2016 at 04:01:12PM -0700, Andy Lutomirski wrote:
> On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones <[email protected]> wrote:
> > On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote:
> >
> > > One possible debugging approach would be to change:
> > >
> > > #define NR_CACHED_STACKS 2
> > >
> > > to
> > >
> > > #define NR_CACHED_STACKS 0
> > >
> > > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will
> > > force an immediate TLB flush after vfree.
> >
> > I can give that idea some runtime, but it sounds like this a case where
> > we're trying to prove a negative, and that'll just run and run ? In which case I
> > might do this when I'm travelling on Sunday.
>
> The idea is that the stack will be free and unmapped immediately upon
> process exit if configured like this so that bogus stack accesses (by
> the CPU, not DMA) would OOPS immediately.

oh, misparsed. ok, I can definitely get behind that idea then.
I'll do that next.

Dave

2016-10-20 23:23:56

by Andy Lutomirski

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Thu, Oct 20, 2016 at 4:03 PM, Dave Jones <[email protected]> wrote:
> On Thu, Oct 20, 2016 at 04:01:12PM -0700, Andy Lutomirski wrote:
> > On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones <[email protected]> wrote:
> > > On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote:
> > >
> > > > One possible debugging approach would be to change:
> > > >
> > > > #define NR_CACHED_STACKS 2
> > > >
> > > > to
> > > >
> > > > #define NR_CACHED_STACKS 0
> > > >
> > > > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will
> > > > force an immediate TLB flush after vfree.
> > >
> > > I can give that idea some runtime, but it sounds like this a case where
> > > we're trying to prove a negative, and that'll just run and run ? In which case I
> > > might do this when I'm travelling on Sunday.
> >
> > The idea is that the stack will be free and unmapped immediately upon
> > process exit if configured like this so that bogus stack accesses (by
> > the CPU, not DMA) would OOPS immediately.
>
> oh, misparsed. ok, I can definitely get behind that idea then.
> I'll do that next.
>

It could be worth trying this, too:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8

It occurred to me that the current code is a little bit fragile.

--Andy

2016-10-21 20:03:09

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Thu, Oct 20, 2016 at 04:23:32PM -0700, Andy Lutomirski wrote:
> On Thu, Oct 20, 2016 at 4:03 PM, Dave Jones <[email protected]> wrote:
> > On Thu, Oct 20, 2016 at 04:01:12PM -0700, Andy Lutomirski wrote:
> > > On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones <[email protected]> wrote:
> > > > On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote:
> > > >
> > > > > One possible debugging approach would be to change:
> > > > >
> > > > > #define NR_CACHED_STACKS 2
> > > > >
> > > > > to
> > > > >
> > > > > #define NR_CACHED_STACKS 0
> > > > >
> > > > > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will
> > > > > force an immediate TLB flush after vfree.
> > > >
> > > > I can give that idea some runtime, but it sounds like this a case where
> > > > we're trying to prove a negative, and that'll just run and run ? In which case I
> > > > might do this when I'm travelling on Sunday.
> > >
> > > The idea is that the stack will be free and unmapped immediately upon
> > > process exit if configured like this so that bogus stack accesses (by
> > > the CPU, not DMA) would OOPS immediately.
> >
> > oh, misparsed. ok, I can definitely get behind that idea then.
> > I'll do that next.
> >
>
> It could be worth trying this, too:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8
>
> It occurred to me that the current code is a little bit fragile.

It's been nearly 24hrs with the above changes, and it's been pretty much
silent the whole time.

The only thing of note over that time period has been a btrfs lockdep
warning that's been around for a while, and occasional btrfs checksum
failures, which I've been seeing for a while, but seem to have gotten
worse since 4.8.

I'm pretty confident in the disk being ok in this machine, so I think
the checksum warnings are bogus. Chris suggested they may be the result
of memory corruption, but there's little else going on.


BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 expected csum 3008371513
BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum 3563910319 expected csum 738595262
BTRFS warning (device sda3): csum failed ino 131176 off 4096 csum 1344477721 expected csum 441864825
BTRFS warning (device sda3): csum failed ino 131241 off 245760 csum 3576232181 expected csum 2566472073
BTRFS warning (device sda3): csum failed ino 131429 off 0 csum 1494450239 expected csum 2646577722
BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800
BTRFS warning (device sda3): csum failed ino 131471 off 4096 csum 3475108475 expected csum 2566472073
BTRFS warning (device sda3): csum failed ino 131471 off 958464 csum 142982740 expected csum 2566472073
BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800
BTRFS warning (device sda3): csum failed ino 131532 off 270336 csum 3138898528 expected csum 2566472073
BTRFS warning (device sda3): csum failed ino 131532 off 1249280 csum 2169165042 expected csum 2566472073
BTRFS warning (device sda3): csum failed ino 131649 off 16384 csum 2914965650 expected csum 1425742005


A curious thing: the expected csum 2566472073 turns up a number of times for different inodes, and gets
differing actual csums each time. I suppose this could be something like a block of all zeros in multiple files,
but it struck me as surprising.

btrfs people: is there an easy way to map those inodes to a filename ? I'm betting those are the
test files that trinity generates. If so, it might point to a race somewhere.

Dave

2016-10-21 20:18:32

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.



On 10/21/2016 04:02 PM, Dave Jones wrote:
> On Thu, Oct 20, 2016 at 04:23:32PM -0700, Andy Lutomirski wrote:
> > On Thu, Oct 20, 2016 at 4:03 PM, Dave Jones <[email protected]> wrote:
> > > On Thu, Oct 20, 2016 at 04:01:12PM -0700, Andy Lutomirski wrote:
> > > > On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones <[email protected]> wrote:
> > > > > On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote:
> > > > >
> > > > > > One possible debugging approach would be to change:
> > > > > >
> > > > > > #define NR_CACHED_STACKS 2
> > > > > >
> > > > > > to
> > > > > >
> > > > > > #define NR_CACHED_STACKS 0
> > > > > >
> > > > > > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will
> > > > > > force an immediate TLB flush after vfree.
> > > > >
> > > > > I can give that idea some runtime, but it sounds like this a case where
> > > > > we're trying to prove a negative, and that'll just run and run ? In which case I
> > > > > might do this when I'm travelling on Sunday.
> > > >
> > > > The idea is that the stack will be free and unmapped immediately upon
> > > > process exit if configured like this so that bogus stack accesses (by
> > > > the CPU, not DMA) would OOPS immediately.
> > >
> > > oh, misparsed. ok, I can definitely get behind that idea then.
> > > I'll do that next.
> > >
> >
> > It could be worth trying this, too:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8
> >
> > It occurred to me that the current code is a little bit fragile.
>
> It's been nearly 24hrs with the above changes, and it's been pretty much
> silent the whole time.
>
> The only thing of note over that time period has been a btrfs lockdep
> warning that's been around for a while, and occasional btrfs checksum
> failures, which I've been seeing for a while, but seem to have gotten
> worse since 4.8.

Meaning you hit them with v4.8 or not?

>
> I'm pretty confident in the disk being ok in this machine, so I think
> the checksum warnings are bogus. Chris suggested they may be the result
> of memory corruption, but there's little else going on.
>
>
> BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 expected csum 3008371513
> BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum 3563910319 expected csum 738595262
> BTRFS warning (device sda3): csum failed ino 131176 off 4096 csum 1344477721 expected csum 441864825
> BTRFS warning (device sda3): csum failed ino 131241 off 245760 csum 3576232181 expected csum 2566472073
> BTRFS warning (device sda3): csum failed ino 131429 off 0 csum 1494450239 expected csum 2646577722
> BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800
> BTRFS warning (device sda3): csum failed ino 131471 off 4096 csum 3475108475 expected csum 2566472073
> BTRFS warning (device sda3): csum failed ino 131471 off 958464 csum 142982740 expected csum 2566472073
> BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800
> BTRFS warning (device sda3): csum failed ino 131532 off 270336 csum 3138898528 expected csum 2566472073
> BTRFS warning (device sda3): csum failed ino 131532 off 1249280 csum 2169165042 expected csum 2566472073
> BTRFS warning (device sda3): csum failed ino 131649 off 16384 csum 2914965650 expected csum 1425742005
>
>
> A curious thing: the expected csum 2566472073 turns up a number of times for different inodes, and gets
> differing actual csums each time. I suppose this could be something like a block of all zeros in multiple files,
> but it struck me as surprising.
>
> btrfs people: is there an easy way to map those inodes to a filename ? I'm betting those are the
> test files that trinity generates. If so, it might point to a race somewhere.

btrfs inspect inode 130654 mntpoint

-chris

2016-10-21 20:23:37

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Fri, Oct 21, 2016 at 04:17:48PM -0400, Chris Mason wrote:

> > BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 expected csum 3008371513
> > BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum 3563910319 expected csum 738595262
> > BTRFS warning (device sda3): csum failed ino 131176 off 4096 csum 1344477721 expected csum 441864825
> > BTRFS warning (device sda3): csum failed ino 131241 off 245760 csum 3576232181 expected csum 2566472073
> > BTRFS warning (device sda3): csum failed ino 131429 off 0 csum 1494450239 expected csum 2646577722
> > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800
> > BTRFS warning (device sda3): csum failed ino 131471 off 4096 csum 3475108475 expected csum 2566472073
> > BTRFS warning (device sda3): csum failed ino 131471 off 958464 csum 142982740 expected csum 2566472073
> > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800
> > BTRFS warning (device sda3): csum failed ino 131532 off 270336 csum 3138898528 expected csum 2566472073
> > BTRFS warning (device sda3): csum failed ino 131532 off 1249280 csum 2169165042 expected csum 2566472073
> > BTRFS warning (device sda3): csum failed ino 131649 off 16384 csum 2914965650 expected csum 1425742005
> >
> >
> > A curious thing: the expected csum 2566472073 turns up a number of times for different inodes, and gets
> > differing actual csums each time. I suppose this could be something like a block of all zeros in multiple files,
> > but it struck me as surprising.
> >
> > btrfs people: is there an easy way to map those inodes to a filename ? I'm betting those are the
> > test files that trinity generates. If so, it might point to a race somewhere.
>
> btrfs inspect inode 130654 mntpoint

Interesting, they all return

ERROR: ino paths ioctl: No such file or directory

So these files got deleted perhaps ?

Dave

2016-10-21 20:43:52

by Josef Bacik

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/21/2016 04:38 PM, Chris Mason wrote:
>
>
> On 10/21/2016 04:23 PM, Dave Jones wrote:
>> On Fri, Oct 21, 2016 at 04:17:48PM -0400, Chris Mason wrote:
>>
>> > > BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073
>> expected csum 3008371513
>> > > BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum
>> 3563910319 expected csum 738595262
>> > > BTRFS warning (device sda3): csum failed ino 131176 off 4096 csum
>> 1344477721 expected csum 441864825
>> > > BTRFS warning (device sda3): csum failed ino 131241 off 245760 csum
>> 3576232181 expected csum 2566472073
>> > > BTRFS warning (device sda3): csum failed ino 131429 off 0 csum 1494450239
>> expected csum 2646577722
>> > > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320
>> expected csum 3828807800
>> > > BTRFS warning (device sda3): csum failed ino 131471 off 4096 csum
>> 3475108475 expected csum 2566472073
>> > > BTRFS warning (device sda3): csum failed ino 131471 off 958464 csum
>> 142982740 expected csum 2566472073
>> > > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320
>> expected csum 3828807800
>> > > BTRFS warning (device sda3): csum failed ino 131532 off 270336 csum
>> 3138898528 expected csum 2566472073
>> > > BTRFS warning (device sda3): csum failed ino 131532 off 1249280 csum
>> 2169165042 expected csum 2566472073
>> > > BTRFS warning (device sda3): csum failed ino 131649 off 16384 csum
>> 2914965650 expected csum 1425742005
>> > >
>> > >
>> > > A curious thing: the expected csum 2566472073 turns up a number of times
>> for different inodes, and gets
>> > > differing actual csums each time. I suppose this could be something like
>> a block of all zeros in multiple files,
>> > > but it struck me as surprising.
>> > >
>> > > btrfs people: is there an easy way to map those inodes to a filename ?
>> I'm betting those are the
>> > > test files that trinity generates. If so, it might point to a race
>> somewhere.
>> >
>> > btrfs inspect inode 130654 mntpoint
>>
>> Interesting, they all return
>>
>> ERROR: ino paths ioctl: No such file or directory
>>
>> So these files got deleted perhaps ?
>>
> Yeah, they must have.
>

So one thing that will cause spurious csum errors is if you do things like
change the memory while it is in flight during O_DIRECT. Does trinity do that?
If so then that would explain it. If not we should probably dig into it. Thanks,

Josef

2016-10-21 20:46:40

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.



On 10/21/2016 04:23 PM, Dave Jones wrote:
> On Fri, Oct 21, 2016 at 04:17:48PM -0400, Chris Mason wrote:
>
> > > BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 expected csum 3008371513
> > > BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum 3563910319 expected csum 738595262
> > > BTRFS warning (device sda3): csum failed ino 131176 off 4096 csum 1344477721 expected csum 441864825
> > > BTRFS warning (device sda3): csum failed ino 131241 off 245760 csum 3576232181 expected csum 2566472073
> > > BTRFS warning (device sda3): csum failed ino 131429 off 0 csum 1494450239 expected csum 2646577722
> > > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800
> > > BTRFS warning (device sda3): csum failed ino 131471 off 4096 csum 3475108475 expected csum 2566472073
> > > BTRFS warning (device sda3): csum failed ino 131471 off 958464 csum 142982740 expected csum 2566472073
> > > BTRFS warning (device sda3): csum failed ino 131471 off 0 csum 3949539320 expected csum 3828807800
> > > BTRFS warning (device sda3): csum failed ino 131532 off 270336 csum 3138898528 expected csum 2566472073
> > > BTRFS warning (device sda3): csum failed ino 131532 off 1249280 csum 2169165042 expected csum 2566472073
> > > BTRFS warning (device sda3): csum failed ino 131649 off 16384 csum 2914965650 expected csum 1425742005
> > >
> > >
> > > A curious thing: the expected csum 2566472073 turns up a number of times for different inodes, and gets
> > > differing actual csums each time. I suppose this could be something like a block of all zeros in multiple files,
> > > but it struck me as surprising.
> > >
> > > btrfs people: is there an easy way to map those inodes to a filename ? I'm betting those are the
> > > test files that trinity generates. If so, it might point to a race somewhere.
> >
> > btrfs inspect inode 130654 mntpoint
>
> Interesting, they all return
>
> ERROR: ino paths ioctl: No such file or directory
>
> So these files got deleted perhaps ?
>
Yeah, they must have.

-chris

2016-10-21 21:11:38

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Fri, Oct 21, 2016 at 04:41:09PM -0400, Josef Bacik wrote:

> >> >
> >> > btrfs inspect inode 130654 mntpoint
> >>
> >> Interesting, they all return
> >>
> >> ERROR: ino paths ioctl: No such file or directory
> >>
> >> So these files got deleted perhaps ?
> >>
> > Yeah, they must have.
> >
>
> So one thing that will cause spurious csum errors is if you do things like
> change the memory while it is in flight during O_DIRECT. Does trinity do that?
> If so then that would explain it. If not we should probably dig into it. Thanks,

Yeah, that's definitely possible. And it wasn't that long ago I added
some code to always open testfiles multiple times with different modes,
so the likely of O_DIRECT went up. That would explain why I've started
seeing this more.

Dave

2016-10-22 15:20:46

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote:

> > It could be worth trying this, too:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8
> >
> > It occurred to me that the current code is a little bit fragile.
>
> It's been nearly 24hrs with the above changes, and it's been pretty much
> silent the whole time.
>
> The only thing of note over that time period has been a btrfs lockdep
> warning that's been around for a while, and occasional btrfs checksum
> failures, which I've been seeing for a while, but seem to have gotten
> worse since 4.8.
>
> I'm pretty confident in the disk being ok in this machine, so I think
> the checksum warnings are bogus. Chris suggested they may be the result
> of memory corruption, but there's little else going on.

The only interesting thing last nights run was this..

BUG: Bad page state in process kworker/u8:1 pfn:4e2b70
page:ffffea00138adc00 count:0 mapcount:0 mapping:ffff88046e9fc2e0 index:0xdf0
flags: 0x400000000000000c(referenced|uptodate)
page dumped because: non-NULL mapping
CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11
Workqueue: writeback wb_workfn (flush-btrfs-2)
ffffc90001f97828
ffffffff8130d07c
ffffea00138adc00
ffffffff819ff524
ffffc90001f97850
ffffffff8115117f
0000000000000000
ffffea00138adc00
400000000000000c
ffffc90001f97860
ffffffff8115123a
ffffc90001f978a8
Call Trace:
[<ffffffff8130d07c>] dump_stack+0x4f/0x73
[<ffffffff8115117f>] bad_page+0xbf/0x120
[<ffffffff8115123a>] free_pages_check_bad+0x5a/0x70
[<ffffffff81153b38>] free_hot_cold_page+0x248/0x290
[<ffffffff81153e3b>] free_hot_cold_page_list+0x2b/0x50
[<ffffffff8115c84d>] release_pages+0x2bd/0x350
[<ffffffff8115dd82>] __pagevec_release+0x22/0x30
[<ffffffffa009cd4e>] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs]
[<ffffffffa009d199>] extent_writepages+0x49/0x60 [btrfs]
[<ffffffffa007d840>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[<ffffffffa007a993>] btrfs_writepages+0x23/0x30 [btrfs]
[<ffffffff8115af6c>] do_writepages+0x1c/0x30
[<ffffffff811f6d33>] __writeback_single_inode+0x33/0x180
[<ffffffff811f7528>] writeback_sb_inodes+0x2a8/0x5b0
[<ffffffff811f78bd>] __writeback_inodes_wb+0x8d/0xc0
[<ffffffff811f7b73>] wb_writeback+0x1e3/0x1f0
[<ffffffff811f80b2>] wb_workfn+0xd2/0x280
[<ffffffff81090875>] process_one_work+0x1d5/0x490
[<ffffffff81090815>] ? process_one_work+0x175/0x490
[<ffffffff81090b79>] worker_thread+0x49/0x490
[<ffffffff81090b30>] ? process_one_work+0x490/0x490
[<ffffffff81095cee>] kthread+0xee/0x110
[<ffffffff81095c00>] ? kthread_park+0x60/0x60
[<ffffffff81790bd2>] ret_from_fork+0x22/0x30

2016-10-23 21:32:47

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.



On 10/22/2016 11:20 AM, Dave Jones wrote:
> On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote:
>
> > > It could be worth trying this, too:
> > >
> > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8
> > >
> > > It occurred to me that the current code is a little bit fragile.
> >
> > It's been nearly 24hrs with the above changes, and it's been pretty much
> > silent the whole time.
> >
> > The only thing of note over that time period has been a btrfs lockdep
> > warning that's been around for a while, and occasional btrfs checksum
> > failures, which I've been seeing for a while, but seem to have gotten
> > worse since 4.8.
> >
> > I'm pretty confident in the disk being ok in this machine, so I think
> > the checksum warnings are bogus. Chris suggested they may be the result
> > of memory corruption, but there's little else going on.
>
> The only interesting thing last nights run was this..
>
> BUG: Bad page state in process kworker/u8:1 pfn:4e2b70
> page:ffffea00138adc00 count:0 mapcount:0 mapping:ffff88046e9fc2e0 index:0xdf0
> flags: 0x400000000000000c(referenced|uptodate)
> page dumped because: non-NULL mapping
> CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11
> Workqueue: writeback wb_workfn (flush-btrfs-2)

Well crud, we're back to wondering if this is Btrfs or the stack
corruption. Since the pagevecs are on the stack and this is a new
crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we
should make sure.

-chris

2016-10-24 04:41:05

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote:
>
>
> On 10/22/2016 11:20 AM, Dave Jones wrote:
> > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote:
> >
> > > > It could be worth trying this, too:
> > > >
> > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8
> > > >
> > > > It occurred to me that the current code is a little bit fragile.
> > >
> > > It's been nearly 24hrs with the above changes, and it's been pretty much
> > > silent the whole time.
> > >
> > > The only thing of note over that time period has been a btrfs lockdep
> > > warning that's been around for a while, and occasional btrfs checksum
> > > failures, which I've been seeing for a while, but seem to have gotten
> > > worse since 4.8.
> > >
> > > I'm pretty confident in the disk being ok in this machine, so I think
> > > the checksum warnings are bogus. Chris suggested they may be the result
> > > of memory corruption, but there's little else going on.
> >
> > The only interesting thing last nights run was this..
> >
> > BUG: Bad page state in process kworker/u8:1 pfn:4e2b70
> > page:ffffea00138adc00 count:0 mapcount:0 mapping:ffff88046e9fc2e0 index:0xdf0
> > flags: 0x400000000000000c(referenced|uptodate)
> > page dumped because: non-NULL mapping
> > CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11
> > Workqueue: writeback wb_workfn (flush-btrfs-2)
>
> Well crud, we're back to wondering if this is Btrfs or the stack
> corruption. Since the pagevecs are on the stack and this is a new
> crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we
> should make sure.

Here's an interesting one from today, pointing the finger at xattrs again.


[69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11
[69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000
[69943.468077] RIP: 0010:[<ffffffff810c3f6b>]
[69943.472704] [<ffffffff810c3f6b>] __lock_acquire.isra.32+0x6b/0x8c0
[69943.477489] RSP: 0018:ffffc9000b10b9e8 EFLAGS: 00010086
[69943.482368] RAX: ffffffff81789b90 RBX: ffff8804f8dd3740 RCX: 0000000000000000
[69943.487410] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[69943.492515] RBP: ffffc9000b10ba18 R08: 0000000000000001 R09: 0000000000000000
[69943.497666] R10: 0000000000000001 R11: 00003f9cfa7f4e73 R12: 0000000000000000
[69943.502880] R13: 0000000000000000 R14: ffffc9000af7bd48 R15: ffff8804f8dd3740
[69943.508163] FS: 00007f64904a2b40(0000) GS:ffff880507a00000(0000) knlGS:0000000000000000
[69943.513591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[69943.518917] CR2: ffffffff81789d28 CR3: 00000004a8f16000 CR4: 00000000001406e0
[69943.524253] DR0: 00007f5b97fd4000 DR1: 0000000000000000 DR2: 0000000000000000
[69943.529488] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[69943.534771] Stack:
[69943.540023] ffff880507bd74c0
[69943.545317] ffff8804f8dd3740 0000000000000046 0000000000000286[69943.545456] ffffc9000af7bd08
[69943.550930] 0000000000000100 ffffc9000b10ba50 ffffffff810c4b68[69943.551069] ffffffff810ba40c
[69943.556657] ffff880400000000 0000000000000000 ffffc9000af7bd48[69943.556796] Call Trace:
[69943.562465] [<ffffffff810c4b68>] lock_acquire+0x58/0x70
[69943.568354] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70
[69943.574306] [<ffffffff8178fef2>] _raw_spin_lock_irqsave+0x42/0x80
[69943.580335] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70
[69943.586237] [<ffffffff810ba40c>] finish_wait+0x3c/0x70
[69943.591992] [<ffffffff81169727>] shmem_fault+0x167/0x1b0
[69943.597807] [<ffffffff810ba6c0>] ? prepare_to_wait_event+0x100/0x100
[69943.603741] [<ffffffff8117b46d>] __do_fault+0x6d/0x1b0
[69943.609743] [<ffffffff8117f168>] handle_mm_fault+0xc58/0x1170
[69943.615822] [<ffffffff8117e553>] ? handle_mm_fault+0x43/0x1170
[69943.621971] [<ffffffff81044982>] __do_page_fault+0x172/0x4e0
[69943.628184] [<ffffffff81044d10>] do_page_fault+0x20/0x70
[69943.634449] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20
[69943.640784] [<ffffffff81791f3f>] page_fault+0x1f/0x30
[69943.647170] [<ffffffff8133d69c>] ? strncpy_from_user+0x5c/0x170
[69943.653480] [<ffffffff8133d686>] ? strncpy_from_user+0x46/0x170
[69943.659632] [<ffffffff811f22a7>] setxattr+0x57/0x170
[69943.665846] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20
[69943.672172] [<ffffffff810c1f09>] ? get_lock_stats+0x19/0x50
[69943.678558] [<ffffffff810a58f6>] ? sched_clock_cpu+0xb6/0xd0
[69943.685007] [<ffffffff810c40cf>] ? __lock_acquire.isra.32+0x1cf/0x8c0
[69943.691542] [<ffffffff8132a8b3>] ? __this_cpu_preempt_check+0x13/0x20
[69943.698130] [<ffffffff8109b9bc>] ? preempt_count_add+0x7c/0xc0
[69943.704791] [<ffffffff811ecda1>] ? __mnt_want_write+0x61/0x90
[69943.711519] [<ffffffff811f2638>] SyS_fsetxattr+0x78/0xa0
[69943.718300] [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[69943.724949] [<ffffffff81790a4b>] entry_SYSCALL64_slow_path+0x25/0x25
[69943.731521] Code:
[69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00 <f0> ff 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85 [69943.745506] RIP
[69943.752432] [<ffffffff810c3f6b>] __lock_acquire.isra.32+0x6b/0x8c0
RSP <ffffc9000b10b9e8>
[69943.766856] CR2: ffffffff81789d28


2016-10-24 13:44:04

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.



On 10/24/2016 12:40 AM, Dave Jones wrote:
> On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote:
> >
> >
> > On 10/22/2016 11:20 AM, Dave Jones wrote:
> > > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote:
> > >
> > > > > It could be worth trying this, too:
> > > > >
> > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8
> > > > >
> > > > > It occurred to me that the current code is a little bit fragile.
> > > >
> > > > It's been nearly 24hrs with the above changes, and it's been pretty much
> > > > silent the whole time.
> > > >
> > > > The only thing of note over that time period has been a btrfs lockdep
> > > > warning that's been around for a while, and occasional btrfs checksum
> > > > failures, which I've been seeing for a while, but seem to have gotten
> > > > worse since 4.8.
> > > >
> > > > I'm pretty confident in the disk being ok in this machine, so I think
> > > > the checksum warnings are bogus. Chris suggested they may be the result
> > > > of memory corruption, but there's little else going on.
> > >
> > > The only interesting thing last nights run was this..
> > >
> > > BUG: Bad page state in process kworker/u8:1 pfn:4e2b70
> > > page:ffffea00138adc00 count:0 mapcount:0 mapping:ffff88046e9fc2e0 index:0xdf0
> > > flags: 0x400000000000000c(referenced|uptodate)
> > > page dumped because: non-NULL mapping
> > > CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11
> > > Workqueue: writeback wb_workfn (flush-btrfs-2)
> >
> > Well crud, we're back to wondering if this is Btrfs or the stack
> > corruption. Since the pagevecs are on the stack and this is a new
> > crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we
> > should make sure.
>
> Here's an interesting one from today, pointing the finger at xattrs again.
>
>
> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11
> [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000
> [69943.468077] RIP: 0010:[<ffffffff810c3f6b>]

Was this btrfs?

-chris

2016-10-24 20:07:09

by Andy Lutomirski

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Sun, Oct 23, 2016 at 9:40 PM, Dave Jones <[email protected]> wrote:
> On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote:
> >
> >
> > On 10/22/2016 11:20 AM, Dave Jones wrote:
> > > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote:
> > >
> > > > > It could be worth trying this, too:
> > > > >
> > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack&id=174531fef4e8
> > > > >
> > > > > It occurred to me that the current code is a little bit fragile.
> > > >
> > > > It's been nearly 24hrs with the above changes, and it's been pretty much
> > > > silent the whole time.
> > > >
> > > > The only thing of note over that time period has been a btrfs lockdep
> > > > warning that's been around for a while, and occasional btrfs checksum
> > > > failures, which I've been seeing for a while, but seem to have gotten
> > > > worse since 4.8.
> > > >
> > > > I'm pretty confident in the disk being ok in this machine, so I think
> > > > the checksum warnings are bogus. Chris suggested they may be the result
> > > > of memory corruption, but there's little else going on.
> > >
> > > The only interesting thing last nights run was this..
> > >
> > > BUG: Bad page state in process kworker/u8:1 pfn:4e2b70
> > > page:ffffea00138adc00 count:0 mapcount:0 mapping:ffff88046e9fc2e0 index:0xdf0
> > > flags: 0x400000000000000c(referenced|uptodate)
> > > page dumped because: non-NULL mapping
> > > CPU: 3 PID: 24234 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11
> > > Workqueue: writeback wb_workfn (flush-btrfs-2)
> >
> > Well crud, we're back to wondering if this is Btrfs or the stack
> > corruption. Since the pagevecs are on the stack and this is a new
> > crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we
> > should make sure.
>
> Here's an interesting one from today, pointing the finger at xattrs again.
>
>
> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC

This is an unhandled kernel page fault. The string "Oops" is so helpful :-/

> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11
> [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000
> [69943.468077] RIP: 0010:[<ffffffff810c3f6b>]
> [69943.472704] [<ffffffff810c3f6b>] __lock_acquire.isra.32+0x6b/0x8c0
> [69943.477489] RSP: 0018:ffffc9000b10b9e8 EFLAGS: 00010086
> [69943.482368] RAX: ffffffff81789b90 RBX: ffff8804f8dd3740 RCX: 0000000000000000
> [69943.487410] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [69943.492515] RBP: ffffc9000b10ba18 R08: 0000000000000001 R09: 0000000000000000
> [69943.497666] R10: 0000000000000001 R11: 00003f9cfa7f4e73 R12: 0000000000000000
> [69943.502880] R13: 0000000000000000 R14: ffffc9000af7bd48 R15: ffff8804f8dd3740
> [69943.508163] FS: 00007f64904a2b40(0000) GS:ffff880507a00000(0000) knlGS:0000000000000000
> [69943.513591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [69943.518917] CR2: ffffffff81789d28 CR3: 00000004a8f16000 CR4: 00000000001406e0
> [69943.524253] DR0: 00007f5b97fd4000 DR1: 0000000000000000 DR2: 0000000000000000
> [69943.529488] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [69943.534771] Stack:
> [69943.540023] ffff880507bd74c0
> [69943.545317] ffff8804f8dd3740 0000000000000046 0000000000000286[69943.545456] ffffc9000af7bd08
> [69943.550930] 0000000000000100 ffffc9000b10ba50 ffffffff810c4b68[69943.551069] ffffffff810ba40c
> [69943.556657] ffff880400000000 0000000000000000 ffffc9000af7bd48[69943.556796] Call Trace:
> [69943.562465] [<ffffffff810c4b68>] lock_acquire+0x58/0x70
> [69943.568354] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70
> [69943.574306] [<ffffffff8178fef2>] _raw_spin_lock_irqsave+0x42/0x80
> [69943.580335] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70
> [69943.586237] [<ffffffff810ba40c>] finish_wait+0x3c/0x70
> [69943.591992] [<ffffffff81169727>] shmem_fault+0x167/0x1b0
> [69943.597807] [<ffffffff810ba6c0>] ? prepare_to_wait_event+0x100/0x100
> [69943.603741] [<ffffffff8117b46d>] __do_fault+0x6d/0x1b0
> [69943.609743] [<ffffffff8117f168>] handle_mm_fault+0xc58/0x1170
> [69943.615822] [<ffffffff8117e553>] ? handle_mm_fault+0x43/0x1170
> [69943.621971] [<ffffffff81044982>] __do_page_fault+0x172/0x4e0
> [69943.628184] [<ffffffff81044d10>] do_page_fault+0x20/0x70
> [69943.634449] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20
> [69943.640784] [<ffffffff81791f3f>] page_fault+0x1f/0x30
> [69943.647170] [<ffffffff8133d69c>] ? strncpy_from_user+0x5c/0x170
> [69943.653480] [<ffffffff8133d686>] ? strncpy_from_user+0x46/0x170
> [69943.659632] [<ffffffff811f22a7>] setxattr+0x57/0x170
> [69943.665846] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20
> [69943.672172] [<ffffffff810c1f09>] ? get_lock_stats+0x19/0x50
> [69943.678558] [<ffffffff810a58f6>] ? sched_clock_cpu+0xb6/0xd0
> [69943.685007] [<ffffffff810c40cf>] ? __lock_acquire.isra.32+0x1cf/0x8c0
> [69943.691542] [<ffffffff8132a8b3>] ? __this_cpu_preempt_check+0x13/0x20
> [69943.698130] [<ffffffff8109b9bc>] ? preempt_count_add+0x7c/0xc0
> [69943.704791] [<ffffffff811ecda1>] ? __mnt_want_write+0x61/0x90
> [69943.711519] [<ffffffff811f2638>] SyS_fsetxattr+0x78/0xa0
> [69943.718300] [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
> [69943.724949] [<ffffffff81790a4b>] entry_SYSCALL64_slow_path+0x25/0x25
> [69943.731521] Code:
> [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00 <f0> ff 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85

That's lock incl 0x198(%rax). I think this is:

atomic_inc((atomic_t *)&class->ops);

I suppose this could be stack corruption at work, but after a fair
amount of staring, I still haven't found anything in the vmap_stack
code that would cause stack corruption.

2016-10-24 20:46:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>
> This is an unhandled kernel page fault. The string "Oops" is so helpful :-/

I think there was a line above it that DaveJ just didn't include.

>
>> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11
>> [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000
>> [69943.468077] RIP: 0010:[<ffffffff810c3f6b>]
>> [69943.472704] [<ffffffff810c3f6b>] __lock_acquire.isra.32+0x6b/0x8c0
>> [69943.477489] RSP: 0018:ffffc9000b10b9e8 EFLAGS: 00010086
>> [69943.482368] RAX: ffffffff81789b90 RBX: ffff8804f8dd3740 RCX: 0000000000000000
>> [69943.487410] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>> [69943.492515] RBP: ffffc9000b10ba18 R08: 0000000000000001 R09: 0000000000000000
>> [69943.497666] R10: 0000000000000001 R11: 00003f9cfa7f4e73 R12: 0000000000000000
>> [69943.502880] R13: 0000000000000000 R14: ffffc9000af7bd48 R15: ffff8804f8dd3740
>> [69943.508163] FS: 00007f64904a2b40(0000) GS:ffff880507a00000(0000) knlGS:0000000000000000
>> [69943.513591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [69943.518917] CR2: ffffffff81789d28 CR3: 00000004a8f16000 CR4: 00000000001406e0
>> [69943.524253] DR0: 00007f5b97fd4000 DR1: 0000000000000000 DR2: 0000000000000000
>> [69943.529488] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>> [69943.534771] Stack:
>> [69943.540023] ffff880507bd74c0
>> [69943.545317] ffff8804f8dd3740 0000000000000046 0000000000000286[69943.545456] ffffc9000af7bd08
>> [69943.550930] 0000000000000100 ffffc9000b10ba50 ffffffff810c4b68[69943.551069] ffffffff810ba40c
>> [69943.556657] ffff880400000000 0000000000000000 ffffc9000af7bd48[69943.556796] Call Trace:
>> [69943.562465] [<ffffffff810c4b68>] lock_acquire+0x58/0x70
>> [69943.568354] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70
>> [69943.574306] [<ffffffff8178fef2>] _raw_spin_lock_irqsave+0x42/0x80
>> [69943.580335] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70
>> [69943.586237] [<ffffffff810ba40c>] finish_wait+0x3c/0x70
>> [69943.591992] [<ffffffff81169727>] shmem_fault+0x167/0x1b0
>> [69943.597807] [<ffffffff810ba6c0>] ? prepare_to_wait_event+0x100/0x100
>> [69943.603741] [<ffffffff8117b46d>] __do_fault+0x6d/0x1b0
>> [69943.609743] [<ffffffff8117f168>] handle_mm_fault+0xc58/0x1170
>> [69943.615822] [<ffffffff8117e553>] ? handle_mm_fault+0x43/0x1170
>> [69943.621971] [<ffffffff81044982>] __do_page_fault+0x172/0x4e0
>> [69943.628184] [<ffffffff81044d10>] do_page_fault+0x20/0x70
>> [69943.634449] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20
>> [69943.640784] [<ffffffff81791f3f>] page_fault+0x1f/0x30
>> [69943.647170] [<ffffffff8133d69c>] ? strncpy_from_user+0x5c/0x170
>> [69943.653480] [<ffffffff8133d686>] ? strncpy_from_user+0x46/0x170
>> [69943.659632] [<ffffffff811f22a7>] setxattr+0x57/0x170
>> [69943.665846] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20
>> [69943.672172] [<ffffffff810c1f09>] ? get_lock_stats+0x19/0x50
>> [69943.678558] [<ffffffff810a58f6>] ? sched_clock_cpu+0xb6/0xd0
>> [69943.685007] [<ffffffff810c40cf>] ? __lock_acquire.isra.32+0x1cf/0x8c0
>> [69943.691542] [<ffffffff8132a8b3>] ? __this_cpu_preempt_check+0x13/0x20
>> [69943.698130] [<ffffffff8109b9bc>] ? preempt_count_add+0x7c/0xc0
>> [69943.704791] [<ffffffff811ecda1>] ? __mnt_want_write+0x61/0x90
>> [69943.711519] [<ffffffff811f2638>] SyS_fsetxattr+0x78/0xa0
>> [69943.718300] [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
>> [69943.724949] [<ffffffff81790a4b>] entry_SYSCALL64_slow_path+0x25/0x25
>> [69943.731521] Code:
>> [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00 <f0> ff 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85
>
> That's lock incl 0x198(%rax). I think this is:
>
> atomic_inc((atomic_t *)&class->ops);
>
> I suppose this could be stack corruption at work, but after a fair
> amount of staring, I still haven't found anything in the vmap_stack
> code that would cause stack corruption.

Well, it is intriguing that what faults is this:

finish_wait(shmem_falloc_waitq, &shmem_fault_wait);

where 'shmem_fault_wait' is a on-stack wait queue. So it really looks
very much like stack corruption.

What strikes me is that "finish_wait()" does this optimistic "has my
entry been removed" without holding the waitqueue lock (and uses
list_empty_careful() to make sure it does that "safely").

It has that big comment too:

/*
* shmem_falloc_waitq points into the shmem_fallocate()
* stack of the hole-punching task: shmem_falloc_waitq
* is usually invalid by the time we reach here, but
* finish_wait() does not dereference it in that case;
* though i_lock needed lest racing with wake_up_all().
*/

the stack it comes from is the wait queue head from shmem_fallocate(),
which will do "wake_up_all()" under the inode lock.

On the face of it, the inode lock should make that safe and serialize
everything. And yes, finish_wait() does not touch the unsafe stuff if
the wait-queue (in the local stack) is empty, which wake_up_all()
*should* have guaranteed. It's just a regular wait-queue entry (that
DEFINE_WAIT() does that), so it uses the normal
autoremove_wake_function() that removes things on successful wakeup:

int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int
sync, void *key)
{
int ret = default_wake_function(wait, mode, sync, key);

if (ret)
list_del_init(&wait->task_list);
return ret;
}

So the only issue is "did default_wake_function() return true"? That's
try_to_wake_up(TASK_NORMAL, 0), and I note that it can return zero
(and thus *not* remove the entry - leavign the invalid entry tghere)
if

if (!(p->state & state))
goto out;

but "prepare_to_wait()" (which also ran with the inode->i_lock held,
and also takes the wait-queue lock) did set p->state to
TASK_UNINTERRUPTIBLE.

So this is all some really subtle code, but I'm not seeing that it
would be wrong.

Linus

2016-10-24 21:17:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Oct 24, 2016 at 1:46 PM, Linus Torvalds
<[email protected]> wrote:
>
> So this is all some really subtle code, but I'm not seeing that it
> would be wrong.

Ahh... Except maybe..

The vmalloc/vfree code itself is a bit scary. In particular, we have a
rather insane model of TLB flushing. We leave the virtual area on a
lazy purge-list, and we delay flushing the TLB and actually freeing
the virtual memory for it so that we can batch things up.

But we've free'd the physical pages that are *mapped* by that area
when we do the vfree(). So there can be stale TLB entries that point
to pages that have gotten re-used. They shouldn't matter, because
nothing should be writing to those pages, but it strikes me that this
may also be hurting the DEBUG_PAGEALLOC thing. Maybe we're not getting
the page fautls that we *should* be getting, and there are hidden
reads and writes to those paghes that already got free'd.\

There was some nasty reason why we did that insane thing. I think it
was just that there are a few high-frequency vmalloc/vfree users and
the TLB flushing was killing some performance.

But it does strike me that we are playing very fast and loose with the
TLB on the vmalloc area.

So maybe all the new VMAP code is fine, and it's really vmalloc/vfree
that has been subtly broken but nobody has ever cared before?

Linus

2016-10-24 21:50:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Oct 24, 2016 at 2:17 PM, Linus Torvalds
<[email protected]> wrote:
>
> The vmalloc/vfree code itself is a bit scary. In particular, we have a
> rather insane model of TLB flushing. We leave the virtual area on a
> lazy purge-list, and we delay flushing the TLB and actually freeing
> the virtual memory for it so that we can batch things up.

Never mind. If DaveJ is running with DEBUG_PAGEALLOC, then the code in
vmap_debug_free_range() should have forced a synchronous TLB flush fro
vmalloc ranges too, so that doesn't explain it either.

Linus

2016-10-24 22:03:01

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/24/2016 05:50 PM, Linus Torvalds wrote:
> On Mon, Oct 24, 2016 at 2:17 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> The vmalloc/vfree code itself is a bit scary. In particular, we have a
>> rather insane model of TLB flushing. We leave the virtual area on a
>> lazy purge-list, and we delay flushing the TLB and actually freeing
>> the virtual memory for it so that we can batch things up.
>
> Never mind. If DaveJ is running with DEBUG_PAGEALLOC, then the code in
> vmap_debug_free_range() should have forced a synchronous TLB flush fro
> vmalloc ranges too, so that doesn't explain it either.
>

My big fear here is that we're just triggering an old stack corruption
more reliably. I wonder if we can make it happen much faster by
restricting the stacks to a static list and cycling through them in lifo
fashion?

-chris

2016-10-24 22:42:47

by Andy Lutomirski

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Oct 24, 2016 at 1:46 PM, Linus Torvalds
<[email protected]> wrote:
> On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirski <[email protected]> wrote:
>>>
>>> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>
>> This is an unhandled kernel page fault. The string "Oops" is so helpful :-/
>
> I think there was a line above it that DaveJ just didn't include.
>
>>
>>> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11
>>> [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000
>>> [69943.468077] RIP: 0010:[<ffffffff810c3f6b>]
>>> [69943.472704] [<ffffffff810c3f6b>] __lock_acquire.isra.32+0x6b/0x8c0
>>> [69943.477489] RSP: 0018:ffffc9000b10b9e8 EFLAGS: 00010086
>>> [69943.482368] RAX: ffffffff81789b90 RBX: ffff8804f8dd3740 RCX: 0000000000000000
>>> [69943.487410] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>> [69943.492515] RBP: ffffc9000b10ba18 R08: 0000000000000001 R09: 0000000000000000
>>> [69943.497666] R10: 0000000000000001 R11: 00003f9cfa7f4e73 R12: 0000000000000000
>>> [69943.502880] R13: 0000000000000000 R14: ffffc9000af7bd48 R15: ffff8804f8dd3740
>>> [69943.508163] FS: 00007f64904a2b40(0000) GS:ffff880507a00000(0000) knlGS:0000000000000000
>>> [69943.513591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [69943.518917] CR2: ffffffff81789d28 CR3: 00000004a8f16000 CR4: 00000000001406e0
>>> [69943.524253] DR0: 00007f5b97fd4000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [69943.529488] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>>> [69943.534771] Stack:
>>> [69943.540023] ffff880507bd74c0
>>> [69943.545317] ffff8804f8dd3740 0000000000000046 0000000000000286[69943.545456] ffffc9000af7bd08
>>> [69943.550930] 0000000000000100 ffffc9000b10ba50 ffffffff810c4b68[69943.551069] ffffffff810ba40c
>>> [69943.556657] ffff880400000000 0000000000000000 ffffc9000af7bd48[69943.556796] Call Trace:
>>> [69943.562465] [<ffffffff810c4b68>] lock_acquire+0x58/0x70
>>> [69943.568354] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70
>>> [69943.574306] [<ffffffff8178fef2>] _raw_spin_lock_irqsave+0x42/0x80
>>> [69943.580335] [<ffffffff810ba40c>] ? finish_wait+0x3c/0x70
>>> [69943.586237] [<ffffffff810ba40c>] finish_wait+0x3c/0x70
>>> [69943.591992] [<ffffffff81169727>] shmem_fault+0x167/0x1b0
>>> [69943.597807] [<ffffffff810ba6c0>] ? prepare_to_wait_event+0x100/0x100
>>> [69943.603741] [<ffffffff8117b46d>] __do_fault+0x6d/0x1b0
>>> [69943.609743] [<ffffffff8117f168>] handle_mm_fault+0xc58/0x1170
>>> [69943.615822] [<ffffffff8117e553>] ? handle_mm_fault+0x43/0x1170
>>> [69943.621971] [<ffffffff81044982>] __do_page_fault+0x172/0x4e0
>>> [69943.628184] [<ffffffff81044d10>] do_page_fault+0x20/0x70
>>> [69943.634449] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20
>>> [69943.640784] [<ffffffff81791f3f>] page_fault+0x1f/0x30
>>> [69943.647170] [<ffffffff8133d69c>] ? strncpy_from_user+0x5c/0x170
>>> [69943.653480] [<ffffffff8133d686>] ? strncpy_from_user+0x46/0x170
>>> [69943.659632] [<ffffffff811f22a7>] setxattr+0x57/0x170
>>> [69943.665846] [<ffffffff8132a897>] ? debug_smp_processor_id+0x17/0x20
>>> [69943.672172] [<ffffffff810c1f09>] ? get_lock_stats+0x19/0x50
>>> [69943.678558] [<ffffffff810a58f6>] ? sched_clock_cpu+0xb6/0xd0
>>> [69943.685007] [<ffffffff810c40cf>] ? __lock_acquire.isra.32+0x1cf/0x8c0
>>> [69943.691542] [<ffffffff8132a8b3>] ? __this_cpu_preempt_check+0x13/0x20
>>> [69943.698130] [<ffffffff8109b9bc>] ? preempt_count_add+0x7c/0xc0
>>> [69943.704791] [<ffffffff811ecda1>] ? __mnt_want_write+0x61/0x90
>>> [69943.711519] [<ffffffff811f2638>] SyS_fsetxattr+0x78/0xa0
>>> [69943.718300] [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
>>> [69943.724949] [<ffffffff81790a4b>] entry_SYSCALL64_slow_path+0x25/0x25
>>> [69943.731521] Code:
>>> [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00 <f0> ff 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85
>>
>> That's lock incl 0x198(%rax). I think this is:
>>
>> atomic_inc((atomic_t *)&class->ops);
>>
>> I suppose this could be stack corruption at work, but after a fair
>> amount of staring, I still haven't found anything in the vmap_stack
>> code that would cause stack corruption.
>
> Well, it is intriguing that what faults is this:
>
> finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
>
> where 'shmem_fault_wait' is a on-stack wait queue. So it really looks
> very much like stack corruption.
>
> What strikes me is that "finish_wait()" does this optimistic "has my
> entry been removed" without holding the waitqueue lock (and uses
> list_empty_careful() to make sure it does that "safely").
>
> It has that big comment too:
>
> /*
> * shmem_falloc_waitq points into the shmem_fallocate()
> * stack of the hole-punching task: shmem_falloc_waitq
> * is usually invalid by the time we reach here, but
> * finish_wait() does not dereference it in that case;
> * though i_lock needed lest racing with wake_up_all().
> */
>
> the stack it comes from is the wait queue head from shmem_fallocate(),
> which will do "wake_up_all()" under the inode lock.

Here's my theory: I think you're looking at the right code but the
wrong stack. shmem_fault_wait is fine, but shmem_fault_waitq looks
really dicey. Consider:

fallocate calls wake_up_all(), which calls autoremove_wait_function().
That wakes up the shmem_fault thread. Suppose that the shmem_fault
thread gets moving really quickly (presumably because it never went to
sleep in the first place -- suppose it hadn't made it to schedule(),
so schedule() turns into a no-op). It calls finish_wait() *before*
autoremove_wake_function() does the list_del_init(). finish_wait()
gets to the list_empty_careful() call and it returns true.

Now the fallocate thread catches up and *exits*. Dave's test makes a
new thread that reuses the stack (the vmap area or the backing store).

Now the shmem_fault thread continues on its merry way and takes
q->lock. But oh crap, q->lock is pointing at some random spot on some
other thread's stack. Kaboom!

If I'm right, actually blowing up due to this bug would have been
very, very unlikely on 4.8 and below, and it has nothing to do with
vmap stacks per se -- it was the RCU removal. We used to wait for an
RCU grace period before reusing a stack, and there couldn't have been
an RCU grace period in the middle of finish_wait(), so we were safer.
(Not totally safe, because the thread that called fallocate() could
have made it somewhere else that used the same stack depth, but I can
see that being less likely.)

IOW, I think that autoremove_wake_function() is buggy. It should
remove the wait entry, then try to wake the thread, then re-add the
wait entry if the wakeup fails. Or maybe it should use some atomic
flag in the wait entry or have some other locking to make sure that
finish_wait does not touch q->lock if autoremove_wake_function
succeeds.

--Andy

2016-10-25 00:00:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirski <[email protected]> wrote:
>
> Here's my theory: I think you're looking at the right code but the
> wrong stack. shmem_fault_wait is fine, but shmem_fault_waitq looks
> really dicey.

Hmm.

> Consider:
>
> fallocate calls wake_up_all(), which calls autoremove_wait_function().
> That wakes up the shmem_fault thread. Suppose that the shmem_fault
> thread gets moving really quickly (presumably because it never went to
> sleep in the first place -- suppose it hadn't made it to schedule(),
> so schedule() turns into a no-op). It calls finish_wait() *before*
> autoremove_wake_function() does the list_del_init(). finish_wait()
> gets to the list_empty_careful() call and it returns true.

All of this happens under inode->i_lock, so the different parts are
serialized. So if this happens before the wakeup, then finish_wait()
will se that the wait-queue entry is not empty (it points to the wait
queue head in shmem_falloc_waitq.

But then it will just remove itself carefully with list_del_init()
under the waitqueue lock, and things are fine.

Yes, it uses the waitiqueue lock on the other stack, but that stack is
still ok since
(a) we're serialized by inode->i_lock
(b) this code ran before the fallocate thread catches up and exits.

In other words, your scenario seems to be dependent on those two
threads being able to race. But they both hold inode->i_lock in the
critical region you are talking about.

> Now the fallocate thread catches up and *exits*. Dave's test makes a
> new thread that reuses the stack (the vmap area or the backing store).
>
> Now the shmem_fault thread continues on its merry way and takes
> q->lock. But oh crap, q->lock is pointing at some random spot on some
> other thread's stack. Kaboom!

Note that q->lock should be entirely immaterial, since inode->i_lock
nests outside of it in all uses.

Now, if there is some code that runs *without* the inode->i_lock, then
that would be a big bug.

But I'm not seeing it.

I do agree that some race on some stack data structure could easily be
the cause of these issues. And yes, the vmap code obviously starts
reusing the stack much earlier, and would trigger problems that would
essentially be hidden by the fact that the kernel stack used to stay
around not just until exit(), but until the process was reaped.

I just think that in this case i_lock really looks like it should
serialize things correctly.

Or are you seeing something I'm not?

Linus

2016-10-25 01:09:47

by Andy Lutomirski

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Oct 24, 2016 5:00 PM, "Linus Torvalds" <[email protected]> wrote:
>
> On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirski <[email protected]> wrote:
>
> > Now the fallocate thread catches up and *exits*. Dave's test makes a
> > new thread that reuses the stack (the vmap area or the backing store).
> >
> > Now the shmem_fault thread continues on its merry way and takes
> > q->lock. But oh crap, q->lock is pointing at some random spot on some
> > other thread's stack. Kaboom!
>
> Note that q->lock should be entirely immaterial, since inode->i_lock
> nests outside of it in all uses.
>
> Now, if there is some code that runs *without* the inode->i_lock, then
> that would be a big bug.
>
> But I'm not seeing it.
>
> I do agree that some race on some stack data structure could easily be
> the cause of these issues. And yes, the vmap code obviously starts
> reusing the stack much earlier, and would trigger problems that would
> essentially be hidden by the fact that the kernel stack used to stay
> around not just until exit(), but until the process was reaped.
>
> I just think that in this case i_lock really looks like it should
> serialize things correctly.
>
> Or are you seeing something I'm not?

No, I missed that.

--Andy

2016-10-26 00:28:07

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Oct 24, 2016 at 09:42:39AM -0400, Chris Mason wrote:

> > > Well crud, we're back to wondering if this is Btrfs or the stack
> > > corruption. Since the pagevecs are on the stack and this is a new
> > > crash, my guess is you'll be able to trigger it on xfs/ext4 too. But we
> > > should make sure.
> >
> > Here's an interesting one from today, pointing the finger at xattrs again.
> >
> >
> > [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 4.9.0-rc1-think+ #11
> > [69943.463510] task: ffff8804f8dd3740 task.stack: ffffc9000b108000
> > [69943.468077] RIP: 0010:[<ffffffff810c3f6b>]
>
> Was this btrfs?

I already told you elsewhere, but for benefit of everyone else, yes, it was.

At Chris' behest, I gave ext4 some more air-time with this workload.
It ran for 1 day 6 hrs without incident before I got bored and tried
something else. I threw XFS on the test partition, restarted the test,
and got the warnings below across two reboots.

DaveC: Do these look like real problems, or is this more "looks like
random memory corruption" ? It's been a while since I did some stress
testing on XFS, so these might not be new..



XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2938
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:113!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 6227 Comm: trinity-c9 Not tainted 4.9.0-rc1-think+ #6
task: ffff8804f4658040 task.stack: ffff88050568c000
RIP: 0010:[<ffffffffa02d3e2b>]
[<ffffffffa02d3e2b>] assfail+0x1b/0x20 [xfs]
RSP: 0000:ffff88050568f9e8 EFLAGS: 00010282
RAX: 00000000ffffffea RBX: 0000000000000046 RCX: 0000000000000001
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffa02fe34d
RBP: ffff88050568f9e8 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000000a R11: f000000000000000 R12: ffff88050568fb44
R13: 00000000000000f3 R14: ffff8804f292bf88 R15: 000ffffffffe0046
FS: 00007fe2ddfdfb40(0000) GS:ffff88050a000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe2dbabd000 CR3: 00000004f461f000 CR4: 00000000001406e0
Stack:
ffff88050568fa88 ffffffffa027ccee fffffffffffffff9 ffff8804f16fd8b0
0000000000003ffa 0000000000000032 ffff8804f292bf40 0000000000004976
000ffffffffe0008 00000000000004fd ffff880400000000 0000000000005107
Call Trace:
[<ffffffffa027ccee>] xfs_bmap_add_extent_hole_delay+0x54e/0x620 [xfs]
[<ffffffffa027f2d4>] xfs_bmapi_reserve_delalloc+0x2b4/0x400 [xfs]
[<ffffffffa02cadd7>] xfs_file_iomap_begin_delay.isra.12+0x247/0x3c0 [xfs]
[<ffffffffa02cb0d1>] xfs_file_iomap_begin+0x181/0x270 [xfs]
[<ffffffffa02ca13e>] ? xfs_file_iomap_end+0x9e/0xe0 [xfs]
[<ffffffff8122e573>] iomap_apply+0x53/0x100
[<ffffffff8122df10>] ? iomap_write_end+0x70/0x70
[<ffffffff8122e68b>] iomap_file_buffered_write+0x6b/0x90
[<ffffffff8122df10>] ? iomap_write_end+0x70/0x70
[<ffffffffa02c1dd8>] xfs_file_buffered_aio_write+0xe8/0x1d0 [xfs]
[<ffffffff810c3b7f>] ? __lock_acquire.isra.32+0x1cf/0x8c0
[<ffffffffa02c1f45>] xfs_file_write_iter+0x85/0x120 [xfs]
[<ffffffff811c8c98>] do_iter_readv_writev+0xa8/0x100
[<ffffffff811c9622>] do_readv_writev+0x172/0x210
[<ffffffffa02c1ec0>] ? xfs_file_buffered_aio_write+0x1d0/0x1d0 [xfs]
[<ffffffff811e9794>] ? __fdget_pos+0x44/0x50
[<ffffffff8178b2f2>] ? mutex_lock_nested+0x272/0x3f0
[<ffffffff811e9794>] ? __fdget_pos+0x44/0x50
[<ffffffff811e9794>] ? __fdget_pos+0x44/0x50
[<ffffffff811c98ea>] vfs_writev+0x3a/0x50
[<ffffffff811c9950>] do_writev+0x50/0xd0
[<ffffffff811ca9fb>] SyS_writev+0xb/0x10
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff8178ff4b>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 48 c7 c7 65 e3 2f a0 e8 74 37 da e0 5d c3 66 90 55 48 89 f1 41 89 d0 48 c7 c6 18 93 30 a0 48 89 fa 48 89 e5 31 ff e8 65 fa ff ff <0f> 0b 0f 1f 00 55 48 63 f6 49 89 f9 41 b8 01 00 00 00 48 89 e5
RIP
[<ffffffffa02d3e2b>] assfail+0x1b/0x20 [xfs]
RSP <ffff88050568f9e8>



XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 309
kernel BUG at fs/xfs/xfs_message.c:113!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 0 PID: 7309 Comm: kworker/u8:1 Not tainted 4.9.0-rc1-think+ #11
Workqueue: writeback wb_workfn
(flush-8:0)
task: ffff88025eb98040 task.stack: ffffc9000a914000
RIP: 0010:[<ffffffffa0571e2b>]
[<ffffffffa0571e2b>] assfail+0x1b/0x20 [xfs]
RSP: 0018:ffffc9000a917410 EFLAGS: 00010282
RAX: 00000000ffffffea RBX: ffff8804538d22b8 RCX: 0000000000000001
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffa059c34d
RBP: ffffc9000a917410 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000000a R11: f000000000000000 R12: ffffffffffffffff
R13: ffff88047c765698 R14: 0000000000000001 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff880507800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 00000004c56e7000 CR4: 00000000001406f0
DR0: 00007fec5e3c9000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
ffffc9000a917438 ffffffffa057abe1 ffffc9000a917510 ffffc9000a917510
ffffc9000a917510 ffffc9000a917460 ffffffffa0548eff ffffc9000a917510
0000000000000001 ffffc9000a917510 ffffc9000a917480 ffffffffa050aa3d
Call Trace:
[<ffffffffa057abe1>] xfs_trans_mod_sb+0x241/0x280 [xfs]
[<ffffffffa0548eff>] xfs_ag_resv_alloc_extent+0x4f/0xc0 [xfs]
[<ffffffffa050aa3d>] xfs_alloc_ag_vextent+0x23d/0x300 [xfs]
[<ffffffffa050bb1b>] xfs_alloc_vextent+0x5fb/0x6d0 [xfs]
[<ffffffffa051c1b4>] xfs_bmap_btalloc+0x304/0x8e0 [xfs]
[<ffffffffa054648e>] ? xfs_iext_bno_to_ext+0xee/0x170 [xfs]
[<ffffffffa051c8db>] xfs_bmap_alloc+0x2b/0x40 [xfs]
[<ffffffffa051dc30>] xfs_bmapi_write+0x640/0x1210 [xfs]
[<ffffffffa0569326>] xfs_iomap_write_allocate+0x166/0x350 [xfs]
[<ffffffffa05540b0>] xfs_map_blocks+0x1b0/0x260 [xfs]
[<ffffffffa0554beb>] xfs_do_writepage+0x23b/0x730 [xfs]
[<ffffffff81159ef8>] ? clear_page_dirty_for_io+0x128/0x210
[<ffffffff81159e71>] ? clear_page_dirty_for_io+0xa1/0x210
[<ffffffff8115a1b6>] write_cache_pages+0x1d6/0x4a0
[<ffffffffa05549b0>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
[<ffffffffa0554419>] xfs_vm_writepages+0x59/0x80 [xfs]
[<ffffffff8115af6c>] do_writepages+0x1c/0x30
[<ffffffff811f6d33>] __writeback_single_inode+0x33/0x180
[<ffffffff811f7528>] writeback_sb_inodes+0x2a8/0x5b0
[<ffffffff811f78bd>] __writeback_inodes_wb+0x8d/0xc0
[<ffffffff811f7b73>] wb_writeback+0x1e3/0x1f0
[<ffffffff811f80b2>] wb_workfn+0xd2/0x280
[<ffffffff81090875>] process_one_work+0x1d5/0x490
[<ffffffff81090815>] ? process_one_work+0x175/0x490
[<ffffffff81090b79>] worker_thread+0x49/0x490
[<ffffffff81090b30>] ? process_one_work+0x490/0x490
[<ffffffff81090b30>] ? process_one_work+0x490/0x490
[<ffffffff81095cee>] kthread+0xee/0x110
[<ffffffff81095c00>] ? kthread_park+0x60/0x60
[<ffffffff81790bd2>] ret_from_fork+0x22/0x30
Code: 48 c7 c7 65 c3 59 a0 e8 c4 5c b0 e0 5d c3 66 90 55 48 89 f1 41 89 d0 48 c7 c6 18 73 5a a0 48 89 fa 48 89 e5 31 ff e8 65 fa ff ff <0f> 0b 0f 1f 00 55 48 63 f6 49 89 f9 41 b8 01 00 00 00 48 89 e5
RIP
[<ffffffffa0571e2b>] assfail+0x1b/0x20 [xfs]
RSP <ffffc9000a917410>





2016-10-26 01:33:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 25, 2016 at 5:27 PM, Dave Jones <[email protected]> wrote:
>
> DaveC: Do these look like real problems, or is this more "looks like
> random memory corruption" ? It's been a while since I did some stress
> testing on XFS, so these might not be new..

Andy, do you think we could just do some poisoning of the stack as we
free it, to see if that catches anything?

Something truly stupid like just

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -218,6 +218,7 @@ static inline void free_thread_stack(struct
task_struct *tsk)
unsigned long flags;
int i;

+ memset(tsk->stack_vm_area->addr, 0xd0, THREAD_SIZE);
local_irq_save(flags);
for (i = 0; i < NR_CACHED_STACKS; i++) {
if (this_cpu_read(cached_stacks[i]))

or similar?

It seems like DaveJ had an easier time triggering these problems with
the stack cache, but they clearly didn't go away when the stack cache
was disabled. So maybe the stack cache just made the reuse more likely
and faster, making the problem show up faster too. But if we actively
poison things, we'll corrupt the free'd stack *immediately* if there
is some stale use..

Completely untested. Maybe there's some reason we can't write to the
whole thing like that?

Linus

Linus

2016-10-26 01:39:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 25, 2016 at 6:33 PM, Linus Torvalds
<[email protected]> wrote:
>
> Completely untested. Maybe there's some reason we can't write to the
> whole thing like that?

That hack boots and seems to work for me, but doesn't show anything.

Dave, mind just trying that oneliner?

Linus

2016-10-26 16:30:29

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 25, 2016 at 06:39:03PM -0700, Linus Torvalds wrote:
> On Tue, Oct 25, 2016 at 6:33 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Completely untested. Maybe there's some reason we can't write to the
> > whole thing like that?
>
> That hack boots and seems to work for me, but doesn't show anything.
>
> Dave, mind just trying that oneliner?

I gave this a go last thing last night. It crashed within 5 minutes,
but it was one we've already seen (the bad page map trace) with nothing
additional that looked interesting. I rebooted, and tried again and
went to bed.

12 hours later, it's still doing it's thing.

Heisenbugs man, literally the worst.

Dave

2016-10-26 16:48:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 9:30 AM, Dave Jones <[email protected]> wrote:
>
> I gave this a go last thing last night. It crashed within 5 minutes,
> but it was one we've already seen (the bad page map trace) with nothing
> additional that looked interesting.

Did the bad page map trace have any registers that looked like they
had 0xd0d0d0d0d0d0 in them?

I assume not, but worth checking.

> Heisenbugs man, literally the worst.

I know you already had this in some email, but I lost it. I think you
narrowed it down to a specific set of system calls that seems to
trigger this best. fallocate and xattrs or something?

Linus

2016-10-26 18:18:45

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 09:48:39AM -0700, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 9:30 AM, Dave Jones <[email protected]> wrote:
> >
> > I gave this a go last thing last night. It crashed within 5 minutes,
> > but it was one we've already seen (the bad page map trace) with nothing
> > additional that looked interesting.
>
> Did the bad page map trace have any registers that looked like they
> had 0xd0d0d0d0d0d0 in them?
>
> I assume not, but worth checking.

sadly not. In case I did overlook something, here's last nights..

BUG: Bad page state in process kworker/u8:13 pfn:4dae31
page:ffffea00136b8c40 count:0 mapcount:0 mapping:ffff8804f011d6e0 index:0xd1530
flags: 0x400000000000000c(referenced|uptodate)
page dumped because: non-NULL mapping
CPU: 3 PID: 1207 Comm: kworker/u8:13 Not tainted 4.9.0-rc2-think+ #3
Workqueue: writeback wb_workfn
(flush-btrfs-1)
ffffc900006fb870
ffffffff8130cf3c
ffffea00136b8c40
ffffffff819ff54c
ffffc900006fb898
ffffffff811511af
0000000000000000
ffffea00136b8c40
400000000000000c
ffffc900006fb8a8
ffffffff8115126a
ffffc900006fb8f0
Call Trace:
[<ffffffff8130cf3c>] dump_stack+0x4f/0x73
[<ffffffff811511af>] bad_page+0xbf/0x120
[<ffffffff8115126a>] free_pages_check_bad+0x5a/0x70
[<ffffffff81153b68>] free_hot_cold_page+0x248/0x290
[<ffffffff81153e6b>] free_hot_cold_page_list+0x2b/0x50
[<ffffffff8115c87d>] release_pages+0x2bd/0x350
[<ffffffff8115ddb2>] __pagevec_release+0x22/0x30
[<ffffffffa00a0d4e>] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs]
[<ffffffffa00a1199>] extent_writepages+0x49/0x60 [btrfs]
[<ffffffffa0081840>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[<ffffffffa007e993>] btrfs_writepages+0x23/0x30 [btrfs]
[<ffffffff8115af9c>] do_writepages+0x1c/0x30
[<ffffffff811f6d73>] __writeback_single_inode+0x33/0x180
[<ffffffff811f7568>] writeback_sb_inodes+0x2a8/0x5b0
[<ffffffff811f7abb>] wb_writeback+0xeb/0x1f0
[<ffffffff811f80f2>] wb_workfn+0xd2/0x280
[<ffffffff810908a5>] process_one_work+0x1d5/0x490
[<ffffffff81090845>] ? process_one_work+0x175/0x490
[<ffffffff81090ba9>] worker_thread+0x49/0x490
[<ffffffff81090b60>] ? process_one_work+0x490/0x490
[<ffffffff81095d1e>] kthread+0xee/0x110
[<ffffffff81095c30>] ? kthread_park+0x60/0x60
[<ffffffff81790a52>] ret_from_fork+0x22/0x30


> > Heisenbugs man, literally the worst.
>
> I know you already had this in some email, but I lost it. I think you
> narrowed it down to a specific set of system calls that seems to
> trigger this best. fallocate and xattrs or something?

I did. Or so I thought. Then iirc, it ran for a really long time
with those. I'll revisit that just to be sure.

Something else I tried: Trinity is very heavily stressing
the fork path, with new children forking off and doing crazy shit
all the time, segfaulting, repeat. I wrote a small test app that
models all of that sans the "do crazy shit with syscalls", and that
didn't do anything useful in terms of reproducing this.
I hoped that had panned out, because I could totally see the relationship
between reusing vmap'ing stacks and heavy fork() use.
Perhaps I'm still missing something non-obvious.

Dave

2016-10-26 18:42:18

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 09:48:39AM -0700, Linus Torvalds wrote:

> I know you already had this in some email, but I lost it. I think you
> narrowed it down to a specific set of system calls that seems to
> trigger this best. fallocate and xattrs or something?

So I was about to give that a shot again. That this has been running
doing for 24hrs was bugging me. I ctrl-c'd the current run, and trinity
just sat there, because all of its child processes are stuck in D state.

The stacks show nearly all of them are stuck in sync_inodes_sb

iotop & vmstat shows there is _zero_ io actually happening.
So it's spent most the night doing next to nothing useful afaict.

Chris ?

Here's the /proc/pid/stack's of those children..


[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
[<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs]
[<ffffffff811fbd4e>] sync_filesystem+0x6e/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffff811fc294>] utimes_common+0xd4/0x190
[<ffffffff811fc457>] do_utimes+0x107/0x120
[<ffffffff811fc641>] SyS_futimesat+0xa1/0xd0
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffa008ed32>] btrfs_fallocate+0xb2/0xfd0 [btrfs]
[<ffffffff811c6c3e>] vfs_fallocate+0x13e/0x220
[<ffffffff811c79f3>] SyS_fallocate+0x43/0x80
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
[<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs]
[<ffffffff811fbcdb>] sync_fs_one_sb+0x1b/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdd0>] sys_sync+0x50/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffa008c0f4>] btrfs_file_llseek+0x34/0x290 [btrfs]
[<ffffffff811c9ab5>] SyS_lseek+0x85/0xa0
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffa0090a0e>] btrfs_sync_file+0x7e/0x360 [btrfs]
[<ffffffff811fbbe6>] vfs_fsync_range+0x46/0xa0
[<ffffffffa00910e0>] btrfs_file_write_iter+0x3f0/0x550 [btrfs]
[<ffffffff811c97d8>] do_iter_readv_writev+0xa8/0x100
[<ffffffff811ca162>] do_readv_writev+0x172/0x210
[<ffffffff811ca42a>] vfs_writev+0x3a/0x50
[<ffffffff811ca5c0>] do_pwritev+0xb0/0xd0
[<ffffffff811cb592>] SyS_pwritev2+0x12/0x20
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811ea2d4>] __fdget_pos+0x44/0x50
[<ffffffff811de8ec>] SyS_getdents+0x6c/0x110
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffff811c752a>] do_truncate+0x4a/0x90
[<ffffffff811c790c>] do_sys_ftruncate.constprop.19+0x10c/0x170
[<ffffffff811c7999>] SyS_ftruncate+0x9/0x10
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffa0095413>] btrfs_wait_ordered_extents+0x1e3/0x2e0 [btrfs]
[<ffffffffa0095647>] btrfs_wait_ordered_roots+0x137/0x200 [btrfs]
[<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs]
[<ffffffff811fbcdb>] sync_fs_one_sb+0x1b/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdd0>] sys_sync+0x50/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff81149fbf>] wait_on_page_bit+0xaf/0xc0
[<ffffffff8114a121>] __filemap_fdatawait_range+0x151/0x170
[<ffffffff8114d79c>] filemap_fdatawait_keep_errors+0x1c/0x20
[<ffffffff811f59b3>] sync_inodes_sb+0x273/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffa0090d50>] btrfs_file_write_iter+0x60/0x550 [btrfs]
[<ffffffff811c97d8>] do_iter_readv_writev+0xa8/0x100
[<ffffffff811ca162>] do_readv_writev+0x172/0x210
[<ffffffff811ca42a>] vfs_writev+0x3a/0x50
[<ffffffff811ca5c0>] do_pwritev+0xb0/0xd0
[<ffffffff811cb57c>] SyS_pwritev+0xc/0x10
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811ea2d4>] __fdget_pos+0x44/0x50
[<ffffffff811c9a48>] SyS_lseek+0x18/0xa0
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffa0090d50>] btrfs_file_write_iter+0x60/0x550 [btrfs]
[<ffffffff811c8e64>] __vfs_write+0xc4/0x120
[<ffffffff811c9f03>] vfs_write+0xb3/0x1a0
[<ffffffff811cb3d4>] SyS_pwrite64+0x74/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
[<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffa009576b>] btrfs_start_ordered_extent+0x5b/0xb0 [btrfs]
[<ffffffffa008bf5d>] lock_and_cleanup_extent_if_need+0x22d/0x290 [btrfs]
[<ffffffffa008d1e8>] __btrfs_buffered_write+0x1b8/0x6e0 [btrfs]
[<ffffffffa0090e60>] btrfs_file_write_iter+0x170/0x550 [btrfs]
[<ffffffff811c97d8>] do_iter_readv_writev+0xa8/0x100
[<ffffffff811ca162>] do_readv_writev+0x172/0x210
[<ffffffff811ca42a>] vfs_writev+0x3a/0x50
[<ffffffff811ca5c0>] do_pwritev+0xb0/0xd0
[<ffffffff811cb57c>] SyS_pwritev+0xc/0x10
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffa0090d50>] btrfs_file_write_iter+0x60/0x550 [btrfs]
[<ffffffff811c8e64>] __vfs_write+0xc4/0x120
[<ffffffff811c998d>] __kernel_write+0x4d/0xf0
[<ffffffff811f995d>] write_pipe_buf+0x6d/0x80
[<ffffffff811f9cdf>] __splice_from_pipe+0x12f/0x1b0
[<ffffffff811fabdc>] splice_from_pipe+0x4c/0x70
[<ffffffff811fac34>] default_file_splice_write+0x14/0x20
[<ffffffff811f9081>] direct_splice_actor+0x31/0x40
[<ffffffff811f972c>] splice_direct_to_actor+0xcc/0x1e0
[<ffffffff811f98d0>] do_splice_direct+0x90/0xb0
[<ffffffff811cad30>] do_sendfile+0x1b0/0x390
[<ffffffff811cb7cf>] SyS_sendfile64+0x5f/0xd0
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff811f5806>] sync_inodes_sb+0xc6/0x300
[<ffffffff811fbac0>] sync_inodes_one_sb+0x10/0x20
[<ffffffff811cdd3f>] iterate_supers+0xaf/0x100
[<ffffffff811fbdb0>] sys_sync+0x30/0x90
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff81086eee>] do_sigtimedwait+0x16e/0x260
[<ffffffff81087084>] SYSC_rt_sigtimedwait+0xa4/0x100
[<ffffffff810870e9>] SyS_rt_sigtimedwait+0x9/0x10
[<ffffffff8100255c>] do_syscall_64+0x5c/0x170
[<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff


2016-10-26 19:06:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones <[email protected]> wrote:
>
> The stacks show nearly all of them are stuck in sync_inodes_sb

That's just wb_wait_for_completion(), and it means that some IO isn't
completing.

There's also a lot of processes waiting for inode_lock(), and a few
waiting for mnt_want_write()

Ignoring those, we have

> [<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
> [<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs]
> [<ffffffff811fbd4e>] sync_filesystem+0x6e/0xa0
> [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff

Don't know this one. There's a couple of them. Could there be some
ABBA deadlock on the ordered roots waiting?

> [<ffffffff8131ae87>] call_rwsem_down_write_failed+0x17/0x30
> [<ffffffffa008ed32>] btrfs_fallocate+0xb2/0xfd0 [btrfs]
> [<ffffffff811c6c3e>] vfs_fallocate+0x13e/0x220
> [<ffffffff811c79f3>] SyS_fallocate+0x43/0x80
> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff

This one is also inode_lock(), and is interesting only because it's
fallocate(), which has shown up so many times before..

But there are other threads blocked on do_truncate, or
btrfs_file_write_iter instead, or on lseek, so this is not different
for any other reason.

> [<ffffffff81149fbf>] wait_on_page_bit+0xaf/0xc0
> [<ffffffff8114a121>] __filemap_fdatawait_range+0x151/0x170
> [<ffffffff8114d79c>] filemap_fdatawait_keep_errors+0x1c/0x20
> [<ffffffff811f59b3>] sync_inodes_sb+0x273/0x300
> [<ffffffff811fbd37>] sync_filesystem+0x57/0xa0
> [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff

This is actually waiting on the page. Possibly this is the IO that is
never completing, and keeps the inode lock.

> [<ffffffffa009576b>] btrfs_start_ordered_extent+0x5b/0xb0 [btrfs]
> [<ffffffffa008bf5d>] lock_and_cleanup_extent_if_need+0x22d/0x290 [btrfs]
> [<ffffffffa008d1e8>] __btrfs_buffered_write+0x1b8/0x6e0 [btrfs]
> [<ffffffffa0090e60>] btrfs_file_write_iter+0x170/0x550 [btrfs]
> [<ffffffff811c97d8>] do_iter_readv_writev+0xa8/0x100
> [<ffffffff811ca162>] do_readv_writev+0x172/0x210
> [<ffffffff811ca42a>] vfs_writev+0x3a/0x50
> [<ffffffff811ca5c0>] do_pwritev+0xb0/0xd0
> [<ffffffff811cb57c>] SyS_pwritev+0xc/0x10
> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25

Hmm. This is the one that *started* the ordered extents (as opposed to
the ones waiting for it)

I dunno. There might be a lost IO. More likely it's the same
corruption that causes it, it just didn't result in an oops this time.

Linus

2016-10-26 20:01:11

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.



On 10/26/2016 03:06 PM, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones <[email protected]> wrote:
>>
>> The stacks show nearly all of them are stuck in sync_inodes_sb
>
> That's just wb_wait_for_completion(), and it means that some IO isn't
> completing.
>
> There's also a lot of processes waiting for inode_lock(), and a few
> waiting for mnt_want_write()
>
> Ignoring those, we have
>
>> [<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
>> [<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs]
>> [<ffffffff811fbd4e>] sync_filesystem+0x6e/0xa0
>> [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
>> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
>> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
>> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Don't know this one. There's a couple of them. Could there be some
> ABBA deadlock on the ordered roots waiting?

It's always possible, but we haven't changed anything here.

I've tried a long list of things to reproduce this on my test boxes,
including days of trinity runs and a kernel module to exercise vmalloc,
and thread creation.

Today I turned off every CONFIG_DEBUG_* except for list debugging, and
ran dbench 2048:

[ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0
[ 2759.119652] list_add corruption. prev->next should be next (ffffe8ffffc80308), but was ffffc90000ccfb88. (prev=ffff880128522380).
[ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
[ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317
[ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[ 2759.125077] ffffc9000f6fb868 ffffffff814fe4ff ffffffff8151cb5e ffffc9000f6fb8c8
[ 2759.125077] ffffc9000f6fb8c8 0000000000000000 ffffc9000f6fb8b8 ffffffff81064bbf
[ 2759.127444] ffff880128523680 0000002139968000 ffff880138b7a4a0 ffff880128523540
[ 2759.127444] Call Trace:
[ 2759.127444] [<ffffffff814fe4ff>] dump_stack+0x53/0x74
[ 2759.127444] [<ffffffff8151cb5e>] ? __list_add+0xbe/0xd0
[ 2759.127444] [<ffffffff81064bbf>] __warn+0xff/0x120
[ 2759.127444] [<ffffffff81064c99>] warn_slowpath_fmt+0x49/0x50
[ 2759.127444] [<ffffffff8151cb5e>] __list_add+0xbe/0xd0
[ 2759.127444] [<ffffffff814df338>] blk_sq_make_request+0x388/0x580
[ 2759.127444] [<ffffffff814d5b44>] generic_make_request+0x104/0x200
[ 2759.127444] [<ffffffff814d5ca5>] submit_bio+0x65/0x130
[ 2759.127444] [<ffffffff8152a946>] ? __percpu_counter_add+0x96/0xd0
[ 2759.127444] [<ffffffff814260bc>] btrfs_map_bio+0x23c/0x310
[ 2759.127444] [<ffffffff813f42b3>] btrfs_submit_bio_hook+0xd3/0x190
[ 2759.127444] [<ffffffff814117ad>] submit_one_bio+0x6d/0xa0
[ 2759.127444] [<ffffffff8141182e>] flush_epd_write_bio+0x4e/0x70
[ 2759.127444] [<ffffffff81418d8d>] extent_writepages+0x5d/0x70
[ 2759.127444] [<ffffffff813f84e0>] ? btrfs_releasepage+0x50/0x50
[ 2759.127444] [<ffffffff81220ffe>] ? wbc_attach_and_unlock_inode+0x6e/0x170
[ 2759.127444] [<ffffffff813f5047>] btrfs_writepages+0x27/0x30
[ 2759.127444] [<ffffffff81178690>] do_writepages+0x20/0x30
[ 2759.127444] [<ffffffff81167d85>] __filemap_fdatawrite_range+0xb5/0x100
[ 2759.127444] [<ffffffff81168263>] filemap_fdatawrite_range+0x13/0x20
[ 2759.127444] [<ffffffff81405a7b>] btrfs_fdatawrite_range+0x2b/0x70
[ 2759.127444] [<ffffffff81405ba8>] btrfs_sync_file+0x88/0x490
[ 2759.127444] [<ffffffff810751e2>] ? group_send_sig_info+0x42/0x80
[ 2759.127444] [<ffffffff8107527d>] ? kill_pid_info+0x5d/0x90
[ 2759.127444] [<ffffffff8107564a>] ? SYSC_kill+0xba/0x1d0
[ 2759.127444] [<ffffffff811f2638>] ? __sb_end_write+0x58/0x80
[ 2759.127444] [<ffffffff81225b9c>] vfs_fsync_range+0x4c/0xb0
[ 2759.127444] [<ffffffff81002501>] ? syscall_trace_enter+0x201/0x2e0
[ 2759.127444] [<ffffffff81225c1c>] vfs_fsync+0x1c/0x20
[ 2759.127444] [<ffffffff81225c5d>] do_fsync+0x3d/0x70
[ 2759.127444] [<ffffffff810029cb>] ? syscall_slow_exit_work+0xfb/0x100
[ 2759.127444] [<ffffffff81225cc0>] SyS_fsync+0x10/0x20
[ 2759.127444] [<ffffffff81002b65>] do_syscall_64+0x55/0xd0
[ 2759.127444] [<ffffffff810026d7>] ? prepare_exit_to_usermode+0x37/0x40
[ 2759.127444] [<ffffffff819ad986>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2759.150635] ---[ end trace 3b5b7e2ef61c3d02 ]---

I put a variant of your suggested patch in place, but my printk never
triggered. Now that I've made it happen once, I'll make sure I can do it
over and over again. This doesn't have the patches that Andy asked Davej to
try out yet, but I'll try them once I have a reliable reproducer.

diff --git a/kernel/fork.c b/kernel/fork.c
index 623259f..de95e19 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -165,7 +165,7 @@ void __weak arch_release_thread_stack(unsigned long *stack)
* vmalloc() is a bit slow, and calling vfree() enough times will force a TLB
* flush. Try to minimize the number of calls by caching stacks.
*/
-#define NR_CACHED_STACKS 2
+#define NR_CACHED_STACKS 256
static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
#endif

@@ -173,7 +173,9 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
{
#ifdef CONFIG_VMAP_STACK
void *stack;
+ char *p;
int i;
+ int j;

local_irq_disable();
for (i = 0; i < NR_CACHED_STACKS; i++) {
@@ -183,7 +185,15 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
continue;
this_cpu_write(cached_stacks[i], NULL);

+ p = s->addr;
+ for (j = 0; j < THREAD_SIZE; j++) {
+ if (p[j] != 'c') {
+ printk_ratelimited(KERN_CRIT "bad poison %c byte %d\n", p[j], j);
+ break;
+ }
+ }
tsk->stack_vm_area = s;
+
local_irq_enable();
return s->addr;
}
@@ -219,6 +229,7 @@ static inline void free_thread_stack(struct task_struct *tsk)
int i;

local_irq_save(flags);
+ memset(tsk->stack_vm_area->addr, 'c', THREAD_SIZE);
for (i = 0; i < NR_CACHED_STACKS; i++) {
if (this_cpu_read(cached_stacks[i]))
continue;

2016-10-26 21:53:23

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.



On 10/26/2016 04:00 PM, Chris Mason wrote:
>
>
> On 10/26/2016 03:06 PM, Linus Torvalds wrote:
>> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones <[email protected]> wrote:
>>>
>>> The stacks show nearly all of them are stuck in sync_inodes_sb
>>
>> That's just wb_wait_for_completion(), and it means that some IO isn't
>> completing.
>>
>> There's also a lot of processes waiting for inode_lock(), and a few
>> waiting for mnt_want_write()
>>
>> Ignoring those, we have
>>
>>> [<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
>>> [<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs]
>>> [<ffffffff811fbd4e>] sync_filesystem+0x6e/0xa0
>>> [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
>>> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
>>> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> Don't know this one. There's a couple of them. Could there be some
>> ABBA deadlock on the ordered roots waiting?
>
> It's always possible, but we haven't changed anything here.
>
> I've tried a long list of things to reproduce this on my test boxes,
> including days of trinity runs and a kernel module to exercise vmalloc,
> and thread creation.
>
> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
> ran dbench 2048:
>

This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in < 10 minutes.
I've done 30 minutes each with XFS and Ext4 without luck.

This is all in a virtual machine that I can copy on to a bunch of hosts. So I'll get some
parallel tests going tonight to narrow it down.

------------[ cut here ]------------
WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0
list_add corruption. prev->next should be next (ffffe8ffffd80b08), but was ffff88012b65fb88. (prev=ffff880128c8d500).
Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr sch_fq_codel autofs4 virtio_blk
CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
ffff880104eff868 ffffffff814fde0f ffffffff8151c46e ffff880104eff8c8
ffff880104eff8c8 0000000000000000 ffff880104eff8b8 ffffffff810648cf
ffff880128cab2c0 000000213fc57c68 ffff8801384e8928 ffff880128cab180
Call Trace:
[<ffffffff814fde0f>] dump_stack+0x53/0x74
[<ffffffff8151c46e>] ? __list_add+0xbe/0xd0
[<ffffffff810648cf>] __warn+0xff/0x120
[<ffffffff810649a9>] warn_slowpath_fmt+0x49/0x50
[<ffffffff8151c46e>] __list_add+0xbe/0xd0
[<ffffffff814dec38>] blk_sq_make_request+0x388/0x580
[<ffffffff814d5444>] generic_make_request+0x104/0x200
[<ffffffff814d55a5>] submit_bio+0x65/0x130
[<ffffffff8152a256>] ? __percpu_counter_add+0x96/0xd0
[<ffffffff81425c7c>] btrfs_map_bio+0x23c/0x310
[<ffffffff813f3e73>] btrfs_submit_bio_hook+0xd3/0x190
[<ffffffff8141136d>] submit_one_bio+0x6d/0xa0
[<ffffffff814113ee>] flush_epd_write_bio+0x4e/0x70
[<ffffffff8141894d>] extent_writepages+0x5d/0x70
[<ffffffff813f80a0>] ? btrfs_releasepage+0x50/0x50
[<ffffffff81220d0e>] ? wbc_attach_and_unlock_inode+0x6e/0x170
[<ffffffff813f4c07>] btrfs_writepages+0x27/0x30
[<ffffffff811783a0>] do_writepages+0x20/0x30
[<ffffffff81167a95>] __filemap_fdatawrite_range+0xb5/0x100
[<ffffffff81167f73>] filemap_fdatawrite_range+0x13/0x20
[<ffffffff8140563b>] btrfs_fdatawrite_range+0x2b/0x70
[<ffffffff81405768>] btrfs_sync_file+0x88/0x490
[<ffffffff81074ef2>] ? group_send_sig_info+0x42/0x80
[<ffffffff81074f8d>] ? kill_pid_info+0x5d/0x90
[<ffffffff8107535a>] ? SYSC_kill+0xba/0x1d0
[<ffffffff811f2348>] ? __sb_end_write+0x58/0x80
[<ffffffff812258ac>] vfs_fsync_range+0x4c/0xb0
[<ffffffff81002501>] ? syscall_trace_enter+0x201/0x2e0
[<ffffffff8122592c>] vfs_fsync+0x1c/0x20
[<ffffffff8122596d>] do_fsync+0x3d/0x70
[<ffffffff810029cb>] ? syscall_slow_exit_work+0xfb/0x100
[<ffffffff812259d0>] SyS_fsync+0x10/0x20
[<ffffffff81002b65>] do_syscall_64+0x55/0xd0
[<ffffffff810026d7>] ? prepare_exit_to_usermode+0x37/0x40
[<ffffffff819ad246>] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace efe6b17c6dba2a6e ]---

2016-10-26 22:07:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason <[email protected]> wrote:
>
> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
> ran dbench 2048:
>
> [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0
> [ 2759.119652] list_add corruption. prev->next should be next (ffffe8ffffc80308), but was ffffc90000ccfb88. (prev=ffff880128522380).
> [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
> [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317
> [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
> [ 2759.125077] ffffc9000f6fb868 ffffffff814fe4ff ffffffff8151cb5e ffffc9000f6fb8c8
> [ 2759.125077] ffffc9000f6fb8c8 0000000000000000 ffffc9000f6fb8b8 ffffffff81064bbf
> [ 2759.127444] ffff880128523680 0000002139968000 ffff880138b7a4a0 ffff880128523540
> [ 2759.127444] Call Trace:
> [ 2759.127444] [<ffffffff814fe4ff>] dump_stack+0x53/0x74
> [ 2759.127444] [<ffffffff8151cb5e>] ? __list_add+0xbe/0xd0
> [ 2759.127444] [<ffffffff81064bbf>] __warn+0xff/0x120
> [ 2759.127444] [<ffffffff81064c99>] warn_slowpath_fmt+0x49/0x50
> [ 2759.127444] [<ffffffff8151cb5e>] __list_add+0xbe/0xd0
> [ 2759.127444] [<ffffffff814df338>] blk_sq_make_request+0x388/0x580

Ok, that's definitely the same one that Dave started out seeing.

The fact that it is that reliable - two different machines, two very
different loads (dbench looks nothing like trinity) really makes me
think that maybe the problem really is in the block plugging after
all.

It very much does not smell like random stack corruption. It's simply
not random enough.

And I just noticed something: I originally thought that this is the
"list_add_tail()" to the plug - which is the only "list_add()" variant
in that function.

But that never made sense, because the whole "but was" isn't a stack
address, and "next" in "list_add_tail()" is basically fixed, and would
have to be the stack.

But I now notice that there's actually another "list_add()" variant
there, and it's the one from __blk_mq_insert_request() that gets
inlined into blk_mq_insert_request(), which then gets inlined into
blk_mq_make_request().

And that actually makes some sense just looking at the offsets too:

blk_sq_make_request+0x388/0x580

so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely.

And there it makes perfect sense that the "next should be" value is
*not* on the stack.

Chris, if you built with debug info, you can try

./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388

to get what line that blk_sq_make_request+0x388 address actually is. I
think it's the

list_add_tail(&rq->queuelist, &ctx->rq_list);

in __blk_mq_insert_req_list() (when it's inlined from
blk_sq_make_request(), "at_head" will be false.

So it smells like "&ctx->rq_list" might be corrupt.

And that actually seems much more likely than the "plug" list, because
while the plug is entirely thread-local (and thus shouldn't have any
races), the ctx->rq_list very much is not.

Jens?

For example, should we have a

BUG_ON(ctx != rq->mq_ctx);

in blk_mq_merge_queue_io()? Because it locks ctx->lock, but then
__blk_mq_insert_request() will insert things onto the queues of
rq->mq_ctx.

blk_mq_insert_requests() has similar issues, but there has that BUG_ON().

The locking there really is *very* messy. All the lockers do

spin_lock(&ctx->lock);
...
spin_unlock(&ctx->lock);

but then __blk_mq_insert_request() and __blk_mq_insert_req_list don't
act on "ctx", but on "ctx = rq->mq_ctx", so if you ever get those
wrong, you're completely dead.

Now, I'm not seeing why they'd be wrong, and why they'd be associated
with the VMAP_STACK thing, but it could just be an unlucky timing
thing.

Linus

2016-10-26 22:23:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 2:52 PM, Chris Mason <[email protected]> wrote:
>
> This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in < 10 minutes.
> I've done 30 minutes each with XFS and Ext4 without luck.

Ok, see the email I wrote that crossed yours - if it's really some
list corruption on ctx->rq_list due to some locking problem, I really
would expect CONFIG_VMAP_STACK to be entirely irrelevant, except
perhaps from a timing standpoint.

> WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0
> list_add corruption. prev->next should be next (ffffe8ffffd80b08), but was ffff88012b65fb88. (prev=ffff880128c8d500).
> Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr sch_fq_codel autofs4 virtio_blk
> CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
> ffff880104eff868 ffffffff814fde0f ffffffff8151c46e ffff880104eff8c8
> ffff880104eff8c8 0000000000000000 ffff880104eff8b8 ffffffff810648cf
> ffff880128cab2c0 000000213fc57c68 ffff8801384e8928 ffff880128cab180
> Call Trace:
> [<ffffffff814fde0f>] dump_stack+0x53/0x74
> [<ffffffff8151c46e>] ? __list_add+0xbe/0xd0
> [<ffffffff810648cf>] __warn+0xff/0x120
> [<ffffffff810649a9>] warn_slowpath_fmt+0x49/0x50
> [<ffffffff8151c46e>] __list_add+0xbe/0xd0
> [<ffffffff814dec38>] blk_sq_make_request+0x388/0x580
> [<ffffffff814d5444>] generic_make_request+0x104/0x200

Well, it's very consistent, I have to say. So I really don't think
this is random corruption.

Could you try the attached patch? It adds a couple of sanity tests:

- a number of tests to verify that 'rq->queuelist' isn't already on
some queue when it is added to a queue

- one test to verify that rq->mq_ctx is the same ctx that we have locked.

I may be completely full of shit, and this patch may be pure garbage
or "obviously will never trigger", but humor me.

Linus


Attachments:
patch.diff (2.01 kB)

2016-10-26 22:40:49

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote:

> Could you try the attached patch? It adds a couple of sanity tests:
>
> - a number of tests to verify that 'rq->queuelist' isn't already on
> some queue when it is added to a queue
>
> - one test to verify that rq->mq_ctx is the same ctx that we have locked.
>
> I may be completely full of shit, and this patch may be pure garbage
> or "obviously will never trigger", but humor me.

I gave it a shot too for shits & giggles.
This falls out during boot.

[ 9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null)
[ 9.271391] ------------[ cut here ]------------
[ 9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0
[ 9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4
[ 9.300106] ffffc90000013848
[ 9.307353] ffffffff8130d27c
[ 9.307420] 0000000000000000
[ 9.307456] 0000000000000000
[ 9.307494] ffffc90000013888
[ 9.314780] ffffffff81077a41
[ 9.314847] 0000049d00013898
[ 9.314884] 0000000000000001
[ 9.314922] 0000000000000002
[ 9.322213] ffffe8ffffc06600
[ 9.322280] ffffe8ffff606600
[ 9.322317] ffff8805015c2e80
[ 9.322355] Call Trace:
[ 9.329689] [<ffffffff8130d27c>] dump_stack+0x4f/0x73
[ 9.337208] [<ffffffff81077a41>] __warn+0xc1/0xe0
[ 9.344730] [<ffffffff81077b18>] warn_slowpath_null+0x18/0x20
[ 9.352282] [<ffffffff812f7975>] blk_sq_make_request+0x465/0x4a0
[ 9.359816] [<ffffffff812eb44a>] ? generic_make_request+0xca/0x210
[ 9.367367] [<ffffffff812eb457>] generic_make_request+0xd7/0x210
[ 9.374931] [<ffffffff812eb5f9>] submit_bio+0x69/0x120
[ 9.382488] [<ffffffff812004ef>] ? submit_bh_wbc+0x16f/0x1e0
[ 9.390083] [<ffffffff812004ef>] submit_bh_wbc+0x16f/0x1e0
[ 9.397673] [<ffffffff81200bfe>] submit_bh+0xe/0x10
[ 9.405205] [<ffffffff81255eac>] __ext4_get_inode_loc+0x1ac/0x3d0
[ 9.412795] [<ffffffff81258f4b>] ext4_iget+0x6b/0xb70
[ 9.420366] [<ffffffff811d6018>] ? lookup_slow+0xf8/0x1f0
[ 9.427947] [<ffffffff81259a7a>] ext4_iget_normal+0x2a/0x30
[ 9.435541] [<ffffffff81264734>] ext4_lookup+0x114/0x230

2016-10-26 22:51:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones <[email protected]> wrote:
>
> I gave it a shot too for shits & giggles.
> This falls out during boot.
>
> [ 9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0

Hmm. That's the

WARN_ON_ONCE(rq->mq_ctx != ctx);

that I added to blk_mq_merge_queue_io(), and I really think that
warning is valid, and the fact that it triggers shows that something
is wrong with locking.

We just did a

spin_lock(&ctx->lock);

and that lock is *supposed* to protect the __blk_mq_insert_request(),
but that uses rq->mq_ctx.

So if rq->mq_ctx != ctx, then we're locking the wrong context.

Jens - please explain to me why I'm wrong.

Or maybe I actually might have found the problem? In which case please
send me a patch that fixes it ;)

Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
blk_mq_merge_queue_io() into two, since right now it can trigger both
for the

blk_mq_bio_to_request(rq, bio);

path _and_ for the

if (!blk_mq_attempt_merge(q, ctx, bio)) {
blk_mq_bio_to_request(rq, bio);
goto insert_rq;

path. If you split it into two: one before that "insert_rq:" label,
and one before the "goto insert_rq" thing, then we could see if it is
just one of the blk_mq_merge_queue_io() cases (or both) that is
broken..

Linus

2016-10-26 22:54:17

by Jens Axboe

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/26/2016 04:40 PM, Dave Jones wrote:
> On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote:
>
> > Could you try the attached patch? It adds a couple of sanity tests:
> >
> > - a number of tests to verify that 'rq->queuelist' isn't already on
> > some queue when it is added to a queue
> >
> > - one test to verify that rq->mq_ctx is the same ctx that we have locked.
> >
> > I may be completely full of shit, and this patch may be pure garbage
> > or "obviously will never trigger", but humor me.
>
> I gave it a shot too for shits & giggles.
> This falls out during boot.
>
> [ 9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null)
> [ 9.271391] ------------[ cut here ]------------
> [ 9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0
> [ 9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4

Very odd, don't immediately see how that can happen. For testing, can
you try and add the below patch? Just curious if that fixes the list
corruption. Thing is, I don't see how ->mq_ctx and ctx are different in
this path, but I can debug that on the side.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..73b9462aa21f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1165,9 +1165,10 @@ static inline bool hctx_allow_merges(struct
blk_mq_hw_ctx *hctx)
}

static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
- struct blk_mq_ctx *ctx,
struct request *rq, struct bio *bio)
{
+ struct blk_mq_ctx *ctx = rq->mq_ctx;
+
if (!hctx_allow_merges(hctx) || !bio_mergeable(bio)) {
blk_mq_bio_to_request(rq, bio);
spin_lock(&ctx->lock);
@@ -1338,7 +1339,7 @@ static blk_qc_t blk_mq_make_request(struct
request_queue *q, struct bio *bio)
goto done;
}

- if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+ if (!blk_mq_merge_queue_io(data.hctx, rq, bio)) {
/*
* For a SYNC request, send it to the hardware immediately. For
* an ASYNC request, just ensure that we run it later on. The
@@ -1416,7 +1417,7 @@ static blk_qc_t blk_sq_make_request(struct
request_queue *q, struct bio *bio)
return cookie;
}

- if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+ if (!blk_mq_merge_queue_io(data.hctx, rq, bio)) {
/*
* For a SYNC request, send it to the hardware immediately. For
* an ASYNC request, just ensure that we run it later on. The

--
Jens Axboe

2016-10-26 22:56:46

by Jens Axboe

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/26/2016 04:51 PM, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones <[email protected]> wrote:
>>
>> I gave it a shot too for shits & giggles.
>> This falls out during boot.
>>
>> [ 9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0
>
> Hmm. That's the
>
> WARN_ON_ONCE(rq->mq_ctx != ctx);
>
> that I added to blk_mq_merge_queue_io(), and I really think that
> warning is valid, and the fact that it triggers shows that something
> is wrong with locking.
>
> We just did a
>
> spin_lock(&ctx->lock);
>
> and that lock is *supposed* to protect the __blk_mq_insert_request(),
> but that uses rq->mq_ctx.
>
> So if rq->mq_ctx != ctx, then we're locking the wrong context.
>
> Jens - please explain to me why I'm wrong.
>
> Or maybe I actually might have found the problem? In which case please
> send me a patch that fixes it ;)

I think you're pretty close, the two should not be different and I don't
immediately see how. I'll run some testing here, should be easier with
this knowledge.

> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
> blk_mq_merge_queue_io() into two, since right now it can trigger both
> for the
>
> blk_mq_bio_to_request(rq, bio);
>
> path _and_ for the
>
> if (!blk_mq_attempt_merge(q, ctx, bio)) {
> blk_mq_bio_to_request(rq, bio);
> goto insert_rq;

And just in case I can't trigger, would be interesting to add a call to
blk_rq_dump_flags() as well, in case this is some special request.


--
Jens Axboe

2016-10-26 22:58:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
<[email protected]> wrote:
>
> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
> blk_mq_merge_queue_io() into two

I did that myself too, since Dave sees this during boot.

But I'm not getting the warning ;(

Dave gets it with ext4, and thats' what I have too, so I'm not sure
what the required trigger would be.

Linus

2016-10-26 23:01:23

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote:
> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
> blk_mq_merge_queue_io() into two, since right now it can trigger both
> for the
>
> blk_mq_bio_to_request(rq, bio);
>
> path _and_ for the
>
> if (!blk_mq_attempt_merge(q, ctx, bio)) {
> blk_mq_bio_to_request(rq, bio);
> goto insert_rq;
>
> path. If you split it into two: one before that "insert_rq:" label,
> and one before the "goto insert_rq" thing, then we could see if it is
> just one of the blk_mq_merge_queue_io() cases (or both) that is
> broken..

It's the latter of the two.

[ 12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191
blk_sq_make_request+0x320/0x4d0

Dave


2016-10-26 23:05:44

by Jens Axboe

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/26/2016 05:01 PM, Dave Jones wrote:
> On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote:
> > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
> > blk_mq_merge_queue_io() into two, since right now it can trigger both
> > for the
> >
> > blk_mq_bio_to_request(rq, bio);
> >
> > path _and_ for the
> >
> > if (!blk_mq_attempt_merge(q, ctx, bio)) {
> > blk_mq_bio_to_request(rq, bio);
> > goto insert_rq;
> >
> > path. If you split it into two: one before that "insert_rq:" label,
> > and one before the "goto insert_rq" thing, then we could see if it is
> > just one of the blk_mq_merge_queue_io() cases (or both) that is
> > broken..
>
> It's the latter of the two.
>
> [ 12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191
> blk_sq_make_request+0x320/0x4d0

I would expect that - so normal request, since we know merging is on. I
just sent out a patch, can you give that a whirl?

Keep Linus' debugging patch, and apply it on top. Below as well, for
reference.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..80a9c45a9235 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1217,9 +1217,7 @@ static struct request *blk_mq_map_request(struct
request_queue *q,
blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
rq = __blk_mq_alloc_request(&alloc_data, op, op_flags);

- hctx->queued++;
- data->hctx = hctx;
- data->ctx = ctx;
+ data->hctx->queued++;
return rq;
}


--
Jens Axboe

2016-10-26 23:05:49

by Jens Axboe

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/26/2016 04:58 PM, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
>> blk_mq_merge_queue_io() into two
>
> I did that myself too, since Dave sees this during boot.
>
> But I'm not getting the warning ;(
>
> Dave gets it with ext4, and thats' what I have too, so I'm not sure
> what the required trigger would be.

Actually, I think I see what might trigger it. You are on nvme, iirc,
and that has a deep queue. Dave, are you testing on a sata drive or
something similar with a shallower queue depth? If we end up sleeping
for a request, I think we could trigger data->ctx being different.

Dave, can you hit the warnings with this? Totally untested...


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..80a9c45a9235 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1217,9 +1217,7 @@ static struct request *blk_mq_map_request(struct
request_queue *q,
blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
rq = __blk_mq_alloc_request(&alloc_data, op, op_flags);

- hctx->queued++;
- data->hctx = hctx;
- data->ctx = ctx;
+ data->hctx->queued++;
return rq;
}


--
Jens Axboe

2016-10-26 23:07:14

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote:
> On 10/26/2016 04:58 PM, Linus Torvalds wrote:
> > On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
> > <[email protected]> wrote:
> >>
> >> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
> >> blk_mq_merge_queue_io() into two
> >
> > I did that myself too, since Dave sees this during boot.
> >
> > But I'm not getting the warning ;(
> >
> > Dave gets it with ext4, and thats' what I have too, so I'm not sure
> > what the required trigger would be.
>
> Actually, I think I see what might trigger it. You are on nvme, iirc,
> and that has a deep queue. Dave, are you testing on a sata drive or
> something similar with a shallower queue depth? If we end up sleeping
> for a request, I think we could trigger data->ctx being different.

yeah, just regular sata. I've been meaning to put an ssd in that box for
a while, but now it sounds like my procrastination may have paid off.

> Dave, can you hit the warnings with this? Totally untested...

Coming up..

Dave

2016-10-26 23:08:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe <[email protected]> wrote:
>
> Actually, I think I see what might trigger it. You are on nvme, iirc,
> and that has a deep queue.

Yes. I have long since moved on from slow disks, so all my systems are
not just flash, but m.2 nvme ssd's.

So at least that could explain why Dave sees it at bootup but I don't.

Linus

2016-10-26 23:15:28

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 03:07:10PM -0700, Linus Torvalds wrote:
>On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason <[email protected]> wrote:
>>
>> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
>> ran dbench 2048:
>>
>> [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0
>> [ 2759.119652] list_add corruption. prev->next should be next (ffffe8ffffc80308), but was ffffc90000ccfb88. (prev=ffff880128522380).
>> [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
>> [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317
>> [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
>> [ 2759.125077] ffffc9000f6fb868 ffffffff814fe4ff ffffffff8151cb5e ffffc9000f6fb8c8
>> [ 2759.125077] ffffc9000f6fb8c8 0000000000000000 ffffc9000f6fb8b8 ffffffff81064bbf
>> [ 2759.127444] ffff880128523680 0000002139968000 ffff880138b7a4a0 ffff880128523540
>> [ 2759.127444] Call Trace:
>> [ 2759.127444] [<ffffffff814fe4ff>] dump_stack+0x53/0x74
>> [ 2759.127444] [<ffffffff8151cb5e>] ? __list_add+0xbe/0xd0
>> [ 2759.127444] [<ffffffff81064bbf>] __warn+0xff/0x120
>> [ 2759.127444] [<ffffffff81064c99>] warn_slowpath_fmt+0x49/0x50
>> [ 2759.127444] [<ffffffff8151cb5e>] __list_add+0xbe/0xd0
>> [ 2759.127444] [<ffffffff814df338>] blk_sq_make_request+0x388/0x580
>
>Ok, that's definitely the same one that Dave started out seeing.
>
>The fact that it is that reliable - two different machines, two very
>different loads (dbench looks nothing like trinity) really makes me
>think that maybe the problem really is in the block plugging after
>all.
>
>It very much does not smell like random stack corruption. It's simply
>not random enough.

Agreed. I'd feel better if I could trigger this outside of btrfs, even
though Dave Chinner hit something very similar on xfs. I'll peel off
another test machine for a long XFS run.

>
>And I just noticed something: I originally thought that this is the
>"list_add_tail()" to the plug - which is the only "list_add()" variant
>in that function.
>
>But that never made sense, because the whole "but was" isn't a stack
>address, and "next" in "list_add_tail()" is basically fixed, and would
>have to be the stack.
>
>But I now notice that there's actually another "list_add()" variant
>there, and it's the one from __blk_mq_insert_request() that gets
>inlined into blk_mq_insert_request(), which then gets inlined into
>blk_mq_make_request().
>
>And that actually makes some sense just looking at the offsets too:
>
> blk_sq_make_request+0x388/0x580
>
>so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely.
>
>And there it makes perfect sense that the "next should be" value is
>*not* on the stack.
>
>Chris, if you built with debug info, you can try
>
> ./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388
>
>to get what line that blk_sq_make_request+0x388 address actually is. I
>think it's the
>
> list_add_tail(&rq->queuelist, &ctx->rq_list);
>
>in __blk_mq_insert_req_list() (when it's inlined from
>blk_sq_make_request(), "at_head" will be false.
>
>So it smells like "&ctx->rq_list" might be corrupt.

I'm running your current git here, so these line numbers should line up
for you:

blk_sq_make_request+0x388/0x578:
__blk_mq_insert_request at block/blk-mq.c:1049
(inlined by) blk_mq_merge_queue_io at block/blk-mq.c:1175
(inlined by) blk_sq_make_request at block/blk-mq.c:1419

The fsync path in the WARN doesn't have any plugs that I can see, so its
not surprising that we're not in the plugging path. I'm here:

if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
/*
* For a SYNC request, send it to the hardware immediately. For
* an ASYNC request, just ensure that we run it later on. The
* latter allows for merging opportunities and more efficient
* dispatching.
*/

I'll try the debugging patch you sent in the other email.

-chris

2016-10-26 23:20:39

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote:
>On 10/26/2016 04:58 PM, Linus Torvalds wrote:
>>On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
>><[email protected]> wrote:
>>>
>>>Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
>>>blk_mq_merge_queue_io() into two
>>
>>I did that myself too, since Dave sees this during boot.
>>
>>But I'm not getting the warning ;(
>>
>>Dave gets it with ext4, and thats' what I have too, so I'm not sure
>>what the required trigger would be.
>
>Actually, I think I see what might trigger it. You are on nvme, iirc,
>and that has a deep queue. Dave, are you testing on a sata drive or
>something similar with a shallower queue depth? If we end up sleeping
>for a request, I think we could trigger data->ctx being different.
>
>Dave, can you hit the warnings with this? Totally untested...

Confirmed, totally untested ;) Don't try this one at home folks
(working this out with Jens offlist)

G: unable to handle kernel paging request at 0000000002411200
IP: [<ffffffff819acff2>] _raw_spin_lock+0x22/0x40
PGD 12840a067
PUD 128446067
PMD 0

Oops: 0002 [#1] PREEMPT SMP
Modules linked in: virtio_blk(+)
CPU: 4 PID: 125 Comm: modprobe Not tainted
4.9.0-rc2-00041-g811d54d-dirty #320
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24
04/01/2014
task: ffff88013849aac0 task.stack: ffff8801293d8000
RIP: 0010:[<ffffffff819acff2>] [<ffffffff819acff2>]
_raw_spin_lock+0x22/0x40
RSP: 0018:ffff8801293db278 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000002411200 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff88013a5c1048 RDI: 0000000000000000
RBP: ffff8801293db288 R08: 0000000000000005 R09: ffff880128449380
R10: 0000000000000000 R11: 0000000000000008 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000000076 R15: ffff8801293b6a80
FS: 00007f1a2a9cdb40(0000) GS:ffff88013fd00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000002411200 CR3: 000000013a5d1000 CR4: 00000000000406e0
Stack:
ffff8801293db2d0 ffff880128488000 ffff8801293db348 ffffffff814debff
00ff8801293db2c8 ffff8801293db338 ffff8801284888c0 ffff8801284888b8
000060fec00004f9 0000000002411200 ffff880128f810c0 ffff880128f810c0
Call Trace:
[<ffffffff814debff>] blk_sq_make_request+0x34f/0x580
[<ffffffff8116b005>] ? mempool_alloc_slab+0x15/0x20
[<ffffffff814d5444>] generic_make_request+0x104/0x200
[<ffffffff814d55a5>] submit_bio+0x65/0x130
[<ffffffff8122a06e>] submit_bh_wbc+0x16e/0x210
[<ffffffff8122a123>] submit_bh+0x13/0x20
[<ffffffff8122b075>] block_read_full_page+0x205/0x3d0
[<ffffffff8122cf00>] ? I_BDEV+0x20/0x20
[<ffffffff8117a1fe>] ? lru_cache_add+0xe/0x10
[<ffffffff81167502>] ? add_to_page_cache_lru+0x92/0xf0
[<ffffffff81166c41>] ? __page_cache_alloc+0xd1/0xe0
[<ffffffff8122df38>] blkdev_readpage+0x18/0x20
[<ffffffff8116ada6>] do_read_cache_page+0x1c6/0x380
[<ffffffff8122df20>] ? blkdev_writepages+0x10/0x10
[<ffffffff811c6662>] ? alloc_pages_current+0xb2/0x1c0
[<ffffffff8116af92>] read_cache_page+0x12/0x20
[<ffffffff814e6b11>] read_dev_sector+0x31/0xb0
[<ffffffff814eb31d>] read_lba+0xbd/0x130
[<ffffffff814eb682>] find_valid_gpt+0xa2/0x580
[<ffffffff814ebb60>] ? find_valid_gpt+0x580/0x580
[<ffffffff814ebbc7>] efi_partition+0x67/0x3d0
[<ffffffff81509cfa>] ? vsnprintf+0x2aa/0x470
[<ffffffff81509f64>] ? snprintf+0x34/0x40
[<ffffffff814ebb60>] ? find_valid_gpt+0x580/0x580
[<ffffffff814e8f46>] check_partition+0x106/0x1e0
[<ffffffff814e741c>] rescan_partitions+0x8c/0x270
[<ffffffff8122ef98>] __blkdev_get+0x328/0x3f0
[<ffffffff8122f0b4>] blkdev_get+0x54/0x320
[<ffffffff8120be7a>] ? unlock_new_inode+0x5a/0x80
[<ffffffff8122dc0f>] ? bdget+0xff/0x110
[<ffffffff814e4d16>] device_add_disk+0x3c6/0x450
[<ffffffff8151970a>] ? ioread8+0x1a/0x40
[<ffffffff815bc68e>] ? vp_get+0x4e/0x70
[<ffffffffa0001540>] virtblk_probe+0x460/0x708 [virtio_blk]
[<ffffffff815bc556>] ? vp_finalize_features+0x36/0x50
[<ffffffff815b8c82>] virtio_dev_probe+0x132/0x1e0
[<ffffffff81619709>] driver_probe_device+0x1a9/0x2d0
[<ffffffff819aa9e4>] ? mutex_lock+0x24/0x50
[<ffffffff816198ed>] __driver_attach+0xbd/0xc0
[<ffffffff81619830>] ? driver_probe_device+0x2d0/0x2d0
[<ffffffff81619830>] ? driver_probe_device+0x2d0/0x2d0
[<ffffffff816178aa>] bus_for_each_dev+0x8a/0xb0
[<ffffffff8161920e>] driver_attach+0x1e/0x20
[<ffffffff81618bf6>] bus_add_driver+0x1b6/0x230
[<ffffffff8161a200>] driver_register+0x60/0xe0
[<ffffffff815b8f50>] register_virtio_driver+0x20/0x40
[<ffffffffa0004057>] init+0x57/0x81 [virtio_blk]
[<ffffffffa0004000>] ? 0xffffffffa0004000
[<ffffffffa0004000>] ? 0xffffffffa0004000
[<ffffffff810003a6>] do_one_initcall+0x46/0x150
[<ffffffff810e923a>] do_init_module+0x6a/0x210
[<ffffffff811b32b7>] ? vfree+0x37/0x90
[<ffffffff810ebe68>] load_module+0x1638/0x1860
[<ffffffff810e83f0>] ? do_free_init+0x30/0x30
[<ffffffff811f6da4>] ? kernel_read_file_from_fd+0x54/0x90
[<ffffffff810ec152>] SYSC_finit_module+0xc2/0xd0
[<ffffffff810ec16e>] SyS_finit_module+0xe/0x10
[<ffffffff819ad1a0>] entry_SYSCALL_64_fastpath+0x13/0x94
Code: 89 df e8 a2 52 70 ff eb e6 55 48 89 e5 53 48 83 ec 08 66 66 66 66
90 48 89 fb bf 01 00 00 00 e8 95 53 6e ff 31 c0 ba 01 00 00 00 <f0> 0f
b1 13 85 c0 75 07 48 83 c4 08 5b c9 c3 89 c6 48 89 df e8
RIP [<ffffffff819acff2>] _raw_spin_lock+0x22/0x40
RSP <ffff8801293db278>
CR2: 0000000002411200
---[ end trace e8cb117e64947621 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception

-chris

2016-10-26 23:21:32

by Jens Axboe

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/26/2016 05:08 PM, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe <[email protected]> wrote:
>>
>> Actually, I think I see what might trigger it. You are on nvme, iirc,
>> and that has a deep queue.
>
> Yes. I have long since moved on from slow disks, so all my systems are
> not just flash, but m.2 nvme ssd's.
>
> So at least that could explain why Dave sees it at bootup but I don't.

Yep, you'd never sleep during normal boot. The original patch had a
problem, though... This one should be better. Too many 'data's, we'll
still need to assign ctx/hctx, we should just use the current ones, not
the original ones.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..e56fec187ba6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1217,9 +1224,9 @@ static struct request *blk_mq_map_request(struct
request_queue *q,
blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
rq = __blk_mq_alloc_request(&alloc_data, op, op_flags);

- hctx->queued++;
- data->hctx = hctx;
- data->ctx = ctx;
+ data->hctx = alloc_data.hctx;
+ data->ctx = alloc_data.ctx;
+ data->hctx->queued++;
return rq;
}


--
Jens Axboe

2016-10-26 23:23:05

by Jens Axboe

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/26/2016 05:19 PM, Chris Mason wrote:
> On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote:
>> On 10/26/2016 04:58 PM, Linus Torvalds wrote:
>>> On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
>>> <[email protected]> wrote:
>>>>
>>>> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
>>>> blk_mq_merge_queue_io() into two
>>>
>>> I did that myself too, since Dave sees this during boot.
>>>
>>> But I'm not getting the warning ;(
>>>
>>> Dave gets it with ext4, and thats' what I have too, so I'm not sure
>>> what the required trigger would be.
>>
>> Actually, I think I see what might trigger it. You are on nvme, iirc,
>> and that has a deep queue. Dave, are you testing on a sata drive or
>> something similar with a shallower queue depth? If we end up sleeping
>> for a request, I think we could trigger data->ctx being different.
>>
>> Dave, can you hit the warnings with this? Totally untested...
>
> Confirmed, totally untested ;) Don't try this one at home folks
> (working this out with Jens offlist)

Ahem, I did say untested! The one I just sent in reply to Linus should
be better. Though that one is completely untested as well...

--
Jens Axboe

2016-10-26 23:38:45

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 05:20:01PM -0600, Jens Axboe wrote:
>On 10/26/2016 05:08 PM, Linus Torvalds wrote:
>>On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe <[email protected]> wrote:
>>>
>>>Actually, I think I see what might trigger it. You are on nvme, iirc,
>>>and that has a deep queue.
>>
>>Yes. I have long since moved on from slow disks, so all my systems are
>>not just flash, but m.2 nvme ssd's.
>>
>>So at least that could explain why Dave sees it at bootup but I don't.
>
>Yep, you'd never sleep during normal boot. The original patch had a
>problem, though... This one should be better. Too many 'data's, we'll
>still need to assign ctx/hctx, we should just use the current ones, not
>the original ones.
>
>diff --git a/block/blk-mq.c b/block/blk-mq.c
>index ddc2eed64771..e56fec187ba6 100644
>--- a/block/blk-mq.c
>+++ b/block/blk-mq.c
>@@ -1217,9 +1224,9 @@ static struct request *blk_mq_map_request(struct
>request_queue *q,
> blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
> rq = __blk_mq_alloc_request(&alloc_data, op, op_flags);
>
>- hctx->queued++;
>- data->hctx = hctx;
>- data->ctx = ctx;
>+ data->hctx = alloc_data.hctx;
>+ data->ctx = alloc_data.ctx;
>+ data->hctx->queued++;
> return rq;
> }

This made it through an entire dbench 2048 run on btrfs. My script has
it running in a loop, but this is farther than I've gotten before.
Looking great so far.

-chris

2016-10-26 23:48:01

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:

> >- hctx->queued++;
> >- data->hctx = hctx;
> >- data->ctx = ctx;
> >+ data->hctx = alloc_data.hctx;
> >+ data->ctx = alloc_data.ctx;
> >+ data->hctx->queued++;
> > return rq;
> > }
>
> This made it through an entire dbench 2048 run on btrfs. My script has
> it running in a loop, but this is farther than I've gotten before.
> Looking great so far.

Fixed the splat during boot for me too.
Now the fun part, let's see if it fixed the 'weird shit' that Trinity
was stumbling on.

Dave


2016-10-27 00:01:12

by Jens Axboe

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/26/2016 05:47 PM, Dave Jones wrote:
> On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:
>
> > >- hctx->queued++;
> > >- data->hctx = hctx;
> > >- data->ctx = ctx;
> > >+ data->hctx = alloc_data.hctx;
> > >+ data->ctx = alloc_data.ctx;
> > >+ data->hctx->queued++;
> > > return rq;
> > > }
> >
> > This made it through an entire dbench 2048 run on btrfs. My script has
> > it running in a loop, but this is farther than I've gotten before.
> > Looking great so far.
>
> Fixed the splat during boot for me too.
> Now the fun part, let's see if it fixed the 'weird shit' that Trinity
> was stumbling on.

Let's let the testing simmer overnight, then I'll turn this into a real
patch tomorrow and get it submitted.

--
Jens Axboe

2016-10-27 05:41:41

by Dave Chinner

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Oct 25, 2016 at 08:27:52PM -0400, Dave Jones wrote:
> DaveC: Do these look like real problems, or is this more "looks like
> random memory corruption" ? It's been a while since I did some stress
> testing on XFS, so these might not be new..
>
> XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2938
> ------------[ cut here ]------------
> kernel BUG at fs/xfs/xfs_message.c:113!
> invalid opcode: 0000 [#1] PREEMPT SMP
> CPU: 1 PID: 6227 Comm: trinity-c9 Not tainted 4.9.0-rc1-think+ #6
> task: ffff8804f4658040 task.stack: ffff88050568c000
> RIP: 0010:[<ffffffffa02d3e2b>]
> [<ffffffffa02d3e2b>] assfail+0x1b/0x20 [xfs]
> RSP: 0000:ffff88050568f9e8 EFLAGS: 00010282
> RAX: 00000000ffffffea RBX: 0000000000000046 RCX: 0000000000000001
> RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffa02fe34d
> RBP: ffff88050568f9e8 R08: 0000000000000000 R09: 0000000000000000
> R10: 000000000000000a R11: f000000000000000 R12: ffff88050568fb44
> R13: 00000000000000f3 R14: ffff8804f292bf88 R15: 000ffffffffe0046
> FS: 00007fe2ddfdfb40(0000) GS:ffff88050a000000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fe2dbabd000 CR3: 00000004f461f000 CR4: 00000000001406e0
> Stack:
> ffff88050568fa88 ffffffffa027ccee fffffffffffffff9 ffff8804f16fd8b0
> 0000000000003ffa 0000000000000032 ffff8804f292bf40 0000000000004976
> 000ffffffffe0008 00000000000004fd ffff880400000000 0000000000005107
> Call Trace:
> [<ffffffffa027ccee>] xfs_bmap_add_extent_hole_delay+0x54e/0x620 [xfs]
> [<ffffffffa027f2d4>] xfs_bmapi_reserve_delalloc+0x2b4/0x400 [xfs]
> [<ffffffffa02cadd7>] xfs_file_iomap_begin_delay.isra.12+0x247/0x3c0 [xfs]
> [<ffffffffa02cb0d1>] xfs_file_iomap_begin+0x181/0x270 [xfs]
> [<ffffffff8122e573>] iomap_apply+0x53/0x100
> [<ffffffff8122e68b>] iomap_file_buffered_write+0x6b/0x90

All this iomap code is new, so it's entirely possible that this is a
new bug. The assert failure is indicative that the delalloc extent's
metadata reservation grew when we expected it to stay the same or
shrink.

> XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 309
...
> [<ffffffffa057abe1>] xfs_trans_mod_sb+0x241/0x280 [xfs]
> [<ffffffffa0548eff>] xfs_ag_resv_alloc_extent+0x4f/0xc0 [xfs]
> [<ffffffffa050aa3d>] xfs_alloc_ag_vextent+0x23d/0x300 [xfs]
> [<ffffffffa050bb1b>] xfs_alloc_vextent+0x5fb/0x6d0 [xfs]
> [<ffffffffa051c1b4>] xfs_bmap_btalloc+0x304/0x8e0 [xfs]
> [<ffffffffa054648e>] ? xfs_iext_bno_to_ext+0xee/0x170 [xfs]
> [<ffffffffa051c8db>] xfs_bmap_alloc+0x2b/0x40 [xfs]
> [<ffffffffa051dc30>] xfs_bmapi_write+0x640/0x1210 [xfs]
> [<ffffffffa0569326>] xfs_iomap_write_allocate+0x166/0x350 [xfs]
> [<ffffffffa05540b0>] xfs_map_blocks+0x1b0/0x260 [xfs]
> [<ffffffffa0554beb>] xfs_do_writepage+0x23b/0x730 [xfs]

And that's indicative of a delalloc metadata reservation being
being too small and so we're allocating unreserved blocks.

Different symptoms, same underlying cause, I think.

I see the latter assert from time to time in my testing, but it's
not common (maybe once a month) and I've never been able to track it
down. However, it doesn't affect production systems unless they hit
ENOSPC hard enough that it causes the critical reserve pool to be
exhausted iand so the allocation fails. That's extremely rare -
usually takes a several hundred processes all trying to write as had
as they can concurrently and to all slip through the ENOSPC
detection without the correct metadata reservations and all require
multiple metadata blocks to be allocated durign writeback...

If you've got a way to trigger it quickly and reliably, that would
be helpful...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-10-27 06:34:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: bio linked list corruption.

> Dave, can you hit the warnings with this? Totally untested...

Can we just kill off the unhelpful blk_map_ctx structure, e.g.:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed..d74a74a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1190,21 +1190,15 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
}
}

-struct blk_map_ctx {
- struct blk_mq_hw_ctx *hctx;
- struct blk_mq_ctx *ctx;
-};
-
static struct request *blk_mq_map_request(struct request_queue *q,
struct bio *bio,
- struct blk_map_ctx *data)
+ struct blk_mq_alloc_data *data)
{
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
struct request *rq;
int op = bio_data_dir(bio);
int op_flags = 0;
- struct blk_mq_alloc_data alloc_data;

blk_queue_enter_live(q);
ctx = blk_mq_get_ctx(q);
@@ -1214,12 +1208,10 @@ static struct request *blk_mq_map_request(struct request_queue *q,
op_flags |= REQ_SYNC;

trace_block_getrq(q, bio, op);
- blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
- rq = __blk_mq_alloc_request(&alloc_data, op, op_flags);
+ blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+ rq = __blk_mq_alloc_request(data, op, op_flags);

- hctx->queued++;
- data->hctx = hctx;
- data->ctx = ctx;
+ data->hctx->queued++;
return rq;
}

@@ -1267,7 +1259,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
{
const int is_sync = rw_is_sync(bio_op(bio), bio->bi_opf);
const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
- struct blk_map_ctx data;
+ struct blk_mq_alloc_data data;
struct request *rq;
unsigned int request_count = 0;
struct blk_plug *plug;
@@ -1363,7 +1355,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
struct blk_plug *plug;
unsigned int request_count = 0;
- struct blk_map_ctx data;
+ struct blk_mq_alloc_data data;
struct request *rq;
blk_qc_t cookie;


2016-10-27 14:09:27

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.



On 10/26/2016 08:00 PM, Jens Axboe wrote:
> On 10/26/2016 05:47 PM, Dave Jones wrote:
>> On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:
>>
>> > >- hctx->queued++;
>> > >- data->hctx = hctx;
>> > >- data->ctx = ctx;
>> > >+ data->hctx = alloc_data.hctx;
>> > >+ data->ctx = alloc_data.ctx;
>> > >+ data->hctx->queued++;
>> > > return rq;
>> > > }
>> >
>> > This made it through an entire dbench 2048 run on btrfs. My script
>> has
>> > it running in a loop, but this is farther than I've gotten before.
>> > Looking great so far.
>>
>> Fixed the splat during boot for me too.
>> Now the fun part, let's see if it fixed the 'weird shit' that Trinity
>> was stumbling on.
>
> Let's let the testing simmer overnight, then I'll turn this into a real
> patch tomorrow and get it submitted.
>

I ran all night on both btrfs and xfs. XFS came out clean, but btrfs hit
the WARN_ON below. I hit it a few times with Jens' patch, always the
same warning. It's pretty obviously a btrfs bug, we're not cleaning up
this list properly during fsync. I tried a v1 of a btrfs fix overnight,
but I see where it was incomplete now and will re-run.

For the blk-mq bug, I think we got it!

Tested-by: always-blaming-jens-from-now-on <[email protected]>

WARNING: CPU: 5 PID: 16163 at lib/list_debug.c:62 __list_del_entry+0x86/0xd0
list_del corruption. next->prev should be ffff8801196d3be0, but was ffff88010fc63308
Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper i2c_piix4 lrw i2c_core gf128mul ablk_helper virtio_net serio_raw button pcspkr floppy cryptd sch_fq_codel autofs4 virtio_blk
CPU: 5 PID: 16163 Comm: dbench Not tainted 4.9.0-rc2-00041-g811d54d-dirty #322
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
ffff8801196d3a68 ffffffff814fde3f ffffffff8151c356 ffff8801196d3ac8
ffff8801196d3ac8 0000000000000000 ffff8801196d3ab8 ffffffff810648cf
dead000000000100 0000003e813bfc4a ffff8801196d3b98 ffff880122b5c800
Call Trace:
[<ffffffff814fde3f>] dump_stack+0x53/0x74
[<ffffffff8151c356>] ? __list_del_entry+0x86/0xd0
[<ffffffff810648cf>] __warn+0xff/0x120
[<ffffffff810649a9>] warn_slowpath_fmt+0x49/0x50
[<ffffffff8151c356>] __list_del_entry+0x86/0xd0
[<ffffffff8143618d>] btrfs_sync_log+0x75d/0xbd0
[<ffffffff8143cfa7>] ? btrfs_log_inode_parent+0x547/0xbb0
[<ffffffff819ad01b>] ? _raw_spin_lock+0x1b/0x40
[<ffffffff8108d1a3>] ? __might_sleep+0x53/0xa0
[<ffffffff812095c5>] ? dput+0x65/0x280
[<ffffffff8143d717>] ? btrfs_log_dentry_safe+0x77/0x90
[<ffffffff81405b04>] btrfs_sync_file+0x424/0x490
[<ffffffff8107535a>] ? SYSC_kill+0xba/0x1d0
[<ffffffff811f2348>] ? __sb_end_write+0x58/0x80
[<ffffffff812258ac>] vfs_fsync_range+0x4c/0xb0
[<ffffffff81002501>] ? syscall_trace_enter+0x201/0x2e0
[<ffffffff8122592c>] vfs_fsync+0x1c/0x20
[<ffffffff8122596d>] do_fsync+0x3d/0x70
[<ffffffff810029cb>] ? syscall_slow_exit_work+0xfb/0x100
[<ffffffff812259d0>] SyS_fsync+0x10/0x20
[<ffffffff81002b65>] do_syscall_64+0x55/0xd0
[<ffffffff810026d7>] ? prepare_exit_to_usermode+0x37/0x40
[<ffffffff819ad286>] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace c93288442a6424aa ]---

2016-10-27 16:34:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig <[email protected]> wrote:
>> Dave, can you hit the warnings with this? Totally untested...
>
> Can we just kill off the unhelpful blk_map_ctx structure, e.g.:

Yeah, I found that hard to read too. The difference between
blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use
the latter everywhere.

But please separate that patch from the patch that fixes the ctx list
corruption.

And Jens - can I have at least the corruption fix asap? Due to KS
travel, I'm going to do rc3 on Saturday, and I'd like to have the fix
in at least a day before that..

Linus

2016-10-27 16:37:15

by Jens Axboe

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 10/27/2016 10:34 AM, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig <[email protected]> wrote:
>>> Dave, can you hit the warnings with this? Totally untested...
>>
>> Can we just kill off the unhelpful blk_map_ctx structure, e.g.:
>
> Yeah, I found that hard to read too. The difference between
> blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use
> the latter everywhere.
>
> But please separate that patch from the patch that fixes the ctx list
> corruption.

Agree

> And Jens - can I have at least the corruption fix asap? Due to KS
> travel, I'm going to do rc3 on Saturday, and I'd like to have the fix
> in at least a day before that..

Just committed half an hour ago with the acks etc, I'll send it to you
shortly.

--
Jens Axboe

2016-10-27 17:24:22

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Thu, Oct 27, 2016 at 04:41:33PM +1100, Dave Chinner wrote:

> And that's indicative of a delalloc metadata reservation being
> being too small and so we're allocating unreserved blocks.
>
> Different symptoms, same underlying cause, I think.
>
> I see the latter assert from time to time in my testing, but it's
> not common (maybe once a month) and I've never been able to track it
> down. However, it doesn't affect production systems unless they hit
> ENOSPC hard enough that it causes the critical reserve pool to be
> exhausted iand so the allocation fails. That's extremely rare -
> usually takes a several hundred processes all trying to write as had
> as they can concurrently and to all slip through the ENOSPC
> detection without the correct metadata reservations and all require
> multiple metadata blocks to be allocated durign writeback...
>
> If you've got a way to trigger it quickly and reliably, that would
> be helpful...

Seems pretty quickly reproducable for me in some shape or form.
Run trinity with --enable-fds=testfile and create enough children
to create a fair bit of contention, (for me -C64 seems a good fit on
spinning rust, but if you're running on shiny nvme you might have to pump it up a bit).

Dave

2016-10-31 18:55:26

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 07:47:51PM -0400, Dave Jones wrote:
> On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:
>
> > >- hctx->queued++;
> > >- data->hctx = hctx;
> > >- data->ctx = ctx;
> > >+ data->hctx = alloc_data.hctx;
> > >+ data->ctx = alloc_data.ctx;
> > >+ data->hctx->queued++;
> > > return rq;
> > > }
> >
> > This made it through an entire dbench 2048 run on btrfs. My script has
> > it running in a loop, but this is farther than I've gotten before.
> > Looking great so far.
>
> Fixed the splat during boot for me too.
> Now the fun part, let's see if it fixed the 'weird shit' that Trinity
> was stumbling on.

It took a while, but.. bad news.


BUG: Bad page state in process kworker/u8:12 pfn:4e0e39
page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c
flags: 0x400000000000000c(referenced|uptodate)
page dumped because: non-NULL mapping
CPU: 3 PID: 1586 Comm: kworker/u8:12 Not tainted 4.9.0-rc3-think+ #1
Workqueue: writeback wb_workfn
(flush-btrfs-2)
ffffc90000777828
ffffffff8130d04c
ffffea0013838e40
ffffffff819ff654
ffffc90000777850
ffffffff81150e5f
0000000000000000
ffffea0013838e40
400000000000000c
ffffc90000777860
ffffffff81150f1a
ffffc900007778a8
Call Trace:
[<ffffffff8130d04c>] dump_stack+0x4f/0x73
[<ffffffff81150e5f>] bad_page+0xbf/0x120
[<ffffffff81150f1a>] free_pages_check_bad+0x5a/0x70
[<ffffffff81153818>] free_hot_cold_page+0x248/0x290
[<ffffffff81153b1b>] free_hot_cold_page_list+0x2b/0x50
[<ffffffff8115c52d>] release_pages+0x2bd/0x350
[<ffffffff8115da62>] __pagevec_release+0x22/0x30
[<ffffffffa00a0d4e>] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs]
[<ffffffffa00a1199>] extent_writepages+0x49/0x60 [btrfs]
[<ffffffffa0081840>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[<ffffffffa007e993>] btrfs_writepages+0x23/0x30 [btrfs]
[<ffffffff8115ac4c>] do_writepages+0x1c/0x30
[<ffffffff811f69f3>] __writeback_single_inode+0x33/0x180
[<ffffffff811f71e8>] writeback_sb_inodes+0x2a8/0x5b0
[<ffffffff811f757d>] __writeback_inodes_wb+0x8d/0xc0
[<ffffffff811f7833>] wb_writeback+0x1e3/0x1f0
[<ffffffff811f7d72>] wb_workfn+0xd2/0x280
[<ffffffff810908d5>] process_one_work+0x1d5/0x490
[<ffffffff81090875>] ? process_one_work+0x175/0x490
[<ffffffff81090bd9>] worker_thread+0x49/0x490
[<ffffffff81090b90>] ? process_one_work+0x490/0x490
[<ffffffff81095d4e>] kthread+0xee/0x110
[<ffffffff81095c60>] ? kthread_park+0x60/0x60
[<ffffffff81790b12>] ret_from_fork+0x22/0x30



2016-10-31 19:35:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <[email protected]> wrote:
>
> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39
> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c
> flags: 0x400000000000000c(referenced|uptodate)
> page dumped because: non-NULL mapping

Hmm. So this seems to be btrfs-specific, right?

I searched for all your "non-NULL mapping" cases, and they all seem to
have basically the same call trace, with some work thread doing
writeback and going through btrfs_writepages().

Sounds like it's a race with either fallocate hole-punching or
truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
clearly ran other filesystems too but I am not seeing this backtrace
for anything else.

Linus

2016-10-31 19:46:21

by Chris Mason

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote:
>On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <[email protected]> wrote:
>>
>> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39
>> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c
>> flags: 0x400000000000000c(referenced|uptodate)
>> page dumped because: non-NULL mapping
>
>Hmm. So this seems to be btrfs-specific, right?
>
>I searched for all your "non-NULL mapping" cases, and they all seem to
>have basically the same call trace, with some work thread doing
>writeback and going through btrfs_writepages().
>
>Sounds like it's a race with either fallocate hole-punching or
>truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
>clearly ran other filesystems too but I am not seeing this backtrace
>for anything else.

Agreed, I think this is a separate bug, almost certainly btrfs specific.
I'll work with Dave on a better reproducer.

-chris

2016-11-06 16:55:53

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs btree_ctree_super fault

<subject changed, hopefully we're done with bio corruption for now>

On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote:
> On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote:
> >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <[email protected]> wrote:
> >>
> >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39
> >> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c
> >> flags: 0x400000000000000c(referenced|uptodate)
> >> page dumped because: non-NULL mapping
> >
> >Hmm. So this seems to be btrfs-specific, right?
> >
> >I searched for all your "non-NULL mapping" cases, and they all seem to
> >have basically the same call trace, with some work thread doing
> >writeback and going through btrfs_writepages().
> >
> >Sounds like it's a race with either fallocate hole-punching or
> >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
> >clearly ran other filesystems too but I am not seeing this backtrace
> >for anything else.
>
> Agreed, I think this is a separate bug, almost certainly btrfs specific.
> I'll work with Dave on a better reproducer.

Still refining my 'capture ftrace when trinity detects taint' feature,
but in the meantime, here's a variant I don't think we've seen before:

general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 3 PID: 1913 Comm: trinity-c51 Not tainted 4.9.0-rc3-think+ #3
task: ffff880503350040 task.stack: ffffc90000240000
RIP: 0010:[<ffffffffa007c2f6>]
[<ffffffffa007c2f6>] write_ctree_super+0x96/0xb30 [btrfs]
RSP: 0018:ffffc90000243c90 EFLAGS: 00010286
RAX: dae05adadadad000 RBX: 0000000000000000 RCX: 0000000000000002
RDX: ffff8804fdfcc000 RSI: ffff8804edcee313 RDI: ffff8804edcee1c3
RBP: ffffc90000243d00 R08: 0000000000000003 R09: ffff880000000000
R10: 0000000000000001 R11: 0000000000000100 R12: ffff88045151c548
R13: 0000000000000000 R14: ffff8804ee5122a8 R15: ffff8804572267e8
FS: 00007f25c3e0eb40(0000) GS:ffff880507e00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f25c1560d44 CR3: 0000000454e20000 CR4: 00000000001406e0
DR0: 00007fee93506000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
0000000000000001
ffff88050227b3f8
ffff8804fff01b28
00000001810b7f35
ffffffffa007c265
0000000000000001
ffffc90000243cb8
000000002c9a8645
ffff8804fff01b28
ffff8804fff01b28
ffff88045151c548
0000000000000000
Call Trace:
[<ffffffffa007c265>] ? write_ctree_super+0x5/0xb30 [btrfs]
[<ffffffffa00d2956>] btrfs_sync_log+0x886/0xa60 [btrfs]
[<ffffffffa009f4f9>] btrfs_sync_file+0x479/0x4d0 [btrfs]
[<ffffffff812789ab>] vfs_fsync_range+0x4b/0xb0
[<ffffffff81260755>] ? __fget_light+0x5/0x60
[<ffffffff81278a6d>] do_fsync+0x3d/0x70
[<ffffffff81278a35>] ? do_fsync+0x5/0x70
[<ffffffff81278d20>] SyS_fsync+0x10/0x20
[<ffffffff81002d81>] do_syscall_64+0x61/0x170
[<ffffffff81894a8b>] entry_SYSCALL64_slow_path+0x25/0x25
Code: c7 48 8b 42 30 4c 8b 08 48 b8 00 00 00 00 00 16 00 00 49 03 81 a0 01 00 00 49 b9 00 00 00 00 00 88 ff ff 48 c1 f8 06 48 c1 e0 0c <4a> 8b 44 08 50 48 39 46 08 0f 84 8d 08 00 00 49 63 c0 48 8d 0c
RIP
[<ffffffffa007c2f6>] write_ctree_super+0x96/0xb30 [btrfs]
RSP <ffffc90000243c90>

All code
========
0: c7 (bad)
1: 48 8b 42 30 mov 0x30(%rdx),%rax
5: 4c 8b 08 mov (%rax),%r9
8: 48 b8 00 00 00 00 00 movabs $0x160000000000,%rax
f: 16 00 00
12: 49 03 81 a0 01 00 00 add 0x1a0(%r9),%rax
19: 49 b9 00 00 00 00 00 movabs $0xffff880000000000,%r9
20: 88 ff ff
23: 48 c1 f8 06 sar $0x6,%rax
27: 48 c1 e0 0c shl $0xc,%rax
2b:* 4a 8b 44 08 50 mov 0x50(%rax,%r9,1),%rax <-- trapping instruction
30: 48 39 46 08 cmp %rax,0x8(%rsi)
34: 0f 84 8d 08 00 00 je 0x8c7
3a: 49 63 c0 movslq %r8d,%rax
3d: 48 rex.W
3e: 8d .byte 0x8d
3f: 0c .byte 0xc

Code starting with the faulting instruction
===========================================
0: 4a 8b 44 08 50 mov 0x50(%rax,%r9,1),%rax
5: 48 39 46 08 cmp %rax,0x8(%rsi)
9: 0f 84 8d 08 00 00 je 0x89c
f: 49 63 c0 movslq %r8d,%rax
12: 48 rex.W
13: 8d .byte 0x8d
14: 0c .byte 0xc


According to objdump -S, it looks like this is an inlined copy of backup_super_roots

root_backup = info->super_for_commit->super_roots + last_backup;
2706: 48 8d b8 2b 0b 00 00 lea 0xb2b(%rax),%rdi
270d: 48 63 c1 movslq %ecx,%rax
2710: 48 8d 34 80 lea (%rax,%rax,4),%rsi
2714: 48 8d 04 b0 lea (%rax,%rsi,4),%rax
2718: 48 8d 34 c7 lea (%rdi,%rax,8),%rsi
btrfs_header_generation(info->tree_root->node))
271c: 48 8b 42 30 mov 0x30(%rdx),%rax
2720: 4c 8b 08 mov (%rax),%r9
2723: 48 b8 00 00 00 00 00 movabs $0x160000000000,%rax
272a: 16 00 00
272d: 49 03 81 a0 01 00 00 add 0x1a0(%r9),%rax
if (btrfs_backup_tree_root_gen(root_backup) ==
2734: 49 b9 00 00 00 00 00 movabs $0xffff880000000000,%r9
273b: 88 ff ff
273e: 48 c1 f8 06 sar $0x6,%rax
2742: 48 c1 e0 0c shl $0xc,%rax
2746: 4a 8b 44 08 50 mov 0x50(%rax,%r9,1),%rax <-- trapping instruction
274b: 48 39 46 08 cmp %rax,0x8(%rsi)
274f: 0f 84 8d 08 00 00 je 2fe2 <write_ctree_super+0x932>

2016-11-08 14:59:27

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs btree_ctree_super fault

On Sun, Nov 06, 2016 at 11:55:39AM -0500, Dave Jones wrote:
> <subject changed, hopefully we're done with bio corruption for now>
>
> On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote:
> > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote:
> > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <[email protected]> wrote:
> > >>
> > >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39
> > >> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c
> > >> flags: 0x400000000000000c(referenced|uptodate)
> > >> page dumped because: non-NULL mapping
> > >
> > >Hmm. So this seems to be btrfs-specific, right?
> > >
> > >I searched for all your "non-NULL mapping" cases, and they all seem to
> > >have basically the same call trace, with some work thread doing
> > >writeback and going through btrfs_writepages().
> > >
> > >Sounds like it's a race with either fallocate hole-punching or
> > >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
> > >clearly ran other filesystems too but I am not seeing this backtrace
> > >for anything else.
> >
> > Agreed, I think this is a separate bug, almost certainly btrfs specific.
> > I'll work with Dave on a better reproducer.
>
> Still refining my 'capture ftrace when trinity detects taint' feature,
> but in the meantime, here's a variant I don't think we've seen before:

And another new one:

kernel BUG at fs/btrfs/ctree.c:3172!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 0 PID: 22702 Comm: trinity-c40 Not tainted 4.9.0-rc4-think+ #1
task: ffff8804ffde37c0 task.stack: ffffc90002188000
RIP: 0010:[<ffffffffa00576b9>]
[<ffffffffa00576b9>] btrfs_set_item_key_safe+0x179/0x190 [btrfs]
RSP: 0000:ffffc9000218b8a8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8804fddcf348 RCX: 0000000000001000
RDX: 0000000000000000 RSI: ffffc9000218b9ce RDI: ffffc9000218b8c7
RBP: ffffc9000218b908 R08: 0000000000004000 R09: ffffc9000218b8c8
R10: 0000000000000000 R11: 0000000000000001 R12: ffffc9000218b8b6
R13: ffffc9000218b9ce R14: 0000000000000001 R15: ffff880480684a88
FS: 00007f7c7f998b40(0000) GS:ffff880507800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000044f15f000 CR4: 00000000001406f0
DR0: 00007f4ce439d000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
ffff880501430000 d305ffffa00a2245 006c000000000002 0500000000000010
6c000000000002d3 0000000000001000 000000006427eebb ffff880480684a88
0000000000000000 ffff8804fddcf348 0000000000002000 0000000000000000
Call Trace:
[<ffffffffa009cff0>] __btrfs_drop_extents+0xb00/0xe30 [btrfs]
[<ffffffff8116c80c>] ? function_trace_call+0x13c/0x190
[<ffffffffa009c4f5>] ? __btrfs_drop_extents+0x5/0xe30 [btrfs]
[<ffffffff810e2b00>] ? do_raw_write_lock+0xb0/0xc0
[<ffffffffa00cd43d>] btrfs_log_changed_extents+0x35d/0x630 [btrfs]
[<ffffffffa00a6a74>] ? release_extent_buffer+0xa4/0x110 [btrfs]
[<ffffffffa00cd0e5>] ? btrfs_log_changed_extents+0x5/0x630 [btrfs]
[<ffffffffa00d1085>] btrfs_log_inode+0xb05/0x11d0 [btrfs]
[<ffffffff8116536c>] ? trace_function+0x6c/0x80
[<ffffffffa00d0580>] ? log_directory_changes+0xc0/0xc0 [btrfs]
[<ffffffffa00d1a20>] ? btrfs_log_inode_parent+0x240/0x940 [btrfs]
[<ffffffff8116c80c>] ? function_trace_call+0x13c/0x190
[<ffffffffa00d1a20>] btrfs_log_inode_parent+0x240/0x940 [btrfs]
[<ffffffffa00d17e5>] ? btrfs_log_inode_parent+0x5/0x940 [btrfs]
[<ffffffff81259131>] ? dget_parent+0x71/0x150
[<ffffffffa00d3102>] btrfs_log_dentry_safe+0x62/0x80 [btrfs]
[<ffffffffa009f404>] btrfs_sync_file+0x344/0x4d0 [btrfs]
[<ffffffff81278a1b>] vfs_fsync_range+0x4b/0xb0
[<ffffffff812607c5>] ? __fget_light+0x5/0x60
[<ffffffff81278add>] do_fsync+0x3d/0x70
[<ffffffff81278aa5>] ? do_fsync+0x5/0x70
[<ffffffff81278db3>] SyS_fdatasync+0x13/0x20
[<ffffffff81002d81>] do_syscall_64+0x61/0x170
[<ffffffff81894ccb>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 48 8b 45 b7 48 8d 7d bf 4c 89 ee 48 89 45 c8 0f b6 45 b6 88 45 c7 48 8b 45 ae 48 89 45 bf e8 af f2 ff ff 85 c0 0f 8f 43 ff ff ff <0f> 0b 0f 0b e8 ee f3 02 e1 0f 1f 40 00 66 2e 0f 1f 84 00 00 00



Unfortunatly, because this was a BUG_ON, it locked up the box so it
didn't save any additional debug info. Tempted to see if making BUG_ON
a no-op will at least let it live long enough to save the ftrace buffer.

Given this seems to be mutating every time I see something go wrong,
I'm wondering if this is fallout from memory corruption again.

Dave

2016-11-08 15:08:46

by Chris Mason

[permalink] [raw]
Subject: Re: btrfs btree_ctree_super fault



On 11/08/2016 09:59 AM, Dave Jones wrote:
> On Sun, Nov 06, 2016 at 11:55:39AM -0500, Dave Jones wrote:
> > <subject changed, hopefully we're done with bio corruption for now>
> >
> > On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote:
> > > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote:
> > > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <[email protected]> wrote:
> > > >>
> > > >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39
> > > >> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c
> > > >> flags: 0x400000000000000c(referenced|uptodate)
> > > >> page dumped because: non-NULL mapping
> > > >
> > > >Hmm. So this seems to be btrfs-specific, right?
> > > >
> > > >I searched for all your "non-NULL mapping" cases, and they all seem to
> > > >have basically the same call trace, with some work thread doing
> > > >writeback and going through btrfs_writepages().
> > > >
> > > >Sounds like it's a race with either fallocate hole-punching or
> > > >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
> > > >clearly ran other filesystems too but I am not seeing this backtrace
> > > >for anything else.
> > >
> > > Agreed, I think this is a separate bug, almost certainly btrfs specific.
> > > I'll work with Dave on a better reproducer.
> >
> > Still refining my 'capture ftrace when trinity detects taint' feature,
> > but in the meantime, here's a variant I don't think we've seen before:
>
> And another new one:
>
> kernel BUG at fs/btrfs/ctree.c:3172!
> invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU: 0 PID: 22702 Comm: trinity-c40 Not tainted 4.9.0-rc4-think+ #1
> task: ffff8804ffde37c0 task.stack: ffffc90002188000
> RIP: 0010:[<ffffffffa00576b9>]
> [<ffffffffa00576b9>] btrfs_set_item_key_safe+0x179/0x190 [btrfs]
> RSP: 0000:ffffc9000218b8a8 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff8804fddcf348 RCX: 0000000000001000
> RDX: 0000000000000000 RSI: ffffc9000218b9ce RDI: ffffc9000218b8c7
> RBP: ffffc9000218b908 R08: 0000000000004000 R09: ffffc9000218b8c8
> R10: 0000000000000000 R11: 0000000000000001 R12: ffffc9000218b8b6
> R13: ffffc9000218b9ce R14: 0000000000000001 R15: ffff880480684a88
> FS: 00007f7c7f998b40(0000) GS:ffff880507800000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 000000044f15f000 CR4: 00000000001406f0
> DR0: 00007f4ce439d000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> Stack:
> ffff880501430000 d305ffffa00a2245 006c000000000002 0500000000000010
> 6c000000000002d3 0000000000001000 000000006427eebb ffff880480684a88
> 0000000000000000 ffff8804fddcf348 0000000000002000 0000000000000000
> Call Trace:
> [<ffffffffa009cff0>] __btrfs_drop_extents+0xb00/0xe30 [btrfs]

We've been hunting this one for at least two years. It's the white
whale of btrfs bugs. Josef has a semi-reliable reproducer now, but I
think it's not the same as the pagevec based problems you reported earlier.

-chris

2016-11-10 14:35:35

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs btree_ctree_super fault

On Tue, Nov 08, 2016 at 10:08:04AM -0500, Chris Mason wrote:

> > And another new one:
> >
> > kernel BUG at fs/btrfs/ctree.c:3172!
> >
> > Call Trace:
> > [<ffffffffa009cff0>] __btrfs_drop_extents+0xb00/0xe30 [btrfs]
>
> We've been hunting this one for at least two years. It's the white
> whale of btrfs bugs. Josef has a semi-reliable reproducer now, but I
> think it's not the same as the pagevec based problems you reported earlier.

Great, now for whatever reason, I'm hitting this over and over.

Even better, after the last time I hit it, it reboot and this happened during boot..

BTRFS info (device sda6): disk space caching is enabled
BTRFS info (device sda6): has skinny extents
BTRFS info (device sda3): disk space caching is enabled
------------[ cut here ]------------
WARNING: CPU: 1 PID: 443 at fs/btrfs/file.c:546 btrfs_drop_extent_cache+0x411/0x420 [btrfs]
CPU: 1 PID: 443 Comm: mount Not tainted 4.9.0-rc4-think+ #1
ffffc90000c4b468 ffffffff813b66bc 0000000000000000 0000000000000000
ffffc90000c4b4a8 ffffffff81086d2b 0000022200c4b488 000000000002f265
40c8dded1afd6000 ffff8804ff5cddc8 ffff8804ef26f2b8 40c8dded1afd5000
Call Trace:
[<ffffffff813b66bc>] dump_stack+0x4f/0x73
[<ffffffff81086d2b>] __warn+0xcb/0xf0
[<ffffffff81086e5d>] warn_slowpath_null+0x1d/0x20
[<ffffffffa009c0f1>] btrfs_drop_extent_cache+0x411/0x420 [btrfs]
[<ffffffff81215923>] ? alloc_debug_processing+0x73/0x1b0
[<ffffffffa009c93f>] __btrfs_drop_extents+0x44f/0xe30 [btrfs]
[<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[<ffffffff8121842a>] ? kmem_cache_alloc+0x2aa/0x330
[<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[<ffffffffa009e399>] btrfs_drop_extents+0x79/0xa0 [btrfs]
[<ffffffffa00ce4d1>] replay_one_extent+0x1e1/0x710 [btrfs]
[<ffffffffa00cec6d>] replay_one_buffer+0x26d/0x7e0 [btrfs]
[<ffffffff8121732c>] ? ___slab_alloc.constprop.83+0x27c/0x5c0
[<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[<ffffffff813d5b87>] ? debug_smp_processor_id+0x17/0x20
[<ffffffffa00ca3db>] walk_up_log_tree+0xeb/0x240 [btrfs]
[<ffffffffa00ca5d6>] walk_log_tree+0xa6/0x1d0 [btrfs]
[<ffffffffa00d32fc>] btrfs_recover_log_trees+0x1dc/0x460 [btrfs]
[<ffffffffa00cea00>] ? replay_one_extent+0x710/0x710 [btrfs]
[<ffffffffa0081f65>] open_ctree+0x2575/0x2670 [btrfs]
[<ffffffffa005144b>] btrfs_mount+0xd0b/0xe10 [btrfs]
[<ffffffff811da804>] ? pcpu_alloc+0x2d4/0x660
[<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200
[<ffffffff810d39eb>] ? __init_waitqueue_head+0x3b/0x50
[<ffffffff81243794>] mount_fs+0x14/0xa0
[<ffffffff8126305b>] vfs_kern_mount+0x6b/0x150
[<ffffffffa0050a08>] btrfs_mount+0x2c8/0xe10 [btrfs]
[<ffffffff811da804>] ? pcpu_alloc+0x2d4/0x660
[<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200
[<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200
[<ffffffff810d39eb>] ? __init_waitqueue_head+0x3b/0x50
[<ffffffff81243794>] mount_fs+0x14/0xa0
[<ffffffff8126305b>] vfs_kern_mount+0x6b/0x150
[<ffffffff81265bd2>] do_mount+0x1c2/0xda0
[<ffffffff811d41c0>] ? memdup_user+0x60/0x90
[<ffffffff81266ac3>] SyS_mount+0x83/0xd0
[<ffffffff81002d81>] do_syscall_64+0x61/0x170
[<ffffffff81894ccb>] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace d3fa03bb9c115bbe ]---
BTRFS: error (device sda3) in btrfs_replay_log:2491: errno=-17 Object already exists (Failed to recover log tree)
BTRFS error (device sda3): cleaner transaction attach returned -30
BTRFS error (device sda3): open_ctree failed


Guess I'll hit it with btrfsck and hope for the best..

Dave

2016-11-10 15:28:31

by Chris Mason

[permalink] [raw]
Subject: Re: btrfs btree_ctree_super fault

On 11/10/2016 09:35 AM, Dave Jones wrote:
> On Tue, Nov 08, 2016 at 10:08:04AM -0500, Chris Mason wrote:
>
> > > And another new one:
> > >
> > > kernel BUG at fs/btrfs/ctree.c:3172!
> > >
> > > Call Trace:
> > > [<ffffffffa009cff0>] __btrfs_drop_extents+0xb00/0xe30 [btrfs]
> >
> > We've been hunting this one for at least two years. It's the white
> > whale of btrfs bugs. Josef has a semi-reliable reproducer now, but I
> > think it's not the same as the pagevec based problems you reported earlier.
>
> Great, now for whatever reason, I'm hitting this over and over.
>
> Even better, after the last time I hit it, it reboot and this happened during boot..
>
> BTRFS info (device sda6): disk space caching is enabled
> BTRFS info (device sda6): has skinny extents
> BTRFS info (device sda3): disk space caching is enabled
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 443 at fs/btrfs/file.c:546 btrfs_drop_extent_cache+0x411/0x420 [btrfs]
> CPU: 1 PID: 443 Comm: mount Not tainted 4.9.0-rc4-think+ #1
> ffffc90000c4b468 ffffffff813b66bc 0000000000000000 0000000000000000
> ffffc90000c4b4a8 ffffffff81086d2b 0000022200c4b488 000000000002f265
> 40c8dded1afd6000 ffff8804ff5cddc8 ffff8804ef26f2b8 40c8dded1afd5000
> Call Trace:
> [<ffffffff813b66bc>] dump_stack+0x4f/0x73
> [<ffffffff81086d2b>] __warn+0xcb/0xf0
> [<ffffffff81086e5d>] warn_slowpath_null+0x1d/0x20
> [<ffffffffa009c0f1>] btrfs_drop_extent_cache+0x411/0x420 [btrfs]
> [<ffffffff81215923>] ? alloc_debug_processing+0x73/0x1b0
> [<ffffffffa009c93f>] __btrfs_drop_extents+0x44f/0xe30 [btrfs]
> [<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
> [<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
> [<ffffffff8121842a>] ? kmem_cache_alloc+0x2aa/0x330
> [<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
> [<ffffffffa009e399>] btrfs_drop_extents+0x79/0xa0 [btrfs]
> [<ffffffffa00ce4d1>] replay_one_extent+0x1e1/0x710 [btrfs]
> [<ffffffffa00cec6d>] replay_one_buffer+0x26d/0x7e0 [btrfs]
> [<ffffffff8121732c>] ? ___slab_alloc.constprop.83+0x27c/0x5c0
> [<ffffffffa005426a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
> [<ffffffff813d5b87>] ? debug_smp_processor_id+0x17/0x20
> [<ffffffffa00ca3db>] walk_up_log_tree+0xeb/0x240 [btrfs]
> [<ffffffffa00ca5d6>] walk_log_tree+0xa6/0x1d0 [btrfs]
> [<ffffffffa00d32fc>] btrfs_recover_log_trees+0x1dc/0x460 [btrfs]
> [<ffffffffa00cea00>] ? replay_one_extent+0x710/0x710 [btrfs]
> [<ffffffffa0081f65>] open_ctree+0x2575/0x2670 [btrfs]
> [<ffffffffa005144b>] btrfs_mount+0xd0b/0xe10 [btrfs]
> [<ffffffff811da804>] ? pcpu_alloc+0x2d4/0x660
> [<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200
> [<ffffffff810d39eb>] ? __init_waitqueue_head+0x3b/0x50
> [<ffffffff81243794>] mount_fs+0x14/0xa0
> [<ffffffff8126305b>] vfs_kern_mount+0x6b/0x150
> [<ffffffffa0050a08>] btrfs_mount+0x2c8/0xe10 [btrfs]
> [<ffffffff811da804>] ? pcpu_alloc+0x2d4/0x660
> [<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200
> [<ffffffff810dce41>] ? lockdep_init_map+0x61/0x200
> [<ffffffff810d39eb>] ? __init_waitqueue_head+0x3b/0x50
> [<ffffffff81243794>] mount_fs+0x14/0xa0
> [<ffffffff8126305b>] vfs_kern_mount+0x6b/0x150
> [<ffffffff81265bd2>] do_mount+0x1c2/0xda0
> [<ffffffff811d41c0>] ? memdup_user+0x60/0x90
> [<ffffffff81266ac3>] SyS_mount+0x83/0xd0
> [<ffffffff81002d81>] do_syscall_64+0x61/0x170
> [<ffffffff81894ccb>] entry_SYSCALL64_slow_path+0x25/0x25
> ---[ end trace d3fa03bb9c115bbe ]---
> BTRFS: error (device sda3) in btrfs_replay_log:2491: errno=-17 Object already exists (Failed to recover log tree)
> BTRFS error (device sda3): cleaner transaction attach returned -30
> BTRFS error (device sda3): open_ctree failed
>
>
> Guess I'll hit it with btrfsck and hope for the best..

You can zero the log if you need to. Josef has a ton of tracing around
this right now, so I'm hoping we nail it down very soon.

-chris

2016-11-23 19:34:33

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote:
> On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote:
> >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones <[email protected]> wrote:
> >>
> >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39
> >> page:ffffea0013838e40 count:0 mapcount:0 mapping:ffff8804a20310e0 index:0x100c
> >> flags: 0x400000000000000c(referenced|uptodate)
> >> page dumped because: non-NULL mapping
> >
> >Hmm. So this seems to be btrfs-specific, right?
> >
> >I searched for all your "non-NULL mapping" cases, and they all seem to
> >have basically the same call trace, with some work thread doing
> >writeback and going through btrfs_writepages().
> >
> >Sounds like it's a race with either fallocate hole-punching or
> >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
> >clearly ran other filesystems too but I am not seeing this backtrace
> >for anything else.
>
> Agreed, I think this is a separate bug, almost certainly btrfs specific.
> I'll work with Dave on a better reproducer.

Ok, let's try this..
Here's the usual dump, from the current tree:


[ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4
[ 317.702235] page:ffffea001363f500 count:0 mapcount:0 mapping:ffff8804eef8e0e0 index:0x29
[ 317.718068] flags: 0x400000000000000c(referenced|uptodate)
[ 317.731086] page dumped because: non-NULL mapping
[ 317.744118] CPU: 3 PID: 1558 Comm: kworker/u8:8 Not tainted 4.9.0-rc6-think+ #1
[ 317.770404] Workqueue: writeback wb_workfn
[ 317.783490] (flush-btrfs-3)
[ 317.783581] ffffc90000ff3798
[ 317.796651] ffffffff813b69bc
[ 317.796742] ffffea001363f500
[ 317.796798] ffffffff81c474ee
[ 317.796854] ffffc90000ff37c0
[ 317.809910] ffffffff811b30c4
[ 317.810001] 0000000000000000
[ 317.810055] ffffea001363f500
[ 317.810112] 0000000000000003
[ 317.823230] ffffc90000ff37d0
[ 317.823320] ffffffff811b318f
[ 317.823376] ffffc90000ff3818
[ 317.823431] Call Trace:
[ 317.836645] [<ffffffff813b69bc>] dump_stack+0x4f/0x73
[ 317.850063] [<ffffffff811b30c4>] bad_page+0xc4/0x130
[ 317.863426] [<ffffffff811b318f>] free_pages_check_bad+0x5f/0x70
[ 317.876805] [<ffffffff811b61f5>] free_hot_cold_page+0x305/0x3b0
[ 317.890132] [<ffffffff811b660e>] free_hot_cold_page_list+0x7e/0x170
[ 317.903410] [<ffffffff811c08b7>] release_pages+0x2e7/0x3a0
[ 317.916600] [<ffffffff811c2067>] __pagevec_release+0x27/0x40
[ 317.929817] [<ffffffffa00b0a55>] extent_write_cache_pages.isra.44.constprop.59+0x355/0x430 [btrfs]
[ 317.943073] [<ffffffffa00b0fcd>] extent_writepages+0x5d/0x90 [btrfs]
[ 317.956033] [<ffffffffa008ea60>] ? btrfs_releasepage+0x40/0x40 [btrfs]
[ 317.968902] [<ffffffffa008b6a8>] btrfs_writepages+0x28/0x30 [btrfs]
[ 317.981744] [<ffffffff811be6d1>] do_writepages+0x21/0x30
[ 317.994524] [<ffffffff812727cf>] __writeback_single_inode+0x7f/0x6e0
[ 318.007333] [<ffffffff81894cc1>] ? _raw_spin_unlock+0x31/0x50
[ 318.020090] [<ffffffff8127356d>] writeback_sb_inodes+0x31d/0x6c0
[ 318.032844] [<ffffffff812739a2>] __writeback_inodes_wb+0x92/0xc0
[ 318.045623] [<ffffffff81273e97>] wb_writeback+0x3f7/0x530
[ 318.058304] [<ffffffff81274a38>] wb_workfn+0x148/0x620
[ 318.070949] [<ffffffff810a4b54>] ? process_one_work+0x194/0x690
[ 318.083549] [<ffffffff812748f5>] ? wb_workfn+0x5/0x620
[ 318.096160] [<ffffffff810a4bf2>] process_one_work+0x232/0x690
[ 318.108722] [<ffffffff810a4b54>] ? process_one_work+0x194/0x690
[ 318.121225] [<ffffffff810a509e>] worker_thread+0x4e/0x490
[ 318.133680] [<ffffffff810a5050>] ? process_one_work+0x690/0x690
[ 318.146126] [<ffffffff810a5050>] ? process_one_work+0x690/0x690
[ 318.158507] [<ffffffff810aa7e2>] kthread+0x102/0x120
[ 318.170880] [<ffffffff810aa6e0>] ? kthread_park+0x60/0x60
[ 318.183254] [<ffffffff81895822>] ret_from_fork+0x22/0x30


This time, I managed to get my ftrace magic working, so we have this
trace from just before this happened. Does this shed any light ?

https://codemonkey.org.uk/junk/trace.txt

(note that some of the spinlock/rcu calls will be missing here. I chose
to add them to ftrace's exclude list because they dominate the trace
buffer so much, especially in debug builds)

Dave

2016-11-23 19:59:20

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:

> [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4
> trace from just before this happened. Does this shed any light ?
>
> https://codemonkey.org.uk/junk/trace.txt

crap, I just noticed the timestamps in the trace come from quite a bit
later. I'll tweak the code to do the taint checking/ftrace stop after
every syscall, that should narrow the window some more.

Getting closer..

Dave

2016-12-01 15:32:36

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs_destroy_inode warn (outstanding extents)

On Wed, Nov 23, 2016 at 02:58:45PM -0500, Dave Jones wrote:
> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:
>
> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4
> > trace from just before this happened. Does this shed any light ?
> >
> > https://codemonkey.org.uk/junk/trace.txt
>
> crap, I just noticed the timestamps in the trace come from quite a bit
> later. I'll tweak the code to do the taint checking/ftrace stop after
> every syscall, that should narrow the window some more.
>
> Getting closer..

Ok, this is getting more like it.
http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents.txt

Also same bug, different run, but a different traceview http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents-function-graph.txt

(function-graph screws up the RIP for some reason, 'return_to_handler'
should actually be btrfs_destroy_inode)


Anyways, I've got some code that works pretty well for dumping the
ftrace buffer now when things go awry. I just need to run it enough
times that I hit that bad page state instead of this, or a lockdep bug first.

Dave

2016-12-03 16:50:51

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs_destroy_inode warn (outstanding extents)

On Thu, Dec 01, 2016 at 10:32:09AM -0500, Dave Jones wrote:
> http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents.txt
>
> Also same bug, different run, but a different traceview http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents-function-graph.txt
>
> (function-graph screws up the RIP for some reason, 'return_to_handler'
> should actually be btrfs_destroy_inode)

Chris pointed me at a pending patch that took care of this warning.

> Anyways, I've got some code that works pretty well for dumping the
> ftrace buffer now when things go awry. I just need to run it enough
> times that I hit that bad page state instead of this, or a lockdep bug first.

Which allowed me to run long enough to get this trace..
http://codemonkey.org.uk/junk/bad-page-state.txt

Does this shed any light ?
The interesting process here seems to be kworker/u8:17, and the trace
captures some of what that was doing before that bad page was hit.

I've got another run going now, so I'll compare that trace when it
happens to see if it matches my current theory that it's something to
do with that btrfs_scrubparity_helper. I've seen that show up in stack
traces a few times while chasing this.

Dave

2016-12-04 23:04:51

by Vegard Nossum

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 23 November 2016 at 20:58, Dave Jones <[email protected]> wrote:
> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:
>
> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4
> > trace from just before this happened. Does this shed any light ?
> >
> > https://codemonkey.org.uk/junk/trace.txt
>
> crap, I just noticed the timestamps in the trace come from quite a bit
> later. I'll tweak the code to do the taint checking/ftrace stop after
> every syscall, that should narrow the window some more.

FWIW I hit this as well:

BUG: unable to handle kernel paging request at ffffffff81ff08b7
IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30
PGD 461e067 PUD 461f063
PMD 1e001e1
Oops: 0003 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: G B 4.9.0-rc7+ #217
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
task: ffff8801ee924080 task.stack: ffff8801bab88000
RIP: 0010:[<ffffffff8135f2ea>] [<ffffffff8135f2ea>]
__lock_acquire.isra.32+0xda/0x1a30
RSP: 0018:ffff8801bab8f730 EFLAGS: 00010082
RAX: ffffffff81ff071f RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffffffff85ae7d00
RBP: ffff8801bab8f7b0 R08: 0000000000000001 R09: 0000000000000000
R10: ffff8801e727fc40 R11: fffffbfff0b54ced R12: ffffffff85ae7d00
R13: ffffffff84912920 R14: ffff8801ee924080 R15: 0000000000000000
FS: 00007f37ee653700(0000) GS:ffff8801f6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff81ff08b7 CR3: 00000001daa70000 CR4: 00000000000006f0
DR0: 00007f37ee465000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
ffff8801ee9247d0 0000000000000000 0000000100000000 ffff8801ee924080
ffff8801f6a201c0 ffff8801f6a201c0 0000000000000000 0000000000000001
ffff880100000000 ffff880100000000 ffff8801e727fc40 ffff8801ee924080
Call Trace:
[<ffffffff81361751>] lock_acquire+0x141/0x2b0
[<ffffffff813530c0>] ? finish_wait+0xb0/0x180
[<ffffffff83c95b29>] _raw_spin_lock_irqsave+0x49/0x60
[<ffffffff813530c0>] ? finish_wait+0xb0/0x180
[<ffffffff813530c0>] finish_wait+0xb0/0x180
[<ffffffff81576227>] shmem_fault+0x4c7/0x6b0
[<ffffffff83a9b7cd>] ? p9_client_rpc+0x13d/0xf40
[<ffffffff81575d60>] ? shmem_getpage_gfp+0x1c90/0x1c90
[<ffffffff81fbe777>] ? radix_tree_next_chunk+0x4f7/0x840
[<ffffffff81352150>] ? wake_atomic_t_function+0x210/0x210
[<ffffffff815ad316>] __do_fault+0x206/0x410
[<ffffffff815ad110>] ? do_page_mkwrite+0x320/0x320
[<ffffffff815b9bcf>] handle_mm_fault+0x1cef/0x2a60
[<ffffffff815b8012>] ? handle_mm_fault+0x132/0x2a60
[<ffffffff815b7ee0>] ? __pmd_alloc+0x370/0x370
[<ffffffff81692e7e>] ? inode_add_bytes+0x10e/0x160
[<ffffffff8162de11>] ? memset+0x31/0x40
[<ffffffff815cba10>] ? find_vma+0x30/0x150
[<ffffffff812373a2>] __do_page_fault+0x452/0x9f0
[<ffffffff81ff071f>] ? iov_iter_init+0xaf/0x1d0
[<ffffffff81237bf5>] trace_do_page_fault+0x1e5/0x3a0
[<ffffffff8122a007>] do_async_page_fault+0x27/0xa0
[<ffffffff83c97618>] async_page_fault+0x28/0x30
[<ffffffff82059341>] ? strnlen_user+0x91/0x1a0
[<ffffffff8205931e>] ? strnlen_user+0x6e/0x1a0
[<ffffffff8157e038>] strndup_user+0x28/0xb0
[<ffffffff81d83c17>] SyS_add_key+0xc7/0x370
[<ffffffff81d83b50>] ? key_get_type_from_user.constprop.6+0xd0/0xd0
[<ffffffff815143ea>] ? __context_tracking_exit.part.4+0x3a/0x1e0
[<ffffffff81d83b50>] ? key_get_type_from_user.constprop.6+0xd0/0xd0
[<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0
[<ffffffff83c96534>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e>
ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00
RIP [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30

I didn't read all the emails in this thread, the crash site looks
identical to one of the earlier traces although the caller may be
different.

I think you can rule out btrfs in any case, probably block layer as
well, since it looks like this comes from shmem.


Vegard

2016-12-05 11:10:36

by Vegard Nossum

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 5 December 2016 at 00:04, Vegard Nossum <[email protected]> wrote:
> FWIW I hit this as well:
>
> BUG: unable to handle kernel paging request at ffffffff81ff08b7
> IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30
> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: G B 4.9.0-rc7+ #217
[...]

> I think you can rule out btrfs in any case, probably block layer as
> well, since it looks like this comes from shmem.

I should rather say that the VM runs on a 9p root filesystem and it
doesn't use/mount any block devices or disk-based filesystems.

I have all the trinity logs for the crash if that's useful. I tried a
couple of runs with just the (at the time) in-progress syscalls but it
didn't turn up anything interesting. Otherwise it seems like a lot of
data to go through by hand.

The crashing child seems to have just started, though, if that's relevant:

[child56:21744] [0]
open(filename="/sys/block/loop2/power/runtime_active_time",
flags=0x777b01, mode=666) = -1 (Not a directory)
[child56:21744] [1] [32BIT] sched_getattr(pid=1, param=0x7f37ec26c000,
size=3415) = -1 (Invalid argument)
[child56:21744] [2] [32BIT]
access(filename="/proc/2/task/2/net/stat/arp_cache", mode=2000) = -1
(Invalid argument)
[child56:21744] [3] getegid() = 0xfffe
[child56:21744] [4]
swapoff(path="/proc/721/task/721/net/dev_snmp6/tunl0") = -1 (Operation
not permitted)
[child56:21744] [5] timerfd_create(clockid=0x0, flags=0x0) = 439
[child56:21744] [6] pkey_mprotect(start=0x7f37ee656000, len=0,
prot=0x1000008, key=0x600000000000000) = 0
[child56:21744] [7] msync(start=0x7f37ee657000, len=0, flags=0x6) = 0
[child56:21744] [8] flock(fd=168, cmd=0xffffffc191f30b0c) = -1
(Invalid argument)
[child56:21744] [9] add_key(_type=0x437a15,
_description=0x7f37ec06c000, _payload=0x7f37ec06c000, plen=0,
ringid=0xfffffffffffffff8)

The other logfiles end thusly:

==> trinity-child0.log <==
[child0:21593] [311] faccessat(dfd=246,
filename="/proc/983/task/983/net/protocols", mode=2000) = -1 (Invalid
argument)
[child0:21593] [312] renameat(olddfd=246,
oldname="/proc/13/task/13/attr/sockcreate", newdfd=377,
newname="/proc/16/task/16/net/stat/rt_cache") = -1 (Permission denied)
[child0:21593] [313] [32BIT] readv(fd=289, vec=0x2e1a3d0, vlen=215) = 0

==> trinity-child100.log <==
[child100:21536] [439] setgid(gid=0x2a000200) = -1 (Operation not permitted)
[child100:21536] [440] waitid(which=175, upid=21587, infop=0x4,
options=3542, ru=0x7f37ec76c000) = -1 (Invalid argument)
[child100:21536] [441]
getxattr(pathname="/proc/980/task/980/net/ptype", name=0x7f37ee466000,
value=0x7f37ec26c000, size=49) = -1 (Operation not supported)

==> trinity-child101.log <==
[child101:21537] [55] getcwd(buf=0x7f37ee466000, size=4096) = 39
[child101:21537] [56] [32BIT] munlock(addr=0x7f37ee658000, len=0) = 0
[child101:21537] [57] semctl(semid=0xbd851e2b40e7df,
semnum=0x1b1b1b1b1b, cmd=0x2000000000, arg=0xcacacacaca) = -1 (Invalid
argument)

==> trinity-child102.log <==
[child102:21542] [11] readahead(fd=353, offset=2, count=249) = -1
(Invalid argument)
[child102:21542] [12] add_key(_type=0x43793f,
_description=0x7f37ec46c000, _payload=0x7f37ee658000, plen=32,
ringid=0xfffffffffffffffa) = -1 (Invalid argument)
[child102:21542] [13] time(tloc=0x7f37ee466000) = 0x584474e0

==> trinity-child103.log <==
[child103:21543] [45] dup(fildes=183) = 512
[child103:21543] [46] rt_sigpending(set=0x7f37ec86c000, sigsetsize=32)
= -1 (Invalid argument)
[child103:21543] [47] newstat(filename="/proc/587/task/587/gid_map",
statbuf=0x7f37ee466000) = 0

==> trinity-child104.log <==
[child104:21546] [49] getdents(fd=162, dirent=0x0, count=127) = -1
(Not a directory)
[child104:21546] [50] [32BIT] clock_adjtime(which_clock=0, utx=0x4) =
-1 (Bad address)
[child104:21546] [51] setsid() = 0x542a

==> trinity-child105.log <==
[child105:21547] [523] epoll_wait(epfd=244, events=0x8, maxevents=246,
timeout=-1) = -1 (Invalid argument)
[child105:21547] [524] dup2(oldfd=244, newfd=244) = 244
[child105:21547] [525] acct(name=0x7f37ec26c000) = -1 (Operation not permitted)

==> trinity-child106.log <==
[child106:19910] [136] getegid() = 0xfffe
[child106:19910] [137] munmap(addr=0x7f37ee65a000, len=4096) = 0
[child106:19910] [138] clock_nanosleep(which_clock=0x1, flags=0x1,
rqtp=0x7f37ec06c000, rmtp=0x7f37ee466000)
==> trinity-child107.log <==
[child107:21224] [994] copy_file_range(fd_in=373, off_in=0x2400e210,
fd_out=373, off_out=8, len=8, flags=0x0) = -1 (Bad file descriptor)
[child107:21224] [995] kcmp(pid1=1, pid2=21453, type=0x5,
idx1=0x787878787878, idx2=0xffffff6060606060) = -1 (Operation not
permitted)
[child107:21224] [996] [32BIT] readv(fd=365, vec=0x2e27e10, vlen=36) = 0

==> trinity-child108.log <==
[child108:21226] [759] recvfrom(fd=219, ubuf=0x7f37ec26c000, size=8,
flags=0x0, addr=0x2e1ed80, addr_len=110) = -1 (Bad file descriptor)
[child108:21226] [760] shmat(shmid=-4097, shmaddr=0x7f37ee465000,
shmflg=-195) = -1 (Invalid argument)
[child108:21226] [761] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0)
==> trinity-child109.log <==
[child109:21552] [241] fremovexattr(fd=133, name=0x4) = -1 (Bad address)
[child109:21552] [242] sched_setattr(pid=0, uattr=0x1) = -1 (Invalid argument)
[child109:21552] [243] msgget(key=0x71717171717171,
msgflg=0xec00000000) = 0x2d805a

==> trinity-child10.log <==
[child10:21619] [120] creat(pathname="/proc/6/attr/fscreate", mode=0) = 521
[child10:21619] [121] [32BIT] sethostname(name=0x7f37ec06c000,
len=4096) = -1 (Operation not permitted)
[child10:21619] [122] add_key(_type=0x43793f,
_description=0x7f37ec66c000, _payload=0x7f37ec06c000, plen=8,
ringid=0xfffffffffffffffe) = -1 (Invalid argument)

==> trinity-child110.log <==
[child110:21554] [289] accept(fd=366, upeer_sockaddr=0x2e09f10,
upeer_addrlen=12) = -1 (Socket operation on non-socket)
[child110:21554] [290] unlinkat(dfd=147,
pathname="/proc/16/task/16/net/atm/br2684", flag=-62) = -1 (Invalid
argument)
[child110:21554] [291] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0)
==> trinity-child111.log <==
[child111:21556] [102] getsockname(fd=292, usockaddr=0x2e10190,
usockaddr_len=110) = -1 (Socket operation on non-socket)
[child111:21556] [103] munmap(addr=0x7f37ee65a000, len=4096) = 0
[child111:21556] [104] personality(personality=1) = 0

==> trinity-child112.log <==
[child112:21241] [848] setpgid(pid=21456, pgid=9) = -1 (No such process)
[child112:21241] [849] splice(fd_in=182, off_in=0x7f37ec76c000,
fd_out=182, off_out=0x7f37ec76c008, len=4, flags=0x7) = -1 (Bad file
descriptor)
[child112:21241] [850] setgid(gid=0) = -1 (Operation not permitted)

==> trinity-child113.log <==
[child113:21560] [453] memfd_create(uname=0x7f37ee657000, flag=0x1) = 530
[child113:21560] [454] pwritev2(fd=342, vec=0x2e1e000, vlen=133,
pos_l=0xffffff1d, pos_h=-4, flags=0x0) = -1 (Illegal seek)
[child113:21560] [455] readahead(fd=338, offset=0xe000, count=4096) =
-1 (Invalid argument)

==> trinity-child114.log <==
[child114:21562] [27] [32BIT] getgroups16(gidsetsize=0xf12ec06dec,
grouplist=0x7f37ee465000) = -1 (Bad address)
[child114:21562] [28] keyctl(cmd=0x14, arg2=0xcecececece, arg3=8192,
arg4=0x7cc57bd0d8cf577, arg5=0x898989898989) = -1 (Invalid argument)
[child114:21562] [29] readlink(path="/sys/block/loop1/mq/0/pending",
buf=0x7f37ee467000, bufsiz=0x1565a3aa) = -1 (Invalid argument)

==> trinity-child115.log <==
[child115:21563] [392] eventfd(count=8) = 523
[child115:21563] [393] getegid() = 0xfffe
[child115:21563] [394] open(filename="/proc/587/stat", flags=0x795ac0,
mode=666) = -1 (Invalid argument)

==> trinity-child116.log <==
[child116:21567] [80] mbind(start=0x7f37ee65a000, len=0, mode=0x3,
nmask=0x0, maxnode=0x8000, flags=0x5) = -1 (Operation not permitted)
[child116:21567] [81] flock(fd=357, cmd=0xffffff0fe2220862) = 0
[child116:21567] [82] clock_adjtime(which_clock=15, utx=0x8) = -1
(Invalid argument)

==> trinity-child117.log <==
[child117:21570] [333] pipe(fildes=0x7f37ee466000) = 0
[child117:21570] [334] get_robust_list(pid=21439, head_ptr=0x1,
len_ptr=0x7f37ebc18000) = -1 (Operation not permitted)
[child117:21570] [335] execve(name="/proc/550/net/rt_cache",
argv=0x2e230d0, envp=0x2ba92a0)
==> trinity-child118.log <==
[child118:21572] [455] fstatat64(dfd=244,
filename="/proc/1297/task/1297/fdinfo/110", statbuf=0x8, flag=97) = -1
(Invalid argument)
[child118:21572] [456] flistxattr(fd=244, list=0x7f37ec26c000, size=8) = 0
[child118:21572] [457] ioprio_set(which=0x5d5d5d5d5d5d,
who=0x222734221582, ioprio=0xffffff5b5b5b5b5b) = -1 (Invalid argument)

==> trinity-child119.log <==
[child119:21574] [400] timerfd_create(clockid=0x0, flags=0x0) = 522
[child119:21574] [401] execveat(fd=338, name="/proc/1024/net/can-bcm",
argv=0x2bda9f0, envp=0x2e21130, flags=0x100) = 522
[child119:21574] [402] execveat(fd=244,
name="/sys/devices/platform/serial8250/tty/ttyS3/power/autosuspend_delay_ms",
argv=0x2e09b30, envp=0x2e1d0f0, flags=0x1000)
==> trinity-child11.log <==
[child11:21621] [352]
symlink(oldname="/sys/devices/virtual/sound/timer/uevent",
newname="/proc/549/task/549/net/rpc/auth.unix.gid") = -1 (File exists)
[child11:21621] [353] [32BIT] setgroups16(gidsetsize=0xffffff18,
grouplist=0x8) = -1 (Operation not permitted)
[child11:21621] [354] semtimedop(semid=0x100000de63c0d6e9, tsops=0x4,
nsops=0, timeout=0x8)
==> trinity-child120.log <==
[child120:21578] [141] execveat(fd=147,
name="/proc/2/task/2/net/llc/socket", argv=0x2e0a250, envp=0x2e10190,
flags=0x1000) = -1 (Permission denied)
[child120:21578] [142] readlinkat(dfd=330,
pathname="/sys/devices/virtual/misc/ocfs2_control/power/runtime_active_kids",
buf=0x7f37ee656000, bufsiz=4) = -1 (Invalid argument)
[child120:21578] [143] fremovexattr(fd=330, name=0x7f37ec26c000) = -1
(Permission denied)

==> trinity-child121.log <==
[child121:21263] [586] pread64(fd=237, buf=0x7f37ec76c000, count=3746,
pos=75) = 0
[child121:21263] [587] inotify_add_watch(fd=237,
pathname="/proc/974/task/974/net/atm/pvc", mask=0x5) = -1 (Invalid
argument)
[child121:21263] [588]
removexattr(pathname="/proc/974/task/974/net/sco",
name=0x7f37ec26c000) = -1 (Permission denied)

==> trinity-child122.log <==
[child122:21580] [57] getrusage(who=0x3a2a97385da52b25, ru=0x0) = -1
(Invalid argument)
[child122:21580] [58] getdents(fd=373, dirent=0x1, count=0x6cfbd4f4) =
-1 (Bad file descriptor)
[child122:21580] [59] msgsnd(msqid=-9, msgp=0x8, msgsz=59, msgflg=0x0)
= -1 (Bad address)

==> trinity-child123.log <==
[child123:21582] [289] recvmsg(fd=244, msg=0x7f37ec26c000,
flags=0x5f058) = -1 (Socket operation on non-socket)
[child123:21582] [290] poll(ufds=0x2d25880, nfds=7, timeout_msecs=0) = 5
[child123:21582] [291] getppid() = 1297

==> trinity-child124.log <==
[child124:20960] [493] uname(name=0x7f37ee466000) = 0
[child124:20960] [494] sysfs(option=0x1, arg1=-3,
arg2=0x5982b49806138008) = -1 (Bad address)
[child124:20960] [495]
execve(name="/proc/sys/net/ipv4/neigh/sit0/base_reachable_time_ms",
argv=0x2e122d0, envp=0x2bc95a0)
==> trinity-child125.log <==
[child125:21586] [15] setgroups(gidsetsize=12, grouplist=0x0) = -1
(Operation not permitted)
[child125:21586] [16] [32BIT] oldumount(name=0x7f37ec46c000) = -1
(Operation not permitted)
[child125:21586] [17] utimes(filename="/proc/5/net/anycast6",
utimes=0x8) = -1 (Bad address)

==> trinity-child126.log <==
[child126:21587] [29] shmdt(shmaddr=0x7f37ec66c000) = -1 (Invalid argument)
[child126:21587] [30]
setxattr(pathname="/sys/devices/pci0000:00/0000:00:03.0/net/eth0/power/runtime_active_kids",
name=0x4, value=0x7f37ec46c000, size=0xfffd, flags=0x1) = -1 (Bad
address)
[child126:21587] [31] chroot(filename="/proc/1240/net/hci") = -1 (Not
a directory)

==> trinity-child127.log <==
[child127:21588] [290] clock_gettime(which_clock=6156, tp=0x0) = -1
(Invalid argument)
[child127:21588] [291] statfs(pathname="/proc/1252/net/ipv6_route",
buf=0x7f37ee659000) = 0
[child127:21588] [292] msync(start=0x7f37ee658000, len=0, flags=0x6) = 0

==> trinity-child12.log <==
[child12:21625] [54] getsid(pid=21655) = 0
[child12:21625] [55] mlock2(start=0x7f37ec06c000, len=0x136000) = -1
(Cannot allocate memory)
[child12:21625] [56] dup3(oldfd=180, newfd=180, flags=0x80000) = -1
(Invalid argument)

==> trinity-child13.log <==
[child13:21627] [32] add_key(_type=0x437a15,
_description=0x7f37ee466001, _payload=0x7f37ee466000, plen=0x40000,
ringid=0xfffffffffffffffe) = -1 (Invalid argument)
[child13:21627] [33] mknodat(dfd=243,
filename="/proc/553/task/553/attr/fscreate", mode=0xf63d64273f4d42,
dev=-10) = -1 (Operation not permitted)
[child13:21627] [34] mq_timedreceive(mqdes=328,
u_msg_ptr=0x7f37ee466000, msg_len=0xd1d1d1, u_msg_prio=0x7f37ec06c000,
u_abs_timeout=0x7f37ee466000) = -1 (Invalid argument)

==> trinity-child14.log <==
[child14:21633] [194] msgsnd(msqid=0x4e251795f0, msgp=0x0, msgsz=8,
msgflg=0x0) = -1 (Bad address)
[child14:21633] [195] ppoll(ufds=0x2c4b530, nfds=7, tsp=0x2e09fc0,
sigmask=0x7f37ee465000, sigsetsize=128) = -1 (Invalid argument)
[child14:21633] [196] faccessat(dfd=251,
filename="/proc/15/task/15/net/stat/nf_conntrack", mode=40) = -1
(Invalid argument)

==> trinity-child15.log <==
[child15:21634] [210] timer_gettime(timer_id=1, setting=0x4) = -1
(Invalid argument)
[child15:21634] [211] lgetxattr(pathname="/proc/551/setgroups",
name=0x4, value=0xc, size=1) = -1 (Bad address)
[child15:21634] [212] splice(fd_in=265, off_in=0x7f37ec26c000,
fd_out=265, off_out=0x7f37ec26c008, len=1, flags=0x4) = -1 (Bad file
descriptor)

==> trinity-child16.log <==
[child16:21640] [307]
sched_get_priority_max(policy=0x7fffffff08000000) = -1 (Invalid
argument)
[child16:21640] [308] getcpu(cpup=0x7f37ee465000,
nodep=0x7f37ee465004, unused=0x7f37ee465008) = 0
[child16:21640] [309] sendfile(out_fd=367, in_fd=367, offset=0x0,
count=0) = -1 (Bad file descriptor)

==> trinity-child17.log <==
[child17:21642] [227]
llistxattr(pathname="/proc/16/task/16/net/route", list=0x8, size=4) =
0
[child17:21642] [228] msync(start=0x7f37ec66c000, len=16384, flags=0x6) = 0
[child17:21642] [229] [32BIT]
execve(name="/proc/1024/net/atm/devices", argv=0x2c32e20,
envp=0x2dfb5a0) = 0

==> trinity-child18.log <==
[child18:21644] [55] execve(name="/proc/971/task/971/net/icmp",
argv=0x2d25880, envp=0x2b996a0) = -1 (Permission denied)
[child18:21644] [56] utime(filename="/proc/974/net/raw",
times=0x7f37ee465000) = -1 (Operation not permitted)
[child18:21644] [57] umask(mask=0xfffffff7) = 18

==> trinity-child19.log <==
[child19:21651] [208] [32BIT] move_pages(pid=21604, nr_pages=145,
pages=0x2e1a3d0, nodes=0x2e1b3e0, status=0x2e1b630, flags=0x2) = -1
(Bad address)
[child19:21651] [209] timer_getoverrun(timer_id=-2) = -1 (Invalid argument)
[child19:21651] [210] rename(oldname="/proc/14/cmdline",
newname="/proc/1041/task/1041/net/can/rcvlist_all") = -1 (No such file
or directory)

==> trinity-child1.log <==
[child1:19992] [1361]
lsetxattr(pathname="/proc/1297/task/1297/fdinfo/96", name=0x1,
value=0x7f37ee466000, size=0, flags=0x0) = -1 (Bad address)
[child1:19992] [1362] setgid(gid=0xf0f0f0f0f) = -1 (Operation not permitted)
[child1:19992] [1363] clock_nanosleep(which_clock=0x1, flags=0x0,
rqtp=0x7f37ee659000, rmtp=0x7f37ee465000)
==> trinity-child20.log <==
[child20:21027] [1852] sendto(fd=244, buff=0x2df7410, len=1,
flags=0x800436c8, addr=0x2df7460, addr_len=16) = -1 (Socket operation
on non-socket)
[child20:21027] [1853] setresgid(rgid=0, egid=0x4848, sgid=0xdddddddd)
= -1 (Operation not permitted)
[child20:21027] [1854] poll(ufds=0x2e1fb80, nfds=8, timeout_msecs=0) = 6

==> trinity-child21.log <==
[child21:21330] [663]
statfs(pathname="/sys/block/md0/queue/scheduler", buf=0x7f37ec66c000)
= 0
[child21:21330] [664] sched_setaffinity(pid=1, len=1028,
user_mask_ptr=0x7f37ec46c000) = -1 (Operation not permitted)
[child21:21330] [665] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0)
==> trinity-child22.log <==
[child22:21652] [61] pkey_alloc(flags=0, init_val=0x0) = -1 (No space
left on device)
[child22:21652] [62] fsync(fd=330) = 0
[child22:21652] [63] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0)
==> trinity-child23.log <==
[child23:21655] [237] sync_file_range(fd=244, offset=191, nbytes=1,
flags=0xffffffffffffff9) = -1 (Invalid argument)
[child23:21655] [238] io_submit(ctx_id=4088, nr=1,
iocbpp=0x7f37ee658000) = -1 (Invalid argument)
[child23:21655] [239] flistxattr(fd=367, list=0x7f37ee656000, size=8)
==> trinity-child24.log <==
[child24:21657] [86] [32BIT] personality(personality=0x3fde3e4912ec) =
0x82604841
[child24:21657] [87]
mkdir(pathname="/proc/924/task/924/net/atalk/interface", mode=0) = -1
(File exists)
[child24:21657] [88] [32BIT] fchownat(dfd=381,
filename="/sys/module/nf_conntrack/parameters/nf_conntrack_helper",
user=0x1804802345, group=0xffffffff, flag=-3) = -1 (Invalid argument)

==> trinity-child25.log <==
[child25:21658] [157] timerfd_settime(ufd=201, flags=0x80800,
utmr=0x1, otmr=0x9) = -1 (Bad address)
[child25:21658] [158] rt_tgsigqueueinfo(tgid=0xc0da20044837,
pid=21330, sig=7099, uinfo=0x7f37ee466000) = -1 (Bad address)
[child25:21658] [159] pkey_mprotect(start=0x7f37ec76c000, len=0x79000,
prot=0x0, key=0x5120a54c676961) = -1 (Invalid argument)

==> trinity-child26.log <==
[child26:21665] [20] fdatasync(fd=217) = -1 (Invalid argument)
[child26:21665] [21] fsetxattr(fd=137, name=0x4, value=0x7f37ec26c000,
size=-9, flags=0x2) = -1 (Bad address)
[child26:21665] [22] open_by_handle_at(mountdirfd=137,
handle=0x7f37ee465000, flags=0x5b33c0) = -1 (Operation not permitted)

==> trinity-child27.log <==
[child27:21668] [190] preadv(fd=381, vec=0x2e1b3e0, vlen=233,
pos_l=0x710702af1dfa, pos_h=0xc976c19d4f) = 0
[child27:21668] [191] [32BIT] mlock(addr=0x7f37ec76c000, len=0xce000)
= -1 (Cannot allocate memory)
[child27:21668] [192] [32BIT] getsockopt(fd=381,
level=0x54100210202640, optname=0x9a9a9a9a9a9a, optval=0x7f37ee466000,
optlen=8) = -1 (Socket operation on non-socket)

==> trinity-child28.log <==
[child28:1813] [202] sched_setscheduler(pid=0, policy=0x5,
param=0x7f37ee465000) = 0
[child28:1813] [203] setresuid(ruid=0x80020293a4aa482, euid=247,
suid=0xffffffff81000002) = -1 (Operation not permitted)
[child28:1813] [204] epoll_ctl(epfd=332, op=0x3, fd=332,
event=0x2e0a150) [child28:1813] [204]
newstat(filename="/proc/9/task/9/net/atalk", statbuf=0x8)
[child28:1813] [204] newstat(filename="/proc/9/task/9/net/atalk",
statbuf=0x8) [child28:1813] [204]
statfs(pathname="/proc/989/task/989/attr/prev", buf=0x7f37ee466000)
[child28:1813] [204] newstat(filename="/proc/9/task/9/net/atalk",
statbuf=0x8) [child28:1813] [204]
statfs(pathname="/proc/989/task/989/attr/prev", buf=0x7f37ee466000)
[child28:1813] [204] timerfd_create(clockid=0x1, flags=0x800)
[child28:1813] [204] newstat(filename="/proc/9/task/9/net/atalk",
statbuf=0x8) [child28:1813] [204]
statfs(pathname="/proc/989/task/989/attr/prev", buf=0x7f37ee466000)
[child28:1813] [204] timerfd_create(clockid=0x1, flags=0x800)
[child28:1813] [204] process_vm_readv(pid=1, lvec=0x2e21d30,
liovcnt=48, rvec=0x2e22040, riovcnt=23, flags=0x0)
==> trinity-child29.log <==
= 0
[child29:1329] [68] [32BIT] getitimer(which=0xfffffffb,
value=0x7f37ee465000) [child29:1329] [68]
setpriority(which=0x8000000000, who=4, niceval=-2) [child29:1329] [68]
rt_sigprocmask(how=-9, set=0x0, oset=0x7f37ee466000, sigsetsize=128) =
-1 (Invalid argument)
[child29:1329] [69] shmat(shmid=0xfffffff9, shmaddr=0x1, shmflg=0x55d10db3)
==> trinity-child2.log <==
[child2:21596] [87] accept4(fd=189, upeer_sockaddr=0x2bda970,
upeer_addrlen=110, flags=0x80800) = -1 (Bad file descriptor)
[child2:21596] [88]
access(filename="/proc/1134/task/1134/net/llc/socket", mode=2060) = -1
(Invalid argument)
[child2:21596] [89] sendfile(out_fd=189, in_fd=172,
offset=0x7f37ee465000, count=4096) = -1 (Bad file descriptor)

==> trinity-child30.log <==
[child30:9344] [334] sendmsg(fd=0, msg=0x2c4b530, flags=0x5ba83) = -1
(Socket operation on non-socket)
[child30:9344] [335] signalfd(ufd=382, user_mask=0x7f37ee65a000,
sizemask=4) = -1 (Invalid argument)
[child30:9344] [336] listen(fd=206, backlog=0x4649119809)
[child30:9344] [336] io_getevents(ctx_id=0x372b076d67, min_nr=0xfff8,
nr=4096, events=0x7f37ec86c000, timeout=0x7f37ee65a000)
==> trinity-child31.log <==
[child31:2333] [416] munlockall() = 0
[child31:2333] [417] fgetxattr(fd=189, name=0x0, value=0x7f37ec76c000,
size=252) = -1 (Bad file descriptor)
[child31:2333] [418] fgetxattr(fd=308, name=0x4, value=0x7f37ec26c000,
size=8) [child31:2333] [418] clock_adjtime(which_clock=15, utx=0x1)
[child31:2333] [418] clock_adjtime(which_clock=15, utx=0x1)
[child31:2333] [418] timer_gettime(timer_id=0xf9f9f9f9f9f9,
setting=0x1) [child31:2333] [418] socket(family=32, type=0x80800,
protocol=10) [child31:2333] [418] socket(family=32, type=0x80800,
protocol=10) [child31:2333] [418] dup2(oldfd=371, newfd=363)
[child31:2333] [418] socket(family=32, type=0x80800, protocol=10)
[child31:2333] [418] dup2(oldfd=371, newfd=363) [child31:2333] [418]
unlink(pathname="/proc/1209/task/1209/net/atm/pvc") [child31:2333]
[418] openat(dfd=213, filename="/proc/971/task/971/net/atm",
flags=0x141a00, mode=712) [child31:2333] [418]
timer_gettime(timer_id=0x900040400186, setting=0x4) [child31:2333]
[418] getxattr(pathname="/proc/1023/net/stat/ndisc_cache",
name=0x7f37ec86c000, value=0x7f37ec86c001, size=259) [child31:2333]
[418] getxattr(pathname="/proc/1023/net/stat/ndisc_cache",
name=0x7f37ec86c000, value=0x7f37ec86c001, size=259) [child31:2333]
[418] open(filename="/sys/module/l2tp_ip", flags=0x80400, mode=211)
[child31:2333] [418] umask(mask=0xa10000004d483090) [child31:2333]
[418] umask(mask=0xa10000004d483090) [child31:2333] [418]
setgroups(gidsetsize=0, grouplist=0x4) [child31:2333] [418]
getgroups(gidsetsize=0xd4d4d4d4d4d4, grouplist=0x8) [child31:2333]
[418] getgroups(gidsetsize=0xd4d4d4d4d4d4, grouplist=0x8)
[child31:2333] [418] io_cancel(ctx_id=0xffffffff00000004,
iocb=0x7f37ee465000, result=0x4fc0000000) [child31:2333] [418]
getgroups(gidsetsize=0xd4d4d4d4d4d4, grouplist=0x8) [child31:2333]
[418] io_cancel(ctx_id=0xffffffff00000004, iocb=0x7f37ee465000,
result=0x4fc0000000) [child31:2333] [418] ioperm(from=221,
num=0x1786c7a03485c3c0, turn_on=176) [child31:2333] [418]
fadvise64(fd=271, offset=0x2afb1db, len=449, advice=0x5)
[child31:2333] [418] getpriority(which=0x1, who=2622) [child31:2333]
[418] signalfd(ufd=285, user_mask=0x7f37ee466000, sizemask=0)
[child31:2333] [418] setresuid(ruid=40, euid=0xffff, suid=-7)
==> trinity-child32.log <==
[child32:21672] [199] getresuid(ruid=0x7f37ec26c000,
euid=0x7f37ee65a000, suid=0x7f37ee465000) = 0
[child32:21672] [200] getrlimit(resource=0xa, rlim=0x4) = -1 (Bad address)
[child32:21672] [201] [32BIT] signalfd(ufd=367,
user_mask=0x7f37ec46c000, sizemask=32)
==> trinity-child33.log <==
[child33:21677] [231] mq_getsetattr(mqdes=244,
u_mqstat=0x7f37ee658000, u_omqstat=0x7f37ee658008) = -1 (Invalid
argument)
[child33:21677] [232] gettid() = 0x54ad
[child33:21677] [233] timer_create(which_clock=161,
timer_event_spec=0x7f37ec86c000, create_timer_id=0x7f37ec86c008)
==> trinity-child34.log <==
[child34:21679] [186] sched_rr_get_interval(pid=0, interval=0x7f37ee466000) = 0
[child34:21679] [187] [32BIT] epoll_pwait(epfd=244,
events=0x7f37ee657000, maxevents=0x40300000a, timeout=562) = -1
(Invalid argument)
[child34:21679] [188] getpriority(which=0x0, who=1)
==> trinity-child35.log <==
[child35:14725] [92] getrandom(buf=0x7f37ee656000, count=0, flags=0x3)
= -1 (Invalid argument)
[child35:14725] [93] setgid(gid=-4) = -1 (Operation not permitted)
[child35:14725] [94] execveat(fd=210, name="/proc/942/net/hidp",
argv=0x2bda970, envp=0x2e10190, flags=0x0) = -1 (Operation not
permitted)

==> trinity-child36.log <==
[child36:21681] [167] sethostname(name=0x7f37ec76c000, len=227) = -1
(Operation not permitted)
[child36:21681] [168] munlock(addr=0x7f37ec86c000, len=0x48000) = 0
[child36:21681] [169] execve(name="/proc/980/task/980/net/udplite",
argv=0x2e0a1a0, envp=0x2e10190)
==> trinity-child37.log <==
[child37:21682] [33] setgid(gid=0x6c6c6c6c) = -1 (Operation not permitted)
[child37:21682] [34] syslog(type=0x3, buf=0x7f37ee656000, len=0) = 0
[child37:21682] [35] pwrite64(fd=335, buf=0x2e204b0, count=1220,
pos=56) = -1 (Bad file descriptor)

==> trinity-child38.log <==
[child38:21689] [107] setpriority(which=0xafafafafafafafaf,
who=0xffffffffffff0000, niceval=0x400000000) = -1 (Invalid argument)
[child38:21689] [108] newstat(filename="/proc/983/cmdline",
statbuf=0x7f37ec66c000) = 0
[child38:21689] [109] mq_getsetattr(mqdes=367, u_mqstat=0x0, u_omqstat=0x0)
==> trinity-child39.log <==
[child39:20422] [556] adjtimex(txc_p=0x7f37ee466000) = -1 (Operation
not permitted)
[child39:20422] [557] ioperm(from=0, num=0xd000, turn_on=55) = -1
(Operation not permitted)
[child39:20422] [558] clock_nanosleep(which_clock=0x2, flags=0x0,
rqtp=0x7f37ec66c000, rmtp=0x7f37ec66c000)
==> trinity-child3.log <==
[child3:21597] [146] prlimit64(pid=1, resource=12, new_rlim=0x4,
old_rlim=0xc) = -1 (Bad address)
[child3:21597] [147] mq_unlink(u_name=0x7f37ec46c000) = -1 (No such
file or directory)
[child3:21597] [148] [32BIT] rt_sigprocmask(how=0x1616161616161617,
set=0x7f37ec76c000, oset=0x7f37ee465000, sigsetsize=128) = -1 (Invalid
argument)

==> trinity-child40.log <==
[child40:21692] [101] getgid() = 0xfffe
[child40:21692] [102]
lsetxattr(pathname="/proc/14/task/14/net/udplite6",
name=0x7f37ee657000, value=0x7f37ee465000, size=4096, flags=0x0) = -1
(Permission denied)
[child40:21692] [103] pwritev2(fd=367, vec=0x2e1a3d0, vlen=92,
pos_l=15, pos_h=0xffa4cdcdd0ee0601, flags=0x5)
==> trinity-child41.log <==
[child41:3589] [16] sched_get_priority_max(policy=0x34000000000000c3)
[child41:3589] [16] timerfd_settime(ufd=308, flags=0x80800,
utmr=0x7f37ec66c000, otmr=0x7f37ee466000) [child41:3589] [16]
timerfd_settime(ufd=326, flags=0x0, utmr=0x7f37ee465000,
otmr=0x7f37ec86c000) = -1 (File exists)
[child41:3589] [16] sched_get_priority_max(policy=0x34000000000000c3)
[child41:3589] [16] timerfd_settime(ufd=308, flags=0x80800,
utmr=0x7f37ec66c000, otmr=0x7f37ee466000) [child41:3589] [16]
timerfd_settime(ufd=326, flags=0x0, utmr=0x7f37ee465000,
otmr=0x7f37ec86c000) [child41:3589] [16]
getresuid(ruid=0x7f37ee467000, euid=0x7f37ee467000,
suid=0x7f37ec06c000) [child41:3589] [16] getrusage(who=0x703a9e50fc,
ru=0x8) [child41:3589] [16] munlockall() = 0
[child41:3589] [17]
chmod(filename="/proc/sys/net/ipv6/route/gc_min_interval", mode=4)
==> trinity-child42.log <==
[child42:21696] [165] getrusage(who=-5, ru=0x0) = -1 (Invalid argument)
[child42:21696] [166] fremovexattr(fd=168, name=0x7f37ee465000) = -1
(Numerical result out of range)
[child42:21696] [167] madvise(start=0x7f37ec06c000, len_in=0xdb000,
advice=0x9) = 0

==> trinity-child43.log <==
[child43:21700] [106] readv(fd=244, vec=0x2e1a3d0, vlen=246) = 0
[child43:21700] [107]
swapoff(path="/sys/devices/virtual/tty/tty26/power/runtime_usage") =
-1 (Operation not permitted)
[child43:21700] [108] linkat(olddfd=244,
oldname="/sys/devices/virtual/workqueue/writeback/nice", newdfd=244,
newname="/proc/1014/net/nfsfs/volumes", flags=0x0) = -1 (File exists)

==> trinity-child44.log <==
[child44:21703] [94] mkdir(pathname="/sys/module/pstore/parameters",
mode=0) = -1 (File exists)
[child44:21703] [95] sendfile(out_fd=367, in_fd=137, offset=0x1,
count=8) = -1 (Bad address)
[child44:21703] [96] [32BIT] sync_file_range(fd=367,
offset=0x707070707, nbytes=0x8000, flags=0xffffefff)
==> trinity-child45.log <==
[child45:21704] [1] fdatasync(fd=381) = -1 (Invalid argument)
[child45:21704] [2] munmap(addr=0x7f37ee656000, len=4096) = 0
[child45:21704] [3] pipe2(fildes=0x7f37ec46c000, flags=0x4800) = 0

==> trinity-child46.log <==
[child46:21705] [82]
rmdir(pathname="/sys/devices/tracepoint/power/runtime_active_kids") =
-1 (Permission denied)
[child46:21705] [83] chroot(filename="/proc/15/task/15/net/tcp") = -1
(Not a directory)
[child46:21705] [84] [32BIT] ioprio_set(which=-2, who=2048,
ioprio=0x646464646464) = -1 (Invalid argument)

==> trinity-child47.log <==
[child47:21710] [127] setrlimit(resource=0x7f37ee465000, rlim=16384) =
-1 (Bad address)
[child47:21710] [128] setns(fd=244, nstype=0x0) = -1 (Invalid argument)
[child47:21710] [129] pkey_alloc(flags=0, init_val=0x2)
==> trinity-child48.log <==
[child48:21717] [39]
access(filename="/proc/10/task/10/net/ip_vs_stats", mode=700) = -1
(Invalid argument)
[child48:21717] [40] [32BIT] sigprocmask(how=233, set=0x7f37ec46c000,
oset=-9) = -1 (Bad address)
[child48:21717] [41] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0)
==> trinity-child49.log <==
[child49:21718] [81] sendto(fd=168, buff=0x2e0a540, len=1,
flags=0x41110, addr=0x2ba92a0, addr_len=28) = -1 (Socket operation on
non-socket)
[child49:21718] [82] clock_gettime(which_clock=0x430d4f678edd, tp=0x8)
= -1 (Invalid argument)
[child49:21718] [83] removexattr(pathname="/proc/13/attr/prev",
name=0x7f37ec06c000)
==> trinity-child4.log <==
[child4:21602] [155] getgroups(gidsetsize=0, grouplist=0x1) = 1
[child4:21602] [156] sendmsg(fd=0, msg=0x2c22c50, flags=0x40010883) =
-1 (Socket operation on non-socket)
[child4:21602] [157] fchown(fd=330, user=0x20240098000041,
group=0x20000) = -1 (Operation not permitted)

==> trinity-child50.log <==
[child50:21719] [41] shmctl(shmid=0x4c4c4c4c4c4c4c, cmd=0x2, buf=0x0)
= -1 (Invalid argument)
[child50:21719] [42] timer_delete(timer_id=0x80000000) = -1 (Invalid argument)
[child50:21719] [43] madvise(start=0x7f37ec86c000, len_in=0x20000, advice=0x2)
==> trinity-child51.log <==
[child51:21723] [91] msgctl(msqid=0x9f054e8b21, cmd=0x3,
buf=0x7f37ee466000) = 1763
[child51:21723] [92] time(tloc=0x7f37ee466000) = 0x584474e5
[child51:21723] [93] timer_gettime(timer_id=1, setting=0x7f37ee466000)
==> trinity-child52.log <==
[child52:21728] [70] epoll_ctl(epfd=244, op=0x3, fd=244,
event=0x2e0a2f0) = -1 (Operation not permitted)
[child52:21728] [71] [32BIT] statfs(pathname="/proc/548/net/atm/vc",
buf=0x7f37ec46c000) = -1 (Bad address)
[child52:21728] [72] rt_sigaction(sig=52, act=0x7f37ee465000,
oact=0x7f37ec06c000, sigsetsize=128) = -1 (Invalid argument)

==> trinity-child53.log <==
[child53:21734] [64] lseek(fd=244, offset=0, whence=0x1) = 33
[child53:21734] [65] preadv(fd=244, vec=0x2e10190, vlen=17,
pos_l=0x2714319808dd0822, pos_h=128) = 0
[child53:21734] [66] epoll_create1(flags=0x0) = 516

==> trinity-child54.log <==
[child54:21738] [50] ioprio_get(which=0, who=10023) = -1 (Invalid argument)
[child54:21738] [51] [32BIT] setregid(rgid=0x1000000000000,
egid=0xffff) = -1 (Operation not permitted)
[child54:21738] [52]
execve(name="/proc/1240/net/rpc/auth.rpcsec.context", argv=0x2e10190,
envp=0x2c6bd50)
==> trinity-child55.log <==
[child55:21741] [44] recvmsg(fd=381, msg=0x1, flags=0x20000000) = -1
(Socket operation on non-socket)
[child55:21741] [45] [32BIT] setgroups(gidsetsize=0xf9f9f9f9f9f9f9,
grouplist=0x0) = -1 (Operation not permitted)
[child55:21741] [46] execveat(fd=168, name="/proc/irq/4/smp_affinity",
argv=0x2e10190, envp=0x2e00900, flags=0x100)
==> trinity-child56.log <==
[child56:21744] [7] msync(start=0x7f37ee657000, len=0, flags=0x6) = 0
[child56:21744] [8] flock(fd=168, cmd=0xffffffc191f30b0c) = -1
(Invalid argument)
[child56:21744] [9] add_key(_type=0x437a15,
_description=0x7f37ec06c000, _payload=0x7f37ec06c000, plen=0,
ringid=0xfffffffffffffff8)
==> trinity-child57.log <==
[child57:21746] [12] [32BIT] mq_open(u_name=0x7f37ec86c000,
oflag=8192, mode=2000, u_attr=0x7f37ec86c004) = -1 (Bad address)
[child57:21746] [13] getpgid(pid=21226) = 0x52ea
[child57:21746] [14] execveat(fd=168, name="/proc/1297/task/1297",
argv=0x2bda970, envp=0x2ba92a0, flags=0x0)
==> trinity-child58.log <==
[child58:21749] [33] eventfd2(count=4, flags=0x0) = 441
[child58:21749] [34] mknodat(dfd=244,
filename="/sys/devices/pci0000:00/0000:00:03.0/net/eth0/queues",
mode=0x17104124, dev=0xffffffff0000567f) = -1 (Operation not
permitted)
[child58:21749] [35] clock_getres(which_clock=158, tp=0x4) = -1
(Invalid argument)

==> trinity-child59.log <==
[child59:21754] [21] setfsgid(gid=0x40000000000000) = 0xfffe
[child59:21754] [22] getpgrp() = 0
[child59:21754] [23] ppoll(ufds=0x2e0a5d0, nfds=0, tsp=0x2e0a590,
sigmask=0x0, sigsetsize=128)
==> trinity-child5.log <==
[child5:21604] [287] seccomp(op=0x1, flags=0x0, uargs=0x1) = 0
[child5:21604] [288] rt_sigpending(set=0x7f37ee466000,
sigsetsize=0x2323232323232323) = -1 (Invalid argument)
[child5:21604] [289] getuid() = 0xfffe

==> trinity-child60.log <==
[child60:21758] [0] getsid(pid=1) = 0
[child60:21758] [1] io_cancel(ctx_id=255, iocb=0x7f37ec46c000,
result=0x7f37ec46c000) = -1 (Invalid argument)
[child60:21758] [2] [32BIT] recvmmsg(fd=367, mmsg=0x1, vlen=4,
flags=0x0, timeout=0x7f37ec46c000)
==> trinity-child61.log <==
[child61:21760] [5] open(filename="/proc/1240/net/nfsfs",
flags=0x331702, mode=666) = 444
[child61:21760] [6] recvmmsg(fd=244, mmsg=0x7f37ec66c000, vlen=1,
flags=0x0, timeout=0x7f37ee466000) = -1 (Socket operation on
non-socket)
[child61:21760] [7] swapon(path=".//proc/3/task/3/net/ip_vs_conn/",
swap_flags=0x0)
==> trinity-child62.log <==
[child62:21428] [668] memfd_create(uname=0x7f37ee465000, flag=0x1) = 546
[child62:21428] [669]
newstat(filename="/sys/devices/virtual/net/sit0/statistics/rx_dropped",
statbuf=0x8) = -1 (Bad address)
[child62:21428] [670] [32BIT] sigpending(set=0x7f37ee467000) = -1 (Bad address)

==> trinity-child63.log <==
[child63:21431] [151] getgid() = 0xfffe
[child63:21431] [152] set_mempolicy(mode=0xffffffff, nmask=0x0,
maxnode=0xbbbbbbbbbbbbbbbb) = -1 (Invalid argument)
[child63:21762] [0] ioperm(from=0xb5b5b5b5b5b5b5b6, num=514, turn_on=135)
==> trinity-child64.log <==
[child64:21432] [754] utime(filename="/proc/552/net/can/rcvlist_eff",
times=0x4) = -1 (Bad address)
[child64:21432] [755] fcntl(fd=244, cmd=0x406, arg=244) = 540
[child64:21432] [756] mlock2(start=0x7f37ec06c000, len=0x9d000) = -1
(Cannot allocate memory)

==> trinity-child65.log <==
[child65:21435] [140] timer_create(which_clock=0x10000000,
timer_event_spec=0x1, create_timer_id=0x7f37ec46c000) = -1 (Invalid
argument)
[child65:21435] [141] getpeername(fd=152, usockaddr=0x2e0a1a0,
usockaddr_len=16) = -1 (Socket operation on non-socket)
[child65:21435] [142] ioperm(from=0x54ee6fb85625, num=1, turn_on=128)
= -1 (Invalid argument)

==> trinity-child66.log <==
[child66:21438] [310] msgctl(msqid=0x12121212121212, cmd=0x1,
buf=0x7f37ee465000) = -1 (Invalid argument)
[child66:21438] [311] set_robust_list(head=0x7f37ee658000, len=24) = 0
[child66:21438] [312] msgget(key=0x10000000,
msgflg=0x80000000000000d1) = -1 (Permission denied)

==> trinity-child67.log <==
[child67:21439] [184] pkey_alloc(flags=0, init_val=0x0) = -1 (Invalid argument)
[child67:21439] [185] getpgrp() = 0
[child67:21439] [186] [32BIT] getpgrp() = 0

==> trinity-child68.log <==
[child68:21443] [90] shmdt(shmaddr=0x7f37ee465000) = 0
[child68:21443] [91] remap_file_pages(start=0x7f37ec86c000, size=4096,
prot=0, pgoff=0, flags=0x10000) = 0
[child68:21443] [92] [32BIT] get_mempolicy(policy=0x7f37ee467000,
nmask=0x7f37ee467001, maxnode=147, addr=0x7f37ee466000, flags=0x0) =
-1 (Invalid argument)

==> trinity-child69.log <==
[child69:21444] [283] signalfd(ufd=139, user_mask=0x1, sizemask=8) =
-1 (Invalid argument)
[child69:21444] [284]
utimes(filename="/sys/devices/virtual/tty/tty50/power/runtime_status",
utimes=0x7f37ee465000) = -1 (Invalid argument)
[child69:21444] [285] fremovexattr(fd=139, name=0x7f37ee465000) = -1
(Permission denied)

==> trinity-child6.log <==
[child6:21605] [7] msync(start=0x7f37ee656000, len=0, flags=0x1) = 0
[child6:21605] [8] mq_timedreceive(mqdes=324,
u_msg_ptr=0x7f37ec06c000, msg_len=0x41da, u_msg_prio=0x7f37ee656000,
u_abs_timeout=0x7f37ee657000) = -1 (Bad file descriptor)
[child6:21605] [9] [32BIT] sethostname(name=0x7f37ec76c000,
len=0x85b19) = -1 (Operation not permitted)

==> trinity-child70.log <==
[child70:21447] [654] inotify_init1(flags=0x80000) = 547
[child70:21447] [655] setns(fd=244, nstype=0x0) = -1 (Invalid argument)
[child70:21447] [656] getdents(fd=367, dirent=0x4, count=0)
==> trinity-child71.log <==
[child71:21449] [86] fremovexattr(fd=344, name=0x1) = -1 (Bad address)
[child71:21449] [87] fadvise64(fd=344, offset=0, len=1, advice=0x2) = 0
[child71:21449] [88] [32BIT] fstat64(fd=285, statbuf=0x7f37ee467000) =
-1 (Bad address)

==> trinity-child72.log <==
[child72:21138] [88] [32BIT] pread64(fd=210, buf=0x7f37ec06c000,
count=1361, pos=0) = -1 (Invalid argument)
[child72:21138] [89] clock_settime(which_clock=512, tp=0x7f37ec06c000)
= -1 (Invalid argument)
[child72:21138] [90] mbind(start=0x7f37ec86c000, len=0xf5000,
mode=0x2, nmask=0x1, maxnode=0x8000, flags=0x6) = -1 (Bad address)

==> trinity-child73.log <==
[child73:21141] [706] setrlimit(resource=0x7f37ec26c000,
rlim=0xfffffffd) = -1 (Bad address)
[child73:21141] [707] pwrite64(fd=351, buf=0x2e091a0, count=1,
pos=0xfffd) = -1 (Bad file descriptor)
[child73:21141] [708] time(tloc=0x7f37ee65a000) = 0x584474dd

==> trinity-child74.log <==
[child74:21456] [429] setfsgid(gid=-5) = 0xfffe
[child74:21456] [430] sethostname(name=0x7f37ec26c000, len=8) = -1
(Operation not permitted)
[child74:21456] [431] open(filename="/sys/block/loop2/mq/0/cpu_list",
flags=0x333401, mode=666) = -1 (Not a directory)

==> trinity-child75.log <==
[child75:21458] [578] accept(fd=316, upeer_sockaddr=0x2e240e0,
upeer_addrlen=3729) = -1 (Socket operation on non-socket)
[child75:21458] [579] setxattr(pathname="/sys/block/loop5",
name=0x7f37ec06c000, value=0x7f37ec86c000, size=8, flags=0x1) = -1
(Permission denied)
[child75:21458] [580] sched_setscheduler(pid=0, policy=0x3, param=0x0)
= -1 (Invalid argument)

==> trinity-child76.log <==
[child76:21461] [546] gettimeofday(tv=0x7f37ee465000,
tz=0x7f37ee465004) = -1 (Bad address)
[child76:21461] [547] [32BIT] rt_sigaction(sig=32, act=0x7f37ec46c000,
oact=0x0, sigsetsize=128) = -1 (Invalid argument)
[child76:21461] [548] clock_nanosleep(which_clock=0x2, flags=0x1,
rqtp=0x7f37ee656000, rmtp=0x7f37ec06c000)
==> trinity-child77.log <==
[child77:21465] [73]
swapoff(path="/proc/865/task/865/net/rpc/auth.rpcsec.context") = -1
(Operation not permitted)
[child77:21465] [74] sched_getscheduler(pid=21253) = 0
[child77:21465] [75] [32BIT]
rt_sigtimedwait(uthese=0xba501c1fc77f1f57, uinfo=0x0,
uts=0x7f37ee466000, sigsetsize=0x7f37ee466008) = -1 (Invalid argument)

==> trinity-child78.log <==
[child78:21469] [216] mknod(filename="/proc/1023/net/ip_vs_conn",
mode=1007, dev=0x1b9008e90080) = -1 (No such file or directory)
[child78:21469] [217] geteuid() = 0xfffe
[child78:21469] [218] setsockopt(fd=169, level=0xa49a302c26ea,
optname=51, optval=0x2e1afd0, optlen=0x5bf09058529076) = -1 (Bad file
descriptor)

==> trinity-child79.log <==
[child79:21475] [696] setresuid(ruid=-2, euid=43,
suid=0x7272727272727272) = -1 (Operation not permitted)
[child79:21475] [697] msync(start=0x7f37ee659000, len=0, flags=0x6) = 0
[child79:21475] [698] io_submit(ctx_id=118, nr=8, iocbpp=0x8) = -1
(Invalid argument)

==> trinity-child7.log <==
[child7:21609] [33] prctl(option=22, arg2=2, arg3=0, arg4=0, arg5=0) =
-1 (Bad address)
[child7:21609] [34] userfaultfd(flags=0x800) = 511
[child7:21609] [35] getuid() = 0xfffe

==> trinity-child80.log <==
[child80:21478] [101] settimeofday(tv=0x7f37ec26c000,
tz=0x7f37ec46c000) = -1 (Operation not permitted)
[child80:21478] [102] getdents(fd=330, dirent=0x0, count=56) = -1 (Not
a directory)
[child80:21478] [103] epoll_ctl(epfd=330, op=0x2, fd=330,
event=0x2e0a390) = -1 (Invalid argument)

==> trinity-child81.log <==
[child81:21482] [33] [32BIT] move_pages(pid=0, nr_pages=304,
pages=0x2e19fc0, nodes=0x2e1afd0, status=0x2e1b4a0, flags=0x0) = -1
(Bad address)
[child81:21482] [34] getpriority(which=0x1, who=21008) = 1
[child81:21482] [35] sethostname(name=0x7f37ee659000, len=0x6aeaf512)
= -1 (Operation not permitted)

==> trinity-child82.log <==
[child82:21163] [963] keyctl(cmd=0x2, arg2=0xff00000, arg3=0xc928d000,
arg4=0x400000, arg5=1) = -1 (Invalid argument)
[child82:21163] [964] [32BIT] uselib(library=0x7f37ec46c000) = -1 (Bad address)
[child82:21163] [965] mq_getsetattr(mqdes=226, u_mqstat=0x1,
u_omqstat=0x2) = -1 (Bad address)

==> trinity-child83.log <==
[child83:21486] [531] getrlimit(resource=0xb, rlim=0x7f37ee465000) =
-1 (Bad address)
[child83:21486] [532] gettid() = 0x53ee
[child83:21486] [533] renameat(olddfd=244,
oldname="/proc/1252/net/atalk/socket", newdfd=267,
newname=".//proc/721/net/irda/ircomm") = -1 (Not a directory)

==> trinity-child84.log <==
[child84:21488] [238] mkdir(pathname="/proc/1161/task/1161/mounts",
mode=222) = -1 (File exists)
[child84:21488] [239] capget(header=0x67002320,
dataptr=0xfffaffb7effdfffe) = -1 (Bad address)
[child84:21488] [240]
rmdir(pathname="/sys/devices/virtual/misc/ocfs2_control/power/runtime_active_time")
= -1 (Permission denied)

==> trinity-child85.log <==
[child85:21489] [151] settimeofday(tv=0x7f37ec76c000,
tz=0x7f37ec76c008) = -1 (Invalid argument)
[child85:21489] [152] fstatat64(dfd=365,
filename=".//proc/551/net/ip_vs_conn", statbuf=0x7f37ec26c000,
flag=-2) = -1 (Invalid argument)
[child85:21489] [153] read(fd=365, buf=0x7f37ee466000, count=4096) = 0

==> trinity-child86.log <==
[child86:21496] [39] sched_getaffinity(pid=21349, len=217,
user_mask_ptr=0x4) = -1 (Invalid argument)
[child86:21496] [40] semtimedop(semid=149, tsops=0x7f37ec86c000,
nsops=-1, timeout=0x7f37ec86c000) = -1 (Argument list too long)
[child86:21496] [41] get_robust_list(pid=0, head_ptr=0x7f37ee465000,
len_ptr=0x7f37ec66c000) = 0

==> trinity-child87.log <==
[child87:21500] [517] select(n=0, inp=0x2e12fa0, outp=0x2e13030,
exp=0x2e130c0, tvp=0x2e09740) = 0
[child87:21500] [518] readlink(path="/proc/14/comm", buf=0x1,
bufsiz=223) = -1 (Invalid argument)
[child87:21500] [519] setgid(gid=0) = -1 (Operation not permitted)

==> trinity-child88.log <==
[child88:21502] [523] [32BIT] socketcall(call=1, args=0x2e1b390) = -1
(Address family not supported by protocol)
[child88:21502] [524]
newlstat(filename="/proc/722/task/722/net/rpc/nfs",
statbuf=0x7f37ec76c000) = 0
[child88:21502] [525] mq_unlink(u_name=0x7f37ec06c000)
==> trinity-child89.log <==
[child89:21507] [132] [32BIT] getresuid(ruid=0x7f37ec06c000,
euid=0x7f37ec06c001, suid=0x7f37ec06c005) = -1 (Bad address)
[child89:21507] [133] uname(name=0x7f37ec26c000) = 0
[child89:21507] [134] shutdown(fd=213, how=0x0) = -1 (Bad file descriptor)

==> trinity-child8.log <==
[child8:21610] [211] mlock2(start=0x7f37ec26c000, len=0x95000) = -1
(Cannot allocate memory)
[child8:21610] [212]
removexattr(pathname="/sys/module/zswap/parameters/compressor",
name=0x7f37ec46c000) = -1 (Permission denied)
[child8:21610] [213] socket(family=25, type=2048, protocol=5) = -1
(Address family not supported by protocol)

==> trinity-child90.log <==
[child90:21508] [36] [32BIT]
llistxattr(pathname="/proc/1045/net/protocols", list=0x0, size=-6) = 0
[child90:21508] [37] pipe(fildes=0x7f37ee467000) = 0
[child90:21508] [38] ftruncate(fd=183, length=0xffffffffca1f2000) = -1
(Invalid argument)

==> trinity-child91.log <==
[child91:21514] [28] unlink(pathname="/proc/924/net/netstat") = -1
(Permission denied)
[child91:21514] [29] pivot_root(new_root=0x7f37ee465000,
put_old=0x7f37ec06c000) = -1 (Operation not permitted)
[child91:21514] [30] sched_setscheduler(pid=21410, policy=0x3,
param=0x4) = -1 (Bad address)

==> trinity-child92.log <==
[child92:21517] [444] [32BIT]
chroot(filename="/proc/980/task/980/net/dev_snmp6/sit0") = -1 (Not a
directory)
[child92:21517] [445] syslog(type=0x1, buf=0x7f37ec26c000,
len=0x48000) = -1 (Operation not permitted)
[child92:21517] [446] getrandom(buf=0x7f37ec76c000, count=0x2a000,
flags=0x1) = -1 (Invalid argument)

==> trinity-child93.log <==
[child93:21518] [477] waitid(which=0xffd9, upid=1,
infop=0x7f37ec06c000, options=0x100000, ru=0x7f37ec06c000) = -1
(Invalid argument)
[child93:21518] [478] fchmod(fd=168, mode=0) = -1 (Operation not permitted)
[child93:21518] [479] clock_nanosleep(which_clock=0x2, flags=0x1,
rqtp=0x7f37ec76c000, rmtp=0x7f37ee466000)
==> trinity-child94.log <==
[child94:20875] [1515] access(filename="/proc/567/net/atm/pvc",
mode=5000) = -1 (Invalid argument)
[child94:20875] [1516] seccomp(op=0x1, flags=0x1, uargs=0x4) = -1
(Invalid argument)
[child94:20875] [1517] [32BIT] getuid() = 0xfffe

==> trinity-child95.log <==
[child95:21520] [94] userfaultfd(flags=0x0) = 516
[child95:21520] [95] eventfd(count=153) = 517
[child95:21520] [96] shmat(shmid=0x4000000, shmaddr=0x7f37ee466000,
shmflg=0x3000000000400000) = -1 (Invalid argument)

==> trinity-child96.log <==
[child96:21522] [225] pwritev2(fd=324, vec=0x2e19fc0, vlen=70,
pos_l=0x2727272727, pos_h=255, flags=0x2) = -1 (Illegal seek)
[child96:21522] [226] setfsuid(uid=-8) = 0xfffe
[child96:21522] [227] getdents(fd=324, dirent=0x7f37ec06c000,
count=4096) = -1 (Not a directory)

==> trinity-child97.log <==
[child97:21526] [119] [32BIT]
statfs64(pathname="/proc/545/task/545/net/can/rcvlist_all",
sz=0x8500000000000088) = -1 (Invalid argument)
[child97:21526] [120] pipe2(fildes=0x7f37ee465000, flags=0x0) = 0
[child97:21526] [121]
utimes(filename="/sys/devices/virtual/tty/ircomm4", utimes=0x0) = -1
(Permission denied)

==> trinity-child98.log <==
[child98:21529] [163] setresgid(rgid=20, egid=256,
sgid=0xffffffffffff5252) = -1 (Operation not permitted)
[child98:21529] [164]
llistxattr(pathname="/proc/986/net/dev_snmp6/ip_vti0", list=0x0,
size=4) = 0
[child98:21529] [165] getcpu(cpup=0x7f37ee659000,
nodep=0x7f37ee659008, unused=0x7f37ee659009) = 0

==> trinity-child99.log <==
[child99:21532] [223] sigaltstack(uss=0x7f37ec86c000,
uoss=0x7f37ec86c008, regs=0x7f37ee465000) = -1 (Invalid argument)
[child99:21532] [224] vmsplice(fd=320, iov=0x2e1afd0, nr_segs=1016,
flags=0xb) = -1 (Bad file descriptor)
[child99:21532] [225] [32BIT] setfsgid16(gid=0x2209024001252024) = 0xfffe

==> trinity-child9.log <==
[child9:21616] [294] getuid() = 0xfffe
[child9:21616] [295] fstatfs(fd=244, buf=0x8) = -1 (Bad address)
[child9:21616] [296] mlock2(start=0x7f37ec76c000, len=0xe000) = 0


Vegard

2016-12-05 17:10:21

by Vegard Nossum

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 5 December 2016 at 12:10, Vegard Nossum <[email protected]> wrote:
> On 5 December 2016 at 00:04, Vegard Nossum <[email protected]> wrote:
>> FWIW I hit this as well:
>>
>> BUG: unable to handle kernel paging request at ffffffff81ff08b7
>> IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30
>> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: G B 4.9.0-rc7+ #217
> [...]
>
>> I think you can rule out btrfs in any case, probably block layer as
>> well, since it looks like this comes from shmem.
>
> I should rather say that the VM runs on a 9p root filesystem and it
> doesn't use/mount any block devices or disk-based filesystems.
>
> I have all the trinity logs for the crash if that's useful. I tried a
> couple of runs with just the (at the time) in-progress syscalls but it
> didn't turn up anything interesting. Otherwise it seems like a lot of
> data to go through by hand.

I've hit this another 7 times in the past ~3 hours.

Three times the address being dereferenced has pointed to
iov_iter_init+0xaf (even across a kernel rebuild), three times it has
pointed to put_prev_entity+0x55, once to 0x800000008, and twice to
0x292. The fact that it would hit even one of those more than once
across runs is pretty suspicious to me, although the ones that point
to iov_iter_init and put_prev_entity point to "random" instructions in
the sense that they are neither entry points nor return addresses.

shmem_fault() was always on the stack, but it came from different
syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(),
clock_settime(), mount(), etc.

I also got this warning which is related:

------------[ cut here ]------------
WARNING: CPU: 9 PID: 25045 at lib/list_debug.c:59 __list_del_entry+0x14f/0x1d0
list_del corruption. prev->next should be ffff88014bdc79e8, but was
ffff88014bfbfc60
Kernel panic - not syncing: panic_on_warn set ...

CPU: 9 PID: 25045 Comm: trinity-c22 Not tainted 4.9.0-rc7+ #219
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
ffff88014bdc7700 ffffffff81fb0861 ffffffff83e74b60 ffff88014bdc77d8
ffffffff84006c00 ffffffff847103e0 ffff88014bdc77c8 ffffffff81515244
0000000041b58ab3 ffffffff844e21c2 ffffffff81515061 ffffffff00000054
Call Trace:
[<ffffffff81fb0861>] dump_stack+0x83/0xb2
[<ffffffff81515244>] panic+0x1e3/0x3ad
[<ffffffff81515061>] ? percpu_up_read_preempt_enable.constprop.45+0xcb/0xcb
[<ffffffff82016f7f>] ? __list_del_entry+0x14f/0x1d0
[<ffffffff812708bf>] __warn+0x1bf/0x1e0
[<ffffffff8135f2d2>] ? __lock_acquire.isra.32+0xc2/0x1a30
[<ffffffff8127098c>] warn_slowpath_fmt+0xac/0xd0
[<ffffffff812708e0>] ? __warn+0x1e0/0x1e0
[<ffffffff813530c0>] ? finish_wait+0xb0/0x180
[<ffffffff82016f7f>] __list_del_entry+0x14f/0x1d0
[<ffffffff813530c0>] ? finish_wait+0xb0/0x180
[<ffffffff813530cb>] finish_wait+0xbb/0x180
[<ffffffff81576227>] shmem_fault+0x4c7/0x6b0
[<ffffffff81574743>] ? shmem_getpage_gfp+0x673/0x1c90
[<ffffffff81575d60>] ? shmem_getpage_gfp+0x1c90/0x1c90
[<ffffffff81352150>] ? wake_atomic_t_function+0x210/0x210
[<ffffffff815ad316>] __do_fault+0x206/0x410
[<ffffffff815ad110>] ? do_page_mkwrite+0x320/0x320
[<ffffffff815b80ac>] ? handle_mm_fault+0x1cc/0x2a60
[<ffffffff815b8fd7>] handle_mm_fault+0x10f7/0x2a60
[<ffffffff815b8012>] ? handle_mm_fault+0x132/0x2a60
[<ffffffff81310a7f>] ? thread_group_cputime+0x49f/0x6e0
[<ffffffff815b7ee0>] ? __pmd_alloc+0x370/0x370
[<ffffffff81310a9c>] ? thread_group_cputime+0x4bc/0x6e0
[<ffffffff81310d2d>] ? thread_group_cputime_adjusted+0x6d/0xe0
[<ffffffff81237170>] ? __do_page_fault+0x220/0x9f0
[<ffffffff815cba10>] ? find_vma+0x30/0x150
[<ffffffff812373a2>] __do_page_fault+0x452/0x9f0
[<ffffffff81237bf5>] trace_do_page_fault+0x1e5/0x3a0
[<ffffffff8122a007>] do_async_page_fault+0x27/0xa0
[<ffffffff83c97618>] async_page_fault+0x28/0x30
[<ffffffff81fdec7c>] ? copy_user_generic_string+0x2c/0x40
[<ffffffff812b0303>] ? SyS_times+0x93/0x110
[<ffffffff812b0270>] ? do_sys_times+0x2b0/0x2b0
[<ffffffff812b0270>] ? do_sys_times+0x2b0/0x2b0
[<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0
[<ffffffff83c96534>] entry_SYSCALL64_slow_path+0x25/0x25
------------[ cut here ]------------

The warning shows that it made it past the list_empty_careful() check
in finish_wait() but then bugs out on the &wait->task_list
dereference.

Anything stick out?


Vegard

2016-12-05 17:21:46

by Dave Jones

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Dec 05, 2016 at 06:09:29PM +0100, Vegard Nossum wrote:
> On 5 December 2016 at 12:10, Vegard Nossum <[email protected]> wrote:
> > On 5 December 2016 at 00:04, Vegard Nossum <[email protected]> wrote:
> >> FWIW I hit this as well:
> >>
> >> BUG: unable to handle kernel paging request at ffffffff81ff08b7
> >> IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30
> >> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: G B 4.9.0-rc7+ #217
> > [...]
> >
> >> I think you can rule out btrfs in any case, probably block layer as
> >> well, since it looks like this comes from shmem.
> >
> > I should rather say that the VM runs on a 9p root filesystem and it
> > doesn't use/mount any block devices or disk-based filesystems.
> >
> > I have all the trinity logs for the crash if that's useful. I tried a
> > couple of runs with just the (at the time) in-progress syscalls but it
> > didn't turn up anything interesting. Otherwise it seems like a lot of
> > data to go through by hand.
>
> I've hit this another 7 times in the past ~3 hours.
>
> Three times the address being dereferenced has pointed to
> iov_iter_init+0xaf (even across a kernel rebuild), three times it has
> pointed to put_prev_entity+0x55, once to 0x800000008, and twice to
> 0x292. The fact that it would hit even one of those more than once
> across runs is pretty suspicious to me, although the ones that point
> to iov_iter_init and put_prev_entity point to "random" instructions in
> the sense that they are neither entry points nor return addresses.
>
> shmem_fault() was always on the stack, but it came from different
> syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(),
> clock_settime(), mount(), etc.
> ------------[ cut here ]------------
> The warning shows that it made it past the list_empty_careful() check
> in finish_wait() but then bugs out on the &wait->task_list
> dereference.

I just pushed out the ftrace changes I made to Trinity that might help
you gather more clues.

Right now it's hardcoded to dump a trace to /boot/trace.txt when it
detects the kernel has become tainted.

Before a trinity run, I run this as root..

#!/bin/sh

cd /sys/kernel/debug/tracing/

echo 10000 > buffer_size_kb
echo function >> current_tracer

for i in $(cat /home/davej/blacklist-symbols)
do
echo $i >> set_ftrace_notrace
done

echo 1 >> tracing_on


blacklist-symbols is the more noisy stuff that pollutes traces.
Right now I use these: https://paste.fedoraproject.org/499794/14809582/
You may need to add some more.

(I'll get around to making all this scripting go away, and have trinity
just set this stuff up itself eventually)

Oh, and if you're not running as root, you might need a diff like below
so that trinity can stop the trace when it detects tainting.

Dave

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8696ce6bf2f6..2d6c97e871e0 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7217,7 +7217,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
trace_create_file("trace_clock", 0644, d_tracer, tr,
&trace_clock_fops);

- trace_create_file("tracing_on", 0644, d_tracer,
+ trace_create_file("tracing_on", 0666, d_tracer,
tr, &rb_simple_fops);

create_trace_options_dir(tr);

2016-12-05 17:55:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum <[email protected]> wrote:
>
> The warning shows that it made it past the list_empty_careful() check
> in finish_wait() but then bugs out on the &wait->task_list
> dereference.
>
> Anything stick out?

I hate that shmem waitqueue garbage. It's really subtle.

I think the problem is that "wake_up_all()" in shmem_fallocate()
doesn't necessarily wake up everything. It wakes up TASK_NORMAL -
which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean
"everything on the list".

I think that what happens is that the waiters somehow move from
TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that
wake_up_all() will ignore them, leave them on the list, and now that
list on stack is no longer empty at the end.

And the way *THAT* can happen is that the task is on some *other*
waitqueue as well, and that other waiqueue wakes it up. That's not
impossible, you can certainly have people on wait-queues that still
take faults.

Or somebody just uses a directed wake_up_process() or something.

Since you apparently can recreate this fairly easily, how about trying
this stupid patch?

NOTE! This is entirely untested. I may have screwed this up entirely.
You get the idea, though - just remove the wait queue head from the
list - the list entries stay around, but nothing points to the stack
entry (that we're going to free) any more.

And add the warning to see if this actually ever triggers (and because
I'd like to see the callchain when it does, to see if it's another
waitqueue somewhere or what..)

Linus


Attachments:
patch.diff (469.00 B)

2016-12-05 18:11:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum <[email protected]> wrote:
> On 23 November 2016 at 20:58, Dave Jones <[email protected]> wrote:
>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:
>>
>> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4
>> > trace from just before this happened. Does this shed any light ?
>> >
>> > https://codemonkey.org.uk/junk/trace.txt
>>
>> crap, I just noticed the timestamps in the trace come from quite a bit
>> later. I'll tweak the code to do the taint checking/ftrace stop after
>> every syscall, that should narrow the window some more.
>
> FWIW I hit this as well:
>
> BUG: unable to handle kernel paging request at ffffffff81ff08b7

We really ought to improve this message. If nothing else, it should
say whether it was a read, a write, or an instruction fetch.

> IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30
> PGD 461e067 PUD 461f063
> PMD 1e001e1

Too lazy to manually decode this right now, but I don't think it matters.

> Oops: 0003 [#1] PREEMPT SMP KASAN

Note this is SMP, but that just means CONFIG_SMP=y. Vegard, how many
CPUs does your kernel think you have?

> RIP: 0010:[<ffffffff8135f2ea>] [<ffffffff8135f2ea>]
> __lock_acquire.isra.32+0xda/0x1a30
> RSP: 0018:ffff8801bab8f730 EFLAGS: 00010082
> RAX: ffffffff81ff071f RBX: 0000000000000000 RCX: 0000000000000000

RAX points to kernel text.

> Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
> c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e>
> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00

2b: 3e ff 80 98 01 00 00 incl %ds:*0x198(%rax) <--
trapping instruction

That's very strange. I think this is:

atomic_inc((atomic_t *)&class->ops);

but my kernel contains:

3cb4: f0 ff 80 98 01 00 00 lock incl 0x198(%rax)

So your kernel has been smp-alternatived. That 3e comes from
alternatives_smp_unlock. If you're running on SMP with UP
alternatives, things will break.

What's your kernel command line? Can we have your entire kernel log from boot?

Adding Borislav, since he's the guru for this code.

2016-12-05 18:25:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Dec 5, 2016 at 10:11 AM, Andy Lutomirski <[email protected]> wrote:
>
> So your kernel has been smp-alternatived. That 3e comes from
> alternatives_smp_unlock. If you're running on SMP with UP
> alternatives, things will break.

I'm assuming he's just running in a VM with a single CPU.

The problem that I pointed out with assuming wake_up_all() actually
removes all wait queue entries does not depend on SMP. The race is
much more fundamental and long-lived.

Linus

2016-12-05 18:26:13

by Vegard Nossum

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 5 December 2016 at 19:11, Andy Lutomirski <[email protected]> wrote:
> On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum <[email protected]> wrote:
>> On 23 November 2016 at 20:58, Dave Jones <[email protected]> wrote:
>>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:
>>>
>>> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4
>>> > trace from just before this happened. Does this shed any light ?
>>> >
>>> > https://codemonkey.org.uk/junk/trace.txt
>>>
>>> crap, I just noticed the timestamps in the trace come from quite a bit
>>> later. I'll tweak the code to do the taint checking/ftrace stop after
>>> every syscall, that should narrow the window some more.
>>
>> FWIW I hit this as well:
>>
>> BUG: unable to handle kernel paging request at ffffffff81ff08b7
>
> We really ought to improve this message. If nothing else, it should
> say whether it was a read, a write, or an instruction fetch.
>
>> IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30
>> PGD 461e067 PUD 461f063
>> PMD 1e001e1
>
> Too lazy to manually decode this right now, but I don't think it matters.
>
>> Oops: 0003 [#1] PREEMPT SMP KASAN
>
> Note this is SMP, but that just means CONFIG_SMP=y. Vegard, how many
> CPUs does your kernel think you have?

My first crash was running on a 1-CPU guest (not intentionally, but
because of a badly configured qemu). I'm honestly surprised it
triggered at all with 1 CPU, but I guess it shows that it's not a true
concurrency issue at least!

>
>> RIP: 0010:[<ffffffff8135f2ea>] [<ffffffff8135f2ea>]
>> __lock_acquire.isra.32+0xda/0x1a30
>> RSP: 0018:ffff8801bab8f730 EFLAGS: 00010082
>> RAX: ffffffff81ff071f RBX: 0000000000000000 RCX: 0000000000000000
>
> RAX points to kernel text.

Yes, it's somewhere in the middle of iov_iter_init() -- other crashes
also had put_prev_entity(), a null pointer, and some garbage values I
couldn't identify.

>
>> Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
>> c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e>
>> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00
>
> 2b: 3e ff 80 98 01 00 00 incl %ds:*0x198(%rax) <--
> trapping instruction
>
> That's very strange. I think this is:
>
> atomic_inc((atomic_t *)&class->ops);
>
> but my kernel contains:
>
> 3cb4: f0 ff 80 98 01 00 00 lock incl 0x198(%rax)
>
> So your kernel has been smp-alternatived. That 3e comes from
> alternatives_smp_unlock. If you're running on SMP with UP
> alternatives, things will break.

Yes, indeed. It was running on 1 CPU by mistake and still triggered the bug.

The crashes started really pouring in once I got my qemu fixed. Just
to reassure you, here's another crash which shows it's using the
correct instruction on an actual multicore guest:

BUG: unable to handle kernel paging request at 00000003030001de
IP: [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30
PGD 183fd2067 PUD 0

Oops: 0002 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
CPU: 23 PID: 9584 Comm: trinity-c104 Not tainted 4.9.0-rc7+ #219
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff880189f68000 task.stack: ffff88017fe50000
RIP: 0010:[<ffffffff8135f2ea>] [<ffffffff8135f2ea>]
__lock_acquire.isra.32+0xda/0x1a30
RSP: 0018:ffff88017fe575e0 EFLAGS: 00010002
RAX: 0000000303000046 RBX: 0000000000000000 RCX: 0000000000000000
[...]
Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <f0>
ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00
RIP [<ffffffff8135f2ea>] __lock_acquire.isra.32+0xda/0x1a30
RSP <ffff88017fe575e0>
CR2: 00000003030001de
---[ end trace 2846425104eb6141 ]---
Kernel panic - not syncing: Fatal exception
------------[ cut here ]------------

2b:* f0 ff 80 98 01 00 00 lock incl 0x198(%rax) <--
trapping instruction

> What's your kernel command line? Can we have your entire kernel log from boot?

Just in case you still want this, I've attached the boot log for the
"true SMP" guest above.


Vegard


Attachments:
5.txt.gz (17.35 kB)

2016-12-05 19:12:20

by Vegard Nossum

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 5 December 2016 at 18:55, Linus Torvalds
<[email protected]> wrote:
> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum <[email protected]> wrote:
>>
>> The warning shows that it made it past the list_empty_careful() check
>> in finish_wait() but then bugs out on the &wait->task_list
>> dereference.
>>
>> Anything stick out?
>
> I hate that shmem waitqueue garbage. It's really subtle.
>
> I think the problem is that "wake_up_all()" in shmem_fallocate()
> doesn't necessarily wake up everything. It wakes up TASK_NORMAL -
> which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean
> "everything on the list".
>
> I think that what happens is that the waiters somehow move from
> TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that
> wake_up_all() will ignore them, leave them on the list, and now that
> list on stack is no longer empty at the end.
>
> And the way *THAT* can happen is that the task is on some *other*
> waitqueue as well, and that other waiqueue wakes it up. That's not
> impossible, you can certainly have people on wait-queues that still
> take faults.
>
> Or somebody just uses a directed wake_up_process() or something.
>
> Since you apparently can recreate this fairly easily, how about trying
> this stupid patch?
>
> NOTE! This is entirely untested. I may have screwed this up entirely.
> You get the idea, though - just remove the wait queue head from the
> list - the list entries stay around, but nothing points to the stack
> entry (that we're going to free) any more.
>
> And add the warning to see if this actually ever triggers (and because
> I'd like to see the callchain when it does, to see if it's another
> waitqueue somewhere or what..)

------------[ cut here ]------------
WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0
Kernel panic - not syncing: panic_on_warn set ...

CPU: 22 PID: 14012 Comm: trinity-c73 Not tainted 4.9.0-rc7+ #220
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
ffff8801e32af970 ffffffff81fb08c1 ffffffff83e74b60 ffff8801e32afa48
ffffffff83ed7600 ffffffff847103e0 ffff8801e32afa38 ffffffff81515244
0000000041b58ab3 ffffffff844e21da ffffffff81515061 ffffffff8151591e
Call Trace:
[<ffffffff81fb08c1>] dump_stack+0x83/0xb2
[<ffffffff81515244>] panic+0x1e3/0x3ad
[<ffffffff812708bf>] __warn+0x1bf/0x1e0
[<ffffffff81270aac>] warn_slowpath_null+0x2c/0x40
[<ffffffff8157aef7>] shmem_fallocate+0x9a7/0xac0
[<ffffffff8167c6c0>] vfs_fallocate+0x350/0x620
[<ffffffff815ee5c2>] SyS_madvise+0x432/0x1290
[<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0
[<ffffffff83c965b4>] entry_SYSCALL64_slow_path+0x25/0x25
------------[ cut here ]------------

Attached a full log.


Vegard


Attachments:
0.txt.gz (18.48 kB)

2016-12-05 20:10:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Dec 5, 2016 at 11:11 AM, Vegard Nossum <[email protected]> wrote:
>
> ------------[ cut here ]------------
> WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0

Ok, good. So that's confirmed as the cause of this problem.

And the call chain that I wanted is obviously completely
uninteresting, because it's call cghain on the other side (the page
fault side) that would show the nested wake queue behavior. I was just
being stupid about it.

I wonder if we have any other places where we just blithely assume
that "wake_up_all()" will actually empty the whole wait queue. It's
_usually_ true, but as noted, nested waiting does happen.

Anyway, can you try this patch instead? It should actually cause the
wake_up_all() to always remove all entries, and thus the WARN_ON()
should no longer happen (and I removed the "list_del()" hackery).

Linus


Attachments:
patch.diff (1.41 kB)

2016-12-05 20:11:02

by Vegard Nossum

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 5 December 2016 at 20:11, Vegard Nossum <[email protected]> wrote:
> On 5 December 2016 at 18:55, Linus Torvalds
> <[email protected]> wrote:
>> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum <[email protected]> wrote:
>> Since you apparently can recreate this fairly easily, how about trying
>> this stupid patch?
>>
>> NOTE! This is entirely untested. I may have screwed this up entirely.
>> You get the idea, though - just remove the wait queue head from the
>> list - the list entries stay around, but nothing points to the stack
>> entry (that we're going to free) any more.
>>
>> And add the warning to see if this actually ever triggers (and because
>> I'd like to see the callchain when it does, to see if it's another
>> waitqueue somewhere or what..)
>
> ------------[ cut here ]------------
> WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0
> Kernel panic - not syncing: panic_on_warn set ...

So I noticed that panic_on_warn just after sending the email and I've
been waiting for it it to trigger again.

The warning has triggered twice more without panic_on_warn set and I
haven't seen any crash yet.


Vegard

2016-12-05 20:35:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

Adding the scheduler people to the participants list, and re-attaching
the patch, because while this patch is internal to the VM code, the
issue itself is not.

There might well be other cases where somebody goes "wake_up_all()"
will wake everybody up, so I can put the wait queue head on the stack,
and then after I've woken people up I can return".

Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()"
does make sure that everybody is woken up, but the usual autoremove
wake function only removes the wakeup entry if the process was woken
up by that particular wakeup. If something else had woken it up, the
entry remains on the list, and the waker in this case returned with
the wait head still containing entries.

Which is deadly when the wait queue head is on the stack.

So I'm wondering if we should make that "synchronous_wake_function()"
available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper
that uses it.

Of course, I'm really hoping that this shmem.c use is the _only_ such
case. But I doubt it.

Comments?

Note for Ingo and Peter: this patch has not been tested at all. But
Vegard did test an earlier patch of mine that just verified that yes,
the issue really was that wait queue entries remained on the wait
queue head just as we were about to return and free it.

Linus


On Mon, Dec 5, 2016 at 12:10 PM, Linus Torvalds
<[email protected]> wrote:
>
> Anyway, can you try this patch instead? It should actually cause the
> wake_up_all() to always remove all entries, and thus the WARN_ON()
> should no longer happen (and I removed the "list_del()" hackery).
>
> Linus


Attachments:
patch.diff (1.41 kB)

2016-12-05 21:33:24

by Vegard Nossum

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 5 December 2016 at 21:35, Linus Torvalds
<[email protected]> wrote:
> Note for Ingo and Peter: this patch has not been tested at all. But
> Vegard did test an earlier patch of mine that just verified that yes,
> the issue really was that wait queue entries remained on the wait
> queue head just as we were about to return and free it.

The second patch has been running for 1h+ without any problems of any
kind. I should typically have seen 2 crashes by now. I'll let it run
overnight to be sure.


Vegard

2016-12-06 08:18:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Mon, Dec 05, 2016 at 12:35:52PM -0800, Linus Torvalds wrote:
> Adding the scheduler people to the participants list, and re-attaching
> the patch, because while this patch is internal to the VM code, the
> issue itself is not.
>
> There might well be other cases where somebody goes "wake_up_all()"
> will wake everybody up, so I can put the wait queue head on the stack,
> and then after I've woken people up I can return".
>
> Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()"
> does make sure that everybody is woken up, but the usual autoremove
> wake function only removes the wakeup entry if the process was woken
> up by that particular wakeup. If something else had woken it up, the
> entry remains on the list, and the waker in this case returned with
> the wait head still containing entries.
>
> Which is deadly when the wait queue head is on the stack.

Yes, very much so.

> So I'm wondering if we should make that "synchronous_wake_function()"
> available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper
> that uses it.

We could also do some debug code that tracks the ONSTACK ness and warns
on autoremove. Something like the below, equally untested.

>
> Of course, I'm really hoping that this shmem.c use is the _only_ such
> case. But I doubt it.

$ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
28

---


diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2408e8d5c05c..199faaa89847 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -39,6 +39,9 @@ struct wait_bit_queue {
struct __wait_queue_head {
spinlock_t lock;
struct list_head task_list;
+#ifdef CONFIG_DEBUG_WAITQUEUE
+ int onstack;
+#endif
};
typedef struct __wait_queue_head wait_queue_head_t;

@@ -56,9 +59,18 @@ struct task_struct;
#define DECLARE_WAITQUEUE(name, tsk) \
wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk)

-#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \
+#ifdef CONFIG_DEBUG_WAITQUEUE
+#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack),
+#else
+#define ___WAIT_QUEUE_ONSTACK(onstack)
+#endif
+
+#define ___WAIT_QUEUE_HEAD_INITIALIZER(name, onstack) { \
.lock = __SPIN_LOCK_UNLOCKED(name.lock), \
- .task_list = { &(name).task_list, &(name).task_list } }
+ .task_list = { &(name).task_list, &(name).task_list }, \
+ ___WAIT_QUEUE_ONSTACK(onstack) }
+
+#define __WAIT_QUEUE_HEAD_INITIALIZER(name) ___WAIT_QUEUE_HEAD_INITIALIZER(name, 0)

#define DECLARE_WAIT_QUEUE_HEAD(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INITIALIZER(name)
@@ -80,11 +92,12 @@ extern void __init_waitqueue_head(wait_queue_head_t *q, const char *name, struct

#ifdef CONFIG_LOCKDEP
# define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \
- ({ init_waitqueue_head(&name); name; })
+ ({ init_waitqueue_head(&name); (name).onstack = 1; name; })
# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INIT_ONSTACK(name)
#else
-# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name)
+# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
+ wait_queue_head_t name = ___WAIT_QUEUE_HEAD_INITIALIZER(name, 1)
#endif

static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 9453efe9b25a..746d00117d08 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -156,6 +156,13 @@ void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
}
EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */

+static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait)
+{
+#ifdef CONFIG_DEBUG_WAITQUEUE
+ WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function)
+#endif
+}
+
/*
* Note: we use "set_current_state()" _after_ the wait-queue add,
* because we need a memory barrier there on SMP, so that any
@@ -178,6 +185,7 @@ prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
if (list_empty(&wait->task_list))
__add_wait_queue(q, wait);
set_current_state(state);
+ prepare_debug(q, wait);
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(prepare_to_wait);
@@ -192,6 +200,7 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
if (list_empty(&wait->task_list))
__add_wait_queue_tail(q, wait);
set_current_state(state);
+ prepare_debug(q, wait);
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(prepare_to_wait_exclusive);
@@ -235,6 +244,7 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
}
set_current_state(state);
}
+ prepare_debug(q, wait);
spin_unlock_irqrestore(&q->lock, flags);

return ret;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 9bb7d825ba14..af2ef22a5b2b 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1235,6 +1235,14 @@ config DEBUG_PI_LIST

If unsure, say N.

+config DEBUG_WAITQUEUE
+ bool "Debug waitqueue"
+ depends on DEBUG_KERNEL
+ help
+ Enable this to do sanity checking on waitqueue usage
+
+ If unsure, say N.
+
config DEBUG_SG
bool "Debug SG table operations"
depends on DEBUG_KERNEL

2016-12-06 08:36:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: bio linked list corruption.


* Peter Zijlstra <[email protected]> wrote:

> $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
> 28

This debug facility looks sensible. A couple of minor suggestions:

> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -39,6 +39,9 @@ struct wait_bit_queue {
> struct __wait_queue_head {
> spinlock_t lock;
> struct list_head task_list;
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> + int onstack;
> +#endif

The structure will pack better in the debug-enabled case if 'onstack' is next to
'lock', as spinlock_t is 4 bytes on many architectures.

> -#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> +#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack),
> +#else
> +#define ___WAIT_QUEUE_ONSTACK(onstack)
> +#endif

Please help readers by showing the internal structure of the definition:

#ifdef CONFIG_DEBUG_WAITQUEUE
# define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack),
#else
# define ___WAIT_QUEUE_ONSTACK(onstack)
#endif


> +static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait)
> +{
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> + WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function)
> +#endif
> +}

I'd name this debug_waitqueue_check() or such - as the 'prepare' is a bit
misleadig (we don't prepare debugging, we do the debug check here).

> +config DEBUG_WAITQUEUE
> + bool "Debug waitqueue"
> + depends on DEBUG_KERNEL

I'd name it DEBUG_SCHED_WAITQUEUE=y and I'd also make it depend on
CONFIG_DEBUG_SCHED.

LGTM otherwise!

Thanks,

Ingo

2016-12-06 08:42:13

by Vegard Nossum

[permalink] [raw]
Subject: Re: bio linked list corruption.

On 5 December 2016 at 22:33, Vegard Nossum <[email protected]> wrote:
> On 5 December 2016 at 21:35, Linus Torvalds
> <[email protected]> wrote:
>> Note for Ingo and Peter: this patch has not been tested at all. But
>> Vegard did test an earlier patch of mine that just verified that yes,
>> the issue really was that wait queue entries remained on the wait
>> queue head just as we were about to return and free it.
>
> The second patch has been running for 1h+ without any problems of any
> kind. I should typically have seen 2 crashes by now. I'll let it run
> overnight to be sure.

Alright, so nearly 12 hours later I don't see either the new warning
or the original crash at all, so feel free to add:

Tested-by: Vegard Nossum <[email protected]>.

That said, my 8 VMs had all panicked in some way due to OOMs (which is
new since v4.8), although some got page allocation stalls for >20s and
died because "khugepaged blocked for more than 120 seconds", others
got "Out of memory and no killable processes".


Vegard

2016-12-06 16:53:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: bio linked list corruption.

On Tue, Dec 6, 2016 at 12:16 AM, Peter Zijlstra <[email protected]> wrote:
>>
>> Of course, I'm really hoping that this shmem.c use is the _only_ such
>> case. But I doubt it.
>
> $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
> 28

Hmm. Most of them seem to be ok, because they use "wait_event()",
which will always remove itself from the wait-queue. And they do it
from the place that allocates the wait-queue.

IOW, the mm/shmem.c case really was fairly special, because it just
did "prepare_to_wait()", and then did a finish_wait() - and not in the
thread that allocated it on the stack.

So it's really that "some _other_ thread allocated the waitqueue on
the stack, and now we're doing a wait on it" that is bad.

So the normal pattern seems to be:

- allocate wq on the stack
- pass it on to a waker
- wait for it

and that's ok, because as part of "wait for it" we will also be
cleaning things up.

The reason mm/shmem.c was buggy was that it did

- allocate wq on the stack
- pass it on to somebody else to wait for
- wake it up

and *that* is buggy, because it's the waiter, not the waker, that
normally cleans things up.

Is there some good way to find this kind of pattern automatically, I wonder....

Linus

2016-12-07 16:58:01

by Dave Jones

[permalink] [raw]
Subject: Re: btrfs_destroy_inode warn (outstanding extents)

On Sat, Dec 03, 2016 at 11:48:33AM -0500, Dave Jones wrote:

> The interesting process here seems to be kworker/u8:17, and the trace
> captures some of what that was doing before that bad page was hit.

I'm travelling next week, so I'm trying to braindump the stuff I've
found so far and summarise so I can pick it back up later if no-one else
figures it out first.

I've hit the bad page map spew with enough regularity that I've now got a handful of
good traces.

http://codemonkey.org.uk/junk/btrfs/bad-page-state1.txt
http://codemonkey.org.uk/junk/btrfs/bad-page-state2.txt
http://codemonkey.org.uk/junk/btrfs/bad-page-state3.txt
http://codemonkey.org.uk/junk/btrfs/bad-page-state4.txt
http://codemonkey.org.uk/junk/btrfs/bad-page-state5.txt
http://codemonkey.org.uk/junk/btrfs/bad-page-state6.txt

It smells to me like a race between truncate and the writeback
workqueue. The variety of traces here seem to show both sides
of the race, sometimes it's kworker, sometimes a trinity child process.

bad-page-state3.txt onwards have some bonus trace_printk's from
btrfs_setsize as I was curious what sizes we were passing down to
truncate. The only patterns I see are going from very large to very
small sizes. Perhaps that causes truncate to generate so much
writeback that it makes the race apparent ?



Other stuff I keep hitting:

Start transaction spew:
http://codemonkey.org.uk/junk/btrfs/start_transaction.txt
That's the WARN_ON(h->use_count > 2);
I hit this with enough regularity that I had to comment it out.
It's not clear to me whether this is related at all.

Lockdep spew:
http://codemonkey.org.uk/junk/btrfs/register_lock_class1.txt
http://codemonkey.org.uk/junk/btrfs/register_lock_class2.txt
This stuff has been around for a while (4.6ish iirc)

Sometimes the fs got into a screwed up state that needed btrfscking.
http://codemonkey.org.uk/junk/btrfs/replay-log-fail.txt

Dave

2016-12-09 21:19:43

by Steven Rostedt

[permalink] [raw]
Subject: Re: btrfs_destroy_inode warn (outstanding extents)

On Thu, Dec 01, 2016 at 10:32:09AM -0500, Dave Jones wrote:
>
> (function-graph screws up the RIP for some reason, 'return_to_handler'
> should actually be btrfs_destroy_inode)

That's because function_graph hijacks the return address and replaces it with
return_to_handler. The back trace has code to show what that handler is
suppose to be. But I should see why the WARNING shows it instead, and fix
that too.

Thanks,

-- Steve