LinuxLists.cc - fs: GPF in locked_inode_to_wb_and_lock

2016-04-18 09:44:37

Subject: fs: GPF in locked_inode_to_wb_and_lock_list

Hello,

I've hit the following GPF while running syzkaller fuzzer on commit
806fdcce017dc98c4dbf8ed001750a0d7d2bb0af (Apr 14):

general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
Modules linked in:
CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-11:0)
task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>]
locked_inode_to_wb_and_lock_list+0xa2/0x750
RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
Call Trace:
[< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
[<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
[<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
[<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
[< inline >] wb_do_writeback fs/fs-writeback.c:1844
[<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
[<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
[<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
[<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
[<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212
RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
fs/fs-writeback.c:281
RSP <ffff88006cdaf7d0>
---[ end trace 986a4d314dcb2694 ]---

The crash happened here:

if (wb != &wb->bdi->wb)
ffffffff818884cb: 48 89 d8 mov %rbx,%rax
ffffffff818884ce: 48 c1 e8 03 shr $0x3,%rax
ffffffff818884d2: 42 80 3c 28 00 cmpb $0x0,(%rax,%r13,1)
ffffffff818884d7: 0f 85 17 06 00 00 jne
ffffffff81888af4 <locked_inode_to_wb_and_lock_list+0x6c4>
ffffffff818884dd: 48 8b 03 mov (%rbx),%rax
ffffffff818884e0: 48 83 c0 50 add $0x50,%rax
ffffffff818884e4: 48 39 c3 cmp %rax,%rbx
ffffffff818884e7: 0f 84 c3 00 00 00 je
ffffffff818885b0 <locked_inode_to_wb_and_lock_list+0x180>

Which means that bdi is NULL (if I get indirections right).

The block device is flush-11:0 (/dev/sr0).

I was able to reproduce it once by re-running programs from the crash log:
https://gist.githubusercontent.com/dvyukov/7c9e6358b661e2d50a774a70b5ee348a/raw/5cd94e32bbcf2a788b5efd0768cfd883c5921070/gistfile1.txt
But it took half an hour, and then another hour long run did not
trigger it. So it is probably triggered by some rare race.

Thank you

2016-04-20 21:15:01

by Tejun Heo

[permalink] [raw]

Subject: Re: fs: GPF in locked_inode_to_wb_and_lock_list

Hello, Dmitry.

On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote:

> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
...
> RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>]
> locked_inode_to_wb_and_lock_list+0xa2/0x750
> RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
> FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> Stack:
> ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
> ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
> ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
> Call Trace:
> [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
> [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
> [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
> [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
> [< inline >] wb_do_writeback fs/fs-writeback.c:1844
> [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
> [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
> [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
> [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
> [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
> RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212
> RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
> fs/fs-writeback.c:281
> RSP <ffff88006cdaf7d0>

Man, that's a beautiful trace w/ decoding of inline functions. When
did we start doing that? Is there a specific config option for this?

> ---[ end trace 986a4d314dcb2694 ]---
> The crash happened here:
>
> if (wb != &wb->bdi->wb)
> ffffffff818884cb: 48 89 d8 mov %rbx,%rax
> ffffffff818884ce: 48 c1 e8 03 shr $0x3,%rax
> ffffffff818884d2: 42 80 3c 28 00 cmpb $0x0,(%rax,%r13,1)

So, it's the above instruction.

> ffffffff818884d7: 0f 85 17 06 00 00 jne
> ffffffff81888af4 <locked_inode_to_wb_and_lock_list+0x6c4>
> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax
> ffffffff818884e0: 48 83 c0 50 add $0x50,%rax
> ffffffff818884e4: 48 39 c3 cmp %rax,%rbx
> ffffffff818884e7: 0f 84 c3 00 00 00 je
> ffffffff818885b0 <locked_inode_to_wb_and_lock_list+0x180>
>
> Which means that bdi is NULL (if I get indirections right).

So, the wb != &wb->bdi->wb comparison would be the cmp at
0xffffffff818884e4 and given that it just compares the address of
&bdi->wb, bdi being NULL wouldn't trigger the fault.

cmpb $0x0,(%rax,%r13,1)
-> *(u8 *)(%rax + %r13) == 0
-> *(u8 *)((%rbx >> 3) + %r13) == 0

Where can that be from? I can't find anything matching even in the
surrounding functions.

Hmmm... The base address %r13 is 0xdffffc0000000000 which isn't a
proper canonical address and in general suspcious. Ooh, it's
KASAN_SHADOW_OFFSET. It looks like something is making KASAN trigger
a fault. Can we please bring in someone who's more familiar with
KASAN?

Thanks.

--
tejun

2016-04-21 08:25:37

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: fs: GPF in locked_inode_to_wb_and_lock_list

On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <[email protected]> wrote:
> Hello, Dmitry.
>
> On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote:
>
>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
> ...
>> RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>]
>> locked_inode_to_wb_and_lock_list+0xa2/0x750
>> RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246
>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
>> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
>> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
>> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
>> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
>> FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
>> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>> Stack:
>> ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
>> ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
>> ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
>> Call Trace:
>> [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
>> [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
>> [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
>> [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
>> [< inline >] wb_do_writeback fs/fs-writeback.c:1844
>> [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
>> [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
>> [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
>> [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>> [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
>> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
>> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
>> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
>> RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212
>> RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
>> fs/fs-writeback.c:281
>> RSP <ffff88006cdaf7d0>
>
> Man, that's a beautiful trace w/ decoding of inline functions. When
> did we start doing that? Is there a specific config option for this?

Thanks! :)
I use this script for symbolization:
https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py
It invokes addr2line to provide file:line info, adds inline frames,
strips ? frames (are they ever useful?) and strips timestamps.
You just need to pipe a report through it. It assumes that vmlinux is
in the current dir, but you can override it with --linux flag.

It can also insert actual source lines into a report (amount of
context is controlled with flags), for example:

Call Trace:
[< inline >] khugepaged_scan_pmd include/linux/uaccess.h:9
8 {
9 current->pagefault_disabled++;
10 }
[< inline >] khugepaged_scan_mm_slot mm/huge_memory.c:2716
2715 hend);
2716 ret = khugepaged_scan_pmd(mm, vma,
2717
khugepaged_scan.address,
[< inline >] khugepaged_do_scan mm/huge_memory.c:2796
2795 pass_through_head < 2)
2796 progress +=
khugepaged_scan_mm_slot(pages - progress,
2797 &hpage);
[<ffffffff81790b33>] khugepaged+0x993/0x48e0 mm/huge_memory.c:2831
2830 while (!kthread_should_stop()) {
2831 khugepaged_do_scan();
2832 khugepaged_wait_work();
[<ffffffff813c195f>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
1302 add_wait_queue(k->waitq, &wait);
1303 __set_current_state(TASK_INTERRUPTIBLE);
1304 }
[<ffffffff866d1b2f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:405
404 SWAPGS
405 jmp restore_regs_and_iret
406 END(ret_from_fork)

2016-04-21 08:35:47

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: fs: GPF in locked_inode_to_wb_and_lock_list

On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <[email protected]> wrote:
> Hello, Dmitry.
>
> On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote:
>
>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
> ...
>> RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>]
>> locked_inode_to_wb_and_lock_list+0xa2/0x750
>> RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246
>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
>> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
>> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
>> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
>> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
>> FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
>> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>> Stack:
>> ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
>> ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
>> ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
>> Call Trace:
>> [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
>> [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
>> [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
>> [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
>> [< inline >] wb_do_writeback fs/fs-writeback.c:1844
>> [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
>> [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
>> [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
>> [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>> [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
>> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
>> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
>> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
>> RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212
>> RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
>> fs/fs-writeback.c:281
>> RSP <ffff88006cdaf7d0>
>
> Man, that's a beautiful trace w/ decoding of inline functions. When
> did we start doing that? Is there a specific config option for this?
>
>> ---[ end trace 986a4d314dcb2694 ]---
>> The crash happened here:
>>
>> if (wb != &wb->bdi->wb)
>> ffffffff818884cb: 48 89 d8 mov %rbx,%rax
>> ffffffff818884ce: 48 c1 e8 03 shr $0x3,%rax
>> ffffffff818884d2: 42 80 3c 28 00 cmpb $0x0,(%rax,%r13,1)
>
> So, it's the above instruction.
>
>> ffffffff818884d7: 0f 85 17 06 00 00 jne
>> ffffffff81888af4 <locked_inode_to_wb_and_lock_list+0x6c4>
>> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax
>> ffffffff818884e0: 48 83 c0 50 add $0x50,%rax
>> ffffffff818884e4: 48 39 c3 cmp %rax,%rbx
>> ffffffff818884e7: 0f 84 c3 00 00 00 je
>> ffffffff818885b0 <locked_inode_to_wb_and_lock_list+0x180>
>>
>> Which means that bdi is NULL (if I get indirections right).
>
> So, the wb != &wb->bdi->wb comparison would be the cmp at
> 0xffffffff818884e4 and given that it just compares the address of
> &bdi->wb, bdi being NULL wouldn't trigger the fault.
>
> cmpb $0x0,(%rax,%r13,1)
> -> *(u8 *)(%rax + %r13) == 0
> -> *(u8 *)((%rbx >> 3) + %r13) == 0
>
> Where can that be from? I can't find anything matching even in the
> surrounding functions.
>
> Hmmm... The base address %r13 is 0xdffffc0000000000 which isn't a
> proper canonical address and in general suspcious. Ooh, it's
> KASAN_SHADOW_OFFSET. It looks like something is making KASAN trigger
> a fault. Can we please bring in someone who's more familiar with
> KASAN?

I am here.
For every memory access to ADDR, KASAN makes a byte load from
KASAN_SHADOW_OFFSET+ADDR>>8 first.
For accesses to kernel memory, the byte is addressable and contains
addressability state of the ADDR.
It also has a positive side effect of catching any accesses to user
memory at the place of occurrence as the address computations produces
a non-valid address.
But the negative side effect that instead of a usual NULL-deref GPF,
you now get what we have here. %r13 contains KASAN_SHADOW_OFFSET, and
%rax contains the address that normal code is going to dereference few
instructions later. So without KASAN the code would trigger NULL deref
on (%rbx is a copy of %rax):

ffffffff818884dd: 48 8b 03 mov (%rbx),%rax

So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
that is NULL?)

Sorry for this mess. It is a known issue in KASAN, but we don't know
how to fix it without slowing down execution and sacrificing other
properties.

2016-04-21 09:10:36

by Andrey Ryabinin

[permalink] [raw]

Subject: Re: fs: GPF in locked_inode_to_wb_and_lock_list

2016-04-21 11:25 GMT+03:00 Dmitry Vyukov <[email protected]>:
> On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <[email protected]> wrote:
>> Hello, Dmitry.
>>
>> On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote:
>>
>>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>> ...
>>> RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>]
>>> locked_inode_to_wb_and_lock_list+0xa2/0x750
>>> RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246
>>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
>>> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
>>> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
>>> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
>>> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
>>> FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
>>> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>>> Stack:
>>> ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
>>> ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
>>> ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
>>> Call Trace:
>>> [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
>>> [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
>>> [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
>>> [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
>>> [< inline >] wb_do_writeback fs/fs-writeback.c:1844
>>> [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
>>> [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
>>> [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
>>> [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>>> [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
>>> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
>>> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
>>> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
>>> RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212
>>> RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
>>> fs/fs-writeback.c:281
>>> RSP <ffff88006cdaf7d0>
>>
>> Man, that's a beautiful trace w/ decoding of inline functions. When
>> did we start doing that? Is there a specific config option for this?
>
> Thanks! :)
> I use this script for symbolization:
> https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py

BTW, it would be nice to have that script in kernel tree, instead of
scripts/decode_stacktrace.sh which
is far less advanced.

> It invokes addr2line to provide file:line info, adds inline frames,
> strips ? frames (are they ever useful?) and strips timestamps.
> You just need to pipe a report through it. It assumes that vmlinux is
> in the current dir, but you can override it with --linux flag.
>
> It can also insert actual source lines into a report (amount of
> context is controlled with flags), for example:
>
>
> Call Trace:
> [< inline >] khugepaged_scan_pmd include/linux/uaccess.h:9
> 8 {
> 9 current->pagefault_disabled++;
> 10 }
> [< inline >] khugepaged_scan_mm_slot mm/huge_memory.c:2716
> 2715 hend);
> 2716 ret = khugepaged_scan_pmd(mm, vma,
> 2717
> khugepaged_scan.address,
> [< inline >] khugepaged_do_scan mm/huge_memory.c:2796
> 2795 pass_through_head < 2)
> 2796 progress +=
> khugepaged_scan_mm_slot(pages - progress,
> 2797 &hpage);
> [<ffffffff81790b33>] khugepaged+0x993/0x48e0 mm/huge_memory.c:2831
> 2830 while (!kthread_should_stop()) {
> 2831 khugepaged_do_scan();
> 2832 khugepaged_wait_work();
> [<ffffffff813c195f>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
> 1302 add_wait_queue(k->waitq, &wait);
> 1303 __set_current_state(TASK_INTERRUPTIBLE);
> 1304 }
> [<ffffffff866d1b2f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:405
> 404 SWAPGS
> 405 jmp restore_regs_and_iret
> 406 END(ret_from_fork)

2016-04-21 09:29:42

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: fs: GPF in locked_inode_to_wb_and_lock_list

On Thu, Apr 21, 2016 at 11:10 AM, Andrey Ryabinin
<[email protected]> wrote:
> 2016-04-21 11:25 GMT+03:00 Dmitry Vyukov <[email protected]>:
>> On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <[email protected]> wrote:
>>> Hello, Dmitry.
>>>
>>> On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote:
>>>
>>>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>>> ...
>>>> RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>]
>>>> locked_inode_to_wb_and_lock_list+0xa2/0x750
>>>> RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246
>>>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
>>>> RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
>>>> RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
>>>> R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
>>>> R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
>>>> FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
>>>> DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>>>> Stack:
>>>> ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
>>>> ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
>>>> ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
>>>> Call Trace:
>>>> [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
>>>> [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
>>>> [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
>>>> [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
>>>> [< inline >] wb_do_writeback fs/fs-writeback.c:1844
>>>> [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
>>>> [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
>>>> [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
>>>> [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>>>> [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
>>>> Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
>>>> 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
>>>> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
>>>> RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212
>>>> RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
>>>> fs/fs-writeback.c:281
>>>> RSP <ffff88006cdaf7d0>
>>>
>>> Man, that's a beautiful trace w/ decoding of inline functions. When
>>> did we start doing that? Is there a specific config option for this?
>>
>> Thanks! :)
>> I use this script for symbolization:
>> https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py
>
> BTW, it would be nice to have that script in kernel tree, instead of
> scripts/decode_stacktrace.sh which
> is far less advanced.

I am all for it!
We have 2 minor, unresolved bugs in the script:
https://github.com/google/sanitizers/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+kasan_symbolize
It would be good to resolve them before submitting it to kernel tree.
But I am not particularly experienced in Python. +Andrey Konovalov
will be able to do it in autumn. If anybody is willing to take it
earlier, you are welcome (please drop a note to
https://github.com/google/sanitizers/issues/674 to coordinate
efforts).

>> It invokes addr2line to provide file:line info, adds inline frames,
>> strips ? frames (are they ever useful?) and strips timestamps.
>> You just need to pipe a report through it. It assumes that vmlinux is
>> in the current dir, but you can override it with --linux flag.
>>
>> It can also insert actual source lines into a report (amount of
>> context is controlled with flags), for example:
>>
>>
>> Call Trace:
>> [< inline >] khugepaged_scan_pmd include/linux/uaccess.h:9
>> 8 {
>> 9 current->pagefault_disabled++;
>> 10 }
>> [< inline >] khugepaged_scan_mm_slot mm/huge_memory.c:2716
>> 2715 hend);
>> 2716 ret = khugepaged_scan_pmd(mm, vma,
>> 2717
>> khugepaged_scan.address,
>> [< inline >] khugepaged_do_scan mm/huge_memory.c:2796
>> 2795 pass_through_head < 2)
>> 2796 progress +=
>> khugepaged_scan_mm_slot(pages - progress,
>> 2797 &hpage);
>> [<ffffffff81790b33>] khugepaged+0x993/0x48e0 mm/huge_memory.c:2831
>> 2830 while (!kthread_should_stop()) {
>> 2831 khugepaged_do_scan();
>> 2832 khugepaged_wait_work();
>> [<ffffffff813c195f>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>> 1302 add_wait_queue(k->waitq, &wait);
>> 1303 __set_current_state(TASK_INTERRUPTIBLE);
>> 1304 }
>> [<ffffffff866d1b2f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:405
>> 404 SWAPGS
>> 405 jmp restore_regs_and_iret
>> 406 END(ret_from_fork)

2016-04-21 09:45:25

by Andrey Ryabinin

[permalink] [raw]

Subject: Re: fs: GPF in locked_inode_to_wb_and_lock_list

2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <[email protected]>:
>
> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax
>
> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
> that is NULL?)

Yes it's NULL wb, because there is only one load:
mov (%rbx),%rax => rax = wb->bdi
add $0x50,%rax => rax = &bdi->wb

2016-04-21 10:01:01

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: fs: GPF in locked_inode_to_wb_and_lock_list

On Thu, Apr 21, 2016 at 11:45 AM, Andrey Ryabinin
<[email protected]> wrote:
> 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <[email protected]>:
>>
>> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax
>>
>> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
>> that is NULL?)
>
> Yes it's NULL wb, because there is only one load:
> mov (%rbx),%rax => rax = wb->bdi
> add $0x50,%rax => rax = &bdi->wb

I bet that wb becomes NULL on the second iteration of the loop. The
loop loops in case of a race with another thread, so it would also
explain why it is difficult to reproduce.

Tejun, does it make any sense to you?

2016-04-21 16:14:06

by Tejun Heo

[permalink] [raw]

Subject: Re: fs: GPF in locked_inode_to_wb_and_lock_list

Hello,

On Thu, Apr 21, 2016 at 10:25:12AM +0200, Dmitry Vyukov wrote:
> I use this script for symbolization:
> https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py
> It invokes addr2line to provide file:line info, adds inline frames,
> strips ? frames (are they ever useful?) and strips timestamps.

Heh, sometimes they are. They obviously are useful when backlinks are
broken for some reason. Through the remnants of the old call stacks,
they also can sometimes tell the history of the task - where it was
before it got to the current backtrace. It isn't reliable at all but
I can recall several occasions where they were useful in root causing
issues. It *might* make a sense to have an option to not strip them.

And, yeah, I'm all for it being in the kernel tree.

Thanks.

--
tejun

2016-04-21 17:06:45

by Tejun Heo

[permalink] [raw]

Subject: Re: fs: GPF in locked_inode_to_wb_and_lock_list

Hello,

(cc'ing Ilya, Jan and Jens)

On Thu, Apr 21, 2016 at 12:00:38PM +0200, Dmitry Vyukov wrote:
> On Thu, Apr 21, 2016 at 11:45 AM, Andrey Ryabinin
> <[email protected]> wrote:
> > 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <[email protected]>:
> >>
> >> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax
> >>
> >> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
> >> that is NULL?)
> >
> > Yes it's NULL wb, because there is only one load:
> > mov (%rbx),%rax => rax = wb->bdi
> > add $0x50,%rax => rax = &bdi->wb
>
>
> I bet that wb becomes NULL on the second iteration of the loop. The
> loop loops in case of a race with another thread, so it would also
> explain why it is difficult to reproduce.
>
> Tejun, does it make any sense to you?

Yeah, that makes sense. I think the culprit is 43d1c0eb7e11 ("block:
detach bdev inode from its wb in __blkdev_put()") which allows inode
to wb association to be broken while other operations including
writeback are in progress. I thought it should be okay as the inode
must be clean at that point but that obviously doesn't mean that there
can be no writeback operations in flight.

I hope we could eventually move away from the current model where we
try to swap out an underlying data structure while upper layers may
still be referring to it in the future but for now we can make sure
the writeback operation is finished before detaching wb.

Dmitry, I understand that the bug is difficult to reproduce but can
you please give the following patch a try?

Thanks.

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 20a2c02..209ea33 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1530,12 +1530,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
kill_bdev(bdev);

bdev_write_inode(bdev);
- /*
- * Detaching bdev inode from its wb in __destroy_inode()
- * is too late: the queue which embeds its bdi (along with
- * root wb) can be gone as soon as we put_disk() below.
- */
- inode_detach_wb(bdev->bd_inode);
+ inode_detach_blkdev_wb(bdev);
}
if (bdev->bd_contains == bdev) {
if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d0b5ca5..ec1f530 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -230,6 +230,25 @@ static inline void inode_detach_wb(struct inode *inode)
}

/**
+ * inode_detach_blkdev_wb - disassociate a bd_inode from its wb
+ * @bdev: block_device of interest
+ *
+ * @bdev is being put for the last time. Detaching bdev inode in
+ * __destroy_inode() is too late: the queue which embeds its bdi (along
+ * with root wb) can be gone as soon as the containing disk is put.
+ *
+ * This function dissociates @bdev->bd_inode from its wb. The inode must
+ * be clean and no further operations should be started on it.
+ */
+static inline void inode_detach_blkdev_wb(struct block_device *bdev)
+{
+ if (bdev->bd_inode->i_wb) {
+ flush_delayed_work(&bdev->bd_inode->i_wb->dwork);
+ inode_detach_wb(bdev->bd_inode);
+ }
+}
+
+/**
* wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite
* @wbc: writeback_control of interest
* @inode: target inode
@@ -277,6 +296,10 @@ static inline void inode_detach_wb(struct inode *inode)
{
}

+static inline void inode_detach_blkdev_wb(struct block_device *bdev)
+{
+}
+
static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
struct inode *inode)
__releases(&inode->i_lock)

2016-04-22 18:55:51

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: fs: GPF in locked_inode_to_wb_and_lock_list

On Thu, Apr 21, 2016 at 7:06 PM, Tejun Heo <[email protected]> wrote:
> Hello,
>
> (cc'ing Ilya, Jan and Jens)
>
> On Thu, Apr 21, 2016 at 12:00:38PM +0200, Dmitry Vyukov wrote:
>> On Thu, Apr 21, 2016 at 11:45 AM, Andrey Ryabinin
>> <[email protected]> wrote:
>> > 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <[email protected]>:
>> >>
>> >> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax
>> >>
>> >> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
>> >> that is NULL?)
>> >
>> > Yes it's NULL wb, because there is only one load:
>> > mov (%rbx),%rax => rax = wb->bdi
>> > add $0x50,%rax => rax = &bdi->wb
>>
>>
>> I bet that wb becomes NULL on the second iteration of the loop. The
>> loop loops in case of a race with another thread, so it would also
>> explain why it is difficult to reproduce.
>>
>> Tejun, does it make any sense to you?
>
> Yeah, that makes sense. I think the culprit is 43d1c0eb7e11 ("block:
> detach bdev inode from its wb in __blkdev_put()") which allows inode
> to wb association to be broken while other operations including
> writeback are in progress. I thought it should be okay as the inode
> must be clean at that point but that obviously doesn't mean that there
> can be no writeback operations in flight.
>
> I hope we could eventually move away from the current model where we
> try to swap out an underlying data structure while upper layers may
> still be referring to it in the future but for now we can make sure
> the writeback operation is finished before detaching wb.
>
> Dmitry, I understand that the bug is difficult to reproduce but can
> you please give the following patch a try?

I've merged it into my tree and will restart the fuzzer and leave it
running for the weekend.
Though, yeah, it is difficult to reproduce...

> Thanks.
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 20a2c02..209ea33 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1530,12 +1530,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
> kill_bdev(bdev);
>
> bdev_write_inode(bdev);
> - /*
> - * Detaching bdev inode from its wb in __destroy_inode()
> - * is too late: the queue which embeds its bdi (along with
> - * root wb) can be gone as soon as we put_disk() below.
> - */
> - inode_detach_wb(bdev->bd_inode);
> + inode_detach_blkdev_wb(bdev);
> }
> if (bdev->bd_contains == bdev) {
> if (disk->fops->release)
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index d0b5ca5..ec1f530 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -230,6 +230,25 @@ static inline void inode_detach_wb(struct inode *inode)
> }
>
> /**
> + * inode_detach_blkdev_wb - disassociate a bd_inode from its wb
> + * @bdev: block_device of interest
> + *
> + * @bdev is being put for the last time. Detaching bdev inode in
> + * __destroy_inode() is too late: the queue which embeds its bdi (along
> + * with root wb) can be gone as soon as the containing disk is put.
> + *
> + * This function dissociates @bdev->bd_inode from its wb. The inode must
> + * be clean and no further operations should be started on it.
> + */
> +static inline void inode_detach_blkdev_wb(struct block_device *bdev)
> +{
> + if (bdev->bd_inode->i_wb) {
> + flush_delayed_work(&bdev->bd_inode->i_wb->dwork);
> + inode_detach_wb(bdev->bd_inode);
> + }
> +}
> +
> +/**
> * wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite
> * @wbc: writeback_control of interest
> * @inode: target inode
> @@ -277,6 +296,10 @@ static inline void inode_detach_wb(struct inode *inode)
> {
> }
>
> +static inline void inode_detach_blkdev_wb(struct block_device *bdev)
> +{
> +}
> +
> static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
> struct inode *inode)
> __releases(&inode->i_lock)