2022-02-18 01:29:08

by John Hubbard

[permalink] [raw]
Subject: Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

On 2/16/22 08:31, Lee Jones wrote:
> Good afternoon,
...
> I managed to seemingly bisect the issue down to commit:
>
> 60263d5889e6d ("iomap: fall back to buffered writes for invalidation failures")
>
> Although it appears to be the belief of the Filesystem community that
> this is likely not the cause of the issue and should therefore not be
> reverted.

They are correct in this matter, imho. :)

...
> Darrick seems to suggest that:
>
> "The BUG report came from page_buffers failing to find any buffer heads
> attached to the page."

Yes. And looking at the pair of backtraces below, this looks very much
like another aspect of the "get_user_pages problem" [1], originally
described in Jan Kara's 2018 email [2].

I'm getting close to posting an RFC for the direct IO conversion to
FOLL_PIN, but even after that, various parts of the kernel (reclaim,
filesystems/block layer) still need to be changed so as to use
page_maybe_dma_pinned() to help avoid this problem. There's a bit
more than that, actually.


[1] https://lwn.net/Articles/753027/

[2] https://www.spinics.net/lists/linux-mm/msg142700.html


thanks,
--
John Hubbard
NVIDIA

>
> If the reproducer, also massively stripped down from the original
> report, would be of any use to you, it can be found further down at
> [2].
>
> I don't how true this is, but it is my current belief that user-space
> should not be able to force the kernel to BUG. This seems to be a
> temporary DoS issue. So although not a critically serious security
> problem involved memory leakage or data corruption, it could
> potentially cause a nuisance if not rectified.
>
> Any well meaning help with this would be gratefully received.
>
> Kind regards,
> Lee
>
> [0] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsyzkaller.appspot.com%2Fbug%3Fextid%3D41c966bf0729a530bd8d&data=04%7C01%7Cjhubbard%40nvidia.com%7C107e857de4b940fbe7e708d9f169d4a8%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637806259852035011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=5oD2%2BB9iOOSqSh3xwLLzR1vFxqyMtYNivJVQmepj2ww%3D&reserved=0
>
> [1]
> [ 15.200920] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
> [ 15.215877] File: /syzkaller.IsS3Yc/0/bus PID: 1497 Comm: repro
> [ 16.718970] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
> [ 16.734250] File: /syzkaller.IsS3Yc/5/bus PID: 1512 Comm: repro
> [ 17.013871] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
> [ 17.028193] File: /syzkaller.IsS3Yc/6/bus PID: 1515 Comm: repro
> [ 17.320498] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
> [ 17.336115] File: /syzkaller.IsS3Yc/7/bus PID: 1518 Comm: repro
> [ 17.617921] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
> [ 17.633063] File: /syzkaller.IsS3Yc/8/bus PID: 1521 Comm: repro
> [ 18.527260] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
> [ 18.544236] File: /syzkaller.IsS3Yc/11/bus PID: 1530 Comm: repro
> [ 18.810347] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
> [ 18.824721] File: /syzkaller.IsS3Yc/12/bus PID: 1533 Comm: repro
> [ 19.099315] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
> [ 19.114151] File: /syzkaller.IsS3Yc/13/bus PID: 1536 Comm: repro
> [ 19.403882] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
> [ 19.418467] File: /syzkaller.IsS3Yc/14/bus PID: 1539 Comm: repro
> [ 19.703934] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
> [ 19.718400] File: /syzkaller.IsS3Yc/15/bus PID: 1542 Comm: repro
> [ 26.533129] ------------[ cut here ]------------
> [ 26.540473] WARNING: CPU: 1 PID: 1612 at fs/ext4/inode.c:3576 ext4_set_page_dirty+0xaf/0xc0
> [ 26.553171] Modules linked in:
> [ 26.557354] CPU: 1 PID: 1612 Comm: repro Not tainted 5.16.0+ #169
> [ 26.565238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [ 26.576182] RIP: 0010:ext4_set_page_dirty+0xaf/0xc0
> [ 26.583077] Code: 4c 89 ff e8 e3 86 e7 ff 49 f7 07 00 20 00 00 74 19 4c 89 ff 5b 41 5e 41 5f e9 8d 05 f0 ff 48 83 c0 ff 48 89 c3 e9 76 ff ff ff <0f> 0b eb e3 48 83 c0 ff 48 89 c3 eb 9e 0f 0b eb b8 55 48 89 e5 41
> [ 26.607402] RSP: 0018:ffff88810f4ffa10 EFLAGS: 00010246
> [ 26.614646] RAX: ffffea00043bc687 RBX: ffffea00043bc680 RCX: ffffffff9913f86d
> [ 26.625115] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffea00043bc680
> [ 26.635137] RBP: 0000000000000400 R08: dffffc0000000000 R09: fffff940008778d1
> [ 26.644923] R10: fffff940008778d1 R11: 0000000000000000 R12: ffff88810e14c000
> [ 26.654807] R13: ffffea00043bc680 R14: ffffea00043bc688 R15: ffffea00043bc680
> [ 26.664812] FS: 00007f27c16d6640(0000) GS:ffff8883ef440000(0000) knlGS:0000000000000000
> [ 26.676238] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 26.684212] CR2: 000000000049b3a8 CR3: 000000010f7a6005 CR4: 0000000000370ee0
> [ 26.693896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 26.703778] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 26.714238] Call Trace:
> [ 26.717987] <TASK>
> [ 26.721105] folio_mark_dirty+0x72/0xa0
> [ 26.726455] set_page_dirty_lock+0x4a/0x70
> [ 26.732426] unpin_user_pages_dirty_lock+0x101/0x1d0
> [ 26.739369] process_vm_rw_single_vec+0x2f4/0x3c0
> [ 26.745707] ? process_vm_rw+0x4d0/0x4d0
> [ 26.751454] ? mm_access+0xe1/0x120
> [ 26.756495] process_vm_rw+0x2fd/0x4d0
> [ 26.762431] ? __ia32_sys_process_vm_writev+0x80/0x80
> [ 26.769780] ? preempt_count_sub+0xf/0xc0
> [ 26.775021] ? folio_add_lru+0xea/0x110
> [ 26.780260] ? preempt_count_sub+0xf/0xc0
> [ 26.786062] ? _raw_spin_unlock+0x2e/0x50
> [ 26.791676] ? __handle_mm_fault+0x14a7/0x1970
> [ 26.797550] ? handle_mm_fault+0x1d0/0x1d0
> [ 26.802981] ? up_read+0x6f/0x180
> [ 26.807430] ? down_read_trylock+0x13f/0x190
> [ 26.813252] ? down_write_trylock+0x130/0x130
> [ 26.818935] ? handle_mm_fault+0x160/0x1d0
> [ 26.824454] ? do_kern_addr_fault+0x130/0x130
> [ 26.830695] __x64_sys_process_vm_writev+0x71/0x80
> [ 26.837270] do_syscall_64+0x43/0x90
> [ 26.842134] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 26.848994] RIP: 0033:0x44c849
> [ 26.853311] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> [ 26.877911] RSP: 002b:00007f27c16d6168 EFLAGS: 00000216 ORIG_RAX: 0000000000000137
> [ 26.887953] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000044c849
> [ 26.897571] RDX: 0000000000000001 RSI: 0000000020c22000 RDI: 0000000000000076
> [ 26.907068] RBP: 00007f27c16d61a0 R08: 0000000000000001 R09: 0000000000000000
> [ 26.916433] R10: 0000000020c22fa0 R11: 0000000000000216 R12: 00007ffd3062431e
> [ 26.927113] R13: 00007ffd3062431f R14: 0000000000000000 R15: 00007f27c16d6640
> [ 26.936755] </TASK>
> [ 26.939785] ---[ end trace 42b5bb79157828eb ]---
> [ 27.160243] ------------[ cut here ]------------
> [ 27.166572] kernel BUG at fs/ext4/inode.c:2620!
> [ 27.173362] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
> [ 27.180459] CPU: 1 PID: 1616 Comm: repro Tainted: G W 5.16.0+ #169
> [ 27.190304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [ 27.201112] RIP: 0010:mpage_prepare_extent_to_map+0x573/0x580
> [ 27.208692] Code: 08 14 00 00 00 00 65 48 8b 04 25 28 00 00 00 48 3b 84 24 40 01 00 00 75 15 89 d8 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b e8 04 39 15 01 0f 1f 40 00 55 48 89 e5 41 57 41 56 41
> [ 27.232612] RSP: 0018:ffff88810f7e6b60 EFLAGS: 00010246
> [ 27.239348] RAX: ffffea00043b4fc7 RBX: 0000000000000067 RCX: ffffffff9913ea61
> [ 27.248720] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffea00043b4fc0
> [ 27.257899] RBP: ffff88810f7e6cf0 R08: dffffc0000000000 R09: fffff940008769f9
> [ 27.266846] R10: fffff940008769f9 R11: 0000000000000000 R12: 0000000000000000
> [ 27.276050] R13: ffff88810f7e6be0 R14: ffffea00043b4fc0 R15: ffff88810f7e6f58
> [ 27.285119] FS: 00007f27c16d6640(0000) GS:ffff8883ef440000(0000) knlGS:0000000000000000
> [ 27.295466] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 27.302935] CR2: 0000000020002002 CR3: 000000010cb70006 CR4: 0000000000370ee0
> [ 27.312837] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 27.322697] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 27.332451] Call Trace:
> [ 27.336060] <TASK>
> [ 27.339164] ? ext4_iomap_swap_activate+0x10/0x10
> [ 27.346171] ? preempt_count_sub+0xf/0xc0
> [ 27.352456] ? page_writeback_cpu_online+0x1f0/0x1f0
> [ 27.359598] ? ext4_init_io_end+0x18/0x90
> [ 27.365427] ? kmem_cache_alloc+0xf2/0x200
> [ 27.371638] ext4_writepages+0x823/0x1c50
> [ 27.377047] ? kernel_text_address+0xa8/0xc0
> [ 27.382587] ? unwind_get_return_address+0x25/0x40
> [ 27.388649] ? __rcu_read_unlock+0x8d/0x320
> [ 27.394134] ? __rcu_read_lock+0x20/0x20
> [ 27.399094] ? preempt_count_sub+0xf/0xc0
> [ 27.404220] ? ext4_readpage+0x110/0x110
> [ 27.409494] ? stack_trace_save+0x120/0x120
> [ 27.414838] ? __is_insn_slot_addr+0x58/0x60
> [ 27.420186] ? kernel_text_address+0xa8/0xc0
> [ 27.425531] ? __kernel_text_address+0x9/0x40
> [ 27.431355] ? unwind_get_return_address+0x25/0x40
> [ 27.437415] ? stack_trace_save+0xdb/0x120
> [ 27.442722] ? stack_trace_snprint+0xc0/0xc0
> [ 27.448310] do_writepages+0x20b/0x3a0
> [ 27.453245] ? __kasan_slab_alloc+0x43/0xb0
> [ 27.458532] ? filter_irq_stacks+0x3d/0x80
> [ 27.463792] ? __writepage+0xb0/0xb0
> [ 27.468366] ? __iomap_dio_rw+0x1c2/0xec0
> [ 27.473579] ? iomap_dio_rw+0x5/0x30
> [ 27.478152] ? ext4_file_write_iter+0x8a8/0xde0
> [ 27.484091] ? do_iter_readv_writev+0x2ce/0x360
> [ 27.490002] ? do_iter_write+0x109/0x370
> [ 27.495400] ? iter_file_splice_write+0x4b6/0x770
> [ 27.501658] ? direct_splice_actor+0x7b/0x90
> [ 27.507225] ? splice_direct_to_actor+0x309/0x570
> [ 27.513487] ? do_splice_direct+0x172/0x230
> [ 27.519352] ? do_sendfile+0x567/0x960
> [ 27.524605] ? __x64_sys_sendfile64+0x104/0x150
> [ 27.531035] ? do_syscall_64+0x43/0x90
> [ 27.536259] ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 27.543606] ? do_syscall_64+0x43/0x90
> [ 27.548942] ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 27.556281] ? _raw_spin_lock+0x120/0x120
> [ 27.561984] ? __ext4_handle_dirty_metadata+0x22d/0x510
> [ 27.569186] filemap_write_and_wait_range+0x200/0x230
> [ 27.576185] ? filemap_range_needs_writeback+0x400/0x400
> [ 27.583632] ? ext4_mark_iloc_dirty+0x66c/0x6b0
> [ 27.589986] ? kmem_cache_alloc_trace+0xe7/0x230
> [ 27.596373] ? __iomap_dio_rw+0x1c2/0xec0
> [ 27.601967] __iomap_dio_rw+0x525/0xec0
> [ 27.609616] ? jbd2_journal_stop+0x481/0x5b0
> [ 27.615606] ? iomap_dio_complete+0x2a0/0x2a0
> [ 27.622054] ? generic_update_time+0xde/0x130
> [ 27.628225] ? __mnt_drop_write_file+0xd/0x60
> [ 27.634306] ? file_update_time+0x1cd/0x210
> [ 27.640161] ? kernel_text_address+0xa8/0xc0
> [ 27.646114] ? file_remove_privs+0x2b0/0x2b0
> [ 27.652278] iomap_dio_rw+0x5/0x30
> [ 27.657149] ext4_file_write_iter+0x8a8/0xde0
> [ 27.663323] ? ext4_file_read_iter+0x1e0/0x1e0
> [ 27.669800] ? ____kasan_kmalloc+0xd1/0xf0
> [ 27.675676] ? direct_splice_actor+0x7b/0x90
> [ 27.681778] ? splice_direct_to_actor+0x309/0x570
> [ 27.688357] ? do_splice_direct+0x172/0x230
> [ 27.694256] ? do_sendfile+0x567/0x960
> [ 27.699605] ? __x64_sys_sendfile64+0x104/0x150
> [ 27.706129] ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 27.713462] do_iter_readv_writev+0x2ce/0x360
> [ 27.719894] ? generic_file_rw_checks+0xd0/0xd0
> [ 27.726446] ? memcpy+0x3c/0x60
> [ 27.731234] ? security_file_permission+0x47/0x270
> [ 27.738026] do_iter_write+0x109/0x370
> [ 27.743417] iter_file_splice_write+0x4b6/0x770
> [ 27.749824] ? splice_from_pipe+0x170/0x170
> [ 27.755630] ? generic_file_splice_read+0x2d0/0x380
> [ 27.765041] ? splice_shrink_spd+0x40/0x40
> [ 27.770668] ? is_mmconf_reserved+0x240/0x240
> [ 27.776487] ? kcalloc+0x1b/0x20
> [ 27.780737] ? splice_from_pipe+0x170/0x170
> [ 27.786278] direct_splice_actor+0x7b/0x90
> [ 27.791885] splice_direct_to_actor+0x309/0x570
> [ 27.797680] ? do_splice_direct+0x230/0x230
> [ 27.803067] ? pipe_to_sendpage+0x1b0/0x1b0
> [ 27.808395] ? security_file_permission+0x47/0x270
> [ 27.814761] do_splice_direct+0x172/0x230
> [ 27.819842] ? splice_direct_to_actor+0x570/0x570
> [ 27.825936] ? security_file_permission+0x47/0x270
> [ 27.832354] do_sendfile+0x567/0x960
> [ 27.837426] ? do_pwritev+0x3d0/0x3d0
> [ 27.842258] ? __se_sys_futex+0x1b1/0x2c0
> [ 27.847619] ? restore_fpregs_from_fpstate+0xc4/0x190
> [ 27.854277] __x64_sys_sendfile64+0x104/0x150
> [ 27.859919] ? __ia32_sys_sendfile+0x170/0x170
> [ 27.865622] ? switch_fpu_return+0x97/0x120
> [ 27.871204] do_syscall_64+0x43/0x90
> [ 27.875661] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 27.882673] RIP: 0033:0x44c849
> [ 27.887077] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> [ 27.911882] RSP: 002b:00007f27c16d6178 EFLAGS: 00000203 ORIG_RAX: 0000000000000028
> [ 27.921306] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000044c849
> [ 27.930587] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000003
> [ 27.939877] RBP: 00007f27c16d61a0 R08: 0000000000000000 R09: 0000000000000000
> [ 27.949975] R10: 0000000080000005 R11: 0000000000000203 R12: 00007ffd3062431e
> [ 27.959298] R13: 00007ffd3062431f R14: 0000000000000000 R15: 00007f27c16d6640
> [ 27.968616] </TASK>
> [ 27.971668] Modules linked in:
> [ 27.975922] ---[ end trace 42b5bb79157828ec ]---
> [ 27.982710] RIP: 0010:mpage_prepare_extent_to_map+0x573/0x580
> [ 27.990558] Code: 08 14 00 00 00 00 65 48 8b 04 25 28 00 00 00 48 3b 84 24 40 01 00 00 75 15 89 d8 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b e8 04 39 15 01 0f 1f 40 00 55 48 89 e5 41 57 41 56 41
> [ 28.015009] RSP: 0018:ffff88810f7e6b60 EFLAGS: 00010246
> [ 28.021763] RAX: ffffea00043b4fc7 RBX: 0000000000000067 RCX: ffffffff9913ea61
> [ 28.031325] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffea00043b4fc0
> [ 28.040622] RBP: ffff88810f7e6cf0 R08: dffffc0000000000 R09: fffff940008769f9
> [ 28.050140] R10: fffff940008769f9 R11: 0000000000000000 R12: 0000000000000000
> [ 28.059313] R13: ffff88810f7e6be0 R14: ffffea00043b4fc0 R15: ffff88810f7e6f58
> [ 28.068502] FS: 00007f27c16d6640(0000) GS:ffff8883ef440000(0000) knlGS:0000000000000000
> [ 28.079454] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 28.087279] CR2: 0000000020002002 CR3: 000000010cb70006 CR4: 0000000000370ee0
> [ 28.096763] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 28.105974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>
> [2]
> // https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsyzkaller.appspot.com%2Fbug%3Fid%3D906354c4596539d9561ee6cb6d8c54cda38fc3c2&amp;data=04%7C01%7Cjhubbard%40nvidia.com%7C107e857de4b940fbe7e708d9f169d4a8%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637806259852035011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=WhaMEtC2yWJva1OO%2Fp0BIYRwADAlClWQPee7Trf9DpQ%3D&amp;reserved=0
> // autogenerated by syzkaller (https://github.com/google/syzkaller)
>
> #define _GNU_SOURCE
>
> #include <arpa/inet.h>
> #include <dirent.h>
> #include <endian.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <net/if.h>
> #include <net/if_arp.h>
> #include <netinet/in.h>
> #include <pthread.h>
> #include <sched.h>
> #include <signal.h>
> #include <stdarg.h>
> #include <stdbool.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/ioctl.h>
> #include <sys/mount.h>
> #include <sys/prctl.h>
> #include <sys/resource.h>
> #include <sys/socket.h>
> #include <sys/stat.h>
> #include <sys/syscall.h>
> #include <sys/time.h>
> #include <sys/types.h>
> #include <sys/uio.h>
> #include <sys/wait.h>
> #include <time.h>
> #include <unistd.h>
>
> #include <linux/capability.h>
> #include <linux/futex.h>
> #include <linux/genetlink.h>
> #include <linux/if_addr.h>
> #include <linux/if_ether.h>
> #include <linux/if_link.h>
> #include <linux/if_tun.h>
> #include <linux/in6.h>
> #include <linux/ip.h>
> #include <linux/neighbour.h>
> #include <linux/net.h>
> #include <linux/netlink.h>
> #include <linux/rtnetlink.h>
> #include <linux/tcp.h>
> #include <linux/veth.h>
>
> static void sleep_ms(uint64_t ms)
> {
> usleep(ms * 1000);
> }
>
> static uint64_t current_time_ms(void)
> {
> struct timespec ts;
> if (clock_gettime(CLOCK_MONOTONIC, &ts))
> exit(1);
> return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
> }
>
> static void use_temporary_dir(void)
> {
> char tmpdir_template[] = "./syzkaller.XXXXXX";
> char* tmpdir = mkdtemp(tmpdir_template);
> if (!tmpdir)
> exit(1);
> if (chmod(tmpdir, 0777))
> exit(1);
> if (chdir(tmpdir))
> exit(1);
> }
>
> static void thread_start(void* (*fn)(void*), void* arg)
> {
> pthread_t th;
> pthread_attr_t attr;
> pthread_attr_init(&attr);
> pthread_attr_setstacksize(&attr, 128 << 10);
> int i = 0;
> for (; i < 100; i++) {
> if (pthread_create(&th, &attr, fn, arg) == 0) {
> pthread_attr_destroy(&attr);
> return;
> }
> if (errno == EAGAIN) {
> usleep(50);
> continue;
> }
> break;
> }
> exit(1);
> }
>
> typedef struct {
> int state;
> } event_t;
>
> static void event_init(event_t* ev)
> {
> ev->state = 0;
> }
>
> static void event_reset(event_t* ev)
> {
> ev->state = 0;
> }
>
> static void event_set(event_t* ev)
> {
> if (ev->state)
> exit(1);
> __atomic_store_n(&ev->state, 1, __ATOMIC_RELEASE);
> syscall(SYS_futex, &ev->state, FUTEX_WAKE | FUTEX_PRIVATE_FLAG, 1000000);
> }
>
> static void event_wait(event_t* ev)
> {
> while (!__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE))
> syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, 0);
> }
>
> static int event_isset(event_t* ev)
> {
> return __atomic_load_n(&ev->state, __ATOMIC_ACQUIRE);
> }
>
> static int event_timedwait(event_t* ev, uint64_t timeout)
> {
> uint64_t start = current_time_ms();
> uint64_t now = start;
> for (;;) {
> uint64_t remain = timeout - (now - start);
> struct timespec ts;
> ts.tv_sec = remain / 1000;
> ts.tv_nsec = (remain % 1000) * 1000 * 1000;
> syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, &ts);
> if (__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE))
> return 1;
> now = current_time_ms();
> if (now - start > timeout)
> return 0;
> }
> }
>
> #define IFLA_IPVLAN_FLAGS 2
> #define IPVLAN_MODE_L3S 2
> #undef IPVLAN_F_VEPA
> #define IPVLAN_F_VEPA 2
>
> #define TUN_IFACE "syz_tun"
> #define LOCAL_MAC 0xaaaaaaaaaaaa
> #define REMOTE_MAC 0xaaaaaaaaaabb
> #define LOCAL_IPV4 "172.20.20.170"
> #define REMOTE_IPV4 "172.20.20.187"
> #define LOCAL_IPV6 "fe80::aa"
> #define REMOTE_IPV6 "fe80::bb"
>
> #define IFF_NAPI 0x0010
>
> #define DEVLINK_FAMILY_NAME "devlink"
>
> #define DEVLINK_CMD_PORT_GET 5
> #define DEVLINK_ATTR_BUS_NAME 1
> #define DEVLINK_ATTR_DEV_NAME 2
> #define DEVLINK_ATTR_NETDEV_NAME 7
>
> #define DEV_IPV4 "172.20.20.%d"
> #define DEV_IPV6 "fe80::%02x"
> #define DEV_MAC 0x00aaaaaaaaaa
>
> #define WG_GENL_NAME "wireguard"
> enum wg_cmd {
> WG_CMD_GET_DEVICE,
> WG_CMD_SET_DEVICE,
> };
> enum wgdevice_attribute {
> WGDEVICE_A_UNSPEC,
> WGDEVICE_A_IFINDEX,
> WGDEVICE_A_IFNAME,
> WGDEVICE_A_PRIVATE_KEY,
> WGDEVICE_A_PUBLIC_KEY,
> WGDEVICE_A_FLAGS,
> WGDEVICE_A_LISTEN_PORT,
> WGDEVICE_A_FWMARK,
> WGDEVICE_A_PEERS,
> };
> enum wgpeer_attribute {
> WGPEER_A_UNSPEC,
> WGPEER_A_PUBLIC_KEY,
> WGPEER_A_PRESHARED_KEY,
> WGPEER_A_FLAGS,
> WGPEER_A_ENDPOINT,
> WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL,
> WGPEER_A_LAST_HANDSHAKE_TIME,
> WGPEER_A_RX_BYTES,
> WGPEER_A_TX_BYTES,
> WGPEER_A_ALLOWEDIPS,
> WGPEER_A_PROTOCOL_VERSION,
> };
> enum wgallowedip_attribute {
> WGALLOWEDIP_A_UNSPEC,
> WGALLOWEDIP_A_FAMILY,
> WGALLOWEDIP_A_IPADDR,
> WGALLOWEDIP_A_CIDR_MASK,
> };
>
> #define MAX_FDS 30
>
> #define XT_TABLE_SIZE 1536
> #define XT_MAX_ENTRIES 10
>
> struct xt_counters {
> uint64_t pcnt, bcnt;
> };
>
> struct ipt_getinfo {
> char name[32];
> unsigned int valid_hooks;
> unsigned int hook_entry[5];
> unsigned int underflow[5];
> unsigned int num_entries;
> unsigned int size;
> };
>
> struct ipt_get_entries {
> char name[32];
> unsigned int size;
> uint64_t entrytable[XT_TABLE_SIZE / sizeof(uint64_t)];
> };
>
> struct ipt_replace {
> char name[32];
> unsigned int valid_hooks;
> unsigned int num_entries;
> unsigned int size;
> unsigned int hook_entry[5];
> unsigned int underflow[5];
> unsigned int num_counters;
> struct xt_counters* counters;
> uint64_t entrytable[XT_TABLE_SIZE / sizeof(uint64_t)];
> };
>
> struct ipt_table_desc {
> const char* name;
> struct ipt_getinfo info;
> struct ipt_replace replace;
> };
>
> #define IPT_BASE_CTL 64
> #define IPT_SO_SET_REPLACE (IPT_BASE_CTL)
> #define IPT_SO_GET_INFO (IPT_BASE_CTL)
> #define IPT_SO_GET_ENTRIES (IPT_BASE_CTL + 1)
>
> struct arpt_getinfo {
> char name[32];
> unsigned int valid_hooks;
> unsigned int hook_entry[3];
> unsigned int underflow[3];
> unsigned int num_entries;
> unsigned int size;
> };
>
> struct arpt_get_entries {
> char name[32];
> unsigned int size;
> uint64_t entrytable[XT_TABLE_SIZE / sizeof(uint64_t)];
> };
>
> struct arpt_replace {
> char name[32];
> unsigned int valid_hooks;
> unsigned int num_entries;
> unsigned int size;
> unsigned int hook_entry[3];
> unsigned int underflow[3];
> unsigned int num_counters;
> struct xt_counters* counters;
> uint64_t entrytable[XT_TABLE_SIZE / sizeof(uint64_t)];
> };
>
> #define ARPT_BASE_CTL 96
> #define ARPT_SO_SET_REPLACE (ARPT_BASE_CTL)
> #define ARPT_SO_GET_INFO (ARPT_BASE_CTL)
> #define ARPT_SO_GET_ENTRIES (ARPT_BASE_CTL + 1)
>
> static void loop();
>
> static int wait_for_loop(int pid)
> {
> if (pid < 0)
> exit(1);
> int status = 0;
> while (waitpid(-1, &status, __WALL) != pid) {
> }
> return WEXITSTATUS(status);
> }
>
> static int do_sandbox_none(void)
> {
> if (unshare(CLONE_NEWPID)) {
> }
> int pid = fork();
> if (pid != 0)
> return wait_for_loop(pid);
>
> if (unshare(CLONE_NEWNET)) {
> }
> loop();
> exit(1);
> }
>
> #define FS_IOC_SETFLAGS _IOW('f', 2, long)
> static void remove_dir(const char* dir)
> {
> int iter = 0;
> DIR* dp = 0;
> retry:
> while (umount2(dir, MNT_DETACH | UMOUNT_NOFOLLOW) == 0) {
> }
> dp = opendir(dir);
> if (dp == NULL) {
> if (errno == EMFILE) {
> exit(1);
> }
> exit(1);
> }
> struct dirent* ep = 0;
> while ((ep = readdir(dp))) {
> if (strcmp(ep->d_name, ".") == 0 || strcmp(ep->d_name, "..") == 0)
> continue;
> char filename[FILENAME_MAX];
> snprintf(filename, sizeof(filename), "%s/%s", dir, ep->d_name);
> while (umount2(filename, MNT_DETACH | UMOUNT_NOFOLLOW) == 0) {
> }
> struct stat st;
> if (lstat(filename, &st))
> exit(1);
> if (S_ISDIR(st.st_mode)) {
> remove_dir(filename);
> continue;
> }
> int i;
> for (i = 0;; i++) {
> if (unlink(filename) == 0)
> break;
> if (errno == EPERM) {
> int fd = open(filename, O_RDONLY);
> if (fd != -1) {
> long flags = 0;
> if (ioctl(fd, FS_IOC_SETFLAGS, &flags) == 0) {
> }
> close(fd);
> continue;
> }
> }
> if (errno == EROFS) {
> break;
> }
> if (errno != EBUSY || i > 100)
> exit(1);
> if (umount2(filename, MNT_DETACH | UMOUNT_NOFOLLOW))
> exit(1);
> }
> }
> closedir(dp);
> for (int i = 0;; i++) {
> if (rmdir(dir) == 0)
> break;
> if (i < 100) {
> if (errno == EPERM) {
> int fd = open(dir, O_RDONLY);
> if (fd != -1) {
> long flags = 0;
> if (ioctl(fd, FS_IOC_SETFLAGS, &flags) == 0) {
> }
> close(fd);
> continue;
> }
> }
> if (errno == EROFS) {
> break;
> }
> if (errno == EBUSY) {
> if (umount2(dir, MNT_DETACH | UMOUNT_NOFOLLOW))
> exit(1);
> continue;
> }
> if (errno == ENOTEMPTY) {
> if (iter < 100) {
> iter++;
> goto retry;
> }
> }
> }
> exit(1);
> }
> }
>
> static void kill_and_wait(int pid, int* status)
> {
> kill(-pid, SIGKILL);
> kill(pid, SIGKILL);
> for (int i = 0; i < 100; i++) {
> if (waitpid(-1, status, WNOHANG | __WALL) == pid)
> return;
> usleep(1000);
> }
> DIR* dir = opendir("/sys/fs/fuse/connections");
> if (dir) {
> for (;;) {
> struct dirent* ent = readdir(dir);
> if (!ent)
> break;
> if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
> continue;
> char abort[300];
> snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort",
> ent->d_name);
> int fd = open(abort, O_WRONLY);
> if (fd == -1) {
> continue;
> }
> if (write(fd, abort, 1) < 0) {
> }
> close(fd);
> }
> closedir(dir);
> } else {
> }
> while (waitpid(-1, status, __WALL) != pid) {
> }
> }
>
> static void close_fds()
> {
> for (int fd = 3; fd < MAX_FDS; fd++)
> close(fd);
> }
>
> struct thread_t {
> int created, call;
> event_t ready, done;
> };
>
> static struct thread_t threads[16];
> static void execute_call(int call);
> static int running;
>
> static void* thr(void* arg)
> {
> struct thread_t* th = (struct thread_t*)arg;
> for (;;) {
> event_wait(&th->ready);
> event_reset(&th->ready);
> execute_call(th->call);
> __atomic_fetch_sub(&running, 1, __ATOMIC_RELAXED);
> event_set(&th->done);
> }
> return 0;
> }
>
> static void execute_one(void)
> {
> int i, call, thread;
> for (call = 0; call < 10; call++) {
> for (thread = 0; thread < (int)(sizeof(threads) / sizeof(threads[0]));
> thread++) {
> struct thread_t* th = &threads[thread];
> if (!th->created) {
> th->created = 1;
> event_init(&th->ready);
> event_init(&th->done);
> event_set(&th->done);
> thread_start(thr, th);
> }
> if (!event_isset(&th->done))
> continue;
> event_reset(&th->done);
> th->call = call;
> __atomic_fetch_add(&running, 1, __ATOMIC_RELAXED);
> event_set(&th->ready);
> event_timedwait(&th->done, 50);
> break;
> }
> }
> for (i = 0; i < 100 && __atomic_load_n(&running, __ATOMIC_RELAXED); i++)
> sleep_ms(1);
> close_fds();
> }
>
> static void execute_one(void);
>
> #define WAIT_FLAGS __WALL
>
> static void loop(void)
> {
> int iter = 0;
> for (;; iter++) {
> char cwdbuf[32];
> sprintf(cwdbuf, "./%d", iter);
> if (mkdir(cwdbuf, 0777))
> exit(1);
> int pid = fork();
> if (pid < 0)
> exit(1);
> if (pid == 0) {
> if (chdir(cwdbuf))
> exit(1);
> execute_one();
> exit(0);
> }
> int status = 0;
> uint64_t start = current_time_ms();
> for (;;) {
> if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid) {
> break;
> }
> sleep_ms(1);
> if (current_time_ms() - start < 5000)
> continue;
> kill_and_wait(pid, &status);
> break;
> }
> remove_dir(cwdbuf);
> }
> }
>
> uint64_t r[5] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff,
> 0xffffffffffffffff, 0x0};
>
> void execute_call(int call)
> {
> intptr_t res = 0;
> switch (call) {
> case 0:
> memcpy((void*)0x20000080, "./bus\000", 6);
> res = syscall(__NR_open, 0x20000080ul, 0x14d842ul, 0ul);
> if (res != -1)
> r[0] = res;
> break;
> case 1:
> memcpy((void*)0x20000000, "/proc/self/exe\000", 15);
> res = syscall(__NR_openat, 0xffffff9c, 0x20000000ul, 0ul, 0ul);
> if (res != -1)
> r[1] = res;
> break;
> case 2:
> memcpy((void*)0x20002000, "./bus\000", 6);
> res = syscall(__NR_open, 0x20002000ul, 0x143042ul, 0ul);
> if (res != -1)
> r[2] = res;
> break;
> case 3:
> syscall(__NR_ftruncate, r[2], 0x2008002ul);
> break;
> case 4:
> memcpy((void*)0x20000400, "./bus\000", 6);
> res = syscall(__NR_open, 0x20000400ul, 0x14103eul, 0ul);
> if (res != -1)
> r[3] = res;
> break;
> case 5:
> syscall(__NR_mmap, 0x20000000ul, 0x600000ul, 0x7ffffeul, 0x11ul, r[3], 0ul);
> break;
> case 6:
> syscall(__NR_sendfile, r[0], r[1], 0ul, 0x80000005ul);
> break;
> case 7:
> res = syscall(__NR_gettid);
> if (res != -1)
> r[4] = res;
> break;
> case 8:
> *(uint64_t*)0x20c22000 = 0x2034afa4;
> *(uint64_t*)0x20c22008 = 0x1f80;
> *(uint64_t*)0x20c22fa0 = 0x20000080;
> *(uint64_t*)0x20c22fa8 = 0x2034afa5;
> syscall(__NR_process_vm_writev, r[4], 0x20c22000ul, 1ul, 0x20c22fa0ul, 1ul,
> 0ul);
> break;
> case 9:
> syscall(__NR_sendfile, r[0], r[1], 0ul, 0x80000005ul);
> break;
> }
> }
> int main(void)
> {
> syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
> syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
> syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
>
> use_temporary_dir();
> do_sandbox_none();
>
> return 0;
> }
>


2022-02-18 04:14:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

On Thu, Feb 17, 2022 at 05:06:45PM -0800, John Hubbard wrote:
> Yes. And looking at the pair of backtraces below, this looks very much
> like another aspect of the "get_user_pages problem" [1], originally
> described in Jan Kara's 2018 email [2].

Hmm... I just posted my analysis, which tracks with yours; but I had
forgotten about Jan's 2018 e-mail on the matter.

> I'm getting close to posting an RFC for the direct IO conversion to
> FOLL_PIN, but even after that, various parts of the kernel (reclaim,
> filesystems/block layer) still need to be changed so as to use
> page_maybe_dma_pinned() to help avoid this problem. There's a bit
> more than that, actually.

The challenge is that fixing this "the right away" is probably not
something we can backport into an LTS kernel, whether it's 5.15 or
5.10... or 4.19.

The only thing which can probably survive getting backported is
something like this. It won't make the right thing happen if someone
is trying to RDMA or call process_vm_writev() into a file-backed
memory region --- but I'm not sure I care. Certainly if the goal is
to make Android kernels, I'm pretty sure they are't either using RDMA,
and I suspect they are probably masking out the process_vm_writev(2)
system call (at least, for Android O and newer). So if the goal is to
just to close some Syzbot bugs, what do folks think about this?

- Ted

commit 7711b1fda6f7f04274fa1cba6f092410262b0296
Author: Theodore Ts'o <[email protected]>
Date: Thu Feb 17 22:54:03 2022 -0500

ext4: work around bugs in mm/gup.c that can cause ext4 to BUG()

[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance (or faulting in the file data if the page
is not yet in the page cache). This was noted by Jan Kara in 2018[1]
and more recently has resulted in bug reports by Syzbot in various
Android kernels[2].

Fixing this for real is non-trivial, and will never be backportable
into stable kernels. So this is a simple workaround that stops the
kernel from BUG()'ing. The changed pages will not be properly written
back, but given that the current gup code is missing the "read" in
"read-modify-write", the dirty page in the page cache might contain
corrupted data anyway.

[1] https://www.spinics.net/lists/linux-mm/msg142700.html
[2] https://lore.kernel.org/r/[email protected]

Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
Reported-by: Lee Jones <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 01c9e4f743ba..3b2f336a90d1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
else
len = PAGE_SIZE;

+ /* Should never happen but for buggy gup code */
+ if (!page_has_buffers(page)) {
+ ext4_warning_inode(inode,
+ "page %lu does not have buffers attached", page->index);
+ ClearPageDirty(page);
+ unlock_page(page);
+ return 0;
+ }
+
page_bufs = page_buffers(page);
/*
* We cannot do block allocation or other extent handling in this
@@ -2594,6 +2603,14 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
wait_on_page_writeback(page);
BUG_ON(PageWriteback(page));

+ /* Should never happen but for buggy gup code */
+ if (!page_has_buffers(page)) {
+ ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index);
+ ClearPageDirty(page);
+ unlock_page(page);
+ continue;
+ }
+
if (mpd->map.m_len == 0)
mpd->first_page = page->index;
mpd->next_page = page->index + 1;

2022-02-18 06:36:16

by John Hubbard

[permalink] [raw]
Subject: Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

On 2/17/22 20:08, Theodore Ts'o wrote:
> On Thu, Feb 17, 2022 at 05:06:45PM -0800, John Hubbard wrote:
>> Yes. And looking at the pair of backtraces below, this looks very much
>> like another aspect of the "get_user_pages problem" [1], originally
>> described in Jan Kara's 2018 email [2].
>
> Hmm... I just posted my analysis, which tracks with yours; but I had
> forgotten about Jan's 2018 e-mail on the matter.
>
>> I'm getting close to posting an RFC for the direct IO conversion to
>> FOLL_PIN, but even after that, various parts of the kernel (reclaim,
>> filesystems/block layer) still need to be changed so as to use
>> page_maybe_dma_pinned() to help avoid this problem. There's a bit
>> more than that, actually.
>
> The challenge is that fixing this "the right away" is probably not
> something we can backport into an LTS kernel, whether it's 5.15 or
> 5.10... or 4.19.
>
> The only thing which can probably survive getting backported is
> something like this. It won't make the right thing happen if someone
> is trying to RDMA or call process_vm_writev() into a file-backed
> memory region --- but I'm not sure I care. Certainly if the goal is
> to make Android kernels, I'm pretty sure they are't either using RDMA,
> and I suspect they are probably masking out the process_vm_writev(2)
> system call (at least, for Android O and newer). So if the goal is to
> just to close some Syzbot bugs, what do folks think about this?
>
> - Ted

Hi Ted!

This seems reasonable...-ish. Although this could turn into a pretty
grand game of whack-a-mole, one filesystem at a time. :)


>
> commit 7711b1fda6f7f04274fa1cba6f092410262b0296
> Author: Theodore Ts'o <[email protected]>
> Date: Thu Feb 17 22:54:03 2022 -0500
>
> ext4: work around bugs in mm/gup.c that can cause ext4 to BUG()
>
> [un]pin_user_pages_remote is dirtying pages without properly warning
> the file system in advance (or faulting in the file data if the page

Just a small thing I'll say once, to get it out of my system. No action
required here, I just want it understood:

Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via
pin_user_pages_remote()"), you would have written that like this:

"process_vm_writev() is dirtying pages without properly warning the file
system in advance..."

Because, there were many callers that were doing this:

get_user_pages*()
...use the pages...

for_each_page() {
set_page_dirty*()
put_page()
}

anyway, moving on smartly...

> is not yet in the page cache). This was noted by Jan Kara in 2018[1]
> and more recently has resulted in bug reports by Syzbot in various
> Android kernels[2].
>
> Fixing this for real is non-trivial, and will never be backportable
> into stable kernels. So this is a simple workaround that stops the
> kernel from BUG()'ing. The changed pages will not be properly written
> back, but given that the current gup code is missing the "read" in
> "read-modify-write", the dirty page in the page cache might contain
> corrupted data anyway.
>
> [1] https://www.spinics.net/lists/linux-mm/msg142700.html

(Sorry my earlier response mangled this link. I've manually fixed it
here, and am working with our IT to get the root cause addressed.)

> [2] https://lore.kernel.org/r/[email protected]
>
> Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
> Reported-by: Lee Jones <[email protected]>
> Signed-off-by: Theodore Ts'o <[email protected]>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 01c9e4f743ba..3b2f336a90d1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
> else
> len = PAGE_SIZE;
>
> + /* Should never happen but for buggy gup code */
> + if (!page_has_buffers(page)) {
> + ext4_warning_inode(inode,
> + "page %lu does not have buffers attached", page->index);

I see that ext4_warning_inode() has rate limiting, but it doesn't look
like it's really intended for a per-page rate. It looks like it is
per-superblock (yes?), so I suspect one instance of this problem, with
many pages involved, could hit the limit.

Often, WARN_ON_ONCE() is used with per-page assertions. That's not great
either, but it might be better here. OTOH, I have minimal experience
with ext4_warning_inode() and maybe it really is just fine with per-page
failure rates.

thanks,
--
John Hubbard
NVIDIA

> + ClearPageDirty(page);
> + unlock_page(page);
> + return 0;
> + }
> +
> page_bufs = page_buffers(page);
> /*
> * We cannot do block allocation or other extent handling in this
> @@ -2594,6 +2603,14 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
> wait_on_page_writeback(page);
> BUG_ON(PageWriteback(page));
>
> + /* Should never happen but for buggy gup code */
> + if (!page_has_buffers(page)) {
> + ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index);
> + ClearPageDirty(page);
> + unlock_page(page);
> + continue;
> + }
> +
> if (mpd->map.m_len == 0)
> mpd->first_page = page->index;
> mpd->next_page = page->index + 1;

2022-02-18 08:25:10

by Greg KH

[permalink] [raw]
Subject: Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

On Thu, Feb 17, 2022 at 11:08:38PM -0500, Theodore Ts'o wrote:
> On Thu, Feb 17, 2022 at 05:06:45PM -0800, John Hubbard wrote:
> > Yes. And looking at the pair of backtraces below, this looks very much
> > like another aspect of the "get_user_pages problem" [1], originally
> > described in Jan Kara's 2018 email [2].
>
> Hmm... I just posted my analysis, which tracks with yours; but I had
> forgotten about Jan's 2018 e-mail on the matter.
>
> > I'm getting close to posting an RFC for the direct IO conversion to
> > FOLL_PIN, but even after that, various parts of the kernel (reclaim,
> > filesystems/block layer) still need to be changed so as to use
> > page_maybe_dma_pinned() to help avoid this problem. There's a bit
> > more than that, actually.
>
> The challenge is that fixing this "the right away" is probably not
> something we can backport into an LTS kernel, whether it's 5.15 or
> 5.10... or 4.19.

Don't worry about stable backports to start with. Do it the "right way"
first and then we can consider if it needs to be backported or not.

thanks,

greg k-h

2022-02-24 00:10:24

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

On Fri, Feb 18, 2022 at 08:51:54AM +0100, Greg Kroah-Hartman wrote:
> > The challenge is that fixing this "the right away" is probably not
> > something we can backport into an LTS kernel, whether it's 5.15 or
> > 5.10... or 4.19.
>
> Don't worry about stable backports to start with. Do it the "right way"
> first and then we can consider if it needs to be backported or not.

Fair enough; on the other hand, we could also view this as making ext4
more robust against buggy code in other subsystems, and while other
file systems may be losing user data if they are actually trying to do
remote memory access to file-backed memory, apparently other file
systems aren't noticing and so they're not crashing. Issuing a
warning and then not crashing is arguably a better way for ext4 to
react, especially if there are other parts of the kernel that are
randomly calling set_page_dirty() on file-backed memory without
properly first informing the file system in a context where it can
block and potentially do I/O to do things like allocate blocks.

- Ted

2022-02-24 02:13:14

by Dave Chinner

[permalink] [raw]
Subject: Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

On Wed, Feb 23, 2022 at 06:35:54PM -0500, Theodore Ts'o wrote:
> On Fri, Feb 18, 2022 at 08:51:54AM +0100, Greg Kroah-Hartman wrote:
> > > The challenge is that fixing this "the right away" is probably not
> > > something we can backport into an LTS kernel, whether it's 5.15 or
> > > 5.10... or 4.19.
> >
> > Don't worry about stable backports to start with. Do it the "right way"
> > first and then we can consider if it needs to be backported or not.
>
> Fair enough; on the other hand, we could also view this as making ext4
> more robust against buggy code in other subsystems, and while other
> file systems may be losing user data if they are actually trying to do
> remote memory access to file-backed memory, apparently other file
> systems aren't noticing and so they're not crashing.

Oh, we've noticed them, no question about that. We've got bug
reports going back years for systems being crashed, triggering BUGs
and/or corrupting data on both XFS and ext4 filesystems due to users
trying to run RDMA applications with file backed pages.

Most of the people doing this now know that we won't support such
applications until the RDMA stack/hardware can trigger on-demand
write page faults the same way CPUs do when they first write to a
clean page. They don't have this, so mostly these people don't
bother reporting these class of problems to us anymore. The
gup/RDMA infrastructure to make this all work is slowly moving
forwards, but it's not here yet.

> Issuing a
> warning and then not crashing is arguably a better way for ext4 to
> react, especially if there are other parts of the kernel that are
> randomly calling set_page_dirty() on file-backed memory without
> properly first informing the file system in a context where it can
> block and potentially do I/O to do things like allocate blocks.

I'm not sure that replacing the BUG() with a warning is good enough
- it's still indicative of an application doing something dangerous
that could result in silent data corruption and/or other problems.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2022-02-24 04:42:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

On Thu, Feb 24, 2022 at 12:48:42PM +1100, Dave Chinner wrote:
> > Fair enough; on the other hand, we could also view this as making ext4
> > more robust against buggy code in other subsystems, and while other
> > file systems may be losing user data if they are actually trying to do
> > remote memory access to file-backed memory, apparently other file
> > systems aren't noticing and so they're not crashing.
>
> Oh, we've noticed them, no question about that. We've got bug
> reports going back years for systems being crashed, triggering BUGs
> and/or corrupting data on both XFS and ext4 filesystems due to users
> trying to run RDMA applications with file backed pages.

Is this issue causing XFS to crash? I didn't know that.

I tried the Syzbot reproducer with XFS mounted, and it didn't trigger
any crashes. I'm sure data was getting corrupted, but I figured I
should bring ext4 to the XFS level of "at least we're not reliably
killing the kernel".

On ext4, an unprivileged process can use process_vm_writev(2) to crash
the system. I don't know how quickly we can get a fix into mm/gup.c,
but if some other kernel path tries calling set_page_dirty() on a
file-backed page without first asking permission from the file system,
it seems to be nice if the file system doesn't BUG() --- as near as I
can tell, xfs isn't crashing in this case, but ext4 is.

- Ted