2022-02-17 14:51:25

by Byungchul Park

[permalink] [raw]
Subject: Report 2 in ext4 and journal based on v5.17-rc1

[ 9.008161] ===================================================
[ 9.008163] DEPT: Circular dependency has been detected.
[ 9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G W
[ 9.008166] ---------------------------------------------------
[ 9.008167] summary
[ 9.008167] ---------------------------------------------------
[ 9.008168] *** DEADLOCK ***
[ 9.008168]
[ 9.008168] context A
[ 9.008169] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
[ 9.008171] [W] wait(&(&journal->j_wait_commit)->dmap:0)
[ 9.008172] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
[ 9.008173]
[ 9.008173] context B
[ 9.008174] [S] down_write(mapping.invalidate_lock:0)
[ 9.008175] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
[ 9.008176] [E] up_write(mapping.invalidate_lock:0)
[ 9.008177]
[ 9.008178] context C
[ 9.008179] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
[ 9.008180] [W] down_write(mapping.invalidate_lock:0)
[ 9.008181] [E] event(&(&journal->j_wait_commit)->dmap:0)
[ 9.008181]
[ 9.008182] [S]: start of the event context
[ 9.008183] [W]: the wait blocked
[ 9.008183] [E]: the event not reachable
[ 9.008184] ---------------------------------------------------
[ 9.008184] context A's detail
[ 9.008185] ---------------------------------------------------
[ 9.008186] context A
[ 9.008186] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
[ 9.008187] [W] wait(&(&journal->j_wait_commit)->dmap:0)
[ 9.008188] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
[ 9.008189]
[ 9.008190] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0):
[ 9.008191] (N/A)
[ 9.008191]
[ 9.008192] [W] wait(&(&journal->j_wait_commit)->dmap:0):
[ 9.008193] prepare_to_wait (kernel/sched/wait.c:275)
[ 9.008197] stacktrace:
[ 9.008198] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213)
[ 9.008200] schedule (kernel/sched/core.c:6373 (discriminator 1))
[ 9.008201] kjournald2 (fs/jbd2/journal.c:250)
[ 9.008203] kthread (kernel/kthread.c:377)
[ 9.008206] ret_from_fork (arch/x86/entry/entry_64.S:301)
[ 9.008209]
[ 9.008209] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0):
[ 9.008210] __wake_up_common (kernel/sched/wait.c:108)
[ 9.008212] stacktrace:
[ 9.008213] dept_event (kernel/dependency/dept.c:2337)
[ 9.008215] __wake_up_common (kernel/sched/wait.c:109)
[ 9.008217] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1))
[ 9.008218] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583)
[ 9.008221] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3))
[ 9.008223] kthread (kernel/kthread.c:377)
[ 9.008224] ret_from_fork (arch/x86/entry/entry_64.S:301)
[ 9.008226] ---------------------------------------------------
[ 9.008226] context B's detail
[ 9.008227] ---------------------------------------------------
[ 9.008228] context B
[ 9.008228] [S] down_write(mapping.invalidate_lock:0)
[ 9.008229] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
[ 9.008230] [E] up_write(mapping.invalidate_lock:0)
[ 9.008231]
[ 9.008232] [S] down_write(mapping.invalidate_lock:0):
[ 9.008233] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963)
[ 9.008237] stacktrace:
[ 9.008237] down_write (kernel/locking/rwsem.c:1514)
[ 9.008239] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963)
[ 9.008241] generic_perform_write (mm/filemap.c:3784)
[ 9.008243] ext4_buffered_write_iter (fs/ext4/file.c:269)
[ 9.008245] ext4_file_write_iter (fs/ext4/file.c:677)
[ 9.008247] new_sync_write (fs/read_write.c:504 (discriminator 1))
[ 9.008250] vfs_write (fs/read_write.c:590)
[ 9.008251] ksys_write (fs/read_write.c:644)
[ 9.008253] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 9.008255] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
[ 9.008258]
[ 9.008258] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0):
[ 9.008259] prepare_to_wait (kernel/sched/wait.c:275)
[ 9.008261] stacktrace:
[ 9.008261] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213)
[ 9.008263] schedule (kernel/sched/core.c:6373 (discriminator 1))
[ 9.008264] wait_transaction_locked (fs/jbd2/transaction.c:184)
[ 9.008266] add_transaction_credits (fs/jbd2/transaction.c:248 (discriminator 3))
[ 9.008267] start_this_handle (fs/jbd2/transaction.c:427)
[ 9.008269] jbd2__journal_start (fs/jbd2/transaction.c:526)
[ 9.008271] __ext4_journal_start_sb (fs/ext4/ext4_jbd2.c:105)
[ 9.008273] ext4_truncate (fs/ext4/inode.c:4164)
[ 9.008274] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963)
[ 9.008276] generic_perform_write (mm/filemap.c:3784)
[ 9.008277] ext4_buffered_write_iter (fs/ext4/file.c:269)
[ 9.008279] ext4_file_write_iter (fs/ext4/file.c:677)
[ 9.008281] new_sync_write (fs/read_write.c:504 (discriminator 1))
[ 9.008283] vfs_write (fs/read_write.c:590)
[ 9.008284] ksys_write (fs/read_write.c:644)
[ 9.008285] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 9.008287]
[ 9.008288] [E] up_write(mapping.invalidate_lock:0):
[ 9.008288] ext4_da_get_block_prep (fs/ext4/inode.c:1795 fs/ext4/inode.c:1829)
[ 9.008291] ---------------------------------------------------
[ 9.008291] context C's detail
[ 9.008292] ---------------------------------------------------
[ 9.008292] context C
[ 9.008293] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
[ 9.008294] [W] down_write(mapping.invalidate_lock:0)
[ 9.008295] [E] event(&(&journal->j_wait_commit)->dmap:0)
[ 9.008296]
[ 9.008297] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0):
[ 9.008298] (N/A)
[ 9.008298]
[ 9.008299] [W] down_write(mapping.invalidate_lock:0):
[ 9.008299] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963)
[ 9.008302] stacktrace:
[ 9.008302] down_write (kernel/locking/rwsem.c:1514)
[ 9.008304] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963)
[ 9.008305] generic_perform_write (mm/filemap.c:3784)
[ 9.008307] ext4_buffered_write_iter (fs/ext4/file.c:269)
[ 9.008309] ext4_file_write_iter (fs/ext4/file.c:677)
[ 9.008311] new_sync_write (fs/read_write.c:504 (discriminator 1))
[ 9.008312] vfs_write (fs/read_write.c:590)
[ 9.008314] ksys_write (fs/read_write.c:644)
[ 9.008315] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 9.008316] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
[ 9.008318]
[ 9.008319] [E] event(&(&journal->j_wait_commit)->dmap:0):
[ 9.008320] __wake_up_common (kernel/sched/wait.c:108)
[ 9.008321] stacktrace:
[ 9.008322] __wake_up_common (kernel/sched/wait.c:109)
[ 9.008323] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1))
[ 9.008324] __jbd2_log_start_commit (fs/jbd2/journal.c:508)
[ 9.008326] jbd2_log_start_commit (fs/jbd2/journal.c:527)
[ 9.008327] __jbd2_journal_force_commit (fs/jbd2/journal.c:560)
[ 9.008329] jbd2_journal_force_commit_nested (fs/jbd2/journal.c:583)
[ 9.008331] ext4_should_retry_alloc (fs/ext4/balloc.c:670 (discriminator 3))
[ 9.008332] ext4_da_write_begin (fs/ext4/inode.c:2965 (discriminator 1))
[ 9.008334] generic_perform_write (mm/filemap.c:3784)
[ 9.008335] ext4_buffered_write_iter (fs/ext4/file.c:269)
[ 9.008337] ext4_file_write_iter (fs/ext4/file.c:677)
[ 9.008339] new_sync_write (fs/read_write.c:504 (discriminator 1))
[ 9.008341] vfs_write (fs/read_write.c:590)
[ 9.008342] ksys_write (fs/read_write.c:644)
[ 9.008343] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 9.008345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
[ 9.008347] ---------------------------------------------------
[ 9.008348] information that might be helpful
[ 9.008348] ---------------------------------------------------
[ 9.008349] CPU: 0 PID: 89 Comm: jbd2/sda1-8 Tainted: G W 5.17.0-rc1-00015-gb94f67143867-dirty #2
[ 9.008352] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 9.008353] Call Trace:
[ 9.008354] <TASK>
[ 9.008355] dump_stack_lvl (lib/dump_stack.c:107)
[ 9.008358] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762)
[ 9.008360] ? print_circle (kernel/dependency/dept.c:1086)
[ 9.008362] cb_check_dl (kernel/dependency/dept.c:1104)
[ 9.008364] bfs (kernel/dependency/dept.c:860)
[ 9.008366] add_dep (kernel/dependency/dept.c:1423)
[ 9.008368] do_event.isra.25 (kernel/dependency/dept.c:1651)
[ 9.008370] ? __wake_up_common (kernel/sched/wait.c:108)
[ 9.008372] dept_event (kernel/dependency/dept.c:2337)
[ 9.008374] __wake_up_common (kernel/sched/wait.c:109)
[ 9.008376] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1))
[ 9.008379] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583)
[ 9.008381] ? arch_stack_walk (arch/x86/kernel/stacktrace.c:24)
[ 9.008385] ? ret_from_fork (arch/x86/entry/entry_64.S:301)
[ 9.008387] ? dept_enable_hardirq (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:1843)
[ 9.008389] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:45 ./arch/x86/include/asm/irqflags.h:80 ./arch/x86/include/asm/irqflags.h:138 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
[ 9.008392] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194)
[ 9.008394] ? try_to_del_timer_sync (kernel/time/timer.c:1239)
[ 9.008396] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3))
[ 9.008398] ? prepare_to_wait_exclusive (kernel/sched/wait.c:431)
[ 9.008400] ? commit_timeout (fs/jbd2/journal.c:173)
[ 9.008402] kthread (kernel/kthread.c:377)
[ 9.008404] ? kthread_complete_and_exit (kernel/kthread.c:332)
[ 9.008407] ret_from_fork (arch/x86/entry/entry_64.S:301)
[ 9.008410] </TASK>


2022-02-21 22:08:35

by Jan Kara

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1


So I was trying to understand what this report is about for some time but
honestly I have failed...

On Thu 17-02-22 20:10:04, Byungchul Park wrote:
> [ 9.008161] ===================================================
> [ 9.008163] DEPT: Circular dependency has been detected.
> [ 9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G W
> [ 9.008166] ---------------------------------------------------
> [ 9.008167] summary
> [ 9.008167] ---------------------------------------------------
> [ 9.008168] *** DEADLOCK ***
> [ 9.008168]
> [ 9.008168] context A
> [ 9.008169] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
> [ 9.008171] [W] wait(&(&journal->j_wait_commit)->dmap:0)
> [ 9.008172] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
> [ 9.008173]
> [ 9.008173] context B
> [ 9.008174] [S] down_write(mapping.invalidate_lock:0)
> [ 9.008175] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
> [ 9.008176] [E] up_write(mapping.invalidate_lock:0)
> [ 9.008177]
> [ 9.008178] context C
> [ 9.008179] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
> [ 9.008180] [W] down_write(mapping.invalidate_lock:0)
> [ 9.008181] [E] event(&(&journal->j_wait_commit)->dmap:0)
> [ 9.008181]
> [ 9.008182] [S]: start of the event context
> [ 9.008183] [W]: the wait blocked
> [ 9.008183] [E]: the event not reachable

So what situation is your tool complaining about here? Can you perhaps show
it here in more common visualization like:

TASK1 TASK2
does foo, grabs Z
does X, grabs lock Y
blocks on Z
blocks on Y

or something like that? Because I was not able to decipher this from the
report even after trying for some time...

Honza



> [ 9.008184] ---------------------------------------------------
> [ 9.008184] context A's detail
> [ 9.008185] ---------------------------------------------------
> [ 9.008186] context A
> [ 9.008186] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
> [ 9.008187] [W] wait(&(&journal->j_wait_commit)->dmap:0)
> [ 9.008188] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
> [ 9.008189]
> [ 9.008190] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0):
> [ 9.008191] (N/A)
> [ 9.008191]
> [ 9.008192] [W] wait(&(&journal->j_wait_commit)->dmap:0):
> [ 9.008193] prepare_to_wait (kernel/sched/wait.c:275)
> [ 9.008197] stacktrace:
> [ 9.008198] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213)
> [ 9.008200] schedule (kernel/sched/core.c:6373 (discriminator 1))
> [ 9.008201] kjournald2 (fs/jbd2/journal.c:250)
> [ 9.008203] kthread (kernel/kthread.c:377)
> [ 9.008206] ret_from_fork (arch/x86/entry/entry_64.S:301)
> [ 9.008209]
> [ 9.008209] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0):
> [ 9.008210] __wake_up_common (kernel/sched/wait.c:108)
> [ 9.008212] stacktrace:
> [ 9.008213] dept_event (kernel/dependency/dept.c:2337)
> [ 9.008215] __wake_up_common (kernel/sched/wait.c:109)
> [ 9.008217] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1))
> [ 9.008218] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583)
> [ 9.008221] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3))
> [ 9.008223] kthread (kernel/kthread.c:377)
> [ 9.008224] ret_from_fork (arch/x86/entry/entry_64.S:301)
> [ 9.008226] ---------------------------------------------------
> [ 9.008226] context B's detail
> [ 9.008227] ---------------------------------------------------
> [ 9.008228] context B
> [ 9.008228] [S] down_write(mapping.invalidate_lock:0)
> [ 9.008229] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
> [ 9.008230] [E] up_write(mapping.invalidate_lock:0)
> [ 9.008231]
> [ 9.008232] [S] down_write(mapping.invalidate_lock:0):
> [ 9.008233] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963)
> [ 9.008237] stacktrace:
> [ 9.008237] down_write (kernel/locking/rwsem.c:1514)
> [ 9.008239] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963)
> [ 9.008241] generic_perform_write (mm/filemap.c:3784)
> [ 9.008243] ext4_buffered_write_iter (fs/ext4/file.c:269)
> [ 9.008245] ext4_file_write_iter (fs/ext4/file.c:677)
> [ 9.008247] new_sync_write (fs/read_write.c:504 (discriminator 1))
> [ 9.008250] vfs_write (fs/read_write.c:590)
> [ 9.008251] ksys_write (fs/read_write.c:644)
> [ 9.008253] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> [ 9.008255] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
> [ 9.008258]
> [ 9.008258] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0):
> [ 9.008259] prepare_to_wait (kernel/sched/wait.c:275)
> [ 9.008261] stacktrace:
> [ 9.008261] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213)
> [ 9.008263] schedule (kernel/sched/core.c:6373 (discriminator 1))
> [ 9.008264] wait_transaction_locked (fs/jbd2/transaction.c:184)
> [ 9.008266] add_transaction_credits (fs/jbd2/transaction.c:248 (discriminator 3))
> [ 9.008267] start_this_handle (fs/jbd2/transaction.c:427)
> [ 9.008269] jbd2__journal_start (fs/jbd2/transaction.c:526)
> [ 9.008271] __ext4_journal_start_sb (fs/ext4/ext4_jbd2.c:105)
> [ 9.008273] ext4_truncate (fs/ext4/inode.c:4164)
> [ 9.008274] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963)
> [ 9.008276] generic_perform_write (mm/filemap.c:3784)
> [ 9.008277] ext4_buffered_write_iter (fs/ext4/file.c:269)
> [ 9.008279] ext4_file_write_iter (fs/ext4/file.c:677)
> [ 9.008281] new_sync_write (fs/read_write.c:504 (discriminator 1))
> [ 9.008283] vfs_write (fs/read_write.c:590)
> [ 9.008284] ksys_write (fs/read_write.c:644)
> [ 9.008285] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> [ 9.008287]
> [ 9.008288] [E] up_write(mapping.invalidate_lock:0):
> [ 9.008288] ext4_da_get_block_prep (fs/ext4/inode.c:1795 fs/ext4/inode.c:1829)
> [ 9.008291] ---------------------------------------------------
> [ 9.008291] context C's detail
> [ 9.008292] ---------------------------------------------------
> [ 9.008292] context C
> [ 9.008293] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
> [ 9.008294] [W] down_write(mapping.invalidate_lock:0)
> [ 9.008295] [E] event(&(&journal->j_wait_commit)->dmap:0)
> [ 9.008296]
> [ 9.008297] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0):
> [ 9.008298] (N/A)
> [ 9.008298]
> [ 9.008299] [W] down_write(mapping.invalidate_lock:0):
> [ 9.008299] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963)
> [ 9.008302] stacktrace:
> [ 9.008302] down_write (kernel/locking/rwsem.c:1514)
> [ 9.008304] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963)
> [ 9.008305] generic_perform_write (mm/filemap.c:3784)
> [ 9.008307] ext4_buffered_write_iter (fs/ext4/file.c:269)
> [ 9.008309] ext4_file_write_iter (fs/ext4/file.c:677)
> [ 9.008311] new_sync_write (fs/read_write.c:504 (discriminator 1))
> [ 9.008312] vfs_write (fs/read_write.c:590)
> [ 9.008314] ksys_write (fs/read_write.c:644)
> [ 9.008315] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> [ 9.008316] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
> [ 9.008318]
> [ 9.008319] [E] event(&(&journal->j_wait_commit)->dmap:0):
> [ 9.008320] __wake_up_common (kernel/sched/wait.c:108)
> [ 9.008321] stacktrace:
> [ 9.008322] __wake_up_common (kernel/sched/wait.c:109)
> [ 9.008323] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1))
> [ 9.008324] __jbd2_log_start_commit (fs/jbd2/journal.c:508)
> [ 9.008326] jbd2_log_start_commit (fs/jbd2/journal.c:527)
> [ 9.008327] __jbd2_journal_force_commit (fs/jbd2/journal.c:560)
> [ 9.008329] jbd2_journal_force_commit_nested (fs/jbd2/journal.c:583)
> [ 9.008331] ext4_should_retry_alloc (fs/ext4/balloc.c:670 (discriminator 3))
> [ 9.008332] ext4_da_write_begin (fs/ext4/inode.c:2965 (discriminator 1))
> [ 9.008334] generic_perform_write (mm/filemap.c:3784)
> [ 9.008335] ext4_buffered_write_iter (fs/ext4/file.c:269)
> [ 9.008337] ext4_file_write_iter (fs/ext4/file.c:677)
> [ 9.008339] new_sync_write (fs/read_write.c:504 (discriminator 1))
> [ 9.008341] vfs_write (fs/read_write.c:590)
> [ 9.008342] ksys_write (fs/read_write.c:644)
> [ 9.008343] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> [ 9.008345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
> [ 9.008347] ---------------------------------------------------
> [ 9.008348] information that might be helpful
> [ 9.008348] ---------------------------------------------------
> [ 9.008349] CPU: 0 PID: 89 Comm: jbd2/sda1-8 Tainted: G W 5.17.0-rc1-00015-gb94f67143867-dirty #2
> [ 9.008352] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [ 9.008353] Call Trace:
> [ 9.008354] <TASK>
> [ 9.008355] dump_stack_lvl (lib/dump_stack.c:107)
> [ 9.008358] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762)
> [ 9.008360] ? print_circle (kernel/dependency/dept.c:1086)
> [ 9.008362] cb_check_dl (kernel/dependency/dept.c:1104)
> [ 9.008364] bfs (kernel/dependency/dept.c:860)
> [ 9.008366] add_dep (kernel/dependency/dept.c:1423)
> [ 9.008368] do_event.isra.25 (kernel/dependency/dept.c:1651)
> [ 9.008370] ? __wake_up_common (kernel/sched/wait.c:108)
> [ 9.008372] dept_event (kernel/dependency/dept.c:2337)
> [ 9.008374] __wake_up_common (kernel/sched/wait.c:109)
> [ 9.008376] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1))
> [ 9.008379] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583)
> [ 9.008381] ? arch_stack_walk (arch/x86/kernel/stacktrace.c:24)
> [ 9.008385] ? ret_from_fork (arch/x86/entry/entry_64.S:301)
> [ 9.008387] ? dept_enable_hardirq (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:1843)
> [ 9.008389] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:45 ./arch/x86/include/asm/irqflags.h:80 ./arch/x86/include/asm/irqflags.h:138 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
> [ 9.008392] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194)
> [ 9.008394] ? try_to_del_timer_sync (kernel/time/timer.c:1239)
> [ 9.008396] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3))
> [ 9.008398] ? prepare_to_wait_exclusive (kernel/sched/wait.c:431)
> [ 9.008400] ? commit_timeout (fs/jbd2/journal.c:173)
> [ 9.008402] kthread (kernel/kthread.c:377)
> [ 9.008404] ? kthread_complete_and_exit (kernel/kthread.c:332)
> [ 9.008407] ret_from_fork (arch/x86/entry/entry_64.S:301)
> [ 9.008410] </TASK>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-02-23 07:09:17

by Byungchul Park

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Mon, Feb 21, 2022 at 08:02:04PM +0100, Jan Kara wrote:
> On Thu 17-02-22 20:10:04, Byungchul Park wrote:
> > [ 9.008161] ===================================================
> > [ 9.008163] DEPT: Circular dependency has been detected.
> > [ 9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G W
> > [ 9.008166] ---------------------------------------------------
> > [ 9.008167] summary
> > [ 9.008167] ---------------------------------------------------
> > [ 9.008168] *** DEADLOCK ***
> > [ 9.008168]
> > [ 9.008168] context A
> > [ 9.008169] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
> > [ 9.008171] [W] wait(&(&journal->j_wait_commit)->dmap:0)
> > [ 9.008172] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
> > [ 9.008173]
> > [ 9.008173] context B
> > [ 9.008174] [S] down_write(mapping.invalidate_lock:0)
> > [ 9.008175] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
> > [ 9.008176] [E] up_write(mapping.invalidate_lock:0)
> > [ 9.008177]
> > [ 9.008178] context C
> > [ 9.008179] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
> > [ 9.008180] [W] down_write(mapping.invalidate_lock:0)
> > [ 9.008181] [E] event(&(&journal->j_wait_commit)->dmap:0)
> > [ 9.008181]
> > [ 9.008182] [S]: start of the event context
> > [ 9.008183] [W]: the wait blocked
> > [ 9.008183] [E]: the event not reachable
>
> So what situation is your tool complaining about here? Can you perhaps show
> it here in more common visualization like:

Sure.

> TASK1 TASK2
> does foo, grabs Z
> does X, grabs lock Y
> blocks on Z
> blocks on Y
>
> or something like that? Because I was not able to decipher this from the
> report even after trying for some time...

KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write)

wait A
--- stuck
wait B
--- stuck
wait C
--- stuck

wake up B wake up C wake up A

where:
A is a wait_queue, j_wait_commit
B is a wait_queue, j_wait_transaction_locked
C is a rwsem, mapping.invalidate_lock

The above is the simplest form. And it's worth noting that Dept focuses
on wait and event itself rather than grabing and releasing things like
lock. The following is the more descriptive form of it.

KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write)

wait @j_wait_commit
ext4_truncate_failed_write()
down_write(mapping.invalidate_lock)

ext4_truncate()
...
wait @j_wait_transaction_locked

ext_truncate_failed_write()
down_write(mapping.invalidate_lock)

ext4_should_retry_alloc()
...
__jbd2_log_start_commit()
wake_up(j_wait_commit)
jbd2_journal_commit_transaction()
wake_up(j_wait_transaction_locked)
up_write(mapping.invalidate_lock)

I hope this would help you understand the report.

Yeah... This is what Dept complained. And as Ted said, the kthread would
be woken up by another wakeup. So it's not deadlock deadlock. However,
these three threads and any other tasks waiting for any of the events A,
B, C would be struck for a while until the wakeup eventually wakes up
the kthread, kjournald2.

I guess it's not what ext4 is meant to do. Honestly, ext4 and journal
system look so complicated that I'm not convinced tho...

Thanks,
Byungchul

>
> Honza
>
>
>
> > [ 9.008184] ---------------------------------------------------
> > [ 9.008184] context A's detail
> > [ 9.008185] ---------------------------------------------------
> > [ 9.008186] context A
> > [ 9.008186] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
> > [ 9.008187] [W] wait(&(&journal->j_wait_commit)->dmap:0)
> > [ 9.008188] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
> > [ 9.008189]
> > [ 9.008190] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0):
> > [ 9.008191] (N/A)
> > [ 9.008191]
> > [ 9.008192] [W] wait(&(&journal->j_wait_commit)->dmap:0):
> > [ 9.008193] prepare_to_wait (kernel/sched/wait.c:275)
> > [ 9.008197] stacktrace:
> > [ 9.008198] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213)
> > [ 9.008200] schedule (kernel/sched/core.c:6373 (discriminator 1))
> > [ 9.008201] kjournald2 (fs/jbd2/journal.c:250)
> > [ 9.008203] kthread (kernel/kthread.c:377)
> > [ 9.008206] ret_from_fork (arch/x86/entry/entry_64.S:301)
> > [ 9.008209]
> > [ 9.008209] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0):
> > [ 9.008210] __wake_up_common (kernel/sched/wait.c:108)
> > [ 9.008212] stacktrace:
> > [ 9.008213] dept_event (kernel/dependency/dept.c:2337)
> > [ 9.008215] __wake_up_common (kernel/sched/wait.c:109)
> > [ 9.008217] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1))
> > [ 9.008218] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583)
> > [ 9.008221] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3))
> > [ 9.008223] kthread (kernel/kthread.c:377)
> > [ 9.008224] ret_from_fork (arch/x86/entry/entry_64.S:301)
> > [ 9.008226] ---------------------------------------------------
> > [ 9.008226] context B's detail
> > [ 9.008227] ---------------------------------------------------
> > [ 9.008228] context B
> > [ 9.008228] [S] down_write(mapping.invalidate_lock:0)
> > [ 9.008229] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
> > [ 9.008230] [E] up_write(mapping.invalidate_lock:0)
> > [ 9.008231]
> > [ 9.008232] [S] down_write(mapping.invalidate_lock:0):
> > [ 9.008233] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963)
> > [ 9.008237] stacktrace:
> > [ 9.008237] down_write (kernel/locking/rwsem.c:1514)
> > [ 9.008239] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963)
> > [ 9.008241] generic_perform_write (mm/filemap.c:3784)
> > [ 9.008243] ext4_buffered_write_iter (fs/ext4/file.c:269)
> > [ 9.008245] ext4_file_write_iter (fs/ext4/file.c:677)
> > [ 9.008247] new_sync_write (fs/read_write.c:504 (discriminator 1))
> > [ 9.008250] vfs_write (fs/read_write.c:590)
> > [ 9.008251] ksys_write (fs/read_write.c:644)
> > [ 9.008253] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> > [ 9.008255] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
> > [ 9.008258]
> > [ 9.008258] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0):
> > [ 9.008259] prepare_to_wait (kernel/sched/wait.c:275)
> > [ 9.008261] stacktrace:
> > [ 9.008261] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213)
> > [ 9.008263] schedule (kernel/sched/core.c:6373 (discriminator 1))
> > [ 9.008264] wait_transaction_locked (fs/jbd2/transaction.c:184)
> > [ 9.008266] add_transaction_credits (fs/jbd2/transaction.c:248 (discriminator 3))
> > [ 9.008267] start_this_handle (fs/jbd2/transaction.c:427)
> > [ 9.008269] jbd2__journal_start (fs/jbd2/transaction.c:526)
> > [ 9.008271] __ext4_journal_start_sb (fs/ext4/ext4_jbd2.c:105)
> > [ 9.008273] ext4_truncate (fs/ext4/inode.c:4164)
> > [ 9.008274] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963)
> > [ 9.008276] generic_perform_write (mm/filemap.c:3784)
> > [ 9.008277] ext4_buffered_write_iter (fs/ext4/file.c:269)
> > [ 9.008279] ext4_file_write_iter (fs/ext4/file.c:677)
> > [ 9.008281] new_sync_write (fs/read_write.c:504 (discriminator 1))
> > [ 9.008283] vfs_write (fs/read_write.c:590)
> > [ 9.008284] ksys_write (fs/read_write.c:644)
> > [ 9.008285] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> > [ 9.008287]
> > [ 9.008288] [E] up_write(mapping.invalidate_lock:0):
> > [ 9.008288] ext4_da_get_block_prep (fs/ext4/inode.c:1795 fs/ext4/inode.c:1829)
> > [ 9.008291] ---------------------------------------------------
> > [ 9.008291] context C's detail
> > [ 9.008292] ---------------------------------------------------
> > [ 9.008292] context C
> > [ 9.008293] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
> > [ 9.008294] [W] down_write(mapping.invalidate_lock:0)
> > [ 9.008295] [E] event(&(&journal->j_wait_commit)->dmap:0)
> > [ 9.008296]
> > [ 9.008297] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0):
> > [ 9.008298] (N/A)
> > [ 9.008298]
> > [ 9.008299] [W] down_write(mapping.invalidate_lock:0):
> > [ 9.008299] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963)
> > [ 9.008302] stacktrace:
> > [ 9.008302] down_write (kernel/locking/rwsem.c:1514)
> > [ 9.008304] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963)
> > [ 9.008305] generic_perform_write (mm/filemap.c:3784)
> > [ 9.008307] ext4_buffered_write_iter (fs/ext4/file.c:269)
> > [ 9.008309] ext4_file_write_iter (fs/ext4/file.c:677)
> > [ 9.008311] new_sync_write (fs/read_write.c:504 (discriminator 1))
> > [ 9.008312] vfs_write (fs/read_write.c:590)
> > [ 9.008314] ksys_write (fs/read_write.c:644)
> > [ 9.008315] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> > [ 9.008316] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
> > [ 9.008318]
> > [ 9.008319] [E] event(&(&journal->j_wait_commit)->dmap:0):
> > [ 9.008320] __wake_up_common (kernel/sched/wait.c:108)
> > [ 9.008321] stacktrace:
> > [ 9.008322] __wake_up_common (kernel/sched/wait.c:109)
> > [ 9.008323] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1))
> > [ 9.008324] __jbd2_log_start_commit (fs/jbd2/journal.c:508)
> > [ 9.008326] jbd2_log_start_commit (fs/jbd2/journal.c:527)
> > [ 9.008327] __jbd2_journal_force_commit (fs/jbd2/journal.c:560)
> > [ 9.008329] jbd2_journal_force_commit_nested (fs/jbd2/journal.c:583)
> > [ 9.008331] ext4_should_retry_alloc (fs/ext4/balloc.c:670 (discriminator 3))
> > [ 9.008332] ext4_da_write_begin (fs/ext4/inode.c:2965 (discriminator 1))
> > [ 9.008334] generic_perform_write (mm/filemap.c:3784)
> > [ 9.008335] ext4_buffered_write_iter (fs/ext4/file.c:269)
> > [ 9.008337] ext4_file_write_iter (fs/ext4/file.c:677)
> > [ 9.008339] new_sync_write (fs/read_write.c:504 (discriminator 1))
> > [ 9.008341] vfs_write (fs/read_write.c:590)
> > [ 9.008342] ksys_write (fs/read_write.c:644)
> > [ 9.008343] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> > [ 9.008345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
> > [ 9.008347] ---------------------------------------------------
> > [ 9.008348] information that might be helpful
> > [ 9.008348] ---------------------------------------------------
> > [ 9.008349] CPU: 0 PID: 89 Comm: jbd2/sda1-8 Tainted: G W 5.17.0-rc1-00015-gb94f67143867-dirty #2
> > [ 9.008352] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > [ 9.008353] Call Trace:
> > [ 9.008354] <TASK>
> > [ 9.008355] dump_stack_lvl (lib/dump_stack.c:107)
> > [ 9.008358] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762)
> > [ 9.008360] ? print_circle (kernel/dependency/dept.c:1086)
> > [ 9.008362] cb_check_dl (kernel/dependency/dept.c:1104)
> > [ 9.008364] bfs (kernel/dependency/dept.c:860)
> > [ 9.008366] add_dep (kernel/dependency/dept.c:1423)
> > [ 9.008368] do_event.isra.25 (kernel/dependency/dept.c:1651)
> > [ 9.008370] ? __wake_up_common (kernel/sched/wait.c:108)
> > [ 9.008372] dept_event (kernel/dependency/dept.c:2337)
> > [ 9.008374] __wake_up_common (kernel/sched/wait.c:109)
> > [ 9.008376] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1))
> > [ 9.008379] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583)
> > [ 9.008381] ? arch_stack_walk (arch/x86/kernel/stacktrace.c:24)
> > [ 9.008385] ? ret_from_fork (arch/x86/entry/entry_64.S:301)
> > [ 9.008387] ? dept_enable_hardirq (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:1843)
> > [ 9.008389] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:45 ./arch/x86/include/asm/irqflags.h:80 ./arch/x86/include/asm/irqflags.h:138 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
> > [ 9.008392] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194)
> > [ 9.008394] ? try_to_del_timer_sync (kernel/time/timer.c:1239)
> > [ 9.008396] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3))
> > [ 9.008398] ? prepare_to_wait_exclusive (kernel/sched/wait.c:431)
> > [ 9.008400] ? commit_timeout (fs/jbd2/journal.c:173)
> > [ 9.008402] kthread (kernel/kthread.c:377)
> > [ 9.008404] ? kthread_complete_and_exit (kernel/kthread.c:332)
> > [ 9.008407] ret_from_fork (arch/x86/entry/entry_64.S:301)
> > [ 9.008410] </TASK>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2022-02-24 02:09:17

by Byungchul Park

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Wed, Feb 23, 2022 at 03:48:59PM +0100, Jan Kara wrote:
> On Wed 23-02-22 09:35:34, Byungchul Park wrote:
> > On Mon, Feb 21, 2022 at 08:02:04PM +0100, Jan Kara wrote:
> > > On Thu 17-02-22 20:10:04, Byungchul Park wrote:
> > > > [ 9.008161] ===================================================
> > > > [ 9.008163] DEPT: Circular dependency has been detected.
> > > > [ 9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G W
> > > > [ 9.008166] ---------------------------------------------------
> > > > [ 9.008167] summary
> > > > [ 9.008167] ---------------------------------------------------
> > > > [ 9.008168] *** DEADLOCK ***
> > > > [ 9.008168]
> > > > [ 9.008168] context A
> > > > [ 9.008169] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
> > > > [ 9.008171] [W] wait(&(&journal->j_wait_commit)->dmap:0)
> > > > [ 9.008172] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
> > > > [ 9.008173]
> > > > [ 9.008173] context B
> > > > [ 9.008174] [S] down_write(mapping.invalidate_lock:0)
> > > > [ 9.008175] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
> > > > [ 9.008176] [E] up_write(mapping.invalidate_lock:0)
> > > > [ 9.008177]
> > > > [ 9.008178] context C
> > > > [ 9.008179] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
> > > > [ 9.008180] [W] down_write(mapping.invalidate_lock:0)
> > > > [ 9.008181] [E] event(&(&journal->j_wait_commit)->dmap:0)
> > > > [ 9.008181]
> > > > [ 9.008182] [S]: start of the event context
> > > > [ 9.008183] [W]: the wait blocked
> > > > [ 9.008183] [E]: the event not reachable
> > >
> > > So what situation is your tool complaining about here? Can you perhaps show
> > > it here in more common visualization like:
> >
> > Sure.
> >
> > > TASK1 TASK2
> > > does foo, grabs Z
> > > does X, grabs lock Y
> > > blocks on Z
> > > blocks on Y
> > >
> > > or something like that? Because I was not able to decipher this from the
> > > report even after trying for some time...
> >
> > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write)
> >
> > wait A
> > --- stuck
> > wait B
> > --- stuck
> > wait C
> > --- stuck
> >
> > wake up B wake up C wake up A
> >
> > where:
> > A is a wait_queue, j_wait_commit
> > B is a wait_queue, j_wait_transaction_locked
> > C is a rwsem, mapping.invalidate_lock
>
> I see. But a situation like this is not necessarily a guarantee of a
> deadlock, is it? I mean there can be task D that will eventually call say
> 'wake up B' and unblock everything and this is how things were designed to
> work? Multiple sources of wakeups are quite common I'd say... What does

Yes. At the very beginning when I desgined Dept, I was thinking whether
to support multiple wakeup sources or not for a quite long time.
Supporting it would be a better option to aovid non-critical reports.
However, I thought anyway we'd better fix it - not urgent tho - if
there's any single circle dependency. That's why I decided not to
support it for now and wanted to gather the kernel guys' opinions. Thing
is which policy we should go with.

> Dept do to prevent false reports in cases like this?
>
> > The above is the simplest form. And it's worth noting that Dept focuses
> > on wait and event itself rather than grabing and releasing things like
> > lock. The following is the more descriptive form of it.
> >
> > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write)
> >
> > wait @j_wait_commit
> > ext4_truncate_failed_write()
> > down_write(mapping.invalidate_lock)
> >
> > ext4_truncate()
> > ...
> > wait @j_wait_transaction_locked
> >
> > ext_truncate_failed_write()
> > down_write(mapping.invalidate_lock)
> >
> > ext4_should_retry_alloc()
> > ...
> > __jbd2_log_start_commit()
> > wake_up(j_wait_commit)
> > jbd2_journal_commit_transaction()
> > wake_up(j_wait_transaction_locked)
> > up_write(mapping.invalidate_lock)
> >
> > I hope this would help you understand the report.
>
> I see, thanks for explanation! So the above scenario is impossible because

My pleasure.

> for anyone to block on @j_wait_transaction_locked the transaction must be
> committing, which is done only by kjournald2 kthread and so that thread
> cannot be waiting at @j_wait_commit. Essentially blocking on
> @j_wait_transaction_locked means @j_wait_commit wakeup was already done.

kjournal2 repeatedly does the wait and the wake_up so the above scenario
looks possible to me even based on what you explained. Maybe I should
understand how the journal things work more for furhter discussion. Your
explanation is so helpful. Thank you really.

Thanks,
Byungchul

> I guess this shows there can be non-trivial dependencies between wait
> queues which are difficult to track in an automated way and without such
> tracking we are going to see false positives...
>
> Honza
>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2022-02-24 10:28:33

by Jan Kara

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Thu 24-02-22 10:11:02, Byungchul Park wrote:
> On Wed, Feb 23, 2022 at 03:48:59PM +0100, Jan Kara wrote:
> > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write)
> > >
> > > wait A
> > > --- stuck
> > > wait B
> > > --- stuck
> > > wait C
> > > --- stuck
> > >
> > > wake up B wake up C wake up A
> > >
> > > where:
> > > A is a wait_queue, j_wait_commit
> > > B is a wait_queue, j_wait_transaction_locked
> > > C is a rwsem, mapping.invalidate_lock
> >
> > I see. But a situation like this is not necessarily a guarantee of a
> > deadlock, is it? I mean there can be task D that will eventually call say
> > 'wake up B' and unblock everything and this is how things were designed to
> > work? Multiple sources of wakeups are quite common I'd say... What does
>
> Yes. At the very beginning when I desgined Dept, I was thinking whether
> to support multiple wakeup sources or not for a quite long time.
> Supporting it would be a better option to aovid non-critical reports.
> However, I thought anyway we'd better fix it - not urgent tho - if
> there's any single circle dependency. That's why I decided not to
> support it for now and wanted to gather the kernel guys' opinions. Thing
> is which policy we should go with.

I see. So supporting only a single wakeup source is fine for locks I guess.
But for general wait queues or other synchronization mechanisms, I'm afraid
it will lead to quite some false positive reports. Just my 2c.

> > Dept do to prevent false reports in cases like this?
> >
> > > The above is the simplest form. And it's worth noting that Dept focuses
> > > on wait and event itself rather than grabing and releasing things like
> > > lock. The following is the more descriptive form of it.
> > >
> > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write)
> > >
> > > wait @j_wait_commit
> > > ext4_truncate_failed_write()
> > > down_write(mapping.invalidate_lock)
> > >
> > > ext4_truncate()
> > > ...
> > > wait @j_wait_transaction_locked
> > >
> > > ext_truncate_failed_write()
> > > down_write(mapping.invalidate_lock)
> > >
> > > ext4_should_retry_alloc()
> > > ...
> > > __jbd2_log_start_commit()
> > > wake_up(j_wait_commit)
> > > jbd2_journal_commit_transaction()
> > > wake_up(j_wait_transaction_locked)
> > > up_write(mapping.invalidate_lock)
> > >
> > > I hope this would help you understand the report.
> >
> > I see, thanks for explanation! So the above scenario is impossible because
>
> My pleasure.
>
> > for anyone to block on @j_wait_transaction_locked the transaction must be
> > committing, which is done only by kjournald2 kthread and so that thread
> > cannot be waiting at @j_wait_commit. Essentially blocking on
> > @j_wait_transaction_locked means @j_wait_commit wakeup was already done.
>
> kjournal2 repeatedly does the wait and the wake_up so the above scenario
> looks possible to me even based on what you explained. Maybe I should
> understand how the journal things work more for furhter discussion. Your
> explanation is so helpful. Thank you really.

OK, let me provide you with more details for better understanding :) In
jbd2 we have an object called 'transaction'. This object can go through
many states but for our case is important that transaction is moved to
T_LOCKED state and out of it only while jbd2_journal_commit_transaction()
function is executing and waiting on j_wait_transaction_locked waitqueue is
exactly waiting for a transaction to get out of T_LOCKED state. Function
jbd2_journal_commit_transaction() is executed only by kjournald. Hence
anyone can see transaction in T_LOCKED state only if kjournald is running
inside jbd2_journal_commit_transaction() and thus kjournald cannot be
sleeping on j_wait_commit at the same time. Does this explain things?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-02-28 10:54:51

by Byungchul Park

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Thu, Feb 24, 2022 at 11:22:39AM +0100, Jan Kara wrote:
> On Thu 24-02-22 10:11:02, Byungchul Park wrote:
> > On Wed, Feb 23, 2022 at 03:48:59PM +0100, Jan Kara wrote:
> > > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write)
> > > >
> > > > wait A
> > > > --- stuck
> > > > wait B
> > > > --- stuck
> > > > wait C
> > > > --- stuck
> > > >
> > > > wake up B wake up C wake up A
> > > >
> > > > where:
> > > > A is a wait_queue, j_wait_commit
> > > > B is a wait_queue, j_wait_transaction_locked
> > > > C is a rwsem, mapping.invalidate_lock
> > >
> > > I see. But a situation like this is not necessarily a guarantee of a
> > > deadlock, is it? I mean there can be task D that will eventually call say
> > > 'wake up B' and unblock everything and this is how things were designed to
> > > work? Multiple sources of wakeups are quite common I'd say... What does
> >
> > Yes. At the very beginning when I desgined Dept, I was thinking whether
> > to support multiple wakeup sources or not for a quite long time.
> > Supporting it would be a better option to aovid non-critical reports.
> > However, I thought anyway we'd better fix it - not urgent tho - if
> > there's any single circle dependency. That's why I decided not to
> > support it for now and wanted to gather the kernel guys' opinions. Thing
> > is which policy we should go with.
>
> I see. So supporting only a single wakeup source is fine for locks I guess.
> But for general wait queues or other synchronization mechanisms, I'm afraid
> it will lead to quite some false positive reports. Just my 2c.

Thank you for your feedback.

I realized we've been using "false positive" differently. There exist
the three types of code in terms of dependency and deadlock. It's worth
noting that dependencies are built from between waits and events in Dept.

---

case 1. Code with an actual circular dependency, but not deadlock.

A circular dependency can be broken by a rescue wakeup source e.g.
timeout. It's not a deadlock. If it's okay that the contexts
participating in the circular dependency and others waiting for the
events in the circle are stuck until it gets broken. Otherwise, say,
if it's not meant, then it's anyway problematic.

1-1. What if we judge this code is problematic?
1-2. What if we judge this code is good?

case 2. Code with an actual circular dependency, and deadlock.

There's no other wakeup source than those within the circular
dependency. Literally deadlock. It's problematic and critical.

2-1. What if we judge this code is problematic?
2-2. What if we judge this code is good?

case 3. Code with no actual circular dependency, and not deadlock.

Must be good.

3-1. What if we judge this code is problematic?
3-2. What if we judge this code is good?

---

I call only 3-1 "false positive" circular dependency. And you call 1-1
and 3-1 "false positive" deadlock.

I've been wondering if the kernel guys esp. Linus considers code with
any circular dependency is problematic or not, even if it won't lead to
a deadlock, say, case 1. Even though I designed Dept based on what I
believe is right, of course, I'm willing to change the design according
to the majority opinion.

However, I would never allow case 1 if I were the owner of the kernel
for better stability, even though the code works anyway okay for now.

Thanks,
Byungchul

> > > Dept do to prevent false reports in cases like this?
> > >
> > > > The above is the simplest form. And it's worth noting that Dept focuses
> > > > on wait and event itself rather than grabing and releasing things like
> > > > lock. The following is the more descriptive form of it.
> > > >
> > > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write)
> > > >
> > > > wait @j_wait_commit
> > > > ext4_truncate_failed_write()
> > > > down_write(mapping.invalidate_lock)
> > > >
> > > > ext4_truncate()
> > > > ...
> > > > wait @j_wait_transaction_locked
> > > >
> > > > ext_truncate_failed_write()
> > > > down_write(mapping.invalidate_lock)
> > > >
> > > > ext4_should_retry_alloc()
> > > > ...
> > > > __jbd2_log_start_commit()
> > > > wake_up(j_wait_commit)
> > > > jbd2_journal_commit_transaction()
> > > > wake_up(j_wait_transaction_locked)
> > > > up_write(mapping.invalidate_lock)
> > > >
> > > > I hope this would help you understand the report.
> > >
> > > I see, thanks for explanation! So the above scenario is impossible because
> >
> > My pleasure.
> >
> > > for anyone to block on @j_wait_transaction_locked the transaction must be
> > > committing, which is done only by kjournald2 kthread and so that thread
> > > cannot be waiting at @j_wait_commit. Essentially blocking on
> > > @j_wait_transaction_locked means @j_wait_commit wakeup was already done.
> >
> > kjournal2 repeatedly does the wait and the wake_up so the above scenario
> > looks possible to me even based on what you explained. Maybe I should
> > understand how the journal things work more for furhter discussion. Your
> > explanation is so helpful. Thank you really.
>
> OK, let me provide you with more details for better understanding :) In
> jbd2 we have an object called 'transaction'. This object can go through
> many states but for our case is important that transaction is moved to
> T_LOCKED state and out of it only while jbd2_journal_commit_transaction()
> function is executing and waiting on j_wait_transaction_locked waitqueue is
> exactly waiting for a transaction to get out of T_LOCKED state. Function
> jbd2_journal_commit_transaction() is executed only by kjournald. Hence
> anyone can see transaction in T_LOCKED state only if kjournald is running
> inside jbd2_journal_commit_transaction() and thus kjournald cannot be
> sleeping on j_wait_commit at the same time. Does this explain things?
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2022-03-03 01:37:15

by Byungchul Park

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Mon, Feb 28, 2022 at 04:25:04PM -0500, Theodore Ts'o wrote:
> On Mon, Feb 28, 2022 at 11:14:44AM +0100, Jan Kara wrote:
> > > case 1. Code with an actual circular dependency, but not deadlock.
> > >
> > > A circular dependency can be broken by a rescue wakeup source e.g.
> > > timeout. It's not a deadlock. If it's okay that the contexts
> > > participating in the circular dependency and others waiting for the
> > > events in the circle are stuck until it gets broken. Otherwise, say,
> > > if it's not meant, then it's anyway problematic.
> > >
> > > 1-1. What if we judge this code is problematic?
> > > 1-2. What if we judge this code is good?
> > >
> > > I've been wondering if the kernel guys esp. Linus considers code with
> > > any circular dependency is problematic or not, even if it won't lead to
> > > a deadlock, say, case 1. Even though I designed Dept based on what I
> > > believe is right, of course, I'm willing to change the design according
> > > to the majority opinion.
> > >
> > > However, I would never allow case 1 if I were the owner of the kernel
> > > for better stability, even though the code works anyway okay for now.
>
> Note, I used the example of the timeout as the most obvious way of
> explaining that a deadlock is not possible. There is also the much
> more complex explanation which Jan was trying to give, which is what
> leads to the circular dependency. It can happen that when trying to
> start a handle, if either (a) there is not enough space in the journal
> for new handles, or (b) the current transaction is so large that if we
> don't close the transaction and start a new hone, we will end up
> running out of space in the future, and so in that case,
> start_this_handle() will block starting any more handles, and then
> wake up the commit thread. The commit thread then waits for the
> currently running threads to complete, before it allows new handles to
> start, and then it will complete the commit. In the case of (a) we
> then need to do a journal checkpoint, which is more work to release
> space in the journal, and only then, can we allow new handles to start.

Thank you for the full explanation of how journal things work.

> The botom line is (a) it works, (b) there aren't significant delays,
> and for DEPT to complain that this is somehow wrong and we need to
> completely rearchitect perfectly working code because it doesn't
> confirm to DEPT's idea of what is "correct" is not acceptable.

Thanks to you and Jan Kara, I realized it's not a real dependency in the
consumer and producer scenario but again *ONLY IF* there is a rescue
wakeup source. Dept should track the rescue wakeup source instead in the
case.

I won't ask you to rearchitect the working code. The code looks sane.

Thanks a lot.

Thanks,
Byungchul

> > We have a queue of work to do Q protected by lock L. Consumer process has
> > code like:
> >
> > while (1) {
> > lock L
> > prepare_to_wait(work_queued);
> > if (no work) {
> > unlock L
> > sleep
> > } else {
> > unlock L
> > do work
> > wake_up(work_done)
> > }
> > }
> >
> > AFAIU Dept will create dependency here that 'wakeup work_done' is after
> > 'wait for work_queued'. Producer has code like:
> >
> > while (1) {
> > lock L
> > prepare_to_wait(work_done)
> > if (too much work queued) {
> > unlock L
> > sleep
> > } else {
> > queue work
> > unlock L
> > wake_up(work_queued)
> > }
> > }
> >
> > And Dept will create dependency here that 'wakeup work_queued' is after
> > 'wait for work_done'. And thus we have a trivial cycle in the dependencies
> > despite the code being perfectly valid and safe.
>
> Cheers,
>
> - Ted

2022-03-03 05:37:27

by Byungchul Park

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

Ted wrote:
> On Thu, Mar 03, 2022 at 10:00:33AM +0900, Byungchul Park wrote:
> >
> > Unfortunately, it's neither perfect nor safe without another wakeup
> > source - rescue wakeup source.
> >
> > consumer producer
> >
> > lock L
> > (too much work queued == true)
> > unlock L
> > --- preempted
> > lock L
> > unlock L
> > do work
> > lock L
> > unlock L
> > do work
> > ...
> > (no work == true)
> > sleep
> > --- scheduled in
> > sleep
>
> That's not how things work in ext4. It's **way** more complicated

You seem to get it wrong. This example is what Jan Kara gave me. I just
tried to explain things based on Jan Kara's example so leaving all
statements that Jan Kara wrote. Plus the example was so helpful. Thanks,
Jan Kara.

> than that. We have multiple wait channels, one wake up the consumer
> (read: the commit thread), and one which wakes up any processes
> waiting for commit thread to have made forward progress. We also have
> two spin-lock protected sequence number, one which indicates the
> current commited transaction #, and one indicating the transaction #
> that needs to be committed.
>
> On the commit thread, it will sleep on j_wait_commit, and when it is
> woken up, it will check to see if there is work to be done
> (j_commit_sequence != j_commit_request), and if so, do the work, and
> then wake up processes waiting on the wait_queue j_wait_done_commit.
> (Again, all of this uses the pattern, "prepare to wait", then check to
> see if we should sleep, if we do need to sleep, unlock j_state_lock,
> then sleep. So this prevents any races leading to lost wakeups.
>
> On the start_this_handle() thread, if we current transaction is too
> full, we set j_commit_request to its transaction id to indicate that
> we want the current transaction to be committed, and then we wake up
> the j_wait_commit wait queue and then we enter a loop where do a
> prepare_to_wait in j_wait_done_commit, check to see if
> j_commit_sequence == the transaction id that we want to be completed,
> and if it's not done yet, we unlock the j_state_lock spinlock, and go
> to sleep. Again, because of the prepare_to_wait, there is no chance
> of a lost wakeup.

The above explantion gives me a clear view about synchronization of
journal things. I appreciate it.

> So there really is no "consumer" and "producer" here. If you really
> insist on using this model, which really doesn't apply, for one

Dept does not assume "consumer" and "producer" model at all, but Dept
works with general waits and events. *That model is just one of them.*

> thread, it's the consumer with respect to one wait queue, and the
> producer with respect to the *other* wait queue. For the other
> thread, the consumer and producer roles are reversed.
>
> And of course, this is a highly simplified model, since we also have a
> wait queue used by the commit thread to wait for the number of active
> handles on a particular transaction to go to zero, and
> stop_this_handle() will wake up commit thread via this wait queue when
> the last active handle on a particular transaction is retired. (And
> yes, that parameter is also protected by a different spin lock which
> is per-transaction).

This one also gives me a clear view. Thanks a lot.

> So it seems to me that a fundamental flaw in DEPT's model is assuming
> that the only waiting paradigm that can be used is consumer/producer,

No, Dept does not.

> and that's simply not true. The fact that you use the term "lock" is
> also going to lead a misleading line of reasoning, because properly

"lock/unlock L" comes from the Jan Kara's example. It has almost nothing
to do with the explanation. I just left "lock/unlock L" as a statement
that comes from the Jan Kara's example.

> speaking, they aren't really locks. We are simply using wait channels

I totally agree with you. *They aren't really locks but it's just waits
and wakeups.* That's exactly why I decided to develop Dept. Dept is not
interested in locks unlike Lockdep, but fouces on waits and wakeup
sources itself. I think you get Dept wrong a lot. Please ask me more if
you have things you doubt about Dept.

Thanks,
Byungchul

2022-03-03 10:55:38

by Jan Kara

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Thu 03-03-22 10:00:33, Byungchul Park wrote:
> On Mon, Feb 28, 2022 at 11:14:44AM +0100, Jan Kara wrote:
> > On Mon 28-02-22 18:28:26, Byungchul Park wrote:
> > > case 1. Code with an actual circular dependency, but not deadlock.
> > >
> > > A circular dependency can be broken by a rescue wakeup source e.g.
> > > timeout. It's not a deadlock. If it's okay that the contexts
> > > participating in the circular dependency and others waiting for the
> > > events in the circle are stuck until it gets broken. Otherwise, say,
> > > if it's not meant, then it's anyway problematic.
> > >
> > > 1-1. What if we judge this code is problematic?
> > > 1-2. What if we judge this code is good?
> > >
> > > case 2. Code with an actual circular dependency, and deadlock.
> > >
> > > There's no other wakeup source than those within the circular
> > > dependency. Literally deadlock. It's problematic and critical.
> > >
> > > 2-1. What if we judge this code is problematic?
> > > 2-2. What if we judge this code is good?
> > >
> > > case 3. Code with no actual circular dependency, and not deadlock.
> > >
> > > Must be good.
> > >
> > > 3-1. What if we judge this code is problematic?
> > > 3-2. What if we judge this code is good?
> > >
> > > ---
> > >
> > > I call only 3-1 "false positive" circular dependency. And you call 1-1
> > > and 3-1 "false positive" deadlock.
> > >
> > > I've been wondering if the kernel guys esp. Linus considers code with
> > > any circular dependency is problematic or not, even if it won't lead to
> > > a deadlock, say, case 1. Even though I designed Dept based on what I
> > > believe is right, of course, I'm willing to change the design according
> > > to the majority opinion.
> > >
> > > However, I would never allow case 1 if I were the owner of the kernel
> > > for better stability, even though the code works anyway okay for now.
> >
> > So yes, I call a report for the situation "There is circular dependency but
> > deadlock is not possible." a false positive. And that is because in my
> > opinion your definition of circular dependency includes schemes that are
> > useful and used in the kernel.
> >
> > Your example in case 1 is kind of borderline (I personally would consider
> > that bug as well) but there are other more valid schemes with multiple
> > wakeup sources like:
> >
> > We have a queue of work to do Q protected by lock L. Consumer process has
> > code like:
> >
> > while (1) {
> > lock L
> > prepare_to_wait(work_queued);
> > if (no work) {
> > unlock L
> > sleep
> > } else {
> > unlock L
> > do work
> > wake_up(work_done)
> > }
> > }
> >
> > AFAIU Dept will create dependency here that 'wakeup work_done' is after
> > 'wait for work_queued'. Producer has code like:
>
> First of all, thank you for this good example.
>
> > while (1) {
> > lock L
> > prepare_to_wait(work_done)
> > if (too much work queued) {
> > unlock L
> > sleep
> > } else {
> > queue work
> > unlock L
> > wake_up(work_queued)
> > }
> > }
> >
> > And Dept will create dependency here that 'wakeup work_queued' is after
> > 'wait for work_done'. And thus we have a trivial cycle in the dependencies
> > despite the code being perfectly valid and safe.
>
> Unfortunately, it's neither perfect nor safe without another wakeup
> source - rescue wakeup source.
>
> consumer producer
>
> lock L
> (too much work queued == true)
> unlock L
> --- preempted
> lock L
> unlock L
> do work
> lock L
> unlock L
> do work
> ...
> (no work == true)
> sleep
> --- scheduled in
> sleep
>
> This code leads a deadlock without another wakeup source, say, not safe.

So the scenario you describe above is indeed possible. But the trick is
that the wakeup from 'consumer' as is doing work will remove 'producer'
from the wait queue and change the 'producer' process state to
'TASK_RUNNING'. So when 'producer' calls sleep (in fact schedule()), the
scheduler will just treat this as another preemption point and the
'producer' will immediately or soon continue to run. So indeed we can think
of this as "another wakeup source" but the source is in the CPU scheduler
itself. This is the standard way how waitqueues are used in the kernel...

> Lastly, just for your information, I need to explain how Dept works a
> little more for you not to misunderstand Dept.
>
> Assuming the consumer and producer guarantee not to lead a deadlock like
> the following, Dept won't report it a problem:
>
> consumer producer
>
> sleep
> wakeup work_done
> queue work
> sleep
> wakeup work_queued
> do work
> sleep
> wakeup work_done
> queue work
> sleep
> wakeup work_queued
> do work
> sleep
> ... ...
>
> Dept does not consider all waits preceeding an event but only waits that
> might lead a deadlock. In this case, Dept works with each region
> independently.
>
> consumer producer
>
> sleep <- initiates region 1
> --- region 1 starts
> ... ...
> --- region 1 ends
> wakeup work_done
> ... ...
> queue work
> ... ...
> sleep <- initiates region 2
> --- region 2 starts
> ... ...
> --- region 2 ends
> wakeup work_queued
> ... ...
> do work
> ... ...
> sleep <- initiates region 3
> --- region 3 starts
> ... ...
> --- region 3 ends
> wakeup work_done
> ... ...
> queue work
> ... ...
> sleep <- initiates region 4
> --- region 4 starts
> ... ...
> --- region 4 ends
> wakeup work_queued
> ... ...
> do work
> ... ...
>
> That is, Dept does not build dependencies across different regions. So
> you don't have to worry about unreasonable false positives that much.
>
> Thoughts?

Thanks for explanation! And what exactly defines the 'regions'? When some
process goes to sleep on some waitqueue, this defines a start of a region
at the place where all the other processes are at that moment and wakeup of
the waitqueue is an end of the region?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-03-04 07:16:15

by Byungchul Park

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Thu, Mar 03, 2022 at 09:36:25AM -0500, Theodore Ts'o wrote:
> On Thu, Mar 03, 2022 at 02:23:33PM +0900, Byungchul Park wrote:
> > I totally agree with you. *They aren't really locks but it's just waits
> > and wakeups.* That's exactly why I decided to develop Dept. Dept is not
> > interested in locks unlike Lockdep, but fouces on waits and wakeup
> > sources itself. I think you get Dept wrong a lot. Please ask me more if
> > you have things you doubt about Dept.
>
> So the question is this --- do you now understand why, even though
> there is a circular dependency, nothing gets stalled in the
> interactions between the two wait channels?

??? I'm afraid I don't get you.

All contexts waiting for any of the events in the circular dependency
chain will be definitely stuck if there is a circular dependency as I
explained. So we need another wakeup source to break the circle. In
ext4 code, you might have the wakeup source for breaking the circle.

What I agreed with is:

The case that 1) the circular dependency is unevitable 2) there are
another wakeup source for breadking the circle and 3) the duration
in sleep is short enough, should be acceptable.

Sounds good?

Thanks,
Byungchul

2022-03-04 16:11:12

by Byungchul Park

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Thu, Mar 03, 2022 at 10:54:56AM +0100, Jan Kara wrote:
> On Thu 03-03-22 10:00:33, Byungchul Park wrote:
> > Unfortunately, it's neither perfect nor safe without another wakeup
> > source - rescue wakeup source.
> >
> > consumer producer
> >
> > lock L
> > (too much work queued == true)
> > unlock L
> > --- preempted
> > lock L
> > unlock L
> > do work
> > lock L
> > unlock L
> > do work
> > ...
> > (no work == true)
> > sleep
> > --- scheduled in
> > sleep
> >
> > This code leads a deadlock without another wakeup source, say, not safe.
>
> So the scenario you describe above is indeed possible. But the trick is
> that the wakeup from 'consumer' as is doing work will remove 'producer'
> from the wait queue and change the 'producer' process state to
> 'TASK_RUNNING'. So when 'producer' calls sleep (in fact schedule()), the
> scheduler will just treat this as another preemption point and the
> 'producer' will immediately or soon continue to run. So indeed we can think
> of this as "another wakeup source" but the source is in the CPU scheduler
> itself. This is the standard way how waitqueues are used in the kernel...

Nice! Thanks for the explanation. I will take it into account if needed.

> > Lastly, just for your information, I need to explain how Dept works a
> > little more for you not to misunderstand Dept.
> >
> > Assuming the consumer and producer guarantee not to lead a deadlock like
> > the following, Dept won't report it a problem:
> >
> > consumer producer
> >
> > sleep
> > wakeup work_done
> > queue work
> > sleep
> > wakeup work_queued
> > do work
> > sleep
> > wakeup work_done
> > queue work
> > sleep
> > wakeup work_queued
> > do work
> > sleep
> > ... ...
> >
> > Dept does not consider all waits preceeding an event but only waits that
> > might lead a deadlock. In this case, Dept works with each region
> > independently.
> >
> > consumer producer
> >
> > sleep <- initiates region 1
> > --- region 1 starts
> > ... ...
> > --- region 1 ends
> > wakeup work_done
> > ... ...
> > queue work
> > ... ...
> > sleep <- initiates region 2
> > --- region 2 starts
> > ... ...
> > --- region 2 ends
> > wakeup work_queued
> > ... ...
> > do work
> > ... ...
> > sleep <- initiates region 3
> > --- region 3 starts
> > ... ...
> > --- region 3 ends
> > wakeup work_done
> > ... ...
> > queue work
> > ... ...
> > sleep <- initiates region 4
> > --- region 4 starts
> > ... ...
> > --- region 4 ends
> > wakeup work_queued
> > ... ...
> > do work
> > ... ...
> >
> > That is, Dept does not build dependencies across different regions. So
> > you don't have to worry about unreasonable false positives that much.
> >
> > Thoughts?
>
> Thanks for explanation! And what exactly defines the 'regions'? When some
> process goes to sleep on some waitqueue, this defines a start of a region
> at the place where all the other processes are at that moment and wakeup of
> the waitqueue is an end of the region?

Yes. Let me explain it more for better understanding.
(I copied it from the talk I did with Matthew..)


ideal view
-----------
context X context Y

request event E ...
write REQUESTEVENT when (notice REQUESTEVENT written)
... notice the request from X [S]

--- ideally region 1 starts here
wait for the event ...
sleep if (can see REQUESTEVENT written)
it's on the way to the event
...
...
--- ideally region 1 ends here

finally the event [E]

Dept basically works with the above view with regard to wait and event.
But it's very hard to identify the ideal [S] point in practice. So Dept
instead identifies [S] point by checking WAITSTART with memory barriers
like the following, which would make Dept work conservatively.


Dept's view
------------
context X context Y

request event E ...
write REQUESTEVENT when (notice REQUESTEVENT written)
... notice the request from X

--- region 2 Dept gives up starts
wait for the event ...
write barrier
write WAITSTART read barrier
sleep when (notice WAITSTART written)
ensure the request has come [S]

--- region 2 Dept gives up ends
--- region 3 starts here
...
if (can see WAITSTART written)
it's on the way to the event
...
...
--- region 3 ends here

finally the event [E]

In short, Dept works with region 3.

Thanks,
Byungchul

2022-03-05 16:21:53

by Byungchul Park

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Fri, Mar 04, 2022 at 10:26:23PM -0500, Theodore Ts'o wrote:
> On Fri, Mar 04, 2022 at 09:42:37AM +0900, Byungchul Park wrote:
> >
> > All contexts waiting for any of the events in the circular dependency
> > chain will be definitely stuck if there is a circular dependency as I
> > explained. So we need another wakeup source to break the circle. In
> > ext4 code, you might have the wakeup source for breaking the circle.
> >
> > What I agreed with is:
> >
> > The case that 1) the circular dependency is unevitable 2) there are
> > another wakeup source for breadking the circle and 3) the duration
> > in sleep is short enough, should be acceptable.
> >
> > Sounds good?
>
> These dependencies are part of every single ext4 metadata update,
> and if there were any unnecessary sleeps, this would be a major
> performance gap, and this is a very well studied part of ext4.
>
> There are some places where we sleep, sure. In some case
> start_this_handle() needs to wait for a commit to complete, and the
> commit thread might need to sleep for I/O to complete. But the moment
> the thing that we're waiting for is complete, we wake up all of the
> processes on the wait queue. But in the case where we wait for I/O
> complete, that wakeupis coming from the device driver, when it
> receives the the I/O completion interrupt from the hard drive. Is
> that considered an "external source"? Maybe DEPT doesn't recognize
> that this is certain to happen just as day follows the night? (Well,
> maybe the I/O completion interrupt might not happen if the disk drive
> bursts into flames --- but then, you've got bigger problems. :-)

Almost all you've been blaming at Dept are totally non-sense. Based on
what you're saying, I'm conviced that you don't understand how Dept
works even 1%. You don't even try to understand it before blame.

You don't have to understand and support it. But I can't response to you
if you keep saying silly things that way.

> In any case, if DEPT is going to report these "circular dependencies
> as bugs that MUST be fixed", it's going to be pure noise and I will
> ignore all DEPT reports, and will push back on having Lockdep replaced

Dept is going to be improved so that what you are concerning about won't
be reported.

> by DEPT --- because Lockdep give us actionable reports, and if DEPT

Right. Dept should give actionable reports, too.

> can't tell the difference between a valid programming pattern and a
> bug, then it's worse than useless.

Needless to say.

2022-03-05 16:26:46

by Byungchul Park

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Fri, Mar 04, 2022 at 10:40:35PM -0500, Theodore Ts'o wrote:
> On Fri, Mar 04, 2022 at 12:20:02PM +0900, Byungchul Park wrote:
> >
> > I found a point that the two wait channels don't lead a deadlock in
> > some cases thanks to Jan Kara. I will fix it so that Dept won't
> > complain it.
>
> I sent my last (admittedly cranky) message before you sent this. I'm
> glad you finally understood Jan's explanation. I was trying to tell

Not finally. I've understood him whenever he tried to tell me something.

> you the same thing, but apparently I failed to communicate in a

I don't think so. Your point and Jan's point are different. All he has
said make sense. But yours does not.

> sufficiently clear manner. In any case, what Jan described is a
> fundamental part of how wait queues work, and I'm kind of amazed that
> you were able to implement DEPT without understanding it. (But maybe

Of course, it was possible because all that Dept has to know for basic
work is wait and event. The subtle things like what Jan told me help
Dept be better.

> that is why some of the DEPT reports were completely incomprehensible

It's because you are blinded to blame at it without understanding how
Dept works at all. I will fix those that must be fixed. Don't worry.

> to me; I couldn't interpret why in the world DEPT was saying there was
> a problem.)

I can tell you if you really want to understand why. But I can't if you
are like this.

> In any case, the thing I would ask is a little humility. We regularly
> use lockdep, and we run a huge number of stress tests, throughout each
> development cycle.

Sure.

> So if DEPT is issuing lots of reports about apparently circular
> dependencies, please try to be open to the thought that the fault is

No one was convinced that Dept doesn't have a fault. I think your
worries are too much.

> in DEPT, and don't try to argue with maintainers that their code MUST
> be buggy --- but since you don't understand our code, and DEPT must be

No one argued that their code must be buggy, either. So I don't think
you have to worry about what's never happened.

> theoretically perfect, that it is up to the Maintainers to prove to
> you that their code is correct.
>
> I am going to gently suggest that it is at least as likely, if not
> more likely, that the failure is in DEPT or your understanding of what

No doubt. I already think so. But it doesn't mean that I have to keep
quiet without discussing to imporve Dept. I will keep improving Dept in
a reasonable way.

> how kernel wait channels and locking works. After all, why would it
> be that we haven't found these problems via our other QA practices?

Let's talk more once you understand how Dept works at least 10%. Or I
think we cannot talk in a productive way.

2022-03-10 01:51:41

by Byungchul Park

[permalink] [raw]
Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Sun, Mar 06, 2022 at 09:19:10AM -0500, Theodore Ts'o wrote:
> On Sun, Mar 06, 2022 at 07:51:42PM +0900, Byungchul Park wrote:
> > >
> > > Users of DEPT must not have to understand how DEPT works in order to
> >
> > Users must not have to understand how Dept works for sure, and haters
> > must not blame things based on what they guess wrong.
>
> For the record, I don't hate DEPT. I *fear* that DEPT will result in
> my getting spammed with a huge number of false posiives once automated
> testing systems like Syzkaller, zero-day test robot, etcs., get a hold
> of it once it gets merged and start generating hundreds of automated
> reports.

Agree. Dept should not be a part of *automated testing system* until it
finally works as much as Lockdep in terms of false positives. However,
it's impossible to achieve it by doing it out of the tree.

Besides automated testing system, Dept works great in the middle of
developing something that is so complicated in terms of synchronization.
They don't have to worry about real reports anymore, that should be
reported, from getting prevented by a false positve.

I will explicitely describe EXPERIMENTAL and "Dept might false-alarm" in
Kconfig until it's considered a few-false-alarming tool.

> > Sure, it should be done manually. I should do it on my own when that
> > kind of issue arises.
>
> The question here is how often will it need to be done, and how easy

I guess it's gonna rarely happens. I want to see too.

> will it be to "do it manually"? Suppose we mark all of the DEPT false

Very easy. Equal to or easier than the way we do for lockdep. But the
interest would be wait/event objects rather than locks.

> positives before it gets merged? How easy will it be able to suppress
> future false positives in the future, as the kernel evolves?

Same as - or even better than - what we do for Lockdep.

And we'd better consider those activies as a code-documentation. Not
only making things just work but organizing code and documenting
in code, are also very meaningful.

> Perhaps one method is to haved a way to take a particular wait queue,
> or call to schedule(), or at the level of an entire kernel source
> file, and opt it out from DEPT analysis? That way, if DEPT gets
> merged, and a maintainer starts getting spammed by bogus (or

Once Dept gets stable - hoplefully now that Dept is working very
conservatively, there might not be as many false positives as you're
concerning. The situation is in control.

> That way we don't necessarily need to have a debate over how close to
> zero percent false positives is necessary before DEPT can get merged.

Non-sense. I would agree with you if it was so when Lockdep was merged.
But I'll try to achieve almost zero false positives, again, it's
impossible to do it out of tree.

> And we avoid needing to force maintainers to prove that a DEPT report

So... It'd be okay if Dept goes not as a part of automated testing
system. Right?

> is a false positive, which is from my experience hard to do, since
> they get accused of being DEPT haters and not understanding DEPT.

Honestly, it's not a problem of that they don't understand other
domians than what they are familiar with, but another issue. I won't
mention it.

And it sounds like you'd do nothing unless it turns out to be
problematic 100%. It's definitely the *easiest* way to maintain
something because it's the same as not checking its correctness at all.

Even if it's so hard to do, checking if the code is safe for real
repeatedly, is what it surely should be done. Again, I understand it
would be freaking hard. But it doesn't mean we should avoid it.

Here, there seems to be two points you'd like to say:

1. Fundamental question. Does Dept track wait and event correctly?
2. Even if so, can Dept consider all the subtle things in the kernel?

For 1, I'm willing to response to whatever it is. And not only me but we
can make it perfectly work if the concept and direction is *right*.
For 2, I need to ask things and try my best to fix those if it exists.

Again. Dept won't be a part of *automated testing system* until it
finally works as much as Lockdep in terms of false positives. Hopefully
you are okay with it.

---
Byungchul