Hi,
Recently my Arch Linux system (on kernel 6.3.4-zen1-1-zen) has
switched to the NTFS3 driver for mounting NTFS disks using udisks2 (as
opposed to fstab). Since then, I've encountered three progressive
system hangs after mounting my Windows partition, which I've traced to
a deadlock in NTFS3. Using the systemd debug console in TTY9, I was
able to save the dmesg logs to my btrfs rootfs (since systemd-journald
itself hung on cgroup_mutex when trying to save these logs to
journal). A cleaned-up stack trace of kswapd0 (with inlined stack
frames reconstructed by looking up the call offset in Ghidra) is as
follows:
[10943.243496] task:kswapd0 state:D stack:0 pid:113
ppid:2 flags:0x00004000
[10943.243524] __schedule+0xc18/0x1620
[10943.243564] schedule+0x5e/0xd0
[↓ __wait_on_freeing_inode]
[10943.243573] find_inode+0x144/0x1a0
[↓ ilookup5_nowait]
[10943.243616] ilookup5+0x79/0x110
[10943.243669] iget5_locked+0x2a/0xf0
[10943.243678] ntfs_iget5+0x4e/0x1620 [ntfs3
779ebf19c5973006d2574d6bf4f57b3f94982500]
[↓ ni_update_parent] // get parent directory inodes
[10943.243725] ni_write_inode+0x9eb/0x1140 [ntfs3
779ebf19c5973006d2574d6bf4f57b3f94982500]
[↓ _ni_write_inode(inode, inode_needs_sync(inode))]
[10943.243750] ntfs_evict_inode+0x56/0x60 [ntfs3
779ebf19c5973006d2574d6bf4f57b3f94982500]
[↓ (op: super_operations*)->evict_inode(inode)]
[10943.243773] evict+0xd1/0x2a0
[↓ dispose_list]
[10943.243781] prune_icache_sb+0x92/0xd0
[10943.243791] super_cache_scan+0x16a/0x1f0
[10943.243801] do_shrink_slab+0x13c/0x2e0
[↓ shrink_slab_memcg]
[10943.243811] shrink_slab+0x1f5/0x2a0
[10943.243821] shrink_one+0x154/0x1c0
[↓ lru_gen_shrink_node]
[10943.243830] shrink_node+0x902/0xc10
[↓ kswapd_shrink_node]
[↓ balance_pgdat]
[10943.243840] kswapd+0x74c/0xfe0
There were a total of four tasks which hung at the same time; the
other three are:
- kworker/u64 -> writeback_sb_inodes (waiting on __wait_on_freeing_inode)
- kworker/5 -> free_ipc (waiting to down_write(&shrinker_rwsem) held by kswapd0)
- Firefox's IPC Launch trying to create mount namespaces (waiting to
down_write(&shrinker_rwsem) held by kswapd0)
I've attached my full dmesg logs, including 4 initial hung tasks and a
zoo of subsequently hung tasks waiting on shrinker_rwsem or do_exit()
(not sure which inlined function). However, I suspect that this
deadlock actually occurs within a single kernel thread.
I found a previous ntfs3 bug report at
https://lore.kernel.org/ntfs3/[email protected]/,
describing a hang with an identical stack trace (aside from minor
differences likely explained by inlining). The other hung tasks in
that bug report are either unrelated to mine (and probably a side
effect), or are trying to down_write(&shrinker_rwsem) (I think it's a
side effect, and not responsible for keeping an inode in
I_FREEING|I_WILL_FREE). So I think either the deadlock is
single-threaded, or the other responsible thread is not currently hung
and doesn't show up in dmesg.
Unfortunately that bug report didn't have an explanation of what went
wrong, and never got a reply or fix.
----
I have a working theory of how this deadlock occurred.
https://lore.kernel.org/lkml/[email protected]/T/
reported a similar single-threaded deadlock in the ext4 filesystem, on
malformed filesystems where an inode and its EA inode (extended
attributes?) share the same number. When do_mount -> ...evict calls
op->evict_inode(inode) = ext4_evict_inode(), that function then tries
to call iget_locked on the inode currently being evicted. iget_locked
blocks on find_inode_fast -> __wait_on_freeing_inode, which waits for
the inode to exit I_FREEING|I_WILL_FREE or signal I_NEW (which will
never happen until ext4_evict_inode returns to evict()) before it
returns to evict.
In the case of my deadlock (and the ukr.net bug report), I suspect
that when kswapd -> ...evict calls op->evict_inode(inode) =
ntfs_evict_inode, eventually ni_update_parent tries to call
(ntfs_iget5 -> iget5_locked) on the inode currently being evicted.
iget5_locked blocks on find_inode -> __wait_on_freeing_inode, which
waits for the inode to exit I_FREEING|I_WILL_FREE or signal I_NEW
(which will never happen until ntfs_evict_inode returns to evict())
before it returns to evict.
I'm thinking my deadlock occurs when a (malformed?) NTFS inode has the
same inode number as an ancestor, and it gets modified, then evicted
due to memory pressure (usually due to web browsers, but I'm able to
trigger kswapd reliably by creating a 40000x20000 image in GIMP).
I've tried to reproduce this hang using a self-built linux-mainline
package (6.4rc4) without DKMS modules. Unfortunately I was unable to
mount the NTFS partition, because the previous crash (on linux-zen) or
something else had marked it as dirty. To clear the dirty bit, I
booted to Windows and ran a chkdsk (since at this point I didn't think
that the bug was caused by a corrupted filesystem, but rather a race
condition). Unfortunately now I can no longer reproduce the NTFS
deadlock, even on linux-zen.
I also could not reproduce this deadlock by going to
"(windows)/Users/user/AppData/Local/Application Data/Application
Data/..." (where Application Data is a NTFS link pointing to Local)
and touching files here.
----
Prior to this, I had an alternative theory involving a race condition
between an evictor thread (another thread calling evict?) and a reader
thread (calling find_inode). In this case, first the reader thread
(here, kswapd0 in find_inode) would reach a parent inode in
hlist_for_each_entry, and optionally call test(inode, data). Then the
evictor thread (in evict) would acquire i_lock, evict the parent inode
(op->evict_inode(inode) = ntfs_evict_inode), remove it from the hash
table (remove_inode_hash(inode)), signal all waiters
(wake_up_bit(&inode->i_state, __I_NEW)), and release i_lock (and
eventually return from the function as normal). At this point, the
reader thread would acquire i_lock, see that (inode->i_state &
(I_FREEING|I_WILL_FREE)), then call __wait_on_freeing_inode which
waits for I_NEW to be set.
If the evictor's previous wake_up_bit does not signal the reader's
*subsequent* __wait_on_freeing_inode, this could be another way that a
reader thread can deadlock in find_inode (or find_inode_fast) with no
other thread hung in inode-related code. Is this race condition
possible in practice, and capable of producing hung kernel threads?
I'm not experienced in kernel development or concurrency, so I don't
know for sure.
Interestingly, there seems to be an attempt to optimize the inode
locking paths at
https://lore.kernel.org/lkml/[email protected]/,
and this patch (and its predecessors) change the code paths of my
deadlock drastically. However the patch hasn't been picked/merged into
kernel 6.4 yet.
--nyanpasu64