When a system with large amounts of memory has several millions of
negative dentries in a single directory, a softlockup can occur while
adding an inotify watch:
watchdog: BUG: soft lockup - CPU#20 stuck for 9s! [inotifywait:9528]
CPU: 20 PID: 9528 Comm: inotifywait Kdump: loaded Not tainted 5.16.0-rc4.20211208.el8uek.rc1.x86_64 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.4.1 12/03/2020
RIP: 0010:__fsnotify_update_child_dentry_flags+0xad/0x120
Call Trace:
<TASK>
fsnotify_add_mark_locked+0x113/0x160
inotify_new_watch+0x130/0x190
inotify_update_watch+0x11a/0x140
__x64_sys_inotify_add_watch+0xef/0x140
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
This patch series is a modified version of the following:
https://lore.kernel.org/linux-fsdevel/1611235185-1685-1-git-send-email-gautham.ananthakrishna@oracle.com/
The strategy employed by this series is to move negative dentries to the
end of the d_subdirs list, and mark them with a flag as "tail negative".
Then, readers of the d_subdirs list, which are only interested in
positive dentries, can stop reading once they reach the first tail
negative dentry. By applying this patch, I'm able to avoid the above
softlockup caused by 200 million negative dentries on my test system.
Inotify watches are set up nearly instantly.
Previously, Al expressed concern for:
1. Possible memory corruption due to use of lock_parent() in
sweep_negative(), see patch 01 for fix.
2. The previous patch didn't catch all ways a negative dentry could
become positive (d_add, d_instantiate_new), see patch 01.
3. The previous series contained a new negative dentry limit, which
capped the negative dentry count at around 3 per hash bucket. I've
dropped this patch from the series.
Patches 2-4 are unmodified from the previous posting.
Konstantin Khlebnikov (3):
fsnotify: stop walking child dentries if remaining tail is negative
dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()
dcache: stop walking siblings if remaining dentries all negative
Stephen Brennan (1):
dcache: sweep cached negative dentries to the end of list of siblings
fs/dcache.c | 101 +++++++++++++++++++++++++++++++++++++++--
fs/libfs.c | 3 ++
fs/notify/fsnotify.c | 6 ++-
include/linux/dcache.h | 6 +++
4 files changed, 110 insertions(+), 6 deletions(-)
--
2.30.2
For disk filesystems result of every negative lookup is cached, content
of directories is usually cached too. Production of negative dentries
isn't limited with disk speed. It's really easy to generate millions of
them if system has enough memory. Negative dentries are linked into
siblings list along with normal positive dentries. Some operations walks
dcache tree but looks only for positive dentries: most important is
fsnotify/inotify.
This patch sweeps negative dentries to the end of list at final dput()
and marks with flag which tells that all following dentries are negative
too. We do this carefully to avoid corruption in case the dentry is
killed when we try to lock its parent. Reverse operation (recycle) is
required before instantiating tail negative dentry, or calling d_add()
with non-NULL inode.
Co-authored-by: Konstantin Khlebnikov <[email protected]>
Co-authored-by: Gautham Ananthakrishna <[email protected]>
Signed-off-by: Stephen Brennan <[email protected]>
---
Previously Al was concerned about sweep_negative() calling lock_parent()
when it held no reference, so there was no guarantee that dentry would
stay allocated. I've addressed this by adding an RCU critical section
into sweep_negative(), ensuring that all writes to dentry occurs within
the outer critical section. Thus, the dentry could not be freed yet. By
checking d_count once the lock is regained, we can be sure that the
dentry has not yet been killed.
This is not necessary for recycle_negative(), since it is called by
d_add and d_instantiate(), both of whose callers should hold a
reference.
The other major concern for this patch was that not all ways for a
negative dentry to be converted to positive were caught by
recycle_negative(). I've moved the call into __d_instantiate(), and
added a call for __d_add().
fs/dcache.c | 85 +++++++++++++++++++++++++++++++++++++++---
include/linux/dcache.h | 6 +++
2 files changed, 86 insertions(+), 5 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index cf871a81f4fd..685b91866e59 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -635,6 +635,58 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
return __lock_parent(dentry);
}
+/*
+ * Move cached negative dentry to the tail of parent->d_subdirs.
+ * This lets walkers skip them all together at first sight.
+ * Must be called at dput of negative dentry with d_lock held.
+ * Releases d_lock.
+ */
+static void sweep_negative(struct dentry *dentry)
+{
+ struct dentry *parent;
+
+ rcu_read_lock();
+ parent = lock_parent(dentry);
+ if (!parent) {
+ rcu_read_unlock();
+ return;
+ }
+
+ /*
+ * If we did not hold a reference to dentry (as in the case of dput),
+ * and dentry->d_lock was dropped in lock_parent(), then we could now be
+ * holding onto a dead dentry. Be careful to check d_count and unlock
+ * before dropping RCU lock, otherwise we could corrupt freed memory.
+ */
+ if (!d_count(dentry) && d_is_negative(dentry) &&
+ !d_is_tail_negative(dentry)) {
+ dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
+ list_move_tail(&dentry->d_child, &parent->d_subdirs);
+ }
+
+ spin_unlock(&parent->d_lock);
+ spin_unlock(&dentry->d_lock);
+ rcu_read_unlock();
+}
+
+/*
+ * Undo sweep_negative() and move to the head of parent->d_subdirs.
+ * Must be called before converting negative dentry into positive.
+ * Must hold dentry->d_lock, we may drop and re-grab it via lock_parent().
+ * Must be hold a reference or be otherwise sure the dentry cannot be killed.
+ */
+static void recycle_negative(struct dentry *dentry)
+{
+ struct dentry *parent;
+
+ parent = lock_parent(dentry);
+ dentry->d_flags &= ~DCACHE_TAIL_NEGATIVE;
+ if (parent) {
+ list_move(&dentry->d_child, &parent->d_subdirs);
+ spin_unlock(&parent->d_lock);
+ }
+}
+
static inline bool retain_dentry(struct dentry *dentry)
{
WARN_ON(d_in_lookup(dentry));
@@ -740,7 +792,7 @@ static struct dentry *dentry_kill(struct dentry *dentry)
static inline bool fast_dput(struct dentry *dentry)
{
int ret;
- unsigned int d_flags;
+ unsigned int d_flags, required;
/*
* If we have a d_op->d_delete() operation, we sould not
@@ -788,6 +840,8 @@ static inline bool fast_dput(struct dentry *dentry)
* a 'delete' op, and it's referenced and already on
* the LRU list.
*
+ * Cached negative dentry must be swept to the tail.
+ *
* NOTE! Since we aren't locked, these values are
* not "stable". However, it is sufficient that at
* some point after we dropped the reference the
@@ -805,11 +859,16 @@ static inline bool fast_dput(struct dentry *dentry)
*/
smp_rmb();
d_flags = READ_ONCE(dentry->d_flags);
+
+ required = DCACHE_REFERENCED | DCACHE_LRU_LIST |
+ (d_flags_negative(d_flags) ? DCACHE_TAIL_NEGATIVE : 0);
+
d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST |
- DCACHE_DISCONNECTED | DCACHE_DONTCACHE;
+ DCACHE_DISCONNECTED | DCACHE_DONTCACHE |
+ DCACHE_TAIL_NEGATIVE;
/* Nothing to do? Dropping the reference was all we needed? */
- if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry))
+ if (d_flags == required && !d_unhashed(dentry))
return true;
/*
@@ -881,7 +940,10 @@ void dput(struct dentry *dentry)
rcu_read_unlock();
if (likely(retain_dentry(dentry))) {
- spin_unlock(&dentry->d_lock);
+ if (d_is_negative(dentry) && !d_is_tail_negative(dentry))
+ sweep_negative(dentry); /* drops d_lock */
+ else
+ spin_unlock(&dentry->d_lock);
return;
}
@@ -1973,6 +2035,8 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
WARN_ON(d_in_lookup(dentry));
spin_lock(&dentry->d_lock);
+ if (d_is_tail_negative(dentry))
+ recycle_negative(dentry);
/*
* Decrement negative dentry count if it was in the LRU list.
*/
@@ -2697,6 +2761,13 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
struct inode *dir = NULL;
unsigned n;
spin_lock(&dentry->d_lock);
+ /*
+ * Tail negative dentries must become positive before associating an
+ * inode. recycle_negative() is safe to use because we hold a reference
+ * to dentry.
+ */
+ if (inode && d_is_tail_negative(dentry))
+ recycle_negative(dentry);
if (unlikely(d_in_lookup(dentry))) {
dir = dentry->d_parent->d_inode;
n = start_dir_add(dir);
@@ -2713,7 +2784,11 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
__d_rehash(dentry);
if (dir)
end_dir_add(dir, n);
- spin_unlock(&dentry->d_lock);
+
+ if (!inode && !d_is_tail_negative(dentry))
+ sweep_negative(dentry); /* drops d_lock */
+ else
+ spin_unlock(&dentry->d_lock);
if (inode)
spin_unlock(&inode->i_lock);
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 9e23d33bb6f1..b877c9ca212f 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,7 @@ struct dentry_operations {
#define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */
#define DCACHE_DENTRY_CURSOR 0x20000000
#define DCACHE_NORCU 0x40000000 /* No RCU delay for freeing */
+#define DCACHE_TAIL_NEGATIVE 0x80000000 /* All following siblings are negative */
extern seqlock_t rename_lock;
@@ -499,6 +500,11 @@ static inline int simple_positive(const struct dentry *dentry)
return d_really_is_positive(dentry) && !d_unhashed(dentry);
}
+static inline bool d_is_tail_negative(const struct dentry *dentry)
+{
+ return unlikely(dentry->d_flags & DCACHE_TAIL_NEGATIVE);
+}
+
extern void d_set_fallthru(struct dentry *dentry);
static inline bool d_is_fallthru(const struct dentry *dentry)
--
2.30.2
From: Konstantin Khlebnikov <[email protected]>
When notification starts/stops listening events from inode's children it
has to update dentry->d_flags of all positive child dentries. Scanning
may take a long time if the directory has a lot of negative child
dentries. Use the new tail negative flag to detect when the remainder of
the children are negative, and skip them. This speeds up
fsnotify/inotify watch creation, and in some extreme cases can avoid a
soft lockup, for example, with 200 million negative dentries in a single
directory:
watchdog: BUG: soft lockup - CPU#20 stuck for 9s! [inotifywait:9528]
CPU: 20 PID: 9528 Comm: inotifywait Kdump: loaded Not tainted 5.16.0-rc4.20211208.el8uek.rc1.x86_64 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.4.1 12/03/2020
RIP: 0010:__fsnotify_update_child_dentry_flags+0xad/0x120
Call Trace:
<TASK>
fsnotify_add_mark_locked+0x113/0x160
inotify_new_watch+0x130/0x190
inotify_update_watch+0x11a/0x140
__x64_sys_inotify_add_watch+0xef/0x140
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Co-authored-by: Konstantin Khlebnikov <[email protected]>
Co-authored-by: Gautham Ananthakrishna <[email protected]>
Signed-off-by: Stephen Brennan <[email protected]>
---
fs/notify/fsnotify.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 4034ca566f95..b754cd9efbad 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -127,8 +127,12 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
* original inode) */
spin_lock(&alias->d_lock);
list_for_each_entry(child, &alias->d_subdirs, d_child) {
- if (!child->d_inode)
+ if (!child->d_inode) {
+ /* all remaining children are negative */
+ if (d_is_tail_negative(child))
+ break;
continue;
+ }
spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
if (watched)
--
2.30.2
From: Konstantin Khlebnikov <[email protected]>
This lets skip remaining siblings at seeing d_is_tail_negative().
Co-authored-by: Konstantin Khlebnikov <[email protected]>
Co-authored-by: Gautham Ananthakrishna <[email protected]>
Signed-off-by: Stephen Brennan <[email protected]>
---
fs/dcache.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/fs/dcache.c b/fs/dcache.c
index 685b91866e59..9083436f5dcb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1366,12 +1366,14 @@ EXPORT_SYMBOL(shrink_dcache_sb);
* @D_WALK_QUIT: quit walk
* @D_WALK_NORETRY: quit when retry is needed
* @D_WALK_SKIP: skip this dentry and its children
+ * @D_WALK_SKIP_SIBLINGS: skip siblings and their children
*/
enum d_walk_ret {
D_WALK_CONTINUE,
D_WALK_QUIT,
D_WALK_NORETRY,
D_WALK_SKIP,
+ D_WALK_SKIP_SIBLINGS,
};
/**
@@ -1402,6 +1404,7 @@ static void d_walk(struct dentry *parent, void *data,
break;
case D_WALK_QUIT:
case D_WALK_SKIP:
+ case D_WALK_SKIP_SIBLINGS:
goto out_unlock;
case D_WALK_NORETRY:
retry = false;
@@ -1433,6 +1436,9 @@ static void d_walk(struct dentry *parent, void *data,
case D_WALK_SKIP:
spin_unlock(&dentry->d_lock);
continue;
+ case D_WALK_SKIP_SIBLINGS:
+ spin_unlock(&dentry->d_lock);
+ goto skip_siblings;
}
if (!list_empty(&dentry->d_subdirs)) {
@@ -1444,6 +1450,7 @@ static void d_walk(struct dentry *parent, void *data,
}
spin_unlock(&dentry->d_lock);
}
+skip_siblings:
/*
* All done at this level ... ascend and resume the search.
*/
--
2.30.2
From: Konstantin Khlebnikov <[email protected]>
Most walkers are interested only in positive dentries.
Changes in simple_* libfs helpers are mostly cosmetic: it shouldn't cache
negative dentries unless uses d_delete other than always_delete_dentry().
Co-authored-by: Konstantin Khlebnikov <[email protected]>
Co-authored-by: Gautham Ananthakrishna <[email protected]>
Signed-off-by: Stephen Brennan <[email protected]>
---
fs/dcache.c | 9 +++++++++
fs/libfs.c | 3 +++
2 files changed, 12 insertions(+)
diff --git a/fs/dcache.c b/fs/dcache.c
index 9083436f5dcb..85f33563936b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1505,6 +1505,8 @@ static enum d_walk_ret path_check_mount(void *data, struct dentry *dentry)
struct check_mount *info = data;
struct path path = { .mnt = info->mnt, .dentry = dentry };
+ if (d_is_tail_negative(dentry))
+ return D_WALK_SKIP_SIBLINGS;
if (likely(!d_mountpoint(dentry)))
return D_WALK_CONTINUE;
if (__path_is_mountpoint(&path)) {
@@ -1751,6 +1753,10 @@ void shrink_dcache_for_umount(struct super_block *sb)
static enum d_walk_ret find_submount(void *_data, struct dentry *dentry)
{
struct dentry **victim = _data;
+
+ if (d_is_tail_negative(dentry))
+ return D_WALK_SKIP_SIBLINGS;
+
if (d_mountpoint(dentry)) {
__dget_dlock(dentry);
*victim = dentry;
@@ -3231,6 +3237,9 @@ static enum d_walk_ret d_genocide_kill(void *data, struct dentry *dentry)
{
struct dentry *root = data;
if (dentry != root) {
+ if (d_is_tail_negative(dentry))
+ return D_WALK_SKIP_SIBLINGS;
+
if (d_unhashed(dentry) || !dentry->d_inode)
return D_WALK_SKIP;
diff --git a/fs/libfs.c b/fs/libfs.c
index ba7438ab9371..13cb44cf158e 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -411,6 +411,9 @@ int simple_empty(struct dentry *dentry)
spin_lock(&dentry->d_lock);
list_for_each_entry(child, &dentry->d_subdirs, d_child) {
+ if (d_is_tail_negative(child))
+ break;
+
spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
if (simple_positive(child)) {
spin_unlock(&child->d_lock);
--
2.30.2
Greeting,
FYI, we noticed the following commit (built with gcc-9):
commit: 5384bd38a814f1eda1565d3657a8960b477f2a32 ("[PATCH 1/4] dcache: sweep cached negative dentries to the end of list of siblings")
url: https://github.com/0day-ci/linux/commits/Stephen-Brennan/Fix-softlockup-when-adding-inotify-watch/20211214-085541
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 136057256686de39cc3a07c2e39ef6bc43003ff6
patch link: https://lore.kernel.org/linux-fsdevel/[email protected]
in testcase: fsmark
version: fsmark-x86_64-698ee57-1_20211207
with following parameters:
iterations: 1x
nr_threads: 1t
disk: 1BRD_32G
fs: xfs
fs2: nfsv4
filesize: 4K
test_size: 4G
sync_method: fsyncBeforeClose
nr_files_per_directory: 1fpd
cpufreq_governor: performance
ucode: 0xb000280
test-description: The fsmark is a file system benchmark to test synchronous write workloads, for example, mail servers workload.
test-url: https://sourceforge.net/projects/fsmark/
on test machine: 96 threads 2 sockets Ice Lake with 256G memory
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
[ 50.611587][ T4039] ------------[ cut here ]------------
[ 50.617465][ T4039] WARNING: CPU: 1 PID: 4039 at fs/nfsd/nfsctl.c:1263 nfsdfs_remove_files+0xd7/0x100 [nfsd]
[ 50.627806][ T4039] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfsd auth_rpcgss xfs dm_mod brd binfmt_misc dax_pmem_compat nd_pmem device_dax nd_btt dax
_pmem_core intel_rapl_msr intel_rapl_common btrfs x86_pkg_temp_thermal blake2b_generic intel_powerclamp xor ipmi_ssif raid6_pq zstd_compress libcrc32c sd_mod coretemp
sg kvm_intel kvm ast drm_vram_helper drm_ttm_helper irqbypass crct10dif_pclmul crc32_pclmul ttm crc32c_intel ghash_clmulni_intel drm_kms_helper syscopyarea rapl nvme
sysfillrect sysimgblt ahci intel_cstate fb_sys_fops libahci nvme_core ioatdma intel_uncore drm t10_pi acpi_ipmi joydev libata dca wmi ipmi_si ipmi_devintf ipmi_msgha
ndler nfit libnvdimm acpi_pad acpi_power_meter ip_tables
[ 50.693283][ T4039] CPU: 1 PID: 4039 Comm: nfsd Not tainted 5.16.0-rc2-00001-g5384bd38a814 #1
[ 50.702314][ T4039] RIP: 0010:nfsdfs_remove_files+0xd7/0x100 [nfsd]
[ 50.709102][ T4039] Code: 83 90 00 00 00 48 8d 93 90 00 00 00 48 89 dd 48 2d 90 00 00 00 49 39 d4 74 1b 48 89 c3 48 8b 45 30 48 85 c0 0f 85 6e ff ff ff <0f> 0b eb
d0 ff d2 0f 1f 00 eb 98 5b 5d 41 5c 41 5d 41 5e 41 5f c3
[ 50.729570][ T4039] RSP: 0018:ffa0000021eb3ce8 EFLAGS: 00010246
[ 50.736017][ T4039] RAX: 0000000000000000 RBX: ff1100012a542f00 RCX: ff1100012a542f80
[ 50.744374][ T4039] RDX: ff1100012a543dd0 RSI: ff1100012a542f90 RDI: ff1100012a542dd8
[ 50.752717][ T4039] RBP: ff1100012a543d40 R08: ff1100012a542820 R09: ff1100012a542820
[ 50.761072][ T4039] R10: 0000000000000067 R11: ff11001fff66a404 R12: ff1100012a542820
[ 50.769416][ T4039] R13: ff1100012a542780 R14: 00000000ffffffff R15: 0000000000000000
[ 50.777758][ T4039] FS: 0000000000000000(0000) GS:ff11001fff640000(0000) knlGS:0000000000000000
[ 50.787070][ T4039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 50.794039][ T4039] CR2: 00007f2f9ff8f060 CR3: 000000407ec0a002 CR4: 0000000000771ee0
[ 50.802395][ T4039] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 50.810740][ T4039] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 50.819092][ T4039] PKRU: 55555554
[ 50.823020][ T4039] Call Trace:
[ 50.826673][ T4039] <TASK>
[ 50.829982][ T4039] nfsd_client_rmdir+0x32/0x100 [nfsd]
[ 50.835809][ T4039] free_client+0x9d/0x140 [nfsd]
[ 50.841119][ T4039] __destroy_client+0x1e8/0x240 [nfsd]
[ 50.846934][ T4039] nfsd4_setclientid_confirm+0x21d/0x480 [nfsd]
[ 50.853564][ T4039] nfsd4_proc_compound+0x3ae/0x680 [nfsd]
[ 50.859667][ T4039] nfsd_dispatch+0x146/0x280 [nfsd]
[ 50.865223][ T4039] svc_process_common+0x45a/0x5c0
[ 50.870581][ T4039] ? nfsd_svc+0x340/0x340 [nfsd]
[ 50.875859][ T4039] ? nfsd_shutdown_threads+0xc0/0xc0 [nfsd]
[ 50.882093][ T4039] svc_process+0xb7/0x100
[ 50.886742][ T4039] nfsd+0xf1/0x180 [nfsd]
[ 50.891400][ T4039] kthread+0x157/0x180
[ 50.895792][ T4039] ? set_kthread_struct+0x40/0x40
[ 50.901138][ T4039] ret_from_fork+0x1f/0x30
[ 50.905874][ T4039] </TASK>
[ 50.909218][ T4039] ---[ end trace cb473be7fca24e0f ]---
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file
# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.
---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
Thanks,
Oliver Sang