2020-05-08 12:25:00

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH RFC 0/8] dcache: increase poison resistance

For most filesystems result of every negative lookup is cached, content of
directories is usually cached too. Production of negative dentries isn't
limited with disk speed. It's really easy to generate millions of them if
system has enough memory.

Getting this memory back ins't that easy because slab frees pages only when
all related objects are gone. While dcache shrinker works in LRU order.

Typical scenario is an idle system where some process periodically creates
temporary files and removes them. After some time, memory will be filled
with negative dentries for these random file names.

Simple lookup of random names also generates negative dentries very fast.
Constant flow of such negative denries drains all other inactive caches.

Negative dentries are linked into siblings list along with normal positive
dentries. Some operations walks dcache tree but looks only for positive
dentries: most important is fsnotify/inotify. Hordes of negative dentries
slow down these operations significantly.

Time of dentry lookup is usually unaffected because hash table grows along
with size of memory. Unless somebody especially crafts hash collisions.

This patch set solves all of these problems:

Move negative denries to the end of sliblings list, thus walkers could
skip them at first sight (patches 3-6).

Keep in dcache at most three unreferenced negative denties in row in each
hash bucket (patches 7-8).

---

Konstantin Khlebnikov (8):
dcache: show count of hash buckets in sysctl fs.dentry-state
selftests: add stress testing tool for dcache
dcache: sweep cached negative dentries to the end of list of siblings
fsnotify: stop walking child dentries if remaining tail is negative
dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()
dcache: stop walking siblings if remaining dentries all negative
dcache: push releasing dentry lock into sweep_negative
dcache: prevent flooding with negative dentries


fs/dcache.c | 144 +++++++++++-
fs/libfs.c | 10 +-
fs/notify/fsnotify.c | 6 +-
include/linux/dcache.h | 6 +
tools/testing/selftests/filesystems/Makefile | 1 +
.../selftests/filesystems/dcache_stress.c | 210 ++++++++++++++++++
6 files changed, 370 insertions(+), 7 deletions(-)
create mode 100644 tools/testing/selftests/filesystems/dcache_stress.c

--
Signature


2020-05-08 12:25:08

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state

Count of buckets is required for estimating average length of hash chains.
Size of hash table depends on memory size and printed once at boot.

Let's expose nr_buckets as sixth number in sysctl fs.dentry-state

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
Documentation/admin-guide/sysctl/fs.rst | 12 ++++++------
fs/dcache.c | 12 ++++++++++--
include/linux/dcache.h | 2 +-
3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 2a45119e3331..b74df4714ddd 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -66,12 +66,12 @@ dentry-state
From linux/include/linux/dcache.h::

struct dentry_stat_t dentry_stat {
- int nr_dentry;
- int nr_unused;
- int age_limit; /* age in seconds */
- int want_pages; /* pages requested by system */
- int nr_negative; /* # of unused negative dentries */
- int dummy; /* Reserved for future use */
+ long nr_dentry;
+ long nr_unused;
+ long age_limit; /* age in seconds */
+ long want_pages; /* pages requested by system */
+ long nr_negative; /* # of unused negative dentries */
+ long nr_buckets; /* count of dcache hash buckets */
};

Dentries are dynamically allocated and deallocated.
diff --git a/fs/dcache.c b/fs/dcache.c
index b280e07e162b..386f97eaf2ff 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3139,6 +3139,14 @@ static int __init set_dhash_entries(char *str)
}
__setup("dhash_entries=", set_dhash_entries);

+static void __init dcache_init_hash(void)
+{
+ dentry_stat.nr_buckets = 1l << d_hash_shift;
+
+ /* shift to use higher bits of 32 bit hash value */
+ d_hash_shift = 32 - d_hash_shift;
+}
+
static void __init dcache_init_early(void)
{
/* If hashes are distributed across NUMA nodes, defer
@@ -3157,7 +3165,7 @@ static void __init dcache_init_early(void)
NULL,
0,
0);
- d_hash_shift = 32 - d_hash_shift;
+ dcache_init_hash();
}

static void __init dcache_init(void)
@@ -3185,7 +3193,7 @@ static void __init dcache_init(void)
NULL,
0,
0);
- d_hash_shift = 32 - d_hash_shift;
+ dcache_init_hash();
}

/* SLAB cache for __getname() consumers */
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index c1488cc84fd9..082b55068e4d 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -65,7 +65,7 @@ struct dentry_stat_t {
long age_limit; /* age in seconds */
long want_pages; /* pages requested by system */
long nr_negative; /* # of unused negative dentries */
- long dummy; /* Reserved for future use */
+ long nr_buckets; /* count of dcache hash buckets */
};
extern struct dentry_stat_t dentry_stat;


2020-05-08 12:25:19

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH RFC 2/8] selftests: add stress testing tool for dcache

This tool fills dcache with negative dentries. Between iterations it prints
statistics and measures time of inotify operation which might degrade.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
tools/testing/selftests/filesystems/Makefile | 1
.../testing/selftests/filesystems/dcache_stress.c | 210 ++++++++++++++++++++
2 files changed, 211 insertions(+)
create mode 100644 tools/testing/selftests/filesystems/dcache_stress.c

diff --git a/tools/testing/selftests/filesystems/Makefile b/tools/testing/selftests/filesystems/Makefile
index 129880fb42d3..6b5e08617d11 100644
--- a/tools/testing/selftests/filesystems/Makefile
+++ b/tools/testing/selftests/filesystems/Makefile
@@ -3,5 +3,6 @@
CFLAGS += -I../../../../usr/include/
TEST_GEN_PROGS := devpts_pts
TEST_GEN_PROGS_EXTENDED := dnotify_test
+TEST_GEN_FILES += dcache_stress

include ../lib.mk
diff --git a/tools/testing/selftests/filesystems/dcache_stress.c b/tools/testing/selftests/filesystems/dcache_stress.c
new file mode 100644
index 000000000000..770e8876629e
--- /dev/null
+++ b/tools/testing/selftests/filesystems/dcache_stress.c
@@ -0,0 +1,210 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/inotify.h>
+#include <sys/stat.h>
+#include <time.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <string.h>
+#include <err.h>
+
+double now(void)
+{
+ struct timespec ts;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts);
+ return ts.tv_sec + ts.tv_nsec * 1e-9;
+}
+
+struct dentry_stat {
+ long nr_dentry;
+ long nr_unused;
+ long age_limit; /* age in seconds */
+ long want_pages; /* pages requested by system */
+ long nr_negative; /* # of unused negative dentries */
+ long nr_buckets; /* count of dcache hash buckets */
+};
+
+void show_dentry_state(void)
+{
+ struct dentry_stat stat;
+ ssize_t len;
+ FILE *f;
+
+ f = fopen("/proc/sys/fs/dentry-state", "r");
+ if (!f)
+ err(2, "open fs.dentry-state");
+
+ if (fscanf(f, "%ld %ld %ld %ld %ld %ld",
+ &stat.nr_dentry,
+ &stat.nr_unused,
+ &stat.age_limit,
+ &stat.want_pages,
+ &stat.nr_negative,
+ &stat.nr_buckets) != 6)
+ err(2, "read fs.dentry-state");
+ fclose(f);
+
+ if (!stat.nr_buckets)
+ stat.nr_buckets = 1 << 20; // for 8Gb ram
+
+ printf("nr_dentry = %ld\t%.1fM\n", stat.nr_dentry, stat.nr_dentry / 1e6);
+ printf("nr_buckets = %ld\t%.1f avg\n", stat.nr_buckets, (double)stat.nr_dentry / stat.nr_buckets);
+ printf("nr_unused = %ld\t%.1f%%\n", stat.nr_unused, stat.nr_unused * 100. / stat.nr_dentry);
+ printf("nr_negative = %ld\t%.1f%%\n\n", stat.nr_negative, stat.nr_negative * 100. / stat.nr_dentry);
+}
+
+void test_inotify(const char *path)
+{
+ double tm;
+ int fd;
+
+ fd = inotify_init1(0);
+
+ tm = now();
+ inotify_add_watch(fd, path, IN_OPEN);
+ tm = now() - tm;
+
+ printf("inotify time: %f seconds\n\n", tm);
+
+ close(fd);
+}
+
+int main(int argc, char **argv)
+{
+ char dir_name[] = "dcache_stress.XXXXXX";
+ char name[4096];
+ char *suffix = name;
+ int nr_iterations = 10;
+ int nr_names = 1 << 20;
+ int iteration, index;
+ int other_dir = -1;
+ int mknod_unlink = 0;
+ int mkdir_chdir = 0;
+ int second_access = 0;
+ long long total_names = 0;
+ double tm;
+ int opt;
+
+ while ((opt = getopt(argc, argv, "i:n:p:o:usdh")) != -1) {
+ switch (opt) {
+ case 'i':
+ nr_iterations = atoi(optarg);
+ break;
+ case 'n':
+ nr_names = atoi(optarg);
+ break;
+ case 'p':
+ strcpy(suffix, optarg);
+ suffix += strlen(suffix);
+ break;
+ case 'o':
+ other_dir = open(optarg, O_RDONLY | O_DIRECTORY);
+ if (other_dir < 0)
+ err(2, "open %s", optarg);
+ break;
+ case 'u':
+ mknod_unlink = 1;
+ break;
+ case 'd':
+ mkdir_chdir = 1;
+ break;
+ case 's':
+ second_access = 1;
+ break;
+ case '?':
+ case 'h':
+ printf("usage: %s [-i <iterations>] [-n <names>] [-p <prefix>] [-o <dir>] [-u] [-s]\n"
+ " -i test iterations, default %d\n"
+ " -n names at each iterations, default %d\n"
+ " -p prefix for names\n"
+ " -o interlave with other dir\n"
+ " -s touch twice\n"
+ " -u mknod-unlink sequence\n"
+ " -d mkdir-chdir sequence (leaves garbage)\n",
+ argv[0], nr_iterations, nr_names);
+ return 1;
+ }
+ }
+
+
+ if (!mkdtemp(dir_name))
+ err(2, "mkdtemp");
+
+ if (chdir(dir_name))
+ err(2, "chdir");
+
+ show_dentry_state();
+
+ if (!mkdir_chdir)
+ test_inotify(".");
+
+ printf("working in temporary directory %s\n\n", dir_name);
+
+ for (iteration = 1; iteration <= nr_iterations; iteration++) {
+
+ printf("start iteration %d, %d names\n", iteration, nr_names);
+
+ tm = now();
+
+ sprintf(suffix, "%08x", iteration);
+
+ for (index = 0; index < nr_names; index++) {
+ sprintf(suffix + 8, "%08x", index);
+
+ if (mknod_unlink) {
+ if (mknod(name, S_IFREG, 0))
+ err(2, "mknod %s", name);
+ if (unlink(name))
+ err(2, "unlink %s", name);
+ } else if (mkdir_chdir) {
+ if (mkdir(name, 0775))
+ err(2, "mkdir %s", name);
+ if (chdir(name))
+ err(2, "chdir %s", name);
+ } else
+ access(name, 0);
+
+ if (second_access)
+ access(name, 0);
+
+ if (other_dir >= 0) {
+ faccessat(other_dir, name, 0, 0);
+ if (second_access)
+ faccessat(other_dir, name, 0, 0);
+ }
+ }
+
+ total_names += nr_names;
+
+ tm = now() - tm;
+ printf("iteration %d complete in %f seconds, total names %lld\n\n", iteration, tm, total_names);
+
+ show_dentry_state();
+
+ if (!mkdir_chdir)
+ test_inotify(".");
+ }
+
+ if (chdir(".."))
+ err(2, "chdir");
+
+ if (mkdir_chdir) {
+ printf("leave temporary directory %s\n", dir_name);
+ return 0;
+ }
+
+ printf("removing temporary directory %s\n", dir_name);
+ tm = now();
+ if (rmdir(dir_name))
+ err(2, "rmdir");
+ tm = now() - tm;
+ printf("remove complete in %f seconds\n\n", tm);
+
+ show_dentry_state();
+
+ return 0;
+}

2020-05-08 12:25:30

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH RFC 6/8] dcache: stop walking siblings if remaining dentries all negative

Most walkers are interested only in positive dentries.

Changes in simple_* libfs helpers are mostly cosmetic: it shouldn't cache
negative dentries unless uses d_delete other than always_delete_dentry().

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
fs/dcache.c | 10 ++++++++++
fs/libfs.c | 10 +++++++++-
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 44c6832d21d6..0fd2e02e507b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1442,6 +1442,9 @@ static enum d_walk_ret path_check_mount(void *data, struct dentry *dentry)
struct check_mount *info = data;
struct path path = { .mnt = info->mnt, .dentry = dentry };

+ if (d_is_tail_negative(dentry))
+ return D_WALK_SKIP_SIBLINGS;
+
if (likely(!d_mountpoint(dentry)))
return D_WALK_CONTINUE;
if (__path_is_mountpoint(&path)) {
@@ -1688,6 +1691,10 @@ void shrink_dcache_for_umount(struct super_block *sb)
static enum d_walk_ret find_submount(void *_data, struct dentry *dentry)
{
struct dentry **victim = _data;
+
+ if (d_is_tail_negative(dentry))
+ return D_WALK_SKIP_SIBLINGS;
+
if (d_mountpoint(dentry)) {
__dget_dlock(dentry);
*victim = dentry;
@@ -3159,6 +3166,9 @@ static enum d_walk_ret d_genocide_kill(void *data, struct dentry *dentry)
{
struct dentry *root = data;
if (dentry != root) {
+ if (d_is_tail_negative(dentry))
+ return D_WALK_SKIP_SIBLINGS;
+
if (d_unhashed(dentry) || !dentry->d_inode)
return D_WALK_SKIP;

diff --git a/fs/libfs.c b/fs/libfs.c
index 3759fbacf522..de944c241cf0 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -106,6 +106,10 @@ static struct dentry *scan_positives(struct dentry *cursor,
spin_lock(&dentry->d_lock);
while ((p = p->next) != &dentry->d_subdirs) {
struct dentry *d = list_entry(p, struct dentry, d_child);
+
+ if (d_is_tail_negative(d))
+ break;
+
// we must at least skip cursors, to avoid livelocks
if (d->d_flags & DCACHE_DENTRY_CURSOR)
continue;
@@ -255,7 +259,8 @@ static struct dentry *find_next_child(struct dentry *parent, struct dentry *prev
spin_unlock(&d->d_lock);
if (likely(child))
break;
- }
+ } else if (d_is_tail_negative(d))
+ break;
}
spin_unlock(&parent->d_lock);
dput(prev);
@@ -408,6 +413,9 @@ int simple_empty(struct dentry *dentry)

spin_lock(&dentry->d_lock);
list_for_each_entry(child, &dentry->d_subdirs, d_child) {
+ if (d_is_tail_negative(child))
+ break;
+
spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
if (simple_positive(child)) {
spin_unlock(&child->d_lock);

2020-05-08 12:25:30

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH RFC 5/8] dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()

This lets skip remaining siblings at seeing d_is_tail_negative().

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
fs/dcache.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 743255773cc7..44c6832d21d6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1303,12 +1303,14 @@ EXPORT_SYMBOL(shrink_dcache_sb);
* @D_WALK_QUIT: quit walk
* @D_WALK_NORETRY: quit when retry is needed
* @D_WALK_SKIP: skip this dentry and its children
+ * @D_WALK_SKIP_SIBLINGS: skip siblings and their children
*/
enum d_walk_ret {
D_WALK_CONTINUE,
D_WALK_QUIT,
D_WALK_NORETRY,
D_WALK_SKIP,
+ D_WALK_SKIP_SIBLINGS,
};

/**
@@ -1339,6 +1341,7 @@ static void d_walk(struct dentry *parent, void *data,
break;
case D_WALK_QUIT:
case D_WALK_SKIP:
+ case D_WALK_SKIP_SIBLINGS:
goto out_unlock;
case D_WALK_NORETRY:
retry = false;
@@ -1370,6 +1373,9 @@ static void d_walk(struct dentry *parent, void *data,
case D_WALK_SKIP:
spin_unlock(&dentry->d_lock);
continue;
+ case D_WALK_SKIP_SIBLINGS:
+ spin_unlock(&dentry->d_lock);
+ goto skip_siblings;
}

if (!list_empty(&dentry->d_subdirs)) {
@@ -1381,6 +1387,7 @@ static void d_walk(struct dentry *parent, void *data,
}
spin_unlock(&dentry->d_lock);
}
+skip_siblings:
/*
* All done at this level ... ascend and resume the search.
*/

2020-05-08 12:25:58

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH RFC 7/8] dcache: push releasing dentry lock into sweep_negative

This is preparation for the next patch.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
fs/dcache.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0fd2e02e507b..60158065891e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -636,15 +636,17 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
* Move cached negative dentry to the tail of parent->d_subdirs.
* This lets walkers skip them all together at first sight.
* Must be called at dput of negative dentry.
+ * dentry->d_lock must be held, returns with it unlocked.
*/
static void sweep_negative(struct dentry *dentry)
+ __releases(dentry->d_lock)
{
struct dentry *parent;

if (!d_is_tail_negative(dentry)) {
parent = lock_parent(dentry);
if (!parent)
- return;
+ goto out;

if (!d_count(dentry) && d_is_negative(dentry) &&
!d_is_tail_negative(dentry)) {
@@ -654,6 +656,8 @@ static void sweep_negative(struct dentry *dentry)

spin_unlock(&parent->d_lock);
}
+out:
+ spin_unlock(&dentry->d_lock);
}

/*
@@ -747,7 +751,8 @@ static struct dentry *dentry_kill(struct dentry *dentry)
spin_unlock(&parent->d_lock);
if (d_is_negative(dentry))
sweep_negative(dentry);
- spin_unlock(&dentry->d_lock);
+ else
+ spin_unlock(&dentry->d_lock);
return NULL;
}

@@ -905,7 +910,8 @@ void dput(struct dentry *dentry)
if (likely(retain_dentry(dentry))) {
if (d_is_negative(dentry))
sweep_negative(dentry);
- spin_unlock(&dentry->d_lock);
+ else
+ spin_unlock(&dentry->d_lock);
return;
}


2020-05-08 12:26:29

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH RFC 8/8] dcache: prevent flooding with negative dentries

Without memory pressure count of negative dentries isn't bounded.
They could consume all memory and drain all other inactive caches.

Typical scenario is an idle system where some process periodically creates
temporary files and removes them. After some time, memory will be filled
with negative dentries for these random file names. Reclaiming them took
some time because slab frees pages only when all related objects are gone.
Time of dentry lookup is usually unaffected because hash table grows along
with size of memory. Unless somebody especially crafts hash collisions.
Simple lookup of random names also generates negative dentries very fast.

This patch implements heuristic which detects such scenarios and prevents
unbounded growth of completely unneeded negative dentries. It keeps up to
three latest negative dentry in each bucket unless they were referenced.

At first dput of negative dentry when it swept to the tail of siblings
we'll also clear it's reference flag and look at next dentries in chain.
Then kill third in series of negative, unused and unreferenced denries.

This way each hash bucket will preserve three negative dentry to let them
get reference and survive. Adding positive or used dentry into hash chain
also protects few recent negative dentries. In result total size of dcache
asymptotically limited by count of buckets and positive or used dentries.

Before patch: tool 'dcache_stress' could fill entire memory with dentries.

nr_dentry = 104913261 104.9M
nr_buckets = 8388608 12.5 avg
nr_unused = 104898729 100.0%
nr_negative = 104883218 100.0%

After this patch count of dentries saturates at around 3 per bucket:

nr_dentry = 24619259 24.6M
nr_buckets = 8388608 2.9 avg
nr_unused = 24605226 99.9%
nr_negative = 24600351 99.9%

This heuristic isn't bulletproof and solves only most practical case.
It's easy to deceive: just touch same random name twice.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
fs/dcache.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 60158065891e..9f3d331b4978 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -632,6 +632,58 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
return __lock_parent(dentry);
}

+/*
+ * Called at first dput of each negative dentry.
+ * Prevents filling cache with never reused negative dentries.
+ *
+ * This clears reference and then looks at following dentries in hash chain.
+ * If they are negative, unused and unreferenced then keep two and kill third.
+ */
+static void trim_negative(struct dentry *dentry)
+ __releases(dentry->d_lock)
+{
+ struct dentry *victim, *parent;
+ struct hlist_bl_node *next;
+ int keep = 2;
+
+ rcu_read_lock();
+
+ dentry->d_flags &= ~DCACHE_REFERENCED;
+ spin_unlock(&dentry->d_lock);
+
+ next = rcu_dereference_raw(dentry->d_hash.next);
+ while (1) {
+ victim = hlist_bl_entry(next, struct dentry, d_hash);
+
+ if (!next || d_count(victim) || !d_is_negative(victim) ||
+ (victim->d_flags & DCACHE_REFERENCED)) {
+ rcu_read_unlock();
+ return;
+ }
+
+ if (!keep--)
+ break;
+
+ next = rcu_dereference_raw(next->next);
+ }
+
+ spin_lock(&victim->d_lock);
+ parent = lock_parent(victim);
+
+ rcu_read_unlock();
+
+ if (d_count(victim) || !d_is_negative(victim) ||
+ (victim->d_flags & DCACHE_REFERENCED)) {
+ if (parent)
+ spin_unlock(&parent->d_lock);
+ spin_unlock(&victim->d_lock);
+ return;
+ }
+
+ __dentry_kill(victim);
+ dput(parent);
+}
+
/*
* Move cached negative dentry to the tail of parent->d_subdirs.
* This lets walkers skip them all together at first sight.
@@ -655,6 +707,8 @@ static void sweep_negative(struct dentry *dentry)
}

spin_unlock(&parent->d_lock);
+
+ return trim_negative(dentry);
}
out:
spin_unlock(&dentry->d_lock);

2020-05-08 12:27:24

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH RFC 4/8] fsnotify: stop walking child dentries if remaining tail is negative

When notification starts/stops listening events from inode's children it
have to update dentry->d_flags of all positive child dentries. Scanning
may took a long time if directory has a lot of negative child dentries.

This is main beneficiary of sweeping cached negative dentries to the end.

Before patch:

nr_dentry = 24172597 24.2M
nr_buckets = 8388608 2.9 avg
nr_unused = 24158110 99.9%
nr_negative = 24142810 99.9%

inotify time: 0.507182 seconds

After patch:

nr_dentry = 24562747 24.6M
nr_buckets = 8388608 2.9 avg
nr_unused = 24548714 99.9%
nr_negative = 24543867 99.9%

inotify time: 0.000010 seconds

Negative dentries no longer slow down inotify op at parent directory.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
fs/notify/fsnotify.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 72d332ce8e12..072974302950 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -127,8 +127,12 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
* original inode) */
spin_lock(&alias->d_lock);
list_for_each_entry(child, &alias->d_subdirs, d_child) {
- if (!child->d_inode)
+ if (!child->d_inode) {
+ /* all remaining children are negative */
+ if (d_is_tail_negative(child))
+ break;
continue;
+ }

spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
if (watched)

2020-05-08 12:29:09

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH RFC 3/8] dcache: sweep cached negative dentries to the end of list of siblings

For disk filesystems result of every negative lookup is cached, content of
directories is usually cached too. Production of negative dentries isn't
limited with disk speed. It's really easy to generate millions of them if
system has enough memory. Negative dentries are linked into siblings list
along with normal positive dentries. Some operations walks dcache tree but
looks only for positive dentries: most important is fsnotify/inotify.

This patch moves negative dentries to the end of list at final dput() and
marks with flag which tells that all following dentries are negative too.
Reverse operation is required before instantiating negative dentry.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
fs/dcache.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/dcache.h | 6 +++++
2 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 386f97eaf2ff..743255773cc7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -632,6 +632,48 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
return __lock_parent(dentry);
}

+/*
+ * Move cached negative dentry to the tail of parent->d_subdirs.
+ * This lets walkers skip them all together at first sight.
+ * Must be called at dput of negative dentry.
+ */
+static void sweep_negative(struct dentry *dentry)
+{
+ struct dentry *parent;
+
+ if (!d_is_tail_negative(dentry)) {
+ parent = lock_parent(dentry);
+ if (!parent)
+ return;
+
+ if (!d_count(dentry) && d_is_negative(dentry) &&
+ !d_is_tail_negative(dentry)) {
+ dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
+ list_move_tail(&dentry->d_child, &parent->d_subdirs);
+ }
+
+ spin_unlock(&parent->d_lock);
+ }
+}
+
+/*
+ * Undo sweep_negative() and move to the head of parent->d_subdirs.
+ * Must be called before converting negative dentry into positive.
+ */
+static void recycle_negative(struct dentry *dentry)
+{
+ struct dentry *parent;
+
+ spin_lock(&dentry->d_lock);
+ parent = lock_parent(dentry);
+ if (parent) {
+ list_move(&dentry->d_child, &parent->d_subdirs);
+ spin_unlock(&parent->d_lock);
+ }
+ dentry->d_flags &= ~DCACHE_TAIL_NEGATIVE;
+ spin_unlock(&dentry->d_lock);
+}
+
static inline bool retain_dentry(struct dentry *dentry)
{
WARN_ON(d_in_lookup(dentry));
@@ -703,6 +745,8 @@ static struct dentry *dentry_kill(struct dentry *dentry)
spin_unlock(&inode->i_lock);
if (parent)
spin_unlock(&parent->d_lock);
+ if (d_is_negative(dentry))
+ sweep_negative(dentry);
spin_unlock(&dentry->d_lock);
return NULL;
}
@@ -718,7 +762,7 @@ static struct dentry *dentry_kill(struct dentry *dentry)
static inline bool fast_dput(struct dentry *dentry)
{
int ret;
- unsigned int d_flags;
+ unsigned int d_flags, required;

/*
* If we have a d_op->d_delete() operation, we sould not
@@ -766,6 +810,8 @@ static inline bool fast_dput(struct dentry *dentry)
* a 'delete' op, and it's referenced and already on
* the LRU list.
*
+ * Cached negative dentry must be swept to the tail.
+ *
* NOTE! Since we aren't locked, these values are
* not "stable". However, it is sufficient that at
* some point after we dropped the reference the
@@ -777,10 +823,15 @@ static inline bool fast_dput(struct dentry *dentry)
*/
smp_rmb();
d_flags = READ_ONCE(dentry->d_flags);
- d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_DISCONNECTED;
+
+ required = DCACHE_REFERENCED | DCACHE_LRU_LIST |
+ (d_flags_negative(d_flags) ? DCACHE_TAIL_NEGATIVE : 0);
+
+ d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST |
+ DCACHE_DISCONNECTED | DCACHE_TAIL_NEGATIVE;

/* Nothing to do? Dropping the reference was all we needed? */
- if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry))
+ if (d_flags == required && !d_unhashed(dentry))
return true;

/*
@@ -852,6 +903,8 @@ void dput(struct dentry *dentry)
rcu_read_unlock();

if (likely(retain_dentry(dentry))) {
+ if (d_is_negative(dentry))
+ sweep_negative(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -1951,6 +2004,8 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
{
BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
if (inode) {
+ if (d_is_tail_negative(entry))
+ recycle_negative(entry);
security_d_instantiate(entry, inode);
spin_lock(&inode->i_lock);
__d_instantiate(entry, inode);
@@ -1970,6 +2025,8 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
BUG_ON(!inode);
lockdep_annotate_inode_mutex_key(inode);
+ if (d_is_tail_negative(entry))
+ recycle_negative(entry);
security_d_instantiate(entry, inode);
spin_lock(&inode->i_lock);
__d_instantiate(entry, inode);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 082b55068e4d..1127a394b931 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -217,6 +217,7 @@ struct dentry_operations {
#define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */
#define DCACHE_DENTRY_CURSOR 0x20000000
#define DCACHE_NORCU 0x40000000 /* No RCU delay for freeing */
+#define DCACHE_TAIL_NEGATIVE 0x80000000 /* All following siblings are negative */

extern seqlock_t rename_lock;

@@ -493,6 +494,11 @@ static inline int simple_positive(const struct dentry *dentry)
return d_really_is_positive(dentry) && !d_unhashed(dentry);
}

+static inline bool d_is_tail_negative(const struct dentry *dentry)
+{
+ return unlikely(dentry->d_flags & DCACHE_TAIL_NEGATIVE);
+}
+
extern void d_set_fallthru(struct dentry *dentry);

static inline bool d_is_fallthru(const struct dentry *dentry)

2020-05-08 14:51:32

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state

On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
> Count of buckets is required for estimating average length of hash chains.
> Size of hash table depends on memory size and printed once at boot.
>
> Let's expose nr_buckets as sixth number in sysctl fs.dentry-state

The hash bucket count is a constant determined at boot time. Is there a
need to use up one dentry_stat entry for that? Besides one can get it by
looking up the kernel dmesg log like:

[    0.055212] Dentry cache hash table entries: 8388608 (order: 14,
67108864 bytes)

Cheers,
Longman

2020-05-08 14:58:59

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC 8/8] dcache: prevent flooding with negative dentries

On Fri, May 08, 2020 at 03:23:33PM +0300, Konstantin Khlebnikov wrote:
> This patch implements heuristic which detects such scenarios and prevents
> unbounded growth of completely unneeded negative dentries. It keeps up to
> three latest negative dentry in each bucket unless they were referenced.
>
> At first dput of negative dentry when it swept to the tail of siblings
> we'll also clear it's reference flag and look at next dentries in chain.
> Then kill third in series of negative, unused and unreferenced denries.
>
> This way each hash bucket will preserve three negative dentry to let them
> get reference and survive. Adding positive or used dentry into hash chain
> also protects few recent negative dentries. In result total size of dcache
> asymptotically limited by count of buckets and positive or used dentries.
>
> This heuristic isn't bulletproof and solves only most practical case.
> It's easy to deceive: just touch same random name twice.

I'm not sure if that's "easy to deceive" ... My concern with limiting
negative dentries is something like a kernel compilation where there
are many (11 for mm/mmap.c, 9 in general) and there will be a lot of
places where <linux/fs.h> does not exist

-isystem /usr/lib/gcc/x86_64-linux-gnu/9/include
-I../arch/x86/include
-I./arch/x86/include/generated
-I../include
-I./include
-I../arch/x86/include/uapi
-I./arch/x86/include/generated/uapi
-I../include/uapi
-I./include/generated/uapi
-I ../mm
-I ./mm

So it'd be good to know that kernel compilation times are unaffected by
this patch.

2020-05-08 16:21:15

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state

On 08/05/2020 17.49, Waiman Long wrote:
> On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
>> Count of buckets is required for estimating average length of hash chains.
>> Size of hash table depends on memory size and printed once at boot.
>>
>> Let's expose nr_buckets as sixth number in sysctl fs.dentry-state
>
> The hash bucket count is a constant determined at boot time. Is there a need to use up one dentry_stat entry for that? Besides one can get
> it by looking up the kernel dmesg log like:
>
> [    0.055212] Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes)

Grepping logs since boot time is a worst API ever.

dentry-state shows count of dentries in various states.
It's very convenient to show count of buckets next to it,
because this number defines overall scale.

>
> Cheers,
> Longman
>

2020-05-08 16:31:35

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC 8/8] dcache: prevent flooding with negative dentries

On 08/05/2020 17.56, Matthew Wilcox wrote:
> On Fri, May 08, 2020 at 03:23:33PM +0300, Konstantin Khlebnikov wrote:
>> This patch implements heuristic which detects such scenarios and prevents
>> unbounded growth of completely unneeded negative dentries. It keeps up to
>> three latest negative dentry in each bucket unless they were referenced.
>>
>> At first dput of negative dentry when it swept to the tail of siblings
>> we'll also clear it's reference flag and look at next dentries in chain.
>> Then kill third in series of negative, unused and unreferenced denries.
>>
>> This way each hash bucket will preserve three negative dentry to let them
>> get reference and survive. Adding positive or used dentry into hash chain
>> also protects few recent negative dentries. In result total size of dcache
>> asymptotically limited by count of buckets and positive or used dentries.
>>
>> This heuristic isn't bulletproof and solves only most practical case.
>> It's easy to deceive: just touch same random name twice.
>
> I'm not sure if that's "easy to deceive" ... My concern with limiting
> negative dentries is something like a kernel compilation where there
> are many (11 for mm/mmap.c, 9 in general) and there will be a lot of
> places where <linux/fs.h> does not exist
>
> -isystem /usr/lib/gcc/x86_64-linux-gnu/9/include
> -I../arch/x86/include
> -I./arch/x86/include/generated
> -I../include
> -I./include
> -I../arch/x86/include/uapi
> -I./arch/x86/include/generated/uapi
> -I../include/uapi
> -I./include/generated/uapi
> -I ../mm
> -I ./mm
>
> So it'd be good to know that kernel compilation times are unaffected by
> this patch.
>

It's very unlikely that this patches changes anything for compilation.
Or any other scenario with sane amount and rate of appearing new names.

This trims only dentries which never been accessed twice.
Keeping 3 dentries per bucket gives high chances that all of them get
reference bit and stay in cache until shrinker bury them.

To get false positive in this heuristic - multiple newly created
negative dentries must hit one bucket in short period of time.
I.e. at least three hash collisions is required.

2020-05-08 19:07:50

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state

On 5/8/20 12:16 PM, Konstantin Khlebnikov wrote:
> On 08/05/2020 17.49, Waiman Long wrote:
>> On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
>>> Count of buckets is required for estimating average length of hash
>>> chains.
>>> Size of hash table depends on memory size and printed once at boot.
>>>
>>> Let's expose nr_buckets as sixth number in sysctl fs.dentry-state
>>
>> The hash bucket count is a constant determined at boot time. Is there
>> a need to use up one dentry_stat entry for that? Besides one can get
>> it by looking up the kernel dmesg log like:
>>
>> [    0.055212] Dentry cache hash table entries: 8388608 (order: 14,
>> 67108864 bytes)
>
> Grepping logs since boot time is a worst API ever.
>
> dentry-state shows count of dentries in various states.
> It's very convenient to show count of buckets next to it,
> because this number defines overall scale.

I am not against using the last free entry for that. My only concern is
when we want to expose another internal dcache data point via
dentry-state, we will have to add one more number to the array which can
cause all sort of compatibility problem. So do we want to use the last
free slot for a constant that can be retrieved from somewhere else?

Cheers,
Longman

2020-05-08 19:41:18

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH RFC 3/8] dcache: sweep cached negative dentries to the end of list of siblings

On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
> For disk filesystems result of every negative lookup is cached, content of
> directories is usually cached too. Production of negative dentries isn't
> limited with disk speed. It's really easy to generate millions of them if
> system has enough memory. Negative dentries are linked into siblings list
> along with normal positive dentries. Some operations walks dcache tree but
> looks only for positive dentries: most important is fsnotify/inotify.
>
> This patch moves negative dentries to the end of list at final dput() and
> marks with flag which tells that all following dentries are negative too.
> Reverse operation is required before instantiating negative dentry.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> ---
> fs/dcache.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++--
> include/linux/dcache.h | 6 +++++
> 2 files changed, 66 insertions(+), 3 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 386f97eaf2ff..743255773cc7 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -632,6 +632,48 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
> return __lock_parent(dentry);
> }
>
> +/*
> + * Move cached negative dentry to the tail of parent->d_subdirs.
> + * This lets walkers skip them all together at first sight.
> + * Must be called at dput of negative dentry.
> + */
> +static void sweep_negative(struct dentry *dentry)
> +{
> + struct dentry *parent;
> +
> + if (!d_is_tail_negative(dentry)) {
> + parent = lock_parent(dentry);
> + if (!parent)
> + return;
> +
> + if (!d_count(dentry) && d_is_negative(dentry) &&
> + !d_is_tail_negative(dentry)) {
> + dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
> + list_move_tail(&dentry->d_child, &parent->d_subdirs);
> + }
> +
> + spin_unlock(&parent->d_lock);
> + }
> +}
> +
> +/*
> + * Undo sweep_negative() and move to the head of parent->d_subdirs.
> + * Must be called before converting negative dentry into positive.
> + */
> +static void recycle_negative(struct dentry *dentry)
> +{
> + struct dentry *parent;
> +
> + spin_lock(&dentry->d_lock);
> + parent = lock_parent(dentry);
> + if (parent) {
> + list_move(&dentry->d_child, &parent->d_subdirs);
> + spin_unlock(&parent->d_lock);
> + }
> + dentry->d_flags &= ~DCACHE_TAIL_NEGATIVE;
> + spin_unlock(&dentry->d_lock);
> +}
> +

The name sweep_negative and recycle_negative are not very descriptive of
what the function is doing (especially the later one). I am not good at
naming, but some kind of naming scheme that can clearly show one is
opposite of another will be better.


> static inline bool retain_dentry(struct dentry *dentry)
> {
> WARN_ON(d_in_lookup(dentry));
> @@ -703,6 +745,8 @@ static struct dentry *dentry_kill(struct dentry *dentry)
> spin_unlock(&inode->i_lock);
> if (parent)
> spin_unlock(&parent->d_lock);
> + if (d_is_negative(dentry))
> + sweep_negative(dentry);
We can potentially save an unneeded lock/unlock pair by moving it up
before "if (parent)" and pass in a flag to indicate if the parent lock
is being held.
> spin_unlock(&dentry->d_lock);
> return NULL;
> }
> @@ -718,7 +762,7 @@ static struct dentry *dentry_kill(struct dentry *dentry)
> static inline bool fast_dput(struct dentry *dentry)
> {
> int ret;
> - unsigned int d_flags;
> + unsigned int d_flags, required;
>
> /*
> * If we have a d_op->d_delete() operation, we sould not
> @@ -766,6 +810,8 @@ static inline bool fast_dput(struct dentry *dentry)
> * a 'delete' op, and it's referenced and already on
> * the LRU list.
> *
> + * Cached negative dentry must be swept to the tail.
> + *
> * NOTE! Since we aren't locked, these values are
> * not "stable". However, it is sufficient that at
> * some point after we dropped the reference the
> @@ -777,10 +823,15 @@ static inline bool fast_dput(struct dentry *dentry)
> */
> smp_rmb();
> d_flags = READ_ONCE(dentry->d_flags);
> - d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_DISCONNECTED;
> +
> + required = DCACHE_REFERENCED | DCACHE_LRU_LIST |
> + (d_flags_negative(d_flags) ? DCACHE_TAIL_NEGATIVE : 0);
> +
> + d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST |
> + DCACHE_DISCONNECTED | DCACHE_TAIL_NEGATIVE;
>
> /* Nothing to do? Dropping the reference was all we needed? */
> - if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry))
> + if (d_flags == required && !d_unhashed(dentry))
> return true;
>
> /*
> @@ -852,6 +903,8 @@ void dput(struct dentry *dentry)
> rcu_read_unlock();
>
> if (likely(retain_dentry(dentry))) {
> + if (d_is_negative(dentry))
> + sweep_negative(dentry);
> spin_unlock(&dentry->d_lock);
> return;
> }
> @@ -1951,6 +2004,8 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
> {
> BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
> if (inode) {
> + if (d_is_tail_negative(entry))
> + recycle_negative(entry);

Will it better to recycle_negative() inside __d_instantiate() under the
d_lock. That reduces the number of lock/unlock you need to do and
eliminate the need to change d_instantiate_new().

Cheers,
Longman

2020-05-08 19:42:06

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state



On 08/05/2020 22.05, Waiman Long wrote:
> On 5/8/20 12:16 PM, Konstantin Khlebnikov wrote:
>> On 08/05/2020 17.49, Waiman Long wrote:
>>> On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
>>>> Count of buckets is required for estimating average length of hash chains.
>>>> Size of hash table depends on memory size and printed once at boot.
>>>>
>>>> Let's expose nr_buckets as sixth number in sysctl fs.dentry-state
>>>
>>> The hash bucket count is a constant determined at boot time. Is there a need to use up one dentry_stat entry for that? Besides one can
>>> get it by looking up the kernel dmesg log like:
>>>
>>> [    0.055212] Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes)
>>
>> Grepping logs since boot time is a worst API ever.
>>
>> dentry-state shows count of dentries in various states.
>> It's very convenient to show count of buckets next to it,
>> because this number defines overall scale.
>
> I am not against using the last free entry for that. My only concern is when we want to expose another internal dcache data point via
> dentry-state, we will have to add one more number to the array which can cause all sort of compatibility problem. So do we want to use the
> last free slot for a constant that can be retrieved from somewhere else?

I see no problem in adding more numbers into sysctl.
Especially into such rarely used.
This interface is designed for that.

Also fields 'age_limit' and 'want_pages' are unused since kernel 2.2.0

>
> Cheers,
> Longman
>

2020-05-08 20:02:24

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state

On 5/8/20 3:38 PM, Konstantin Khlebnikov wrote:
>
>
> On 08/05/2020 22.05, Waiman Long wrote:
>> On 5/8/20 12:16 PM, Konstantin Khlebnikov wrote:
>>> On 08/05/2020 17.49, Waiman Long wrote:
>>>> On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
>>>>> Count of buckets is required for estimating average length of hash
>>>>> chains.
>>>>> Size of hash table depends on memory size and printed once at boot.
>>>>>
>>>>> Let's expose nr_buckets as sixth number in sysctl fs.dentry-state
>>>>
>>>> The hash bucket count is a constant determined at boot time. Is
>>>> there a need to use up one dentry_stat entry for that? Besides one
>>>> can get it by looking up the kernel dmesg log like:
>>>>
>>>> [    0.055212] Dentry cache hash table entries: 8388608 (order: 14,
>>>> 67108864 bytes)
>>>
>>> Grepping logs since boot time is a worst API ever.
>>>
>>> dentry-state shows count of dentries in various states.
>>> It's very convenient to show count of buckets next to it,
>>> because this number defines overall scale.
>>
>> I am not against using the last free entry for that. My only concern
>> is when we want to expose another internal dcache data point via
>> dentry-state, we will have to add one more number to the array which
>> can cause all sort of compatibility problem. So do we want to use the
>> last free slot for a constant that can be retrieved from somewhere else?
>
> I see no problem in adding more numbers into sysctl.
> Especially into such rarely used.
> This interface is designed for that.
>
> Also fields 'age_limit' and 'want_pages' are unused since kernel 2.2.0

Well, I got rebuke the last time I want to reuse one of
age_limit/want_pages entry for negative dentry count because of the
potential of breaking some really old applications or tools. Changing
dentry-state to output one more number can potentially break
compatibility too. That is why I am questioning if it is a good idea to
use up the last free slot.

Cheers,
Longman


2020-05-08 20:07:45

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC 1/8] dcache: show count of hash buckets in sysctl fs.dentry-state

On Fri, May 08, 2020 at 04:00:08PM -0400, Waiman Long wrote:
> On 5/8/20 3:38 PM, Konstantin Khlebnikov wrote:
> > On 08/05/2020 22.05, Waiman Long wrote:
> > > On 5/8/20 12:16 PM, Konstantin Khlebnikov wrote:
> > > > On 08/05/2020 17.49, Waiman Long wrote:
> > > > > On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
> > > > > > Count of buckets is required for estimating average
> > > > > > length of hash chains.
> > > > > > Size of hash table depends on memory size and printed once at boot.
> > > > > >
> > > > > > Let's expose nr_buckets as sixth number in sysctl fs.dentry-state
> > > > >
> > > > > The hash bucket count is a constant determined at boot time.
> > > > > Is there a need to use up one dentry_stat entry for that?
> > > > > Besides one can get it by looking up the kernel dmesg log
> > > > > like:
> > > > >
> > > > > [??? 0.055212] Dentry cache hash table entries: 8388608
> > > > > (order: 14, 67108864 bytes)
> > > >
> > > > Grepping logs since boot time is a worst API ever.
> > > >
> > > > dentry-state shows count of dentries in various states.
> > > > It's very convenient to show count of buckets next to it,
> > > > because this number defines overall scale.
> > >
> > > I am not against using the last free entry for that. My only concern
> > > is when we want to expose another internal dcache data point via
> > > dentry-state, we will have to add one more number to the array which
> > > can cause all sort of compatibility problem. So do we want to use
> > > the last free slot for a constant that can be retrieved from
> > > somewhere else?
> >
> > I see no problem in adding more numbers into sysctl.
> > Especially into such rarely used.
> > This interface is designed for that.
> >
> > Also fields 'age_limit' and 'want_pages' are unused since kernel 2.2.0
>
> Well, I got rebuke the last time I want to reuse one of age_limit/want_pages
> entry for negative dentry count because of the potential of breaking some
> really old applications or tools. Changing dentry-state to output one more
> number can potentially break compatibility too. That is why I am questioning
> if it is a good idea to use up the last free slot.

I'd rather see the nr_buckets exposed some other way, leaving this file
for dynamic state.

2020-05-08 21:12:15

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH RFC 8/8] dcache: prevent flooding with negative dentries

On 5/8/20 8:23 AM, Konstantin Khlebnikov wrote:
> Without memory pressure count of negative dentries isn't bounded.
> They could consume all memory and drain all other inactive caches.
>
> Typical scenario is an idle system where some process periodically creates
> temporary files and removes them. After some time, memory will be filled
> with negative dentries for these random file names. Reclaiming them took
> some time because slab frees pages only when all related objects are gone.
> Time of dentry lookup is usually unaffected because hash table grows along
> with size of memory. Unless somebody especially crafts hash collisions.
> Simple lookup of random names also generates negative dentries very fast.
>
> This patch implements heuristic which detects such scenarios and prevents
> unbounded growth of completely unneeded negative dentries. It keeps up to
> three latest negative dentry in each bucket unless they were referenced.
>
> At first dput of negative dentry when it swept to the tail of siblings
> we'll also clear it's reference flag and look at next dentries in chain.
> Then kill third in series of negative, unused and unreferenced denries.
>
> This way each hash bucket will preserve three negative dentry to let them
> get reference and survive. Adding positive or used dentry into hash chain
> also protects few recent negative dentries. In result total size of dcache
> asymptotically limited by count of buckets and positive or used dentries.
>
> Before patch: tool 'dcache_stress' could fill entire memory with dentries.
>
> nr_dentry = 104913261 104.9M
> nr_buckets = 8388608 12.5 avg
> nr_unused = 104898729 100.0%
> nr_negative = 104883218 100.0%
>
> After this patch count of dentries saturates at around 3 per bucket:
>
> nr_dentry = 24619259 24.6M
> nr_buckets = 8388608 2.9 avg
> nr_unused = 24605226 99.9%
> nr_negative = 24600351 99.9%
>
> This heuristic isn't bulletproof and solves only most practical case.
> It's easy to deceive: just touch same random name twice.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> ---
> fs/dcache.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 54 insertions(+)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 60158065891e..9f3d331b4978 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -632,6 +632,58 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
> return __lock_parent(dentry);
> }
>
> +/*
> + * Called at first dput of each negative dentry.
> + * Prevents filling cache with never reused negative dentries.
> + *
> + * This clears reference and then looks at following dentries in hash chain.
> + * If they are negative, unused and unreferenced then keep two and kill third.
> + */
> +static void trim_negative(struct dentry *dentry)
> + __releases(dentry->d_lock)
> +{
> + struct dentry *victim, *parent;
> + struct hlist_bl_node *next;
> + int keep = 2;
> +
> + rcu_read_lock();
> +
> + dentry->d_flags &= ~DCACHE_REFERENCED;
> + spin_unlock(&dentry->d_lock);
> +
> + next = rcu_dereference_raw(dentry->d_hash.next);
> + while (1) {
> + victim = hlist_bl_entry(next, struct dentry, d_hash);
> +
> + if (!next || d_count(victim) || !d_is_negative(victim) ||
> + (victim->d_flags & DCACHE_REFERENCED)) {
> + rcu_read_unlock();
> + return;
> + }
> +
> + if (!keep--)
> + break;
> +
> + next = rcu_dereference_raw(next->next);
> + }
> +
> + spin_lock(&victim->d_lock);
> + parent = lock_parent(victim);
> +
> + rcu_read_unlock();
> +
> + if (d_count(victim) || !d_is_negative(victim) ||
> + (victim->d_flags & DCACHE_REFERENCED)) {
> + if (parent)
> + spin_unlock(&parent->d_lock);
> + spin_unlock(&victim->d_lock);
> + return;
> + }
> +
> + __dentry_kill(victim);
> + dput(parent);
> +}

Since you are picking a victim from the hash list, I think it is better
to kill it only if it has already been in the LRU. Otherwise, it could
be in the process of being instantiated or in the middle of some operations.

Besides, I feel a bit uneasy about picking a random negative dentry to
kill like that.

Cheers,
Longman

2020-05-13 01:55:03

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH RFC 2/8] selftests: add stress testing tool for dcache

On Fri, May 08, 2020 at 03:23:17PM +0300, Konstantin Khlebnikov wrote:
> This tool fills dcache with negative dentries. Between iterations it prints
> statistics and measures time of inotify operation which might degrade.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> ---
> tools/testing/selftests/filesystems/Makefile | 1
> .../testing/selftests/filesystems/dcache_stress.c | 210 ++++++++++++++++++++

This sort of thing should go into fstests along with test scripts
that use it to exercise the dentry cache. We already have tools like
this in fstests (dirstress, metaperf, etc) for exercising name-based
operations like this, so it would fit right in.

That way it would get run by just about every filesystem developer
and distro QE department automatically and extremely frequently...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-12-09 23:55:51

by Junxiao Bi

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] dcache: increase poison resistance

Hi Konstantin,

We tested this patch set recently and found it limiting negative dentry
to a small part of total memory. And also we don't see any performance
regression on it. Do you have any plan to integrate it into mainline? It
will help a lot on memory fragmentation issue causing by dentry slab,
there were a lot of customer cases where sys% was very high since most
cpu were doing memory compaction, dentry slab was taking too much memory
and nearly all dentry there were negative.

The following is test result we run on two types of servers, one is 256G
memory with 24 CPUS and another is 3T memory with 384 CPUS. The test
case is using a lot of processes to generate negative dentry in
parallel, the following is the test result after 72 hours, the negative
dentry number is stable around that number even running longer time. If
without the patch set, in less than half an hour 197G was took by
negative dentry on 256G system, in 1 day 2.4T was took on 3T system.

                neg-dentry-number        neg-dentry-mem-usage

256G 55259084 10.6G

3T 202306756 38.8G

For perf test, we run the following, and no regression found.

- create 1M negative dentry and then touch them to convert them to
positive dentry

- create 10K/100K/1M files

- remove 10K/100K/1M files

- kernel compile

To verify the fsnotify fix, we used inotifywait to watch file
create/open in some directory where there is a lot of negative dentry,
without the patch set, the system will run into soft lockup, with it, no
soft lockup.

We also try to defeat the limitation by making different processes
generating negative dentry with the same naming way, that will make one
negative dentry being accessed couple times around same time,
DCACHE_REFERENCED will be set on it and then it can't be trimmed easily.
We do see negative dentry will take all the memory slowly from one of
our system with 120G memory, for above two system, we see the memory
usage were increased, but still a small part of total memory. This looks
ok, since the common negative dentry user case will be create some temp
files and then remove it, it will be rare to access same negative dentry
around same time.

Thanks,

Junxiao.


On 5/8/20 5:23 AM, Konstantin Khlebnikov wrote:
> For most filesystems result of every negative lookup is cached, content of
> directories is usually cached too. Production of negative dentries isn't
> limited with disk speed. It's really easy to generate millions of them if
> system has enough memory.
>
> Getting this memory back ins't that easy because slab frees pages only when
> all related objects are gone. While dcache shrinker works in LRU order.
>
> Typical scenario is an idle system where some process periodically creates
> temporary files and removes them. After some time, memory will be filled
> with negative dentries for these random file names.
>
> Simple lookup of random names also generates negative dentries very fast.
> Constant flow of such negative denries drains all other inactive caches.
>
> Negative dentries are linked into siblings list along with normal positive
> dentries. Some operations walks dcache tree but looks only for positive
> dentries: most important is fsnotify/inotify. Hordes of negative dentries
> slow down these operations significantly.
>
> Time of dentry lookup is usually unaffected because hash table grows along
> with size of memory. Unless somebody especially crafts hash collisions.
>
> This patch set solves all of these problems:
>
> Move negative denries to the end of sliblings list, thus walkers could
> skip them at first sight (patches 3-6).
>
> Keep in dcache at most three unreferenced negative denties in row in each
> hash bucket (patches 7-8).
>
> ---
>
> Konstantin Khlebnikov (8):
> dcache: show count of hash buckets in sysctl fs.dentry-state
> selftests: add stress testing tool for dcache
> dcache: sweep cached negative dentries to the end of list of siblings
> fsnotify: stop walking child dentries if remaining tail is negative
> dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()
> dcache: stop walking siblings if remaining dentries all negative
> dcache: push releasing dentry lock into sweep_negative
> dcache: prevent flooding with negative dentries
>
>
> fs/dcache.c | 144 +++++++++++-
> fs/libfs.c | 10 +-
> fs/notify/fsnotify.c | 6 +-
> include/linux/dcache.h | 6 +
> tools/testing/selftests/filesystems/Makefile | 1 +
> .../selftests/filesystems/dcache_stress.c | 210 ++++++++++++++++++
> 6 files changed, 370 insertions(+), 7 deletions(-)
> create mode 100644 tools/testing/selftests/filesystems/dcache_stress.c
>
> --
> Signature

2020-12-13 19:09:40

by Junxiao Bi

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] dcache: increase poison resistance

On 12/11/20 11:32 PM, Konstantin Khlebnikov wrote:

> On Thu, Dec 10, 2020 at 2:01 AM Junxiao Bi <[email protected]
> <mailto:[email protected]>> wrote:
>
> Hi Konstantin,
>
> We tested this patch set recently and found it limiting negative
> dentry
> to a small part of total memory. And also we don't see any
> performance
> regression on it. Do you have any plan to integrate it into
> mainline? It
> will help a lot on memory fragmentation issue causing by dentry slab,
> there were a lot of customer cases where sys% was very high since
> most
> cpu were doing memory compaction, dentry slab was taking too much
> memory
> and nearly all dentry there were negative.
>
>
> Right now I don't have any plans for this. I suspect such problems will
> appear much more often since machines are getting bigger.
> So, somebody will take care of it.
We already had a lot of customer cases. It made no sense to leave so
many negative dentry in the system, it caused memory fragmentation and
not much benefit.
>
> First part which collects negative dentries at the end list of
> siblings could be
> done in a more obvious way by splitting the list in two.
> But this touches much more code.
That would add new field to dentry?
>
> Last patch isn't very rigid but does non-trivial changes.
> Probably it's better to call some garbage collector thingy periodically.
> Lru list needs pressure to age and reorder entries properly.

Swap the negative dentry to the head of hash list when it get accessed?
Extra ones can be easily trimmed when swapping, using GC is to reduce
perf impact?

Thanks,

Junxioao.

>
> Gc could be off by default or thresholds set very high (50% of ram for
> example).
> Final setup could be left up to owners of large systems, which needs
> fine tuning.

2020-12-15 00:29:27

by Junxiao Bi

[permalink] [raw]
Subject: Re: [PATCH RFC 0/8] dcache: increase poison resistance

On 12/13/20 11:43 PM, Konstantin Khlebnikov wrote:

>
>
> On Sun, Dec 13, 2020 at 9:52 PM Junxiao Bi <[email protected]
> <mailto:[email protected]>> wrote:
>
> On 12/11/20 11:32 PM, Konstantin Khlebnikov wrote:
>
> > On Thu, Dec 10, 2020 at 2:01 AM Junxiao Bi
> <[email protected] <mailto:[email protected]>
> > <mailto:[email protected] <mailto:[email protected]>>>
> wrote:
> >
> >     Hi Konstantin,
> >
> >     We tested this patch set recently and found it limiting negative
> >     dentry
> >     to a small part of total memory. And also we don't see any
> >     performance
> >     regression on it. Do you have any plan to integrate it into
> >     mainline? It
> >     will help a lot on memory fragmentation issue causing by
> dentry slab,
> >     there were a lot of customer cases where sys% was very high
> since
> >     most
> >     cpu were doing memory compaction, dentry slab was taking too
> much
> >     memory
> >     and nearly all dentry there were negative.
> >
> >
> > Right now I don't have any plans for this. I suspect such
> problems will
> > appear much more often since machines are getting bigger.
> > So, somebody will take care of it.
> We already had a lot of customer cases. It made no sense to leave so
> many negative dentry in the system, it caused memory fragmentation
> and
> not much benefit.
>
>
> Dcache could grow so big only if the system lacks of memory pressure.
>
> Simplest solution is a cronjob which provinces such pressure by
> creating sparse file on disk-based fs and then reading it.
> This should wash away all inactive caches with no IO and zero chance
> of oom.
Sound good, will try.
>
> >
> > First part which collects negative dentries at the end list of
> > siblings could be
> > done in a more obvious way by splitting the list in two.
> > But this touches much more code.
> That would add new field to dentry?
>
>
> Yep. Decision is up to maintainers.
>
> >
> > Last patch isn't very rigid but does non-trivial changes.
> > Probably it's better to call some garbage collector thingy
> periodically.
> > Lru list needs pressure to age and reorder entries properly.
>
> Swap the negative dentry to the head of hash list when it get
> accessed?
> Extra ones can be easily trimmed when swapping, using GC is to reduce
> perf impact?
>
>
> Reclaimer/shrinker scans denties in LRU lists, it's an another list.

Ah, you mean GC to reclaim from LRU list. I am not sure it could catch
up the speed of negative dentry generating.

Thanks,

Junxiao.

> My patch used order in hash lists is a very unusual way. Don't be
> confused.
>
> There are four lists
> parent - siblings
> hashtable - hashchain
> LRU
> inode - alias
>
>
> Thanks,
>
> Junxioao.
>
> >
> > Gc could be off by default or thresholds set very high (50% of
> ram for
> > example).
> > Final setup could be left up to owners of large systems, which
> needs
> > fine tuning.
>