2020-02-26 16:18:33

by Waiman Long

[permalink] [raw]
Subject: [PATCH 00/11] fs/dcache: Limit # of negative dentries

As there is no limit for negative dentries, it is possible that a sizeable
portion of system memory can be tied up in dentry cache slabs. Dentry slabs
are generally recalimable if the dentries are in the LRUs. Still having
too much memory used up by dentries can be problematic:

1) When a filesystem with too many negative dentries is being unmounted,
the process of draining the dentries associated with the filesystem
can take some time. To users, the system may seem to hang for
a while. The long wait may also cause unexpected timeout error or
other warnings. This can happen when a long running container with
many negative dentries is being destroyed, for instance.

2) Tying up too much memory in unused negative dentries means there
are less memory available for other use. Even though the kernel is
able to reclaim unused dentries when running out of free memory,
it will still introduce additional latency to the application
reducing its performance.

There are two different approaches to limit negative dentries.

1) Global reclaim
Based on the total number of negative dentries as tracked by the
nr_dentry_negative percpu count, a function can be activated to
scan the various LRU lists to trim out excess negative dentries.

2) Local reclaim
By tracking the number of negative dentries under each directory,
we can start the reclaim process if the number exceeds a certain
limit.

The problem with global reclaim is that there are just too many LRU lists
present that may need to be scanned for each filesystem. Especially
problematic is the fact that each memory cgroup can have its own LRU
lists. As memory cgroup can come and go at any time, scanning its LRUs
can be tricky.

Local reclaim does not have this problem. So it is used as the basis
for negative dentry reclaim for this patchset. Accurately tracking the
number of negative dentries in each directory can be costly in term of
performance hit. As a result, this patchset estimates the number of
negative dentries present in a directory by looking at a newly added
children count and an opportunistically stored positive dentry count.

A new sysctl parameter "dentry-dir-max" is introduced which accepts a
value of 0 (default) for no limit or a positive integer 256 and up. Small
dentry-dir-max numbers are forbidden to avoid excessive dentry count
checking which can impact system performance.

The actual negative dentry reclaim is delegated to the system workqueue
to avoid adding excessive latency to normal filesystem operation.

Waiman Long (11):
fs/dcache: Fix incorrect accounting of negative dentries
fs/dcache: Simplify __dentry_kill()
fs/dcache: Add a counter to track number of children
fs/dcache: Add sysctl parameter dentry-dir-max
fs/dcache: Reclaim excessive negative dentries in directories
fs/dcache: directory opportunistically stores # of positive dentries
fs/dcache: Add static key negative_reclaim_enable
fs/dcache: Limit dentry reclaim count in negative_reclaim_workfn()
fs/dcache: Don't allow small values for dentry-dir-max
fs/dcache: Kill off dentry as last resort
fs/dcache: Track # of negative dentries reclaimed & killed

Documentation/admin-guide/sysctl/fs.rst | 18 +
fs/dcache.c | 428 +++++++++++++++++++++++-
include/linux/dcache.h | 18 +-
kernel/sysctl.c | 11 +
4 files changed, 457 insertions(+), 18 deletions(-)

--
2.18.1


2020-02-26 16:18:46

by Waiman Long

[permalink] [raw]
Subject: [PATCH 03/11] fs/dcache: Add a counter to track number of children

Add a new field d_nchildren to struct dentry to track the number of
children in a directory.

Theoretically, we could use reference count (d_lockref.count) as a
proxy for the number of children. Normally the reference count should
be quite close to the number of children. However, when the directory
dentry is heavily contended, the reference count can differ from the
number of children by quite a bit.

The d_nchildren field is updated only when d_lock has already been held,
so the performance cost of this tracking should be negligible.

Signed-off-by: Waiman Long <[email protected]>
---
fs/dcache.c | 16 ++++++++++++----
include/linux/dcache.h | 7 ++++---
2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index a977f9e05840..0ee5aa2c31cf 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -526,6 +526,8 @@ static inline void dentry_unlist(struct dentry *dentry, struct dentry *parent)
if (unlikely(list_empty(&dentry->d_child)))
return;
__list_del_entry(&dentry->d_child);
+ parent->d_nchildren--;
+
/*
* Cursors can move around the list of children. While we'd been
* a normal list member, it didn't matter - ->d_child.next would've
@@ -1738,6 +1740,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
dentry->d_sb = sb;
dentry->d_op = NULL;
dentry->d_fsdata = NULL;
+ dentry->d_nchildren = 0;
INIT_HLIST_BL_NODE(&dentry->d_hash);
INIT_LIST_HEAD(&dentry->d_lru);
INIT_LIST_HEAD(&dentry->d_subdirs);
@@ -1782,6 +1785,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
__dget_dlock(parent);
dentry->d_parent = parent;
list_add(&dentry->d_child, &parent->d_subdirs);
+ parent->d_nchildren++;
spin_unlock(&parent->d_lock);

return dentry;
@@ -2762,10 +2766,10 @@ static void swap_names(struct dentry *dentry, struct dentry *target)
* Both are internal.
*/
unsigned int i;
- BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long)));
- for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
- swap(((long *) &dentry->d_iname)[i],
- ((long *) &target->d_iname)[i]);
+ BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(int)));
+ for (i = 0; i < DNAME_INLINE_LEN / sizeof(int); i++) {
+ swap(((int *) &dentry->d_iname)[i],
+ ((int *) &target->d_iname)[i]);
}
}
}
@@ -2855,6 +2859,10 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
dentry->d_parent->d_lockref.count++;
if (dentry != old_parent) /* wasn't IS_ROOT */
WARN_ON(!--old_parent->d_lockref.count);
+
+ /* Adjust d_nchildren */
+ dentry->d_parent->d_nchildren++;
+ old_parent->d_nchildren--;
} else {
target->d_parent = old_parent;
swap_names(dentry, target);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 2762ca2508f9..e9e66eb50d1a 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -75,12 +75,12 @@ extern struct dentry_stat_t dentry_stat;
* large memory footprint increase).
*/
#ifdef CONFIG_64BIT
-# define DNAME_INLINE_LEN 32 /* 192 bytes */
+# define DNAME_INLINE_LEN 28 /* 192 bytes */
#else
# ifdef CONFIG_SMP
-# define DNAME_INLINE_LEN 36 /* 128 bytes */
+# define DNAME_INLINE_LEN 32 /* 128 bytes */
# else
-# define DNAME_INLINE_LEN 40 /* 128 bytes */
+# define DNAME_INLINE_LEN 36 /* 128 bytes */
# endif
#endif

@@ -96,6 +96,7 @@ struct dentry {
struct inode *d_inode; /* Where the name belongs to - NULL is
* negative */
unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */
+ unsigned int d_nchildren; /* # of children (directory only) */

/* Ref lookup also touches following */
struct lockref d_lockref; /* per-dentry lock and refcount */
--
2.18.1

2020-02-26 16:19:18

by Waiman Long

[permalink] [raw]
Subject: [PATCH 06/11] fs/dcache: directory opportunistically stores # of positive dentries

For directories that contain large number of files (e.g. in thousands),
the number of positive dentries that will never be reclaimed by
the negative dentry reclaiming process may approach or even exceed
"dentry-dir-max". That will unnecessary cause the triggering of the
reclaim process even if there aren't that many negative dentries that
can be reclaimed.

This can impact system performance. One possible way to solve this
problem is to somehow store the number of positive or negative dentries
in the directory dentry itself. The negative dentry count can change
frequently, whereas the positive dentry count is relatively more stable,

Keeping an accurate count of positive or negative dentries can be
costly. Instead, an estimate of the positive dentry count is computed
in the scan loop of the negative dentry reclaim work function.

That computed value is then stored in the trailing end of the d_iname[]
array if there is enough space for it. This value is then used to
estimate the number of negative dentries in the directory to be compare
against the "dentry-dir-max" value.

Signed-off-by: Waiman Long <[email protected]>
---
fs/dcache.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 81 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 01c6d7277244..c5364c2ed5d8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1294,6 +1294,60 @@ struct reclaim_dentry
struct dentry *parent_dir;
};

+/*
+ * If there is enough space at the end of d_iname[] of a directory dentry
+ * to hold an integer. The space will be used to hold an estimate of the
+ * number of positive dentries in the directory. That number will be
+ * subtracted from d_nchildren to see if the limit has been exceeded and
+ * the number of excess negative dentries to be reclaimed.
+ */
+struct d_iname_count {
+ unsigned char d_dummy[DNAME_INLINE_LEN - sizeof(int)];
+ unsigned int d_npositive;
+};
+
+static inline bool dentry_has_npositive(struct dentry *dentry)
+{
+ int len = dentry->d_name.len;
+
+ BUILD_BUG_ON(sizeof(struct d_iname_count) != sizeof(dentry->d_iname));
+
+ return (len >= DNAME_INLINE_LEN) ||
+ (len < DNAME_INLINE_LEN - sizeof(int));
+}
+
+static inline unsigned int read_dentry_npositive(struct dentry *dentry)
+{
+ struct d_iname_count *p = (struct d_iname_count *)dentry->d_iname;
+
+ return p->d_npositive;
+}
+
+static inline void set_dentry_npositive(struct dentry *dentry,
+ unsigned int npositive)
+{
+ struct d_iname_count *p = (struct d_iname_count *)dentry->d_iname;
+
+ p->d_npositive = npositive;
+}
+
+/*
+ * Get an estimated negative dentry count
+ */
+static inline unsigned int read_dentry_nnegative(struct dentry *dentry)
+{
+ return dentry->d_nchildren - (dentry_has_npositive(dentry)
+ ? read_dentry_npositive(dentry) : 0);
+}
+
+/*
+ * Initialize d_iname[] to have null bytes at the end of the array.
+ */
+static inline void init_dentry_iname(struct dentry *dentry)
+{
+ set_dentry_npositive(dentry, 0);
+}
+
/*
* Reclaim excess negative dentries in a directory
*/
@@ -1302,11 +1356,11 @@ static void reclaim_negative_dentry(struct dentry *parent,
{
struct dentry *child;
int limit = READ_ONCE(dcache_dentry_dir_max_sysctl);
- int cnt;
+ int cnt, npositive;

lockdep_assert_held(&parent->d_lock);

- cnt = parent->d_nchildren;
+ cnt = read_dentry_nnegative(parent);

/*
* Compute # of negative dentries to be reclaimed
@@ -1316,6 +1370,7 @@ static void reclaim_negative_dentry(struct dentry *parent,
return;
cnt -= limit;
cnt += (limit >> 3);
+ npositive = 0;

/*
* The d_subdirs is treated like a kind of LRU where
@@ -1327,8 +1382,10 @@ static void reclaim_negative_dentry(struct dentry *parent,
/*
* This check is racy and so it may not be accurate.
*/
- if (!d_is_negative(child))
+ if (!d_is_negative(child)) {
+ npositive++;
continue;
+ }

if (!spin_trylock(&child->d_lock))
continue;
@@ -1368,7 +1425,17 @@ static void reclaim_negative_dentry(struct dentry *parent,
list_cut_before(&list, &parent->d_subdirs,
&child->d_child);
list_splice_tail(&list, &parent->d_subdirs);
+
+ /*
+ * Update positive dentry count estimate
+ * Don't allow npositive to decay by more than 1/2.
+ */
+ if (dentry_has_npositive(parent) &&
+ (read_dentry_npositive(parent) > 2 * npositive))
+ npositive = read_dentry_npositive(parent) / 2;
}
+ if (dentry_has_npositive(parent))
+ set_dentry_npositive(parent, npositive);
}

/*
@@ -1430,16 +1497,14 @@ static void negative_reclaim_check(struct dentry *parent)
*/
if (!can_reclaim_dentry(parent->d_flags) ||
(parent->d_flags & DCACHE_RECLAIMING) ||
- (READ_ONCE(parent->d_nchildren) <= limit))
+ (read_dentry_nnegative(parent) <= limit))
goto rcu_unlock_out;

spin_lock(&parent->d_lock);
if (!can_reclaim_dentry(parent->d_flags) ||
(parent->d_flags & DCACHE_RECLAIMING) ||
- (READ_ONCE(parent->d_nchildren) <= limit))
- goto unlock_out;
-
- if (!d_is_dir(parent))
+ (read_dentry_nnegative(parent) <= limit) ||
+ !d_is_dir(parent))
goto unlock_out;

dentry_node = kzalloc(sizeof(*dentry_node), GFP_NOWAIT);
@@ -1921,7 +1986,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
* will still always have a NUL at the end, even if we might
* be overwriting an internal NUL character
*/
- dentry->d_iname[DNAME_INLINE_LEN-1] = 0;
+ init_dentry_iname(dentry);
if (unlikely(!name)) {
name = &slash_name;
dname = dentry->d_iname;
@@ -2991,11 +3056,18 @@ static void swap_names(struct dentry *dentry, struct dentry *target)
}
}
swap(dentry->d_name.hash_len, target->d_name.hash_len);
+
+ if (dentry_has_npositive(dentry))
+ set_dentry_npositive(dentry, 0);
+ if (dentry_has_npositive(target))
+ set_dentry_npositive(target, 0);
}

static void copy_name(struct dentry *dentry, struct dentry *target)
{
struct external_name *old_name = NULL;
+
+ init_dentry_iname(dentry);
if (unlikely(dname_external(dentry)))
old_name = external_name(dentry);
if (unlikely(dname_external(target))) {
--
2.18.1

2020-02-26 16:19:35

by Waiman Long

[permalink] [raw]
Subject: [PATCH 08/11] fs/dcache: Limit dentry reclaim count in negative_reclaim_workfn()

To limit the d_lock hold time of directory dentry in the negative dentry
reclaim process, a quota (currently 64k) is added to limit the amount of
work that the work function can do and hence its execution time. This is
done to minimize impact on other processes that depend on that d_lock or
other work functions in the same work queue from excessive delay.

Signed-off-by: Waiman Long <[email protected]>
---
fs/dcache.c | 33 +++++++++++++++++++++++++++++----
1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 149c0a6c1a6e..0bd5d6974f75 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1374,10 +1374,25 @@ static inline void init_dentry_iname(struct dentry *dentry)
set_dentry_npositive(dentry, 0);
}

+/*
+ * In the pathological case where a large number of negative dentries are
+ * generated in a short time in a given directory, there is a possibility
+ * that negative dentries reclaiming process will have many dentries to
+ * be dispose of. Thus the d_lock lock can be hold for too long impacting
+ * other running processes that need it.
+ *
+ * There is also the consideration that a long runtime will impact other
+ * work functions that need to be run in the same work queue. As a result,
+ * we have to limit the number of dentries that can be reclaimed in each
+ * invocation of the work function.
+ */
+#define MAX_DENTRY_RECLAIM (1 << 16)
+
/*
* Reclaim excess negative dentries in a directory
+ * Return: true if the work function needs to be rescheduled, false otherwise
*/
-static void reclaim_negative_dentry(struct dentry *parent,
+static void reclaim_negative_dentry(struct dentry *parent, int *quota,
struct list_head *dispose)
{
struct dentry *child;
@@ -1394,9 +1409,16 @@ static void reclaim_negative_dentry(struct dentry *parent,
*/
if (cnt <= limit)
return;
+
+ npositive = 0;
cnt -= limit;
cnt += (limit >> 3);
- npositive = 0;
+ if (cnt >= *quota) {
+ cnt = *quota;
+ *quota = 0;
+ } else {
+ *quota -= cnt;
+ }

/*
* The d_subdirs is treated like a kind of LRU where
@@ -1462,6 +1484,8 @@ static void reclaim_negative_dentry(struct dentry *parent,
}
if (dentry_has_npositive(parent))
set_dentry_npositive(parent, npositive);
+
+ *quota += cnt;
}

/*
@@ -1472,6 +1496,7 @@ static void negative_reclaim_workfn(struct work_struct *work)
struct llist_node *nodes, *next;
struct dentry *parent;
struct reclaim_dentry *dentry_node;
+ int quota = MAX_DENTRY_RECLAIM;

/*
* Collect excess negative dentries in dispose list & shrink them.
@@ -1486,10 +1511,10 @@ static void negative_reclaim_workfn(struct work_struct *work)
parent = dentry_node->parent_dir;
spin_lock(&parent->d_lock);

- if (d_is_dir(parent) &&
+ if (d_is_dir(parent) && quota &&
can_reclaim_dentry(parent->d_flags) &&
(parent->d_flags & DCACHE_RECLAIMING))
- reclaim_negative_dentry(parent, &dispose);
+ reclaim_negative_dentry(parent, &quota, &dispose);

if (!list_empty(&dispose)) {
spin_unlock(&parent->d_lock);
--
2.18.1

2020-02-26 16:19:40

by Waiman Long

[permalink] [raw]
Subject: [PATCH 09/11] fs/dcache: Don't allow small values for dentry-dir-max

A small value for "dentry-dir-max", e.g. < 10, will cause excessive
dentry count checking leading to noticeable performance degradation. In
order to make this sysctl parameter more foolproof, we are not going
to allow any positive integer value less than 256.

Signed-off-by: Waiman Long <[email protected]>
---
Documentation/admin-guide/sysctl/fs.rst | 10 +++++-----
fs/dcache.c | 24 +++++++++++++++++++-----
2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 7274a7b34ee4..e09d851f9d42 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -71,11 +71,11 @@ in the directory. No restriction is placed on the number of positive
dentries as it is naturally limited by the number of files in the
directory.

-The default value is 0 which means there is no limit. Any non-negative
-value is allowed. However, internal tracking is done on all dentry
-types. So the value given, if non-zero, should be larger than the
-number of files in a typical large directory in order to reduce the
-tracking overhead.
+The default value is 0 which means there is no limit. Any positive
+integer value not less than 256 is also allowed. However, internal
+tracking is done on all dentry types. So the value given, if non-zero,
+should be larger than the number of files in a typical large directory
+in order to reduce the tracking overhead.


dentry-state
diff --git a/fs/dcache.c b/fs/dcache.c
index 0bd5d6974f75..f470763e7fb8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -129,10 +129,14 @@ static DEFINE_PER_CPU(long, nr_dentry_negative);
*
* This is sysctl parameter "dentry-dir-max" which specifies a limit on
* the maximum number of negative dentries that are allowed under any
- * given directory.
+ * given directory. The allowable value of "dentry-dir-max" is either
+ * 0, which means no limit, or 256 and up. A low value of "dentry-dir-max"
+ * will cause excessive dentry count checking affecting system performance.
*/
-int dcache_dentry_dir_max_sysctl __read_mostly;
+int dcache_dentry_dir_max_sysctl;
EXPORT_SYMBOL_GPL(dcache_dentry_dir_max_sysctl);
+static int negative_dentry_dir_max __read_mostly;
+#define DENTRY_DIR_MAX_MIN 0x100

static LLIST_HEAD(negative_reclaim_list);
static DEFINE_STATIC_KEY_FALSE(negative_reclaim_enable);
@@ -206,6 +210,16 @@ int proc_dcache_dentry_dir_max(struct ctl_table *ctl, int write,
if (!write || ret || (dcache_dentry_dir_max_sysctl == old))
return ret;

+ /*
+ * A non-zero value must be >= DENTRY_DIR_MAX_MIN.
+ */
+ if (dcache_dentry_dir_max_sysctl &&
+ (dcache_dentry_dir_max_sysctl < DENTRY_DIR_MAX_MIN)) {
+ dcache_dentry_dir_max_sysctl = old;
+ return -EINVAL;
+ }
+
+ negative_dentry_dir_max = dcache_dentry_dir_max_sysctl;
if (!old && dcache_dentry_dir_max_sysctl)
static_branch_enable(&negative_reclaim_enable);
else if (old && !dcache_dentry_dir_max_sysctl)
@@ -1396,7 +1410,7 @@ static void reclaim_negative_dentry(struct dentry *parent, int *quota,
struct list_head *dispose)
{
struct dentry *child;
- int limit = READ_ONCE(dcache_dentry_dir_max_sysctl);
+ int limit = READ_ONCE(negative_dentry_dir_max);
int cnt, npositive;

lockdep_assert_held(&parent->d_lock);
@@ -1405,7 +1419,7 @@ static void reclaim_negative_dentry(struct dentry *parent, int *quota,

/*
* Compute # of negative dentries to be reclaimed
- * An extra 1/8 of dcache_dentry_dir_max_sysctl is added.
+ * An extra 1/8 of negative_dentry_dir_max is added.
*/
if (cnt <= limit)
return;
@@ -1537,7 +1551,7 @@ static void negative_reclaim_workfn(struct work_struct *work)
static void negative_reclaim_check(struct dentry *parent)
__releases(rcu)
{
- int limit = dcache_dentry_dir_max_sysctl;
+ int limit = negative_dentry_dir_max;
struct reclaim_dentry *dentry_node;

if (!limit)
--
2.18.1

2020-02-26 16:20:02

by Waiman Long

[permalink] [raw]
Subject: [PATCH 10/11] fs/dcache: Kill off dentry as last resort

In the unlikely case that an out-of-control application is generating
negative dentries faster than what the negative dentry reclaim process
can get rid of, we will have to kill the negative dentry directly as
the last resort.

The current threshold for killing negative dentry is the maximum of 4
times dentry-dir-max and 10,000 within a directory.

On a 32-vcpu VM, a 30-thread parallel negative dentry generation problem
was run. Without this patch, the negative dentry reclaim process was
overwhelmed by the negative dentry generator and the number of negative
dentries kept growing. With this patch applied with a "dentry-dir-max"
of 10,000. The number of negative dentries never went beyond 40,000.

Signed-off-by: Waiman Long <[email protected]>
---
fs/dcache.c | 37 +++++++++++++++++++++++++++++--------
1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f470763e7fb8..fe48e00349c9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -140,7 +140,7 @@ static int negative_dentry_dir_max __read_mostly;

static LLIST_HEAD(negative_reclaim_list);
static DEFINE_STATIC_KEY_FALSE(negative_reclaim_enable);
-static void negative_reclaim_check(struct dentry *parent);
+static void negative_reclaim_check(struct dentry *parent, struct dentry *child);
static void negative_reclaim_workfn(struct work_struct *work);
static DECLARE_WORK(negative_reclaim_work, negative_reclaim_workfn);

@@ -927,7 +927,7 @@ void dput(struct dentry *dentry)
*/
if (static_branch_unlikely(&negative_reclaim_enable) &&
neg_parent)
- negative_reclaim_check(neg_parent);
+ negative_reclaim_check(neg_parent, dentry);
return;
}

@@ -1548,10 +1548,12 @@ static void negative_reclaim_workfn(struct work_struct *work)
* Check the parent to see if it has too many negative dentries and queue
* it up for the negative dentry reclaim work function to handle it.
*/
-static void negative_reclaim_check(struct dentry *parent)
+static void negative_reclaim_check(struct dentry *parent, struct dentry *child)
__releases(rcu)
{
int limit = negative_dentry_dir_max;
+ int kill_threshold = max(4 * limit, 10000);
+ int ncnt = read_dentry_nnegative(parent);
struct reclaim_dentry *dentry_node;

if (!limit)
@@ -1560,16 +1562,16 @@ static void negative_reclaim_check(struct dentry *parent)
/*
* These checks are racy before spin_lock().
*/
- if (!can_reclaim_dentry(parent->d_flags) ||
- (parent->d_flags & DCACHE_RECLAIMING) ||
- (read_dentry_nnegative(parent) <= limit))
+ if ((!can_reclaim_dentry(parent->d_flags) ||
+ (parent->d_flags & DCACHE_RECLAIMING) || (ncnt <= limit)) &&
+ (ncnt < kill_threshold))
goto rcu_unlock_out;

spin_lock(&parent->d_lock);
+ ncnt = read_dentry_nnegative(parent);
if (!can_reclaim_dentry(parent->d_flags) ||
(parent->d_flags & DCACHE_RECLAIMING) ||
- (read_dentry_nnegative(parent) <= limit) ||
- !d_is_dir(parent))
+ (ncnt <= limit) || !d_is_dir(parent))
goto unlock_out;

dentry_node = kzalloc(sizeof(*dentry_node), GFP_NOWAIT);
@@ -1592,6 +1594,25 @@ static void negative_reclaim_check(struct dentry *parent)
return;

unlock_out:
+ /*
+ * In the unlikely case that an out-of-control application is
+ * generating negative dentries faster than what the negative
+ * dentry reclaim process can get rid of, we will have to kill
+ * the negative dentry directly as the last resort.
+ *
+ * N.B. __dentry_kill() releases both the parent and child's d_lock.
+ */
+ if (unlikely(ncnt >= kill_threshold)) {
+ spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+ if (can_reclaim_dentry(child->d_flags) &&
+ !child->d_lockref.count && (child->d_parent == parent)) {
+ rcu_read_unlock();
+ __dentry_kill(child);
+ dput(parent);
+ return;
+ }
+ spin_unlock(&child->d_lock);
+ }
spin_unlock(&parent->d_lock);
rcu_unlock_out:
rcu_read_unlock();
--
2.18.1

2020-02-26 16:20:08

by Waiman Long

[permalink] [raw]
Subject: [PATCH 11/11] fs/dcache: Track # of negative dentries reclaimed & killed

The negative dentry reclaim process gave no visible indication that
it was being activated. In order to allow system administrator to
see if it is being activated as expected, two new debugfs variables
"negative_dentry_reclaimed" and "negative_dentry_killed" are now added
to report the total number of negative dentries that have been reclaimed
and killed. These debugfs variables are only added after the negative
dentry reclaim mechanism is activated for the first time.

In reality, the actual number may be slightly less than the reported
number as not all the negative dentries passed to shrink_dentry_list()
and __dentry_kill() can be successfully reclaimed.

Signed-off-by: Waiman Long <[email protected]>
---
fs/dcache.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index fe48e00349c9..471b51316506 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -33,6 +33,7 @@
#include <linux/rculist_bl.h>
#include <linux/list_lru.h>
#include <linux/jump_label.h>
+#include <linux/debugfs.h>
#include "internal.h"
#include "mount.h"

@@ -136,6 +137,8 @@ static DEFINE_PER_CPU(long, nr_dentry_negative);
int dcache_dentry_dir_max_sysctl;
EXPORT_SYMBOL_GPL(dcache_dentry_dir_max_sysctl);
static int negative_dentry_dir_max __read_mostly;
+static unsigned long negative_dentry_reclaim_count;
+static atomic_t negative_dentry_kill_count;
#define DENTRY_DIR_MAX_MIN 0x100

static LLIST_HEAD(negative_reclaim_list);
@@ -204,6 +207,7 @@ int proc_dcache_dentry_dir_max(struct ctl_table *ctl, int write,
{
int old = dcache_dentry_dir_max_sysctl;
int ret;
+ static bool debugfs_file_created;

ret = proc_dointvec_minmax(ctl, write, buffer, lenp, ppos);

@@ -219,6 +223,14 @@ int proc_dcache_dentry_dir_max(struct ctl_table *ctl, int write,
return -EINVAL;
}

+ if (!debugfs_file_created) {
+ debugfs_create_ulong("negative_dentry_reclaimed", 0400, NULL,
+ &negative_dentry_reclaim_count);
+ debugfs_create_u32("negative_dentry_killed", 0400, NULL,
+ (u32 *)&negative_dentry_kill_count.counter);
+ debugfs_file_created = true;
+ }
+
negative_dentry_dir_max = dcache_dentry_dir_max_sysctl;
if (!old && dcache_dentry_dir_max_sysctl)
static_branch_enable(&negative_reclaim_enable);
@@ -1542,6 +1554,8 @@ static void negative_reclaim_workfn(struct work_struct *work)
kfree(dentry_node);
cond_resched();
}
+ if (quota < MAX_DENTRY_RECLAIM)
+ negative_dentry_reclaim_count += MAX_DENTRY_RECLAIM - quota;
}

/*
@@ -1609,6 +1623,7 @@ static void negative_reclaim_check(struct dentry *parent, struct dentry *child)
rcu_read_unlock();
__dentry_kill(child);
dput(parent);
+ atomic_inc(&negative_dentry_kill_count);
return;
}
spin_unlock(&child->d_lock);
--
2.18.1

2020-02-26 16:31:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
> value of 0 (default) for no limit or a positive integer 256 and up. Small
> dentry-dir-max numbers are forbidden to avoid excessive dentry count
> checking which can impact system performance.

This is always the wrong approach. A sysctl is just a way of blaming
the sysadmin for us not being very good at programming.

I agree that we need a way to limit the number of negative dentries.
But that limit needs to be dynamic and depend on how the system is being
used, not on how some overworked sysadmin has configured it.

So we need an initial estimate for the number of negative dentries that
we need for good performance. Maybe it's 1000. It doesn't really matter;
it's going to change dynamically.

Then we need a metric to let us know whether it needs to be increased.
Perhaps that's "number of new negative dentries created in the last
second". And we need to decide how much to increase it; maybe it's by
50% or maybe by 10%. Perhaps somewhere between 10-100% depending on
how high the recent rate of negative dentry creation has been.

We also need a metric to let us know whether it needs to be decreased.
I'm reluctant to say that memory pressure should be that metric because
very large systems can let the number of dentries grow in an unbounded
way. Perhaps that metric is "number of hits in the negative dentry
cache in the last ten seconds". Again, we'll need to decide how much
to shrink the target number by.

If the number of negative dentries is at or above the target, then
creating a new negative dentry means evicting an existing negative dentry.
If the number of negative dentries is lower than the target, then we
can just create a new one.

Of course, memory pressure (and shrinking the target number) should
cause negative dentries to be evicted from the old end of the LRU list.
But memory pressure shouldn't cause us to change the target number;
the target number is what we think we need to keep the system running
smoothly.

2020-02-26 19:21:00

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On 2/26/20 11:29 AM, Matthew Wilcox wrote:
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
>> value of 0 (default) for no limit or a positive integer 256 and up. Small
>> dentry-dir-max numbers are forbidden to avoid excessive dentry count
>> checking which can impact system performance.
> This is always the wrong approach. A sysctl is just a way of blaming
> the sysadmin for us not being very good at programming.
>
> I agree that we need a way to limit the number of negative dentries.
> But that limit needs to be dynamic and depend on how the system is being
> used, not on how some overworked sysadmin has configured it.
>
> So we need an initial estimate for the number of negative dentries that
> we need for good performance. Maybe it's 1000. It doesn't really matter;
> it's going to change dynamically.
>
> Then we need a metric to let us know whether it needs to be increased.
> Perhaps that's "number of new negative dentries created in the last
> second". And we need to decide how much to increase it; maybe it's by
> 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on
> how high the recent rate of negative dentry creation has been.
>
> We also need a metric to let us know whether it needs to be decreased.
> I'm reluctant to say that memory pressure should be that metric because
> very large systems can let the number of dentries grow in an unbounded
> way. Perhaps that metric is "number of hits in the negative dentry
> cache in the last ten seconds". Again, we'll need to decide how much
> to shrink the target number by.
>
> If the number of negative dentries is at or above the target, then
> creating a new negative dentry means evicting an existing negative dentry.
> If the number of negative dentries is lower than the target, then we
> can just create a new one.
>
> Of course, memory pressure (and shrinking the target number) should
> cause negative dentries to be evicted from the old end of the LRU list.
> But memory pressure shouldn't cause us to change the target number;
> the target number is what we think we need to keep the system running
> smoothly.
>
Thanks for the quick response.

I agree that auto-tuning so that the system administrator don't have to
worry about it will be the best approach if it is implemented in the
right way. However, it is hard to do it right.

How about letting users specify a cap on the amount of total system
memory allowed for negative dentries like one of my previous patchs.
Internally, there is a predefined minimum and maximum for
dentry-dir-max. We sample the total negative dentry counts periodically
and adjust the dentry-dir-max accordingly.

Specifying a percentage of total system memory is more intuitive than
just specifying a hard number for dentry-dir-max. Still some user input
is required.

What do you think?

Cheers,
Longman

2020-02-26 21:29:09

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Wed, Feb 26, 2020 at 02:19:59PM -0500, Waiman Long wrote:
> On 2/26/20 11:29 AM, Matthew Wilcox wrote:
> > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> >> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
> >> value of 0 (default) for no limit or a positive integer 256 and up. Small
> >> dentry-dir-max numbers are forbidden to avoid excessive dentry count
> >> checking which can impact system performance.
> > This is always the wrong approach. A sysctl is just a way of blaming
> > the sysadmin for us not being very good at programming.
> >
> > I agree that we need a way to limit the number of negative dentries.
> > But that limit needs to be dynamic and depend on how the system is being
> > used, not on how some overworked sysadmin has configured it.
> >
> > So we need an initial estimate for the number of negative dentries that
> > we need for good performance. Maybe it's 1000. It doesn't really matter;
> > it's going to change dynamically.
> >
> > Then we need a metric to let us know whether it needs to be increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second". And we need to decide how much to increase it; maybe it's by
> > 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on
> > how high the recent rate of negative dentry creation has been.
> >
> > We also need a metric to let us know whether it needs to be decreased.
> > I'm reluctant to say that memory pressure should be that metric because
> > very large systems can let the number of dentries grow in an unbounded
> > way. Perhaps that metric is "number of hits in the negative dentry
> > cache in the last ten seconds". Again, we'll need to decide how much
> > to shrink the target number by.
> >
> > If the number of negative dentries is at or above the target, then
> > creating a new negative dentry means evicting an existing negative dentry.
> > If the number of negative dentries is lower than the target, then we
> > can just create a new one.
> >
> > Of course, memory pressure (and shrinking the target number) should
> > cause negative dentries to be evicted from the old end of the LRU list.
> > But memory pressure shouldn't cause us to change the target number;
> > the target number is what we think we need to keep the system running
> > smoothly.
> >
> Thanks for the quick response.
>
> I agree that auto-tuning so that the system administrator don't have to
> worry about it will be the best approach if it is implemented in the
> right way. However, it is hard to do it right.
>
> How about letting users specify a cap on the amount of total system
> memory allowed for negative dentries like one of my previous patchs.
> Internally, there is a predefined minimum and maximum for
> dentry-dir-max. We sample the total negative dentry counts periodically
> and adjust the dentry-dir-max accordingly.
>
> Specifying a percentage of total system memory is more intuitive than
> just specifying a hard number for dentry-dir-max. Still some user input
> is required.

If you want to base the whole thing on a per-directory target number,
or a percentage of the system memory target (rather than my suggestion
of a total # of negative dentries), that seems reasonable. What's not
reasonable is expecting the sysadmin to be able to either predict the
workload, or react to a changing workload in sufficient time. The system
has to be self-tuning.

Just look how long stale information stays around about how to tune your
Linux system. Here's an article from 2018 suggesting using the 'intr'
option for NFS mounts:
https://kb.netapp.com/app/answers/answer_view/a_id/1004893/~/hard-mount-vs-soft-mount-
I made that a no-op in 2007. Any tunable you add to Linux immediately
becomes a cargo-cult solution to any problem people are having.

2020-02-26 21:29:50

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
>> value of 0 (default) for no limit or a positive integer 256 and up. Small
>> dentry-dir-max numbers are forbidden to avoid excessive dentry count
>> checking which can impact system performance.
>
> This is always the wrong approach. A sysctl is just a way of blaming
> the sysadmin for us not being very good at programming.
>
> I agree that we need a way to limit the number of negative dentries.
> But that limit needs to be dynamic and depend on how the system is being
> used, not on how some overworked sysadmin has configured it.
>
> So we need an initial estimate for the number of negative dentries that
> we need for good performance. Maybe it's 1000. It doesn't really matter;
> it's going to change dynamically.
>
> Then we need a metric to let us know whether it needs to be increased.
> Perhaps that's "number of new negative dentries created in the last
> second". And we need to decide how much to increase it; maybe it's by
> 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on
> how high the recent rate of negative dentry creation has been.
>
> We also need a metric to let us know whether it needs to be decreased.
> I'm reluctant to say that memory pressure should be that metric because
> very large systems can let the number of dentries grow in an unbounded
> way. Perhaps that metric is "number of hits in the negative dentry
> cache in the last ten seconds". Again, we'll need to decide how much
> to shrink the target number by.

OK, so now instead of a single tunable parameter we need three, because
these numbers are totally made up and nobody knows the right values. :-)
Defaulting the limit to "disabled/no limit" also has the problem that
99.99% of users won't even know this tunable exists, let alone how to
set it correctly, so they will continue to see these problems, and the
code may as well not exist (i.e. pure overhead), while Waiman has a
better idea today of what would be reasonable defaults.

I definitely agree that a single fixed value will be wrong for every
system except the original developer's. Making the maximum default to
some reasonable fraction of the system size, rather than a fixed value,
is probably best to start. Something like this as a starting point:

/* Allow a reasonable minimum number of negative entries,
* but proportionately more if the directory/dcache is large.
*/
dir_negative_max = max(num_dir_entries / 16, 1024);
total_negative_max = max(totalram_pages / 32, total_dentries / 8);

(Waiman should decide actual values based on where the problem was hit
previously), and include tunables to change the limits for testing.

Ideally there would also be a dir ioctl that allows fetching the current
positive/negative entry count on a directory (e.g. /usr/bin, /usr/lib64,
/usr/share/man/man*) to see what these values are. Otherwise there is
no way to determine whether the limits used are any good or not.

Dynamic limits are hard to get right, and incorrect state machines can lead
to wild swings in behaviour due to unexpected feedback. It isn't clear to
me that adjusting the limit based on the current rate of negative dentry
creation even makes sense. If there are a lot of negative entries being
created, that is when you'd want to _stop_ allowing more to be added.

We don't have any limit today, so imposing some large-but-still-reasonable
upper limit on negative entries will catch the runaway negative dcache case
that was the original need of this functionality without adding a lot of
complexity that we may not need at all.

> If the number of negative dentries is at or above the target, then
> creating a new negative dentry means evicting an existing negative dentry.
> If the number of negative dentries is lower than the target, then we
> can just create a new one.
>
> Of course, memory pressure (and shrinking the target number) should
> cause negative dentries to be evicted from the old end of the LRU list.
> But memory pressure shouldn't cause us to change the target number;
> the target number is what we think we need to keep the system running
> smoothly.


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2020-02-26 21:46:01

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Wed, Feb 26, 2020 at 02:28:50PM -0700, Andreas Dilger wrote:
> On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <[email protected]> wrote:
> > This is always the wrong approach. A sysctl is just a way of blaming
> > the sysadmin for us not being very good at programming.
> >
> > I agree that we need a way to limit the number of negative dentries.
> > But that limit needs to be dynamic and depend on how the system is being
> > used, not on how some overworked sysadmin has configured it.
> >
> > So we need an initial estimate for the number of negative dentries that
> > we need for good performance. Maybe it's 1000. It doesn't really matter;
> > it's going to change dynamically.
> >
> > Then we need a metric to let us know whether it needs to be increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second". And we need to decide how much to increase it; maybe it's by
> > 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on
> > how high the recent rate of negative dentry creation has been.
> >
> > We also need a metric to let us know whether it needs to be decreased.
> > I'm reluctant to say that memory pressure should be that metric because
> > very large systems can let the number of dentries grow in an unbounded
> > way. Perhaps that metric is "number of hits in the negative dentry
> > cache in the last ten seconds". Again, we'll need to decide how much
> > to shrink the target number by.
>
> OK, so now instead of a single tunable parameter we need three, because
> these numbers are totally made up and nobody knows the right values. :-)
> Defaulting the limit to "disabled/no limit" also has the problem that
> 99.99% of users won't even know this tunable exists, let alone how to
> set it correctly, so they will continue to see these problems, and the
> code may as well not exist (i.e. pure overhead), while Waiman has a
> better idea today of what would be reasonable defaults.

I never said "no limit". I just said to start at some fairly random
value and not worry about where you start because it'll correct to where
this system needs it to be. As long as it converges like loadavg does,
it'll be fine. It needs a fairly large "don't change the target" area,
and it needs to react quickly to real changes in a system's workload.

> I definitely agree that a single fixed value will be wrong for every
> system except the original developer's. Making the maximum default to
> some reasonable fraction of the system size, rather than a fixed value,
> is probably best to start. Something like this as a starting point:
>
> /* Allow a reasonable minimum number of negative entries,
> * but proportionately more if the directory/dcache is large.
> */
> dir_negative_max = max(num_dir_entries / 16, 1024);
> total_negative_max = max(totalram_pages / 32, total_dentries / 8);

Those kinds of things are garbage on large machines. With a terabyte
of RAM, you can end up with tens of millions of dentries clogging up
the system. There _is_ an upper limit on the useful number of dentries
to keep around.

> (Waiman should decide actual values based on where the problem was hit
> previously), and include tunables to change the limits for testing.
>
> Ideally there would also be a dir ioctl that allows fetching the current
> positive/negative entry count on a directory (e.g. /usr/bin, /usr/lib64,
> /usr/share/man/man*) to see what these values are. Otherwise there is
> no way to determine whether the limits used are any good or not.

It definitely needs to be instrumented for testing, but no, not ioctls.
tracepoints, perhaps.

> Dynamic limits are hard to get right, and incorrect state machines can lead
> to wild swings in behaviour due to unexpected feedback. It isn't clear to
> me that adjusting the limit based on the current rate of negative dentry
> creation even makes sense. If there are a lot of negative entries being
> created, that is when you'd want to _stop_ allowing more to be added.

That doesn't make sense. What you really want to know is "If my dcache
had twice as many entries in it, would that significantly reduce the
thrash of new entries being created". In the page cache, we end up
with a double LRU where once-used entries fall off the list quickly
but twice-or-more used entries get to stay around for a bit longer.
Maybe we could do something like that; keep a victim cache for recently
evicted dentries, and if we get a large hit rate in the victim cache,
expand the size of the primary cache.


2020-02-27 08:07:26

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Wed, Feb 26, 2020 at 01:45:07PM -0800, Matthew Wilcox wrote:
> had twice as many entries in it, would that significantly reduce the
> thrash of new entries being created". In the page cache, we end up
> with a double LRU where once-used entries fall off the list quickly
> but twice-or-more used entries get to stay around for a bit longer.
> Maybe we could do something like that; keep a victim cache for recently
> evicted dentries, and if we get a large hit rate in the victim cache,
> expand the size of the primary cache.

You know, I've been saying exactly the same thing about the inode
LRU in response to people trying to hack behaviour out of the
shrinker that is triggered by the working set getting trashed by
excessive creation of single use inodes (i.e. large scale directory
traversal).

IOWs, both have the same problem with working set retention in the
face of excessive growth pressure.

So, you know, perhaps two caches with the same problem, that use the
same LRU implementation, could solve the same problem by enhancing
the generic LRU code they use to an active/inactive style clocking
LRU like the page LRUs?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-02-27 08:31:47

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> As there is no limit for negative dentries, it is possible that a sizeable
> portion of system memory can be tied up in dentry cache slabs. Dentry slabs
> are generally recalimable if the dentries are in the LRUs. Still having
> too much memory used up by dentries can be problematic:

I don't get it.

Why isn't the solution simply "constrain the application generating
unbound numbers of dentries to a memcg"?

Then when the memcg runs out of memory, it will start reclaiming the
dentries that were allocated inside the memcg that are using all
it's resources, thereby preventing unbound growth of the dentry
cache.

I mean, this sort of resource control is exactly what memcgs are
supposed to be used for and are already used for. I don't see why we
need all this complexity for global dentry resource management when
memcgs should already provide an effective means of managing and
placing bounds on the amount of memory any specific application can
use...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-02-27 09:57:32

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Wed, 2020-02-26 at 14:28 -0700, Andreas Dilger wrote:
> On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <[email protected]>
> wrote:
> > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> > > A new sysctl parameter "dentry-dir-max" is introduced which
> > > accepts a
> > > value of 0 (default) for no limit or a positive integer 256 and
> > > up. Small
> > > dentry-dir-max numbers are forbidden to avoid excessive dentry
> > > count
> > > checking which can impact system performance.
> >
> > This is always the wrong approach. A sysctl is just a way of
> > blaming
> > the sysadmin for us not being very good at programming.
> >
> > I agree that we need a way to limit the number of negative
> > dentries.
> > But that limit needs to be dynamic and depend on how the system is
> > being
> > used, not on how some overworked sysadmin has configured it.
> >
> > So we need an initial estimate for the number of negative dentries
> > that
> > we need for good performance. Maybe it's 1000. It doesn't really
> > matter;
> > it's going to change dynamically.
> >
> > Then we need a metric to let us know whether it needs to be
> > increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second". And we need to decide how much to increase it; maybe it's
> > by
> > 50% or maybe by 10%. Perhaps somewhere between 10-100% depending
> > on
> > how high the recent rate of negative dentry creation has been.
> >
> > We also need a metric to let us know whether it needs to be
> > decreased.
> > I'm reluctant to say that memory pressure should be that metric
> > because
> > very large systems can let the number of dentries grow in an
> > unbounded
> > way. Perhaps that metric is "number of hits in the negative dentry
> > cache in the last ten seconds". Again, we'll need to decide how
> > much
> > to shrink the target number by.
>
> OK, so now instead of a single tunable parameter we need three,
> because
> these numbers are totally made up and nobody knows the right values.
> :-)
> Defaulting the limit to "disabled/no limit" also has the problem that
> 99.99% of users won't even know this tunable exists, let alone how to
> set it correctly, so they will continue to see these problems, and
> the
> code may as well not exist (i.e. pure overhead), while Waiman has a
> better idea today of what would be reasonable defaults.

Why have these at all.

Not all file systems even produce negative hashed dentries.

The most beneficial use of them is to improve performance of rapid
fire lookups for non-existent names. Longer lived negative hashed
dentries don't give much benefit at all unless they suddenly have
lots of hits and that would cost a single allocation on the first
lookup if the dentry ttl expired and the dentry discarded.

A ttl (say jiffies) set at appropriate times could be a better
choice all round, no sysctl values at all.

Ian

2020-02-27 19:05:35

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On 2/26/20 8:29 AM, Matthew Wilcox wrote:
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
>> value of 0 (default) for no limit or a positive integer 256 and up. Small
>> dentry-dir-max numbers are forbidden to avoid excessive dentry count
>> checking which can impact system performance.
>
> This is always the wrong approach. A sysctl is just a way of blaming
> the sysadmin for us not being very good at programming.
>
> I agree that we need a way to limit the number of negative dentries.
> But that limit needs to be dynamic and depend on how the system is being
> used, not on how some overworked sysadmin has configured it.
>
> So we need an initial estimate for the number of negative dentries that
> we need for good performance. Maybe it's 1000. It doesn't really matter;
> it's going to change dynamically.
>
> Then we need a metric to let us know whether it needs to be increased.
> Perhaps that's "number of new negative dentries created in the last
> second". And we need to decide how much to increase it; maybe it's by
> 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on
> how high the recent rate of negative dentry creation has been.

There are pitfalls to this approach as well. Consider what libnss
does every time it starts up (via curl in this case)

# cat /proc/sys/fs/dentry-state
3154271 3131421 45 0 2863333 0
# for I in `seq 1 10`; do curl https://sandeen.net/ &>/dev/null; done
# cat /proc/sys/fs/dentry-state
3170738 3147844 45 0 2879882 0

voila, 16k more negative dcache entries, thanks to:

https://github.com/nss-dev/nss/blob/317cb06697d5b953d825e050c1d8c1ee0d647010/lib/softoken/sdb.c#L390

i.e. each time it inits, it will intentionally create up to 10,000 negative
dentries which will never be looked up again. I /think/ the original intent
of this work was to limit such rogue applications, so scaling with use probably
isn't the way to go.

-Eric

2020-02-27 22:40:25

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Thu, Feb 27, 2020 at 11:04:40AM -0800, Eric Sandeen wrote:
> On 2/26/20 8:29 AM, Matthew Wilcox wrote:
> > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> >> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
> >> value of 0 (default) for no limit or a positive integer 256 and up. Small
> >> dentry-dir-max numbers are forbidden to avoid excessive dentry count
> >> checking which can impact system performance.
> >
> > This is always the wrong approach. A sysctl is just a way of blaming
> > the sysadmin for us not being very good at programming.
> >
> > I agree that we need a way to limit the number of negative dentries.
> > But that limit needs to be dynamic and depend on how the system is being
> > used, not on how some overworked sysadmin has configured it.
> >
> > So we need an initial estimate for the number of negative dentries that
> > we need for good performance. Maybe it's 1000. It doesn't really matter;
> > it's going to change dynamically.
> >
> > Then we need a metric to let us know whether it needs to be increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second". And we need to decide how much to increase it; maybe it's by
> > 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on
> > how high the recent rate of negative dentry creation has been.
>
> There are pitfalls to this approach as well. Consider what libnss
> does every time it starts up (via curl in this case)
>
> # cat /proc/sys/fs/dentry-state
> 3154271 3131421 45 0 2863333 0
> # for I in `seq 1 10`; do curl https://sandeen.net/ &>/dev/null; done
> # cat /proc/sys/fs/dentry-state
> 3170738 3147844 45 0 2879882 0
>
> voila, 16k more negative dcache entries, thanks to:
>
> https://github.com/nss-dev/nss/blob/317cb06697d5b953d825e050c1d8c1ee0d647010/lib/softoken/sdb.c#L390
>
> i.e. each time it inits, it will intentionally create up to 10,000 negative
> dentries which will never be looked up again.

Sandboxing via memcg restricted cgroups means users and/or
applications cannot create unbound numbers of negative dentries, and
that largely solves this problem.

For a system daemons whose environment is controlled by a
systemd unit file, this should be pretty trivial to do, and memcg
directed memory reclaim will control negative dentry buildup.

For short-lived applications, teardown of the cgroup will free
all the negative dentries it created - they don't hang around
forever.

For long lived applications, negative dentries are bound by the
application memcg limits, and buildup will only affect the
applications own performance, not that of the whole system.

IOWs, I'd expect this sort of resource control problem to be solved
at the user, application and/or distro level, not with a huge kernel
hammer.

> I /think/ the original intent of this work was to limit such rogue
> applications, so scaling with use probably isn't the way to go.

The original intent was to prevent problems on old kernels that
supported terabytes of memory but could not use cgroup/memcg
infrastructure to isolate and contain negative dentry growth.
That was a much simpler, targeted negative dentry limiting solution,
not the ... craziness that can be found in this patchset.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-02-28 03:34:36

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> Not all file systems even produce negative hashed dentries.
>
> The most beneficial use of them is to improve performance of rapid
> fire lookups for non-existent names. Longer lived negative hashed
> dentries don't give much benefit at all unless they suddenly have
> lots of hits and that would cost a single allocation on the first
> lookup if the dentry ttl expired and the dentry discarded.
>
> A ttl (say jiffies) set at appropriate times could be a better
> choice all round, no sysctl values at all.

The canonical argument in favour of negative dentries is to improve
application startup time as every application searches the library path
for the same libraries. Only they don't do that any more:

$ strace -e file cat /dev/null
execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars */) = 0
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3

So, are they still useful? Or should we, say, keep at most 100 around?

2020-02-28 04:17:16

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Thu, 2020-02-27 at 19:34 -0800, Matthew Wilcox wrote:
> On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> > Not all file systems even produce negative hashed dentries.
> >
> > The most beneficial use of them is to improve performance of rapid
> > fire lookups for non-existent names. Longer lived negative hashed
> > dentries don't give much benefit at all unless they suddenly have
> > lots of hits and that would cost a single allocation on the first
> > lookup if the dentry ttl expired and the dentry discarded.
> >
> > A ttl (say jiffies) set at appropriate times could be a better
> > choice all round, no sysctl values at all.
>
> The canonical argument in favour of negative dentries is to improve
> application startup time as every application searches the library
> path
> for the same libraries. Only they don't do that any more:
>
> $ strace -e file cat /dev/null
> execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars
> */) = 0
> access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or
> directory)
> openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6",
> O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib/locale/locale-archive",
> O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3
>
> So, are they still useful? Or should we, say, keep at most 100
> around?

Who knows how old apps will be on distros., ;)

But I don't think it matters.

The VFS will (should) work fine without a minimum negative hashed
dentry count (and hashed since unhashed negative dentries are
summarily executed on final dput()) and a ttl should keep frequently
used ones around long enough to satisfy this sort of thing should it
be needed.

Even the ttl value should be resilient to a large range of values,
just not so much very small ones.

Ian

2020-02-28 04:22:53

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Thu, Feb 27, 2020 at 07:34:12PM -0800, Matthew Wilcox wrote:
> On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> > Not all file systems even produce negative hashed dentries.
> >
> > The most beneficial use of them is to improve performance of rapid
> > fire lookups for non-existent names. Longer lived negative hashed
> > dentries don't give much benefit at all unless they suddenly have
> > lots of hits and that would cost a single allocation on the first
> > lookup if the dentry ttl expired and the dentry discarded.
> >
> > A ttl (say jiffies) set at appropriate times could be a better
> > choice all round, no sysctl values at all.
>
> The canonical argument in favour of negative dentries is to improve
> application startup time as every application searches the library path
> for the same libraries. Only they don't do that any more:

Tell that to scripts that keep looking through $PATH for
binaries each time they are run. Tell that to cc(1) looking through
include path, etc.

Ian, autofs is deeply pathological in that respect; that's OK,
since it has very unusual needs, but please don't use it as a model
for anything else - its needs *are* unusual.

2020-02-28 04:38:29

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Fri, 2020-02-28 at 12:16 +0800, Ian Kent wrote:
> On Thu, 2020-02-27 at 19:34 -0800, Matthew Wilcox wrote:
> > On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> > > Not all file systems even produce negative hashed dentries.
> > >
> > > The most beneficial use of them is to improve performance of
> > > rapid
> > > fire lookups for non-existent names. Longer lived negative hashed
> > > dentries don't give much benefit at all unless they suddenly have
> > > lots of hits and that would cost a single allocation on the first
> > > lookup if the dentry ttl expired and the dentry discarded.
> > >
> > > A ttl (say jiffies) set at appropriate times could be a better
> > > choice all round, no sysctl values at all.
> >
> > The canonical argument in favour of negative dentries is to improve
> > application startup time as every application searches the library
> > path
> > for the same libraries. Only they don't do that any more:
> >
> > $ strace -e file cat /dev/null
> > execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars
> > */) = 0
> > access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file
> > or
> > directory)
> > openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6",
> > O_RDONLY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/usr/lib/locale/locale-archive",
> > O_RDONLY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3
> >
> > So, are they still useful? Or should we, say, keep at most 100
> > around?
>
> Who knows how old apps will be on distros., ;)

Or what admins put in the PATH, I've seen oddness in that
a lot.

>
> But I don't think it matters.

And I don't think I made my answer to the question clear.

I don't think setting a minimum matters but there are other
sources of a possibly significant number of lookups on
paths that don't exist. I've seen evidence recently
(although I suspect unfounded) that systemd can generate
lots of these lookups at times.

And let's not forget that file systems are the primary
source of these and not all create them on lookups.
I may be mistaken, but I think ext4 does not while xfs
definitely does.

The more important metric I think is calculating a sensible
maximum to be pruned to prevent getting bogged down as there
could be times when a lot of these are present. After all this
is meant to be an iterative pruning measure.

Ian

2020-02-28 04:53:18

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Fri, 2020-02-28 at 04:22 +0000, Al Viro wrote:
> On Thu, Feb 27, 2020 at 07:34:12PM -0800, Matthew Wilcox wrote:
> > On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> > > Not all file systems even produce negative hashed dentries.
> > >
> > > The most beneficial use of them is to improve performance of
> > > rapid
> > > fire lookups for non-existent names. Longer lived negative hashed
> > > dentries don't give much benefit at all unless they suddenly have
> > > lots of hits and that would cost a single allocation on the first
> > > lookup if the dentry ttl expired and the dentry discarded.
> > >
> > > A ttl (say jiffies) set at appropriate times could be a better
> > > choice all round, no sysctl values at all.
> >
> > The canonical argument in favour of negative dentries is to improve
> > application startup time as every application searches the library
> > path
> > for the same libraries. Only they don't do that any more:
>
> Tell that to scripts that keep looking through $PATH for
> binaries each time they are run. Tell that to cc(1) looking through
> include path, etc.
>
> Ian, autofs is deeply pathological in that respect; that's OK,
> since it has very unusual needs, but please don't use it as a model
> for anything else - its needs *are* unusual.

Ok, but my thoughts aren't based on autofs behaviours.

But it sounds like you don't believe this is a sensible suggestion
and you would know best so ...

Ian

2020-02-28 04:53:21

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Fri, Feb 28, 2020 at 12:36:09PM +0800, Ian Kent wrote:

> And let's not forget that file systems are the primary
> source of these and not all create them on lookups.
> I may be mistaken, but I think ext4 does not while xfs
> definitely does.

Both ext4 and xfs bloody well *DO* create hashed negative
dentries on lookups. There is a pathological case when
they are trying to be case-insensitive (and in that situation
we are SOL - if somebody insists upon mounting with
-o make-it-suck, that's what they bloody well get).

Casefondling idiocy aside, negative lookups are hashed.
On all normal filesystems. Look for d_splice_alias()
getting passed NULL inode - that's where ->lookup()
instances normally create those.

2020-02-28 15:33:22

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On 2/27/20 10:34 PM, Matthew Wilcox wrote:
> On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
>> Not all file systems even produce negative hashed dentries.
>>
>> The most beneficial use of them is to improve performance of rapid
>> fire lookups for non-existent names. Longer lived negative hashed
>> dentries don't give much benefit at all unless they suddenly have
>> lots of hits and that would cost a single allocation on the first
>> lookup if the dentry ttl expired and the dentry discarded.
>>
>> A ttl (say jiffies) set at appropriate times could be a better
>> choice all round, no sysctl values at all.
> The canonical argument in favour of negative dentries is to improve
> application startup time as every application searches the library path
> for the same libraries. Only they don't do that any more:
>
> $ strace -e file cat /dev/null
> execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars */) = 0
> access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
> openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3
>
> So, are they still useful? Or should we, say, keep at most 100 around?
>
It is the shell that does the path search, not the command itself.

Cheers,
Longman

2020-02-28 15:40:57

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Fri, Feb 28, 2020 at 10:32:02AM -0500, Waiman Long wrote:
> On 2/27/20 10:34 PM, Matthew Wilcox wrote:
> > The canonical argument in favour of negative dentries is to improve
> > application startup time as every application searches the library path
^^^^^^^
> > for the same libraries. Only they don't do that any more:
^^^^^^^^^
>
> It is the shell that does the path search, not the command itself.

2020-02-28 15:49:02

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On 2/27/20 3:30 AM, Dave Chinner wrote:
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> As there is no limit for negative dentries, it is possible that a sizeable
>> portion of system memory can be tied up in dentry cache slabs. Dentry slabs
>> are generally recalimable if the dentries are in the LRUs. Still having
>> too much memory used up by dentries can be problematic:
> I don't get it.
>
> Why isn't the solution simply "constrain the application generating
> unbound numbers of dentries to a memcg"?
>
> Then when the memcg runs out of memory, it will start reclaiming the
> dentries that were allocated inside the memcg that are using all
> it's resources, thereby preventing unbound growth of the dentry
> cache.
>
> I mean, this sort of resource control is exactly what memcgs are
> supposed to be used for and are already used for. I don't see why we
> need all this complexity for global dentry resource management when
> memcgs should already provide an effective means of managing and
> placing bounds on the amount of memory any specific application can
> use...

Using memcg is one way to limit the damage. The argument that excessive
negative dentries can push out existing memory objects that can be more
useful if left alone still applies. Daemons that run in the root memcg
has no limitation on how much memory that they can use.

There can also be memcgs with high memory limits and long running
applications. memcg is certainly a useful tool in this regards, but it
doesn't solve all the problem.

Cheers,
Longman


2020-02-28 19:33:52

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Thu, Feb 27, 2020 at 07:34:12PM -0800, Matthew Wilcox wrote:
>
> The canonical argument in favour of negative dentries is to improve
> application startup time as every application searches the library path
> for the same libraries.

The other canonical example is C compilers that need to search for
header files along the include search path:

% strace -o /tmp/st -f gcc -o /tmp/hello /tmp/hello.c -I.. -I../..
% grep open /tmp/st | grep stdio.h | grep ENOENT | wc -l
6

- Ted

2020-03-15 07:16:55

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> As there is no limit for negative dentries, it is possible that a sizeable
> portion of system memory can be tied up in dentry cache slabs. Dentry slabs
> are generally recalimable if the dentries are in the LRUs. Still having
> too much memory used up by dentries can be problematic:
>
> 1) When a filesystem with too many negative dentries is being unmounted,
> the process of draining the dentries associated with the filesystem
> can take some time. To users, the system may seem to hang for
> a while. The long wait may also cause unexpected timeout error or
> other warnings. This can happen when a long running container with
> many negative dentries is being destroyed, for instance.
>
> 2) Tying up too much memory in unused negative dentries means there
> are less memory available for other use. Even though the kernel is
> able to reclaim unused dentries when running out of free memory,
> it will still introduce additional latency to the application
> reducing its performance.

There's a third problem, which is that having a lot of negative dentries
can clog the hash chains. I tried to quantify this, and found a weird result:

root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real 0m5.402s
user 0m4.361s
sys 0m1.230s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real 0m5.572s
user 0m4.337s
sys 0m1.407s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real 0m5.607s
user 0m4.522s
sys 0m1.342s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real 0m5.599s
user 0m4.472s
sys 0m1.369s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real 0m5.574s
user 0m4.498s
sys 0m1.300s

Pretty consistent system time, between about 1.3 and 1.4 seconds.

root@bobo-kvm:~# grep dentry /proc/slabinfo
dentry 20394 21735 192 21 1 : tunables 0 0 0 : slabdata 1035 1035 0
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real 0m5.515s
user 0m4.353s
sys 0m1.359s

At this point, we have 20k dentries allocated.

Now, pollute the dcache with names that don't exist:

root@bobo-kvm:~# for i in `seq 1 100000`; do cat /dev/null$i >/dev/zero; done 2>/dev/null
root@bobo-kvm:~# grep dentry /proc/slabinfo
dentry 20605 21735 192 21 1 : tunables 0 0 0 : slabdata 1035 1035 0

Huh. We've kept the number of dentries pretty constant. Still, maybe the
bad dentries have pushed out the good ones.

root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real 0m6.644s
user 0m4.921s
sys 0m1.946s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real 0m6.676s
user 0m5.004s
sys 0m1.909s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real 0m6.662s
user 0m4.980s
sys 0m1.916s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real 0m6.714s
user 0m4.973s
sys 0m1.986s

Well, we certainly made it suck. Up to a pretty consistent 1.9-2.0 seconds
of kernel time, or 50% worse. We've also made user time worse, somehow.

Anyhow, I should write a proper C program to measure this. But I thought
I'd share this raw data with you now to demonstrate that dcache pollution
is a real problem today, even on a machine with 2GB.

2020-03-21 10:19:45

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On 15/03/2020 06.46, Matthew Wilcox wrote:
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> As there is no limit for negative dentries, it is possible that a sizeable
>> portion of system memory can be tied up in dentry cache slabs. Dentry slabs
>> are generally recalimable if the dentries are in the LRUs. Still having
>> too much memory used up by dentries can be problematic:
>>
>> 1) When a filesystem with too many negative dentries is being unmounted,
>> the process of draining the dentries associated with the filesystem
>> can take some time. To users, the system may seem to hang for
>> a while. The long wait may also cause unexpected timeout error or
>> other warnings. This can happen when a long running container with
>> many negative dentries is being destroyed, for instance.
>>
>> 2) Tying up too much memory in unused negative dentries means there
>> are less memory available for other use. Even though the kernel is
>> able to reclaim unused dentries when running out of free memory,
>> it will still introduce additional latency to the application
>> reducing its performance.
>
> There's a third problem, which is that having a lot of negative dentries
> can clog the hash chains. I tried to quantify this, and found a weird result:

Yep. I've seen this in the wild. Server hard too much unused memory.

https://lore.kernel.org/lkml/[email protected]/T/#u

---quote---

I've seen problem on large server where horde of negative dentries
slowed down all lookups significantly:

watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [atop:968884] at __d_lookup_rcu+0x6f/0x190

slabtop:

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
85118166 85116916 0% 0.19K 2026623 42 16212984K dentry
16577106 16371723 0% 0.10K 425054 39 1700216K buffer_head
935850 934379 0% 1.05K 31195 30 998240K ext4_inode_cache
663740 654967 0% 0.57K 23705 28 379280K radix_tree_node
399987 380055 0% 0.65K 8163 49 261216K proc_inode_cache
226380 168813 0% 0.19K 5390 42 43120K cred_jar
70345 65721 0% 0.58K 1279 55 40928K inode_cache
105927 43314 0% 0.31K 2077 51 33232K filp
630972 601503 0% 0.04K 6186 102 24744K ext4_extent_status
5848 4269 0% 3.56K 731 8 23392K task_struct
16224 11531 0% 1.00K 507 32 16224K kmalloc-1024
6752 5833 0% 2.00K 422 16 13504K kmalloc-2048
199680 158086 0% 0.06K 3120 64 12480K anon_vma_chain
156128 154751 0% 0.07K 2788 56 11152K Acpi-Operand

Total RAM is 256 GB

These dentries came from temporary files created and deleted by postgres.
But this could be easily reproduced by lookup of non-existent files.

Of course, memory pressure easily washes them away.

Similar problem happened before around proc sysctl entries:
https://lkml.org/lkml/2017/2/10/47

This one does not concentrate in one bucket and needs much more memory.

Looks like dcache needs some kind of background shrinker started
when dcache size or fraction of negative dentries exceeds some threshold.

---end---

> > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real 0m5.402s
> user 0m4.361s
> sys 0m1.230s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real 0m5.572s
> user 0m4.337s
> sys 0m1.407s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real 0m5.607s
> user 0m4.522s
> sys 0m1.342s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real 0m5.599s
> user 0m4.472s
> sys 0m1.369s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real 0m5.574s
> user 0m4.498s
> sys 0m1.300s
>
> Pretty consistent system time, between about 1.3 and 1.4 seconds.
>
> root@bobo-kvm:~# grep dentry /proc/slabinfo
> dentry 20394 21735 192 21 1 : tunables 0 0 0 : slabdata 1035 1035 0
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real 0m5.515s
> user 0m4.353s
> sys 0m1.359s
>
> At this point, we have 20k dentries allocated.
>
> Now, pollute the dcache with names that don't exist:
>
> root@bobo-kvm:~# for i in `seq 1 100000`; do cat /dev/null$i >/dev/zero; done 2>/dev/null
> root@bobo-kvm:~# grep dentry /proc/slabinfo
> dentry 20605 21735 192 21 1 : tunables 0 0 0 : slabdata 1035 1035 0
>
> Huh. We've kept the number of dentries pretty constant. Still, maybe the
> bad dentries have pushed out the good ones.
>
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real 0m6.644s
> user 0m4.921s
> sys 0m1.946s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real 0m6.676s
> user 0m5.004s
> sys 0m1.909s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real 0m6.662s
> user 0m4.980s
> sys 0m1.916s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real 0m6.714s
> user 0m4.973s
> sys 0m1.986s
>
> Well, we certainly made it suck. Up to a pretty consistent 1.9-2.0 seconds
> of kernel time, or 50% worse. We've also made user time worse, somehow.
>
> Anyhow, I should write a proper C program to measure this. But I thought
> I'd share this raw data with you now to demonstrate that dcache pollution
> is a real problem today, even on a machine with 2GB.
>