2023-05-31 10:05:49

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 0/8] make unregistration of super_block shrinker more faster

From: Qi Zheng <[email protected]>

Hi all,

This patch series aims to make unregistration of super_block shrinker more
faster.

1. Background
=============

The kernel test robot noticed a -88.8% regression of stress-ng.ramfs.ops_per_sec
on commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). More
details can be seen from the link[1] below.

[1]. https://lore.kernel.org/lkml/[email protected]/

We can just use the following command to reproduce the result:

stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &

1) before commit f95bdb700bc6b:

stress-ng: info: [11023] dispatching hogs: 9 ramfs
stress-ng: info: [11023] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [11023] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [11023] ramfs 774966 60.00 10.18 169.45 12915.89 4314.26
stress-ng: info: [11023] for a 60.00s run time:
stress-ng: info: [11023] 1920.11s available CPU time
stress-ng: info: [11023] 10.18s user time ( 0.53%)
stress-ng: info: [11023] 169.44s system time ( 8.82%)
stress-ng: info: [11023] 179.62s total time ( 9.35%)
stress-ng: info: [11023] load average: 8.99 2.69 0.93
stress-ng: info: [11023] successful run completed in 60.00s (1 min, 0.00 secs)

2) after commit f95bdb700bc6b:

stress-ng: info: [37676] dispatching hogs: 9 ramfs
stress-ng: info: [37676] stressor bogo ops real time usrtime sys time bogo ops/s bogo ops/s
stress-ng: info: [37676] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [37676] ramfs 168673 60.00 1.61 39.66 2811.08 4087.47
stress-ng: info: [37676] for a 60.10s run time:
stress-ng: info: [37676] 1923.36s available CPU time
stress-ng: info: [37676] 1.60s user time ( 0.08%)
stress-ng: info: [37676] 39.66s system time ( 2.06%)
stress-ng: info: [37676] 41.26s total time ( 2.15%)
stress-ng: info: [37676] load average: 7.69 3.63 2.36
stress-ng: info: [37676] successful run completed in 60.10s (1 min, 0.10 secs)

The root cause is that SRCU has to be careful to not frequently check for srcu
read-side critical section exits. Paul E. McKenney gave a detailed explanation:

```
In practice, the act of checking to see if there is anyone in an SRCU
read-side critical section is a heavy-weight operation, involving at
least one cache miss per CPU along with a number of full memory barriers.
```

Therefore, even if no one is currently in the SRCU read-side critical section,
synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has
become slower.

2. Idea
=======

2.1 use synchronize_srcu_expedited() ?
--------------------------------------

The synchronize_srcu_expedited() will let SRCU to be much more aggressive.
If we use it to replace synchronize_srcu() in the unregister_shrinker(), the
ops/s will return to previous levels:

stress-ng: info: [13159] dispatching hogs: 9 ramfs
stress-ng: info: [13159] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [13159] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [13159] ramfs 710062 60.00 9.63 157.26 11834.18 4254.75
stress-ng: info: [13159] for a 60.00s run time:
stress-ng: info: [13159] 1920.14s available CPU time
stress-ng: info: [13159] 9.62s user time ( 0.50%)
stress-ng: info: [13159] 157.26s system time ( 8.19%)
stress-ng: info: [13159] 166.88s total time ( 8.69%)
stress-ng: info: [13159] load average: 9.49 4.02 1.65
stress-ng: info: [13159] successful run completed in 60.00s (1 min, 0.00 secs)

But because SRCU (Sleepable RCU) is used here, the reader is allowed to sleep in
the read-side critical section, so synchronize_srcu_expedited() may cause a lot
of CPU consumption, so this is not a good choice.

2.2 move synchronize_srcu() to the asynchronous delayed work
------------------------------------------------------------

Kirill Tkhai proposed a better idea[2] in 2018: move synchronize_srcu() to the
asynchronous delayed work, then it doesn't affect on user-visible unregistration
speed.

[2]. https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/

After applying his patches ([PATCH RFC 04/10]~[PATCH RFC 10/10], with few
conflicts), the ops/s is of course back to the previous levels:

stress-ng: info: [11506] setting to a 60 second run per stressor
stress-ng: info: [11506] dispatching hogs: 9 ramfs
stress-ng: info: [11506] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [11506] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [11506] ramfs 829462 60.00 10.81 174.25 13824.14 4482.08
stress-ng: info: [11506] for a 60.00s run time:
stress-ng: info: [11506] 1920.12s available CPU time
stress-ng: info: [11506] 10.81s user time ( 0.56%)
stress-ng: info: [11506] 174.25s system time ( 9.07%)
stress-ng: info: [11506] 185.06s total time ( 9.64%)
stress-ng: info: [11506] load average: 8.96 2.60 0.89
stress-ng: info: [11506] successful run completed in 60.00s (1 min, 0.00 secs)

In order to continue to advance this patch set, I rebase these patches onto the
next-20230525. Any comments and suggestions are welcome.

Note: This patch serise is only for super_block shrinker, all further
time-critical for unregistration places may be written in the same conception.

Thanks,
Qi

Kirill Tkhai (7):
mm: vmscan: split unregister_shrinker()
fs: move list_lru_destroy() to destroy_super_work()
fs: shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan()
fs: introduce struct super_operations::destroy_super() callback
xfs: introduce xfs_fs_destroy_super()
shmem: implement shmem_destroy_super()
fs: use unregister_shrinker_delayed_{initiate, finalize} for
super_block shrinker

Qi Zheng (1):
mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu()

fs/super.c | 32 ++++++++++++++++++--------------
fs/xfs/xfs_super.c | 25 ++++++++++++++++++++++---
include/linux/fs.h | 6 ++++++
include/linux/shrinker.h | 2 ++
mm/shmem.c | 8 ++++++++
mm/vmscan.c | 26 ++++++++++++++++++++------
6 files changed, 76 insertions(+), 23 deletions(-)

--
2.30.2



2023-05-31 10:13:25

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 1/8] mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu()

From: Qi Zheng <[email protected]>

The debugfs_remove_recursive() will wait for debugfs_file_put()
to return, so there is no need to put it after synchronize_srcu()
to wait for the rcu read-side critical section to exit.

Just move it before synchronize_srcu(), which is also convenient
to put the heavy synchronize_srcu() in the delayed work later.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/vmscan.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeca83e28c9b..a773e97e152e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -818,11 +818,11 @@ void unregister_shrinker(struct shrinker *shrinker)
debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
mutex_unlock(&shrinker_mutex);

+ shrinker_debugfs_remove(debugfs_entry, debugfs_id);
+
atomic_inc(&shrinker_srcu_generation);
synchronize_srcu(&shrinker_srcu);

- shrinker_debugfs_remove(debugfs_entry, debugfs_id);
-
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}
--
2.30.2


2023-05-31 10:13:53

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 3/8] fs: move list_lru_destroy() to destroy_super_work()

From: Kirill Tkhai <[email protected]>

The patch makes s_dentry_lru and s_inode_lru be destroyed
later from the workqueue. This is preparation to split
unregister_shrinker(super_block::s_shrink) in two stages,
and to call finalize stage from destroy_super_work().

Note, that generic filesystem shrinker unregistration
is safe to be split in two stages right after this
patch, since super_cache_count() and super_cache_scan()
have a deal with s_dentry_lru and s_inode_lru only.

But there are two exceptions: XFS and SHMEM, which
define .nr_cached_objects() and .free_cached_objects()
callbacks. These two do not allow us to do the splitting
right after this patch. They touch fs-specific data,
which is destroyed earlier, than destroy_super_work().
So, we can't call unregister_shrinker_delayed_finalize()
from destroy_super_work() because of them, and next
patches make preparations to make this possible.

Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Qi Zheng <[email protected]>
---
fs/super.c | 17 +++++------------
1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 8d8d68799b34..2ce4c72720f3 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -159,6 +159,11 @@ static void destroy_super_work(struct work_struct *work)
destroy_work);
int i;

+ WARN_ON(list_lru_count(&s->s_dentry_lru));
+ WARN_ON(list_lru_count(&s->s_inode_lru));
+ list_lru_destroy(&s->s_dentry_lru);
+ list_lru_destroy(&s->s_inode_lru);
+
for (i = 0; i < SB_FREEZE_LEVELS; i++)
percpu_free_rwsem(&s->s_writers.rw_sem[i]);
kfree(s);
@@ -177,8 +182,6 @@ static void destroy_unused_super(struct super_block *s)
if (!s)
return;
up_write(&s->s_umount);
- list_lru_destroy(&s->s_dentry_lru);
- list_lru_destroy(&s->s_inode_lru);
security_sb_free(s);
put_user_ns(s->s_user_ns);
kfree(s->s_subtype);
@@ -287,8 +290,6 @@ static void __put_super(struct super_block *s)
{
if (!--s->s_count) {
list_del_init(&s->s_list);
- WARN_ON(s->s_dentry_lru.node);
- WARN_ON(s->s_inode_lru.node);
WARN_ON(!list_empty(&s->s_mounts));
security_sb_free(s);
put_user_ns(s->s_user_ns);
@@ -330,14 +331,6 @@ void deactivate_locked_super(struct super_block *s)
unregister_shrinker(&s->s_shrink);
fs->kill_sb(s);

- /*
- * Since list_lru_destroy() may sleep, we cannot call it from
- * put_super(), where we hold the sb_lock. Therefore we destroy
- * the lru lists right now.
- */
- list_lru_destroy(&s->s_dentry_lru);
- list_lru_destroy(&s->s_inode_lru);
-
put_filesystem(fs);
put_super(s);
} else {
--
2.30.2


2023-05-31 10:14:45

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 2/8] mm: vmscan: split unregister_shrinker()

From: Kirill Tkhai <[email protected]>

This and the next patches in this series aim to make
time effect of synchronize_srcu() invisible for user.
The patch splits unregister_shrinker() in two functions:

unregister_shrinker_delayed_initiate()
unregister_shrinker_delayed_finalize()

and shrinker users may make the second of them to be called
asynchronous (e.g., from workqueue). Next patches make
superblock shrinker to follow this way, so user-visible
umount() time won't contain delays from synchronize_srcu().

Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Qi Zheng <[email protected]>
---
include/linux/shrinker.h | 2 ++
mm/vmscan.c | 22 ++++++++++++++++++----
2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 224293b2dd06..e9d5a19d83fe 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -102,6 +102,8 @@ extern void register_shrinker_prepared(struct shrinker *shrinker);
extern int __printf(2, 3) register_shrinker(struct shrinker *shrinker,
const char *fmt, ...);
extern void unregister_shrinker(struct shrinker *shrinker);
+extern void unregister_shrinker_delayed_initiate(struct shrinker *shrinker);
+extern void unregister_shrinker_delayed_finalize(struct shrinker *shrinker);
extern void free_prealloced_shrinker(struct shrinker *shrinker);
extern void synchronize_shrinkers(void);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a773e97e152e..baf8d2327d70 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -799,10 +799,7 @@ int register_shrinker(struct shrinker *shrinker, const char *fmt, ...)
#endif
EXPORT_SYMBOL(register_shrinker);

-/*
- * Remove one
- */
-void unregister_shrinker(struct shrinker *shrinker)
+void unregister_shrinker_delayed_initiate(struct shrinker *shrinker)
{
struct dentry *debugfs_entry;
int debugfs_id;
@@ -819,6 +816,13 @@ void unregister_shrinker(struct shrinker *shrinker)
mutex_unlock(&shrinker_mutex);

shrinker_debugfs_remove(debugfs_entry, debugfs_id);
+}
+EXPORT_SYMBOL(unregister_shrinker_delayed_initiate);
+
+void unregister_shrinker_delayed_finalize(struct shrinker *shrinker)
+{
+ if (!shrinker->nr_deferred)
+ return;

atomic_inc(&shrinker_srcu_generation);
synchronize_srcu(&shrinker_srcu);
@@ -826,6 +830,16 @@ void unregister_shrinker(struct shrinker *shrinker)
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}
+EXPORT_SYMBOL(unregister_shrinker_delayed_finalize);
+
+/*
+ * Remove one
+ */
+void unregister_shrinker(struct shrinker *shrinker)
+{
+ unregister_shrinker_delayed_initiate(shrinker);
+ unregister_shrinker_delayed_finalize(shrinker);
+}
EXPORT_SYMBOL(unregister_shrinker);

/**
--
2.30.2


2023-05-31 10:18:10

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 4/8] fs: shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan()

From: Kirill Tkhai <[email protected]>

This patch prepares superblock shrinker for delayed unregistering.
It makes super_cache_scan() avoid shrinking of not active superblocks.
SB_ACTIVE is used as such the indicator. In case of superblock is not
active, super_cache_scan() just exits with SHRINK_STOP as result.

Note, that SB_ACTIVE is cleared in generic_shutdown_super() and this
is made under the write lock of s_umount. Function super_cache_scan()
also takes the read lock of s_umount, so it can't skip this flag cleared.

SB_BORN check is added to super_cache_scan() just for uniformity
with super_cache_count(), while super_cache_count() received SB_ACTIVE
check just for uniformity with super_cache_scan().

After this patch super_cache_scan() becomes to ignore unregistering
superblocks, so this function is OK with splitting unregister_shrinker().
Next patches prepare super_cache_count() to follow this way.

Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Qi Zheng <[email protected]>
---
fs/super.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 2ce4c72720f3..2ce54561e82e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -79,6 +79,11 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
if (!trylock_super(sb))
return SHRINK_STOP;

+ if ((sb->s_flags & (SB_BORN|SB_ACTIVE)) != (SB_BORN|SB_ACTIVE)) {
+ freed = SHRINK_STOP;
+ goto unlock;
+ }
+
if (sb->s_op->nr_cached_objects)
fs_objects = sb->s_op->nr_cached_objects(sb, sc);

@@ -110,6 +115,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
freed += sb->s_op->free_cached_objects(sb, sc);
}

+unlock:
up_read(&sb->s_umount);
return freed;
}
@@ -136,7 +142,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
* avoid this situation, so do the same here. The memory barrier is
* matched with the one in mount_fs() as we don't hold locks here.
*/
- if (!(sb->s_flags & SB_BORN))
+ if ((sb->s_flags & (SB_BORN|SB_ACTIVE)) != (SB_BORN|SB_ACTIVE))
return 0;
smp_rmb();

--
2.30.2


2023-05-31 10:19:33

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 8/8] fs: use unregister_shrinker_delayed_{initiate, finalize} for super_block shrinker

From: Kirill Tkhai <[email protected]>

Previous patches made all the data, which is touched from
super_cache_count(), destroyed from destroy_super_work():
s_dentry_lru, s_inode_lru and super_block::s_fs_info.

super_cache_scan() can't be called after SB_ACTIVE is cleared
in generic_shutdown_super().

So, it safe to move heavy unregister_shrinker_delayed_finalize()
part to delayed work, i.e. it's safe for parallel do_shrink_slab()
to be executed between unregister_shrinker_delayed_initiate() and
destroy_super_work()->unregister_shrinker_delayed_finalize().

This makes the heavy synchronize_srcu() to do not affect on user-visible
unregistration speed (since now it's executed from workqueue).

All further time-critical for unregistration places may be written
in the same conception.

Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Qi Zheng <[email protected]>
---
fs/super.c | 4 +++-
include/linux/fs.h | 5 +++++
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 4e9d08224f86..c61efb74fa7f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -165,6 +165,8 @@ static void destroy_super_work(struct work_struct *work)
destroy_work);
int i;

+ unregister_shrinker_delayed_finalize(&s->s_shrink);
+
WARN_ON(list_lru_count(&s->s_dentry_lru));
WARN_ON(list_lru_count(&s->s_inode_lru));
list_lru_destroy(&s->s_dentry_lru);
@@ -337,7 +339,7 @@ void deactivate_locked_super(struct super_block *s)
{
struct file_system_type *fs = s->s_type;
if (atomic_dec_and_test(&s->s_active)) {
- unregister_shrinker(&s->s_shrink);
+ unregister_shrinker_delayed_initiate(&s->s_shrink);
fs->kill_sb(s);

put_filesystem(fs);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 30b46d0facfc..869dd8de91a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1929,6 +1929,11 @@ struct super_operations {
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
struct dquot **(*get_dquots)(struct inode *);
#endif
+ /*
+ * Shrinker may call these two function on destructing super_block
+ * till unregister_shrinker_delayed_finalize() has completed
+ * in destroy_super_work(), and they must care about that.
+ */
long (*nr_cached_objects)(struct super_block *,
struct shrink_control *);
long (*free_cached_objects)(struct super_block *,
--
2.30.2


2023-05-31 10:21:37

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()

From: Kirill Tkhai <[email protected]>

xfs_fs_nr_cached_objects() touches sb->s_fs_info,
and this patch makes it to be destructed later.

After this patch xfs_fs_nr_cached_objects() is safe
for splitting unregister_shrinker(): mp->m_perag_tree
is stable till destroy_super_work(), while iteration
over it is already RCU-protected by internal XFS
business.

Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Qi Zheng <[email protected]>
---
fs/xfs/xfs_super.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 7e706255f165..694616524c76 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -743,11 +743,18 @@ xfs_fs_drop_inode(
}

static void
-xfs_mount_free(
+xfs_free_names(
struct xfs_mount *mp)
{
kfree(mp->m_rtname);
kfree(mp->m_logname);
+}
+
+static void
+xfs_mount_free(
+ struct xfs_mount *mp)
+{
+ xfs_free_names(mp);
kmem_free(mp);
}

@@ -1136,8 +1143,19 @@ xfs_fs_put_super(
xfs_destroy_mount_workqueues(mp);
xfs_close_devices(mp);

- sb->s_fs_info = NULL;
- xfs_mount_free(mp);
+ xfs_free_names(mp);
+}
+
+static void
+xfs_fs_destroy_super(
+ struct super_block *sb)
+{
+ if (sb->s_fs_info) {
+ struct xfs_mount *mp = XFS_M(sb);
+
+ kmem_free(mp);
+ sb->s_fs_info = NULL;
+ }
}

static long
@@ -1165,6 +1183,7 @@ static const struct super_operations xfs_super_operations = {
.dirty_inode = xfs_fs_dirty_inode,
.drop_inode = xfs_fs_drop_inode,
.put_super = xfs_fs_put_super,
+ .destroy_super = xfs_fs_destroy_super,
.sync_fs = xfs_fs_sync_fs,
.freeze_fs = xfs_fs_freeze,
.unfreeze_fs = xfs_fs_unfreeze,
--
2.30.2


2023-05-31 10:21:52

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 5/8] fs: introduce struct super_operations::destroy_super() callback

From: Kirill Tkhai <[email protected]>

The patch introduces a new callback, which will be called
asynchronous from delayed work.

This will allows to make ::nr_cached_objects() safe
to be called on destroying superblock in next patches,
and to split unregister_shrinker() into two primitives.

Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Qi Zheng <[email protected]>
---
fs/super.c | 3 +++
include/linux/fs.h | 1 +
2 files changed, 4 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 2ce54561e82e..4e9d08224f86 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -170,6 +170,9 @@ static void destroy_super_work(struct work_struct *work)
list_lru_destroy(&s->s_dentry_lru);
list_lru_destroy(&s->s_inode_lru);

+ if (s->s_op->destroy_super)
+ s->s_op->destroy_super(s);
+
for (i = 0; i < SB_FREEZE_LEVELS; i++)
percpu_free_rwsem(&s->s_writers.rw_sem[i]);
kfree(s);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0b54ac1d331b..30b46d0facfc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1910,6 +1910,7 @@ struct super_operations {
int (*drop_inode) (struct inode *);
void (*evict_inode) (struct inode *);
void (*put_super) (struct super_block *);
+ void (*destroy_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_super) (struct super_block *);
int (*freeze_fs) (struct super_block *);
--
2.30.2


2023-05-31 10:23:03

by Qi Zheng

[permalink] [raw]
Subject: [PATCH 7/8] shmem: implement shmem_destroy_super()

From: Kirill Tkhai <[email protected]>

Similar to xfs_fs_destroy_super() implement the method
for shmem.

shmem_unused_huge_count() just touches sb->s_fs_info.
After such the later freeing it will be safe for
unregister_shrinker() splitting (which is made in next
patch).

Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Qi Zheng <[email protected]>
---
mm/shmem.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 71e6c9855770..a4c3fdf23838 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3930,6 +3930,12 @@ static void shmem_put_super(struct super_block *sb)
free_percpu(sbinfo->ino_batch);
percpu_counter_destroy(&sbinfo->used_blocks);
mpol_put(sbinfo->mpol);
+}
+
+static void shmem_destroy_super(struct super_block *sb)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+
kfree(sbinfo);
sb->s_fs_info = NULL;
}
@@ -4018,6 +4024,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)

failed:
shmem_put_super(sb);
+ shmem_destroy_super(sb);
return -ENOMEM;
}

@@ -4182,6 +4189,7 @@ static const struct super_operations shmem_ops = {
.evict_inode = shmem_evict_inode,
.drop_inode = generic_delete_inode,
.put_super = shmem_put_super,
+ .destroy_super = shmem_destroy_super,
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
.nr_cached_objects = shmem_unused_huge_count,
.free_cached_objects = shmem_unused_huge_scan,
--
2.30.2


2023-05-31 10:55:14

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 1/8] mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu()

On Wed, May 31, 2023 at 09:57:35AM +0000, Qi Zheng wrote:
> From: Qi Zheng <[email protected]>
>
> The debugfs_remove_recursive() will wait for debugfs_file_put()
> to return, so there is no need to put it after synchronize_srcu()
> to wait for the rcu read-side critical section to exit.
>
> Just move it before synchronize_srcu(), which is also convenient
> to put the heavy synchronize_srcu() in the delayed work later.
>
> Signed-off-by: Qi Zheng <[email protected]>
> ---

Afaict, should be a patch independent of this series.

2023-05-31 11:27:04

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 5/8] fs: introduce struct super_operations::destroy_super() callback

On Wed, May 31, 2023 at 09:57:39AM +0000, Qi Zheng wrote:
> From: Kirill Tkhai <[email protected]>
>
> The patch introduces a new callback, which will be called
> asynchronous from delayed work.
>
> This will allows to make ::nr_cached_objects() safe
> to be called on destroying superblock in next patches,
> and to split unregister_shrinker() into two primitives.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> Signed-off-by: Qi Zheng <[email protected]>
> ---
> fs/super.c | 3 +++
> include/linux/fs.h | 1 +
> 2 files changed, 4 insertions(+)

Misses updates to
Documentation/filesystems/locking.rst
Documentation/filesystems/vfs.rst

2023-05-31 13:16:47

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 5/8] fs: introduce struct super_operations::destroy_super() callback



On 2023/5/31 19:19, Christian Brauner wrote:
> On Wed, May 31, 2023 at 09:57:39AM +0000, Qi Zheng wrote:
>> From: Kirill Tkhai <[email protected]>
>>
>> The patch introduces a new callback, which will be called
>> asynchronous from delayed work.
>>
>> This will allows to make ::nr_cached_objects() safe
>> to be called on destroying superblock in next patches,
>> and to split unregister_shrinker() into two primitives.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> Signed-off-by: Qi Zheng <[email protected]>
>> ---
>> fs/super.c | 3 +++
>> include/linux/fs.h | 1 +
>> 2 files changed, 4 insertions(+)
>
> Misses updates to
> Documentation/filesystems/locking.rst
> Documentation/filesystems/vfs.rst

Will do.

Thanks,
Qi

2023-05-31 13:19:07

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 1/8] mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu()



On 2023/5/31 18:49, Christian Brauner wrote:
> On Wed, May 31, 2023 at 09:57:35AM +0000, Qi Zheng wrote:
>> From: Qi Zheng <[email protected]>
>>
>> The debugfs_remove_recursive() will wait for debugfs_file_put()
>> to return, so there is no need to put it after synchronize_srcu()
>> to wait for the rcu read-side critical section to exit.
>>
>> Just move it before synchronize_srcu(), which is also convenient
>> to put the heavy synchronize_srcu() in the delayed work later.
>>
>> Signed-off-by: Qi Zheng <[email protected]>
>> ---
>
> Afaict, should be a patch independent of this series.

OK, will resend as an independent patch.

Thanks,
Qi

2023-05-31 18:52:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/8] make unregistration of super_block shrinker more faster

On Wed, 31 May 2023 09:57:34 +0000 Qi Zheng <[email protected]> wrote:

> From: Qi Zheng <[email protected]>
>
> Hi all,
>
> This patch series aims to make unregistration of super_block shrinker more
> faster.
>
> 1. Background
> =============
>
> The kernel test robot noticed a -88.8% regression of stress-ng.ramfs.ops_per_sec
> on commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). More
> details can be seen from the link[1] below.
>
> [1]. https://lore.kernel.org/lkml/[email protected]/
>
> We can just use the following command to reproduce the result:
>
> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>
> 1) before commit f95bdb700bc6b:
>
> stress-ng: info: [11023] dispatching hogs: 9 ramfs
> stress-ng: info: [11023] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
> stress-ng: info: [11023] (secs) (secs) (secs) (real time) (usr+sys time)
> stress-ng: info: [11023] ramfs 774966 60.00 10.18 169.45 12915.89 4314.26
> stress-ng: info: [11023] for a 60.00s run time:
> stress-ng: info: [11023] 1920.11s available CPU time
> stress-ng: info: [11023] 10.18s user time ( 0.53%)
> stress-ng: info: [11023] 169.44s system time ( 8.82%)
> stress-ng: info: [11023] 179.62s total time ( 9.35%)
> stress-ng: info: [11023] load average: 8.99 2.69 0.93
> stress-ng: info: [11023] successful run completed in 60.00s (1 min, 0.00 secs)
>
> 2) after commit f95bdb700bc6b:
>
> stress-ng: info: [37676] dispatching hogs: 9 ramfs
> stress-ng: info: [37676] stressor bogo ops real time usrtime sys time bogo ops/s bogo ops/s
> stress-ng: info: [37676] (secs) (secs) (secs) (real time) (usr+sys time)
> stress-ng: info: [37676] ramfs 168673 60.00 1.61 39.66 2811.08 4087.47
> stress-ng: info: [37676] for a 60.10s run time:
> stress-ng: info: [37676] 1923.36s available CPU time
> stress-ng: info: [37676] 1.60s user time ( 0.08%)
> stress-ng: info: [37676] 39.66s system time ( 2.06%)
> stress-ng: info: [37676] 41.26s total time ( 2.15%)
> stress-ng: info: [37676] load average: 7.69 3.63 2.36
> stress-ng: info: [37676] successful run completed in 60.10s (1 min, 0.10 secs)

Is this comparison reversed? It appears to demonstrate that
f95bdb700bc6b made the operation faster.


2023-05-31 23:12:07

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 3/8] fs: move list_lru_destroy() to destroy_super_work()

On Wed, May 31, 2023 at 09:57:37AM +0000, Qi Zheng wrote:
> From: Kirill Tkhai <[email protected]>
>
> The patch makes s_dentry_lru and s_inode_lru be destroyed
> later from the workqueue. This is preparation to split
> unregister_shrinker(super_block::s_shrink) in two stages,
> and to call finalize stage from destroy_super_work().
>
> Note, that generic filesystem shrinker unregistration
> is safe to be split in two stages right after this
> patch, since super_cache_count() and super_cache_scan()
> have a deal with s_dentry_lru and s_inode_lru only.
>
> But there are two exceptions: XFS and SHMEM, which
> define .nr_cached_objects() and .free_cached_objects()
> callbacks. These two do not allow us to do the splitting
> right after this patch. They touch fs-specific data,
> which is destroyed earlier, than destroy_super_work().
> So, we can't call unregister_shrinker_delayed_finalize()
> from destroy_super_work() because of them, and next
> patches make preparations to make this possible.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> Signed-off-by: Qi Zheng <[email protected]>
> ---
> fs/super.c | 17 +++++------------
> 1 file changed, 5 insertions(+), 12 deletions(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index 8d8d68799b34..2ce4c72720f3 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -159,6 +159,11 @@ static void destroy_super_work(struct work_struct *work)
> destroy_work);
> int i;
>
> + WARN_ON(list_lru_count(&s->s_dentry_lru));
> + WARN_ON(list_lru_count(&s->s_inode_lru));
> + list_lru_destroy(&s->s_dentry_lru);
> + list_lru_destroy(&s->s_inode_lru);
> +
> for (i = 0; i < SB_FREEZE_LEVELS; i++)
> percpu_free_rwsem(&s->s_writers.rw_sem[i]);
> kfree(s);
> @@ -177,8 +182,6 @@ static void destroy_unused_super(struct super_block *s)
> if (!s)
> return;
> up_write(&s->s_umount);
> - list_lru_destroy(&s->s_dentry_lru);
> - list_lru_destroy(&s->s_inode_lru);
> security_sb_free(s);
> put_user_ns(s->s_user_ns);
> kfree(s->s_subtype);
> @@ -287,8 +290,6 @@ static void __put_super(struct super_block *s)
> {
> if (!--s->s_count) {
> list_del_init(&s->s_list);
> - WARN_ON(s->s_dentry_lru.node);
> - WARN_ON(s->s_inode_lru.node);

Why are you removing the wanrings from here? Regardless of where
we tear down the lru lists, they *must* be empty here otherwise we
have a memory leak. Hence I don't think these warnings should be
moved at all.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-05-31 23:15:19

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/8] mm: vmscan: split unregister_shrinker()

On Wed, May 31, 2023 at 09:57:36AM +0000, Qi Zheng wrote:
> From: Kirill Tkhai <[email protected]>
>
> This and the next patches in this series aim to make
> time effect of synchronize_srcu() invisible for user.
> The patch splits unregister_shrinker() in two functions:
>
> unregister_shrinker_delayed_initiate()
> unregister_shrinker_delayed_finalize()
>
> and shrinker users may make the second of them to be called
> asynchronous (e.g., from workqueue). Next patches make
> superblock shrinker to follow this way, so user-visible
> umount() time won't contain delays from synchronize_srcu().
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> Signed-off-by: Qi Zheng <[email protected]>
> ---
> include/linux/shrinker.h | 2 ++
> mm/vmscan.c | 22 ++++++++++++++++++----
> 2 files changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 224293b2dd06..e9d5a19d83fe 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -102,6 +102,8 @@ extern void register_shrinker_prepared(struct shrinker *shrinker);
> extern int __printf(2, 3) register_shrinker(struct shrinker *shrinker,
> const char *fmt, ...);
> extern void unregister_shrinker(struct shrinker *shrinker);
> +extern void unregister_shrinker_delayed_initiate(struct shrinker *shrinker);
> +extern void unregister_shrinker_delayed_finalize(struct shrinker *shrinker);
> extern void free_prealloced_shrinker(struct shrinker *shrinker);
> extern void synchronize_shrinkers(void);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a773e97e152e..baf8d2327d70 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -799,10 +799,7 @@ int register_shrinker(struct shrinker *shrinker, const char *fmt, ...)
> #endif
> EXPORT_SYMBOL(register_shrinker);
>
> -/*
> - * Remove one
> - */
> -void unregister_shrinker(struct shrinker *shrinker)
> +void unregister_shrinker_delayed_initiate(struct shrinker *shrinker)
> {
> struct dentry *debugfs_entry;
> int debugfs_id;
> @@ -819,6 +816,13 @@ void unregister_shrinker(struct shrinker *shrinker)
> mutex_unlock(&shrinker_mutex);
>
> shrinker_debugfs_remove(debugfs_entry, debugfs_id);
> +}
> +EXPORT_SYMBOL(unregister_shrinker_delayed_initiate);
> +
> +void unregister_shrinker_delayed_finalize(struct shrinker *shrinker)
> +{
> + if (!shrinker->nr_deferred)
> + return;

This is new logic and isn't explained anywhere: why do we want to
avoid RCU cleanup if (shrinker->nr_deferred == 0)? Regardless,
whatever this is avoiding, it needs a comment to explain it.

-Dave.
--
Dave Chinner
[email protected]

2023-05-31 23:25:57

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 4/8] fs: shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan()

On Wed, May 31, 2023 at 09:57:38AM +0000, Qi Zheng wrote:
> From: Kirill Tkhai <[email protected]>
>
> This patch prepares superblock shrinker for delayed unregistering.
> It makes super_cache_scan() avoid shrinking of not active superblocks.
> SB_ACTIVE is used as such the indicator. In case of superblock is not
> active, super_cache_scan() just exits with SHRINK_STOP as result.
>
> Note, that SB_ACTIVE is cleared in generic_shutdown_super() and this
> is made under the write lock of s_umount. Function super_cache_scan()
> also takes the read lock of s_umount, so it can't skip this flag cleared.
>
> SB_BORN check is added to super_cache_scan() just for uniformity
> with super_cache_count(), while super_cache_count() received SB_ACTIVE
> check just for uniformity with super_cache_scan().
>
> After this patch super_cache_scan() becomes to ignore unregistering
> superblocks, so this function is OK with splitting unregister_shrinker().
> Next patches prepare super_cache_count() to follow this way.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> Signed-off-by: Qi Zheng <[email protected]>
> ---
> fs/super.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index 2ce4c72720f3..2ce54561e82e 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -79,6 +79,11 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
> if (!trylock_super(sb))
> return SHRINK_STOP;
>
> + if ((sb->s_flags & (SB_BORN|SB_ACTIVE)) != (SB_BORN|SB_ACTIVE)) {
> + freed = SHRINK_STOP;
> + goto unlock;
> + }

This should not be here - the check to determine if the shrinker
should run is done in the ->count method. If we removed the SB_ACTIVE
flag between ->count and ->scan, then the superblock should be
locked and the trylock_super() call above should fail....

Indeed, the unregister_shrinker() call in deactivate_locked_super()
is done with the sb->s_umount held exclusively, and this happens
before we clear SB_ACTIVE in the ->kill_sb() -> kill_block_super()
-> generic_shutdown_super() path after the shrinker is unregistered.

Hence we can't get to this check without SB_ACTIVE being set - the
trylock will fail and then the shrinker_unregister() call will do
it's thing to ensure the shrinker is never called again.

If the change to the shrinker code allows the shrinker to still be
actively running when we call ->kill_sb(), then that needs to be
fixed. THe superblock shrinker must be stopped completely and never
run again before we call ->kill_sb().

> if (sb->s_op->nr_cached_objects)
> fs_objects = sb->s_op->nr_cached_objects(sb, sc);
>
> @@ -110,6 +115,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
> freed += sb->s_op->free_cached_objects(sb, sc);
> }
>
> +unlock:
> up_read(&sb->s_umount);
> return freed;
> }
> @@ -136,7 +142,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
> * avoid this situation, so do the same here. The memory barrier is
> * matched with the one in mount_fs() as we don't hold locks here.
> */
> - if (!(sb->s_flags & SB_BORN))
> + if ((sb->s_flags & (SB_BORN|SB_ACTIVE)) != (SB_BORN|SB_ACTIVE))
> return 0;

This is fine because it's an unlocked check, but I don't think it's
actually necessary given the above. Indeed, if you are adding this,
you need to expand the comment above on why SB_ACTIVE needs
checking, and why the memory barrier doesn't actually apply to that
part of the check....

-Dave.
--
Dave Chinner
[email protected]

2023-06-01 08:54:12

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 0/8] make unregistration of super_block shrinker more faster



On 2023/6/1 02:40, Andrew Morton wrote:
> On Wed, 31 May 2023 09:57:34 +0000 Qi Zheng <[email protected]> wrote:
>
>> From: Qi Zheng <[email protected]>
>>
>> Hi all,
>>
>> This patch series aims to make unregistration of super_block shrinker more
>> faster.
>>
>> 1. Background
>> =============
>>
>> The kernel test robot noticed a -88.8% regression of stress-ng.ramfs.ops_per_sec
>> on commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). More
>> details can be seen from the link[1] below.
>>
>> [1]. https://lore.kernel.org/lkml/[email protected]/
>>
>> We can just use the following command to reproduce the result:
>>
>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>
>> 1) before commit f95bdb700bc6b:
>>
>> stress-ng: info: [11023] dispatching hogs: 9 ramfs
>> stress-ng: info: [11023] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
>> stress-ng: info: [11023] (secs) (secs) (secs) (real time) (usr+sys time)
>> stress-ng: info: [11023] ramfs 774966 60.00 10.18 169.45 12915.89 4314.26
>> stress-ng: info: [11023] for a 60.00s run time:
>> stress-ng: info: [11023] 1920.11s available CPU time
>> stress-ng: info: [11023] 10.18s user time ( 0.53%)
>> stress-ng: info: [11023] 169.44s system time ( 8.82%)
>> stress-ng: info: [11023] 179.62s total time ( 9.35%)
>> stress-ng: info: [11023] load average: 8.99 2.69 0.93
>> stress-ng: info: [11023] successful run completed in 60.00s (1 min, 0.00 secs)
>>
>> 2) after commit f95bdb700bc6b:
>>
>> stress-ng: info: [37676] dispatching hogs: 9 ramfs
>> stress-ng: info: [37676] stressor bogo ops real time usrtime sys time bogo ops/s bogo ops/s
>> stress-ng: info: [37676] (secs) (secs) (secs) (real time) (usr+sys time)
>> stress-ng: info: [37676] ramfs 168673 60.00 1.61 39.66 2811.08 4087.47
>> stress-ng: info: [37676] for a 60.10s run time:
>> stress-ng: info: [37676] 1923.36s available CPU time
>> stress-ng: info: [37676] 1.60s user time ( 0.08%)
>> stress-ng: info: [37676] 39.66s system time ( 2.06%)
>> stress-ng: info: [37676] 41.26s total time ( 2.15%)
>> stress-ng: info: [37676] load average: 7.69 3.63 2.36
>> stress-ng: info: [37676] successful run completed in 60.10s (1 min, 0.10 secs)
>
> Is this comparison reversed? It appears to demonstrate that
> f95bdb700bc6b made the operation faster.

Maybe not. IIUC, the bogo ops/s (real time) bigger the better.

Thanks,
Qi

>


2023-06-01 09:15:35

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()

Hi Dave,

On 2023/6/1 07:48, Dave Chinner wrote:
> On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
>> From: Kirill Tkhai <[email protected]>
>>
>> xfs_fs_nr_cached_objects() touches sb->s_fs_info,
>> and this patch makes it to be destructed later.
>>
>> After this patch xfs_fs_nr_cached_objects() is safe
>> for splitting unregister_shrinker(): mp->m_perag_tree
>> is stable till destroy_super_work(), while iteration
>> over it is already RCU-protected by internal XFS
>> business.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> Signed-off-by: Qi Zheng <[email protected]>
>> ---
>> fs/xfs/xfs_super.c | 25 ++++++++++++++++++++++---
>> 1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 7e706255f165..694616524c76 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -743,11 +743,18 @@ xfs_fs_drop_inode(
>> }
>>
>> static void
>> -xfs_mount_free(
>> +xfs_free_names(
>> struct xfs_mount *mp)
>> {
>> kfree(mp->m_rtname);
>> kfree(mp->m_logname);
>> +}
>> +
>> +static void
>> +xfs_mount_free(
>> + struct xfs_mount *mp)
>> +{
>> + xfs_free_names(mp);
>> kmem_free(mp);
>> }
>>
>> @@ -1136,8 +1143,19 @@ xfs_fs_put_super(
>> xfs_destroy_mount_workqueues(mp);
>> xfs_close_devices(mp);
>>
>> - sb->s_fs_info = NULL;
>> - xfs_mount_free(mp);
>> + xfs_free_names(mp);
>> +}
>> +
>> +static void
>> +xfs_fs_destroy_super(
>> + struct super_block *sb)
>> +{
>> + if (sb->s_fs_info) {
>> + struct xfs_mount *mp = XFS_M(sb);
>> +
>> + kmem_free(mp);
>> + sb->s_fs_info = NULL;
>> + }
>> }
>>
>> static long
>> @@ -1165,6 +1183,7 @@ static const struct super_operations xfs_super_operations = {
>> .dirty_inode = xfs_fs_dirty_inode,
>> .drop_inode = xfs_fs_drop_inode,
>> .put_super = xfs_fs_put_super,
>> + .destroy_super = xfs_fs_destroy_super,
>> .sync_fs = xfs_fs_sync_fs,
>> .freeze_fs = xfs_fs_freeze,
>> .unfreeze_fs = xfs_fs_unfreeze,
>
> I don't really like this ->destroy_super() callback, especially as
> it's completely undocumented as to why it exists. This is purely a
> work-around for handling extended filesystem superblock shrinker
> functionality, yet there's nothing that tells the reader this.
>
> It also seems to imply that the superblock shrinker can continue to
> run after the existing unregister_shrinker() call before ->kill_sb()
> is called. This violates the assumption made in filesystems that the
> superblock shrinkers have been stopped and will never run again
> before ->kill_sb() is called. Hence ->kill_sb() implementations
> assume there is nothing else accessing filesystem owned structures
> and it can tear down internal structures safely.
>
> Realistically, the days of XFS using this superblock shrinker
> extension are numbered. We've got a lot of the infrastructure we
> need in place to get rid of the background inode reclaim
> infrastructure that requires this shrinker extension, and it's on my
> list of things that need to be addressed in the near future.
>
> In fact, now that I look at it, I think the shmem usage of this
> superblock shrinker interface is broken - it returns SHRINK_STOP to
> ->free_cached_objects(), but the only valid return value is the
> number of objects freed (i.e. 0 is nothing freed). These special
> superblock extension interfaces do not work like a normal
> shrinker....
>
> Hence I think the shmem usage should be replaced with an separate
> internal shmem shrinker that is managed by the filesystem itself
> (similar to how XFS has multiple internal shrinkers).
>
> At this point, then the only user of this interface is (again) XFS.
> Given this, adding new VFS methods for a single filesystem
> for functionality that is planned to be removed is probably not the
> best approach to solving the problem.

Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
new method[1], I cc'd you on the email.

[1].
https://lore.kernel.org/lkml/[email protected]/

Thanks,
Qi

>
> Cheers,
>
> Dave.


2023-06-01 10:13:08

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()

On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
> Hi Dave,
>
> On 2023/6/1 07:48, Dave Chinner wrote:
> > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> > > From: Kirill Tkhai <[email protected]>
> > >
> > > xfs_fs_nr_cached_objects() touches sb->s_fs_info,
> > > and this patch makes it to be destructed later.
> > >
> > > After this patch xfs_fs_nr_cached_objects() is safe
> > > for splitting unregister_shrinker(): mp->m_perag_tree
> > > is stable till destroy_super_work(), while iteration
> > > over it is already RCU-protected by internal XFS
> > > business.
> > >
> > > Signed-off-by: Kirill Tkhai <[email protected]>
> > > Signed-off-by: Qi Zheng <[email protected]>
> > > ---
> > > fs/xfs/xfs_super.c | 25 ++++++++++++++++++++++---
> > > 1 file changed, 22 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index 7e706255f165..694616524c76 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -743,11 +743,18 @@ xfs_fs_drop_inode(
> > > }
> > > static void
> > > -xfs_mount_free(
> > > +xfs_free_names(
> > > struct xfs_mount *mp)
> > > {
> > > kfree(mp->m_rtname);
> > > kfree(mp->m_logname);
> > > +}
> > > +
> > > +static void
> > > +xfs_mount_free(
> > > + struct xfs_mount *mp)
> > > +{
> > > + xfs_free_names(mp);
> > > kmem_free(mp);
> > > }
> > > @@ -1136,8 +1143,19 @@ xfs_fs_put_super(
> > > xfs_destroy_mount_workqueues(mp);
> > > xfs_close_devices(mp);
> > > - sb->s_fs_info = NULL;
> > > - xfs_mount_free(mp);
> > > + xfs_free_names(mp);
> > > +}
> > > +
> > > +static void
> > > +xfs_fs_destroy_super(
> > > + struct super_block *sb)
> > > +{
> > > + if (sb->s_fs_info) {
> > > + struct xfs_mount *mp = XFS_M(sb);
> > > +
> > > + kmem_free(mp);
> > > + sb->s_fs_info = NULL;
> > > + }
> > > }
> > > static long
> > > @@ -1165,6 +1183,7 @@ static const struct super_operations xfs_super_operations = {
> > > .dirty_inode = xfs_fs_dirty_inode,
> > > .drop_inode = xfs_fs_drop_inode,
> > > .put_super = xfs_fs_put_super,
> > > + .destroy_super = xfs_fs_destroy_super,
> > > .sync_fs = xfs_fs_sync_fs,
> > > .freeze_fs = xfs_fs_freeze,
> > > .unfreeze_fs = xfs_fs_unfreeze,
> >
> > I don't really like this ->destroy_super() callback, especially as
> > it's completely undocumented as to why it exists. This is purely a
> > work-around for handling extended filesystem superblock shrinker
> > functionality, yet there's nothing that tells the reader this.
> >
> > It also seems to imply that the superblock shrinker can continue to
> > run after the existing unregister_shrinker() call before ->kill_sb()
> > is called. This violates the assumption made in filesystems that the
> > superblock shrinkers have been stopped and will never run again
> > before ->kill_sb() is called. Hence ->kill_sb() implementations
> > assume there is nothing else accessing filesystem owned structures
> > and it can tear down internal structures safely.
> >
> > Realistically, the days of XFS using this superblock shrinker
> > extension are numbered. We've got a lot of the infrastructure we
> > need in place to get rid of the background inode reclaim
> > infrastructure that requires this shrinker extension, and it's on my
> > list of things that need to be addressed in the near future.
> >
> > In fact, now that I look at it, I think the shmem usage of this
> > superblock shrinker interface is broken - it returns SHRINK_STOP to
> > ->free_cached_objects(), but the only valid return value is the
> > number of objects freed (i.e. 0 is nothing freed). These special
> > superblock extension interfaces do not work like a normal
> > shrinker....
> >
> > Hence I think the shmem usage should be replaced with an separate
> > internal shmem shrinker that is managed by the filesystem itself
> > (similar to how XFS has multiple internal shrinkers).
> >
> > At this point, then the only user of this interface is (again) XFS.
> > Given this, adding new VFS methods for a single filesystem
> > for functionality that is planned to be removed is probably not the
> > best approach to solving the problem.
>
> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
> new method[1], I cc'd you on the email.
>
> [1].
> https://lore.kernel.org/lkml/[email protected]/

As long as we agree that we're not adding a new super operation that
sounds like a better way forward.

2023-06-01 23:12:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()

On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
> Hi Dave,
> On 2023/6/1 07:48, Dave Chinner wrote:
> > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> > > From: Kirill Tkhai <[email protected]>
> > I don't really like this ->destroy_super() callback, especially as
> > it's completely undocumented as to why it exists. This is purely a
> > work-around for handling extended filesystem superblock shrinker
> > functionality, yet there's nothing that tells the reader this.
> >
> > It also seems to imply that the superblock shrinker can continue to
> > run after the existing unregister_shrinker() call before ->kill_sb()
> > is called. This violates the assumption made in filesystems that the
> > superblock shrinkers have been stopped and will never run again
> > before ->kill_sb() is called. Hence ->kill_sb() implementations
> > assume there is nothing else accessing filesystem owned structures
> > and it can tear down internal structures safely.
> >
> > Realistically, the days of XFS using this superblock shrinker
> > extension are numbered. We've got a lot of the infrastructure we
> > need in place to get rid of the background inode reclaim
> > infrastructure that requires this shrinker extension, and it's on my
> > list of things that need to be addressed in the near future.
> >
> > In fact, now that I look at it, I think the shmem usage of this
> > superblock shrinker interface is broken - it returns SHRINK_STOP to
> > ->free_cached_objects(), but the only valid return value is the
> > number of objects freed (i.e. 0 is nothing freed). These special
> > superblock extension interfaces do not work like a normal
> > shrinker....
> >
> > Hence I think the shmem usage should be replaced with an separate
> > internal shmem shrinker that is managed by the filesystem itself
> > (similar to how XFS has multiple internal shrinkers).
> >
> > At this point, then the only user of this interface is (again) XFS.
> > Given this, adding new VFS methods for a single filesystem
> > for functionality that is planned to be removed is probably not the
> > best approach to solving the problem.
>
> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
> new method[1], I cc'd you on the email.

I;ve just read through that thread, and I've looked at the original
patch that caused the regression.

I'm a bit annoyed right now. Nobody cc'd me on the original patches
nor were any of the subsystems that use shrinkers were cc'd on the
patches that changed shrinker behaviour. I only find out about this
because someone tries to fix something they broke by *breaking more
stuff* and not even realising how broken what they are proposing is.

The previous code was not broken and it provided specific guarantees
to subsystems via unregister_shrinker(). From the above discussion,
it appears that the original authors of these changes either did not
know about or did not understand them, so that casts doubt in my
mind about the attempted solution and all the proposed fixes for it.

I don't have the time right now unravel this mess and fully
understand the original problem, changes or the band-aids that are
being thrown around. We are also getting quite late in the cycle to
be doing major surgery to critical infrastructure, especially as it
gives so little time to review regression test whatever new solution
is proposed.

Given this appears to be a change introduced in 6.4-rc1, I think the
right thing to do is to revert the change rather than make things
worse by trying to shove some "quick fix" into the kernel to address
it.

Andrew, could you please sort out a series to revert this shrinker
infrastructure change and all the dependent hacks that have been
added to try to fix it so far?

-Dave.
--
Dave Chinner
[email protected]

2023-06-02 03:53:57

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()

Hi Dave,

On 2023/6/2 07:06, Dave Chinner wrote:
> On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
>> Hi Dave,
>> On 2023/6/1 07:48, Dave Chinner wrote:
>>> On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
>>>> From: Kirill Tkhai <[email protected]>
>>> I don't really like this ->destroy_super() callback, especially as
>>> it's completely undocumented as to why it exists. This is purely a
>>> work-around for handling extended filesystem superblock shrinker
>>> functionality, yet there's nothing that tells the reader this.
>>>
>>> It also seems to imply that the superblock shrinker can continue to
>>> run after the existing unregister_shrinker() call before ->kill_sb()
>>> is called. This violates the assumption made in filesystems that the
>>> superblock shrinkers have been stopped and will never run again
>>> before ->kill_sb() is called. Hence ->kill_sb() implementations
>>> assume there is nothing else accessing filesystem owned structures
>>> and it can tear down internal structures safely.
>>>
>>> Realistically, the days of XFS using this superblock shrinker
>>> extension are numbered. We've got a lot of the infrastructure we
>>> need in place to get rid of the background inode reclaim
>>> infrastructure that requires this shrinker extension, and it's on my
>>> list of things that need to be addressed in the near future.
>>>
>>> In fact, now that I look at it, I think the shmem usage of this
>>> superblock shrinker interface is broken - it returns SHRINK_STOP to
>>> ->free_cached_objects(), but the only valid return value is the
>>> number of objects freed (i.e. 0 is nothing freed). These special
>>> superblock extension interfaces do not work like a normal
>>> shrinker....
>>>
>>> Hence I think the shmem usage should be replaced with an separate
>>> internal shmem shrinker that is managed by the filesystem itself
>>> (similar to how XFS has multiple internal shrinkers).
>>>
>>> At this point, then the only user of this interface is (again) XFS.
>>> Given this, adding new VFS methods for a single filesystem
>>> for functionality that is planned to be removed is probably not the
>>> best approach to solving the problem.
>>
>> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
>> new method[1], I cc'd you on the email.
>
> I;ve just read through that thread, and I've looked at the original
> patch that caused the regression.
>
> I'm a bit annoyed right now. Nobody cc'd me on the original patches
> nor were any of the subsystems that use shrinkers were cc'd on the
> patches that changed shrinker behaviour. I only find out about this

Sorry about that, my mistake. I followed the results of
scripts/get_maintainer.pl before.

> because someone tries to fix something they broke by *breaking more
> stuff* and not even realising how broken what they are proposing is.

Yes, this slows down the speed of umount. But the benefit is that
slab shrink becomes lockless, the mount operation and slab shrink no
longer affect each other, and the IPC no longer drops significantly,
etc.

And I used bpftrace to measure the time consumption of
unregister_shrinker():

```
And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
8260 CPU @ 2.40GHz) and the results are as follows:

1) use synchronize_srcu():

@ns[umount]:
[8K, 16K) 83 |@@@@@@@ |
[16K, 32K) 578
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K) 78 |@@@@@@@ |
[64K, 128K) 6 | |
[128K, 256K) 7 | |
[256K, 512K) 29 |@@ |
[512K, 1M) 51 |@@@@ |
[1M, 2M) 90 |@@@@@@@@ |
[2M, 4M) 70 |@@@@@@ |
[4M, 8M) 8 | |

2) use synchronize_srcu_expedited():

@ns[umount]:
[8K, 16K) 31 |@@ |
[16K, 32K) 803
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K) 158 |@@@@@@@@@@ |
[64K, 128K) 4 | |
[128K, 256K) 2 | |
[256K, 512K) 2 | |
```

With synchronize_srcu(), most of the time consumption is between 16us
and 32us, the worst case between 4ms and 8ms. Is this totally
unacceptable?

This performance regression report comes from a stress test. Will the
umount action be executed so frequently under real workloads?

If there are really unacceptable, after applying the newly proposed
method, umount will be as fast as before (or even faster).

Thanks,
Qi

>
> The previous code was not broken and it provided specific guarantees
> to subsystems via unregister_shrinker(). From the above discussion,
> it appears that the original authors of these changes either did not
> know about or did not understand them, so that casts doubt in my
> mind about the attempted solution and all the proposed fixes for it.
>
> I don't have the time right now unravel this mess and fully
> understand the original problem, changes or the band-aids that are
> being thrown around. We are also getting quite late in the cycle to
> be doing major surgery to critical infrastructure, especially as it
> gives so little time to review regression test whatever new solution
> is proposed.
>
> Given this appears to be a change introduced in 6.4-rc1, I think the
> right thing to do is to revert the change rather than make things
> worse by trying to shove some "quick fix" into the kernel to address
> it.
>
> Andrew, could you please sort out a series to revert this shrinker
> infrastructure change and all the dependent hacks that have been
> added to try to fix it so far?
>
> -Dave.

--
Thanks,
Qi

2023-06-02 15:35:04

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()

On Fri, Jun 02, 2023 at 11:13:09AM +0800, Qi Zheng wrote:
> Hi Dave,
>
> On 2023/6/2 07:06, Dave Chinner wrote:
> > On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
> > > Hi Dave,
> > > On 2023/6/1 07:48, Dave Chinner wrote:
> > > > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> > > > > From: Kirill Tkhai <[email protected]>
> > > > I don't really like this ->destroy_super() callback, especially as
> > > > it's completely undocumented as to why it exists. This is purely a
> > > > work-around for handling extended filesystem superblock shrinker
> > > > functionality, yet there's nothing that tells the reader this.
> > > >
> > > > It also seems to imply that the superblock shrinker can continue to
> > > > run after the existing unregister_shrinker() call before ->kill_sb()
> > > > is called. This violates the assumption made in filesystems that the
> > > > superblock shrinkers have been stopped and will never run again
> > > > before ->kill_sb() is called. Hence ->kill_sb() implementations
> > > > assume there is nothing else accessing filesystem owned structures
> > > > and it can tear down internal structures safely.
> > > >
> > > > Realistically, the days of XFS using this superblock shrinker
> > > > extension are numbered. We've got a lot of the infrastructure we
> > > > need in place to get rid of the background inode reclaim
> > > > infrastructure that requires this shrinker extension, and it's on my
> > > > list of things that need to be addressed in the near future.
> > > >
> > > > In fact, now that I look at it, I think the shmem usage of this
> > > > superblock shrinker interface is broken - it returns SHRINK_STOP to
> > > > ->free_cached_objects(), but the only valid return value is the
> > > > number of objects freed (i.e. 0 is nothing freed). These special
> > > > superblock extension interfaces do not work like a normal
> > > > shrinker....
> > > >
> > > > Hence I think the shmem usage should be replaced with an separate
> > > > internal shmem shrinker that is managed by the filesystem itself
> > > > (similar to how XFS has multiple internal shrinkers).
> > > >
> > > > At this point, then the only user of this interface is (again) XFS.
> > > > Given this, adding new VFS methods for a single filesystem
> > > > for functionality that is planned to be removed is probably not the
> > > > best approach to solving the problem.
> > >
> > > Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
> > > new method[1], I cc'd you on the email.
> >
> > I;ve just read through that thread, and I've looked at the original
> > patch that caused the regression.
> >
> > I'm a bit annoyed right now. Nobody cc'd me on the original patches
> > nor were any of the subsystems that use shrinkers were cc'd on the
> > patches that changed shrinker behaviour. I only find out about this
>
> Sorry about that, my mistake. I followed the results of
> scripts/get_maintainer.pl before.

Sometimes I wonder if people who contribute a lot to a subsystem should
be more aggressive about listing themselves explicitly in MAINTAINERS
but then I look at the ~600 emails that came in while I was on vacation
for 6 days over a long weekend and ... shut up. :P

> > because someone tries to fix something they broke by *breaking more
> > stuff* and not even realising how broken what they are proposing is.
>
> Yes, this slows down the speed of umount. But the benefit is that
> slab shrink becomes lockless, the mount operation and slab shrink no
> longer affect each other, and the IPC no longer drops significantly,
> etc.

The lockless shrink seems like a good thing to have, but ... is it
really true that the superblock shrinker can still be running after
->kill_sb? /That/ is surprising to me.

--D

> And I used bpftrace to measure the time consumption of
> unregister_shrinker():
>
> ```
> And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
> 8260 CPU @ 2.40GHz) and the results are as follows:
>
> 1) use synchronize_srcu():
>
> @ns[umount]:
> [8K, 16K) 83 |@@@@@@@ |
> [16K, 32K) 578
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K) 78 |@@@@@@@ |
> [64K, 128K) 6 | |
> [128K, 256K) 7 | |
> [256K, 512K) 29 |@@ |
> [512K, 1M) 51 |@@@@ |
> [1M, 2M) 90 |@@@@@@@@ |
> [2M, 4M) 70 |@@@@@@ |
> [4M, 8M) 8 | |
>
> 2) use synchronize_srcu_expedited():
>
> @ns[umount]:
> [8K, 16K) 31 |@@ |
> [16K, 32K) 803
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K) 158 |@@@@@@@@@@ |
> [64K, 128K) 4 | |
> [128K, 256K) 2 | |
> [256K, 512K) 2 | |
> ```
>
> With synchronize_srcu(), most of the time consumption is between 16us and
> 32us, the worst case between 4ms and 8ms. Is this totally
> unacceptable?
>
> This performance regression report comes from a stress test. Will the
> umount action be executed so frequently under real workloads?
>
> If there are really unacceptable, after applying the newly proposed
> method, umount will be as fast as before (or even faster).
>
> Thanks,
> Qi
>
> >
> > The previous code was not broken and it provided specific guarantees
> > to subsystems via unregister_shrinker(). From the above discussion,
> > it appears that the original authors of these changes either did not
> > know about or did not understand them, so that casts doubt in my
> > mind about the attempted solution and all the proposed fixes for it.
> >
> > I don't have the time right now unravel this mess and fully
> > understand the original problem, changes or the band-aids that are
> > being thrown around. We are also getting quite late in the cycle to
> > be doing major surgery to critical infrastructure, especially as it
> > gives so little time to review regression test whatever new solution
> > is proposed.
> >
> > Given this appears to be a change introduced in 6.4-rc1, I think the
> > right thing to do is to revert the change rather than make things
> > worse by trying to shove some "quick fix" into the kernel to address
> > it.
> >
> > Andrew, could you please sort out a series to revert this shrinker
> > infrastructure change and all the dependent hacks that have been
> > added to try to fix it so far?
> >
> > -Dave.
>
> --
> Thanks,
> Qi

2023-06-05 11:52:10

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()

On Fri, Jun 02, 2023 at 08:15:32AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 02, 2023 at 11:13:09AM +0800, Qi Zheng wrote:
> > Hi Dave,
> >
> > On 2023/6/2 07:06, Dave Chinner wrote:
> > > On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
> > > > Hi Dave,
> > > > On 2023/6/1 07:48, Dave Chinner wrote:
> > > > > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> > > > > > From: Kirill Tkhai <[email protected]>
> > > > > I don't really like this ->destroy_super() callback, especially as
> > > > > it's completely undocumented as to why it exists. This is purely a
> > > > > work-around for handling extended filesystem superblock shrinker
> > > > > functionality, yet there's nothing that tells the reader this.
> > > > >
> > > > > It also seems to imply that the superblock shrinker can continue to
> > > > > run after the existing unregister_shrinker() call before ->kill_sb()
> > > > > is called. This violates the assumption made in filesystems that the
> > > > > superblock shrinkers have been stopped and will never run again
> > > > > before ->kill_sb() is called. Hence ->kill_sb() implementations
> > > > > assume there is nothing else accessing filesystem owned structures
> > > > > and it can tear down internal structures safely.
> > > > >
> > > > > Realistically, the days of XFS using this superblock shrinker
> > > > > extension are numbered. We've got a lot of the infrastructure we
> > > > > need in place to get rid of the background inode reclaim
> > > > > infrastructure that requires this shrinker extension, and it's on my
> > > > > list of things that need to be addressed in the near future.
> > > > >
> > > > > In fact, now that I look at it, I think the shmem usage of this
> > > > > superblock shrinker interface is broken - it returns SHRINK_STOP to
> > > > > ->free_cached_objects(), but the only valid return value is the
> > > > > number of objects freed (i.e. 0 is nothing freed). These special
> > > > > superblock extension interfaces do not work like a normal
> > > > > shrinker....
> > > > >
> > > > > Hence I think the shmem usage should be replaced with an separate
> > > > > internal shmem shrinker that is managed by the filesystem itself
> > > > > (similar to how XFS has multiple internal shrinkers).
> > > > >
> > > > > At this point, then the only user of this interface is (again) XFS.
> > > > > Given this, adding new VFS methods for a single filesystem
> > > > > for functionality that is planned to be removed is probably not the
> > > > > best approach to solving the problem.
> > > >
> > > > Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
> > > > new method[1], I cc'd you on the email.
> > >
> > > I;ve just read through that thread, and I've looked at the original
> > > patch that caused the regression.
> > >
> > > I'm a bit annoyed right now. Nobody cc'd me on the original patches
> > > nor were any of the subsystems that use shrinkers were cc'd on the
> > > patches that changed shrinker behaviour. I only find out about this
> >
> > Sorry about that, my mistake. I followed the results of
> > scripts/get_maintainer.pl before.
>
> Sometimes I wonder if people who contribute a lot to a subsystem should
> be more aggressive about listing themselves explicitly in MAINTAINERS
> but then I look at the ~600 emails that came in while I was on vacation
> for 6 days over a long weekend and ... shut up. :P
>
> > > because someone tries to fix something they broke by *breaking more
> > > stuff* and not even realising how broken what they are proposing is.
> >
> > Yes, this slows down the speed of umount. But the benefit is that
> > slab shrink becomes lockless, the mount operation and slab shrink no
> > longer affect each other, and the IPC no longer drops significantly,
> > etc.
>
> The lockless shrink seems like a good thing to have, but ... is it
> really true that the superblock shrinker can still be running after
> ->kill_sb? /That/ is surprising to me.

So what's the plan here? If this causes issues for filesystems that rely
on specific guarantees that are broken by the patch then either it needs
a clean fix or a revert.

>
> --D
>
> > And I used bpftrace to measure the time consumption of
> > unregister_shrinker():
> >
> > ```
> > And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
> > 8260 CPU @ 2.40GHz) and the results are as follows:
> >
> > 1) use synchronize_srcu():
> >
> > @ns[umount]:
> > [8K, 16K) 83 |@@@@@@@ |
> > [16K, 32K) 578
> > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [32K, 64K) 78 |@@@@@@@ |
> > [64K, 128K) 6 | |
> > [128K, 256K) 7 | |
> > [256K, 512K) 29 |@@ |
> > [512K, 1M) 51 |@@@@ |
> > [1M, 2M) 90 |@@@@@@@@ |
> > [2M, 4M) 70 |@@@@@@ |
> > [4M, 8M) 8 | |
> >
> > 2) use synchronize_srcu_expedited():
> >
> > @ns[umount]:
> > [8K, 16K) 31 |@@ |
> > [16K, 32K) 803
> > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [32K, 64K) 158 |@@@@@@@@@@ |
> > [64K, 128K) 4 | |
> > [128K, 256K) 2 | |
> > [256K, 512K) 2 | |
> > ```
> >
> > With synchronize_srcu(), most of the time consumption is between 16us and
> > 32us, the worst case between 4ms and 8ms. Is this totally
> > unacceptable?
> >
> > This performance regression report comes from a stress test. Will the
> > umount action be executed so frequently under real workloads?
> >
> > If there are really unacceptable, after applying the newly proposed
> > method, umount will be as fast as before (or even faster).
> >
> > Thanks,
> > Qi
> >
> > >
> > > The previous code was not broken and it provided specific guarantees
> > > to subsystems via unregister_shrinker(). From the above discussion,
> > > it appears that the original authors of these changes either did not
> > > know about or did not understand them, so that casts doubt in my
> > > mind about the attempted solution and all the proposed fixes for it.
> > >
> > > I don't have the time right now unravel this mess and fully
> > > understand the original problem, changes or the band-aids that are
> > > being thrown around. We are also getting quite late in the cycle to
> > > be doing major surgery to critical infrastructure, especially as it
> > > gives so little time to review regression test whatever new solution
> > > is proposed.
> > >
> > > Given this appears to be a change introduced in 6.4-rc1, I think the
> > > right thing to do is to revert the change rather than make things
> > > worse by trying to shove some "quick fix" into the kernel to address
> > > it.
> > >
> > > Andrew, could you please sort out a series to revert this shrinker
> > > infrastructure change and all the dependent hacks that have been
> > > added to try to fix it so far?
> > >
> > > -Dave.
> >
> > --
> > Thanks,
> > Qi

2023-06-05 12:22:16

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()



On 2023/6/5 19:50, Christian Brauner wrote:
> On Fri, Jun 02, 2023 at 08:15:32AM -0700, Darrick J. Wong wrote:
>> On Fri, Jun 02, 2023 at 11:13:09AM +0800, Qi Zheng wrote:
>>> Hi Dave,
>>>
>>> On 2023/6/2 07:06, Dave Chinner wrote:
>>>> On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
>>>>> Hi Dave,
>>>>> On 2023/6/1 07:48, Dave Chinner wrote:
>>>>>> On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
>>>>>>> From: Kirill Tkhai <[email protected]>
>>>>>> I don't really like this ->destroy_super() callback, especially as
>>>>>> it's completely undocumented as to why it exists. This is purely a
>>>>>> work-around for handling extended filesystem superblock shrinker
>>>>>> functionality, yet there's nothing that tells the reader this.
>>>>>>
>>>>>> It also seems to imply that the superblock shrinker can continue to
>>>>>> run after the existing unregister_shrinker() call before ->kill_sb()
>>>>>> is called. This violates the assumption made in filesystems that the
>>>>>> superblock shrinkers have been stopped and will never run again
>>>>>> before ->kill_sb() is called. Hence ->kill_sb() implementations
>>>>>> assume there is nothing else accessing filesystem owned structures
>>>>>> and it can tear down internal structures safely.
>>>>>>
>>>>>> Realistically, the days of XFS using this superblock shrinker
>>>>>> extension are numbered. We've got a lot of the infrastructure we
>>>>>> need in place to get rid of the background inode reclaim
>>>>>> infrastructure that requires this shrinker extension, and it's on my
>>>>>> list of things that need to be addressed in the near future.
>>>>>>
>>>>>> In fact, now that I look at it, I think the shmem usage of this
>>>>>> superblock shrinker interface is broken - it returns SHRINK_STOP to
>>>>>> ->free_cached_objects(), but the only valid return value is the
>>>>>> number of objects freed (i.e. 0 is nothing freed). These special
>>>>>> superblock extension interfaces do not work like a normal
>>>>>> shrinker....
>>>>>>
>>>>>> Hence I think the shmem usage should be replaced with an separate
>>>>>> internal shmem shrinker that is managed by the filesystem itself
>>>>>> (similar to how XFS has multiple internal shrinkers).
>>>>>>
>>>>>> At this point, then the only user of this interface is (again) XFS.
>>>>>> Given this, adding new VFS methods for a single filesystem
>>>>>> for functionality that is planned to be removed is probably not the
>>>>>> best approach to solving the problem.
>>>>>
>>>>> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
>>>>> new method[1], I cc'd you on the email.
>>>>
>>>> I;ve just read through that thread, and I've looked at the original
>>>> patch that caused the regression.
>>>>
>>>> I'm a bit annoyed right now. Nobody cc'd me on the original patches
>>>> nor were any of the subsystems that use shrinkers were cc'd on the
>>>> patches that changed shrinker behaviour. I only find out about this
>>>
>>> Sorry about that, my mistake. I followed the results of
>>> scripts/get_maintainer.pl before.
>>
>> Sometimes I wonder if people who contribute a lot to a subsystem should
>> be more aggressive about listing themselves explicitly in MAINTAINERS
>> but then I look at the ~600 emails that came in while I was on vacation
>> for 6 days over a long weekend and ... shut up. :P
>>
>>>> because someone tries to fix something they broke by *breaking more
>>>> stuff* and not even realising how broken what they are proposing is.
>>>
>>> Yes, this slows down the speed of umount. But the benefit is that
>>> slab shrink becomes lockless, the mount operation and slab shrink no
>>> longer affect each other, and the IPC no longer drops significantly,
>>> etc.
>>
>> The lockless shrink seems like a good thing to have, but ... is it
>> really true that the superblock shrinker can still be running after
>> ->kill_sb? /That/ is surprising to me.
>
> So what's the plan here? If this causes issues for filesystems that rely
> on specific guarantees that are broken by the patch then either it needs
> a clean fix or a revert.

If the reduction in umount speed is really unacceptable, I think we can
try the patch[1] from Kirill Tkhai. At least the granularity of the
shrinker rwsem lock is reduced, and the file system code does not need
to be modified.

And IIUC, after clearing SHRINKER_REGISTERED under the write lock of
shrinker->rwsem, we can guarantee that the shrinker won't be running.
So synchronize_srcu() doesn't need to be called in unregister_shrinker()
anymore. So we don't need to split unregister_shrinker().

[1].
https://lore.kernel.org/lkml/[email protected]/

>
>>
>> --D
>>
>>> And I used bpftrace to measure the time consumption of
>>> unregister_shrinker():
>>>
>>> ```
>>> And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
>>> 8260 CPU @ 2.40GHz) and the results are as follows:
>>>
>>> 1) use synchronize_srcu():
>>>
>>> @ns[umount]:
>>> [8K, 16K) 83 |@@@@@@@ |
>>> [16K, 32K) 578
>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [32K, 64K) 78 |@@@@@@@ |
>>> [64K, 128K) 6 | |
>>> [128K, 256K) 7 | |
>>> [256K, 512K) 29 |@@ |
>>> [512K, 1M) 51 |@@@@ |
>>> [1M, 2M) 90 |@@@@@@@@ |
>>> [2M, 4M) 70 |@@@@@@ |
>>> [4M, 8M) 8 | |
>>>
>>> 2) use synchronize_srcu_expedited():
>>>
>>> @ns[umount]:
>>> [8K, 16K) 31 |@@ |
>>> [16K, 32K) 803
>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [32K, 64K) 158 |@@@@@@@@@@ |
>>> [64K, 128K) 4 | |
>>> [128K, 256K) 2 | |
>>> [256K, 512K) 2 | |
>>> ```
>>>
>>> With synchronize_srcu(), most of the time consumption is between 16us and
>>> 32us, the worst case between 4ms and 8ms. Is this totally
>>> unacceptable?
>>>
>>> This performance regression report comes from a stress test. Will the
>>> umount action be executed so frequently under real workloads?
>>>
>>> If there are really unacceptable, after applying the newly proposed
>>> method, umount will be as fast as before (or even faster).
>>>
>>> Thanks,
>>> Qi
>>>
>>>>
>>>> The previous code was not broken and it provided specific guarantees
>>>> to subsystems via unregister_shrinker(). From the above discussion,
>>>> it appears that the original authors of these changes either did not
>>>> know about or did not understand them, so that casts doubt in my
>>>> mind about the attempted solution and all the proposed fixes for it.
>>>>
>>>> I don't have the time right now unravel this mess and fully
>>>> understand the original problem, changes or the band-aids that are
>>>> being thrown around. We are also getting quite late in the cycle to
>>>> be doing major surgery to critical infrastructure, especially as it
>>>> gives so little time to review regression test whatever new solution
>>>> is proposed.
>>>>
>>>> Given this appears to be a change introduced in 6.4-rc1, I think the
>>>> right thing to do is to revert the change rather than make things
>>>> worse by trying to shove some "quick fix" into the kernel to address
>>>> it.
>>>>
>>>> Andrew, could you please sort out a series to revert this shrinker
>>>> infrastructure change and all the dependent hacks that have been
>>>> added to try to fix it so far?
>>>>
>>>> -Dave.
>>>
>>> --
>>> Thanks,
>>> Qi