2018-08-07 16:42:47

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()

After bitmaps of not-empty memcg shrinkers were implemented
(see "[PATCH v9 00/17] Improve shrink_slab() scalability..."
series, which is already in mm tree), all the evil in perf
trace has moved from shrink_slab() to down_read_trylock().
As reported by Shakeel Butt:

> I created 255 memcgs, 255 ext4 mounts and made each memcg create a
> file containing few KiBs on corresponding mount. Then in a separate
> memcg of 200 MiB limit ran a fork-bomb.
>
> I ran the "perf record -ag -- sleep 60" and below are the results:
> + 47.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
> + 30.72% fb.sh [kernel.kallsyms] [k] up_read
> + 9.51% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
> + 1.69% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
> + 1.35% fb.sh [kernel.kallsyms] [k] mem_cgroup_protected
> + 1.05% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
> + 0.85% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
> + 0.78% fb.sh [kernel.kallsyms] [k] lruvec_lru_size
> + 0.57% fb.sh [kernel.kallsyms] [k] shrink_node
> + 0.54% fb.sh [kernel.kallsyms] [k] queue_work_on
> + 0.46% fb.sh [kernel.kallsyms] [k] shrink_slab_memcg

The patchset continues to improve shrink_slab() scalability and makes
it lockless completely. Here are several steps for that:

1)Use SRCU to synchronize shrink_slab() and unregister_shrinker().
Nothing exiting is here, just srcu_read_lock() in shrink_slab()
and shrink_slab_memcg() and synchronize_srcu() in unregister_shrinker().
See [2/10] for details.
2)The above requires to make SRCU unconditional enabled.
[1/10] makes this. Note, that if we can't always enable
SRCU, we may use percpu_rw_semaphore instead of this.
See comment to [2/10] for details.
3)Convert shrinker_rwsem to mutex. Just cleanup.

4)Further patches make possible to speed up unregister_shrinker()
by splitting it in two stages. The first stage unlinks shrinker
from shrinker_list and shrinker_idr, while the second finalizes
the thing by calling synchronize_srcu() and freeing memory.
Patch [4/10] actually splits unregister_shrinker(), while
[10/10] makes superblock shrinker to use the new helpers
(unregister_shrinker_delayed_{initiate,finalize}().

5)Patches [5-9/10] are preparations on fs, which make possible
to split superblock unregistration in two stages. They sequentially
make super_cache_count() and super_cache_scan() safe to be called
on unregistering shrinker:

[cpu0] [cpu1]
unregister_shrinker_delayed_initiate(shrinker)
... shrink_slab(shrinker) (OK!)
unregister_shrinker_delayed_finalize(shrinker)

After all of this, shrink_slab() becomes lockless, while unregister_shrinker()
remains fast at least for superblock shrinker (another shrinkers also
can be made to unregister in the same delayed manner).

(This requires "mm: Use special value SHRINKER_REGISTERING instead list_empty() check"
from https://lkml.org/lkml/2018/8/6/276, which is on the way to -mm tree, as said
by -mm tree notification message from Andrew).

---

Kirill Tkhai (10):
rcu: Make CONFIG_SRCU unconditionally enabled
mm: Make shrink_slab() lockless
mm: Convert shrinker_rwsem to mutex
mm: Split unregister_shrinker()
fs: Move list_lru_destroy() to destroy_super_work()
fs: Shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan()
fs: Introduce struct super_operations::destroy_super() callback.
xfs: Introduce xfs_fs_destroy_super()
shmem: Implement shmem_destroy_super()
fs: Use unregister_shrinker_delayed_{initiate,finalize} for super_block shrinker


drivers/base/core.c | 42 ----------
fs/super.c | 32 ++++----
fs/xfs/xfs_super.c | 14 +++
include/linux/device.h | 2
include/linux/fs.h | 6 +
include/linux/rcutiny.h | 4 -
include/linux/shrinker.h | 2
include/linux/srcu.h | 5 -
kernel/notifier.c | 3 -
kernel/rcu/Kconfig | 12 ---
kernel/rcu/tree.h | 5 -
kernel/rcu/update.c | 4 -
mm/shmem.c | 8 ++
mm/vmscan.c | 82 ++++++++++----------
.../selftests/rcutorture/doc/TREE_RCU-kconfig.txt | 5 -
15 files changed, 89 insertions(+), 137 deletions(-)

--
Signed-off-by: Kirill Tkhai <[email protected]>


2018-08-07 15:39:37

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC 02/10] mm: Make shrink_slab() lockless

The patch makes shrinker list and shrinker_idr SRCU-safe
for readers. This requires synchronize_srcu() on finalize
stage unregistering stage, which waits till all parallel
shrink_slab() are finished

Note, that patch removes rwsem_is_contended() checks from
the code, and this does not result in delays during
registration, since there is no waiting at all. Unregistration
case may be optimized by splitting unregister_shrinker()
in tho stages, and this is made in next patches.

Also, keep in mind, that in case of SRCU is not allowed
to make unconditional (which is done in previous patch),
it is possible to use percpu_rw_semaphore instead of it.
percpu_down_read() will be used in shrink_slab_memcg()
and in shrink_slab(), and consecutive calls

percpu_down_write(percpu_rwsem);
percpu_up_write(percpu_rwsem);

will be used instead of synchronize_srcu().

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/vmscan.c | 42 +++++++++++++-----------------------------
1 file changed, 13 insertions(+), 29 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index da135e1acd94..9dda903a1406 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -168,6 +168,7 @@ unsigned long vm_total_pages;

static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_STATIC_SRCU(srcu);

#ifdef CONFIG_MEMCG_KMEM

@@ -192,7 +193,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
int id, ret = -ENOMEM;

down_write(&shrinker_rwsem);
- /* This may call shrinker, so it must use down_read_trylock() */
id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -406,7 +406,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
void register_shrinker_prepared(struct shrinker *shrinker)
{
down_write(&shrinker_rwsem);
- list_add_tail(&shrinker->list, &shrinker_list);
+ list_add_tail_rcu(&shrinker->list, &shrinker_list);
idr_replace(&shrinker_idr, shrinker, shrinker->id);
up_write(&shrinker_rwsem);
}
@@ -432,8 +432,11 @@ void unregister_shrinker(struct shrinker *shrinker)
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
down_write(&shrinker_rwsem);
- list_del(&shrinker->list);
+ list_del_rcu(&shrinker->list);
up_write(&shrinker_rwsem);
+
+ synchronize_srcu(&srcu);
+
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}
@@ -567,14 +570,12 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
{
struct memcg_shrinker_map *map;
unsigned long freed = 0;
- int ret, i;
+ int ret, i, srcu_id;

if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
return 0;

- if (!down_read_trylock(&shrinker_rwsem))
- return 0;
-
+ srcu_id = srcu_read_lock(&srcu);
map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
true);
if (unlikely(!map))
@@ -621,14 +622,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
memcg_set_shrinker_bit(memcg, nid, i);
}
freed += ret;
-
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
}
unlock:
- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&srcu, srcu_id);
return freed;
}
#else /* CONFIG_MEMCG_KMEM */
@@ -665,15 +661,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
{
struct shrinker *shrinker;
unsigned long freed = 0;
- int ret;
+ int srcu_id, ret;

if (!mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

- if (!down_read_trylock(&shrinker_rwsem))
- goto out;
-
- list_for_each_entry(shrinker, &shrinker_list, list) {
+ srcu_id = srcu_read_lock(&srcu);
+ list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
@@ -684,19 +678,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
- /*
- * Bail out if someone want to register a new shrinker to
- * prevent the regsitration from being stalled for long periods
- * by parallel ongoing shrinking.
- */
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
}
+ srcu_read_unlock(&srcu, srcu_id);

- up_read(&shrinker_rwsem);
-out:
cond_resched();
return freed;
}


2018-08-07 15:41:23

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC 09/10] shmem: Implement shmem_destroy_super()

Similar to xfs_fs_destroy_super() implement the method
for shmem.

shmem_unused_huge_count() just touches sb->s_fs_info.
After such the later freeing it will be safe for
unregister_shrinker() splitting (which is made in next
patch).

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/shmem.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 4829798869b6..35c65afefbc8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3427,6 +3427,12 @@ static void shmem_put_super(struct super_block *sb)

percpu_counter_destroy(&sbinfo->used_blocks);
mpol_put(sbinfo->mpol);
+}
+
+static void shmem_destroy_super(struct super_block *sb)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+
kfree(sbinfo);
sb->s_fs_info = NULL;
}
@@ -3504,6 +3510,7 @@ int shmem_fill_super(struct super_block *sb, void *data, size_t data_size,

failed:
shmem_put_super(sb);
+ shmem_destroy_super(sb);
return err;
}

@@ -3630,6 +3637,7 @@ static const struct super_operations shmem_ops = {
.evict_inode = shmem_evict_inode,
.drop_inode = generic_delete_inode,
.put_super = shmem_put_super,
+ .destroy_super = shmem_destroy_super,
#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
.nr_cached_objects = shmem_unused_huge_count,
.free_cached_objects = shmem_unused_huge_scan,


2018-08-07 15:42:28

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC 10/10] fs: Use unregister_shrinker_delayed_{initiate, finalize} for super_block shrinker

Previous patches made all the data, which is touched from
super_cache_count(), destroyed from destroy_super_work():
s_dentry_lru, s_inode_lru and super_block::s_fs_info.

super_cache_scan() can't be called after SB_ACTIVE is cleared
in generic_shutdown_super().

So, it safe to move heavy unregister_shrinker_delayed_finalize()
part to delayed work, i.e. it's safe for parallel do_shrink_slab()
to be executed between unregister_shrinker_delayed_initiate() and
destroy_super_work()->unregister_shrinker_delayed_finalize().

This makes the heavy synchronize_srcu() to do not affect on user-visible
unregistration speed (since now it's executed from workqueue).

All further time-critical for unregistration places may be written
in the same conception.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 4 +++-
include/linux/fs.h | 5 +++++
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index c60f092538c7..33e829741ec0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -165,6 +165,8 @@ static void destroy_super_work(struct work_struct *work)
destroy_work);
int i;

+ unregister_shrinker_delayed_finalize(&s->s_shrink);
+
WARN_ON(list_lru_count(&s->s_dentry_lru));
WARN_ON(list_lru_count(&s->s_inode_lru));
list_lru_destroy(&s->s_dentry_lru);
@@ -334,7 +336,7 @@ void deactivate_locked_super(struct super_block *s)
struct file_system_type *fs = s->s_type;
if (atomic_dec_and_test(&s->s_active)) {
cleancache_invalidate_fs(s);
- unregister_shrinker(&s->s_shrink);
+ unregister_shrinker_delayed_initiate(&s->s_shrink);
fs->kill_sb(s);

put_filesystem(fs);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 33dfaed0a01a..8a1cd3097eef 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1902,6 +1902,11 @@ struct super_operations {
struct dquot **(*get_dquots)(struct inode *);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+ /*
+ * Shrinker may call these two function on destructing super_block
+ * till unregister_shrinker_delayed_finalize() has completed
+ * in destroy_super_work(), and they must care about that.
+ */
long (*nr_cached_objects)(struct super_block *,
struct shrink_control *);
long (*free_cached_objects)(struct super_block *,


2018-08-07 16:42:49

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

This patch kills all CONFIG_SRCU defines and
the code under !CONFIG_SRCU.

Signed-off-by: Kirill Tkhai <[email protected]>
---
drivers/base/core.c | 42 --------------------
include/linux/device.h | 2 -
include/linux/rcutiny.h | 4 --
include/linux/srcu.h | 5 --
kernel/notifier.c | 3 -
kernel/rcu/Kconfig | 12 +-----
kernel/rcu/tree.h | 5 --
kernel/rcu/update.c | 4 --
.../selftests/rcutorture/doc/TREE_RCU-kconfig.txt | 5 --
9 files changed, 3 insertions(+), 79 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 04bbcd779e11..8483da53c88f 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -44,7 +44,6 @@ early_param("sysfs.deprecated", sysfs_deprecated_setup);

/* Device links support. */

-#ifdef CONFIG_SRCU
static DEFINE_MUTEX(device_links_lock);
DEFINE_STATIC_SRCU(device_links_srcu);

@@ -67,30 +66,6 @@ void device_links_read_unlock(int idx)
{
srcu_read_unlock(&device_links_srcu, idx);
}
-#else /* !CONFIG_SRCU */
-static DECLARE_RWSEM(device_links_lock);
-
-static inline void device_links_write_lock(void)
-{
- down_write(&device_links_lock);
-}
-
-static inline void device_links_write_unlock(void)
-{
- up_write(&device_links_lock);
-}
-
-int device_links_read_lock(void)
-{
- down_read(&device_links_lock);
- return 0;
-}
-
-void device_links_read_unlock(int not_used)
-{
- up_read(&device_links_lock);
-}
-#endif /* !CONFIG_SRCU */

/**
* device_is_dependent - Check if one device depends on another one
@@ -317,7 +292,6 @@ static void device_link_free(struct device_link *link)
kfree(link);
}

-#ifdef CONFIG_SRCU
static void __device_link_free_srcu(struct rcu_head *rhead)
{
device_link_free(container_of(rhead, struct device_link, rcu_head));
@@ -337,22 +311,6 @@ static void __device_link_del(struct kref *kref)
list_del_rcu(&link->c_node);
call_srcu(&device_links_srcu, &link->rcu_head, __device_link_free_srcu);
}
-#else /* !CONFIG_SRCU */
-static void __device_link_del(struct kref *kref)
-{
- struct device_link *link = container_of(kref, struct device_link, kref);
-
- dev_info(link->consumer, "Dropping the link to %s\n",
- dev_name(link->supplier));
-
- if (link->flags & DL_FLAG_PM_RUNTIME)
- pm_runtime_drop_link(link->consumer);
-
- list_del(&link->s_node);
- list_del(&link->c_node);
- device_link_free(link);
-}
-#endif /* !CONFIG_SRCU */

/**
* device_link_del - Delete a link between two devices.
diff --git a/include/linux/device.h b/include/linux/device.h
index 8f882549edee..524dc17d67be 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -827,9 +827,7 @@ struct device_link {
u32 flags;
bool rpm_active;
struct kref kref;
-#ifdef CONFIG_SRCU
struct rcu_head rcu_head;
-#endif
};

/**
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 8d9a0ea8f0b5..63e2b6f2e94a 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -115,11 +115,7 @@ static inline void rcu_irq_exit_irqson(void) { }
static inline void rcu_irq_enter_irqson(void) { }
static inline void rcu_irq_exit(void) { }
static inline void exit_rcu(void) { }
-#ifdef CONFIG_SRCU
void rcu_scheduler_starting(void);
-#else /* #ifndef CONFIG_SRCU */
-static inline void rcu_scheduler_starting(void) { }
-#endif /* #else #ifndef CONFIG_SRCU */
static inline void rcu_end_inkernel_boot(void) { }
static inline bool rcu_is_watching(void) { return true; }

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 3e72a291c401..27238223a78e 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -60,11 +60,8 @@ int init_srcu_struct(struct srcu_struct *sp);
#include <linux/srcutiny.h>
#elif defined(CONFIG_TREE_SRCU)
#include <linux/srcutree.h>
-#elif defined(CONFIG_SRCU)
-#error "Unknown SRCU implementation specified to kernel configuration"
#else
-/* Dummy definition for things like notifiers. Actual use gets link error. */
-struct srcu_struct { };
+#error "Unknown SRCU implementation specified to kernel configuration"
#endif

void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
diff --git a/kernel/notifier.c b/kernel/notifier.c
index 6196af8a8223..6e4b55e74736 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -402,7 +402,6 @@ int raw_notifier_call_chain(struct raw_notifier_head *nh,
}
EXPORT_SYMBOL_GPL(raw_notifier_call_chain);

-#ifdef CONFIG_SRCU
/*
* SRCU notifier chain routines. Registration and unregistration
* use a mutex, and call_chain is synchronized by SRCU (no locks).
@@ -529,8 +528,6 @@ void srcu_init_notifier_head(struct srcu_notifier_head *nh)
}
EXPORT_SYMBOL_GPL(srcu_init_notifier_head);

-#endif /* CONFIG_SRCU */
-
static ATOMIC_NOTIFIER_HEAD(die_chain);

int notrace notify_die(enum die_val val, const char *str,
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 9210379c0353..f52e55e33f0a 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -49,28 +49,20 @@ config RCU_EXPERT

Say N if you are unsure.

-config SRCU
- bool
- help
- This option selects the sleepable version of RCU. This version
- permits arbitrary sleeping or blocking within RCU read-side critical
- sections.
-
config TINY_SRCU
bool
- default y if SRCU && TINY_RCU
+ default y if TINY_RCU
help
This option selects the single-CPU non-preemptible version of SRCU.

config TREE_SRCU
bool
- default y if SRCU && !TINY_RCU
+ default y if !TINY_RCU
help
This option selects the full-fledged version of SRCU.

config TASKS_RCU
def_bool PREEMPT
- select SRCU
help
This option enables a task-based RCU implementation that uses
only voluntary context switch (not preemption!), idle, and
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 4e74df768c57..b7f76400a45e 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -489,12 +489,7 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
static void rcu_dynticks_task_enter(void);
static void rcu_dynticks_task_exit(void);

-#ifdef CONFIG_SRCU
void srcu_online_cpu(unsigned int cpu);
void srcu_offline_cpu(unsigned int cpu);
-#else /* #ifdef CONFIG_SRCU */
-void srcu_online_cpu(unsigned int cpu) { }
-void srcu_offline_cpu(unsigned int cpu) { }
-#endif /* #else #ifdef CONFIG_SRCU */

#endif /* #ifndef RCU_TREE_NONCORE */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 39cb23d22109..90de81c98524 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -210,8 +210,6 @@ void rcu_test_sync_prims(void)
synchronize_sched_expedited();
}

-#if !defined(CONFIG_TINY_RCU) || defined(CONFIG_SRCU)
-
/*
* Switch to run-time mode once RCU has fully initialized.
*/
@@ -224,8 +222,6 @@ static int __init rcu_set_runtime_mode(void)
}
core_initcall(rcu_set_runtime_mode);

-#endif /* #if !defined(CONFIG_TINY_RCU) || defined(CONFIG_SRCU) */
-
#ifdef CONFIG_DEBUG_LOCK_ALLOC
static struct lock_class_key rcu_lock_key;
struct lockdep_map rcu_lock_map =
diff --git a/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt b/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt
index af6fca03602f..b4f015c3244a 100644
--- a/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt
+++ b/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt
@@ -73,9 +73,4 @@ CONFIG_TASKS_RCU

These are controlled by CONFIG_PREEMPT and/or CONFIG_SMP.

-CONFIG_SRCU
-
- Selected by CONFIG_RCU_TORTURE_TEST, so cannot disable.
-
-
boot parameters ignored: TBD


2018-08-07 16:42:56

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC 03/10] mm: Convert shrinker_rwsem to mutex

There are no readers, so rwsem is not need anymore.

Signed-off-by: Kirill Tkhai <[email protected]>
---
mm/vmscan.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9dda903a1406..2dc274a385b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -167,7 +167,7 @@ int vm_swappiness = 60;
unsigned long vm_total_pages;

static LIST_HEAD(shrinker_list);
-static DECLARE_RWSEM(shrinker_rwsem);
+static DEFINE_MUTEX(shrinker_mutex);
DEFINE_STATIC_SRCU(srcu);

#ifdef CONFIG_MEMCG_KMEM
@@ -192,7 +192,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
{
int id, ret = -ENOMEM;

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -208,7 +208,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
shrinker->id = id;
ret = 0;
unlock:
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return ret;
}

@@ -218,9 +218,9 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)

BUG_ON(id < 0);

- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
idr_remove(&shrinker_idr, id);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
}
#else /* CONFIG_MEMCG_KMEM */
static int prealloc_memcg_shrinker(struct shrinker *shrinker)
@@ -405,10 +405,10 @@ void free_prealloced_shrinker(struct shrinker *shrinker)

void register_shrinker_prepared(struct shrinker *shrinker)
{
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_add_tail_rcu(&shrinker->list, &shrinker_list);
idr_replace(&shrinker_idr, shrinker, shrinker->id);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
}

int register_shrinker(struct shrinker *shrinker)
@@ -431,9 +431,9 @@ void unregister_shrinker(struct shrinker *shrinker)
return;
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_del_rcu(&shrinker->list);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);

synchronize_srcu(&srcu);



2018-08-07 16:43:02

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC 06/10] fs: Shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan()

This patch prepares superblock shrinker for delayed unregistering.
It makes super_cache_scan() avoid shrinking of not active superblocks.
SB_ACTIVE is used as such the indicator. In case of superblock is not
active, super_cache_scan() just exits with SHRINK_STOP as result.

Note, that SB_ACTIVE is cleared in generic_shutdown_super() and this
is made under s_umount mutex. Function super_cache_scan() also takes
the mutex, so it can't skip this flag cleared.

SB_BORN check is added to super_cache_scan() just for uniformity
with super_cache_count(), while super_cache_count() received SB_ACTIVE
check just for uniformity with super_cache_scan().

After this patch super_cache_scan() becomes to ignore unregistering
superblocks, so this function is OK with splitting unregister_shrinker().
Next patches prepare super_cache_count() to follow this way.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 457834278e37..9222cfc196bf 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -79,6 +79,11 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
if (!trylock_super(sb))
return SHRINK_STOP;

+ if ((sb->s_flags & (SB_BORN|SB_ACTIVE)) != (SB_BORN|SB_ACTIVE)) {
+ freed = SHRINK_STOP;
+ goto unlock;
+ }
+
if (sb->s_op->nr_cached_objects)
fs_objects = sb->s_op->nr_cached_objects(sb, sc);

@@ -110,6 +115,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
freed += sb->s_op->free_cached_objects(sb, sc);
}

+unlock:
up_read(&sb->s_umount);
return freed;
}
@@ -136,7 +142,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
* avoid this situation, so do the same here. The memory barrier is
* matched with the one in mount_fs() as we don't hold locks here.
*/
- if (!(sb->s_flags & SB_BORN))
+ if ((sb->s_flags & (SB_BORN|SB_ACTIVE)) != (SB_BORN|SB_ACTIVE))
return 0;
smp_rmb();



2018-08-07 16:43:15

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC 05/10] fs: Move list_lru_destroy() to destroy_super_work()

The patch makes s_dentry_lru and s_inode_lru be destroyed
later from the workqueue. This is preparation to split
unregister_shrinker(super_block::s_shrink) in two stages,
and to call finalize stage from destroy_super_work().

Note, that generic filesystem shrinker unregistration
is safe to be splitted in two stages right after this
patch, since super_cache_count() and super_cache_scan()
have a deal with s_dentry_lru and s_inode_lru only.

But there are two exceptions: XFS and SHMEM, which
define .nr_cached_objects() and .free_cached_objects()
callbacks. These two do not allow us to do the splitting
right after this patch. They touch fs-specific data,
which is destroyed earlier, than destroy_super_work().
So, we can't call unregister_shrinker_delayed_finalize()
from destroy_super_work() because of them, and next
patches make preparations to make this possible.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 17 +++++------------
1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 426161360af3..457834278e37 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -159,6 +159,11 @@ static void destroy_super_work(struct work_struct *work)
destroy_work);
int i;

+ WARN_ON(list_lru_count(&s->s_dentry_lru));
+ WARN_ON(list_lru_count(&s->s_inode_lru));
+ list_lru_destroy(&s->s_dentry_lru);
+ list_lru_destroy(&s->s_inode_lru);
+
for (i = 0; i < SB_FREEZE_LEVELS; i++)
percpu_free_rwsem(&s->s_writers.rw_sem[i]);
kfree(s);
@@ -177,8 +182,6 @@ static void destroy_unused_super(struct super_block *s)
if (!s)
return;
up_write(&s->s_umount);
- list_lru_destroy(&s->s_dentry_lru);
- list_lru_destroy(&s->s_inode_lru);
security_sb_free(s);
put_user_ns(s->s_user_ns);
kfree(s->s_subtype);
@@ -283,8 +286,6 @@ static void __put_super(struct super_block *s)
{
if (!--s->s_count) {
list_del_init(&s->s_list);
- WARN_ON(s->s_dentry_lru.node);
- WARN_ON(s->s_inode_lru.node);
WARN_ON(!list_empty(&s->s_mounts));
security_sb_free(s);
put_user_ns(s->s_user_ns);
@@ -327,14 +328,6 @@ void deactivate_locked_super(struct super_block *s)
unregister_shrinker(&s->s_shrink);
fs->kill_sb(s);

- /*
- * Since list_lru_destroy() may sleep, we cannot call it from
- * put_super(), where we hold the sb_lock. Therefore we destroy
- * the lru lists right now.
- */
- list_lru_destroy(&s->s_dentry_lru);
- list_lru_destroy(&s->s_inode_lru);
-
put_filesystem(fs);
put_super(s);
} else {


2018-08-07 18:29:56

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC 04/10] mm: Split unregister_shrinker()

This and the next patches in this series aim to make
time effect of synchronize_srcu() invisible for user.
The patch splits unregister_shrinker() in two functions:

unregister_shrinker_delayed_initiate()
unregister_shrinker_delayed_finalize()

and shrinker users may make the second of them to be called
asynchronous (e.g., from workqueue). Next patches make
superblock shrinker to follow this way, so user-visible
umount() time won't contain delays from synchronize_srcu().

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/shrinker.h | 2 ++
mm/vmscan.c | 22 ++++++++++++++++++----
2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 9443cafd1969..92062d1239c2 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -86,5 +86,7 @@ extern int prealloc_shrinker(struct shrinker *shrinker);
extern void register_shrinker_prepared(struct shrinker *shrinker);
extern int register_shrinker(struct shrinker *shrinker);
extern void unregister_shrinker(struct shrinker *shrinker);
+extern void unregister_shrinker_delayed_initiate(struct shrinker *shrinker);
+extern void unregister_shrinker_delayed_finalize(struct shrinker *shrinker);
extern void free_prealloced_shrinker(struct shrinker *shrinker);
#endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2dc274a385b9..fba4996dfe25 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -422,10 +422,7 @@ int register_shrinker(struct shrinker *shrinker)
}
EXPORT_SYMBOL(register_shrinker);

-/*
- * Remove one
- */
-void unregister_shrinker(struct shrinker *shrinker)
+void unregister_shrinker_delayed_initiate(struct shrinker *shrinker)
{
if (!shrinker->nr_deferred)
return;
@@ -434,12 +431,29 @@ void unregister_shrinker(struct shrinker *shrinker)
mutex_lock(&shrinker_mutex);
list_del_rcu(&shrinker->list);
mutex_unlock(&shrinker_mutex);
+}
+EXPORT_SYMBOL(unregister_shrinker_delayed_initiate);
+
+void unregister_shrinker_delayed_finalize(struct shrinker *shrinker)
+{
+ if (!shrinker->nr_deferred)
+ return;

synchronize_srcu(&srcu);

kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}
+EXPORT_SYMBOL(unregister_shrinker_delayed_finalize);
+
+/*
+ * Remove one
+ */
+void unregister_shrinker(struct shrinker *shrinker)
+{
+ unregister_shrinker_delayed_initiate(shrinker);
+ unregister_shrinker_delayed_finalize(shrinker);
+}
EXPORT_SYMBOL(unregister_shrinker);

#define SHRINK_BATCH 128


2018-08-07 18:30:09

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC 08/10] xfs: Introduce xfs_fs_destroy_super()

xfs_fs_nr_cached_objects() touches sb->s_fs_info,
and this patch makes it to be destructed later.

After this patch xfs_fs_nr_cached_objects() is safe
for splitting unregister_shrinker(): mp->m_perag_tree
is stable till destroy_super_work(), while iteration
over it is already RCU-protected by internal XFS
business.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/xfs/xfs_super.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9e2ce4cd98e1..c1e00dd06893 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1774,11 +1774,20 @@ xfs_fs_put_super(
xfs_destroy_mount_workqueues(mp);
xfs_close_devices(mp);

- sb->s_fs_info = NULL;
xfs_free_fsname(mp);
- kfree(mp);
}

+STATIC void
+xfs_fs_destroy_super(
+ struct super_block *sb)
+{
+ if (sb->s_fs_info) {
+ kfree(sb->s_fs_info);
+ sb->s_fs_info = NULL;
+ }
+}
+
+
STATIC struct dentry *
xfs_fs_mount(
struct file_system_type *fs_type,
@@ -1816,6 +1825,7 @@ static const struct super_operations xfs_super_operations = {
.dirty_inode = xfs_fs_dirty_inode,
.drop_inode = xfs_fs_drop_inode,
.put_super = xfs_fs_put_super,
+ .destroy_super = xfs_fs_destroy_super,
.sync_fs = xfs_fs_sync_fs,
.freeze_fs = xfs_fs_freeze,
.unfreeze_fs = xfs_fs_unfreeze,


2018-08-07 18:30:45

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC 07/10] fs: Introduce struct super_operations::destroy_super() callback.

The patch introduces a new callback, which will be called
asynchronous from delayed work.

This will allows to make ::nr_cached_objects() safe
to be called on destroying superblock in next patches,
and to split unregister_shrinker() into two primitives.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/super.c | 3 +++
include/linux/fs.h | 1 +
2 files changed, 4 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 9222cfc196bf..c60f092538c7 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -170,6 +170,9 @@ static void destroy_super_work(struct work_struct *work)
list_lru_destroy(&s->s_dentry_lru);
list_lru_destroy(&s->s_inode_lru);

+ if (s->s_op->destroy_super)
+ s->s_op->destroy_super(s);
+
for (i = 0; i < SB_FREEZE_LEVELS; i++)
percpu_free_rwsem(&s->s_writers.rw_sem[i]);
kfree(s);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 842fde0f0981..33dfaed0a01a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1880,6 +1880,7 @@ struct super_operations {
int (*drop_inode) (struct inode *);
void (*evict_inode) (struct inode *);
void (*put_super) (struct super_block *);
+ void (*destroy_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_super) (struct super_block *);
int (*freeze_fs) (struct super_block *);


2018-08-08 00:56:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On Tue, 07 Aug 2018 18:37:36 +0300
Kirill Tkhai <[email protected]> wrote:

> This patch kills all CONFIG_SRCU defines and
> the code under !CONFIG_SRCU.

Can you add the rationale for removing the SRCU config in the change log
please.

Thanks!

-- Steve

>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> drivers/base/core.c | 42 --------------------
> include/linux/device.h | 2 -
> include/linux/rcutiny.h | 4 --
> include/linux/srcu.h | 5 --
> kernel/notifier.c | 3 -
> kernel/rcu/Kconfig | 12 +-----
> kernel/rcu/tree.h | 5 --
> kernel/rcu/update.c | 4 --
> .../selftests/rcutorture/doc/TREE_RCU-kconfig.txt | 5 --
> 9 files changed, 3 insertions(+), 79 deletions(-)
>
>

2018-08-08 01:07:10

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

Hi Kirill,

On Tue, 07 Aug 2018 18:37:36 +0300 Kirill Tkhai <[email protected]> wrote:
>
> This patch kills all CONFIG_SRCU defines and
> the code under !CONFIG_SRCU.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> drivers/base/core.c | 42 --------------------
> include/linux/device.h | 2 -
> include/linux/rcutiny.h | 4 --
> include/linux/srcu.h | 5 --
> kernel/notifier.c | 3 -
> kernel/rcu/Kconfig | 12 +-----
> kernel/rcu/tree.h | 5 --
> kernel/rcu/update.c | 4 --
> .../selftests/rcutorture/doc/TREE_RCU-kconfig.txt | 5 --
> 9 files changed, 3 insertions(+), 79 deletions(-)

You left quite a few "select SRCU" statements scattered across Kconfig
files:

$ git grep -l 'select SRCU' '*Kconfig*'
arch/arm/kvm/Kconfig
arch/arm64/kvm/Kconfig
arch/mips/kvm/Kconfig
arch/powerpc/kvm/Kconfig
arch/s390/kvm/Kconfig
arch/x86/Kconfig
arch/x86/kvm/Kconfig
block/Kconfig
drivers/clk/Kconfig
drivers/cpufreq/Kconfig
drivers/dax/Kconfig
drivers/devfreq/Kconfig
drivers/hwtracing/stm/Kconfig
drivers/md/Kconfig
drivers/net/Kconfig
drivers/opp/Kconfig
fs/btrfs/Kconfig
fs/notify/Kconfig
fs/quota/Kconfig
init/Kconfig
kernel/rcu/Kconfig
kernel/rcu/Kconfig.debug
mm/Kconfig
security/tomoyo/Kconfig

--
Cheers,
Stephen Rothwell


Attachments:
(No filename) (499.00 B)
OpenPGP digital signature

2018-08-08 01:09:35

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

Hi Kirill,

On Tue, 07 Aug 2018 18:37:36 +0300 Kirill Tkhai <[email protected]> wrote:
>
> This patch kills all CONFIG_SRCU defines and
> the code under !CONFIG_SRCU.
>
> Signed-off-by: Kirill Tkhai <[email protected]>

So what sort of overheads (in terms of code size and performance) are
we adding by having SRCU enabled where it used not to be?

--
Cheers,
Stephen Rothwell


Attachments:
(No filename) (499.00 B)
OpenPGP digital signature

2018-08-08 01:13:32

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] Introduce lockless shrink_slab()

Hi Kirill,

On Tue, 07 Aug 2018 18:37:19 +0300 Kirill Tkhai <[email protected]> wrote:
>
> After bitmaps of not-empty memcg shrinkers were implemented
> (see "[PATCH v9 00/17] Improve shrink_slab() scalability..."
> series, which is already in mm tree), all the evil in perf
> trace has moved from shrink_slab() to down_read_trylock().
> As reported by Shakeel Butt:
>
> > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
> > file containing few KiBs on corresponding mount. Then in a separate
> > memcg of 200 MiB limit ran a fork-bomb.
> >
> > I ran the "perf record -ag -- sleep 60" and below are the results:
> > + 47.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
> > + 30.72% fb.sh [kernel.kallsyms] [k] up_read
> > + 9.51% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
> > + 1.69% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
> > + 1.35% fb.sh [kernel.kallsyms] [k] mem_cgroup_protected
> > + 1.05% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > + 0.85% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
> > + 0.78% fb.sh [kernel.kallsyms] [k] lruvec_lru_size
> > + 0.57% fb.sh [kernel.kallsyms] [k] shrink_node
> > + 0.54% fb.sh [kernel.kallsyms] [k] queue_work_on
> > + 0.46% fb.sh [kernel.kallsyms] [k] shrink_slab_memcg
>
> The patchset continues to improve shrink_slab() scalability and makes
> it lockless completely. Here are several steps for that:

So do you have any numbers for after theses changes?

--
Cheers,
Stephen Rothwell


Attachments:
(No filename) (499.00 B)
OpenPGP digital signature

2018-08-08 05:41:07

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] Introduce lockless shrink_slab()

On Tue, Aug 7, 2018 at 6:12 PM Stephen Rothwell <[email protected]> wrote:
>
> Hi Kirill,
>
> On Tue, 07 Aug 2018 18:37:19 +0300 Kirill Tkhai <[email protected]> wrote:
> >
> > After bitmaps of not-empty memcg shrinkers were implemented
> > (see "[PATCH v9 00/17] Improve shrink_slab() scalability..."
> > series, which is already in mm tree), all the evil in perf
> > trace has moved from shrink_slab() to down_read_trylock().
> > As reported by Shakeel Butt:
> >
> > > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
> > > file containing few KiBs on corresponding mount. Then in a separate
> > > memcg of 200 MiB limit ran a fork-bomb.
> > >
> > > I ran the "perf record -ag -- sleep 60" and below are the results:
> > > + 47.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
> > > + 30.72% fb.sh [kernel.kallsyms] [k] up_read
> > > + 9.51% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
> > > + 1.69% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
> > > + 1.35% fb.sh [kernel.kallsyms] [k] mem_cgroup_protected
> > > + 1.05% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > > + 0.85% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
> > > + 0.78% fb.sh [kernel.kallsyms] [k] lruvec_lru_size
> > > + 0.57% fb.sh [kernel.kallsyms] [k] shrink_node
> > > + 0.54% fb.sh [kernel.kallsyms] [k] queue_work_on
> > > + 0.46% fb.sh [kernel.kallsyms] [k] shrink_slab_memcg
> >
> > The patchset continues to improve shrink_slab() scalability and makes
> > it lockless completely. Here are several steps for that:
>
> So do you have any numbers for after theses changes?
>

I will do the same experiment as before with these patches sometime
this or next week.

BTW Kirill, thanks for pushing this.

regards,
Shakeel

2018-08-08 07:21:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On Tue 07-08-18 18:37:36, Kirill Tkhai wrote:
> This patch kills all CONFIG_SRCU defines and
> the code under !CONFIG_SRCU.

The last time somebody tried to do this there was a pushback due to
kernel tinyfication. So this should really give some numbers about the
code size increase. Also why can't we make this depend on MMU. Is
anybody else than the reclaim asking for unconditional SRCU usage?

Btw. I totaly agree with Steven. This is a very poor changelog. It is
trivial to see what the patch does but it is far from clear why it is
doing that and why we cannot go other ways.
--
Michal Hocko
SUSE Labs

2018-08-08 09:47:57

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On 08.08.2018 04:05, Stephen Rothwell wrote:
> Hi Kirill,
>
> On Tue, 07 Aug 2018 18:37:36 +0300 Kirill Tkhai <[email protected]> wrote:
>>
>> This patch kills all CONFIG_SRCU defines and
>> the code under !CONFIG_SRCU.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> drivers/base/core.c | 42 --------------------
>> include/linux/device.h | 2 -
>> include/linux/rcutiny.h | 4 --
>> include/linux/srcu.h | 5 --
>> kernel/notifier.c | 3 -
>> kernel/rcu/Kconfig | 12 +-----
>> kernel/rcu/tree.h | 5 --
>> kernel/rcu/update.c | 4 --
>> .../selftests/rcutorture/doc/TREE_RCU-kconfig.txt | 5 --
>> 9 files changed, 3 insertions(+), 79 deletions(-)
>
> You left quite a few "select SRCU" statements scattered across Kconfig
> files:
>
> $ git grep -l 'select SRCU' '*Kconfig*'
> arch/arm/kvm/Kconfig
> arch/arm64/kvm/Kconfig
> arch/mips/kvm/Kconfig
> arch/powerpc/kvm/Kconfig
> arch/s390/kvm/Kconfig
> arch/x86/Kconfig
> arch/x86/kvm/Kconfig
> block/Kconfig
> drivers/clk/Kconfig
> drivers/cpufreq/Kconfig
> drivers/dax/Kconfig
> drivers/devfreq/Kconfig
> drivers/hwtracing/stm/Kconfig
> drivers/md/Kconfig
> drivers/net/Kconfig
> drivers/opp/Kconfig
> fs/btrfs/Kconfig
> fs/notify/Kconfig
> fs/quota/Kconfig
> init/Kconfig
> kernel/rcu/Kconfig
> kernel/rcu/Kconfig.debug
> mm/Kconfig
> security/tomoyo/Kconfig

Yeah, thanks, Stephen.

2018-08-08 10:02:02

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On 08.08.2018 04:08, Stephen Rothwell wrote:
> Hi Kirill,
>
> On Tue, 07 Aug 2018 18:37:36 +0300 Kirill Tkhai <[email protected]> wrote:
>>
>> This patch kills all CONFIG_SRCU defines and
>> the code under !CONFIG_SRCU.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>
> So what sort of overheads (in terms of code size and performance) are
> we adding by having SRCU enabled where it used not to be?

SRCU is unconditionally enabled for x86, so I had to use another arch (sparc64)
to check the size difference. The config, I used to compile, is attached, SRCU
was enabled via:

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 2d58c26bff9a..6e9116e356d4 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -15,6 +15,7 @@ config SPARC
select ARCH_MIGHT_HAVE_PC_PARPORT if SPARC64 && PCI
select ARCH_MIGHT_HAVE_PC_SERIO
select OF
+ select SRCU
select OF_PROMTREE
select HAVE_IDE
select HAVE_OPROFILE

$ size image.srcu.disabled
text data bss dec hex filename
5117546 8030506 1968104 15116156 e6a77c image.srcu.disabled

$ size image.srcu.enabled
text data bss dec hex filename
5126175 8064346 1968104 15158625 e74d61 image.srcu.enabled

The difference is: 15158625-15116156 = 42469 ~41Kb

I have not ideas about performance overhead measurements. If you have ideas,
where they may occur, please say. At the first sight, there should not be
a problem, since SRCU is enabled in x86 by default.


Attachments:
.config (66.80 kB)

2018-08-08 10:19:06

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On 08.08.2018 10:20, Michal Hocko wrote:
> On Tue 07-08-18 18:37:36, Kirill Tkhai wrote:
>> This patch kills all CONFIG_SRCU defines and
>> the code under !CONFIG_SRCU.
>
> The last time somebody tried to do this there was a pushback due to
> kernel tinyfication. So this should really give some numbers about the
> code size increase. Also why can't we make this depend on MMU. Is
> anybody else than the reclaim asking for unconditional SRCU usage?

I don't know one. The size numbers (sparc64) are:

$ size image.srcu.disabled
text data bss dec hex filename
5117546 8030506 1968104 15116156 e6a77c image.srcu.disabled
$ size image.srcu.enabled
text data bss dec hex filename
5126175 8064346 1968104 15158625 e74d61 image.srcu.enabled
The difference is: 15158625-15116156 = 42469 ~41Kb

Please, see the measurement details to my answer to Stephen.

> Btw. I totaly agree with Steven. This is a very poor changelog. It is
> trivial to see what the patch does but it is far from clear why it is
> doing that and why we cannot go other ways.
We possibly can go another way, and there is comment to [2/10] about this.
Percpu rwsem may be used instead, the only thing, it is worse, is it will
make shrink_slab() wait unregistering shrinkers, while srcu-based
implementation does not require this. This may be not a big problem.
But, if SRCU is real problem for embedded people, I really don't want they
hate me in the future because of this, so please CC someone if you know :)

Kirill

2018-08-08 10:19:41

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] Introduce lockless shrink_slab()

On 08.08.2018 08:39, Shakeel Butt wrote:
> On Tue, Aug 7, 2018 at 6:12 PM Stephen Rothwell <[email protected]> wrote:
>>
>> Hi Kirill,
>>
>> On Tue, 07 Aug 2018 18:37:19 +0300 Kirill Tkhai <[email protected]> wrote:
>>>
>>> After bitmaps of not-empty memcg shrinkers were implemented
>>> (see "[PATCH v9 00/17] Improve shrink_slab() scalability..."
>>> series, which is already in mm tree), all the evil in perf
>>> trace has moved from shrink_slab() to down_read_trylock().
>>> As reported by Shakeel Butt:
>>>
>>> > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
>>> > file containing few KiBs on corresponding mount. Then in a separate
>>> > memcg of 200 MiB limit ran a fork-bomb.
>>> >
>>> > I ran the "perf record -ag -- sleep 60" and below are the results:
>>> > + 47.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
>>> > + 30.72% fb.sh [kernel.kallsyms] [k] up_read
>>> > + 9.51% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
>>> > + 1.69% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
>>> > + 1.35% fb.sh [kernel.kallsyms] [k] mem_cgroup_protected
>>> > + 1.05% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
>>> > + 0.85% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
>>> > + 0.78% fb.sh [kernel.kallsyms] [k] lruvec_lru_size
>>> > + 0.57% fb.sh [kernel.kallsyms] [k] shrink_node
>>> > + 0.54% fb.sh [kernel.kallsyms] [k] queue_work_on
>>> > + 0.46% fb.sh [kernel.kallsyms] [k] shrink_slab_memcg
>>>
>>> The patchset continues to improve shrink_slab() scalability and makes
>>> it lockless completely. Here are several steps for that:
>>
>> So do you have any numbers for after theses changes?
>>
>
> I will do the same experiment as before with these patches sometime
> this or next week.

Thanks, Shakeel!

> BTW Kirill, thanks for pushing this.
>
> regards,
> Shakeel
>

2018-08-08 10:28:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

[CC Josh - the whole series is
http://lkml.kernel.org/r/153365347929.19074.12509495712735843805.stgit@localhost.localdomain]

On Wed 08-08-18 13:17:44, Kirill Tkhai wrote:
> On 08.08.2018 10:20, Michal Hocko wrote:
> > On Tue 07-08-18 18:37:36, Kirill Tkhai wrote:
> >> This patch kills all CONFIG_SRCU defines and
> >> the code under !CONFIG_SRCU.
> >
> > The last time somebody tried to do this there was a pushback due to
> > kernel tinyfication. So this should really give some numbers about the
> > code size increase. Also why can't we make this depend on MMU. Is
> > anybody else than the reclaim asking for unconditional SRCU usage?
>
> I don't know one. The size numbers (sparc64) are:
>
> $ size image.srcu.disabled
> text data bss dec hex filename
> 5117546 8030506 1968104 15116156 e6a77c image.srcu.disabled
> $ size image.srcu.enabled
> text data bss dec hex filename
> 5126175 8064346 1968104 15158625 e74d61 image.srcu.enabled
> The difference is: 15158625-15116156 = 42469 ~41Kb
>
> Please, see the measurement details to my answer to Stephen.
>
> > Btw. I totaly agree with Steven. This is a very poor changelog. It is
> > trivial to see what the patch does but it is far from clear why it is
> > doing that and why we cannot go other ways.
> We possibly can go another way, and there is comment to [2/10] about this.
> Percpu rwsem may be used instead, the only thing, it is worse, is it will
> make shrink_slab() wait unregistering shrinkers, while srcu-based
> implementation does not require this.

Well, if unregisterring doesn't do anything subtle - e.g. an allocation
or take locks which depend on allocation - and we can guarantee that
then blocking shrink_slab shouldn't be a big deal. It is subtle though.
Maybe subtle enough to make unconditional SRCU worth it. This all should
be in the changelog though.

> This may be not a big problem.
> But, if SRCU is real problem for embedded people, I really don't want they
> hate me in the future because of this, so please CC someone if you know :)

I guess Josh was trying to pursue kernel tinification.

--
Michal Hocko
SUSE Labs

2018-08-08 11:06:21

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

Hi Kirill,

On Wed, 8 Aug 2018 12:59:40 +0300 Kirill Tkhai <[email protected]> wrote:
>
> On 08.08.2018 04:08, Stephen Rothwell wrote:
> >
> > So what sort of overheads (in terms of code size and performance) are
> > we adding by having SRCU enabled where it used not to be?
>
> SRCU is unconditionally enabled for x86, so I had to use another arch (sparc64)
> to check the size difference. The config, I used to compile, is attached, SRCU
> was enabled via:
>
> diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
> index 2d58c26bff9a..6e9116e356d4 100644
> --- a/arch/sparc/Kconfig
> +++ b/arch/sparc/Kconfig
> @@ -15,6 +15,7 @@ config SPARC
> select ARCH_MIGHT_HAVE_PC_PARPORT if SPARC64 && PCI
> select ARCH_MIGHT_HAVE_PC_SERIO
> select OF
> + select SRCU
> select OF_PROMTREE
> select HAVE_IDE
> select HAVE_OPROFILE
>
> $ size image.srcu.disabled
> text data bss dec hex filename
> 5117546 8030506 1968104 15116156 e6a77c image.srcu.disabled
>
> $ size image.srcu.enabled
> text data bss dec hex filename
> 5126175 8064346 1968104 15158625 e74d61 image.srcu.enabled
>
> The difference is: 15158625-15116156 = 42469 ~41Kb

Thanks for that.

> I have not ideas about performance overhead measurements. If you have ideas,
> where they may occur, please say. At the first sight, there should not be
> a problem, since SRCU is enabled in x86 by default.

I have no idea, just asking questions that might be relevant for
platforms where SRCU is normally disabled.

--
Cheers,
Stephen Rothwell


Attachments:
(No filename) (499.00 B)
OpenPGP digital signature

2018-08-08 11:53:21

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC 02/10] mm: Make shrink_slab() lockless

This also requires call_srcu() in memcg_expand_one_shrinker_map()
and srcu_dereference() in shrink_slab_memcg(). The updated patch
is below:

[PATCH] mm: Make shrink_slab() lockless

From: Kirill Tkhai <[email protected]>

The patch makes shrinker list and shrinker_idr SRCU-safe
for readers. This requires synchronize_srcu() on finalize
stage unregistering stage, which waits till all parallel
shrink_slab() are finished

Note, that patch removes rwsem_is_contended() checks from
the code, and this does not result in delays during
registration, since there is no waiting at all. Unregistration
case may be optimized by splitting unregister_shrinker()
in tho stages, and this is made in next patches.

Also, keep in mind, that in case of SRCU is not allowed
to make unconditional (which is done in previous patch),
it is possible to use percpu_rw_semaphore instead of it.
percpu_down_read() will be used in shrink_slab_memcg()
and in shrink_slab(), and consecutive calls

percpu_down_write(percpu_rwsem);
percpu_up_write(percpu_rwsem);

will be used instead of synchronize_srcu().

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/shrinker.h | 2 ++
mm/memcontrol.c | 2 +-
mm/vmscan.c | 46 +++++++++++++++-------------------------------
3 files changed, 18 insertions(+), 32 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 9443cafd1969..94b44662f430 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -82,6 +82,8 @@ struct shrinker {
#define SHRINKER_NUMA_AWARE (1 << 0)
#define SHRINKER_MEMCG_AWARE (1 << 1)

+extern struct srcu_struct shrinker_srcu;
+
extern int prealloc_shrinker(struct shrinker *shrinker);
extern void register_shrinker_prepared(struct shrinker *shrinker);
extern int register_shrinker(struct shrinker *shrinker);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4e3c1315b1de..a30bc2cc6380 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -347,7 +347,7 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
memset((void *)new->map + old_size, 0, size - old_size);

rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
- call_rcu(&old->rcu, memcg_free_shrinker_map_rcu);
+ call_srcu(&shrinker_srcu, &old->rcu, memcg_free_shrinker_map_rcu);
}

return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index da135e1acd94..acb087f3ac35 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -168,6 +168,7 @@ unsigned long vm_total_pages;

static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_SRCU(shrinker_srcu);

#ifdef CONFIG_MEMCG_KMEM

@@ -192,7 +193,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
int id, ret = -ENOMEM;

down_write(&shrinker_rwsem);
- /* This may call shrinker, so it must use down_read_trylock() */
id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -406,7 +406,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
void register_shrinker_prepared(struct shrinker *shrinker)
{
down_write(&shrinker_rwsem);
- list_add_tail(&shrinker->list, &shrinker_list);
+ list_add_tail_rcu(&shrinker->list, &shrinker_list);
idr_replace(&shrinker_idr, shrinker, shrinker->id);
up_write(&shrinker_rwsem);
}
@@ -432,8 +432,11 @@ void unregister_shrinker(struct shrinker *shrinker)
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
down_write(&shrinker_rwsem);
- list_del(&shrinker->list);
+ list_del_rcu(&shrinker->list);
up_write(&shrinker_rwsem);
+
+ synchronize_srcu(&shrinker_srcu);
+
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}
@@ -567,16 +570,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
{
struct memcg_shrinker_map *map;
unsigned long freed = 0;
- int ret, i;
+ int ret, i, srcu_id;

if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
return 0;

- if (!down_read_trylock(&shrinker_rwsem))
- return 0;
-
- map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
- true);
+ srcu_id = srcu_read_lock(&shrinker_srcu);
+ map = srcu_dereference(memcg->nodeinfo[nid]->shrinker_map,
+ &shrinker_srcu);
if (unlikely(!map))
goto unlock;

@@ -621,14 +622,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
memcg_set_shrinker_bit(memcg, nid, i);
}
freed += ret;
-
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
}
unlock:
- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&shrinker_srcu, srcu_id);
return freed;
}
#else /* CONFIG_MEMCG_KMEM */
@@ -665,15 +661,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
{
struct shrinker *shrinker;
unsigned long freed = 0;
- int ret;
+ int srcu_id, ret;

if (!mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

- if (!down_read_trylock(&shrinker_rwsem))
- goto out;
-
- list_for_each_entry(shrinker, &shrinker_list, list) {
+ srcu_id = srcu_read_lock(&shrinker_srcu);
+ list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
@@ -684,19 +678,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
- /*
- * Bail out if someone want to register a new shrinker to
- * prevent the regsitration from being stalled for long periods
- * by parallel ongoing shrinking.
- */
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
}
+ srcu_read_unlock(&shrinker_srcu, srcu_id);

- up_read(&shrinker_rwsem);
-out:
cond_resched();
return freed;
}

2018-08-08 12:37:48

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH RFC 02/10] mm: Make shrink_slab() lockless

On 2018/08/08 20:51, Kirill Tkhai wrote:
> @@ -192,7 +193,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> int id, ret = -ENOMEM;
>
> down_write(&shrinker_rwsem);
> - /* This may call shrinker, so it must use down_read_trylock() */
> id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
> if (id < 0)
> goto unlock;

I don't know why perf reports down_read_trylock(&shrinker_rwsem). But
above code is already bad. GFP_KERNEL allocation involves shrinkers and
the OOM killer would be invoked because shrinkers are defunctional due to
this down_write(&shrinker_rwsem). Please avoid blocking memory allocation
with shrinker_rwsem held.

2018-08-08 12:52:41

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC 02/10] mm: Make shrink_slab() lockless

On 08.08.2018 15:36, Tetsuo Handa wrote:
> On 2018/08/08 20:51, Kirill Tkhai wrote:
>> @@ -192,7 +193,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>> int id, ret = -ENOMEM;
>>
>> down_write(&shrinker_rwsem);
>> - /* This may call shrinker, so it must use down_read_trylock() */
>> id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
>> if (id < 0)
>> goto unlock;
>
> I don't know why perf reports down_read_trylock(&shrinker_rwsem).

This happens in the case of many cgroups and mounts on node. This
is often happen on the big machines with containers.

> But above code is already bad. GFP_KERNEL allocation involves shrinkers and
> the OOM killer would be invoked because shrinkers are defunctional due to
> this down_write(&shrinker_rwsem). Please avoid blocking memory allocation
> with shrinker_rwsem held.

There was non-blocking allocation in first versions of the patchset,
but it's gone away in the process of the review (CC Vladimir).

There are still pages lists shrinkers in case of shrink_slab() is
not available, while additional locks makes the code more difficult
and not worth this difficulties.

Kirill

2018-08-08 13:23:01

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH RFC v2 02/10] mm: Make shrink_slab() lockless

[Added two more places needed srcu_dereference(). All ->shrinker_map
dereferences must be under SRCU, and this v2 adds missed in previous]

The patch makes shrinker list and shrinker_idr SRCU-safe
for readers. This requires synchronize_srcu() on finalize
stage unregistering stage, which waits till all parallel
shrink_slab() are finished

Note, that patch removes rwsem_is_contended() checks from
the code, and this does not result in delays during
registration, since there is no waiting at all. Unregistration
case may be optimized by splitting unregister_shrinker()
in tho stages, and this is made in next patches.

Also, keep in mind, that in case of SRCU is not allowed
to make unconditional (which is done in previous patch),
it is possible to use percpu_rw_semaphore instead of it.
percpu_down_read() will be used in shrink_slab_memcg()
and in shrink_slab(), and consecutive calls

percpu_down_write(percpu_rwsem);
percpu_up_write(percpu_rwsem);

will be used instead of synchronize_srcu().

Signed-off-by: Kirill Tkhai <[email protected]>
---
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 9443cafd1969..94b44662f430 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -82,6 +82,8 @@ struct shrinker {
#define SHRINKER_NUMA_AWARE (1 << 0)
#define SHRINKER_MEMCG_AWARE (1 << 1)

+extern struct srcu_struct shrinker_srcu;
+
extern int prealloc_shrinker(struct shrinker *shrinker);
extern void register_shrinker_prepared(struct shrinker *shrinker);
extern int register_shrinker(struct shrinker *shrinker);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4e3c1315b1de..ed40eb4b8300 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -332,8 +332,9 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
lockdep_assert_held(&memcg_shrinker_map_mutex);

for_each_node(nid) {
- old = rcu_dereference_protected(
- mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
+ old = srcu_dereference_check(
+ mem_cgroup_nodeinfo(memcg, nid)->shrinker_map,
+ &shrinker_srcu, true);
/* Not yet online memcg */
if (!old)
return 0;
@@ -347,7 +348,7 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
memset((void *)new->map + old_size, 0, size - old_size);

rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
- call_rcu(&old->rcu, memcg_free_shrinker_map_rcu);
+ call_srcu(&shrinker_srcu, &old->rcu, memcg_free_shrinker_map_rcu);
}

return 0;
@@ -364,7 +365,8 @@ static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)

for_each_node(nid) {
pn = mem_cgroup_nodeinfo(memcg, nid);
- map = rcu_dereference_protected(pn->shrinker_map, true);
+ map = srcu_dereference_check(pn->shrinker_map,
+ &shrinker_srcu, true);
if (map)
kvfree(map);
rcu_assign_pointer(pn->shrinker_map, NULL);
@@ -427,13 +429,15 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
{
if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
struct memcg_shrinker_map *map;
+ int srcu_id;

- rcu_read_lock();
- map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
+ srcu_id = srcu_read_lock(&shrinker_srcu);
+ map = srcu_dereference(memcg->nodeinfo[nid]->shrinker_map,
+ &shrinker_srcu);
/* Pairs with smp mb in shrink_slab() */
smp_mb__before_atomic();
set_bit(shrinker_id, map->map);
- rcu_read_unlock();
+ srcu_read_unlock(&shrinker_srcu, srcu_id);
}
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index da135e1acd94..acb087f3ac35 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -168,6 +168,7 @@ unsigned long vm_total_pages;

static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_SRCU(shrinker_srcu);

#ifdef CONFIG_MEMCG_KMEM

@@ -192,7 +193,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
int id, ret = -ENOMEM;

down_write(&shrinker_rwsem);
- /* This may call shrinker, so it must use down_read_trylock() */
id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -406,7 +406,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
void register_shrinker_prepared(struct shrinker *shrinker)
{
down_write(&shrinker_rwsem);
- list_add_tail(&shrinker->list, &shrinker_list);
+ list_add_tail_rcu(&shrinker->list, &shrinker_list);
idr_replace(&shrinker_idr, shrinker, shrinker->id);
up_write(&shrinker_rwsem);
}
@@ -432,8 +432,11 @@ void unregister_shrinker(struct shrinker *shrinker)
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
down_write(&shrinker_rwsem);
- list_del(&shrinker->list);
+ list_del_rcu(&shrinker->list);
up_write(&shrinker_rwsem);
+
+ synchronize_srcu(&shrinker_srcu);
+
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}
@@ -567,16 +570,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
{
struct memcg_shrinker_map *map;
unsigned long freed = 0;
- int ret, i;
+ int ret, i, srcu_id;

if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
return 0;

- if (!down_read_trylock(&shrinker_rwsem))
- return 0;
-
- map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
- true);
+ srcu_id = srcu_read_lock(&shrinker_srcu);
+ map = srcu_dereference(memcg->nodeinfo[nid]->shrinker_map,
+ &shrinker_srcu);
if (unlikely(!map))
goto unlock;

@@ -621,14 +622,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
memcg_set_shrinker_bit(memcg, nid, i);
}
freed += ret;
-
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
}
unlock:
- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&shrinker_srcu, srcu_id);
return freed;
}
#else /* CONFIG_MEMCG_KMEM */
@@ -665,15 +661,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
{
struct shrinker *shrinker;
unsigned long freed = 0;
- int ret;
+ int srcu_id, ret;

if (!mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

- if (!down_read_trylock(&shrinker_rwsem))
- goto out;
-
- list_for_each_entry(shrinker, &shrinker_list, list) {
+ srcu_id = srcu_read_lock(&shrinker_srcu);
+ list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
@@ -684,19 +678,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
- /*
- * Bail out if someone want to register a new shrinker to
- * prevent the regsitration from being stalled for long periods
- * by parallel ongoing shrinking.
- */
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
}
+ srcu_read_unlock(&shrinker_srcu, srcu_id);

- up_read(&shrinker_rwsem);
-out:
cond_resched();
return freed;
}


2018-08-08 16:19:12

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On Wed, Aug 08, 2018 at 01:17:44PM +0300, Kirill Tkhai wrote:
> On 08.08.2018 10:20, Michal Hocko wrote:
> > On Tue 07-08-18 18:37:36, Kirill Tkhai wrote:
> >> This patch kills all CONFIG_SRCU defines and
> >> the code under !CONFIG_SRCU.
> >
> > The last time somebody tried to do this there was a pushback due to
> > kernel tinyfication. So this should really give some numbers about the
> > code size increase. Also why can't we make this depend on MMU. Is
> > anybody else than the reclaim asking for unconditional SRCU usage?
>
> I don't know one. The size numbers (sparc64) are:
>
> $ size image.srcu.disabled
> text data bss dec hex filename
> 5117546 8030506 1968104 15116156 e6a77c image.srcu.disabled
> $ size image.srcu.enabled
> text data bss dec hex filename
> 5126175 8064346 1968104 15158625 e74d61 image.srcu.enabled
> The difference is: 15158625-15116156 = 42469 ~41Kb

41k is a *substantial* size increase. However, can you compare
tinyconfig with and without this patch? That may have a smaller change.

2018-08-08 16:24:15

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On 08.08.2018 19:13, Josh Triplett wrote:
> On Wed, Aug 08, 2018 at 01:17:44PM +0300, Kirill Tkhai wrote:
>> On 08.08.2018 10:20, Michal Hocko wrote:
>>> On Tue 07-08-18 18:37:36, Kirill Tkhai wrote:
>>>> This patch kills all CONFIG_SRCU defines and
>>>> the code under !CONFIG_SRCU.
>>>
>>> The last time somebody tried to do this there was a pushback due to
>>> kernel tinyfication. So this should really give some numbers about the
>>> code size increase. Also why can't we make this depend on MMU. Is
>>> anybody else than the reclaim asking for unconditional SRCU usage?
>>
>> I don't know one. The size numbers (sparc64) are:
>>
>> $ size image.srcu.disabled
>> text data bss dec hex filename
>> 5117546 8030506 1968104 15116156 e6a77c image.srcu.disabled
>> $ size image.srcu.enabled
>> text data bss dec hex filename
>> 5126175 8064346 1968104 15158625 e74d61 image.srcu.enabled
>> The difference is: 15158625-15116156 = 42469 ~41Kb
>
> 41k is a *substantial* size increase. However, can you compare
> tinyconfig with and without this patch? That may have a smaller change.

$ size image.srcu.disabled
text data bss dec hex filename
1105900 195456 63232 1364588 14d26c image.srcu.disabled

$ size image.srcu.enabled
text data bss dec hex filename
1106960 195528 63232 1365720 14d6d8 image.srcu.enabled

1365720-1364588 = 1132 ~ 1Kb

2018-08-08 16:32:07

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On 08.08.2018 19:23, Kirill Tkhai wrote:
> On 08.08.2018 19:13, Josh Triplett wrote:
>> On Wed, Aug 08, 2018 at 01:17:44PM +0300, Kirill Tkhai wrote:
>>> On 08.08.2018 10:20, Michal Hocko wrote:
>>>> On Tue 07-08-18 18:37:36, Kirill Tkhai wrote:
>>>>> This patch kills all CONFIG_SRCU defines and
>>>>> the code under !CONFIG_SRCU.
>>>>
>>>> The last time somebody tried to do this there was a pushback due to
>>>> kernel tinyfication. So this should really give some numbers about the
>>>> code size increase. Also why can't we make this depend on MMU. Is
>>>> anybody else than the reclaim asking for unconditional SRCU usage?
>>>
>>> I don't know one. The size numbers (sparc64) are:
>>>
>>> $ size image.srcu.disabled
>>> text data bss dec hex filename
>>> 5117546 8030506 1968104 15116156 e6a77c image.srcu.disabled
>>> $ size image.srcu.enabled
>>> text data bss dec hex filename
>>> 5126175 8064346 1968104 15158625 e74d61 image.srcu.enabled
>>> The difference is: 15158625-15116156 = 42469 ~41Kb
>>
>> 41k is a *substantial* size increase. However, can you compare
>> tinyconfig with and without this patch? That may have a smaller change.
>
> $ size image.srcu.disabled
> text data bss dec hex filename
> 1105900 195456 63232 1364588 14d26c image.srcu.disabled
>
> $ size image.srcu.enabled
> text data bss dec hex filename
> 1106960 195528 63232 1365720 14d6d8 image.srcu.enabled
>
> 1365720-1364588 = 1132 ~ 1Kb

1Kb is not huge size. It looks as not a big price for writing generic code
for only case (now some places have CONFIG_SRCU and !CONFIG_SRCU variants,
e.g. drivers/base/core.c). What do you think?

2018-08-08 18:04:20

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On Wed, Aug 08, 2018 at 07:30:13PM +0300, Kirill Tkhai wrote:
> On 08.08.2018 19:23, Kirill Tkhai wrote:
> > On 08.08.2018 19:13, Josh Triplett wrote:
> >> On Wed, Aug 08, 2018 at 01:17:44PM +0300, Kirill Tkhai wrote:
> >>> On 08.08.2018 10:20, Michal Hocko wrote:
> >>>> On Tue 07-08-18 18:37:36, Kirill Tkhai wrote:
> >>>>> This patch kills all CONFIG_SRCU defines and
> >>>>> the code under !CONFIG_SRCU.
> >>>>
> >>>> The last time somebody tried to do this there was a pushback due to
> >>>> kernel tinyfication. So this should really give some numbers about the
> >>>> code size increase. Also why can't we make this depend on MMU. Is
> >>>> anybody else than the reclaim asking for unconditional SRCU usage?
> >>>
> >>> I don't know one. The size numbers (sparc64) are:
> >>>
> >>> $ size image.srcu.disabled
> >>> text data bss dec hex filename
> >>> 5117546 8030506 1968104 15116156 e6a77c image.srcu.disabled
> >>> $ size image.srcu.enabled
> >>> text data bss dec hex filename
> >>> 5126175 8064346 1968104 15158625 e74d61 image.srcu.enabled
> >>> The difference is: 15158625-15116156 = 42469 ~41Kb
> >>
> >> 41k is a *substantial* size increase. However, can you compare
> >> tinyconfig with and without this patch? That may have a smaller change.
> >
> > $ size image.srcu.disabled
> > text data bss dec hex filename
> > 1105900 195456 63232 1364588 14d26c image.srcu.disabled
> >
> > $ size image.srcu.enabled
> > text data bss dec hex filename
> > 1106960 195528 63232 1365720 14d6d8 image.srcu.enabled
> >
> > 1365720-1364588 = 1132 ~ 1Kb
>
> 1Kb is not huge size. It looks as not a big price for writing generic code
> for only case (now some places have CONFIG_SRCU and !CONFIG_SRCU variants,
> e.g. drivers/base/core.c). What do you think?

That's a little more reasonable than 41k, likely because of
CONFIG_TINY_SRCU. That's still not ideal, though. And as far as I can
tell, the *only* two pieces of core code that use SRCU are
drivers/base/core.c and kernel/notifier.c, and the latter is exclusively
code to use notifiers with SRCU, not notifiers wanting to use SRCU
themselves. So, as far as I can tell, this would really just save a
couple of small #ifdef sections in drivers/base/core.c, and I think
those #ifdef sections could be simplified even further. That doesn't
seem worth it at all.

2018-08-08 21:32:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On Wed, Aug 08, 2018 at 12:27:34PM +0200, Michal Hocko wrote:
> [CC Josh - the whole series is
> http://lkml.kernel.org/r/153365347929.19074.12509495712735843805.stgit@localhost.localdomain]
>
> On Wed 08-08-18 13:17:44, Kirill Tkhai wrote:
> > On 08.08.2018 10:20, Michal Hocko wrote:
> > > On Tue 07-08-18 18:37:36, Kirill Tkhai wrote:
> > >> This patch kills all CONFIG_SRCU defines and
> > >> the code under !CONFIG_SRCU.
> > >
> > > The last time somebody tried to do this there was a pushback due to
> > > kernel tinyfication. So this should really give some numbers about the
> > > code size increase. Also why can't we make this depend on MMU. Is
> > > anybody else than the reclaim asking for unconditional SRCU usage?
> >
> > I don't know one. The size numbers (sparc64) are:
> >
> > $ size image.srcu.disabled
> > text data bss dec hex filename
> > 5117546 8030506 1968104 15116156 e6a77c image.srcu.disabled
> > $ size image.srcu.enabled
> > text data bss dec hex filename
> > 5126175 8064346 1968104 15158625 e74d61 image.srcu.enabled
> > The difference is: 15158625-15116156 = 42469 ~41Kb
> >
> > Please, see the measurement details to my answer to Stephen.
> >
> > > Btw. I totaly agree with Steven. This is a very poor changelog. It is
> > > trivial to see what the patch does but it is far from clear why it is
> > > doing that and why we cannot go other ways.
> > We possibly can go another way, and there is comment to [2/10] about this.
> > Percpu rwsem may be used instead, the only thing, it is worse, is it will
> > make shrink_slab() wait unregistering shrinkers, while srcu-based
> > implementation does not require this.
>
> Well, if unregisterring doesn't do anything subtle - e.g. an allocation
> or take locks which depend on allocation - and we can guarantee that
> then blocking shrink_slab shouldn't be a big deal.

unregister_shrinker() already blocks shrink_slab - taking a rwsem in
write mode blocks all readers - so using a per-cpu rwsem doesn't
introduce anything new or unexpected. I'd like to see numbers of the
different methods before anything else.

IMO, the big deal is that the split unregister mechanism seems to
imply superblock shrinkers can be called during sb teardown or
/after/ the filesystem has been torn down in memory (i.e. after
->put_super() is called). That's a change of behaviour, but it's
left to the filesystem to detect and handle that condition. That's
exceedingly subtle and looks like a recipe for disaster to me. I
note that XFS hasn't been updated to detect and avoid this landmine.

And, FWIW, filesystems with multiple shrinkers (e.g. XFS as 3 per
mount) will take the SCRU penalty multiple times during unmount, and
potentially be exposed to multiple different "use during/after
teardown" race conditions.

> It is subtle though.
> Maybe subtle enough to make unconditional SRCU worth it. This all should
> be in the changelog though.

IMO, we've had enough recent bugs to deal with from shrinkers being
called before the filesystem is set up and from trying to handle
allocation errors during setup. Do we really want to make shrinker
shutdown just as prone to mismanagement and subtle, hard to hit
bugs? I don't think we do - unmount is simply not a critical
performance path.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-08-08 23:03:46

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On Wed, Aug 8, 2018 at 11:02 AM Josh Triplett <[email protected]> wrote:
>
> On Wed, Aug 08, 2018 at 07:30:13PM +0300, Kirill Tkhai wrote:
> > On 08.08.2018 19:23, Kirill Tkhai wrote:
> > > On 08.08.2018 19:13, Josh Triplett wrote:
> > >> On Wed, Aug 08, 2018 at 01:17:44PM +0300, Kirill Tkhai wrote:
> > >>> On 08.08.2018 10:20, Michal Hocko wrote:
> > >>>> On Tue 07-08-18 18:37:36, Kirill Tkhai wrote:
> > >>>>> This patch kills all CONFIG_SRCU defines and
> > >>>>> the code under !CONFIG_SRCU.
> > >>>>
> > >>>> The last time somebody tried to do this there was a pushback due to
> > >>>> kernel tinyfication. So this should really give some numbers about the
> > >>>> code size increase. Also why can't we make this depend on MMU. Is
> > >>>> anybody else than the reclaim asking for unconditional SRCU usage?
> > >>>
> > >>> I don't know one. The size numbers (sparc64) are:
> > >>>
> > >>> $ size image.srcu.disabled
> > >>> text data bss dec hex filename
> > >>> 5117546 8030506 1968104 15116156 e6a77c image.srcu.disabled
> > >>> $ size image.srcu.enabled
> > >>> text data bss dec hex filename
> > >>> 5126175 8064346 1968104 15158625 e74d61 image.srcu.enabled
> > >>> The difference is: 15158625-15116156 = 42469 ~41Kb
> > >>
> > >> 41k is a *substantial* size increase. However, can you compare
> > >> tinyconfig with and without this patch? That may have a smaller change.
> > >
> > > $ size image.srcu.disabled
> > > text data bss dec hex filename
> > > 1105900 195456 63232 1364588 14d26c image.srcu.disabled
> > >
> > > $ size image.srcu.enabled
> > > text data bss dec hex filename
> > > 1106960 195528 63232 1365720 14d6d8 image.srcu.enabled
> > >
> > > 1365720-1364588 = 1132 ~ 1Kb
> >
> > 1Kb is not huge size. It looks as not a big price for writing generic code
> > for only case (now some places have CONFIG_SRCU and !CONFIG_SRCU variants,
> > e.g. drivers/base/core.c). What do you think?
>
> That's a little more reasonable than 41k, likely because of
> CONFIG_TINY_SRCU. That's still not ideal, though. And as far as I can
> tell, the *only* two pieces of core code that use SRCU are
> drivers/base/core.c and kernel/notifier.c, and the latter is exclusively
> code to use notifiers with SRCU, not notifiers wanting to use SRCU
> themselves. So, as far as I can tell, this would really just save a
> couple of small #ifdef sections in drivers/base/core.c, and I think
> those #ifdef sections could be simplified even further. That doesn't
> seem worth it at all.

Hi Josh, the motivation behind enabling SRCU is not to simplify the
code in drivers/base/core.c but rather not to introduce similar ifdefs
in mm/vmscan.c for shrinker traversals.

Shakeel

2018-08-08 23:11:24

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On Wed, Aug 08, 2018 at 04:02:29PM -0700, Shakeel Butt wrote:
> On Wed, Aug 8, 2018 at 11:02 AM Josh Triplett <[email protected]> wrote:
> >
> > On Wed, Aug 08, 2018 at 07:30:13PM +0300, Kirill Tkhai wrote:
> > > On 08.08.2018 19:23, Kirill Tkhai wrote:
> > > > On 08.08.2018 19:13, Josh Triplett wrote:
> > > >> On Wed, Aug 08, 2018 at 01:17:44PM +0300, Kirill Tkhai wrote:
> > > >>> On 08.08.2018 10:20, Michal Hocko wrote:
> > > >>>> On Tue 07-08-18 18:37:36, Kirill Tkhai wrote:
> > > >>>>> This patch kills all CONFIG_SRCU defines and
> > > >>>>> the code under !CONFIG_SRCU.
> > > >>>>
> > > >>>> The last time somebody tried to do this there was a pushback due to
> > > >>>> kernel tinyfication. So this should really give some numbers about the
> > > >>>> code size increase. Also why can't we make this depend on MMU. Is
> > > >>>> anybody else than the reclaim asking for unconditional SRCU usage?
> > > >>>
> > > >>> I don't know one. The size numbers (sparc64) are:
> > > >>>
> > > >>> $ size image.srcu.disabled
> > > >>> text data bss dec hex filename
> > > >>> 5117546 8030506 1968104 15116156 e6a77c image.srcu.disabled
> > > >>> $ size image.srcu.enabled
> > > >>> text data bss dec hex filename
> > > >>> 5126175 8064346 1968104 15158625 e74d61 image.srcu.enabled
> > > >>> The difference is: 15158625-15116156 = 42469 ~41Kb
> > > >>
> > > >> 41k is a *substantial* size increase. However, can you compare
> > > >> tinyconfig with and without this patch? That may have a smaller change.
> > > >
> > > > $ size image.srcu.disabled
> > > > text data bss dec hex filename
> > > > 1105900 195456 63232 1364588 14d26c image.srcu.disabled
> > > >
> > > > $ size image.srcu.enabled
> > > > text data bss dec hex filename
> > > > 1106960 195528 63232 1365720 14d6d8 image.srcu.enabled
> > > >
> > > > 1365720-1364588 = 1132 ~ 1Kb
> > >
> > > 1Kb is not huge size. It looks as not a big price for writing generic code
> > > for only case (now some places have CONFIG_SRCU and !CONFIG_SRCU variants,
> > > e.g. drivers/base/core.c). What do you think?
> >
> > That's a little more reasonable than 41k, likely because of
> > CONFIG_TINY_SRCU. That's still not ideal, though. And as far as I can
> > tell, the *only* two pieces of core code that use SRCU are
> > drivers/base/core.c and kernel/notifier.c, and the latter is exclusively
> > code to use notifiers with SRCU, not notifiers wanting to use SRCU
> > themselves. So, as far as I can tell, this would really just save a
> > couple of small #ifdef sections in drivers/base/core.c, and I think
> > those #ifdef sections could be simplified even further. That doesn't
> > seem worth it at all.
>
> Hi Josh, the motivation behind enabling SRCU is not to simplify the
> code in drivers/base/core.c but rather not to introduce similar ifdefs
> in mm/vmscan.c for shrinker traversals.

Leaving aside the comment someone made about sticking with rwsem for
this, I honestly hope that someday the shrinker is optional too. :)

2018-08-09 00:11:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On Thu, Aug 09, 2018 at 07:31:25AM +1000, Dave Chinner wrote:
> IMO, we've had enough recent bugs to deal with from shrinkers being
> called before the filesystem is set up and from trying to handle
> allocation errors during setup. Do we really want to make shrinker
> shutdown just as prone to mismanagement and subtle, hard to hit
> bugs? I don't think we do - unmount is simply not a critical
> performance path.

It's never been performance critical for me, but I'm not so sure that
there aren't container workloads which unmount filesystems multiple
times per second.

2018-08-09 07:15:26

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH RFC v2 02/10] mm: Make shrink_slab() lockless

On Wed 08-08-18 16:20:54, Kirill Tkhai wrote:
> [Added two more places needed srcu_dereference(). All ->shrinker_map
> dereferences must be under SRCU, and this v2 adds missed in previous]
>
> The patch makes shrinker list and shrinker_idr SRCU-safe
> for readers. This requires synchronize_srcu() on finalize
> stage unregistering stage, which waits till all parallel
> shrink_slab() are finished
>
> Note, that patch removes rwsem_is_contended() checks from
> the code, and this does not result in delays during
> registration, since there is no waiting at all. Unregistration
> case may be optimized by splitting unregister_shrinker()
> in tho stages, and this is made in next patches.
>
> Also, keep in mind, that in case of SRCU is not allowed
> to make unconditional (which is done in previous patch),
> it is possible to use percpu_rw_semaphore instead of it.
> percpu_down_read() will be used in shrink_slab_memcg()
> and in shrink_slab(), and consecutive calls
>
> percpu_down_write(percpu_rwsem);
> percpu_up_write(percpu_rwsem);
>
> will be used instead of synchronize_srcu().

An obvious question. Why didn't you go that way? What are pros/cons of
both approaches?
--
Michal Hocko
SUSE Labs

2018-08-09 07:46:36

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On Wed, Aug 08, 2018 at 05:07:08PM -0700, Matthew Wilcox wrote:
> On Thu, Aug 09, 2018 at 07:31:25AM +1000, Dave Chinner wrote:
> > IMO, we've had enough recent bugs to deal with from shrinkers being
> > called before the filesystem is set up and from trying to handle
> > allocation errors during setup. Do we really want to make shrinker
> > shutdown just as prone to mismanagement and subtle, hard to hit
> > bugs? I don't think we do - unmount is simply not a critical
> > performance path.
>
> It's never been performance critical for me, but I'm not so sure that
> there aren't container workloads which unmount filesystems multiple
> times per second.

What? Why would they do that? Who cares about tear-down speeds? Start
up speeds I can kind of understand...

2018-08-09 09:23:17

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC v2 02/10] mm: Make shrink_slab() lockless

On 09.08.2018 10:14, Michal Hocko wrote:
> On Wed 08-08-18 16:20:54, Kirill Tkhai wrote:
>> [Added two more places needed srcu_dereference(). All ->shrinker_map
>> dereferences must be under SRCU, and this v2 adds missed in previous]
>>
>> The patch makes shrinker list and shrinker_idr SRCU-safe
>> for readers. This requires synchronize_srcu() on finalize
>> stage unregistering stage, which waits till all parallel
>> shrink_slab() are finished
>>
>> Note, that patch removes rwsem_is_contended() checks from
>> the code, and this does not result in delays during
>> registration, since there is no waiting at all. Unregistration
>> case may be optimized by splitting unregister_shrinker()
>> in tho stages, and this is made in next patches.
>>
>> Also, keep in mind, that in case of SRCU is not allowed
>> to make unconditional (which is done in previous patch),
>> it is possible to use percpu_rw_semaphore instead of it.
>> percpu_down_read() will be used in shrink_slab_memcg()
>> and in shrink_slab(), and consecutive calls
>>
>> percpu_down_write(percpu_rwsem);
>> percpu_up_write(percpu_rwsem);
>>
>> will be used instead of synchronize_srcu().
>
> An obvious question. Why didn't you go that way? What are pros/cons of
> both approaches?

1)After percpu_rw_semaphore is introduced, shrink_slab() will be not able
to do successful percpu_down_read_trylock() for longer time in comparison
to current behavior:

[cpu0] [cpu1]
{un,}register_shrinker(); shrink_slab()
percpu_down_write(); percpu_down_read_trylock() -> fail
synchronize_rcu(); -> in some periods very slow on big SMP ...
shrink_slab()
percpu_down_read_trylock() -> fail

Also, register_shrinker() and unregister_shrinker() will become slower for the same reason.
Unlike unregister_shrinker(); register_shrinker() can't be made asynchronous/delayed, so
simple mount() performance will be worse.

It's possible, these both can be solved by using both percpu_rw_semaphore and rw_semaphore.
shrink_slab() may fall back to rw_semaphore in case of percpu_rw_semaphore can't be blocked:

shrink_slab()
{
bool percpu = true;

if (!percpu_down_read_try_lock()) {
if(!down_read_trylock())
return 0;
percpu = false;
}

shrinker = idr_find();
...

if (percpu)
percpu_up_read();
else
up_read();
}

register_shrinker()
{
down_write();
idr_alloc();
up_write();
}

unregister_shrinker()
{
percpu_down_write();
down_write();
idr_remove();
up_write();
percpu_up_write();
}

But a)On big machine this may turn in always down_read_trylock() like we have now;
b)I'm not sure, unlocked idr_find() is safe in parallel with idr_alloc(), maybe,
there is needed something else around it (I just haven't investigated this).

All the above are cons. Pros are not enabling SRCU.

2)SRCU. Pros are there are no the above problems; we will have completely unlocked and
scalable shrink_slab(). We will also have a possibility to avoid unregistering delays,
like I did for superblock shrinker. There will be full scalability.
Cons is enabling SRCU.

Kirill

2018-08-09 10:23:48

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] rcu: Make CONFIG_SRCU unconditionally enabled

On 09.08.2018 00:31, Dave Chinner wrote:
> On Wed, Aug 08, 2018 at 12:27:34PM +0200, Michal Hocko wrote:
>> [CC Josh - the whole series is
>> http://lkml.kernel.org/r/153365347929.19074.12509495712735843805.stgit@localhost.localdomain]
>>
>> On Wed 08-08-18 13:17:44, Kirill Tkhai wrote:
>>> On 08.08.2018 10:20, Michal Hocko wrote:
>>>> On Tue 07-08-18 18:37:36, Kirill Tkhai wrote:
>>>>> This patch kills all CONFIG_SRCU defines and
>>>>> the code under !CONFIG_SRCU.
>>>>
>>>> The last time somebody tried to do this there was a pushback due to
>>>> kernel tinyfication. So this should really give some numbers about the
>>>> code size increase. Also why can't we make this depend on MMU. Is
>>>> anybody else than the reclaim asking for unconditional SRCU usage?
>>>
>>> I don't know one. The size numbers (sparc64) are:
>>>
>>> $ size image.srcu.disabled
>>> text data bss dec hex filename
>>> 5117546 8030506 1968104 15116156 e6a77c image.srcu.disabled
>>> $ size image.srcu.enabled
>>> text data bss dec hex filename
>>> 5126175 8064346 1968104 15158625 e74d61 image.srcu.enabled
>>> The difference is: 15158625-15116156 = 42469 ~41Kb
>>>
>>> Please, see the measurement details to my answer to Stephen.
>>>
>>>> Btw. I totaly agree with Steven. This is a very poor changelog. It is
>>>> trivial to see what the patch does but it is far from clear why it is
>>>> doing that and why we cannot go other ways.
>>> We possibly can go another way, and there is comment to [2/10] about this.
>>> Percpu rwsem may be used instead, the only thing, it is worse, is it will
>>> make shrink_slab() wait unregistering shrinkers, while srcu-based
>>> implementation does not require this.
>>
>> Well, if unregisterring doesn't do anything subtle - e.g. an allocation
>> or take locks which depend on allocation - and we can guarantee that
>> then blocking shrink_slab shouldn't be a big deal.
>
> unregister_shrinker() already blocks shrink_slab - taking a rwsem in
> write mode blocks all readers - so using a per-cpu rwsem doesn't
> introduce anything new or unexpected. I'd like to see numbers of the
> different methods before anything else.

The difference is percpu_rw_semaphore makes readers to wait till RCU
grace period is finished. Sometimes this takes unpredictable time on
big machines with many CPUs, which is not good.

> IMO, the big deal is that the split unregister mechanism seems to
> imply superblock shrinkers can be called during sb teardown or
> /after/ the filesystem has been torn down in memory (i.e. after
> ->put_super() is called). That's a change of behaviour, but it's
> left to the filesystem to detect and handle that condition. That's
> exceedingly subtle and looks like a recipe for disaster to me. I
> note that XFS hasn't been updated to detect and avoid this landmine.
>
> And, FWIW, filesystems with multiple shrinkers (e.g. XFS as 3 per
> mount) will take the SCRU penalty multiple times during unmount, and
> potentially be exposed to multiple different "use during/after
> teardown" race conditions.
>
>> It is subtle though.
>> Maybe subtle enough to make unconditional SRCU worth it. This all should
>> be in the changelog though.
>
> IMO, we've had enough recent bugs to deal with from shrinkers being
> called before the filesystem is set up and from trying to handle
> allocation errors during setup. Do we really want to make shrinker
> shutdown just as prone to mismanagement and subtle, hard to hit
> bugs? I don't think we do - unmount is simply not a critical
> performance path.

There are possible different situations, people use linux like they want.
Imagine, you want to reboot NFS server, but you want to enter clients
and umount them over ssh, and the time is critical. Something like this.
I believe there are many examples, people need this.

2018-08-09 10:40:46

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH RFC v2 02/10] mm: Make shrink_slab() lockless

On 2018/08/09 18:21, Kirill Tkhai wrote:
> 2)SRCU. Pros are there are no the above problems; we will have completely unlocked and
> scalable shrink_slab(). We will also have a possibility to avoid unregistering delays,
> like I did for superblock shrinker. There will be full scalability.
> Cons is enabling SRCU.
>

How unregistering delays can be avoided? Since you traverse all shrinkers
using one shrinker_srcu, synchronize_srcu(&shrinker_srcu) will block
unregistering threads until longest inflight srcu_read_lock() user calls
srcu_read_unlock().

Unless you use per shrinker counter like below, I wonder whether
unregistering delays can be avoided...

https://marc.info/?l=linux-mm&m=151060909613004
https://marc.info/?l=linux-mm&m=151060909713005

2018-08-09 10:59:39

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC v2 02/10] mm: Make shrink_slab() lockless

On 09.08.2018 13:37, Tetsuo Handa wrote:
> On 2018/08/09 18:21, Kirill Tkhai wrote:
>> 2)SRCU. Pros are there are no the above problems; we will have completely unlocked and
>> scalable shrink_slab(). We will also have a possibility to avoid unregistering delays,
>> like I did for superblock shrinker. There will be full scalability.
>> Cons is enabling SRCU.
>>
>
> How unregistering delays can be avoided? Since you traverse all shrinkers
> using one shrinker_srcu, synchronize_srcu(&shrinker_srcu) will block
> unregistering threads until longest inflight srcu_read_lock() user calls
> srcu_read_unlock().

Yes, but we can do synchronize_srcu() from async work like I did for the further patches.
The only thing we need is to teach shrinker::count_objects() and shrinker::scan_objects()
be safe to be called on unregistering shrinker user. The next patches do this for superblock
shrinker.

> Unless you use per shrinker counter like below, I wonder whether
> unregistering delays can be avoided...
>
> https://marc.info/?l=linux-mm&m=151060909613004
> https://marc.info/?l=linux-mm&m=151060909713005

I'm afraid these atomic_{inc,dec}(&shrinker->nr_active) may regulary drop CPU caches
on another CPUs on some workloads. Also, synchronize_rcu() is also a heavy delay.

2018-08-09 11:41:36

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH RFC v2 02/10] mm: Make shrink_slab() lockless

On 09.08.2018 10:14, Michal Hocko wrote:
> On Wed 08-08-18 16:20:54, Kirill Tkhai wrote:
>> [Added two more places needed srcu_dereference(). All ->shrinker_map
>> dereferences must be under SRCU, and this v2 adds missed in previous]
>>
>> The patch makes shrinker list and shrinker_idr SRCU-safe
>> for readers. This requires synchronize_srcu() on finalize
>> stage unregistering stage, which waits till all parallel
>> shrink_slab() are finished
>>
>> Note, that patch removes rwsem_is_contended() checks from
>> the code, and this does not result in delays during
>> registration, since there is no waiting at all. Unregistration
>> case may be optimized by splitting unregister_shrinker()
>> in tho stages, and this is made in next patches.
>>
>> Also, keep in mind, that in case of SRCU is not allowed
>> to make unconditional (which is done in previous patch),
>> it is possible to use percpu_rw_semaphore instead of it.
>> percpu_down_read() will be used in shrink_slab_memcg()
>> and in shrink_slab(), and consecutive calls
>>
>> percpu_down_write(percpu_rwsem);
>> percpu_up_write(percpu_rwsem);
>>
>> will be used instead of synchronize_srcu().
>
> An obvious question. Why didn't you go that way? What are pros/cons of
> both approaches?

percpu_rw_semaphore based variant looks something like:

commit d581d4ad7ecf
Author: Kirill Tkhai <[email protected]>
Date: Thu Aug 9 14:21:12 2018 +0300

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0ff97e860759..fe8693775e33 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -168,6 +168,7 @@ unsigned long vm_total_pages;

static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_STATIC_PERCPU_RWSEM(shrinker_percpu_rwsem);

#ifdef CONFIG_MEMCG_KMEM

@@ -198,7 +199,10 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
goto unlock;

if (id >= shrinker_nr_max) {
- if (memcg_expand_shrinker_maps(id)) {
+ percpu_down_write(&shrinker_percpu_rwsem);
+ ret = memcg_expand_shrinker_maps(id);
+ percpu_up_write(&shrinker_percpu_rwsem);
+ if (ret) {
idr_remove(&shrinker_idr, id);
goto unlock;
}
@@ -406,7 +410,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
void register_shrinker_prepared(struct shrinker *shrinker)
{
down_write(&shrinker_rwsem);
- list_add_tail(&shrinker->list, &shrinker_list);
+ list_add_tail_rcu(&shrinker->list, &shrinker_list);
#ifdef CONFIG_MEMCG_KMEM
idr_replace(&shrinker_idr, shrinker, shrinker->id);
#endif
@@ -434,8 +438,14 @@ void unregister_shrinker(struct shrinker *shrinker)
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
down_write(&shrinker_rwsem);
- list_del(&shrinker->list);
+ list_del_rcu(&shrinker->list);
up_write(&shrinker_rwsem);
+
+ synchronize_rcu();
+
+ percpu_down_write(&shrinker_percpu_rwsem);
+ percpu_up_write(&shrinker_percpu_rwsem);
+
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}
@@ -574,11 +584,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
return 0;

- if (!down_read_trylock(&shrinker_rwsem))
+ if (!percpu_down_read_trylock(&shrinker_percpu_rwsem))
return 0;

map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
- true);
+ true /* shrinker_percpu_rwsem */);
if (unlikely(!map))
goto unlock;

@@ -590,7 +600,22 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
};
struct shrinker *shrinker;

+ /*
+ * See shutdown sequence in unregister_shrinker().
+ * RCU allows us to iterate IDR locklessly (this
+ * is the way to synchronize with IDR changing by
+ * idr_alloc()).
+ *
+ * If we see shrinker pointer undex RCU, this means
+ * synchronize_rcu() in unregister_shrinker() has not
+ * finished yet. Then, we unlock RCU, and synchronize_rcu()
+ * can complete, but unregister_shrinker() can't proceed,
+ * before we unlock shrinker_percpu_rwsem.
+ */
+ rcu_read_lock();
shrinker = idr_find(&shrinker_idr, i);
+ rcu_read_unlock();
+
if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
if (!shrinker)
clear_bit(i, map->map);
@@ -624,13 +649,13 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
}
freed += ret;

- if (rwsem_is_contended(&shrinker_rwsem)) {
+ if (!rcu_sync_is_idle(&shrinker_percpu_rwsem.rss)) {
freed = freed ? : 1;
break;
}
}
unlock:
- up_read(&shrinker_rwsem);
+ percpu_up_read(&shrinker_percpu_rwsem);
return freed;
}
#else /* CONFIG_MEMCG_KMEM */
@@ -672,15 +697,17 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (!mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

- if (!down_read_trylock(&shrinker_rwsem))
+ if (!percpu_down_read_trylock(&shrinker_percpu_rwsem))
goto out;

- list_for_each_entry(shrinker, &shrinker_list, list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
.memcg = memcg,
};
+ rcu_read_unlock();

ret = do_shrink_slab(&sc, shrinker, priority);
if (ret == SHRINK_EMPTY)
@@ -691,13 +718,16 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
* prevent the regsitration from being stalled for long periods
* by parallel ongoing shrinking.
*/
- if (rwsem_is_contended(&shrinker_rwsem)) {
+ if (!rcu_sync_is_idle(&shrinker_percpu_rwsem.rss)) {
freed = freed ? : 1;
break;
}
+
+ rcu_read_lock();
}
+ rcu_read_unlock();

- up_read(&shrinker_rwsem);
+ percpu_up_read(&shrinker_percpu_rwsem);
out:
cond_resched();
return freed;