2019-07-17 20:26:12

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches

Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
file to shrink the slab by flushing out all the per-cpu slabs and free
slabs in partial lists. This can be useful to squeeze out a bit more memory
under extreme condition as well as making the active object counts in
/proc/slabinfo more accurate.

This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
option is usually not enabled and "slub_memcg_sysfs=1" not set. Even
if memcg sysfs is turned on, it is too cumbersome and impractical to
manage all those per-memcg sysfs files in a real production system.

So there is no practical way to shrink memcg caches. Fix this by
enabling a proper write to the shrink sysfs file of the root cache
to scan all the available memcg caches and shrink them as well. For a
non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is
on), only that cache will be shrunk when written.

On a 2-socket 64-core 256-thread arm64 system with 64k page after
a parallel kernel build, the the amount of memory occupied by slabs
before shrinking slabs were:

# grep task_struct /proc/slabinfo
task_struct 53137 53192 4288 61 4 : tunables 0 0
0 : slabdata 872 872 0
# grep "^S[lRU]" /proc/meminfo
Slab: 3936832 kB
SReclaimable: 399104 kB
SUnreclaim: 3537728 kB

After shrinking slabs:

# grep "^S[lRU]" /proc/meminfo
Slab: 1356288 kB
SReclaimable: 263296 kB
SUnreclaim: 1092992 kB
# grep task_struct /proc/slabinfo
task_struct 2764 6832 4288 61 4 : tunables 0 0
0 : slabdata 112 112 0

Signed-off-by: Waiman Long <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
---
Documentation/ABI/testing/sysfs-kernel-slab | 12 ++++---
mm/slab.h | 1 +
mm/slab_common.c | 37 +++++++++++++++++++++
mm/slub.c | 2 +-
4 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
index 29601d93a1c2..94ffd47fc8d7 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -429,10 +429,14 @@ KernelVersion: 2.6.22
Contact: Pekka Enberg <[email protected]>,
Christoph Lameter <[email protected]>
Description:
- The shrink file is written when memory should be reclaimed from
- a cache. Empty partial slabs are freed and the partial list is
- sorted so the slabs with the fewest available objects are used
- first.
+ The shrink file is used to enable some unused slab cache
+ memory to be reclaimed from a cache. Empty per-cpu
+ or partial slabs are freed and the partial list is
+ sorted so the slabs with the fewest available objects
+ are used first. It only accepts a value of "1" on
+ write for shrinking the cache. Other input values are
+ considered invalid. If it is a root cache, all the
+ child memcg caches will also be shrunk, if available.

What: /sys/kernel/slab/cache/slab_size
Date: May 2007
diff --git a/mm/slab.h b/mm/slab.h
index 9057b8056b07..5bf615cb3f99 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -174,6 +174,7 @@ int __kmem_cache_shrink(struct kmem_cache *);
void __kmemcg_cache_deactivate(struct kmem_cache *s);
void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
void slab_kmem_cache_release(struct kmem_cache *);
+void kmem_cache_shrink_all(struct kmem_cache *s);

struct seq_file;
struct file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 807490fe217a..6491c3a41805 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -981,6 +981,43 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
}
EXPORT_SYMBOL(kmem_cache_shrink);

+/**
+ * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache
+ * @s: The cache pointer
+ */
+void kmem_cache_shrink_all(struct kmem_cache *s)
+{
+ struct kmem_cache *c;
+
+ if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) {
+ kmem_cache_shrink(s);
+ return;
+ }
+
+ get_online_cpus();
+ get_online_mems();
+ kasan_cache_shrink(s);
+ __kmem_cache_shrink(s);
+
+ /*
+ * We have to take the slab_mutex to protect from the memcg list
+ * modification.
+ */
+ mutex_lock(&slab_mutex);
+ for_each_memcg_cache(c, s) {
+ /*
+ * Don't need to shrink deactivated memcg caches.
+ */
+ if (s->flags & SLAB_DEACTIVATED)
+ continue;
+ kasan_cache_shrink(c);
+ __kmem_cache_shrink(c);
+ }
+ mutex_unlock(&slab_mutex);
+ put_online_mems();
+ put_online_cpus();
+}
+
bool slab_is_available(void)
{
return slab_state >= UP;
diff --git a/mm/slub.c b/mm/slub.c
index e6c030e47364..9736eb10dcb8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5294,7 +5294,7 @@ static ssize_t shrink_store(struct kmem_cache *s,
const char *buf, size_t length)
{
if (buf[0] == '1')
- kmem_cache_shrink(s);
+ kmem_cache_shrink_all(s);
else
return -EINVAL;
return length;
--
2.18.1


Subject: Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches

On Wed, 17 Jul 2019, Waiman Long wrote:

> Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
> file to shrink the slab by flushing out all the per-cpu slabs and free
> slabs in partial lists. This can be useful to squeeze out a bit more memory
> under extreme condition as well as making the active object counts in
> /proc/slabinfo more accurate.

Acked-by: Christoph Lameter <[email protected]>

> # grep task_struct /proc/slabinfo
> task_struct 53137 53192 4288 61 4 : tunables 0 0
> 0 : slabdata 872 872 0
> # grep "^S[lRU]" /proc/meminfo
> Slab: 3936832 kB
> SReclaimable: 399104 kB
> SUnreclaim: 3537728 kB
>
> After shrinking slabs:
>
> # grep "^S[lRU]" /proc/meminfo
> Slab: 1356288 kB
> SReclaimable: 263296 kB
> SUnreclaim: 1092992 kB

Well another indicator that it may not be a good decision to replicate the
whole set of slabs for each memcg. Migrate the memcg ownership into the
objects may allow the use of the same slab cache. In particular together
with the slab migration patches this may be a viable way to reduce memory
consumption.

2019-07-18 17:06:00

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches

On Thu, Jul 18, 2019 at 11:38:11AM +0000, Christopher Lameter wrote:
> On Wed, 17 Jul 2019, Waiman Long wrote:
>
> > Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
> > file to shrink the slab by flushing out all the per-cpu slabs and free
> > slabs in partial lists. This can be useful to squeeze out a bit more memory
> > under extreme condition as well as making the active object counts in
> > /proc/slabinfo more accurate.
>
> Acked-by: Christoph Lameter <[email protected]>
>
> > # grep task_struct /proc/slabinfo
> > task_struct 53137 53192 4288 61 4 : tunables 0 0
> > 0 : slabdata 872 872 0
> > # grep "^S[lRU]" /proc/meminfo
> > Slab: 3936832 kB
> > SReclaimable: 399104 kB
> > SUnreclaim: 3537728 kB
> >
> > After shrinking slabs:
> >
> > # grep "^S[lRU]" /proc/meminfo
> > Slab: 1356288 kB
> > SReclaimable: 263296 kB
> > SUnreclaim: 1092992 kB
>
> Well another indicator that it may not be a good decision to replicate the
> whole set of slabs for each memcg. Migrate the memcg ownership into the
> objects may allow the use of the same slab cache. In particular together
> with the slab migration patches this may be a viable way to reduce memory
> consumption.
>

Btw I'm working on an alternative solution. It's way too early to present
anything, but preliminary results are looking promising: slab memory usage
is decreased by 10-40% depending on the workload.

Thanks!

2019-07-19 06:21:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches

On Wed 17-07-19 16:24:12, Waiman Long wrote:
> Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
> file to shrink the slab by flushing out all the per-cpu slabs and free
> slabs in partial lists. This can be useful to squeeze out a bit more memory
> under extreme condition as well as making the active object counts in
> /proc/slabinfo more accurate.
>
> This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
> option is usually not enabled and "slub_memcg_sysfs=1" not set. Even
> if memcg sysfs is turned on, it is too cumbersome and impractical to
> manage all those per-memcg sysfs files in a real production system.
>
> So there is no practical way to shrink memcg caches. Fix this by
> enabling a proper write to the shrink sysfs file of the root cache
> to scan all the available memcg caches and shrink them as well. For a
> non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is
> on), only that cache will be shrunk when written.

I would mention that memcg unawareness was an overlook more than
anything else. The interface is intended to shrink all pcp data of the
cache. The fact that we are using per-memcg internal caches is an
implementation detail.

> On a 2-socket 64-core 256-thread arm64 system with 64k page after
> a parallel kernel build, the the amount of memory occupied by slabs
> before shrinking slabs were:
>
> # grep task_struct /proc/slabinfo
> task_struct 53137 53192 4288 61 4 : tunables 0 0
> 0 : slabdata 872 872 0
> # grep "^S[lRU]" /proc/meminfo
> Slab: 3936832 kB
> SReclaimable: 399104 kB
> SUnreclaim: 3537728 kB
>
> After shrinking slabs:
>
> # grep "^S[lRU]" /proc/meminfo
> Slab: 1356288 kB
> SReclaimable: 263296 kB
> SUnreclaim: 1092992 kB
> # grep task_struct /proc/slabinfo
> task_struct 2764 6832 4288 61 4 : tunables 0 0
> 0 : slabdata 112 112 0

Now that you are touching the documentation I would just add a note that
shrinking might be expensive and block other slab operations so it
should be used with some care.

> Signed-off-by: Waiman Long <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>

The patch looks good to me. I do not feel qualified to give my ack but
it is definitely a change in the good direction.

Let's just be careful recommending people to use this as a workaround to
over caching and resulting tilted stats. That needs to be addressed
separately.

Thanks!

> ---
> Documentation/ABI/testing/sysfs-kernel-slab | 12 ++++---
> mm/slab.h | 1 +
> mm/slab_common.c | 37 +++++++++++++++++++++
> mm/slub.c | 2 +-
> 4 files changed, 47 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
> index 29601d93a1c2..94ffd47fc8d7 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-slab
> +++ b/Documentation/ABI/testing/sysfs-kernel-slab
> @@ -429,10 +429,14 @@ KernelVersion: 2.6.22
> Contact: Pekka Enberg <[email protected]>,
> Christoph Lameter <[email protected]>
> Description:
> - The shrink file is written when memory should be reclaimed from
> - a cache. Empty partial slabs are freed and the partial list is
> - sorted so the slabs with the fewest available objects are used
> - first.
> + The shrink file is used to enable some unused slab cache
> + memory to be reclaimed from a cache. Empty per-cpu
> + or partial slabs are freed and the partial list is
> + sorted so the slabs with the fewest available objects
> + are used first. It only accepts a value of "1" on
> + write for shrinking the cache. Other input values are
> + considered invalid. If it is a root cache, all the
> + child memcg caches will also be shrunk, if available.
>
> What: /sys/kernel/slab/cache/slab_size
> Date: May 2007
> diff --git a/mm/slab.h b/mm/slab.h
> index 9057b8056b07..5bf615cb3f99 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -174,6 +174,7 @@ int __kmem_cache_shrink(struct kmem_cache *);
> void __kmemcg_cache_deactivate(struct kmem_cache *s);
> void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
> void slab_kmem_cache_release(struct kmem_cache *);
> +void kmem_cache_shrink_all(struct kmem_cache *s);
>
> struct seq_file;
> struct file;
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 807490fe217a..6491c3a41805 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -981,6 +981,43 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
> }
> EXPORT_SYMBOL(kmem_cache_shrink);
>
> +/**
> + * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache
> + * @s: The cache pointer
> + */
> +void kmem_cache_shrink_all(struct kmem_cache *s)
> +{
> + struct kmem_cache *c;
> +
> + if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) {
> + kmem_cache_shrink(s);
> + return;
> + }
> +
> + get_online_cpus();
> + get_online_mems();
> + kasan_cache_shrink(s);
> + __kmem_cache_shrink(s);
> +
> + /*
> + * We have to take the slab_mutex to protect from the memcg list
> + * modification.
> + */
> + mutex_lock(&slab_mutex);
> + for_each_memcg_cache(c, s) {
> + /*
> + * Don't need to shrink deactivated memcg caches.
> + */
> + if (s->flags & SLAB_DEACTIVATED)
> + continue;
> + kasan_cache_shrink(c);
> + __kmem_cache_shrink(c);
> + }
> + mutex_unlock(&slab_mutex);
> + put_online_mems();
> + put_online_cpus();
> +}
> +
> bool slab_is_available(void)
> {
> return slab_state >= UP;
> diff --git a/mm/slub.c b/mm/slub.c
> index e6c030e47364..9736eb10dcb8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5294,7 +5294,7 @@ static ssize_t shrink_store(struct kmem_cache *s,
> const char *buf, size_t length)
> {
> if (buf[0] == '1')
> - kmem_cache_shrink(s);
> + kmem_cache_shrink_all(s);
> else
> return -EINVAL;
> return length;
> --
> 2.18.1

--
Michal Hocko
SUSE Labs

2019-07-19 14:11:28

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches

On 7/19/19 2:20 AM, Michal Hocko wrote:
> On Wed 17-07-19 16:24:12, Waiman Long wrote:
>> Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
>> file to shrink the slab by flushing out all the per-cpu slabs and free
>> slabs in partial lists. This can be useful to squeeze out a bit more memory
>> under extreme condition as well as making the active object counts in
>> /proc/slabinfo more accurate.
>>
>> This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
>> option is usually not enabled and "slub_memcg_sysfs=1" not set. Even
>> if memcg sysfs is turned on, it is too cumbersome and impractical to
>> manage all those per-memcg sysfs files in a real production system.
>>
>> So there is no practical way to shrink memcg caches. Fix this by
>> enabling a proper write to the shrink sysfs file of the root cache
>> to scan all the available memcg caches and shrink them as well. For a
>> non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is
>> on), only that cache will be shrunk when written.
> I would mention that memcg unawareness was an overlook more than
> anything else. The interface is intended to shrink all pcp data of the
> cache. The fact that we are using per-memcg internal caches is an
> implementation detail.
>
>> On a 2-socket 64-core 256-thread arm64 system with 64k page after
>> a parallel kernel build, the the amount of memory occupied by slabs
>> before shrinking slabs were:
>>
>> # grep task_struct /proc/slabinfo
>> task_struct 53137 53192 4288 61 4 : tunables 0 0
>> 0 : slabdata 872 872 0
>> # grep "^S[lRU]" /proc/meminfo
>> Slab: 3936832 kB
>> SReclaimable: 399104 kB
>> SUnreclaim: 3537728 kB
>>
>> After shrinking slabs:
>>
>> # grep "^S[lRU]" /proc/meminfo
>> Slab: 1356288 kB
>> SReclaimable: 263296 kB
>> SUnreclaim: 1092992 kB
>> # grep task_struct /proc/slabinfo
>> task_struct 2764 6832 4288 61 4 : tunables 0 0
>> 0 : slabdata 112 112 0
> Now that you are touching the documentation I would just add a note that
> shrinking might be expensive and block other slab operations so it
> should be used with some care.
>
Good point. I will update the patch to include such a note in the
documentation.

Thanks,
Longman