2022-12-20 18:49:14

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH RFC 0/2] mm: kmem: optimize obj_cgroup pointer retrieval

This patchset improves the performance of get_obj_cgroup_from_current(), which
is used to get an objcg pointer on the kernel memory allocation fast path.

Results (1M 8-bytes accounted allocations):

| version | accounted (us) | delta | unaccounted (us) | delta |
|-----------------+----------------+-------+------------------+--------|
| baseline (6.1+) | 81042 | | 45269 | |
| patch 1 | 78756 | -2.8% | 42731 | -5.6% |
| patch 2 | 73650 | -9.1% | 30662 | -32.3% |

Unaccounted allocations were performed from a user's task belonging to
the root cgroup, so savings are particularly large because previously
the root_mem_cgroup pointer was obtained first just to learn that it's
corresponding objcg is NULL.


Roman Gushchin (2):
mm: kmem: optimize get_obj_cgroup_from_current()
mm: kmem: add direct objcg pointer to task_struct

include/linux/sched.h | 4 ++
mm/memcontrol.c | 102 ++++++++++++++++++++++++++++++++----------
2 files changed, 83 insertions(+), 23 deletions(-)

--
2.39.0


2022-12-20 19:08:07

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH RFC 2/2] mm: kmem: add direct objcg pointer to task_struct

To charge a freshly allocated kernel object to a memory cgroup, the
kernel needs to obtain an objcg pointer. Currently it does it
indirectly by obtaining the memcg pointer first and then calling to
__get_obj_cgroup_from_memcg().

Usually tasks spend their entire life belonging to the same object
cgroup. So it makes sense to save the objcg pointer on task_struct
directly, so it can be obtained faster. It requires some work on fork,
exit and cgroup migrate paths, but these paths are way colder.

The old indirect way is still used for remote memcg charging.

Signed-off-by: Roman Gushchin <[email protected]>
---
include/linux/sched.h | 4 +++
mm/memcontrol.c | 84 +++++++++++++++++++++++++++++++++++++------
2 files changed, 77 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 853d08f7562b..e17be609cbcb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1435,6 +1435,10 @@ struct task_struct {
struct mem_cgroup *active_memcg;
#endif

+#ifdef CONFIG_MEMCG_KMEM
+ struct obj_cgroup *objcg;
+#endif
+
#ifdef CONFIG_BLK_CGROUP
struct request_queue *throttle_queue;
#endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 82828c51d2ea..e0547b224f40 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3001,23 +3001,29 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
{
struct mem_cgroup *memcg;
- struct obj_cgroup *objcg;
+ struct obj_cgroup *objcg = NULL;

if (in_task()) {
memcg = current->active_memcg;
-
- /* Memcg to charge can't be determined. */
- if (likely(!memcg) && (!current->mm || (current->flags & PF_KTHREAD)))
- return NULL;
+ if (unlikely(memcg))
+ goto from_memcg;
+
+ if (current->objcg) {
+ rcu_read_lock();
+ do {
+ objcg = READ_ONCE(current->objcg);
+ } while (objcg && !obj_cgroup_tryget(objcg));
+ rcu_read_unlock();
+ }
} else {
memcg = this_cpu_read(int_active_memcg);
- if (likely(!memcg))
- return NULL;
+ if (unlikely(memcg))
+ goto from_memcg;
}
+ return objcg;

+from_memcg:
rcu_read_lock();
- if (!memcg)
- memcg = mem_cgroup_from_task(current);
objcg = __get_obj_cgroup_from_memcg(memcg);
rcu_read_unlock();
return objcg;
@@ -6303,6 +6309,28 @@ static void mem_cgroup_move_task(void)
mem_cgroup_clear_mc();
}
}
+
+#ifdef CONFIG_MEMCG_KMEM
+static void mem_cgroup_fork(struct task_struct *task)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(task);
+ if (!memcg || mem_cgroup_is_root(memcg))
+ task->objcg = NULL;
+ else
+ task->objcg = __get_obj_cgroup_from_memcg(memcg);
+ rcu_read_unlock();
+}
+
+static void mem_cgroup_exit(struct task_struct *task)
+{
+ if (task->objcg)
+ obj_cgroup_put(task->objcg);
+}
+#endif
+
#else /* !CONFIG_MMU */
static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
@@ -6317,7 +6345,7 @@ static void mem_cgroup_move_task(void)
#endif

#ifdef CONFIG_LRU_GEN
-static void mem_cgroup_attach(struct cgroup_taskset *tset)
+static void mem_cgroup_lru_gen_attach(struct cgroup_taskset *tset)
{
struct task_struct *task;
struct cgroup_subsys_state *css;
@@ -6335,10 +6363,38 @@ static void mem_cgroup_attach(struct cgroup_taskset *tset)
task_unlock(task);
}
#else
+static void mem_cgroup_lru_gen_attach(struct cgroup_taskset *tset) {}
+#endif /* CONFIG_LRU_GEN */
+
+#ifdef CONFIG_MEMCG_KMEM
+static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset)
+{
+ struct task_struct *task;
+ struct cgroup_subsys_state *css;
+
+ cgroup_taskset_for_each(task, css, tset) {
+ struct mem_cgroup *memcg;
+
+ if (task->objcg)
+ obj_cgroup_put(task->objcg);
+
+ rcu_read_lock();
+ memcg = container_of(css, struct mem_cgroup, css);
+ task->objcg = __get_obj_cgroup_from_memcg(memcg);
+ rcu_read_unlock();
+ }
+}
+#else
+static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset) {}
+#endif /* CONFIG_MEMCG_KMEM */
+
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_KMEM)
static void mem_cgroup_attach(struct cgroup_taskset *tset)
{
+ mem_cgroup_lru_gen_attach(tset);
+ mem_cgroup_kmem_attach(tset);
}
-#endif /* CONFIG_LRU_GEN */
+#endif

static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
{
@@ -6816,9 +6872,15 @@ struct cgroup_subsys memory_cgrp_subsys = {
.css_reset = mem_cgroup_css_reset,
.css_rstat_flush = mem_cgroup_css_rstat_flush,
.can_attach = mem_cgroup_can_attach,
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_KMEM)
.attach = mem_cgroup_attach,
+#endif
.cancel_attach = mem_cgroup_cancel_attach,
.post_attach = mem_cgroup_move_task,
+#ifdef CONFIG_MEMCG_KMEM
+ .fork = mem_cgroup_fork,
+ .exit = mem_cgroup_exit,
+#endif
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,
.early_init = 0,
--
2.39.0

2022-12-20 19:12:58

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH RFC 1/2] mm: kmem: optimize get_obj_cgroup_from_current()

Manually inline memcg_kmem_bypass() and active_memcg() to speed up
get_obj_cgroup_from_current() by avoiding duplicate in_task() checks
and active_memcg() readings.

Also add a likely() macro to __get_obj_cgroup_from_memcg():
obj_cgroup_tryget() should succeed at almost all times except
a very unlikely race with the memcg deletion path.

Signed-off-by: Roman Gushchin <[email protected]>
---
mm/memcontrol.c | 34 ++++++++++++++--------------------
1 file changed, 14 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bafd3cde4507..82828c51d2ea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1047,19 +1047,6 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
}
EXPORT_SYMBOL(get_mem_cgroup_from_mm);

-static __always_inline bool memcg_kmem_bypass(void)
-{
- /* Allow remote memcg charging from any context. */
- if (unlikely(active_memcg()))
- return false;
-
- /* Memcg to charge can't be determined. */
- if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
- return true;
-
- return false;
-}
-
/**
* mem_cgroup_iter - iterate over memory cgroup hierarchy
* @root: hierarchy root
@@ -3004,7 +2991,7 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)

for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
objcg = rcu_dereference(memcg->objcg);
- if (objcg && obj_cgroup_tryget(objcg))
+ if (likely(objcg && obj_cgroup_tryget(objcg)))
break;
objcg = NULL;
}
@@ -3013,16 +3000,23 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)

__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
{
- struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;

- if (memcg_kmem_bypass())
- return NULL;
+ if (in_task()) {
+ memcg = current->active_memcg;
+
+ /* Memcg to charge can't be determined. */
+ if (likely(!memcg) && (!current->mm || (current->flags & PF_KTHREAD)))
+ return NULL;
+ } else {
+ memcg = this_cpu_read(int_active_memcg);
+ if (likely(!memcg))
+ return NULL;
+ }

rcu_read_lock();
- if (unlikely(active_memcg()))
- memcg = active_memcg();
- else
+ if (!memcg)
memcg = mem_cgroup_from_task(current);
objcg = __get_obj_cgroup_from_memcg(memcg);
rcu_read_unlock();
--
2.39.0

2022-12-20 20:14:14

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] mm: kmem: optimize get_obj_cgroup_from_current()

On Tue, Dec 20, 2022 at 10:28 AM Roman Gushchin
<[email protected]> wrote:
>
> Manually inline memcg_kmem_bypass() and active_memcg() to speed up
> get_obj_cgroup_from_current() by avoiding duplicate in_task() checks
> and active_memcg() readings.
>
> Also add a likely() macro to __get_obj_cgroup_from_memcg():
> obj_cgroup_tryget() should succeed at almost all times except
> a very unlikely race with the memcg deletion path.
>
> Signed-off-by: Roman Gushchin <[email protected]>

Can you please add your performance experiment setup and result of
this patch in the commit description of this patch as well?

Acked-by: Shakeel Butt <[email protected]>

2022-12-20 21:37:53

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] mm: kmem: optimize get_obj_cgroup_from_current()

On Tue, Dec 20, 2022 at 11:55:34AM -0800, Shakeel Butt wrote:
> On Tue, Dec 20, 2022 at 10:28 AM Roman Gushchin
> <[email protected]> wrote:
> >
> > Manually inline memcg_kmem_bypass() and active_memcg() to speed up
> > get_obj_cgroup_from_current() by avoiding duplicate in_task() checks
> > and active_memcg() readings.
> >
> > Also add a likely() macro to __get_obj_cgroup_from_memcg():
> > obj_cgroup_tryget() should succeed at almost all times except
> > a very unlikely race with the memcg deletion path.
> >
> > Signed-off-by: Roman Gushchin <[email protected]>
>
> Can you please add your performance experiment setup and result of
> this patch in the commit description of this patch as well?

Sure. I used a small hack to just do a bunch of allocations in a raw
and measured the time. Will include it into the commit message.

Also will fix the #ifdef thing from the second patch, thanks for spotting
it.

>
> Acked-by: Shakeel Butt <[email protected]>

Thank you for taking a look!

2022-12-20 22:06:43

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH RFC 2/2] mm: kmem: add direct objcg pointer to task_struct

On Tue, Dec 20, 2022 at 10:27:45AM -0800, Roman Gushchin wrote:
> To charge a freshly allocated kernel object to a memory cgroup, the
> kernel needs to obtain an objcg pointer. Currently it does it
> indirectly by obtaining the memcg pointer first and then calling to
> __get_obj_cgroup_from_memcg().
>
> Usually tasks spend their entire life belonging to the same object
> cgroup. So it makes sense to save the objcg pointer on task_struct
> directly, so it can be obtained faster. It requires some work on fork,
> exit and cgroup migrate paths, but these paths are way colder.
>
> The old indirect way is still used for remote memcg charging.
>
> Signed-off-by: Roman Gushchin <[email protected]>

This looks good too. Few comments below:

[...]
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset)
> +{
> + struct task_struct *task;
> + struct cgroup_subsys_state *css;
> +
> + cgroup_taskset_for_each(task, css, tset) {
> + struct mem_cgroup *memcg;
> +
> + if (task->objcg)
> + obj_cgroup_put(task->objcg);
> +
> + rcu_read_lock();
> + memcg = container_of(css, struct mem_cgroup, css);
> + task->objcg = __get_obj_cgroup_from_memcg(memcg);
> + rcu_read_unlock();
> + }
> +}
> +#else
> +static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset) {}
> +#endif /* CONFIG_MEMCG_KMEM */
> +
> +#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_KMEM)

I think you want CONFIG_LRU_GEN in the above check.

> static void mem_cgroup_attach(struct cgroup_taskset *tset)
> {
> + mem_cgroup_lru_gen_attach(tset);
> + mem_cgroup_kmem_attach(tset);
> }
> -#endif /* CONFIG_LRU_GEN */
> +#endif
>
> static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
> {
> @@ -6816,9 +6872,15 @@ struct cgroup_subsys memory_cgrp_subsys = {
> .css_reset = mem_cgroup_css_reset,
> .css_rstat_flush = mem_cgroup_css_rstat_flush,
> .can_attach = mem_cgroup_can_attach,
> +#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_KMEM)

Same here.

> .attach = mem_cgroup_attach,
> +#endif
> .cancel_attach = mem_cgroup_cancel_attach,
> .post_attach = mem_cgroup_move_task,
> +#ifdef CONFIG_MEMCG_KMEM
> + .fork = mem_cgroup_fork,
> + .exit = mem_cgroup_exit,
> +#endif
> .dfl_cftypes = memory_files,
> .legacy_cftypes = mem_cgroup_legacy_files,
> .early_init = 0,
> --
> 2.39.0
>

2022-12-22 13:59:59

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH RFC 2/2] mm: kmem: add direct objcg pointer to task_struct

On Tue, Dec 20, 2022 at 10:27:45AM -0800, Roman Gushchin <[email protected]> wrote:
> To charge a freshly allocated kernel object to a memory cgroup, the
> kernel needs to obtain an objcg pointer. Currently it does it
> indirectly by obtaining the memcg pointer first and then calling to
> __get_obj_cgroup_from_memcg().

Jinx [1].

You report additional 7% improvement with this patch (focused on
allocations only). I didn't see impressive numbers (different benchmark
in [1]), so it looked as a microoptimization without big benefit to me.

My 0.02€ to RFC,
Michal


[1] https://bugzilla.kernel.org/show_bug.cgi?id=216038#c5


Attachments:
(No filename) (653.00 B)
signature.asc (235.00 B)
Digital signature
Download all attachments

2022-12-22 16:28:13

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH RFC 2/2] mm: kmem: add direct objcg pointer to task_struct

On Thu, Dec 22, 2022 at 02:50:44PM +0100, Michal Koutn? wrote:
> On Tue, Dec 20, 2022 at 10:27:45AM -0800, Roman Gushchin <[email protected]> wrote:
> > To charge a freshly allocated kernel object to a memory cgroup, the
> > kernel needs to obtain an objcg pointer. Currently it does it
> > indirectly by obtaining the memcg pointer first and then calling to
> > __get_obj_cgroup_from_memcg().
>
> Jinx [1].
>
> You report additional 7% improvement with this patch (focused on
> allocations only). I didn't see impressive numbers (different benchmark
> in [1]), so it looked as a microoptimization without big benefit to me.

Hi Michal!

Thank you for taking a look.
Do you have any numbers to share?

In general, I agree that it's a micro-optimization, but:
1) some people periodically complain that accounted allocations are slow
in comparison to non-accounted and slower than they were with page-based
accounting,
2) I don't see any particular hot point or obviously non-optimal place on the
allocation path.

so if we want to make it faster, we have to micro-optimize it here and there,
no other way. It's basically the question how many cache lines we touch.

Btw, I'm working on a patch 3 for this series, which in early tests brings
additional ~25% improvement in my benchmark, hopefully will post it soon as
a part of v1.

Thanks!

2023-01-02 17:12:29

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH RFC 2/2] mm: kmem: add direct objcg pointer to task_struct

Hello.

On Thu, Dec 22, 2022 at 08:21:49AM -0800, Roman Gushchin <[email protected]> wrote:
> Do you have any numbers to share?

The numbers are in bko#216038, let me explain them here a bit.
I used the will-it-scale benchmark that repeatedly locks/unlocks a file
and runs in parallel.

The final numbers were:
sample metric δ δ_cg
no accounting implemented 32307750 0 % ­
accounting in cg 2.49577e+07 -23 % 0 %
accounting in cg + cache 2.51642e+07 -22 % +1 %

Hence my result was only 1% improvement.

(But it was a very simple try, not delving into any of the CPU cache
statistics.)

Question: Were your measurements multi-threaded?

> 1) some people periodically complain that accounted allocations are slow
> in comparison to non-accounted and slower than they were with page-based
> accounting,

My result above would not likely satisfy those complainers I know about.
But if your additional changes are better the additional code complexity
may be justified in the end.


> Btw, I'm working on a patch 3 for this series, which in early tests brings
> additional ~25% improvement in my benchmark, hopefully will post it soon as
> a part of v1.

Please send it with more details about your benchmark to put the numbers
into context.


Michal


Attachments:
(No filename) (1.29 kB)
signature.asc (235.00 B)
Digital signature
Download all attachments