2021-04-14 11:23:24

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

v3:
- Add missing "inline" qualifier to the alternate mod_obj_stock_state()
in patch 3.
- Remove redundant current_obj_stock() call in patch 5.

v2:
- Fix bug found by test robot in patch 5.
- Update cover letter and commit logs.

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].

A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of a 64-byte object at module
init time is used for benchmarking. The test was run on a CascadeLake
server with turbo-boosting disable to reduce run-to-run variation.

With memory accounting disable, the run time was 2.848s. With memory
accounting enabled, the run times with the application of various
patches in the patchset were:

Applied patches Run time Accounting overhead Overhead %age
--------------- -------- ------------------- -------------
None 10.800s 7.952s 100.0%
1-2 9.140s 6.292s 79.1%
1-3 7.641s 4.793s 60.3%
1-5 6.801s 3.953s 49.7%

Note that this is the best case scenario where most updates happen only
to the percpu stocks. Real workloads will likely have a certain amount
of updates to the memcg charges and vmstats. So the performance benefit
will be less.

It was found that a big part of the memory accounting overhead
was caused by the local_irq_save()/local_irq_restore() sequences in
updating local stock charge bytes and vmstat array, at least in x86
systems. There are two such sequences in kmem_cache_alloc() and two
in kmem_cache_free(). This patchset tries to reduce the use of such
sequences as much as possible. In fact, it eliminates them in the common
case. Another part of this patchset to cache the vmstat data update in
the local stock as well which also helps.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/

Waiman Long (5):
mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
mm/memcg: Separate out object stock data into its own struct
mm/memcg: Optimize user context object stock access

include/linux/memcontrol.h | 14 ++-
mm/memcontrol.c | 199 ++++++++++++++++++++++++++++++++-----
mm/percpu.c | 9 +-
mm/slab.h | 32 +++---
4 files changed, 196 insertions(+), 58 deletions(-)

--
2.18.1


2021-04-14 11:23:26

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
is followed by mod_objcg_state()/mod_memcg_state(). Each of these
function call goes through a separate irq_save/irq_restore cycle. That
is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state()
that combines them with a single irq_save/irq_restore cycle.

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
---
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 31 +++++++++++++++++++++++++++----
mm/percpu.c | 9 ++-------
mm/slab.h | 6 +++---
4 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 95f12996e66c..6890f999c1a3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void);

int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+ struct pglist_data *pgdat, int idx);

extern struct static_key_false memcg_kmem_enabled_key;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d66e1e38f8ac..b19100c68aa0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
return false;
}

-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
{
struct memcg_stock_pcp *stock;
- unsigned long flags;
-
- local_irq_save(flags);

stock = this_cpu_ptr(&memcg_stock);
if (stock->cached_objcg != objcg) { /* reset if necessary */
@@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)

if (stock->nr_bytes > PAGE_SIZE)
drain_obj_stock(stock);
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+ unsigned long flags;

+ local_irq_save(flags);
+ __refill_obj_stock(objcg, nr_bytes);
local_irq_restore(flags);
}

@@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
refill_obj_stock(objcg, size);
}

+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+ struct pglist_data *pgdat, int idx)
+{
+ unsigned long flags;
+ struct mem_cgroup *memcg;
+ struct lruvec *lruvec = NULL;
+
+ local_irq_save(flags);
+ __refill_obj_stock(objcg, size);
+
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(objcg);
+ if (pgdat)
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ __mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
+ rcu_read_unlock();
+ local_irq_restore(flags);
+}
+
#endif /* CONFIG_MEMCG_KMEM */

/*
diff --git a/mm/percpu.c b/mm/percpu.c
index 23308113a5ff..fd7aad6d7f90 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;

- obj_cgroup_uncharge(objcg, size * num_possible_cpus());
-
- rcu_read_lock();
- mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
- -(size * num_possible_cpus()));
- rcu_read_unlock();
-
+ obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(),
+ NULL, MEMCG_PERCPU_B);
obj_cgroup_put(objcg);
}

diff --git a/mm/slab.h b/mm/slab.h
index bc6c7545e487..677cdc52e641 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
continue;

objcgs[off] = NULL;
- obj_cgroup_uncharge(objcg, obj_full_size(s));
- mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
- -obj_full_size(s));
+ obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s),
+ page_pgdat(page),
+ cache_vmstat_idx(s));
obj_cgroup_put(objcg);
}
}
--
2.18.1

2021-04-14 11:23:58

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct

The object stock data stored in struct memcg_stock_pcp are independent
of the other page based data stored there. Separating them out into
their own struct to highlight the independency.

Signed-off-by: Waiman Long <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
---
mm/memcontrol.c | 41 ++++++++++++++++++++++++++---------------
1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 539c3b632e47..69f728383efe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page)
}
EXPORT_SYMBOL(unlock_page_memcg);

-struct memcg_stock_pcp {
- struct mem_cgroup *cached; /* this never be root cgroup */
- unsigned int nr_pages;
-
+struct obj_stock {
#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
int vmstat_idx;
int vmstat_bytes;
+#else
+ int dummy[0];
#endif
+};
+
+struct memcg_stock_pcp {
+ struct mem_cgroup *cached; /* this never be root cgroup */
+ unsigned int nr_pages;
+ struct obj_stock obj;

struct work_struct work;
unsigned long flags;
@@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
static DEFINE_MUTEX(percpu_charge_mutex);

#ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static void drain_obj_stock(struct obj_stock *stock);
static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
struct mem_cgroup *root_memcg);

#else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline void drain_obj_stock(struct obj_stock *stock)
{
}
static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
}
#endif

+static inline struct obj_stock *current_obj_stock(void)
+{
+ struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+
+ return &stock->obj;
+}
+
/**
* consume_stock: Try to consume stocked charge on this cpu.
* @memcg: memcg to consume from.
@@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy)
local_irq_save(flags);

stock = this_cpu_ptr(&memcg_stock);
- drain_obj_stock(stock);
+ drain_obj_stock(&stock->obj);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);

@@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,

static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
{
- struct memcg_stock_pcp *stock;
+ struct obj_stock *stock;
unsigned long flags;
bool ret = false;

local_irq_save(flags);

- stock = this_cpu_ptr(&memcg_stock);
+ stock = current_obj_stock();
if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
stock->nr_bytes -= nr_bytes;
ret = true;
@@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
return ret;
}

-static void drain_obj_stock(struct memcg_stock_pcp *stock)
+static void drain_obj_stock(struct obj_stock *stock)
{
struct obj_cgroup *old = stock->cached_objcg;

@@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
{
struct mem_cgroup *memcg;

- if (stock->cached_objcg) {
- memcg = obj_cgroup_memcg(stock->cached_objcg);
+ if (stock->obj.cached_objcg) {
+ memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
return true;
}
@@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,

static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
{
- struct memcg_stock_pcp *stock;
+ struct obj_stock *stock = current_obj_stock();

- stock = this_cpu_ptr(&memcg_stock);
if (stock->cached_objcg != objcg) { /* reset if necessary */
drain_obj_stock(stock);
obj_cgroup_get(objcg);
@@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
static void __mod_obj_stock_state(struct obj_cgroup *objcg,
struct pglist_data *pgdat, int idx, int nr)
{
- struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+ struct obj_stock *stock = current_obj_stock();

if (stock->cached_objcg != objcg) {
/* Output the current data as is */
--
2.18.1

2021-04-14 11:24:28

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 17% (946796 out of 5515184) of the
time when __mod_obj_stock_state() is called leads to an actual call to
mod_objcg_state() after initial boot. When doing parallel kernel build,
the figure was about 16% (21894614 out of 139780628). So caching the
vmstat data reduces the number of calls to mod_objcg_state() by more
than 80%.

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
---
mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++------
mm/slab.h | 26 +++++++----------
2 files changed, 79 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b19100c68aa0..539c3b632e47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {

#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
+ struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
+ int vmstat_idx;
+ int vmstat_bytes;
#endif

struct work_struct work;
@@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
css_put(&memcg->css);
}

+static inline void mod_objcg_state(struct obj_cgroup *objcg,
+ struct pglist_data *pgdat,
+ enum node_stat_item idx, int nr)
+{
+ struct mem_cgroup *memcg;
+ struct lruvec *lruvec = NULL;
+
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(objcg);
+ if (pgdat)
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+ rcu_read_unlock();
+}
+
static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
{
struct memcg_stock_pcp *stock;
@@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
stock->nr_bytes = 0;
}

+ if (stock->vmstat_bytes) {
+ mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
+ stock->vmstat_bytes);
+ stock->vmstat_bytes = 0;
+ stock->vmstat_idx = 0;
+ stock->cached_pgdat = NULL;
+ }
+
obj_cgroup_put(old);
stock->cached_objcg = NULL;
}
@@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
local_irq_restore(flags);
}

+static void __mod_obj_stock_state(struct obj_cgroup *objcg,
+ struct pglist_data *pgdat, int idx, int nr)
+{
+ struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+
+ if (stock->cached_objcg != objcg) {
+ /* Output the current data as is */
+ } else if (!stock->vmstat_bytes) {
+ /* Save the current data */
+ stock->vmstat_bytes = nr;
+ stock->vmstat_idx = idx;
+ stock->cached_pgdat = pgdat;
+ nr = 0;
+ } else if ((stock->cached_pgdat != pgdat) ||
+ (stock->vmstat_idx != idx)) {
+ /* Output the cached data & save the current data */
+ swap(nr, stock->vmstat_bytes);
+ swap(idx, stock->vmstat_idx);
+ swap(pgdat, stock->cached_pgdat);
+ } else {
+ stock->vmstat_bytes += nr;
+ if (abs(nr) > PAGE_SIZE) {
+ nr = stock->vmstat_bytes;
+ stock->vmstat_bytes = 0;
+ } else {
+ nr = 0;
+ }
+ }
+ if (nr)
+ mod_objcg_state(objcg, pgdat, idx, nr);
+}
+
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+ int idx, int nr)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __mod_obj_stock_state(objcg, pgdat, idx, nr);
+ local_irq_restore(flags);
+}
+
int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
{
struct mem_cgroup *memcg;
@@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
struct pglist_data *pgdat, int idx)
{
unsigned long flags;
- struct mem_cgroup *memcg;
- struct lruvec *lruvec = NULL;

local_irq_save(flags);
__refill_obj_stock(objcg, size);
-
- rcu_read_lock();
- memcg = obj_cgroup_memcg(objcg);
- if (pgdat)
- lruvec = mem_cgroup_lruvec(memcg, pgdat);
- __mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
- rcu_read_unlock();
+ __mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
local_irq_restore(flags);
}

diff --git a/mm/slab.h b/mm/slab.h
index 677cdc52e641..03bd9813422b 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
#ifdef CONFIG_MEMCG_KMEM
int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
gfp_t gfp, bool new_page);
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+ int idx, int nr);

static inline void memcg_free_page_obj_cgroups(struct page *page)
{
@@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
return true;
}

-static inline void mod_objcg_state(struct obj_cgroup *objcg,
- struct pglist_data *pgdat,
- enum node_stat_item idx, int nr)
-{
- struct mem_cgroup *memcg;
- struct lruvec *lruvec;
-
- rcu_read_lock();
- memcg = obj_cgroup_memcg(objcg);
- lruvec = mem_cgroup_lruvec(memcg, pgdat);
- mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
- rcu_read_unlock();
-}
-
static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
struct obj_cgroup *objcg,
gfp_t flags, size_t size,
@@ -324,8 +312,9 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
off = obj_to_index(s, page, p[i]);
obj_cgroup_get(objcg);
page_objcgs(page)[off] = objcg;
- mod_objcg_state(objcg, page_pgdat(page),
- cache_vmstat_idx(s), obj_full_size(s));
+ mod_obj_stock_state(objcg, page_pgdat(page),
+ cache_vmstat_idx(s),
+ obj_full_size(s));
} else {
obj_cgroup_uncharge(objcg, obj_full_size(s));
}
@@ -408,6 +397,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s,
void **p, int objects)
{
}
+
+static inline void mod_obj_stock_state(struct obj_cgroup *objcg,
+ struct pglist_data *pgdat, int idx, int nr)
+{
+}
#endif /* CONFIG_MEMCG_KMEM */

static inline struct kmem_cache *virt_to_cache(const void *obj)
--
2.18.1

2021-04-15 03:29:32

by Masayoshi Mizuma

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> function call goes through a separate irq_save/irq_restore cycle. That
> is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state()
> that combines them with a single irq_save/irq_restore cycle.
>
> Signed-off-by: Waiman Long <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>
> ---
> include/linux/memcontrol.h | 2 ++
> mm/memcontrol.c | 31 +++++++++++++++++++++++++++----
> mm/percpu.c | 9 ++-------
> mm/slab.h | 6 +++---
> 4 files changed, 34 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 95f12996e66c..6890f999c1a3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void);
>
> int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
> void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> + struct pglist_data *pgdat, int idx);
>
> extern struct static_key_false memcg_kmem_enabled_key;
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d66e1e38f8ac..b19100c68aa0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> return false;
> }
>
> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> {
> struct memcg_stock_pcp *stock;
> - unsigned long flags;
> -
> - local_irq_save(flags);
>
> stock = this_cpu_ptr(&memcg_stock);
> if (stock->cached_objcg != objcg) { /* reset if necessary */
> @@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>
> if (stock->nr_bytes > PAGE_SIZE)
> drain_obj_stock(stock);
> +}
> +
> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +{
> + unsigned long flags;
>
> + local_irq_save(flags);
> + __refill_obj_stock(objcg, nr_bytes);
> local_irq_restore(flags);
> }
>
> @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
> refill_obj_stock(objcg, size);
> }
>
> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> + struct pglist_data *pgdat, int idx)
> +{
> + unsigned long flags;
> + struct mem_cgroup *memcg;
> + struct lruvec *lruvec = NULL;
> +
> + local_irq_save(flags);
> + __refill_obj_stock(objcg, size);
> +
> + rcu_read_lock();
> + memcg = obj_cgroup_memcg(objcg);
> + if (pgdat)
> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
> + __mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
> + rcu_read_unlock();
> + local_irq_restore(flags);
> +}
> +
> #endif /* CONFIG_MEMCG_KMEM */
>
> /*
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 23308113a5ff..fd7aad6d7f90 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
> objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
> chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
>
> - obj_cgroup_uncharge(objcg, size * num_possible_cpus());
> -
> - rcu_read_lock();
> - mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
> - -(size * num_possible_cpus()));
> - rcu_read_unlock();
> -
> + obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(),
> + NULL, MEMCG_PERCPU_B);
> obj_cgroup_put(objcg);
> }
>
> diff --git a/mm/slab.h b/mm/slab.h
> index bc6c7545e487..677cdc52e641 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
> continue;
>
> objcgs[off] = NULL;
> - obj_cgroup_uncharge(objcg, obj_full_size(s));
> - mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
> - -obj_full_size(s));
> + obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s),
> + page_pgdat(page),
> + cache_vmstat_idx(s));
> obj_cgroup_put(objcg);
> }
> }
> --
> 2.18.1
>

Please feel free to add:

Tested-by: Masayoshi Mizuma <[email protected]>

Thanks!
Masa

2021-04-15 03:30:20

by Masayoshi Mizuma

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

On Tue, Apr 13, 2021 at 09:20:22PM -0400, Waiman Long wrote:
> v3:
> - Add missing "inline" qualifier to the alternate mod_obj_stock_state()
> in patch 3.
> - Remove redundant current_obj_stock() call in patch 5.
>
> v2:
> - Fix bug found by test robot in patch 5.
> - Update cover letter and commit logs.
>
> With the recent introduction of the new slab memory controller, we
> eliminate the need for having separate kmemcaches for each memory
> cgroup and reduce overall kernel memory usage. However, we also add
> additional memory accounting overhead to each call of kmem_cache_alloc()
> and kmem_cache_free().
>
> For workloads that require a lot of kmemcache allocations and
> de-allocations, they may experience performance regression as illustrated
> in [1] and [2].
>
> A simple kernel module that performs repeated loop of 100,000,000
> kmem_cache_alloc() and kmem_cache_free() of a 64-byte object at module
> init time is used for benchmarking. The test was run on a CascadeLake
> server with turbo-boosting disable to reduce run-to-run variation.
>
> With memory accounting disable, the run time was 2.848s. With memory
> accounting enabled, the run times with the application of various
> patches in the patchset were:
>
> Applied patches Run time Accounting overhead Overhead %age
> --------------- -------- ------------------- -------------
> None 10.800s 7.952s 100.0%
> 1-2 9.140s 6.292s 79.1%
> 1-3 7.641s 4.793s 60.3%
> 1-5 6.801s 3.953s 49.7%
>
> Note that this is the best case scenario where most updates happen only
> to the percpu stocks. Real workloads will likely have a certain amount
> of updates to the memcg charges and vmstats. So the performance benefit
> will be less.
>
> It was found that a big part of the memory accounting overhead
> was caused by the local_irq_save()/local_irq_restore() sequences in
> updating local stock charge bytes and vmstat array, at least in x86
> systems. There are two such sequences in kmem_cache_alloc() and two
> in kmem_cache_free(). This patchset tries to reduce the use of such
> sequences as much as possible. In fact, it eliminates them in the common
> case. Another part of this patchset to cache the vmstat data update in
> the local stock as well which also helps.
>
> [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u

Hi Longman,

Thank you for your patches.
I rerun the benchmark with your patches, it seems that the reduction
is small... The total duration of sendto() and recvfrom() system call
during the benchmark are as follows.

- sendto
- v5.8 vanilla: 2576.056 msec (100%)
- v5.12-rc7 vanilla: 2988.911 msec (116%)
- v5.12-rc7 with your patches (1-5): 2984.307 msec (115%)

- recvfrom
- v5.8 vanilla: 2113.156 msec (100%)
- v5.12-rc7 vanilla: 2305.810 msec (109%)
- v5.12-rc7 with your patches (1-5): 2287.351 msec (108%)

kmem_cache_alloc()/kmem_cache_free() are called around 1,400,000 times during
the benchmark. I ran a loop in a kernel module as following. The duration
is reduced by your patches actually.

---
dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
for (i = 0; i < 1400000; i++) {
p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
kmem_cache_free(dummy_cache, p);
}
---

- v5.12-rc7 vanilla: 110 msec (100%)
- v5.12-rc7 with your patches (1-5): 85 msec (77%)

It seems that the reduction is small for the benchmark though...
Anyway, I can see your patches reduce the overhead.
Please feel free to add:

Tested-by: Masayoshi Mizuma <[email protected]>

Thanks!
Masa

> [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/
>
> Waiman Long (5):
> mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
> mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
> mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
> mm/memcg: Separate out object stock data into its own struct
> mm/memcg: Optimize user context object stock access
>
> include/linux/memcontrol.h | 14 ++-
> mm/memcontrol.c | 199 ++++++++++++++++++++++++++++++++-----
> mm/percpu.c | 9 +-
> mm/slab.h | 32 +++---
> 4 files changed, 196 insertions(+), 58 deletions(-)
>
> --
> 2.18.1
>

2021-04-15 03:30:26

by Masayoshi Mizuma

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

On Tue, Apr 13, 2021 at 09:20:25PM -0400, Waiman Long wrote:
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
>
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change.
>
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 17% (946796 out of 5515184) of the
> time when __mod_obj_stock_state() is called leads to an actual call to
> mod_objcg_state() after initial boot. When doing parallel kernel build,
> the figure was about 16% (21894614 out of 139780628). So caching the
> vmstat data reduces the number of calls to mod_objcg_state() by more
> than 80%.
>
> Signed-off-by: Waiman Long <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
> ---
> mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++------
> mm/slab.h | 26 +++++++----------
> 2 files changed, 79 insertions(+), 25 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b19100c68aa0..539c3b632e47 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
>
> #ifdef CONFIG_MEMCG_KMEM
> struct obj_cgroup *cached_objcg;
> + struct pglist_data *cached_pgdat;
> unsigned int nr_bytes;
> + int vmstat_idx;
> + int vmstat_bytes;
> #endif
>
> struct work_struct work;
> @@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
> css_put(&memcg->css);
> }
>
> +static inline void mod_objcg_state(struct obj_cgroup *objcg,
> + struct pglist_data *pgdat,
> + enum node_stat_item idx, int nr)
> +{
> + struct mem_cgroup *memcg;
> + struct lruvec *lruvec = NULL;
> +
> + rcu_read_lock();
> + memcg = obj_cgroup_memcg(objcg);
> + if (pgdat)
> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
> + __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> + rcu_read_unlock();
> +}
> +
> static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> {
> struct memcg_stock_pcp *stock;
> @@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
> stock->nr_bytes = 0;
> }
>
> + if (stock->vmstat_bytes) {
> + mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
> + stock->vmstat_bytes);
> + stock->vmstat_bytes = 0;
> + stock->vmstat_idx = 0;
> + stock->cached_pgdat = NULL;
> + }
> +
> obj_cgroup_put(old);
> stock->cached_objcg = NULL;
> }
> @@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> local_irq_restore(flags);
> }
>
> +static void __mod_obj_stock_state(struct obj_cgroup *objcg,
> + struct pglist_data *pgdat, int idx, int nr)
> +{
> + struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
> +
> + if (stock->cached_objcg != objcg) {
> + /* Output the current data as is */
> + } else if (!stock->vmstat_bytes) {
> + /* Save the current data */
> + stock->vmstat_bytes = nr;
> + stock->vmstat_idx = idx;
> + stock->cached_pgdat = pgdat;
> + nr = 0;
> + } else if ((stock->cached_pgdat != pgdat) ||
> + (stock->vmstat_idx != idx)) {
> + /* Output the cached data & save the current data */
> + swap(nr, stock->vmstat_bytes);
> + swap(idx, stock->vmstat_idx);
> + swap(pgdat, stock->cached_pgdat);
> + } else {
> + stock->vmstat_bytes += nr;
> + if (abs(nr) > PAGE_SIZE) {
> + nr = stock->vmstat_bytes;
> + stock->vmstat_bytes = 0;
> + } else {
> + nr = 0;
> + }
> + }
> + if (nr)
> + mod_objcg_state(objcg, pgdat, idx, nr);
> +}
> +
> +void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> + int idx, int nr)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + __mod_obj_stock_state(objcg, pgdat, idx, nr);
> + local_irq_restore(flags);
> +}
> +
> int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> {
> struct mem_cgroup *memcg;
> @@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> struct pglist_data *pgdat, int idx)
> {
> unsigned long flags;
> - struct mem_cgroup *memcg;
> - struct lruvec *lruvec = NULL;
>
> local_irq_save(flags);
> __refill_obj_stock(objcg, size);
> -
> - rcu_read_lock();
> - memcg = obj_cgroup_memcg(objcg);
> - if (pgdat)
> - lruvec = mem_cgroup_lruvec(memcg, pgdat);
> - __mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
> - rcu_read_unlock();
> + __mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
> local_irq_restore(flags);
> }
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 677cdc52e641..03bd9813422b 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
> #ifdef CONFIG_MEMCG_KMEM
> int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
> gfp_t gfp, bool new_page);
> +void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> + int idx, int nr);
>
> static inline void memcg_free_page_obj_cgroups(struct page *page)
> {
> @@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
> return true;
> }
>
> -static inline void mod_objcg_state(struct obj_cgroup *objcg,
> - struct pglist_data *pgdat,
> - enum node_stat_item idx, int nr)
> -{
> - struct mem_cgroup *memcg;
> - struct lruvec *lruvec;
> -
> - rcu_read_lock();
> - memcg = obj_cgroup_memcg(objcg);
> - lruvec = mem_cgroup_lruvec(memcg, pgdat);
> - mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> - rcu_read_unlock();
> -}
> -
> static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> struct obj_cgroup *objcg,
> gfp_t flags, size_t size,
> @@ -324,8 +312,9 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> off = obj_to_index(s, page, p[i]);
> obj_cgroup_get(objcg);
> page_objcgs(page)[off] = objcg;
> - mod_objcg_state(objcg, page_pgdat(page),
> - cache_vmstat_idx(s), obj_full_size(s));
> + mod_obj_stock_state(objcg, page_pgdat(page),
> + cache_vmstat_idx(s),
> + obj_full_size(s));
> } else {
> obj_cgroup_uncharge(objcg, obj_full_size(s));
> }
> @@ -408,6 +397,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s,
> void **p, int objects)
> {
> }
> +
> +static inline void mod_obj_stock_state(struct obj_cgroup *objcg,
> + struct pglist_data *pgdat, int idx, int nr)
> +{
> +}
> #endif /* CONFIG_MEMCG_KMEM */
>
> static inline struct kmem_cache *virt_to_cache(const void *obj)
> --
> 2.18.1
>

Please feel free to add:

Tested-by: Masayoshi Mizuma <[email protected]>

Thanks!
Masa

2021-04-15 03:30:47

by Masayoshi Mizuma

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct

On Tue, Apr 13, 2021 at 09:20:26PM -0400, Waiman Long wrote:
> The object stock data stored in struct memcg_stock_pcp are independent
> of the other page based data stored there. Separating them out into
> their own struct to highlight the independency.
>
> Signed-off-by: Waiman Long <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
> ---
> mm/memcontrol.c | 41 ++++++++++++++++++++++++++---------------
> 1 file changed, 26 insertions(+), 15 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 539c3b632e47..69f728383efe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page)
> }
> EXPORT_SYMBOL(unlock_page_memcg);
>
> -struct memcg_stock_pcp {
> - struct mem_cgroup *cached; /* this never be root cgroup */
> - unsigned int nr_pages;
> -
> +struct obj_stock {
> #ifdef CONFIG_MEMCG_KMEM
> struct obj_cgroup *cached_objcg;
> struct pglist_data *cached_pgdat;
> unsigned int nr_bytes;
> int vmstat_idx;
> int vmstat_bytes;
> +#else
> + int dummy[0];
> #endif
> +};
> +
> +struct memcg_stock_pcp {
> + struct mem_cgroup *cached; /* this never be root cgroup */
> + unsigned int nr_pages;
> + struct obj_stock obj;
>
> struct work_struct work;
> unsigned long flags;
> @@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> static DEFINE_MUTEX(percpu_charge_mutex);
>
> #ifdef CONFIG_MEMCG_KMEM
> -static void drain_obj_stock(struct memcg_stock_pcp *stock);
> +static void drain_obj_stock(struct obj_stock *stock);
> static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> struct mem_cgroup *root_memcg);
>
> #else
> -static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
> +static inline void drain_obj_stock(struct obj_stock *stock)
> {
> }
> static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> @@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> }
> #endif
>
> +static inline struct obj_stock *current_obj_stock(void)
> +{
> + struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
> +
> + return &stock->obj;
> +}
> +
> /**
> * consume_stock: Try to consume stocked charge on this cpu.
> * @memcg: memcg to consume from.
> @@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy)
> local_irq_save(flags);
>
> stock = this_cpu_ptr(&memcg_stock);
> - drain_obj_stock(stock);
> + drain_obj_stock(&stock->obj);
> drain_stock(stock);
> clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>
> @@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>
> static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> {
> - struct memcg_stock_pcp *stock;
> + struct obj_stock *stock;
> unsigned long flags;
> bool ret = false;
>
> local_irq_save(flags);
>
> - stock = this_cpu_ptr(&memcg_stock);
> + stock = current_obj_stock();
> if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
> stock->nr_bytes -= nr_bytes;
> ret = true;
> @@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> return ret;
> }
>
> -static void drain_obj_stock(struct memcg_stock_pcp *stock)
> +static void drain_obj_stock(struct obj_stock *stock)
> {
> struct obj_cgroup *old = stock->cached_objcg;
>
> @@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> {
> struct mem_cgroup *memcg;
>
> - if (stock->cached_objcg) {
> - memcg = obj_cgroup_memcg(stock->cached_objcg);
> + if (stock->obj.cached_objcg) {
> + memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
> if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> return true;
> }
> @@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>
> static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> {
> - struct memcg_stock_pcp *stock;
> + struct obj_stock *stock = current_obj_stock();
>
> - stock = this_cpu_ptr(&memcg_stock);
> if (stock->cached_objcg != objcg) { /* reset if necessary */
> drain_obj_stock(stock);
> obj_cgroup_get(objcg);
> @@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> static void __mod_obj_stock_state(struct obj_cgroup *objcg,
> struct pglist_data *pgdat, int idx, int nr)
> {
> - struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
> + struct obj_stock *stock = current_obj_stock();
>
> if (stock->cached_objcg != objcg) {
> /* Output the current data as is */
> --
> 2.18.1
>
Please feel free to add:

Tested-by: Masayoshi Mizuma <[email protected]>

Thanks!
Masa

2021-04-15 13:19:25

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

On 4/14/21 11:26 PM, Masayoshi Mizuma wrote:
>
> Hi Longman,
>
> Thank you for your patches.
> I rerun the benchmark with your patches, it seems that the reduction
> is small... The total duration of sendto() and recvfrom() system call
> during the benchmark are as follows.
>
> - sendto
> - v5.8 vanilla: 2576.056 msec (100%)
> - v5.12-rc7 vanilla: 2988.911 msec (116%)
> - v5.12-rc7 with your patches (1-5): 2984.307 msec (115%)
>
> - recvfrom
> - v5.8 vanilla: 2113.156 msec (100%)
> - v5.12-rc7 vanilla: 2305.810 msec (109%)
> - v5.12-rc7 with your patches (1-5): 2287.351 msec (108%)
>
> kmem_cache_alloc()/kmem_cache_free() are called around 1,400,000 times during
> the benchmark. I ran a loop in a kernel module as following. The duration
> is reduced by your patches actually.
>
> ---
> dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
> for (i = 0; i < 1400000; i++) {
> p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
> kmem_cache_free(dummy_cache, p);
> }
> ---
>
> - v5.12-rc7 vanilla: 110 msec (100%)
> - v5.12-rc7 with your patches (1-5): 85 msec (77%)
>
> It seems that the reduction is small for the benchmark though...
> Anyway, I can see your patches reduce the overhead.
> Please feel free to add:
>
> Tested-by: Masayoshi Mizuma <[email protected]>
>
> Thanks!
> Masa
>
Thanks for the testing.

I was focusing on your kernel module benchmark in testing my patch. I
will try out your pgbench benchmark to see if there can be other tuning
that can be done.

BTW, how many numa nodes does your test machine? I did my testing with a
2-socket system. The vmstat caching part may be less effective on
systems with more numa nodes. I will try to find a larger 4-socket
systems for testing.

Cheers,
Longman

2021-04-15 15:50:05

by Masayoshi Mizuma

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

On Thu, Apr 15, 2021 at 09:17:37AM -0400, Waiman Long wrote:
> I was focusing on your kernel module benchmark in testing my patch. I will
> try out your pgbench benchmark to see if there can be other tuning that can
> be done.

Thanks a lot!

> BTW, how many numa nodes does your test machine? I did my testing with a
> 2-socket system. The vmstat caching part may be less effective on systems
> with more numa nodes. I will try to find a larger 4-socket systems for
> testing.

The test machine has one node.

- Masa

2021-04-15 16:32:40

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> function call goes through a separate irq_save/irq_restore cycle. That
> is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state()
> that combines them with a single irq_save/irq_restore cycle.
>
> @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
> refill_obj_stock(objcg, size);
> }
>
> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> + struct pglist_data *pgdat, int idx)

The optimization makes sense.

But please don't combine independent operations like this into a
single function. It makes for an unclear parameter list, it's a pain
in the behind to change the constituent operations later on, and it
has a habit of attracting more random bools over time. E.g. what if
the caller already has irqs disabled? What if it KNOWS that irqs are
enabled and it could use local_irq_disable() instead of save?

Just provide an __obj_cgroup_uncharge() that assumes irqs are
disabled, combine with the existing __mod_memcg_lruvec_state(), and
bubble the irq handling up to those callsites which know better.

2021-04-15 16:37:51

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

On 4/15/21 12:30 PM, Johannes Weiner wrote:
> On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
>> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
>> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
>> function call goes through a separate irq_save/irq_restore cycle. That
>> is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state()
>> that combines them with a single irq_save/irq_restore cycle.
>>
>> @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>> refill_obj_stock(objcg, size);
>> }
>>
>> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
>> + struct pglist_data *pgdat, int idx)
> The optimization makes sense.
>
> But please don't combine independent operations like this into a
> single function. It makes for an unclear parameter list, it's a pain
> in the behind to change the constituent operations later on, and it
> has a habit of attracting more random bools over time. E.g. what if
> the caller already has irqs disabled? What if it KNOWS that irqs are
> enabled and it could use local_irq_disable() instead of save?
>
> Just provide an __obj_cgroup_uncharge() that assumes irqs are
> disabled, combine with the existing __mod_memcg_lruvec_state(), and
> bubble the irq handling up to those callsites which know better.
>
That will also work. However, the reason I did that was because of patch
5 in the series. I could put the get_obj_stock() and put_obj_stock()
code in slab.h and allowed them to be used directly in various places,
but hiding in one function is easier.

Anyway, I can change the patch if you think that is the right way.

Cheers,
Longman

2021-04-15 16:51:44

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

On Tue, Apr 13, 2021 at 09:20:25PM -0400, Waiman Long wrote:
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
>
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change.
>
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 17% (946796 out of 5515184) of the
> time when __mod_obj_stock_state() is called leads to an actual call to
> mod_objcg_state() after initial boot. When doing parallel kernel build,
> the figure was about 16% (21894614 out of 139780628). So caching the
> vmstat data reduces the number of calls to mod_objcg_state() by more
> than 80%.

Right, but mod_objcg_state() is itself already percpu-cached. What's
the benefit of avoiding calls to it with another percpu cache?

2021-04-15 16:59:17

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct

On Tue, Apr 13, 2021 at 09:20:26PM -0400, Waiman Long wrote:
> The object stock data stored in struct memcg_stock_pcp are independent
> of the other page based data stored there. Separating them out into
> their own struct to highlight the independency.
>
> Signed-off-by: Waiman Long <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
> ---
> mm/memcontrol.c | 41 ++++++++++++++++++++++++++---------------
> 1 file changed, 26 insertions(+), 15 deletions(-)

It's almost twice more code, and IMO not any clearer. Plus it adds
some warts: int dummy[0], redundant naming in stock->obj.cached_objcg,
this_cpu_ptr() doesn't really need a wrapper etc.

2021-04-15 17:09:54

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

On 4/15/21 12:50 PM, Johannes Weiner wrote:
> On Tue, Apr 13, 2021 at 09:20:25PM -0400, Waiman Long wrote:
>> Before the new slab memory controller with per object byte charging,
>> charging and vmstat data update happen only when new slab pages are
>> allocated or freed. Now they are done with every kmem_cache_alloc()
>> and kmem_cache_free(). This causes additional overhead for workloads
>> that generate a lot of alloc and free calls.
>>
>> The memcg_stock_pcp is used to cache byte charge for a specific
>> obj_cgroup to reduce that overhead. To further reducing it, this patch
>> makes the vmstat data cached in the memcg_stock_pcp structure as well
>> until it accumulates a page size worth of update or when other cached
>> data change.
>>
>> On a 2-socket Cascade Lake server with instrumentation enabled and this
>> patch applied, it was found that about 17% (946796 out of 5515184) of the
>> time when __mod_obj_stock_state() is called leads to an actual call to
>> mod_objcg_state() after initial boot. When doing parallel kernel build,
>> the figure was about 16% (21894614 out of 139780628). So caching the
>> vmstat data reduces the number of calls to mod_objcg_state() by more
>> than 80%.
> Right, but mod_objcg_state() is itself already percpu-cached. What's
> the benefit of avoiding calls to it with another percpu cache?
>
There are actually 2 set of vmstat data that have to be updated. One is
associated with the memcg and other one is for each lruvec within the
cgroup. Caching it in obj_stock, we replace 2 writes to two colder
cachelines with one write to a hot cacheline. If you look at patch 5, I
break obj_stock into two - one for task context and one for irq context.
Interrupt disable is no longer needed in task context, but that is not
possible when writing to the actual vmstat data arrays.

Cheers,
Longman

2021-04-15 17:12:51

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

On Tue, Apr 13, 2021 at 09:20:22PM -0400, Waiman Long wrote:
> With memory accounting disable, the run time was 2.848s. With memory
> accounting enabled, the run times with the application of various
> patches in the patchset were:
>
> Applied patches Run time Accounting overhead Overhead %age
> --------------- -------- ------------------- -------------
> None 10.800s 7.952s 100.0%
> 1-2 9.140s 6.292s 79.1%
> 1-3 7.641s 4.793s 60.3%
> 1-5 6.801s 3.953s 49.7%

I think this is a misleading way to report the overhead. I would have said:

10.800s 7.952s 279.2%
9.140s 6.292s 220.9%
7.641s 4.793s 168.3%
6.801s 3.953s 138.8%

2021-04-15 17:44:55

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

On 4/15/21 1:10 PM, Matthew Wilcox wrote:
> On Tue, Apr 13, 2021 at 09:20:22PM -0400, Waiman Long wrote:
>> With memory accounting disable, the run time was 2.848s. With memory
>> accounting enabled, the run times with the application of various
>> patches in the patchset were:
>>
>> Applied patches Run time Accounting overhead Overhead %age
>> --------------- -------- ------------------- -------------
>> None 10.800s 7.952s 100.0%
>> 1-2 9.140s 6.292s 79.1%
>> 1-3 7.641s 4.793s 60.3%
>> 1-5 6.801s 3.953s 49.7%
> I think this is a misleading way to report the overhead. I would have said:
>
> 10.800s 7.952s 279.2%
> 9.140s 6.292s 220.9%
> 7.641s 4.793s 168.3%
> 6.801s 3.953s 138.8%
>
What I want to emphasize is the reduction in the accounting overhead
part of execution time. Your percentage used the accounting disable time
as the denominator. I think both are valid, I will be more clear about
that in my version of the patch.

Thanks,
Longman

2021-04-15 18:12:54

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
> On 4/15/21 12:30 PM, Johannes Weiner wrote:
> > On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> > > In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> > > is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> > > function call goes through a separate irq_save/irq_restore cycle. That
> > > is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state()
> > > that combines them with a single irq_save/irq_restore cycle.
> > >
> > > @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
> > > refill_obj_stock(objcg, size);
> > > }
> > > +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> > > + struct pglist_data *pgdat, int idx)
> > The optimization makes sense.
> >
> > But please don't combine independent operations like this into a
> > single function. It makes for an unclear parameter list, it's a pain
> > in the behind to change the constituent operations later on, and it
> > has a habit of attracting more random bools over time. E.g. what if
> > the caller already has irqs disabled? What if it KNOWS that irqs are
> > enabled and it could use local_irq_disable() instead of save?
> >
> > Just provide an __obj_cgroup_uncharge() that assumes irqs are
> > disabled, combine with the existing __mod_memcg_lruvec_state(), and
> > bubble the irq handling up to those callsites which know better.
> >
> That will also work. However, the reason I did that was because of patch 5
> in the series. I could put the get_obj_stock() and put_obj_stock() code in
> slab.h and allowed them to be used directly in various places, but hiding in
> one function is easier.

Yeah it's more obvious after getting to patch 5.

But with the irq disabling gone entirely, is there still an incentive
to combine the atomic section at all? Disabling preemption is pretty
cheap, so it wouldn't matter to just do it twice.

I.e. couldn't the final sequence in slab code simply be

objcg_uncharge()
mod_objcg_state()

again and each function disables preemption (and in the rare case
irqs) as it sees fit?

You lose the irqsoff batching in the cold path, but as you say, hit
rates are pretty good, and it doesn't seem worth complicating the code
for the cold path.

2021-04-15 18:15:33

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

On Thu, Apr 15, 2021 at 01:08:29PM -0400, Waiman Long wrote:
> On 4/15/21 12:50 PM, Johannes Weiner wrote:
> > On Tue, Apr 13, 2021 at 09:20:25PM -0400, Waiman Long wrote:
> > > Before the new slab memory controller with per object byte charging,
> > > charging and vmstat data update happen only when new slab pages are
> > > allocated or freed. Now they are done with every kmem_cache_alloc()
> > > and kmem_cache_free(). This causes additional overhead for workloads
> > > that generate a lot of alloc and free calls.
> > >
> > > The memcg_stock_pcp is used to cache byte charge for a specific
> > > obj_cgroup to reduce that overhead. To further reducing it, this patch
> > > makes the vmstat data cached in the memcg_stock_pcp structure as well
> > > until it accumulates a page size worth of update or when other cached
> > > data change.
> > >
> > > On a 2-socket Cascade Lake server with instrumentation enabled and this
> > > patch applied, it was found that about 17% (946796 out of 5515184) of the
> > > time when __mod_obj_stock_state() is called leads to an actual call to
> > > mod_objcg_state() after initial boot. When doing parallel kernel build,
> > > the figure was about 16% (21894614 out of 139780628). So caching the
> > > vmstat data reduces the number of calls to mod_objcg_state() by more
> > > than 80%.
> > Right, but mod_objcg_state() is itself already percpu-cached. What's
> > the benefit of avoiding calls to it with another percpu cache?
> >
> There are actually 2 set of vmstat data that have to be updated. One is
> associated with the memcg and other one is for each lruvec within the
> cgroup. Caching it in obj_stock, we replace 2 writes to two colder
> cachelines with one write to a hot cacheline. If you look at patch 5, I
> break obj_stock into two - one for task context and one for irq context.
> Interrupt disable is no longer needed in task context, but that is not
> possible when writing to the actual vmstat data arrays.

Ah, thanks for the explanation. Both of these points are worth
mentioning in the changelog of this patch.

2021-04-15 18:48:23

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

On 4/15/21 2:10 PM, Johannes Weiner wrote:
> On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
>> On 4/15/21 12:30 PM, Johannes Weiner wrote:
>>> On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
>>>> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
>>>> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
>>>> function call goes through a separate irq_save/irq_restore cycle. That
>>>> is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state()
>>>> that combines them with a single irq_save/irq_restore cycle.
>>>>
>>>> @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>>>> refill_obj_stock(objcg, size);
>>>> }
>>>> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
>>>> + struct pglist_data *pgdat, int idx)
>>> The optimization makes sense.
>>>
>>> But please don't combine independent operations like this into a
>>> single function. It makes for an unclear parameter list, it's a pain
>>> in the behind to change the constituent operations later on, and it
>>> has a habit of attracting more random bools over time. E.g. what if
>>> the caller already has irqs disabled? What if it KNOWS that irqs are
>>> enabled and it could use local_irq_disable() instead of save?
>>>
>>> Just provide an __obj_cgroup_uncharge() that assumes irqs are
>>> disabled, combine with the existing __mod_memcg_lruvec_state(), and
>>> bubble the irq handling up to those callsites which know better.
>>>
>> That will also work. However, the reason I did that was because of patch 5
>> in the series. I could put the get_obj_stock() and put_obj_stock() code in
>> slab.h and allowed them to be used directly in various places, but hiding in
>> one function is easier.
> Yeah it's more obvious after getting to patch 5.
>
> But with the irq disabling gone entirely, is there still an incentive
> to combine the atomic section at all? Disabling preemption is pretty
> cheap, so it wouldn't matter to just do it twice.
>
> I.e. couldn't the final sequence in slab code simply be
>
> objcg_uncharge()
> mod_objcg_state()
>
> again and each function disables preemption (and in the rare case
> irqs) as it sees fit?
>
> You lose the irqsoff batching in the cold path, but as you say, hit
> rates are pretty good, and it doesn't seem worth complicating the code
> for the cold path.
>
That does make sense, though a little bit of performance may be lost. I
will try that out to see how it work out performance wise.

Cheers,
Longman

2021-04-15 19:45:10

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

On Thu, Apr 15, 2021 at 02:47:31PM -0400, Waiman Long wrote:
> On 4/15/21 2:10 PM, Johannes Weiner wrote:
> > On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
> > > On 4/15/21 12:30 PM, Johannes Weiner wrote:
> > > > On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> > > > > In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> > > > > is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> > > > > function call goes through a separate irq_save/irq_restore cycle. That
> > > > > is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state()
> > > > > that combines them with a single irq_save/irq_restore cycle.
> > > > >
> > > > > @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
> > > > > refill_obj_stock(objcg, size);
> > > > > }
> > > > > +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> > > > > + struct pglist_data *pgdat, int idx)
> > > > The optimization makes sense.
> > > >
> > > > But please don't combine independent operations like this into a
> > > > single function. It makes for an unclear parameter list, it's a pain
> > > > in the behind to change the constituent operations later on, and it
> > > > has a habit of attracting more random bools over time. E.g. what if
> > > > the caller already has irqs disabled? What if it KNOWS that irqs are
> > > > enabled and it could use local_irq_disable() instead of save?
> > > >
> > > > Just provide an __obj_cgroup_uncharge() that assumes irqs are
> > > > disabled, combine with the existing __mod_memcg_lruvec_state(), and
> > > > bubble the irq handling up to those callsites which know better.
> > > >
> > > That will also work. However, the reason I did that was because of patch 5
> > > in the series. I could put the get_obj_stock() and put_obj_stock() code in
> > > slab.h and allowed them to be used directly in various places, but hiding in
> > > one function is easier.
> > Yeah it's more obvious after getting to patch 5.
> >
> > But with the irq disabling gone entirely, is there still an incentive
> > to combine the atomic section at all? Disabling preemption is pretty
> > cheap, so it wouldn't matter to just do it twice.
> >
> > I.e. couldn't the final sequence in slab code simply be
> >
> > objcg_uncharge()
> > mod_objcg_state()
> >
> > again and each function disables preemption (and in the rare case
> > irqs) as it sees fit?
> >
> > You lose the irqsoff batching in the cold path, but as you say, hit
> > rates are pretty good, and it doesn't seem worth complicating the code
> > for the cold path.
> >
> That does make sense, though a little bit of performance may be lost. I will
> try that out to see how it work out performance wise.

Thanks.

Even if we still end up doing it, it's great to have that cost
isolated, so we know how much extra code complexity corresponds to how
much performance gain. It seems the task/irq split could otherwise be
a pretty localized change with no API implications.

2021-04-15 20:26:04

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

On 4/15/21 3:40 PM, Johannes Weiner wrote:
> On Thu, Apr 15, 2021 at 02:47:31PM -0400, Waiman Long wrote:
>> On 4/15/21 2:10 PM, Johannes Weiner wrote:
>>> On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
>>>> On 4/15/21 12:30 PM, Johannes Weiner wrote:
>>>>> On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
>>>>>> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
>>>>>> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
>>>>>> function call goes through a separate irq_save/irq_restore cycle. That
>>>>>> is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state()
>>>>>> that combines them with a single irq_save/irq_restore cycle.
>>>>>>
>>>>>> @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>>>>>> refill_obj_stock(objcg, size);
>>>>>> }
>>>>>> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
>>>>>> + struct pglist_data *pgdat, int idx)
>>>>> The optimization makes sense.
>>>>>
>>>>> But please don't combine independent operations like this into a
>>>>> single function. It makes for an unclear parameter list, it's a pain
>>>>> in the behind to change the constituent operations later on, and it
>>>>> has a habit of attracting more random bools over time. E.g. what if
>>>>> the caller already has irqs disabled? What if it KNOWS that irqs are
>>>>> enabled and it could use local_irq_disable() instead of save?
>>>>>
>>>>> Just provide an __obj_cgroup_uncharge() that assumes irqs are
>>>>> disabled, combine with the existing __mod_memcg_lruvec_state(), and
>>>>> bubble the irq handling up to those callsites which know better.
>>>>>
>>>> That will also work. However, the reason I did that was because of patch 5
>>>> in the series. I could put the get_obj_stock() and put_obj_stock() code in
>>>> slab.h and allowed them to be used directly in various places, but hiding in
>>>> one function is easier.
>>> Yeah it's more obvious after getting to patch 5.
>>>
>>> But with the irq disabling gone entirely, is there still an incentive
>>> to combine the atomic section at all? Disabling preemption is pretty
>>> cheap, so it wouldn't matter to just do it twice.
>>>
>>> I.e. couldn't the final sequence in slab code simply be
>>>
>>> objcg_uncharge()
>>> mod_objcg_state()
>>>
>>> again and each function disables preemption (and in the rare case
>>> irqs) as it sees fit?
>>>
>>> You lose the irqsoff batching in the cold path, but as you say, hit
>>> rates are pretty good, and it doesn't seem worth complicating the code
>>> for the cold path.
>>>
>> That does make sense, though a little bit of performance may be lost. I will
>> try that out to see how it work out performance wise.
> Thanks.
>
> Even if we still end up doing it, it's great to have that cost
> isolated, so we know how much extra code complexity corresponds to how
> much performance gain. It seems the task/irq split could otherwise be
> a pretty localized change with no API implications.
>
I still want to move mod_objcg_state() function to memcontrol.c though
as I don't want to put any obj_stock stuff in mm/slab.h.

Cheers,
Longman

2021-04-15 20:38:56

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

On Thu, Apr 15, 2021 at 03:44:56PM -0400, Waiman Long wrote:
> On 4/15/21 3:40 PM, Johannes Weiner wrote:
> > On Thu, Apr 15, 2021 at 02:47:31PM -0400, Waiman Long wrote:
> > > On 4/15/21 2:10 PM, Johannes Weiner wrote:
> > > > On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
> > > > > On 4/15/21 12:30 PM, Johannes Weiner wrote:
> > > > > > On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> > > > > > > In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> > > > > > > is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> > > > > > > function call goes through a separate irq_save/irq_restore cycle. That
> > > > > > > is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state()
> > > > > > > that combines them with a single irq_save/irq_restore cycle.
> > > > > > >
> > > > > > > @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
> > > > > > > refill_obj_stock(objcg, size);
> > > > > > > }
> > > > > > > +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> > > > > > > + struct pglist_data *pgdat, int idx)
> > > > > > The optimization makes sense.
> > > > > >
> > > > > > But please don't combine independent operations like this into a
> > > > > > single function. It makes for an unclear parameter list, it's a pain
> > > > > > in the behind to change the constituent operations later on, and it
> > > > > > has a habit of attracting more random bools over time. E.g. what if
> > > > > > the caller already has irqs disabled? What if it KNOWS that irqs are
> > > > > > enabled and it could use local_irq_disable() instead of save?
> > > > > >
> > > > > > Just provide an __obj_cgroup_uncharge() that assumes irqs are
> > > > > > disabled, combine with the existing __mod_memcg_lruvec_state(), and
> > > > > > bubble the irq handling up to those callsites which know better.
> > > > > >
> > > > > That will also work. However, the reason I did that was because of patch 5
> > > > > in the series. I could put the get_obj_stock() and put_obj_stock() code in
> > > > > slab.h and allowed them to be used directly in various places, but hiding in
> > > > > one function is easier.
> > > > Yeah it's more obvious after getting to patch 5.
> > > >
> > > > But with the irq disabling gone entirely, is there still an incentive
> > > > to combine the atomic section at all? Disabling preemption is pretty
> > > > cheap, so it wouldn't matter to just do it twice.
> > > >
> > > > I.e. couldn't the final sequence in slab code simply be
> > > >
> > > > objcg_uncharge()
> > > > mod_objcg_state()
> > > >
> > > > again and each function disables preemption (and in the rare case
> > > > irqs) as it sees fit?
> > > >
> > > > You lose the irqsoff batching in the cold path, but as you say, hit
> > > > rates are pretty good, and it doesn't seem worth complicating the code
> > > > for the cold path.
> > > >
> > > That does make sense, though a little bit of performance may be lost. I will
> > > try that out to see how it work out performance wise.
> > Thanks.
> >
> > Even if we still end up doing it, it's great to have that cost
> > isolated, so we know how much extra code complexity corresponds to how
> > much performance gain. It seems the task/irq split could otherwise be
> > a pretty localized change with no API implications.
> >
> I still want to move mod_objcg_state() function to memcontrol.c though as I
> don't want to put any obj_stock stuff in mm/slab.h.

No objection from me!

That's actually a nice cleanup, IMO. Not sure why it was separated
from the rest of the objcg interface implementation to begin with.