2023-01-25 07:36:43

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

Disclaimer:
a - The cover letter got bigger than expected, so I had to split it in
sections to better organize myself. I am not very confortable with it.
b - Performance numbers below did not include patch 5/5 (Remove flags
from memcg_stock_pcp), which could further improve performance for
drain_all_stock(), but I could only notice the optimization at the
last minute.


0 - Motivation:
On current codebase, when drain_all_stock() is ran, it will schedule a
drain_local_stock() for each cpu that has a percpu stock associated with a
descendant of a given root_memcg.

This happens even on 'isolated cpus', a feature commonly used on workloads that
are sensitive to interruption and context switching such as vRAN and Industrial
Control Systems.

Since this scheduling behavior is a problem to those workloads, the proposal is
to replace the current local_lock + schedule_work_on() solution with a per-cpu
spinlock.

1 - Possible performance losses:
With a low amount of shedule_work_on(), local_locks are supposed to perform
better than spinlocks, given cacheline is always accessed by a single CPU and
there is no contention. The impact on those areas is analyzed bellow:

1.1 - Cacheline usage
In current implementation drain_all_stock() will be remote reading the percpu
memcg_stock cacheline of every online CPU, and remote writing to all cpus that
succeed the mem_cgroup_is_descendant() test. (stock->flags, stock->work)

With spinlocks, drain_all_stock() will be remote reading the percpu memcg_stock
cacheline of every online CPU, and remote writing to all cpus that succeed the
mem_cgroup_is_descendant() test. (stock->stock_lock, on top of the above)
While the spinlock may require extra acquire/release writes on some archs, they
will all happen on an exclusive cacheline, so not much overhead.

In both cases, the next local cpu read will require a fetch from memory (as it
was invalidated on the remote write) and cacheline exclusivity get before the
local write.

So about cacheline usage, there should not be much impact.

1.2 - Contention
We can safely assume that drain_all_stock() will not run oftenly. If it was not
the case, there would be a lot of scheduled tasks and kill cpu performance.

Since it does not run oftenly, and it is the only function that acesses remote
percpu memcg_stock, contention should not be too expressive, and should cause
less impact than scheduling (remote) and running the scheduled work (local).


2 - Performance test results:
2.1 - Test System:
- Non-virtualized AMD EPYC 7601 32-Core Processor, 128 CPUs distributed
in 8 NUMA nodes: 0-7,64-71 on node0, 8-15,72-79 on node1, and so on.
- 256GB RAM, 2x32GB per NUMA node
- For cpu isolation: use kernel cmdline:
isolcpus=8-15,72-79 nohz_full=8-15,72-79
- Memcg group created with:
[Slice]
MemoryAccounting=true
MemoryLimit=1G
MemoryMax=1G
MemorySwapMax=1M
- For pinning runs on given CPU, I used 'taskset -c $cpunum command'
- For restarting the memcg, it was used:
restart_memcg(){
systemctl stop test.slice
systemctl daemon-reload
systemctl start test.slice
}

2.2 - Impact on functions that use memcg_stock:
2.2.1 - Approach: Using perf or tracepoints get a very course result, so
it was preferred to count the total cpu clocks between entering and
exiting the functions that use the memcg_stock, on an isolated cpu.
Something like this was used on x86:

fun_name(){
u64 clk = rdtsc_ordered();
... <function does it's job>
clk = rdtsc_ordered() - clk;
<percpu struct dbg>
dbg->fun_name.cycles += clk;
dbg->fun_name.count++;
}

The percpu statistics were then acquired via a char device after the
test finished, and an average function clock usage was calculated.

For the stress test, run "cat /proc/interrupts > /dev/null" in a loop
of 1000000 iterations inside the memcg for each cpu tested:
for each cpu in $cpuset; do
systemd-run --wait --slice=test.slice taskset -c $cpu bash -c "
for k in {1..100000} ; do
cat /proc/interrupts > /dev/null;
done"

For the drain_all_stock() test, it was necessary to restart the memcg
(or cause an OOM) to call the function.

2.2.2 - Results
For 1 isolated CPU, pinned on cpu 8, with no drain_all_stock() calls,
being STDDEV the standard deviation between the average on 6 runs,
and Call Count the sum of calls on the 6 runs:

Patched Average clk STDDEV Call Count
consume_stock: 63.75983475 0.1610502136 72167768
refill_stock: 67.45708322 0.09732816852 23401756
mod_objcg_state: 98.03841384 1.491628532 181292961
consume_obj_stock: 63.2543456 0.04624513799 94846454
refill_obj_stock: 78.56390025 0.3160306174 91732973

Upstream Average clk STDDEV Call Count
consume_stock: 53.51201046 0.05200824438 71.866536
refill_stock: 65.46991584 0.1178078417 23401752
mod_objcg_state: 84.95365055 1.371464414 181.292707
consume_obj_stock: 60.03619438 0.05944582207 94.846327
refill_obj_stock: 73.23757912 1.161933856 91.732787

Patched - Upstream Diff (cycles) Diff %
consume_stock: 10.24782429 19.15051258
refill_stock: 1.987167378 3.035237411
mod_objcg_state: 13.08476328 15.40223781
consume_obj_stock: 3.218151217 5.360351785
refill_obj_stock: 5.326321123 7.272661368

So in average the above patched functions are 2~13 clocks cycles slower
than upstream.

On the other hand, drain_all_stock is faster on the patched version,
even considering it does all the draining instead of scheduling the work
to other CPUs:

drain_all_stock
cpus Upstream Patched Diff (cycles) Diff(%)
1 44331.10831 38978.03581 -5353.072507 -12.07520567
8 43992.96512 39026.76654 -4966.198572 -11.2886198
128 156274.6634 58053.87421 -98220.78915 -62.85138425

(8 cpus being in the same NUMA node)

2.3 - Contention numbers
2.3.1 - Approach
On top of the patched version, I replaced the spin_lock_irqsave() on
functions that use the memcg_stock with spin_lock_irqsave_cc(), which
is defined as:

#define spin_lock_irqsave_cc(l, flags) \
if (!spin_trylock_irqsave(l, flags)) { \
u64 clk = rdtsc_ordered(); \
spin_lock_irqsave(l, flags); \
clk = rdtsc_ordered() - clk; \
pr_err("mydebug: cpu %d hit contention :" \
" fun = %s, clk = %lld/n", \
smp_processor_id(), __func__, clk); \
}

So in case of contention (try_lock failing) it would record an
approximate clk usage before getting the lock, and print this to dmesg.

For the stress test, run "cat /proc/interrupts > /dev/null" in a loop
of 1000000 iterations for each of the 128 cpus inside the memcg
(limit set to 20G):

for each cpu in {1..128}; do
restart_memcg()
systemd-run --wait --slice=test.slice taskset -c $cpu bash -c "
for k in {1..100000} ; do
cat /proc/interrupts > /dev/null;
done"

This loop was repeated for over 8 hours

2.3.2- Results

Function # calls
consume_stock: 15078323802
refill_stock: 2495995683
mod_objcg_state: 39765559905
consume_obj_stock: 19882888224
refill_obj_stock: 21025241793
drain_all_stock: 592

Contentions hit: 0

(Other more aggressive synthetic tests were run, and even in this case
contention was hit just a couple times over some hours.)

2.4 - Syscall time measure
2.4.1- Approach
To measure the patchset effect on syscall time, the following code was
used: (copied/adapted from
https://lore.kernel.org/linux-mm/[email protected]/ )

################
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <string.h>

typedef unsigned long long cycles_t;
typedef unsigned long long usecs_t;
typedef unsigned long long u64;

#ifdef __x86_64__
#define DECLARE_ARGS(val, low, high) unsigned long low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32))
#define EAX_EDX_ARGS(val, low, high) "a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high)
#else
#define DECLARE_ARGS(val, low, high) unsigned long long val
#define EAX_EDX_VAL(val, low, high) (val)
#define EAX_EDX_ARGS(val, low, high) "A" (val)
#define EAX_EDX_RET(val, low, high) "=A" (val)
#endif

static inline unsigned long long __rdtscll(void)
{
DECLARE_ARGS(val, low, high);

asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));

return EAX_EDX_VAL(val, low, high);
}

#define rdtscll(val) do { (val) = __rdtscll(); } while (0)

#define NRSYSCALLS 30000000
#define NRSLEEPS 100000

#define page_mmap() mmap(NULL, 4096, PROT_READ|PROT_WRITE, \
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
#define page_munmap(x) munmap(x, 4096)

void main(int argc, char *argv[])
{
unsigned long a, b, cycles;
int i, syscall = 0;
int *page;

page = page_mmap();
if (page == MAP_FAILED)
perror("mmap");

if (page_munmap(page))
perror("munmap");

if (argc != 2) {
printf("usage: %s {idle,syscall}\n", argv[0]);
exit(1);
}

rdtscll(a);

if (strncmp("idle", argv[1], 4) == 0)
syscall = 0;
else if (strncmp("syscall", argv[1], 7) == 0)
syscall = 1;
else {
printf("usage: %s {idle,syscall}\n", argv[0]);
exit(1);
}

if (syscall == 1) {
for (i = 0; i < NRSYSCALLS; i++) {
page = page_mmap();
if (page == MAP_FAILED)
perror("mmap");

#ifdef MY_WRITE
page[3] = i;
#endif

if (page_munmap(page))
perror("munmap");
}
} else {
for (i = 0; i < NRSLEEPS; i++)
usleep(10);
}

rdtscll(b);

cycles = b - a;

if (syscall == 1)
printf("cycles / syscall: %d\n", (b-a)/(NRSYSCALLS*2));
else
printf("cycles / idle loop: %d\n", (b-a)/NRSLEEPS);
}
################

Running with ./my_test syscall will cause it to print the average clock
cycles usage of the syscall pair (page_mmap() and page_munmap());

It was compiled with two versions: With -DMY_WRITE and without it.
The difference is writing to the allocated page, causing it to fault.

Each version was run 200 times, pinned to an isolated cpu.
Then an average and standard deviation was calculated on those results.

2.4.2- Results
Patched: no_write write
AVG 2991.195 5746.495
STDEV 27.77488427 40.55878512
STDEV % 0.9285547838 0.7058004073

Upstream: no_write write
AVG 3012.22 5749.605
STDEV 25.1186451 37.26206223
STDEV % 0.8338914522 0.6480803851

Pat - Up: no_write write
Diff -21.025 -3.11
Diff % -0.6979901866 -0.05409067232

Meaning the pair page_mmap() + page_munmap() + pagefault runs a tiny bit
faster on the patched version, compared to upstream.

3 - Discussion on results
On 2.2 we see every function that uses memcg_stock on local cpu gets a
slower by some cycles on the patched version, while the function that
accesses it remotely (drain_all_stock()) gets faster. The difference is
more accentuated as we raise the cpu count, and consequently start
dealing with sharing memory across NUMA.

On 2.3 we see contention is not a big issue, as expected in 1.2.
This probably happens due to the fact that drain_all_stock() runs quite
rarely on normal operation, and the other functions are quite fast.

On 2.4 we can see that page_mmap() + page_munmap() + pagefault ran
a tiny bit faster. This is probably due to the fact that
drain_all_stock() does not schedule work on the running cpu, causing it
not to get interrupted, and possibly making up for the increased time
in local functions.

4- Conclusion
Scheduling work on isolated cpus can be an issue for some workloads.
Reducing the issue by replacing the local_lock in memcg_stock with a
spinlock should not cause much impact on performance.

Leonardo Bras (5):
mm/memcontrol: Align percpu memcg_stock to cache
mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t
mm/memcontrol: Reorder memcg_stock_pcp members to avoid holes
mm/memcontrol: Perform all stock drain in current CPU
mm/memcontrol: Remove flags from memcg_stock_pcp

mm/memcontrol.c | 75 ++++++++++++++++++++-----------------------------
1 file changed, 31 insertions(+), 44 deletions(-)

--
2.39.1



2023-01-25 07:36:47

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v2 1/5] mm/memcontrol: Align percpu memcg_stock to cache

When a struct smaller than a cacheline has an instance that is not aligned
to cache block size, it could happen that data can be spread across two
cachelines.

For memcg_stock this could mean the need to fetch and get cache-exclusivity
in 2 cachelines instead of 1 when we bounce the cacheline between local
cpu functions and drain_all_stock(), which does remote read/write.

This could also mean some false-sharing costs being paid due to the
cacheline being shared between 2 unrelated structures.

Avoid this issue by getting memcg_stock cacheline-aligned.

Signed-off-by: Leonardo Bras <[email protected]>
---
mm/memcontrol.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab6..f8e86b88b3c7a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2188,7 +2188,8 @@ struct memcg_stock_pcp {
unsigned long flags;
#define FLUSHING_CACHED_CHARGE 0
};
-static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct memcg_stock_pcp, memcg_stock) = {
.stock_lock = INIT_LOCAL_LOCK(stock_lock),
};
static DEFINE_MUTEX(percpu_charge_mutex);
--
2.39.1


2023-01-25 07:36:56

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v2 2/5] mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t

In this context, since it's using per-cpu variables, changing from
local_lock to spinlock should not deal much impact in performance and can
allow operations such as stock draining to happen in remote cpus.

Why performance would probably not get impacted:
1 - Since the lock is in the same cache line as the information that is
written next, there is no much extra memory access cost for using the
lock.
2 - Since it's a percpu struct, there should be rare for other cpu to share
this cacheline, so there should be rare to need cacheline invalidation,
and writing to the lock should be cheap since there is always a
write to next struct members.
3 - Even the write in (2) could be pipelined and batched with following
writes to the cacheline (such as nr_pages member), further decreasing
the impact of this change.

Suggested-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Leonardo Bras <[email protected]>
---
mm/memcontrol.c | 38 ++++++++++++++++++++------------------
1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f8e86b88b3c7a..1d5c108413c83 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2172,7 +2172,7 @@ void unlock_page_memcg(struct page *page)
}

struct memcg_stock_pcp {
- local_lock_t stock_lock;
+ spinlock_t stock_lock; /* Protects the percpu struct */
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;

@@ -2190,7 +2190,7 @@ struct memcg_stock_pcp {
};

static DEFINE_PER_CPU_SHARED_ALIGNED(struct memcg_stock_pcp, memcg_stock) = {
- .stock_lock = INIT_LOCAL_LOCK(stock_lock),
+ .stock_lock = __SPIN_LOCK_UNLOCKED(stock_lock),
};
static DEFINE_MUTEX(percpu_charge_mutex);

@@ -2235,15 +2235,15 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
if (nr_pages > MEMCG_CHARGE_BATCH)
return ret;

- local_lock_irqsave(&memcg_stock.stock_lock, flags);
-
stock = this_cpu_ptr(&memcg_stock);
+ spin_lock_irqsave(&stock->stock_lock, flags);
+
if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
stock->nr_pages -= nr_pages;
ret = true;
}

- local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ spin_unlock_irqrestore(&stock->stock_lock, flags);

return ret;
}
@@ -2280,14 +2280,14 @@ static void drain_local_stock(struct work_struct *dummy)
* drain_stock races is that we always operate on local CPU stock
* here with IRQ disabled
*/
- local_lock_irqsave(&memcg_stock.stock_lock, flags);
-
stock = this_cpu_ptr(&memcg_stock);
+ spin_lock_irqsave(&stock->stock_lock, flags);
+
old = drain_obj_stock(stock);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);

- local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ spin_unlock_irqrestore(&stock->stock_lock, flags);
if (old)
obj_cgroup_put(old);
}
@@ -2315,10 +2315,12 @@ static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
{
unsigned long flags;
+ struct memcg_stock_pcp *stock;

- local_lock_irqsave(&memcg_stock.stock_lock, flags);
+ stock = this_cpu_ptr(&memcg_stock);
+ spin_lock_irqsave(&stock->stock_lock, flags);
__refill_stock(memcg, nr_pages);
- local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ spin_unlock_irqrestore(&stock->stock_lock, flags);
}

/*
@@ -3165,8 +3167,8 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
unsigned long flags;
int *bytes;

- local_lock_irqsave(&memcg_stock.stock_lock, flags);
stock = this_cpu_ptr(&memcg_stock);
+ spin_lock_irqsave(&stock->stock_lock, flags);

/*
* Save vmstat data in stock and skip vmstat array update unless
@@ -3218,7 +3220,7 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
if (nr)
mod_objcg_mlstate(objcg, pgdat, idx, nr);

- local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ spin_unlock_irqrestore(&stock->stock_lock, flags);
if (old)
obj_cgroup_put(old);
}
@@ -3229,15 +3231,15 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
unsigned long flags;
bool ret = false;

- local_lock_irqsave(&memcg_stock.stock_lock, flags);
-
stock = this_cpu_ptr(&memcg_stock);
+ spin_lock_irqsave(&stock->stock_lock, flags);
+
if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
stock->nr_bytes -= nr_bytes;
ret = true;
}

- local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ spin_unlock_irqrestore(&stock->stock_lock, flags);

return ret;
}
@@ -3327,9 +3329,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
unsigned long flags;
unsigned int nr_pages = 0;

- local_lock_irqsave(&memcg_stock.stock_lock, flags);
-
stock = this_cpu_ptr(&memcg_stock);
+ spin_lock_irqsave(&stock->stock_lock, flags);
+
if (stock->cached_objcg != objcg) { /* reset if necessary */
old = drain_obj_stock(stock);
obj_cgroup_get(objcg);
@@ -3345,7 +3347,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
stock->nr_bytes &= (PAGE_SIZE - 1);
}

- local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ spin_unlock_irqrestore(&stock->stock_lock, flags);
if (old)
obj_cgroup_put(old);

--
2.39.1


2023-01-25 07:37:02

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v2 4/5] mm/memcontrol: Perform all stock drain in current CPU

When drain_all_stock() is called, some CPUs will be required to have their
per-CPU caches drained. This currently happens by scheduling a call to
drain_local_stock() to run in each affected CPU.

This, as a consequence, may end up scheduling work to CPUs that are
isolated, and therefore should have as little interruption as possible.

In order to avoid this, run all CPUs stock drain directly from the
current CPU. This should be fine as long as drain_all_stock() runs fast
enough so it don't often cause contention on consume_stock(),
refill_stock(), mod_objcg_state(), consume_obj_stock() or
refill_obj_stock().

Also, since drain_all_stock() will be able to run on a remote CPU, protect
memcg_hotplug_cpu_dead() with stock_lock.

Signed-off-by: Leonardo Bras <[email protected]>
---
mm/memcontrol.c | 29 +++++++++--------------------
1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 373fa78c4d881..5b7f7c2e0232f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2184,7 +2184,6 @@ struct memcg_stock_pcp {
int nr_slab_unreclaimable_b;
#endif

- struct work_struct work;
unsigned long flags;
#define FLUSHING_CACHED_CHARGE 0
};
@@ -2269,18 +2268,15 @@ static void drain_stock(struct memcg_stock_pcp *stock)
stock->cached = NULL;
}

-static void drain_local_stock(struct work_struct *dummy)
+static void drain_stock_from(struct memcg_stock_pcp *stock)
{
- struct memcg_stock_pcp *stock;
struct obj_cgroup *old = NULL;
unsigned long flags;

/*
- * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
- * drain_stock races is that we always operate on local CPU stock
- * here with IRQ disabled
+ * The protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
+ * drain_stock races is stock_lock, a percpu spinlock.
*/
- stock = this_cpu_ptr(&memcg_stock);
spin_lock_irqsave(&stock->stock_lock, flags);

old = drain_obj_stock(stock);
@@ -2329,7 +2325,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
*/
static void drain_all_stock(struct mem_cgroup *root_memcg)
{
- int cpu, curcpu;
+ int cpu;

/* If someone's already draining, avoid adding running more workers. */
if (!mutex_trylock(&percpu_charge_mutex))
@@ -2341,7 +2337,6 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
* per-cpu data. CPU up doesn't touch memcg_stock at all.
*/
migrate_disable();
- curcpu = smp_processor_id();
for_each_online_cpu(cpu) {
struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
struct mem_cgroup *memcg;
@@ -2357,12 +2352,8 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
rcu_read_unlock();

if (flush &&
- !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
- if (cpu == curcpu)
- drain_local_stock(&stock->work);
- else
- schedule_work_on(cpu, &stock->work);
- }
+ !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+ drain_stock_from(stock);
}
migrate_enable();
mutex_unlock(&percpu_charge_mutex);
@@ -2373,7 +2364,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
struct memcg_stock_pcp *stock;

stock = &per_cpu(memcg_stock, cpu);
+ spin_lock(&stock->stock_lock);
drain_stock(stock);
+ spin_unlock(&stock->stock_lock);

return 0;
}
@@ -7328,7 +7321,7 @@ __setup("cgroup.memory=", cgroup_memory);
*/
static int __init mem_cgroup_init(void)
{
- int cpu, node;
+ int node;

/*
* Currently s32 type (can refer to struct batched_lruvec_stat) is
@@ -7341,10 +7334,6 @@ static int __init mem_cgroup_init(void)
cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
memcg_hotplug_cpu_dead);

- for_each_possible_cpu(cpu)
- INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
- drain_local_stock);
-
for_each_node(node) {
struct mem_cgroup_tree_per_node *rtpn;

--
2.39.1


2023-01-25 07:37:05

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v2 3/5] mm/memcontrol: Reorder memcg_stock_pcp members to avoid holes

In 64-bit architectures, the current layout of memcg_stock_pcp should look
like this:

struct memcg_stock_pcp {
spinlock_t stock_lock; /* 0 4 */
/* 4 bytes hole */
struct mem_cgroup * cached; /* 8 8 */
unsigned int nr_pages; /* 16 4 */
/* 4 bytes hole */
[...]
};

This happens because pointers will have 8 bytes (64-bit) and ints will have
4 bytes.

Those 2 holes are avoided if we reorder nr_pages and cached,
effectivelly putting nr_pages in the first hole, and saving 8 bytes.

Signed-off-by: Leonardo Bras <[email protected]>
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1d5c108413c83..373fa78c4d881 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2173,8 +2173,8 @@ void unlock_page_memcg(struct page *page)

struct memcg_stock_pcp {
spinlock_t stock_lock; /* Protects the percpu struct */
- struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
+ struct mem_cgroup *cached; /* this never be root cgroup */

#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
--
2.39.1


2023-01-25 07:37:19

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v2 5/5] mm/memcontrol: Remove flags from memcg_stock_pcp

The flags member of struct memcg_stock_pcp has only one used bit:
FLUSHING_CACHED_CHARGE

Both struct member and flag were created to avoid scheduling multiple
instances of kworkers running drain_local_stock() for a single cpu.

How could this scenario happen before:
- drain_all_stock() gets called, get ownership of percpu_charge_mutex,
schedules a drain_local_stock() on cpu X, and drops ownership of
percpu_charge_mutex.
- Another thread calls drain_all_stock(), get ownership of
percpu_charge_mutex, schedules a drain_local_stock() on cpu X, ...

Since the stock draining is now performed by the thread running
drain_all_stock(), and happens before letting go of the
percpu_charge_mutex, there is no chance of another drain happening
between test_and_set_bit() and clear_bit(), so flags is now useless.

Remove the flags member of memcg_stock_pcp, its usages and the
FLUSHING_CACHED_CHARGE define.

Signed-off-by: Leonardo Bras <[email protected]>
---
mm/memcontrol.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5b7f7c2e0232f..60712f69595e4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2183,9 +2183,6 @@ struct memcg_stock_pcp {
int nr_slab_reclaimable_b;
int nr_slab_unreclaimable_b;
#endif
-
- unsigned long flags;
-#define FLUSHING_CACHED_CHARGE 0
};

static DEFINE_PER_CPU_SHARED_ALIGNED(struct memcg_stock_pcp, memcg_stock) = {
@@ -2281,7 +2278,6 @@ static void drain_stock_from(struct memcg_stock_pcp *stock)

old = drain_obj_stock(stock);
drain_stock(stock);
- clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);

spin_unlock_irqrestore(&stock->stock_lock, flags);
if (old)
@@ -2351,8 +2347,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
flush = true;
rcu_read_unlock();

- if (flush &&
- !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+ if (flush)
drain_stock_from(stock);
}
migrate_enable();
--
2.39.1


2023-01-25 08:34:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> Disclaimer:
> a - The cover letter got bigger than expected, so I had to split it in
> sections to better organize myself. I am not very confortable with it.
> b - Performance numbers below did not include patch 5/5 (Remove flags
> from memcg_stock_pcp), which could further improve performance for
> drain_all_stock(), but I could only notice the optimization at the
> last minute.
>
>
> 0 - Motivation:
> On current codebase, when drain_all_stock() is ran, it will schedule a
> drain_local_stock() for each cpu that has a percpu stock associated with a
> descendant of a given root_memcg.
>
> This happens even on 'isolated cpus', a feature commonly used on workloads that
> are sensitive to interruption and context switching such as vRAN and Industrial
> Control Systems.
>
> Since this scheduling behavior is a problem to those workloads, the proposal is
> to replace the current local_lock + schedule_work_on() solution with a per-cpu
> spinlock.

If IIRC we have also discussed that isolated CPUs can simply opt out
from the pcp caching and therefore the problem would be avoided
altogether without changes to the locking scheme. I do not see anything
regarding that in this submission. Could you elaborate why you have
abandoned this option?
--
Michal Hocko
SUSE Labs

2023-01-25 11:07:48

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > Disclaimer:
> > a - The cover letter got bigger than expected, so I had to split it in
> > sections to better organize myself. I am not very confortable with it.
> > b - Performance numbers below did not include patch 5/5 (Remove flags
> > from memcg_stock_pcp), which could further improve performance for
> > drain_all_stock(), but I could only notice the optimization at the
> > last minute.
> >
> >
> > 0 - Motivation:
> > On current codebase, when drain_all_stock() is ran, it will schedule a
> > drain_local_stock() for each cpu that has a percpu stock associated with a
> > descendant of a given root_memcg.
> >
> > This happens even on 'isolated cpus', a feature commonly used on workloads that
> > are sensitive to interruption and context switching such as vRAN and Industrial
> > Control Systems.
> >
> > Since this scheduling behavior is a problem to those workloads, the proposal is
> > to replace the current local_lock + schedule_work_on() solution with a per-cpu
> > spinlock.
>
> If IIRC we have also discussed that isolated CPUs can simply opt out
> from the pcp caching and therefore the problem would be avoided
> altogether without changes to the locking scheme. I do not see anything
> regarding that in this submission. Could you elaborate why you have
> abandoned this option?

Hello Michal,

I understand pcp caching is a nice to have.
So while I kept the idea of disabling pcp caching in mind as an option, I first
tried to understand what kind of impacts we would be seeing when trying to
change the locking scheme.

After I raised the data in the cover letter, I found that the performance impact
appears not be that big. So in order to try keeping the pcp cache on isolated
cpus active, I decided to focus effort on the locking scheme change.

I mean, my rationale is: if is there a non-expensive way of keeping the feature,
why should we abandon it?

Best regards,
Leo








2023-01-25 11:40:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Wed 25-01-23 08:06:46, Leonardo Br?s wrote:
> On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > Disclaimer:
> > > a - The cover letter got bigger than expected, so I had to split it in
> > > sections to better organize myself. I am not very confortable with it.
> > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > from memcg_stock_pcp), which could further improve performance for
> > > drain_all_stock(), but I could only notice the optimization at the
> > > last minute.
> > >
> > >
> > > 0 - Motivation:
> > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > descendant of a given root_memcg.
> > >
> > > This happens even on 'isolated cpus', a feature commonly used on workloads that
> > > are sensitive to interruption and context switching such as vRAN and Industrial
> > > Control Systems.
> > >
> > > Since this scheduling behavior is a problem to those workloads, the proposal is
> > > to replace the current local_lock + schedule_work_on() solution with a per-cpu
> > > spinlock.
> >
> > If IIRC we have also discussed that isolated CPUs can simply opt out
> > from the pcp caching and therefore the problem would be avoided
> > altogether without changes to the locking scheme. I do not see anything
> > regarding that in this submission. Could you elaborate why you have
> > abandoned this option?
>
> Hello Michal,
>
> I understand pcp caching is a nice to have.
> So while I kept the idea of disabling pcp caching in mind as an option, I first
> tried to understand what kind of impacts we would be seeing when trying to
> change the locking scheme.
>
> After I raised the data in the cover letter, I found that the performance impact
> appears not be that big. So in order to try keeping the pcp cache on isolated
> cpus active, I decided to focus effort on the locking scheme change.
>
> I mean, my rationale is: if is there a non-expensive way of keeping the feature,
> why should we abandon it?

Because any locking to pcp adds a potential contention. You haven't been
able to trigger that contention by your testing but that doesn't really
mean it is non-existent.

Besided that opt-out for isolated cpus should be a much smaller change
with a much more predictable behavior. The overall performance for
charging on isolated cpus will be slightly worse but it hasn't been seen
this is anything really noticeable. Most workloads which do care about
isolcpus tend to not enter kernel much AFAIK so this should have
relatively small impact.

All that being said I would prefer a simpler solution which seems to be
to simply opt out from pcp caching and if the performance for real
world workloads shows regressions then we can start thinking about a
more complex solution.
--
Michal Hocko
SUSE Labs

2023-01-25 18:23:25

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Br?s wrote:
> On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > Disclaimer:
> > > a - The cover letter got bigger than expected, so I had to split it in
> > > sections to better organize myself. I am not very confortable with it.
> > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > from memcg_stock_pcp), which could further improve performance for
> > > drain_all_stock(), but I could only notice the optimization at the
> > > last minute.
> > >
> > >
> > > 0 - Motivation:
> > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > descendant of a given root_memcg.
> > >
> > > This happens even on 'isolated cpus', a feature commonly used on workloads that
> > > are sensitive to interruption and context switching such as vRAN and Industrial
> > > Control Systems.
> > >
> > > Since this scheduling behavior is a problem to those workloads, the proposal is
> > > to replace the current local_lock + schedule_work_on() solution with a per-cpu
> > > spinlock.
> >
> > If IIRC we have also discussed that isolated CPUs can simply opt out
> > from the pcp caching and therefore the problem would be avoided
> > altogether without changes to the locking scheme. I do not see anything
> > regarding that in this submission. Could you elaborate why you have
> > abandoned this option?
>
> Hello Michal,
>
> I understand pcp caching is a nice to have.
> So while I kept the idea of disabling pcp caching in mind as an option, I first
> tried to understand what kind of impacts we would be seeing when trying to
> change the locking scheme.

Remote draining reduces interruptions whether CPU
is marked as isolated or not:

- Allows isolated CPUs from benefiting of pcp caching.
- Removes the interruption to non isolated CPUs. See for example

https://lkml.org/lkml/2022/6/13/2769

"Minchan Kim tested this independently and reported;

My workload is not NOHZ CPUs but run apps under heavy memory
pressure so they goes to direct reclaim and be stuck on
drain_all_pages until work on workqueue run.

unit: nanosecond
max(dur) avg(dur) count(dur)
166713013 487511.77786438033 1283

From traces, system encountered the drain_all_pages 1283 times and
worst case was 166ms and avg was 487us.

The other problem was alloc_contig_range in CMA. The PCP draining
takes several hundred millisecond sometimes though there is no
memory pressure or a few of pages to be migrated out but CPU were
fully booked.

Your patch perfectly removed those wasted time."


> After I raised the data in the cover letter, I found that the performance impact
> appears not be that big. So in order to try keeping the pcp cache on isolated
> cpus active, I decided to focus effort on the locking scheme change.
>
> I mean, my rationale is: if is there a non-expensive way of keeping the feature,
> why should we abandon it?
>
> Best regards,
> Leo
>
>
>
>
>
>
>
>


2023-01-25 23:15:13

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Br?s wrote:
> > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > Disclaimer:
> > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > sections to better organize myself. I am not very confortable with it.
> > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > from memcg_stock_pcp), which could further improve performance for
> > > > drain_all_stock(), but I could only notice the optimization at the
> > > > last minute.
> > > >
> > > >
> > > > 0 - Motivation:
> > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > descendant of a given root_memcg.

Do you know what caused those drain_all_stock() calls? I wonder if we should look
into why we have many of them and whether we really need them?

It's either some user's actions (e.g. reducing memory.max), either some memcg
is entering pre-oom conditions. In the latter case a lot of drain calls can be
scheduled without a good reason (assuming the cgroup contain multiple tasks running
on multiple cpus). Essentially each cpu will try to grab the remains of the memory quota
and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
on per-cgroup basis.

Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some
charges/memcg references (it might be not completely idle, but running some very special
workload which is not doing any kernel allocations or a process belonging to the root memcg).
In all other cases pcpu stock will be either drained naturally by an allocation from another
memcg or an allocation from the same memcg will "restore" it, making draining useless.

We also can into drain_all_pages() opportunistically, without waiting for the result.
On a busy system it's most likely useless, we might oom before scheduled works will be executed.

I admit I planned to do some work around and even started, but then never had enough time to
finish it.

Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
doing this without a really strong reason.

Thanks!

2023-01-26 07:41:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Br?s wrote:
> > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > Disclaimer:
> > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > sections to better organize myself. I am not very confortable with it.
> > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > from memcg_stock_pcp), which could further improve performance for
> > > > > drain_all_stock(), but I could only notice the optimization at the
> > > > > last minute.
> > > > >
> > > > >
> > > > > 0 - Motivation:
> > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > descendant of a given root_memcg.
>
> Do you know what caused those drain_all_stock() calls? I wonder if we should look
> into why we have many of them and whether we really need them?
>
> It's either some user's actions (e.g. reducing memory.max), either some memcg
> is entering pre-oom conditions. In the latter case a lot of drain calls can be
> scheduled without a good reason (assuming the cgroup contain multiple tasks running
> on multiple cpus).

I believe I've never got a specific answer to that. We
have discussed that in the previous version submission
([email protected] and specifically
[email protected]). Leonardo has mentioned a mix of RT and
isolcpus. I was wondering about using memcgs in RT workloads because
that just sounds weird but let's say this is the case indeed. Then an RT
task or whatever task that is running on an isolated cpu can have pcp
charges.

> Essentially each cpu will try to grab the remains of the memory quota
> and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> on per-cgroup basis.

I think it would be more than sufficient to disable pcp charging on an
isolated cpu. This is not a per memcg property. I can imagine that
different tasks running in the same memcg can run on a mix of CPUs (e.g.
only part of it on isolated CPUs). It is a recipe for all sorts of
priority inversions but well, memcg and RT is there already.

> Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some
> charges/memcg references (it might be not completely idle, but running some very special
> workload which is not doing any kernel allocations or a process belonging to the root memcg).
> In all other cases pcpu stock will be either drained naturally by an allocation from another
> memcg or an allocation from the same memcg will "restore" it, making draining useless.
>
> We also can into drain_all_pages() opportunistically, without waiting for the result.
> On a busy system it's most likely useless, we might oom before scheduled works will be executed.

I think the primary objective is that no userspace unintended execution
happens on isolated cpus.

> I admit I planned to do some work around and even started, but then never had enough time to
> finish it.
>
> Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> doing this without a really strong reason.

Are you OK with a simple opt out on isolated CPUs? That would make
charges slightly slower (atomic on the hierarchy counters vs. a single
pcp adjustment) but it would guarantee that the isolated workload is
predictable which is the primary objective AFAICS.
--
Michal Hocko
SUSE Labs

2023-01-26 07:46:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
[...]
> Remote draining reduces interruptions whether CPU
> is marked as isolated or not:
>
> - Allows isolated CPUs from benefiting of pcp caching.
> - Removes the interruption to non isolated CPUs. See for example
>
> https://lkml.org/lkml/2022/6/13/2769

This is talking about page allocato per cpu caches, right? In this patch
we are talking about memcg pcp caches. Are you sure the same applies
here?
--
Michal Hocko
SUSE Labs

2023-01-26 18:38:17

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Wed, Jan 25, 2023 at 03:14:48PM -0800, Roman Gushchin wrote:
> On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > Disclaimer:
> > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > sections to better organize myself. I am not very confortable with it.
> > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > from memcg_stock_pcp), which could further improve performance for
> > > > > drain_all_stock(), but I could only notice the optimization at the
> > > > > last minute.
> > > > >
> > > > >
> > > > > 0 - Motivation:
> > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > descendant of a given root_memcg.
>
> Do you know what caused those drain_all_stock() calls? I wonder if we should look
> into why we have many of them and whether we really need them?
>
> It's either some user's actions (e.g. reducing memory.max), either some memcg
> is entering pre-oom conditions. In the latter case a lot of drain calls can be
> scheduled without a good reason (assuming the cgroup contain multiple tasks running
> on multiple cpus). Essentially each cpu will try to grab the remains of the memory quota
> and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> on per-cgroup basis.
>
> Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some
> charges/memcg references (it might be not completely idle, but running some very special
> workload which is not doing any kernel allocations or a process belonging to the root memcg).
> In all other cases pcpu stock will be either drained naturally by an allocation from another
> memcg or an allocation from the same memcg will "restore" it, making draining useless.
>
> We also can into drain_all_pages() opportunistically, without waiting for the result.
> On a busy system it's most likely useless, we might oom before scheduled works will be executed.
>
> I admit I planned to do some work around and even started, but then never had enough time to
> finish it.
>
> Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> doing this without a really strong reason.

The expectation would be that cache locking should not cause slowdown of
the allocation and free paths:

https://manualsbrain.com/en/manuals/1246877/?page=313

For the P6 and more recent processor families, if the area of memory being locked
during a LOCK operation is cached in the processor that is performing the LOCK oper-
ation as write-back memory and is completely contained in a cache line, the
processor may not assert the LOCK# signal on the bus. Instead, it will modify the
memory location internally and allow it’s cache coherency mechanism to insure that
the operation is carried out atomically. This operation is called “cache locking.” The
cache coherency mechanism automatically prevents two or more processors that ...



2023-01-26 18:38:20

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Br?s wrote:
> > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > Disclaimer:
> > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > sections to better organize myself. I am not very confortable with it.
> > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > from memcg_stock_pcp), which could further improve performance for
> > > > > > drain_all_stock(), but I could only notice the optimization at the
> > > > > > last minute.
> > > > > >
> > > > > >
> > > > > > 0 - Motivation:
> > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > descendant of a given root_memcg.
> >
> > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > into why we have many of them and whether we really need them?
> >
> > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > on multiple cpus).
>
> I believe I've never got a specific answer to that. We
> have discussed that in the previous version submission
> ([email protected] and specifically
> [email protected]). Leonardo has mentioned a mix of RT and
> isolcpus. I was wondering about using memcgs in RT workloads because
> that just sounds weird but let's say this is the case indeed.

This could be the case. You can consider an "edge device" where it is
necessary to run a RT workload. It might also be useful to run
non realtime applications on the same system.

> Then an RT task or whatever task that is running on an isolated
> cpu can have pcp charges.

Usually the RT task (or more specifically the realtime sensitive loop
of the application) runs entirely on userspace. But i suppose there
could be charges on application startup.

> > Essentially each cpu will try to grab the remains of the memory quota
> > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > on per-cgroup basis.
>
> I think it would be more than sufficient to disable pcp charging on an
> isolated cpu. This is not a per memcg property. I can imagine that
> different tasks running in the same memcg can run on a mix of CPUs (e.g.
> only part of it on isolated CPUs). It is a recipe for all sorts of
> priority inversions but well, memcg and RT is there already.

I suppose the more general the solution, the better.

> > Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some
> > charges/memcg references (it might be not completely idle, but running some very special
> > workload which is not doing any kernel allocations or a process belonging to the root memcg).
> > In all other cases pcpu stock will be either drained naturally by an allocation from another
> > memcg or an allocation from the same memcg will "restore" it, making draining useless.
> >
> > We also can into drain_all_pages() opportunistically, without waiting for the result.
> > On a busy system it's most likely useless, we might oom before scheduled works will be executed.
>
> I think the primary objective is that no userspace unintended execution
> happens on isolated cpus.

No interruptions to the userspace code (time sensitive code) running on
isolated CPUs: no IPIs, no task switches.

> > I admit I planned to do some work around and even started, but then never had enough time to
> > finish it.
> >
> > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> > doing this without a really strong reason.
>
> Are you OK with a simple opt out on isolated CPUs? That would make
> charges slightly slower (atomic on the hierarchy counters vs. a single
> pcp adjustment) but it would guarantee that the isolated workload is
> predictable which is the primary objective AFAICS.

This would make isolated CPUs "second class citizens": it would be nice
to be able to execute non realtime apps on isolated CPUs as well
(think of different periods of time during a day, one where
more realtime apps are required, another where less
realtime apps are required).

Concrete example: think of a computer handling vRAN traffic near a
cell tower. The traffic (therefore amount of processing required
by realtime applications) might vary during the day.

User might want to run containers that depend on good memcg charging
performance on isolated CPUs.


2023-01-26 18:38:22

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> [...]
> > Remote draining reduces interruptions whether CPU
> > is marked as isolated or not:
> >
> > - Allows isolated CPUs from benefiting of pcp caching.
> > - Removes the interruption to non isolated CPUs. See for example
> >
> > https://lkml.org/lkml/2022/6/13/2769
>
> This is talking about page allocato per cpu caches, right? In this patch
> we are talking about memcg pcp caches. Are you sure the same applies
> here?

Both can stall the users of the drain operation.

"Minchan Kim tested this independently and reported;

My workload is not NOHZ CPUs but run apps under heavy memory
pressure so they goes to direct reclaim and be stuck on
drain_all_pages until work on workqueue run."

Therefore using a workqueue to drain memcg pcps also depends on the
remote CPU executing that work item in time (which can stall
the following). No?

===

7 3141 mm/memory.c <<wp_page_copy>>
if (mem_cgroup_charge(page_folio(new_page), mm, GFP_KERNEL))
8 4118 mm/memory.c <<do_anonymous_page>>
if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
9 4577 mm/memory.c <<do_cow_fault>>
if (mem_cgroup_charge(page_folio(vmf->cow_page), vma->vm_mm,
10 621 mm/migrate_device.c <<migrate_vma_insert_page>>
if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
11 710 mm/shmem.c <<shmem_add_to_page_cache>>
error = mem_cgroup_charge(folio, charge_mm, gfp);


2023-01-26 19:14:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote:
> On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> > [...]
> > > Remote draining reduces interruptions whether CPU
> > > is marked as isolated or not:
> > >
> > > - Allows isolated CPUs from benefiting of pcp caching.
> > > - Removes the interruption to non isolated CPUs. See for example
> > >
> > > https://lkml.org/lkml/2022/6/13/2769
> >
> > This is talking about page allocato per cpu caches, right? In this patch
> > we are talking about memcg pcp caches. Are you sure the same applies
> > here?
>
> Both can stall the users of the drain operation.

Yes. But it is important to consider who those users are. We are
draining when
- we are charging and the limit is hit so that memory reclaim
has to be triggered.
- hard, high limits are set and require memory reclaim.
- force_empty - full memory reclaim for a memcg
- memcg offlining - cgroup removel - quite a heavy operation as
well.
all those could be really costly kernel operations and they affect
isolated cpu only if the same memcg is used by both isolated and non-isolated
cpus. In other words those costly operations would have to be triggered
from non-isolated cpus and those are to be expected to be stalled. It is
the side effect of the local cpu draining that is scheduled that affects
the isolated cpu as well.

Is that more clear?
--
Michal Hocko
SUSE Labs

2023-01-26 19:20:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Thu 26-01-23 15:03:43, Marcelo Tosatti wrote:
> On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Br?s wrote:
> > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > > Disclaimer:
> > > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > > sections to better organize myself. I am not very confortable with it.
> > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > > from memcg_stock_pcp), which could further improve performance for
> > > > > > > drain_all_stock(), but I could only notice the optimization at the
> > > > > > > last minute.
> > > > > > >
> > > > > > >
> > > > > > > 0 - Motivation:
> > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > > descendant of a given root_memcg.
> > >
> > > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > > into why we have many of them and whether we really need them?
> > >
> > > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > > on multiple cpus).
> >
> > I believe I've never got a specific answer to that. We
> > have discussed that in the previous version submission
> > ([email protected] and specifically
> > [email protected]). Leonardo has mentioned a mix of RT and
> > isolcpus. I was wondering about using memcgs in RT workloads because
> > that just sounds weird but let's say this is the case indeed.
>
> This could be the case. You can consider an "edge device" where it is
> necessary to run a RT workload. It might also be useful to run
> non realtime applications on the same system.
>
> > Then an RT task or whatever task that is running on an isolated
> > cpu can have pcp charges.
>
> Usually the RT task (or more specifically the realtime sensitive loop
> of the application) runs entirely on userspace. But i suppose there
> could be charges on application startup.

What is the role of memcg then? If the memory limit is in place and the
workload doesn't fit in then it will get reclaimed during start up and
memory would need to be refaulted if not mlocked. If it is mlocked then
the limit cannot be enforced and the start up would likely fail as a
result of the memcg oom killer.

[...]
> > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> > > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> > > doing this without a really strong reason.
> >
> > Are you OK with a simple opt out on isolated CPUs? That would make
> > charges slightly slower (atomic on the hierarchy counters vs. a single
> > pcp adjustment) but it would guarantee that the isolated workload is
> > predictable which is the primary objective AFAICS.
>
> This would make isolated CPUs "second class citizens": it would be nice
> to be able to execute non realtime apps on isolated CPUs as well
> (think of different periods of time during a day, one where
> more realtime apps are required, another where less
> realtime apps are required).

An alternative requires to make the current implementation that is
lockless to use locks and introduce potential lock contention. This
could be harmful to regular workloads. Not using pcp caching would make
a fast path using few atomics rather than local pcp update. That is not
a terrible cost to pay for special cased workloads which use isolcpus.
Really we are not talking about a massive cost to be payed. At least
nobody has shown that in any numbers.

> Concrete example: think of a computer handling vRAN traffic near a
> cell tower. The traffic (therefore amount of processing required
> by realtime applications) might vary during the day.
>
> User might want to run containers that depend on good memcg charging
> performance on isolated CPUs.

What kind of role would memcg play here? Do you want to memory constrain
that workload?
--
Michal Hocko
SUSE Labs

2023-01-26 23:12:59

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Br?s wrote:
> > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > Disclaimer:
> > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > sections to better organize myself. I am not very confortable with it.
> > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > from memcg_stock_pcp), which could further improve performance for
> > > > > > drain_all_stock(), but I could only notice the optimization at the
> > > > > > last minute.
> > > > > >
> > > > > >
> > > > > > 0 - Motivation:
> > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > descendant of a given root_memcg.
> >
> > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > into why we have many of them and whether we really need them?
> >
> > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > on multiple cpus).
>
> I believe I've never got a specific answer to that. We
> have discussed that in the previous version submission
> ([email protected] and specifically
> [email protected]). Leonardo has mentioned a mix of RT and
> isolcpus. I was wondering about using memcgs in RT workloads because
> that just sounds weird but let's say this is the case indeed. Then an RT
> task or whatever task that is running on an isolated cpu can have pcp
> charges.
>
> > Essentially each cpu will try to grab the remains of the memory quota
> > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > on per-cgroup basis.
>
> I think it would be more than sufficient to disable pcp charging on an
> isolated cpu.

It might have significant performance consequences.

I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
the accuracy of memory limits and slightly increase the memory footprint (all
those dying memcgs...), but the impact will be limited. Actually it is limited
by the number of cpus.

> This is not a per memcg property.

Sure, my point was that in pre-oom condition several cpus might try to consolidate
the remains of the memory quota, actually working against each other. Separate issue,
which might be a reason why there are many flush attempts in the case we discuss.

Thanks!

2023-01-27 04:39:07

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Thu, Jan 26, 2023 at 08:20:46PM +0100, Michal Hocko wrote:
> On Thu 26-01-23 15:03:43, Marcelo Tosatti wrote:
> > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Br?s wrote:
> > > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > > > Disclaimer:
> > > > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > > > sections to better organize myself. I am not very confortable with it.
> > > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > > > from memcg_stock_pcp), which could further improve performance for
> > > > > > > > drain_all_stock(), but I could only notice the optimization at the
> > > > > > > > last minute.
> > > > > > > >
> > > > > > > >
> > > > > > > > 0 - Motivation:
> > > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > > > descendant of a given root_memcg.
> > > >
> > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > > > into why we have many of them and whether we really need them?
> > > >
> > > > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > > > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > > > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > > > on multiple cpus).
> > >
> > > I believe I've never got a specific answer to that. We
> > > have discussed that in the previous version submission
> > > ([email protected] and specifically
> > > [email protected]). Leonardo has mentioned a mix of RT and
> > > isolcpus. I was wondering about using memcgs in RT workloads because
> > > that just sounds weird but let's say this is the case indeed.
> >
> > This could be the case. You can consider an "edge device" where it is
> > necessary to run a RT workload. It might also be useful to run
> > non realtime applications on the same system.
> >
> > > Then an RT task or whatever task that is running on an isolated
> > > cpu can have pcp charges.
> >
> > Usually the RT task (or more specifically the realtime sensitive loop
> > of the application) runs entirely on userspace. But i suppose there
> > could be charges on application startup.
>
> What is the role of memcg then? If the memory limit is in place and the
> workload doesn't fit in then it will get reclaimed during start up and
> memory would need to be refaulted if not mlocked. If it is mlocked then
> the limit cannot be enforced and the start up would likely fail as a
> result of the memcg oom killer.

1) Application which is not time sensitive executes on isolated CPU,
with memcg control enabled. Per-CPU stock is created.

2) App with memcg control enabled exits, per-CPU stock is not drained.

3) Latency sensitive application starts, isolated per-CPU has stock to
be drained, and:

/*
* Drains all per-CPU charge caches for given root_memcg resp. subtree
* of the hierarchy under it.
*/
static void drain_all_stock(struct mem_cgroup *root_memcg)
{
int cpu, curcpu;

/* If someone's already draining, avoid adding running more workers. */
if (!mutex_trylock(&percpu_charge_mutex))
return;
/*
* Notify other cpus that system-wide "drain" is running
* We do not care about races with the cpu hotplug because cpu down
* as well as workers from this path always operate on the local
* per-cpu data. CPU up doesn't touch memcg_stock at all.
*/
migrate_disable();
curcpu = smp_processor_id();
for_each_online_cpu(cpu) {
struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
struct mem_cgroup *memcg;
bool flush = false;

rcu_read_lock();
memcg = stock->cached;
if (memcg && stock->nr_pages &&
mem_cgroup_is_descendant(memcg, root_memcg))
flush = true;
else if (obj_stock_flush_required(stock, root_memcg))
flush = true;
rcu_read_unlock();

if (flush &&
!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
if (cpu == curcpu)
drain_local_stock(&stock->work);
else
schedule_work_on(cpu, &stock->work);
}
}
migrate_enable();
mutex_unlock(&percpu_charge_mutex);
}

> [...]
> > > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> > > > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> > > > doing this without a really strong reason.
> > >
> > > Are you OK with a simple opt out on isolated CPUs? That would make
> > > charges slightly slower (atomic on the hierarchy counters vs. a single
> > > pcp adjustment) but it would guarantee that the isolated workload is
> > > predictable which is the primary objective AFAICS.
> >
> > This would make isolated CPUs "second class citizens": it would be nice
> > to be able to execute non realtime apps on isolated CPUs as well
> > (think of different periods of time during a day, one where
> > more realtime apps are required, another where less
> > realtime apps are required).
>
> An alternative requires to make the current implementation that is
> lockless to use locks and introduce potential lock contention. This
> could be harmful to regular workloads. Not using pcp caching would make
> a fast path using few atomics rather than local pcp update. That is not
> a terrible cost to pay for special cased workloads which use isolcpus.
> Really we are not talking about a massive cost to be payed. At least
> nobody has shown that in any numbers.
>
> > Concrete example: think of a computer handling vRAN traffic near a
> > cell tower. The traffic (therefore amount of processing required
> > by realtime applications) might vary during the day.
> >
> > User might want to run containers that depend on good memcg charging
> > performance on isolated CPUs.
>
> What kind of role would memcg play here? Do you want to memory constrain
> that workload?

See example above.

> --
> Michal Hocko
> SUSE Labs
>
>


2023-01-27 05:41:21

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Thu, 2023-01-26 at 15:19 -0300, Marcelo Tosatti wrote:
> On Wed, Jan 25, 2023 at 03:14:48PM -0800, Roman Gushchin wrote:
> > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > Disclaimer:
> > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > sections to better organize myself. I am not very confortable with it.
> > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > from memcg_stock_pcp), which could further improve performance for
> > > > > > drain_all_stock(), but I could only notice the optimization at the
> > > > > > last minute.
> > > > > >
> > > > > >
> > > > > > 0 - Motivation:
> > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > descendant of a given root_memcg.
> >
> > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > into why we have many of them and whether we really need them?
> >
> > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > on multiple cpus). Essentially each cpu will try to grab the remains of the memory quota
> > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > on per-cgroup basis.
> >
> > Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some
> > charges/memcg references (it might be not completely idle, but running some very special
> > workload which is not doing any kernel allocations or a process belonging to the root memcg).
> > In all other cases pcpu stock will be either drained naturally by an allocation from another
> > memcg or an allocation from the same memcg will "restore" it, making draining useless.
> >
> > We also can into drain_all_pages() opportunistically, without waiting for the result.
> > On a busy system it's most likely useless, we might oom before scheduled works will be executed.
> >
> > I admit I planned to do some work around and even started, but then never had enough time to
> > finish it.
> >
> > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> > doing this without a really strong reason.
>
> The expectation would be that cache locking should not cause slowdown of
> the allocation and free paths:
>
> https://manualsbrain.com/en/manuals/1246877/?page=313
>
> For the P6 and more recent processor families, if the area of memory being locked
> during a LOCK operation is cached in the processor that is performing the LOCK oper-
> ation as write-back memory and is completely contained in a cache line, the
> processor may not assert the LOCK# signal on the bus. Instead, it will modify the
> memory location internally and allow it’s cache coherency mechanism to insure that
> the operation is carried out atomically. This operation is called “cache locking.” The
> cache coherency mechanism automatically prevents two or more processors that ...
>
>

Just to keep the info easily available: the protected structure (struct
memcg_stock_pcp) fits in 48 Bytes, which is less than the usual 64B cacheline.

struct memcg_stock_pcp {
spinlock_t stock_lock; /* 0 4 */
unsigned int nr_pages; /* 4 4 */
struct mem_cgroup * cached; /* 8 8 */
struct obj_cgroup * cached_objcg; /* 16 8 */
struct pglist_data * cached_pgdat; /* 24 8 */
unsigned int nr_bytes; /* 32 4 */
int nr_slab_reclaimable_b; /* 36 4 */
int nr_slab_unreclaimable_b; /* 40 4 */

/* size: 48, cachelines: 1, members: 8 */
/* padding: 4 */
/* last cacheline: 48 bytes */
};

(It got smaller after patches 3/5, 4/5 and 5/5, which remove holes, work_struct
and flags respectively.)

On top of that, patch 1/5 makes sure the percpu allocation is aligned to
cacheline size.


2023-01-27 06:56:42

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote:
> On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote:
> > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> > > [...]
> > > > Remote draining reduces interruptions whether CPU
> > > > is marked as isolated or not:
> > > >
> > > > - Allows isolated CPUs from benefiting of pcp caching.
> > > > - Removes the interruption to non isolated CPUs. See for example
> > > >
> > > > https://lkml.org/lkml/2022/6/13/2769
> > >
> > > This is talking about page allocato per cpu caches, right? In this patch
> > > we are talking about memcg pcp caches. Are you sure the same applies
> > > here?
> >
> > Both can stall the users of the drain operation.
>
> Yes. But it is important to consider who those users are. We are
> draining when
> - we are charging and the limit is hit so that memory reclaim
> has to be triggered.
> - hard, high limits are set and require memory reclaim.
> - force_empty - full memory reclaim for a memcg
> - memcg offlining - cgroup removel - quite a heavy operation as
> well.
> all those could be really costly kernel operations and they affect
> isolated cpu only if the same memcg is used by both isolated and non-isolated
> cpus. In other words those costly operations would have to be triggered
> from non-isolated cpus and those are to be expected to be stalled. It is
> the side effect of the local cpu draining that is scheduled that affects
> the isolated cpu as well.
>
> Is that more clear?

I think so, please help me check:

IIUC, we can approach this by dividing the problem in two working modes:
1 - Normal, meaning no drain_all_stock() running.
2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
destroying, reconfiguring a memcg.

For (1), we will have (ideally) only local cpu working on the percpu struct.
This mode will not have any kind of contention, because each CPU will hold it's
own spinlock only.

For (2), we will have a lot of drain_all_stock() running. This will mean a lot
of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
local cpus having to wait for a lock to get their cache, on the patch proposal.

Ok, given the above is correct:

# Some arguments point that (1) becomes slower with this patch.

This is partially true: while test 2.2 pointed that local cpu functions running
time had became slower by a few cycles, test 2.4 points that the userspace
perception of it was that the syscalls and pagefaulting actually became faster:

During some debugging tests before getting the performance on test 2.4, I
noticed that the 'syscall + write' test would call all those functions that
became slower on test 2.2. Those functions were called multiple millions of
times during a single test, and still the patched version performance test
returned faster for test 2.4 than upstream version. Maybe the functions became
slower, but overall the usage of them in the usual context became faster.

Is not that a small improvement?

# Regarding (2), I notice that we fear contention

While this seems to be the harder part of the discussion, I think we have enough
data to deal with it.

In which case contention would be a big problem here? 
IIUC it would be when a lot of drain_all_stock() get running because the memory
limit is getting near. I mean, having the user to create / modify a memcg
multiple times a second for a while is not something that is expected, IMHO.

Now, if I assumed correctly and the case where contention could be a problem is
on a memcg with high memory pressure, then we have the argument that Marcelo
Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page
allocation brought better results than local_locks + schedule_work_on().

I mean, while contention would cause the cpu to wait for a while before getting
the lock for allocating a page from cache, something similar would happen with
schedule_work_on(), which would force the current task to wait while the
draining happens locally. 

What I am able to see is that, for each drain_all_stock(), for each cpu getting
drained we have the option to (a) (sometimes) wait for a lock to be freed, or
(b) wait for a whole context switch to happen.
And IIUC, (b) is much slower than (a) on average, and this is what causes the
improved performance seen in [P1].

(I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for
it to run on the remote CPU may not be that different, since the cacheline is
already writen to by the remote cpu on Upstream)

Also according to test 2.2, for the patched version, drain_local_stock() have
gotten faster (much faster for 128 cpus), even though it does all the draining
instead of just scheduling it on the other cpus. 
I mean, summing that to the brief nature of local cpu functions, we may not hit
contention as much as we are expected.

##

Sorry for the long text.
I may be missing some point, please let me know if that's the case.

Thanks a lot for reviewing!
Leo

[P1]: https://lkml.org/lkml/2022/6/13/2769


2023-01-27 06:58:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Thu 26-01-23 21:32:18, Marcelo Tosatti wrote:
> On Thu, Jan 26, 2023 at 08:20:46PM +0100, Michal Hocko wrote:
> > On Thu 26-01-23 15:03:43, Marcelo Tosatti wrote:
> > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > > > On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > > > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Br?s wrote:
> > > > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > > > > Disclaimer:
> > > > > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > > > > sections to better organize myself. I am not very confortable with it.
> > > > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > > > > from memcg_stock_pcp), which could further improve performance for
> > > > > > > > > drain_all_stock(), but I could only notice the optimization at the
> > > > > > > > > last minute.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 0 - Motivation:
> > > > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > > > > descendant of a given root_memcg.
> > > > >
> > > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > > > > into why we have many of them and whether we really need them?
> > > > >
> > > > > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > > > > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > > > > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > > > > on multiple cpus).
> > > >
> > > > I believe I've never got a specific answer to that. We
> > > > have discussed that in the previous version submission
> > > > ([email protected] and specifically
> > > > [email protected]). Leonardo has mentioned a mix of RT and
> > > > isolcpus. I was wondering about using memcgs in RT workloads because
> > > > that just sounds weird but let's say this is the case indeed.
> > >
> > > This could be the case. You can consider an "edge device" where it is
> > > necessary to run a RT workload. It might also be useful to run
> > > non realtime applications on the same system.
> > >
> > > > Then an RT task or whatever task that is running on an isolated
> > > > cpu can have pcp charges.
> > >
> > > Usually the RT task (or more specifically the realtime sensitive loop
> > > of the application) runs entirely on userspace. But i suppose there
> > > could be charges on application startup.
> >
> > What is the role of memcg then? If the memory limit is in place and the
> > workload doesn't fit in then it will get reclaimed during start up and
> > memory would need to be refaulted if not mlocked. If it is mlocked then
> > the limit cannot be enforced and the start up would likely fail as a
> > result of the memcg oom killer.
>
> 1) Application which is not time sensitive executes on isolated CPU,
> with memcg control enabled. Per-CPU stock is created.
>
> 2) App with memcg control enabled exits, per-CPU stock is not drained.
>
> 3) Latency sensitive application starts, isolated per-CPU has stock to
> be drained, and:
>
> /*
> * Drains all per-CPU charge caches for given root_memcg resp. subtree
> * of the hierarchy under it.
> */
> static void drain_all_stock(struct mem_cgroup *root_memcg)

No, this is not really answering my question. See
[email protected] which already explains how the draining
would be triggered. This is not really happening on any operation.

I am really asking for specific workloads which are running multiple
processes on a mix of isolated and non-isolated cpus yet they share
memcg so that they can interfere. The consequences of the common memcg
are described above.
--
Michal Hocko
SUSE Labs

2023-01-27 07:11:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

[Cc Frederic]

On Thu 26-01-23 15:12:35, Roman Gushchin wrote:
> On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
[...]
> > > Essentially each cpu will try to grab the remains of the memory quota
> > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > on per-cgroup basis.
> >
> > I think it would be more than sufficient to disable pcp charging on an
> > isolated cpu.
>
> It might have significant performance consequences.

Is it really significant?

> I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> the accuracy of memory limits and slightly increase the memory footprint (all
> those dying memcgs...), but the impact will be limited. Actually it is limited
> by the number of cpus.

Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
potentially left behind should be small and that shouldn't really be a
concern for memcg oom situations (unless the limit is very small and
workloads on isolated cpus using small hard limits is way beyond my
imagination).

My first thought was that those charges could be left behind without any
upper bound but in reality sooner or later something should be running
on those cpus and if the memcg is gone the pcp cache would get refilled
and old charges gone.

So yes, this is actually a better and even simpler solution. All we need
is something like this
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..13b84bbd70ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
struct mem_cgroup *memcg;
bool flush = false;

+ if (cpu_is_isolated(cpu))
+ continue;
+
rcu_read_lock();
memcg = stock->cached;
if (memcg && stock->nr_pages &&

There is no such cpu_is_isolated() AFAICS so we would need a help from
NOHZ and cpuisol people to create one for us. Frederic, would such an
abstraction make any sense from your POV?
--
Michal Hocko
SUSE Labs

2023-01-27 07:15:18

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote:
> On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > > Disclaimer:
> > > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > > sections to better organize myself. I am not very confortable with it.
> > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > > from memcg_stock_pcp), which could further improve performance for
> > > > > > > drain_all_stock(), but I could only notice the optimization at the
> > > > > > > last minute.
> > > > > > >
> > > > > > >
> > > > > > > 0 - Motivation:
> > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > > descendant of a given root_memcg.
> > >
> > > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > > into why we have many of them and whether we really need them?
> > >
> > > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > > on multiple cpus).
> >
> > I believe I've never got a specific answer to that. We
> > have discussed that in the previous version submission
> > ([email protected] and specifically
> > [email protected]). Leonardo has mentioned a mix of RT and
> > isolcpus. I was wondering about using memcgs in RT workloads because
> > that just sounds weird but let's say this is the case indeed. Then an RT
> > task or whatever task that is running on an isolated cpu can have pcp
> > charges.
> >
> > > Essentially each cpu will try to grab the remains of the memory quota
> > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > on per-cgroup basis.
> >
> > I think it would be more than sufficient to disable pcp charging on an
> > isolated cpu.
>
> It might have significant performance consequences.
>
> I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> the accuracy of memory limits and slightly increase the memory footprint (all
> those dying memcgs...), but the impact will be limited. Actually it is limited
> by the number of cpus.

I was discussing this same idea with Marcelo yesterday morning.

The questions had in the topic were:
a - About how many pages the pcp cache will hold before draining them itself? 
b - Would it cache any kind of bigger page, or huge page in this same aspect?

Please let me know if I got anything wrong, but IIUC from a previous debug (a)'s
answer is 4 pages. Meaning even on bigger-page archs such as powerpc, with 64k
pages, the max pcp cache 'wasted' on each processor would be 256k (very small on
today's standard).

Please let me know if you have any info on (b), or any correcting on (a).

The thing is: having this drain_local_stock() waiver only for isolated cpus
would not bring the same benefits for non-isolated cpus in high memory pressure
as I understand this patchset is bringing.  

OTOH not running drain_local_stock() at all for every cpu may introduce
performance gains (no remote CPU access) but can be a problem if I got the
'wasted pages' could on (a) wrong. I mean, drain_local_stock() was introduced
for a reason.

>
> > This is not a per memcg property.
>
> Sure, my point was that in pre-oom condition several cpus might try to consolidate
> the remains of the memory quota, actually working against each other. Separate issue,
> which might be a reason why there are many flush attempts in the case we discuss.
>
> Thanks!
>

Thank you for reviewing!

Leo


2023-01-27 07:20:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri 27-01-23 04:14:19, Leonardo Br?s wrote:
> On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote:
[...]
> > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > the accuracy of memory limits and slightly increase the memory footprint (all
> > those dying memcgs...), but the impact will be limited. Actually it is limited
> > by the number of cpus.
>
> I was discussing this same idea with Marcelo yesterday morning.
>
> The questions had in the topic were:
> a - About how many pages the pcp cache will hold before draining them itself??

MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache
doesn't really hold any pages. It is a mere counter of how many charges
have been accounted for the memcg page counter. So it is not really
consuming proportional amount of resources. It just pins the
corresponding memcg. Have a look at consume_stock and refill_stock

> b - Would it cache any kind of bigger page, or huge page in this same aspect?

The above should answer this as well as those following up I hope. If
not let me know.
--
Michal Hocko
SUSE Labs

2023-01-27 07:23:18

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri, 2023-01-27 at 08:11 +0100, Michal Hocko wrote:
> [Cc Frederic]
>
> On Thu 26-01-23 15:12:35, Roman Gushchin wrote:
> > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> [...]
> > > > Essentially each cpu will try to grab the remains of the memory quota
> > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > > on per-cgroup basis.
> > >
> > > I think it would be more than sufficient to disable pcp charging on an
> > > isolated cpu.
> >
> > It might have significant performance consequences.
>
> Is it really significant?
>
> > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > the accuracy of memory limits and slightly increase the memory footprint (all
> > those dying memcgs...), but the impact will be limited. Actually it is limited
> > by the number of cpus.
>
> Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
> potentially left behind should be small and that shouldn't really be a
> concern for memcg oom situations (unless the limit is very small and
> workloads on isolated cpus using small hard limits is way beyond my
> imagination).
>
> My first thought was that those charges could be left behind without any
> upper bound but in reality sooner or later something should be running
> on those cpus and if the memcg is gone the pcp cache would get refilled
> and old charges gone.
>
> So yes, this is actually a better and even simpler solution. All we need
> is something like this
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab457f0394ab..13b84bbd70ba 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> struct mem_cgroup *memcg;
> bool flush = false;
>
> + if (cpu_is_isolated(cpu))
> + continue;
> +
> rcu_read_lock();
> memcg = stock->cached;
> if (memcg && stock->nr_pages &&
>
> There is no such cpu_is_isolated() AFAICS so we would need a help from
> NOHZ and cpuisol people to create one for us. Frederic, would such an
> abstraction make any sense from your POV?


IIUC, 'if (cpu_is_isolated())' would be instead:

if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_DOMAIN) ||
!housekeeping_cpu(smp_processor_id(), HK_TYPE_WQ)


2023-01-27 07:36:21

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri, 2023-01-27 at 08:20 +0100, Michal Hocko wrote:
> On Fri 27-01-23 04:14:19, Leonardo Brás wrote:
> > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote:
> [...]
> > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > by the number of cpus.
> >
> > I was discussing this same idea with Marcelo yesterday morning.
> >
> > The questions had in the topic were:
> > a - About how many pages the pcp cache will hold before draining them itself? 
>
> MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache
> doesn't really hold any pages. It is a mere counter of how many charges
> have been accounted for the memcg page counter. So it is not really
> consuming proportional amount of resources. It just pins the
> corresponding memcg. Have a look at consume_stock and refill_stock

I see. Thanks for pointing that out!

So in worst case scenario the memcg would have reserved 64 pages * (numcpus - 1)
that are not getting used, and may cause an 'earlier' OOM if this amount is
needed but can't be freed.

In the wave of worst case, supposing a big powerpc machine, 256 CPUs, each
holding 64k * 64 pages => 1GB memory - 4MB (one cpu using resources).
It's starting to get too big, but still ok for a machine this size.

The thing is that it can present an odd behavior:
You have a cgroup created before, now empty, and try to run given application,
and hits OOM.
You then restart the cgroup, run the same application without an issue.

Even though it looks a good possibility, this can be perceived by user as
instability.

>
> > b - Would it cache any kind of bigger page, or huge page in this same aspect?
>
> The above should answer this as well as those following up I hope. If
> not let me know.

IIUC we are talking normal pages, is that it?

Best regards,
Leo


2023-01-27 08:13:17

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri, 2023-01-27 at 04:22 -0300, Leonardo Brás wrote:
> On Fri, 2023-01-27 at 08:11 +0100, Michal Hocko wrote:
> > [Cc Frederic]
> >
> > On Thu 26-01-23 15:12:35, Roman Gushchin wrote:
> > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > [...]
> > > > > Essentially each cpu will try to grab the remains of the memory quota
> > > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > > > on per-cgroup basis.
> > > >
> > > > I think it would be more than sufficient to disable pcp charging on an
> > > > isolated cpu.
> > >
> > > It might have significant performance consequences.
> >
> > Is it really significant?
> >
> > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > by the number of cpus.
> >
> > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
> > potentially left behind should be small and that shouldn't really be a
> > concern for memcg oom situations (unless the limit is very small and
> > workloads on isolated cpus using small hard limits is way beyond my
> > imagination).
> >
> > My first thought was that those charges could be left behind without any
> > upper bound but in reality sooner or later something should be running
> > on those cpus and if the memcg is gone the pcp cache would get refilled
> > and old charges gone.
> >
> > So yes, this is actually a better and even simpler solution. All we need
> > is something like this
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ab457f0394ab..13b84bbd70ba 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > struct mem_cgroup *memcg;
> > bool flush = false;
> >
> > + if (cpu_is_isolated(cpu))
> > + continue;
> > +
> > rcu_read_lock();
> > memcg = stock->cached;
> > if (memcg && stock->nr_pages &&
> >
> > There is no such cpu_is_isolated() AFAICS so we would need a help from
> > NOHZ and cpuisol people to create one for us. Frederic, would such an
> > abstraction make any sense from your POV?
>
>
> IIUC, 'if (cpu_is_isolated())' would be instead:
>
> if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_DOMAIN) ||
> !housekeeping_cpu(smp_processor_id(), HK_TYPE_WQ)

oh, sorry 's/smp_processor_id()/cpu/' here:

if(!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) || !housekeeping_cpu(cpu, HK_TYPE_WQ))


Not sure why I added smp_processor_id() instead of cpu. I think I need some
sleep. :)


2023-01-27 09:24:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri 27-01-23 05:12:13, Leonardo Br?s wrote:
> On Fri, 2023-01-27 at 04:22 -0300, Leonardo Br?s wrote:
> > On Fri, 2023-01-27 at 08:11 +0100, Michal Hocko wrote:
> > > [Cc Frederic]
> > >
> > > On Thu 26-01-23 15:12:35, Roman Gushchin wrote:
> > > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > > [...]
> > > > > > Essentially each cpu will try to grab the remains of the memory quota
> > > > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > > > > on per-cgroup basis.
> > > > >
> > > > > I think it would be more than sufficient to disable pcp charging on an
> > > > > isolated cpu.
> > > >
> > > > It might have significant performance consequences.
> > >
> > > Is it really significant?
> > >
> > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > > by the number of cpus.
> > >
> > > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
> > > potentially left behind should be small and that shouldn't really be a
> > > concern for memcg oom situations (unless the limit is very small and
> > > workloads on isolated cpus using small hard limits is way beyond my
> > > imagination).
> > >
> > > My first thought was that those charges could be left behind without any
> > > upper bound but in reality sooner or later something should be running
> > > on those cpus and if the memcg is gone the pcp cache would get refilled
> > > and old charges gone.
> > >
> > > So yes, this is actually a better and even simpler solution. All we need
> > > is something like this
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index ab457f0394ab..13b84bbd70ba 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > > struct mem_cgroup *memcg;
> > > bool flush = false;
> > >
> > > + if (cpu_is_isolated(cpu))
> > > + continue;
> > > +
> > > rcu_read_lock();
> > > memcg = stock->cached;
> > > if (memcg && stock->nr_pages &&
> > >
> > > There is no such cpu_is_isolated() AFAICS so we would need a help from
> > > NOHZ and cpuisol people to create one for us. Frederic, would such an
> > > abstraction make any sense from your POV?
> >
> >
> > IIUC, 'if (cpu_is_isolated())' would be instead:
> >
> > if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_DOMAIN) ||
> > !housekeeping_cpu(smp_processor_id(), HK_TYPE_WQ)
>
> oh, sorry 's/smp_processor_id()/cpu/' here:
>
> if(!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) || !housekeeping_cpu(cpu, HK_TYPE_WQ))

Or maybe we can get a nice abstract API so that we do not have to really
care about those low level details. I do not really know what those
really mean and hopefully I shouldn't really need to know.

--
Michal Hocko
SUSE Labs

2023-01-27 09:29:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri 27-01-23 04:35:22, Leonardo Br?s wrote:
> On Fri, 2023-01-27 at 08:20 +0100, Michal Hocko wrote:
> > On Fri 27-01-23 04:14:19, Leonardo Br?s wrote:
> > > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote:
> > [...]
> > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > > by the number of cpus.
> > >
> > > I was discussing this same idea with Marcelo yesterday morning.
> > >
> > > The questions had in the topic were:
> > > a - About how many pages the pcp cache will hold before draining them itself??
> >
> > MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache
> > doesn't really hold any pages. It is a mere counter of how many charges
> > have been accounted for the memcg page counter. So it is not really
> > consuming proportional amount of resources. It just pins the
> > corresponding memcg. Have a look at consume_stock and refill_stock
>
> I see. Thanks for pointing that out!
>
> So in worst case scenario the memcg would have reserved 64 pages * (numcpus - 1)

s@numcpus@num_isolated_cpus@

> that are not getting used, and may cause an 'earlier' OOM if this amount is
> needed but can't be freed.

s@OOM@memcg OOM@

> In the wave of worst case, supposing a big powerpc machine, 256 CPUs, each
> holding 64k * 64 pages => 1GB memory - 4MB (one cpu using resources).
> It's starting to get too big, but still ok for a machine this size.

It is more about the memcg limit rather than the size of the machine.
Again, let's focus on actual usacase. What is the usual memcg setup with
those isolcpus

> The thing is that it can present an odd behavior:
> You have a cgroup created before, now empty, and try to run given application,
> and hits OOM.

The application would either consume those cached charges or flush them
if it is running in a different memcg. Or what do you have in mind?

> You then restart the cgroup, run the same application without an issue.
>
> Even though it looks a good possibility, this can be perceived by user as
> instability.
>
> >
> > > b - Would it cache any kind of bigger page, or huge page in this same aspect?
> >
> > The above should answer this as well as those following up I hope. If
> > not let me know.
>
> IIUC we are talking normal pages, is that it?

We are talking about memcg charges and those have page granularity.

--
Michal Hocko
SUSE Labs

2023-01-27 13:04:08

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri, Jan 27, 2023 at 05:12:13AM -0300, Leonardo Br?s wrote:
> On Fri, 2023-01-27 at 04:22 -0300, Leonardo Br?s wrote:
> > > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
> > > potentially left behind should be small and that shouldn't really be a
> > > concern for memcg oom situations (unless the limit is very small and
> > > workloads on isolated cpus using small hard limits is way beyond my
> > > imagination).
> > >
> > > My first thought was that those charges could be left behind without any
> > > upper bound but in reality sooner or later something should be running
> > > on those cpus and if the memcg is gone the pcp cache would get refilled
> > > and old charges gone.
> > >
> > > So yes, this is actually a better and even simpler solution. All we need
> > > is something like this
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index ab457f0394ab..13b84bbd70ba 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > > struct mem_cgroup *memcg;
> > > bool flush = false;
> > >
> > > + if (cpu_is_isolated(cpu))
> > > + continue;
> > > +
> > > rcu_read_lock();
> > > memcg = stock->cached;
> > > if (memcg && stock->nr_pages &&
> > >
> > > There is no such cpu_is_isolated() AFAICS so we would need a help from
> > > NOHZ and cpuisol people to create one for us. Frederic, would such an
> > > abstraction make any sense from your POV?
> >
> >
> > IIUC, 'if (cpu_is_isolated())' would be instead:
> >
> > if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_DOMAIN) ||
> > !housekeeping_cpu(smp_processor_id(), HK_TYPE_WQ)
>
> oh, sorry 's/smp_processor_id()/cpu/' here:
>
> if(!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) || !housekeeping_cpu(cpu, HK_TYPE_WQ))

Do you also need to handle cpuset.sched_load_balance=0 (aka. cpuset v2 "isolated"
partition type). It has the same effect as isolcpus=, but it can be changed at
runtime.

And then on_null_domain() look like what you need. You'd have to make that API
more generally available though, and rename it to something like
"bool cpu_has_null_domain(int cpu)".

But then you also need to handle concurrent cpuset changes. If you can tolerate
it to be racy, then RCU alone is fine.

Thanks.

2023-01-27 14:00:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri 27-01-23 08:11:04, Michal Hocko wrote:
> [Cc Frederic]
>
> On Thu 26-01-23 15:12:35, Roman Gushchin wrote:
> > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> [...]
> > > > Essentially each cpu will try to grab the remains of the memory quota
> > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > > on per-cgroup basis.
> > >
> > > I think it would be more than sufficient to disable pcp charging on an
> > > isolated cpu.
> >
> > It might have significant performance consequences.
>
> Is it really significant?
>
> > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > the accuracy of memory limits and slightly increase the memory footprint (all
> > those dying memcgs...), but the impact will be limited. Actually it is limited
> > by the number of cpus.
>
> Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
> potentially left behind should be small and that shouldn't really be a
> concern for memcg oom situations (unless the limit is very small and
> workloads on isolated cpus using small hard limits is way beyond my
> imagination).
>
> My first thought was that those charges could be left behind without any
> upper bound but in reality sooner or later something should be running
> on those cpus and if the memcg is gone the pcp cache would get refilled
> and old charges gone.
>
> So yes, this is actually a better and even simpler solution. All we need
> is something like this
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab457f0394ab..13b84bbd70ba 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> struct mem_cgroup *memcg;
> bool flush = false;
>
> + if (cpu_is_isolated(cpu))
> + continue;
> +
> rcu_read_lock();
> memcg = stock->cached;
> if (memcg && stock->nr_pages &&

Btw. this would be over pessimistic. The following should make more
sense:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..55e440e54504 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2357,7 +2357,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
if (cpu == curcpu)
drain_local_stock(&stock->work);
- else
+ else if (!cpu_is_isolated(cpu))
schedule_work_on(cpu, &stock->work);
}
}
--
Michal Hocko
SUSE Labs

2023-01-27 18:19:21

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri, Jan 27, 2023 at 02:58:19PM +0100, Michal Hocko wrote:
> On Fri 27-01-23 08:11:04, Michal Hocko wrote:
> > [Cc Frederic]
> >
> > On Thu 26-01-23 15:12:35, Roman Gushchin wrote:
> > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > [...]
> > > > > Essentially each cpu will try to grab the remains of the memory quota
> > > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > > > on per-cgroup basis.
> > > >
> > > > I think it would be more than sufficient to disable pcp charging on an
> > > > isolated cpu.
> > >
> > > It might have significant performance consequences.
> >
> > Is it really significant?
> >
> > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > by the number of cpus.
> >
> > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
> > potentially left behind should be small and that shouldn't really be a
> > concern for memcg oom situations (unless the limit is very small and
> > workloads on isolated cpus using small hard limits is way beyond my
> > imagination).
> >
> > My first thought was that those charges could be left behind without any
> > upper bound but in reality sooner or later something should be running
> > on those cpus and if the memcg is gone the pcp cache would get refilled
> > and old charges gone.
> >
> > So yes, this is actually a better and even simpler solution. All we need
> > is something like this
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ab457f0394ab..13b84bbd70ba 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > struct mem_cgroup *memcg;
> > bool flush = false;
> >
> > + if (cpu_is_isolated(cpu))
> > + continue;
> > +
> > rcu_read_lock();
> > memcg = stock->cached;
> > if (memcg && stock->nr_pages &&
>
> Btw. this would be over pessimistic. The following should make more
> sense:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab457f0394ab..55e440e54504 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2357,7 +2357,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> if (cpu == curcpu)
> drain_local_stock(&stock->work);
> - else
> + else if (!cpu_is_isolated(cpu))
> schedule_work_on(cpu, &stock->work);
> }
> }


Yes, this is exactly what I was thinking of. It should solve the problem
for isolated cpus well enough without introducing an overhead for everybody else.

If you'll make a proper patch, please add my
Acked-by: Roman Gushchin <[email protected]>

I understand the concerns regarding spurious OOMs on 256-cores machine,
but I guess they are somewhat theoretical and also possible with the current code
(e.g. one ooming cgroup can effectively block draining for everybody else).

Thanks!

2023-01-27 19:30:47

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri, 2023-01-27 at 10:29 +0100, Michal Hocko wrote:
> On Fri 27-01-23 04:35:22, Leonardo Brás wrote:
> > On Fri, 2023-01-27 at 08:20 +0100, Michal Hocko wrote:
> > > On Fri 27-01-23 04:14:19, Leonardo Brás wrote:
> > > > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote:
> > > [...]
> > > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > > > by the number of cpus.
> > > >
> > > > I was discussing this same idea with Marcelo yesterday morning.
> > > >
> > > > The questions had in the topic were:
> > > > a - About how many pages the pcp cache will hold before draining them itself? 
> > >
> > > MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache
> > > doesn't really hold any pages. It is a mere counter of how many charges
> > > have been accounted for the memcg page counter. So it is not really
> > > consuming proportional amount of resources. It just pins the
> > > corresponding memcg. Have a look at consume_stock and refill_stock
> >
> > I see. Thanks for pointing that out!
> >
> > So in worst case scenario the memcg would have reserved 64 pages * (numcpus - 1)
>
> s@numcpus@num_isolated_cpus@

I was thinking worst case scenario being (ncpus - 1) being isolated.

>
> > that are not getting used, and may cause an 'earlier' OOM if this amount is
> > needed but can't be freed.
>
> s@OOM@memcg OOM@

> > In the wave of worst case, supposing a big powerpc machine, 256 CPUs, each
> > holding 64k * 64 pages => 1GB memory - 4MB (one cpu using resources).
> > It's starting to get too big, but still ok for a machine this size.
>
> It is more about the memcg limit rather than the size of the machine.
> Again, let's focus on actual usacase. What is the usual memcg setup with
> those isolcpus

I understand it's about the limit, not actually allocated memory. When I point
the machine size, I mean what is expected to be acceptable from a user in that
machine.

>
> > The thing is that it can present an odd behavior:
> > You have a cgroup created before, now empty, and try to run given application,
> > and hits OOM.
>
> The application would either consume those cached charges or flush them
> if it is running in a different memcg. Or what do you have in mind?

1 - Create a memcg with a VM inside, multiple vcpus pinned to isolated cpus.
2 - Run multi-cpu task inside the VM, it allocates memory for every CPU and keep
the pcp cache
3 - Try to run a single-cpu task (pinned?) inside the VM, which uses almost all
the available memory.
4 - memcg OOM.

Does it make sense?


>
> > You then restart the cgroup, run the same application without an issue.
> >
> > Even though it looks a good possibility, this can be perceived by user as
> > instability.
> >
> > >
> > > > b - Would it cache any kind of bigger page, or huge page in this same aspect?
> > >
> > > The above should answer this as well as those following up I hope. If
> > > not let me know.
> >
> > IIUC we are talking normal pages, is that it?
>
> We are talking about memcg charges and those have page granularity.
>

Thanks for the info!

Also, thanks for the feedback!
Leo


2023-01-27 23:50:48

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri, Jan 27, 2023 at 04:29:37PM -0300, Leonardo Br?s wrote:
> On Fri, 2023-01-27 at 10:29 +0100, Michal Hocko wrote:
> > On Fri 27-01-23 04:35:22, Leonardo Br?s wrote:
> > > On Fri, 2023-01-27 at 08:20 +0100, Michal Hocko wrote:
> > > > On Fri 27-01-23 04:14:19, Leonardo Br?s wrote:
> > > > > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote:
> > > > [...]
> > > > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > > > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > > > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > > > > by the number of cpus.
> > > > >
> > > > > I was discussing this same idea with Marcelo yesterday morning.
> > > > >
> > > > > The questions had in the topic were:
> > > > > a - About how many pages the pcp cache will hold before draining them itself??
> > > >
> > > > MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache
> > > > doesn't really hold any pages. It is a mere counter of how many charges
> > > > have been accounted for the memcg page counter. So it is not really
> > > > consuming proportional amount of resources. It just pins the
> > > > corresponding memcg. Have a look at consume_stock and refill_stock
> > >
> > > I see. Thanks for pointing that out!
> > >
> > > So in worst case scenario the memcg would have reserved 64 pages * (numcpus - 1)
> >
> > s@numcpus@num_isolated_cpus@
>
> I was thinking worst case scenario being (ncpus - 1) being isolated.
>
> >
> > > that are not getting used, and may cause an 'earlier' OOM if this amount is
> > > needed but can't be freed.
> >
> > s@OOM@memcg OOM@
>
> > > In the wave of worst case, supposing a big powerpc machine, 256 CPUs, each
> > > holding 64k * 64 pages => 1GB memory - 4MB (one cpu using resources).
> > > It's starting to get too big, but still ok for a machine this size.
> >
> > It is more about the memcg limit rather than the size of the machine.
> > Again, let's focus on actual usacase. What is the usual memcg setup with
> > those isolcpus
>
> I understand it's about the limit, not actually allocated memory. When I point
> the machine size, I mean what is expected to be acceptable from a user in that
> machine.
>
> >
> > > The thing is that it can present an odd behavior:
> > > You have a cgroup created before, now empty, and try to run given application,
> > > and hits OOM.
> >
> > The application would either consume those cached charges or flush them
> > if it is running in a different memcg. Or what do you have in mind?
>
> 1 - Create a memcg with a VM inside, multiple vcpus pinned to isolated cpus.
> 2 - Run multi-cpu task inside the VM, it allocates memory for every CPU and keep
> the pcp cache
> 3 - Try to run a single-cpu task (pinned?) inside the VM, which uses almost all
> the available memory.
> 4 - memcg OOM.
>
> Does it make sense?

It can happen now as well, you just need a competing drain request.

Honestly, I feel the probability of this scenario to be a real problem is fairly low.
I don't recall any complains on spurious OOMs because of races in the draining code.
Usually machines which are tight on memory are rarely have so many idle cpus.

Thanks!

2023-01-31 12:15:21

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri, Jan 27, 2023 at 03:55:39AM -0300, Leonardo Br?s wrote:
> On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote:
> > On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote:
> > > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> > > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> > > > [...]
> > > > > Remote draining reduces interruptions whether CPU
> > > > > is marked as isolated or not:
> > > > >
> > > > > - Allows isolated CPUs from benefiting of pcp caching.
> > > > > - Removes the interruption to non isolated CPUs. See for example
> > > > >
> > > > > https://lkml.org/lkml/2022/6/13/2769
> > > >
> > > > This is talking about page allocato per cpu caches, right? In this patch
> > > > we are talking about memcg pcp caches. Are you sure the same applies
> > > > here?
> > >
> > > Both can stall the users of the drain operation.
> >
> > Yes. But it is important to consider who those users are. We are
> > draining when
> > - we are charging and the limit is hit so that memory reclaim
> > has to be triggered.
> > - hard, high limits are set and require memory reclaim.
> > - force_empty - full memory reclaim for a memcg
> > - memcg offlining - cgroup removel - quite a heavy operation as
> > well.
> > all those could be really costly kernel operations and they affect
> > isolated cpu only if the same memcg is used by both isolated and non-isolated
> > cpus. In other words those costly operations would have to be triggered
> > from non-isolated cpus and those are to be expected to be stalled. It is
> > the side effect of the local cpu draining that is scheduled that affects
> > the isolated cpu as well.
> >
> > Is that more clear?
>
> I think so, please help me check:
>
> IIUC, we can approach this by dividing the problem in two working modes:
> 1 - Normal, meaning no drain_all_stock() running.
> 2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
> destroying, reconfiguring a memcg.
>
> For (1), we will have (ideally) only local cpu working on the percpu struct.
> This mode will not have any kind of contention, because each CPU will hold it's
> own spinlock only.
>
> For (2), we will have a lot of drain_all_stock() running. This will mean a lot
> of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
> local cpus having to wait for a lock to get their cache, on the patch proposal.
>
> Ok, given the above is correct:
>
> # Some arguments point that (1) becomes slower with this patch.
>
> This is partially true: while test 2.2 pointed that local cpu functions running
> time had became slower by a few cycles, test 2.4 points that the userspace
> perception of it was that the syscalls and pagefaulting actually became faster:
>
> During some debugging tests before getting the performance on test 2.4, I
> noticed that the 'syscall + write' test would call all those functions that
> became slower on test 2.2. Those functions were called multiple millions of
> times during a single test, and still the patched version performance test
> returned faster for test 2.4 than upstream version. Maybe the functions became
> slower, but overall the usage of them in the usual context became faster.
>
> Is not that a small improvement?
>
> # Regarding (2), I notice that we fear contention
>
> While this seems to be the harder part of the discussion, I think we have enough
> data to deal with it.
>
> In which case contention would be a big problem here??
> IIUC it would be when a lot of drain_all_stock() get running because the memory
> limit is getting near.?I mean, having the user to create / modify a memcg
> multiple times a second for a while is not something that is expected, IMHO.

Considering that the use of spinlocks with remote draining is the more general solution,
what would be a test-case to demonstrate a contention problem?

> Now, if I assumed correctly and the case where contention could be a problem is
> on a memcg with high memory pressure, then we have the argument that Marcelo
> Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page
> allocation brought better results than local_locks + schedule_work_on().
>
> I mean, while contention would cause the cpu to wait for a while before getting
> the lock for allocating a page from cache, something similar would happen with
> schedule_work_on(), which would force the current task to wait while the
> draining happens locally.?
>
> What I am able to see is that, for each drain_all_stock(), for each cpu getting
> drained we have the option to (a) (sometimes) wait for a lock to be freed, or
> (b) wait for a whole context switch to happen.
> And IIUC, (b) is much slower than (a) on average, and this is what causes the
> improved performance seen in [P1].

Moreover, there is a delay for the remote CPU to execute a work item
(there could be high priority tasks, IRQ handling on the remote CPU,
which delays execution of the work item even further).

Also, the other point is that the spinlock already exists for
PREEMPT_RT (which means that any potential contention issue
with the spinlock today is limited to PREEMPT_RT users).

So it would be good to point out a specific problematic
testcase/scenario with using the spinlock in this particular case?

> (I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for
> it to run on the remote CPU may not be that different, since the cacheline is
> already writen to by the remote cpu on Upstream)
>
> Also according to test 2.2, for the patched version, drain_local_stock() have
> gotten faster (much faster for 128 cpus), even though it does all the draining
> instead of just scheduling it on the other cpus.?
> I mean, summing that to the brief nature of local cpu functions, we may not hit
> contention as much as we are expected.
>
> ##
>
> Sorry for the long text.
> I may be missing some point, please let me know if that's the case.
>
> Thanks a lot for reviewing!
> Leo
>
> [P1]: https://lkml.org/lkml/2022/6/13/2769
>
>


2023-02-01 04:37:10

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Tue, 2023-01-31 at 08:35 -0300, Marcelo Tosatti wrote:
> On Fri, Jan 27, 2023 at 03:55:39AM -0300, Leonardo Brás wrote:
> > On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote:
> > > On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote:
> > > > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> > > > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> > > > > [...]
> > > > > > Remote draining reduces interruptions whether CPU
> > > > > > is marked as isolated or not:
> > > > > >
> > > > > > - Allows isolated CPUs from benefiting of pcp caching.
> > > > > > - Removes the interruption to non isolated CPUs. See for example
> > > > > >
> > > > > > https://lkml.org/lkml/2022/6/13/2769
> > > > >
> > > > > This is talking about page allocato per cpu caches, right? In this patch
> > > > > we are talking about memcg pcp caches. Are you sure the same applies
> > > > > here?
> > > >
> > > > Both can stall the users of the drain operation.
> > >
> > > Yes. But it is important to consider who those users are. We are
> > > draining when
> > > - we are charging and the limit is hit so that memory reclaim
> > > has to be triggered.
> > > - hard, high limits are set and require memory reclaim.
> > > - force_empty - full memory reclaim for a memcg
> > > - memcg offlining - cgroup removel - quite a heavy operation as
> > > well.
> > > all those could be really costly kernel operations and they affect
> > > isolated cpu only if the same memcg is used by both isolated and non-isolated
> > > cpus. In other words those costly operations would have to be triggered
> > > from non-isolated cpus and those are to be expected to be stalled. It is
> > > the side effect of the local cpu draining that is scheduled that affects
> > > the isolated cpu as well.
> > >
> > > Is that more clear?
> >
> > I think so, please help me check:

Michal, Roman: Could you please review my argumentation below, so I can
understand what exactly is wrong ?

> >
> > IIUC, we can approach this by dividing the problem in two working modes:
> > 1 - Normal, meaning no drain_all_stock() running.
> > 2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
> > destroying, reconfiguring a memcg.
> >
> > For (1), we will have (ideally) only local cpu working on the percpu struct.
> > This mode will not have any kind of contention, because each CPU will hold it's
> > own spinlock only.
> >
> > For (2), we will have a lot of drain_all_stock() running. This will mean a lot
> > of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
> > local cpus having to wait for a lock to get their cache, on the patch proposal.
> >
> > Ok, given the above is correct:
> >
> > # Some arguments point that (1) becomes slower with this patch.
> >
> > This is partially true: while test 2.2 pointed that local cpu functions running
> > time had became slower by a few cycles, test 2.4 points that the userspace
> > perception of it was that the syscalls and pagefaulting actually became faster:
> >
> > During some debugging tests before getting the performance on test 2.4, I
> > noticed that the 'syscall + write' test would call all those functions that
> > became slower on test 2.2. Those functions were called multiple millions of
> > times during a single test, and still the patched version performance test
> > returned faster for test 2.4 than upstream version. Maybe the functions became
> > slower, but overall the usage of them in the usual context became faster.
> >
> > Is not that a small improvement?
> >
> > # Regarding (2), I notice that we fear contention
> >
> > While this seems to be the harder part of the discussion, I think we have enough
> > data to deal with it.
> >
> > In which case contention would be a big problem here? 
> > IIUC it would be when a lot of drain_all_stock() get running because the memory
> > limit is getting near. I mean, having the user to create / modify a memcg
> > multiple times a second for a while is not something that is expected, IMHO.
>
> Considering that the use of spinlocks with remote draining is the more general solution,
> what would be a test-case to demonstrate a contention problem?

IIUC we could try to reproduce a memory tight workload that keeps allocating /
freeing from different cpus (without hitting OOM).

Michal, Roman: Is that correct? You have any workload like that so we can test?

>
> > Now, if I assumed correctly and the case where contention could be a problem is
> > on a memcg with high memory pressure, then we have the argument that Marcelo
> > Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page
> > allocation brought better results than local_locks + schedule_work_on().
> >
> > I mean, while contention would cause the cpu to wait for a while before getting
> > the lock for allocating a page from cache, something similar would happen with
> > schedule_work_on(), which would force the current task to wait while the
> > draining happens locally. 
> >
> > What I am able to see is that, for each drain_all_stock(), for each cpu getting
> > drained we have the option to (a) (sometimes) wait for a lock to be freed, or
> > (b) wait for a whole context switch to happen.
> > And IIUC, (b) is much slower than (a) on average, and this is what causes the
> > improved performance seen in [P1].
>
> Moreover, there is a delay for the remote CPU to execute a work item
> (there could be high priority tasks, IRQ handling on the remote CPU,
> which delays execution of the work item even further).
>
> Also, the other point is that the spinlock already exists for
> PREEMPT_RT (which means that any potential contention issue
> with the spinlock today is limited to PREEMPT_RT users).
>
> So it would be good to point out a specific problematic
> testcase/scenario with using the spinlock in this particular case?


Oh, that's a good point, but IIUC spin_lock() replaces local_lock() in memcg,
meaning it will always run in the same CPU, and there should only be any
contention if the memcg local cpu functions get preempted by something that
calls a memcg local cpu function.

>
> > (I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for
> > it to run on the remote CPU may not be that different, since the cacheline is
> > already writen to by the remote cpu on Upstream)
> >
> > Also according to test 2.2, for the patched version, drain_local_stock() have
> > gotten faster (much faster for 128 cpus), even though it does all the draining
> > instead of just scheduling it on the other cpus. 
> > I mean, summing that to the brief nature of local cpu functions, we may not hit
> > contention as much as we are expected.
> >
> > ##
> >
> > Sorry for the long text.
> > I may be missing some point, please let me know if that's the case.
> >
> > Thanks a lot for reviewing!
> > Leo
> >
> > [P1]: https://lkml.org/lkml/2022/6/13/2769
> >
> >
>

Thanks!
Leo


2023-02-01 12:41:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Tue 31-01-23 08:35:34, Marcelo Tosatti wrote:
[...]
> So it would be good to point out a specific problematic
> testcase/scenario with using the spinlock in this particular case?

Please think about it some more. The sole purpose of the pcp charge
caching is to avoid atomics because the normal fast path (i.e. no limit
hit) is a page counter which is an atomic counter. If you go with a spin
lock for the pcp cache you are just losing some of the advantage of the
caching. Sure that would be a smaller atomics use than a deeper memcg
hierarchy but still.

Not to mention a potential contention which is hard to predict and it
will depend on the specific runtime very much. So not something that
would be easy to test for other than artificial testcases. Full cpu
pcp draining is not a very common operation and one could argue that
the problem will be limited but so far I haven't really heard any strong
arguments against the proposal to avoid scheduling the work on isolated
cpus which is a much more simpler solution and achieves the same
effect.
--
Michal Hocko
SUSE Labs

2023-02-01 12:52:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Wed 01-02-23 01:36:07, Leonardo Br?s wrote:
[...]
> Michal, Roman: Could you please review my argumentation below, so I can
> understand what exactly is wrong ?
>
> > >
> > > IIUC, we can approach this by dividing the problem in two working modes:
> > > 1 - Normal, meaning no drain_all_stock() running.
> > > 2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
> > > destroying, reconfiguring a memcg.
> > >
> > > For (1), we will have (ideally) only local cpu working on the percpu struct.
> > > This mode will not have any kind of contention, because each CPU will hold it's
> > > own spinlock only.

You are still hitting locked atomic operations which is not the case for
a simple pcp access. As already mentioned page counter (which would be
taken in case of no pcp cache) is also about atomics (elbeit more of
them with a deeper cgroup hierarchy).

> > > For (2), we will have a lot of drain_all_stock() running. This will mean a lot
> > > of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
> > > local cpus having to wait for a lock to get their cache, on the patch proposal.

Yes.

> > > Ok, given the above is correct:
> > >
> > > # Some arguments point that (1) becomes slower with this patch.
> > >
> > > This is partially true: while test 2.2 pointed that local cpu functions running
> > > time had became slower by a few cycles, test 2.4 points that the userspace
> > > perception of it was that the syscalls and pagefaulting actually became faster:
> > >
> > > During some debugging tests before getting the performance on test 2.4, I
> > > noticed that the 'syscall + write' test would call all those functions that
> > > became slower on test 2.2. Those functions were called multiple millions of
> > > times during a single test, and still the patched version performance test
> > > returned faster for test 2.4 than upstream version. Maybe the functions became
> > > slower, but overall the usage of them in the usual context became faster.

It is hard to tell anything without proper analysis of those numbers.

> > > Is not that a small improvement?
> > >
> > > # Regarding (2), I notice that we fear contention
> > >
> > > While this seems to be the harder part of the discussion, I think we have enough
> > > data to deal with it.
> > >
> > > In which case contention would be a big problem here??
> > > IIUC it would be when a lot of drain_all_stock() get running because the memory
> > > limit is getting near.?I mean, having the user to create / modify a memcg
> > > multiple times a second for a while is not something that is expected, IMHO.

We have already explained when that happens. You are right this is not
happening all that much in most workloads. But this is a user triggered
operation so you cannot really many assumptions. It will really depend
on the workload.

> > Considering that the use of spinlocks with remote draining is the more general solution,
> > what would be a test-case to demonstrate a contention problem?
>
> IIUC we could try to reproduce a memory tight workload that keeps allocating /
> freeing from different cpus (without hitting OOM).
>
> Michal, Roman: Is that correct? You have any workload like that so we can test?

It would be easy to create artificial workload that would hit this. All
you need is to trigger any of the code paths that have already been
mentioned elsewhere from the userspace.

Let me be clear here. Unless there are some real concerns about not
flushing remotely on isolated cpus then the spin lock approach is just
much less preferable. So I would much rather focus on the trivial patch
and investigates whether there are no new corner cases to be created by
that.
--
Michal Hocko
SUSE Labs

2023-02-01 18:31:54

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Thu, Jan 26, 2023 at 08:20:46PM +0100, Michal Hocko wrote:
> On Thu 26-01-23 15:03:43, Marcelo Tosatti wrote:
> > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Br?s wrote:
> > > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > > > Disclaimer:
> > > > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > > > sections to better organize myself. I am not very confortable with it.
> > > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > > > from memcg_stock_pcp), which could further improve performance for
> > > > > > > > drain_all_stock(), but I could only notice the optimization at the
> > > > > > > > last minute.
> > > > > > > >
> > > > > > > >
> > > > > > > > 0 - Motivation:
> > > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > > > descendant of a given root_memcg.
> > > >
> > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > > > into why we have many of them and whether we really need them?
> > > >
> > > > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > > > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > > > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > > > on multiple cpus).
> > >
> > > I believe I've never got a specific answer to that. We
> > > have discussed that in the previous version submission
> > > ([email protected] and specifically
> > > [email protected]). Leonardo has mentioned a mix of RT and
> > > isolcpus. I was wondering about using memcgs in RT workloads because
> > > that just sounds weird but let's say this is the case indeed.
> >
> > This could be the case. You can consider an "edge device" where it is
> > necessary to run a RT workload. It might also be useful to run
> > non realtime applications on the same system.
> >
> > > Then an RT task or whatever task that is running on an isolated
> > > cpu can have pcp charges.
> >
> > Usually the RT task (or more specifically the realtime sensitive loop
> > of the application) runs entirely on userspace. But i suppose there
> > could be charges on application startup.
>
> What is the role of memcg then? If the memory limit is in place and the
> workload doesn't fit in then it will get reclaimed during start up and
> memory would need to be refaulted if not mlocked. If it is mlocked then
> the limit cannot be enforced and the start up would likely fail as a
> result of the memcg oom killer.
>
> [...]
> > > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> > > > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> > > > doing this without a really strong reason.
> > >
> > > Are you OK with a simple opt out on isolated CPUs? That would make
> > > charges slightly slower (atomic on the hierarchy counters vs. a single
> > > pcp adjustment) but it would guarantee that the isolated workload is
> > > predictable which is the primary objective AFAICS.
> >
> > This would make isolated CPUs "second class citizens": it would be nice
> > to be able to execute non realtime apps on isolated CPUs as well
> > (think of different periods of time during a day, one where
> > more realtime apps are required, another where less
> > realtime apps are required).
>
> An alternative requires to make the current implementation that is
> lockless to use locks and introduce potential lock contention. This
> could be harmful to regular workloads. Not using pcp caching would make
> a fast path using few atomics rather than local pcp update. That is not
> a terrible cost to pay for special cased workloads which use isolcpus.
> Really we are not talking about a massive cost to be payed. At least
> nobody has shown that in any numbers.

Can't agree more.
I also agree that the whole pcpu stock draining code can be enhanced,
but I believe we should go into the direction almost directly opposite
to what's being proposed here.

Can we please return to the original problem which the patchset aims to solve?
Is it the latency introduced by execution of draining works on isolated cpus?
Maybe schedule these works with a delay and cancel them if the draining
occurred naturally during the delay?

Thanks!

2023-02-03 15:23:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

OK, so this is a speculative patch. cpu_is_isolated doesn't exist yet. I
have talked to Frederic off-list and he is working on an implementation.
I will be offline next whole week (will be back Feb 13th) so I am
sending this early but this patch cannot be merged without his one of
course.

I have tried to summarize the reasoning behind both approach should we
ever need to revisit this approach. For now I strongly believe a simpler
solution should be preferred.

Roman, I have added your ack as you indicated but if you disagree with
the reasoning please let me know.
---
From 6d99c4d7a238809ff2eb7c362b45d002ca212c70 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Fri, 3 Feb 2023 15:54:38 +0100
Subject: [PATCH] memcg: do not drain charge pcp caches on remote isolated cpus

Leonardo Bras has noticed that pcp charge cache draining might be
disruptive on workloads relying on 'isolated cpus', a feature commonly
used on workloads that are sensitive to interruption and context
switching such as vRAN and Industrial Control Systems.

There are essentially two ways how to approach the issue. We can either
allow the pcp cache to be drained on a different rather than a local cpu
or avoid remote flushing on isolated cpus.

The current pcp charge cache is really optimized for high performance
and it always relies to stick with its cpu. That means it only requires
local_lock (preempt_disable on !RT) and draining is handed over to pcp
WQ to drain locally again.

The former solution (remote draining) would require to add an additional
locking to prevent local charges from racing with the draining. This
adds an atomic operation to otherwise simple arithmetic fast path in the
try_charge path. Another concern is that the remote draining can cause a
lock contention for the isolated workloads and therefore interfere with
it indirectly via user space interfaces.

Another option is to avoid draining scheduling on isolated cpus
altogether. That means that those remote cpus would keep their charges
even after drain_all_stock returns. This is certainly not optimal either
but it shouldn't really cause any major problems. In the worst case
(many isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH
i.e 64 page) the memory consumption of a memcg would be artificially
higher than can be immediately used from other cpus.

Theoretically a memcg OOM killer could be triggered pre-maturely.
Currently it is not really clear whether this is a practical problem
though. Tight memcg limit would be really counter productive to cpu
isolated workloads pretty much by definition because any memory
reclaimed induced by memcg limit could break user space timing
expectations as those usually expect execution in the userspace most of
the time.

Also charges could be left behind on memcg removal. Any future charge on
those isolated cpus will drain that pcp cache so this won't be a
permanent leak.

Considering cons and pros of both approaches this patch is implementing
the second option and simply do not schedule remote draining if the
target cpu is isolated. This solution is much more simpler. It doesn't
add any new locking and it is more more predictable from the user space
POV. Should the pre-mature memcg OOM become a real life problem, we can
revisit this decision.

Reported-by: Leonardo Bras <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..55e440e54504 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2357,7 +2357,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
if (cpu == curcpu)
drain_local_stock(&stock->work);
- else
+ else if (!cpu_is_isolated(cpu))
schedule_work_on(cpu, &stock->work);
}
}
--
2.30.2

--
Michal Hocko
SUSE Labs

2023-02-03 19:25:54

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri, Feb 03, 2023 at 04:21:09PM +0100, Michal Hocko wrote:
> OK, so this is a speculative patch. cpu_is_isolated doesn't exist yet. I
> have talked to Frederic off-list and he is working on an implementation.
> I will be offline next whole week (will be back Feb 13th) so I am
> sending this early but this patch cannot be merged without his one of
> course.
>
> I have tried to summarize the reasoning behind both approach should we
> ever need to revisit this approach. For now I strongly believe a simpler
> solution should be preferred.
>
> Roman, I have added your ack as you indicated but if you disagree with
> the reasoning please let me know.

Great, thank you for preparing it up! Really nice summary.
My ack definitely applies.

If you want, feel free to add a
"Suggested-by: Roman Gushchin <[email protected]>"
tag to blame me later :)

Thank you!

2023-02-04 04:56:14

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

Michal, Roman,

I understand you have far more experience in this codebase than myself, so
please help me understand what am I missing in my argument for the spinlock
approach. 

I honestly want to improve, and your help is really appreciated.

On Wed, 2023-02-01 at 13:41 +0100, Michal Hocko wrote:
> On Tue 31-01-23 08:35:34, Marcelo Tosatti wrote:
> [...]
> > So it would be good to point out a specific problematic
> > testcase/scenario with using the spinlock in this particular case?
>
> Please think about it some more. The sole purpose of the pcp charge
> caching is to avoid atomics because the normal fast path (i.e. no limit
> hit) is a page counter which is an atomic counter. If you go with a spin
> lock for the pcp cache you are just losing some of the advantage of the
> caching. Sure that would be a smaller atomics use than a deeper memcg
> hierarchy but still.

I could read the code that calls consume_stock(), and noticed that you are
probably referencing the atomic operations on memcg->memory->usage (and others)
used in page_counter_try_charge(). It is a single atomic variable per memcg that
is potentially accessed by every cpu. Is that correct?

I understand the cost of an atomic operation such as this is high because of the
inherent high cost of bouncing the variable's cacheline between those cpus.

The difference between 'usage' atomic variable and the spinlock we are proposing
is the scope of the variable: spinlock is defined on a per-cpu structure, and
most of the accesses will come from the local cpu. 

According to Intel® 64 and IA-32 Architectures Software Developer’s Manual, at
Vol. 3A page 9-5:

###
9.1.4 Effects of a LOCK Operation on Internal Processor Caches
[...]
For the P6 and more recent processor families, if the area of memory being
locked during a LOCK operation is cached in the processor that is performing the
LOCK operation as write-back memory and is completely contained in a cache line,
the processor may not assert the LOCK# signal on the bus. Instead, it will
modify the memory location internally and allow it’s cache coherency mechanism
to ensure that the operation is carried out atomically. This operation is called
“cache locking.” The cache coherency mechanism automatically prevents two or
more processors that have cached the same area of memory from simultaneously
modifying data in that area.
###

So locking on a percpu spinlock which is cached in the current cpu (happens most
of time) is as cheap as modifying the local cacheline. Since the cachelines are
already modified in the local cpu functions, so the write to memory is batched.

For reference, this is the pahole output for memcg_stock_pcp after my patch. The
spinlock fits in the same cacheline as the changed variables.

struct memcg_stock_pcp {
spinlock_t stock_lock; /* 0 4 */
unsigned int nr_pages; /* 4 4 */
struct mem_cgroup * cached; /* 8 8 */
struct obj_cgroup * cached_objcg; /* 16 8 */
struct pglist_data * cached_pgdat; /* 24 8 */
unsigned int nr_bytes; /* 32 4 */
int nr_slab_reclaimable_b; /* 36 4 */
int nr_slab_unreclaimable_b; /* 40 4 */

/* size: 48, cachelines: 1, members: 8 */
/* padding: 4 */
/* last cacheline: 48 bytes */
};

The only accesses that will come from a remote cpu, and thus cause the cacheline
to bounce and the lock to be more expensive, are the ones from
drain_all_stock(). OTOH, on upstream, those accesses already modify the remote
cacheline with an atomic variable (memcg_stock_pcp.flags), so we are only
dealing with potential contention costs.

>
> Not to mention a potential contention which is hard to predict and it
> will depend on the specific runtime very much. So not something that
> would be easy to test for other than artificial testcases. Full cpu
> pcp draining is not a very common operation and one could argue that
> the problem will be limited 
>

Yes, but we are exchanging an "always schedule_work_on()", which is a kind of
contention, for a "sometimes we hit spinlock contention".

For the spinlock proposal, on the local cpu side, the *worst case* contention
is:
1 - wait the spin_unlock() for a complete <percpu cache drain process>,
2 - wait a cache hit for local per-cpu cacheline 

What is current implemented (schedule_work_on() approach), for the local
cpu side there is *always* this contention:
1 - wait for a context switch,
2 - wait a cache hit from it's local per-cpu cacheline,
3 - wait a complete <percpu cache drain process>, 
4 - then for a new context switch to the current thread.

So moving from schedule_work_on() to spinlocks will save 2 context switches per
cpu every time drain_all_stock() is called.

On the remote cpu side, my tests point that doing the remote draining is faster
than scheduling a local draining, so it's also a gain.

Also, IIUC the possible contention in the spinlock approach happens only on
page-faulting and syscalls, versus the schedule_work_on() approach that can
interrupt user workload at any time. 

In fact, not interrupting the user workload in isolated cpus is just a bonus of
using spinlocks.

> but so far I haven't really heard any strong
> arguments against the proposal to avoid scheduling the work on isolated
> cpus which is a much more simpler solution and achieves the same
> effect.

I understand the 'not scheduling work on isolated cpus' appeal: it's a much
simpler change, and will only affect cases with isolated cpus, so it is safer
than changing the whole locking structure. 

I am not against it. I already have a patch implementing it for testing, and I
could gladly send it if you see fit.

But if nothing else, it introduces another specific case, and now it have to
deal differently with local cpu, remote cpus, and isolated cpus.

With spinlocks, there is a much simpler solution:
- All cpus are dealt with the same way, which is faster in the local cpu (as 
in upstream implementation).
- We don't have to avoid draining on isolated cpus.
- We get rid of the "work_struct", and scheduling costs
- We get rid of "flag", as we don't need to take care of multi-scheduling local 
draining.
- When drain_all_stock() returns, all stock have already been drained, so 
retrying consume_stock() may have immediate results on pre-OOM cases.

On Wed, 2023-02-01 at 13:52 +0100, Michal Hocko wrote:
[...]
> Let me be clear here. Unless there are some real concerns about not
> flushing remotely on isolated cpus then the spin lock approach is just
> much less preferable. So I would much rather focus on the trivial patch
> and investigates whether there are no new corner cases to be created by
> that.

I understand 'not scheduling work on isolated cpus' approach should work with
low impact on current behavior, and I also understand that as Maintainers you
have to consider many more factors than a regular developer like me, and also
that the spinlock approach is a big change on how pcp memcg caches work.

On that topic, if there is any comparison test you find important running, or
any topic you think could be better discussed, please let me know.

Thank you for reviewing,
Leo



2023-02-05 19:49:31

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

Hi Leonardo!

> Yes, but we are exchanging an "always schedule_work_on()", which is a kind of
> contention, for a "sometimes we hit spinlock contention".
>
> For the spinlock proposal, on the local cpu side, the *worst case* contention
> is:
> 1 - wait the spin_unlock() for a complete <percpu cache drain process>,
> 2 - wait a cache hit for local per-cpu cacheline?
>
> What is current implemented (schedule_work_on() approach), for the local
> cpu?side there is *always* this contention:
> 1 - wait for a context switch,
> 2 - wait a cache hit from it's local per-cpu cacheline,
> 3 - wait a complete <percpu cache drain process>,?
> 4 - then for a new context switch to the current thread.

I think both Michal and me are thinking of a more generic case in which the cpu
is not exclusively consumed by 1 special process, so that the draining work can
be executed during an idle time. In this case the work is basically free.

And the introduction of a spin_lock() on the hot path is what we're are concerned
about. I agree, that on some hardware platforms it won't be that expensive, but
in general not having any spinlocks is so much better.

>
> So moving from schedule_work_on() to spinlocks will save 2 context switches per
> cpu every time drain_all_stock() is called.
>
> On the remote cpu side, my tests point that doing the remote draining is faster
> than scheduling a local draining, so it's also a gain.
>
> Also, IIUC the possible contention in the spinlock approach happens only on
> page-faulting and syscalls, versus the schedule_work_on() approach that can
> interrupt user workload at any time.?
>
> In fact, not interrupting the user workload in isolated cpus is just a bonus of
> using spinlocks.

I believe it significantly depends on the preemption model: you're right regarding
fully preemptive kernels, but with voluntary/none preemption it's exactly opposite:
the draining work will be executed at some point later (probably with 0 cost),
while the remote access from another cpu will potentially cause delays on the
spin lock as well as a need to refill the stock.

Overall I'd expect a noticeable performance regression from an introduction of
spin locks and remote draining. Maybe not on all platforms, but at least on some.
That's my main concern. And I don't think the problem we're aiming to solve here
justifies this potential regression.

Thanks!

2023-02-07 03:19:27

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Sun, 2023-02-05 at 11:49 -0800, Roman Gushchin wrote:
> Hi Leonardo!

Hello Roman,
Thanks a lot for replying!

>
> > Yes, but we are exchanging an "always schedule_work_on()", which is a kind of
> > contention, for a "sometimes we hit spinlock contention".
> >
> > For the spinlock proposal, on the local cpu side, the *worst case* contention
> > is:
> > 1 - wait the spin_unlock() for a complete <percpu cache drain process>,
> > 2 - wait a cache hit for local per-cpu cacheline 
> >
> > What is current implemented (schedule_work_on() approach), for the local
> > cpu side there is *always* this contention:
> > 1 - wait for a context switch,
> > 2 - wait a cache hit from it's local per-cpu cacheline,
> > 3 - wait a complete <percpu cache drain process>, 
> > 4 - then for a new context switch to the current thread.
>
> I think both Michal and me are thinking of a more generic case in which the cpu
> is not exclusively consumed by 1 special process, so that the draining work can
> be executed during an idle time. In this case the work is basically free.

Oh, it makes sense.
But in such scenarios, wouldn't the same happens to spinlocks?

I mean, most of the contention with spinlocks only happens if the remote cpu is
trying to drain the cache while the local cpu happens to be draining/charging,
which is quite rare due to how fast the local cpu operations are.

Also, if the cpu has some idle time, using a little more on a possible spinlock
contention should not be a problem. Right?

>
> And the introduction of a spin_lock() on the hot path is what we're are concerned
> about. I agree, that on some hardware platforms it won't be that expensive, 
>

IIRC most hardware platforms with multicore supported by the kernel should have
the same behavior, since it's better to rely on cache coherence than locking the
memory bus.

For instance, the other popular architectures supported by Linux use the LR/SC
strategy for atomic operations (tested on ARM, POWER, RISCV) and IIRC the
LoadReserve slow part waits for the cacheline exclusivity, which is already
already exclusive in this perCPU structure.


> but in general not having any spinlocks is so much better.

I agree that spinlocks may bring contention, which is not ideal in many cases.
In this case, though, it may not be a big issue, due to very rare remote access
in the structure, for the usual case (non-pre-OOMCG)

>
> >
> > So moving from schedule_work_on() to spinlocks will save 2 context switches per
> > cpu every time drain_all_stock() is called.
> >
> > On the remote cpu side, my tests point that doing the remote draining is faster
> > than scheduling a local draining, so it's also a gain.
> >
> > Also, IIUC the possible contention in the spinlock approach happens only on
> > page-faulting and syscalls, versus the schedule_work_on() approach that can
> > interrupt user workload at any time. 
> >
> > In fact, not interrupting the user workload in isolated cpus is just a bonus of
> > using spinlocks.
>
> I believe it significantly depends on the preemption model: you're right regarding
> fully preemptive kernels, but with voluntary/none preemption it's exactly opposite:
> the draining work will be executed at some point later (probably with 0 cost),

So, in case of voluntary/none preemption with some free cpu time.

> while the remote access from another cpu will potentially cause delays on the
> spin lock as well as a need to refill the stock.

But if there is some free CPU time, what is the issue of some (potential) delays
due to spinlock contention?

I am probably missing the whole picture, but when I think of performance
improvement, I think on doing more with the same cputime. If we can use free
cputime to do stuff later, it's only fair to also use it in case of contention,
right?

I know there are some cases that may need to be more previsible (mostly RT), but
when I think of memory allocation, I don't expect it to always take the same
time (as there are caches, pre-OOM, and so)

Also, as previously discussed, in case of a busy cpu, the spinlock approach will
probably allow more work to be done.

>
> Overall I'd expect a noticeable performance regression from an introduction of
> spin locks and remote draining. Maybe not on all platforms, but at least on some.
> That's my main concern.
>

I see.
For the platform I have tested (x86) I noticed better overall performance on
spinlocks than upstream solution. For other popular platforms, I have briefly
read some documentation on locking/atomicity and I think we may keep the
performance gains.

But to be sure, I could retake the tests on other platforms, such as ARM, POWER,
RISCV, and so. Or even perform extra suggested tests.

With that info, would you feel less concerned about a possible change in memcg
pcp cache locking scheme?


> And I don't think the problem we're aiming to solve here
> justifies this potential regression.
>

To be strict, the isolated cpu scheduling problem is already fixed by the
housekeeping patch (with some limitations). 

At this point, I am trying to bring focus to a (possible) performance
improvement on the memcg pcp cache locking system.


> Thanks!
>

Thank you for helping me better understand your arguments and concerns.
I really appreciate it!

Best regards,
Leo


2023-02-08 19:24:25

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Tue, Feb 07, 2023 at 12:18:01AM -0300, Leonardo Br?s wrote:
> On Sun, 2023-02-05 at 11:49 -0800, Roman Gushchin wrote:
> > Hi Leonardo!
>
> Hello Roman,
> Thanks a lot for replying!
>
> >
> > > Yes, but we are exchanging an "always schedule_work_on()", which is a kind of
> > > contention, for a "sometimes we hit spinlock contention".
> > >
> > > For the spinlock proposal, on the local cpu side, the *worst case* contention
> > > is:
> > > 1 - wait the spin_unlock() for a complete <percpu cache drain process>,
> > > 2 - wait a cache hit for local per-cpu cacheline?
> > >
> > > What is current implemented (schedule_work_on() approach), for the local
> > > cpu?side there is *always* this contention:
> > > 1 - wait for a context switch,
> > > 2 - wait a cache hit from it's local per-cpu cacheline,
> > > 3 - wait a complete <percpu cache drain process>,?
> > > 4 - then for a new context switch to the current thread.
> >
> > I think both Michal and me are thinking of a more generic case in which the cpu
> > is not exclusively consumed by 1 special process, so that the draining work can
> > be executed during an idle time. In this case the work is basically free.
>
> Oh, it makes sense.
> But in such scenarios, wouldn't the same happens to spinlocks?
>
> I mean, most of the contention with spinlocks only happens if the remote cpu is
> trying to drain the cache while the local cpu happens to be draining/charging,
> which is quite rare due to how fast the local cpu operations are.
>
> Also, if the cpu has some idle time, using a little more on a possible spinlock
> contention should not be a problem. Right?
>
> >
> > And the introduction of a spin_lock() on the hot path is what we're are concerned
> > about. I agree, that on some hardware platforms it won't be that expensive,?
> >
>
> IIRC most hardware platforms with multicore supported by the kernel should have
> the same behavior, since it's better to rely on cache coherence than locking the
> memory bus.
>
> For instance, the other popular architectures supported by Linux use the LR/SC
> strategy for atomic operations (tested on ARM, POWER, RISCV) and IIRC the
> LoadReserve slow part waits for the cacheline exclusivity, which is already
> already exclusive in this perCPU structure.
>
>
> > but in general not having any spinlocks is so much better.
>
> I agree that spinlocks may bring contention, which is not ideal in many cases.
> In this case, though, it may not be a big issue, due to very rare remote access
> in the structure, for the usual case (non-pre-OOMCG)

Hi Leonardo!

I made a very simple test: replaced pcp local_lock with a spinlock and ran
a very simple kernel memory accounting benchmark (attached below) on
my desktop pc (Intel Core i9-7940X).

Original (3 attempts):
81341 us
80973 us
81258 us

Patched (3 attempts):
99918 us
100220 us
100291 us

This is without any contention and without CONFIG_LOCKDEP.

Thanks!

--

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 49f67176a1a2..bafd3cde4507 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6563,6 +6563,37 @@ static int memory_stat_show(struct seq_file *m, void *v)
return 0;
}

+static int memory_alloc_test(struct seq_file *m, void *v)
+{
+ unsigned long i, j;
+ void **ptrs;
+ ktime_t start, end;
+ s64 delta, min_delta = LLONG_MAX;
+
+ ptrs = kvmalloc(sizeof(void *) * 1024 * 1024, GFP_KERNEL);
+ if (!ptrs)
+ return -ENOMEM;
+
+ for (j = 0; j < 100; j++) {
+ start = ktime_get();
+ for (i = 0; i < 1024 * 1024; i++)
+ ptrs[i] = kmalloc(8, GFP_KERNEL_ACCOUNT);
+ end = ktime_get();
+
+ delta = ktime_us_delta(end, start);
+ if (delta < min_delta)
+ min_delta = delta;
+
+ for (i = 0; i < 1024 * 1024; i++)
+ kfree(ptrs[i]);
+ }
+
+ kvfree(ptrs);
+ seq_printf(m, "%lld us\n", min_delta);
+
+ return 0;
+}
+
#ifdef CONFIG_NUMA
static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
int item)
@@ -6758,6 +6789,10 @@ static struct cftype memory_files[] = {
.name = "stat",
.seq_show = memory_stat_show,
},
+ {
+ .name = "test",
+ .seq_show = memory_alloc_test,
+ },
#ifdef CONFIG_NUMA
{
.name = "numa_stat",

2023-02-13 13:36:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

On Fri 03-02-23 11:25:16, Roman Gushchin wrote:
> On Fri, Feb 03, 2023 at 04:21:09PM +0100, Michal Hocko wrote:
> > OK, so this is a speculative patch. cpu_is_isolated doesn't exist yet. I
> > have talked to Frederic off-list and he is working on an implementation.
> > I will be offline next whole week (will be back Feb 13th) so I am
> > sending this early but this patch cannot be merged without his one of
> > course.
> >
> > I have tried to summarize the reasoning behind both approach should we
> > ever need to revisit this approach. For now I strongly believe a simpler
> > solution should be preferred.
> >
> > Roman, I have added your ack as you indicated but if you disagree with
> > the reasoning please let me know.
>
> Great, thank you for preparing it up! Really nice summary.
> My ack definitely applies.
>
> If you want, feel free to add a
> "Suggested-by: Roman Gushchin <[email protected]>"
> tag to blame me later :)

Sure, I have updated the changelog. I will repost after the patchset by
Frederic[1] is merged.

[1] http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs