2024-06-04 07:20:55

by Max Kellermann

[permalink] [raw]
Subject: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

Hi kernel people,
I have a problem that I have been trying to debug for a few days, but
I got lost in the depths of the scheduler code; I'm stuck and I need
your help.
We have several servers which show a constant memory.pressure value of
30 to 100 (or more), even when the server is idle. I tracked this down
to psi_group_cpu.tasks[NR_MEMSTALL]==1 even though no such process
exists, but I can't figure out why the kernel thinks there is still
one task stuck in memstall. I tried to track down all the code paths
that lead to psi_group_change(), but found nothing conclusive, and
failed to reproduce it on a test machine with kernel patches injecting
delays (trying to cause data race bugs that may have caused this
problem).

This happened on servers that were very busy and indeed were in
memstall often due to going over memory.high frequently. We have one
"main" cgroup with memory.high configured, and all the workload
processes live in sub-cgroups, of which we always have a few thousand.
When memory.events gets triggered, our process manager stops a bunch
of idle processes to free up memory, which then deletes the sub-cgroup
they belong to. In other words: sub-cgroups get created and deleted
very often, and they get deleted when there is indeed memory stall
happening. My theory was that there could be a data race bug that
forgets to decrement tasks[NR_MEMSTALL], maybe when a stalled child
cgroup gets deleted.
On our Grafana, I can easily track the beginning of this bug to a
point two weeks ago; in the system log, I can see that hundreds of
processes needed to be terminated due to memory pressure at that time.

The affected servers run kernel 6.8.7 with a few custom patches, but
none of these patches affect the scheduler or cgroups; they're about
unrelated things like denying access to Ceph snapshots and adjusting
debugfs permissions. (I submitted most of those patches to LKML long
ago but nobody cared.)
Newer kernels don't seem to have fixes for my problem; the relevant
parts of the code are unchanged.

One of the servers is still running with this problem, and I can
access it with gdb on /proc/kcore. I'll keep it that way for some more
time, so if you have any idea what to look for, let me know.

This is the psi_group of the "main" cgroup:

$1 = {parent = 0xffff9de707287800, enabled = true, avgs_lock = {owner
= {counter = 0}, wait_lock = {raw_lock = {{val = {counter = 0},
{locked = 0 '\000', pending = 0 '\000'}, {locked_pending = 0, tail =
0}}}}, osq = {tail = {
counter = 0}}, wait_list = {next = 0xffff9de70f772820, prev =
0xffff9de70f772820}}, pcpu = 0x3fb640033900, avg_total =
{6133960836647, 5923217690044, 615319825665255, 595479374843164,
19259777147170, 12847590051880},
avg_last_update = 1606208471280060, avg_next_update =
1606210394507082, avgs_work = {work = {data = {counter = 321}, entry =
{next = 0xffff9de70f772880, prev = 0xffff9de70f772880}, func =
0xffffffff880dcc00}, timer = {entry = {
next = 0x0 <fixed_percpu_data>, pprev = 0xffff9e05bef5bc48},
expires = 4455558105, function = 0xffffffff880a1ca0, flags =
522190853}, wq = 0xffff9de700051400, cpu = 64}, avg_triggers = {next =
0xffff9de70f7728d0,
prev = 0xffff9de70f7728d0}, avg_nr_triggers = {0, 0, 0, 0, 0, 0},
total = {{6133960836647, 5923217690044, 615328415599847,
595487964777756, 19281251983650, 12869064888360}, {6092994926,
5559819737, 105947464151, 100672353730,
8196529519, 7678536634}}, avg = {{0, 0, 0}, {0, 0, 0}, {203596,
203716, 198499}, {203596, 203716, 198288}, {0, 0, 60}, {0, 0, 0}},
rtpoll_task = 0x0 <fixed_percpu_data>, rtpoll_timer = {entry = {next =
0xdead000000000122,
pprev = 0x0 <fixed_percpu_data>}, expires = 4405010639, function
= 0xffffffff880dac50, flags = 67108868}, rtpoll_wait = {lock = {{rlock
= {raw_lock = {{val = {counter = 0}, {locked = 0 '\000', pending = 0
'\000'}, {
locked_pending = 0, tail = 0}}}}}}, head = {next =
0xffff9de70f772a20, prev = 0xffff9de70f772a20}}, rtpoll_wakeup =
{counter = 0}, rtpoll_scheduled = {counter = 0}, rtpoll_trigger_lock =
{owner = {counter = 0},
wait_lock = {raw_lock = {{val = {counter = 0}, {locked = 0 '\000',
pending = 0 '\000'}, {locked_pending = 0, tail = 0}}}}, osq = {tail =
{counter = 0}}, wait_list = {next = 0xffff9de70f772a48, prev =
0xffff9de70f772a48}},
rtpoll_triggers = {next = 0xffff9de70f772a58, prev =
0xffff9de70f772a58}, rtpoll_nr_triggers = {0, 0, 0, 0, 0, 0},
rtpoll_states = 0, rtpoll_min_period = 18446744073709551615,
rtpoll_total = {6092994926, 5559819737, 105947464151,
100672353730, 8196529519, 7678536634}, rtpoll_next_update =
1100738436720135, rtpoll_until = 0}

This is a summary of all psi_group_pcpu for the 32 CPU cores (on the
way, I wrote a small gdb script to dump interesting details like these
but that went nowhere):

state_mask 0 = 0x0 tasks {0, 0, 0, 0}
state_mask 1 = 0x0 tasks {0, 0, 0, 0}
state_mask 2 = 0x0 tasks {0, 0, 0, 0}
state_mask 3 = 0x0 tasks {0, 0, 0, 0}
state_mask 4 = 0x0 tasks {0, 0, 0, 0}
state_mask 5 = 0x0 tasks {0, 0, 0, 0}
state_mask 6 = 0x0 tasks {0, 0, 0, 0}
state_mask 7 = 0x0 tasks {0, 0, 0, 0}
state_mask 8 = 0x0 tasks {0, 0, 0, 0}
state_mask 9 = 0x0 tasks {0, 0, 0, 0}
state_mask 10 = 0x0 tasks {0, 0, 0, 0}
state_mask 11 = 0x0 tasks {0, 0, 0, 0}
state_mask 12 = 0x0 tasks {0, 0, 0, 0}
state_mask 13 = 0x0 tasks {0, 0, 0, 0}
state_mask 14 = 0x0 tasks {0, 0, 0, 0}
state_mask 15 = 0x0 tasks {0, 0, 0, 0}
state_mask 16 = 0x0 tasks {0, 0, 0, 0}
state_mask 17 = 0x0 tasks {0, 0, 0, 0}
state_mask 18 = 0x0 tasks {0, 0, 0, 0}
state_mask 19 = 0x0 tasks {0, 0, 0, 0}
state_mask 20 = 0x0 tasks {0, 0, 0, 0}
state_mask 21 = 0x0 tasks {0, 0, 0, 0}
state_mask 22 = 0x0 tasks {0, 0, 0, 0}
state_mask 23 = 0x0 tasks {0, 0, 0, 0}
state_mask 24 = 0x0 tasks {0, 0, 0, 0}
state_mask 25 = 0x0 tasks {0, 0, 0, 0}
state_mask 26 = 0x0 tasks {0, 0, 0, 0}
state_mask 27 = 0x0 tasks {0, 0, 0, 0}
state_mask 28 = 0x0 tasks {0, 0, 0, 0}
state_mask 29 = 0x0 tasks {0, 0, 0, 0}
state_mask 30 = 0x4c tasks {0, 1, 0, 0}
state_mask 31 = 0x0 tasks {0, 0, 0, 0}

CPU core 30 is stuck with this bogus value. state_mask 0x4c =
PSI_MEM_SOME|PSI_MEM_FULL|PSI_NONIDLE.

The memory pressure at the time of this writing:

# cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/memory.pressure
some avg10=99.22 avg60=99.39 avg300=97.62 total=615423620626
full avg10=99.22 avg60=99.39 avg300=97.54 total=595583169804
# cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/_/memory.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/cgroup.stat
nr_descendants 1
nr_dying_descendants 2224

There is currently no worker process; there is only one idle dummy
process in a single sub-cgroup called "_", only there to keep the
systemd scope populated. It should therefore be impossible to have
memory.pressure when the only leaf cgroup has pressure=0.

(nr_dying_descendants is decremented extremely slowly; I deactivated
the server shortly before collecting these numbers, to make sure it's
really idle and there are really no processes left to cause this
pressure. I don't think nr_dying_descendants is relevant for this
problem; even after two days of full idle, the counter and the
pressure didn't go back to zero.)

Please help :-)

Max


2024-06-10 08:38:38

by Max Kellermann

[permalink] [raw]
Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

On Tue, Jun 4, 2024 at 9:16 AM Max Kellermann <[email protected]> wrote:
> One of the servers is still running with this problem, and I can
> access it with gdb on /proc/kcore. I'll keep it that way for some more
> time, so if you have any idea what to look for, let me know.

It's been a week since I reported this kernel bug, but nobody replied.
I will have to reboot this server soon, so this is the last chance to
extract information about the bug.

Max

2024-06-12 05:01:51

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

On Tue, Jun 4, 2024 at 12:16 AM Max Kellermann <[email protected]> wrote:
>
> Hi kernel people,
> I have a problem that I have been trying to debug for a few days, but
> I got lost in the depths of the scheduler code; I'm stuck and I need
> your help.
> We have several servers which show a constant memory.pressure value of
> 30 to 100 (or more), even when the server is idle. I tracked this down
> to psi_group_cpu.tasks[NR_MEMSTALL]==1 even though no such process
> exists, but I can't figure out why the kernel thinks there is still
> one task stuck in memstall. I tried to track down all the code paths
> that lead to psi_group_change(), but found nothing conclusive, and
> failed to reproduce it on a test machine with kernel patches injecting
> delays (trying to cause data race bugs that may have caused this
> problem).
>
> This happened on servers that were very busy and indeed were in
> memstall often due to going over memory.high frequently. We have one
> "main" cgroup with memory.high configured, and all the workload
> processes live in sub-cgroups, of which we always have a few thousand.
> When memory.events gets triggered, our process manager stops a bunch
> of idle processes to free up memory, which then deletes the sub-cgroup
> they belong to. In other words: sub-cgroups get created and deleted
> very often, and they get deleted when there is indeed memory stall
> happening. My theory was that there could be a data race bug that
> forgets to decrement tasks[NR_MEMSTALL], maybe when a stalled child
> cgroup gets deleted.

Hi Max,
I'm not an expert in the scheduler (I maintain mostly PSI triggers),
so my feedback might be utterly wrong.
I looked a bit into the relevant code and I think if your theory was
correct and psi_task_change() was called while task's cgroup is
destroyed then task_psi_group() would have returned an invalid pointer
and we would crash once that value is dereferenced.
Instead I think what might be happening is that the task is terminated
while it's in memstall. do_exit() calls do_task_dead() at the very
end, which sets current->__state to TASK_DEAD and calls the last
__schedule() for this task. __schedule() will call deactivate_task(rq,
prev, DEQUEUE_SLEEP) which will set prev->on_rq = NULL and call
dequeue_task(..., DEQUEUE_SLEEP) leading to psi_dequeue(..., true).
Note that because of that last parameter of psi_dequeue() is "true",
psi_task_change() will not be called at this time. Later on
__schedule() calls psi_sched_switch(). That leads to psi_task_switch()
but note that the last parameter will be "true" because prev->on_rq ==
NULL. So we end up calling psi_task_switch(, true). Now, note this
line: https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L955.
It will clear TSK_MEMSTALL_RUNNING but not TSK_MEMSTALL. So, if the
task was in TSK_MEMSTALL state then that state won't be cleared, which
might be the problem you are facing.
I think you can check if this theory pans out by adding a WARN_ON() ar
the end of psi_task_switch():

void psi_task_switch(struct task_struct *prev, struct task_struct
*next, bool sleep)
{
...
if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
clear &= ~TSK_ONCPU;
for (; group; group = group->parent)
psi_group_change(group, cpu, clear, set, now,
wake_clock);
}
+ WARN_ON(prev->__state & TASK_DEAD && prev->psi_flags & TSK_MEMSTALL);
}

Again, I am by no means an expert in this area. Johannes or Peter
would be much better people to consult with.
Thanks,
Suren.

> On our Grafana, I can easily track the beginning of this bug to a
> point two weeks ago; in the system log, I can see that hundreds of
> processes needed to be terminated due to memory pressure at that time.
>
> The affected servers run kernel 6.8.7 with a few custom patches, but
> none of these patches affect the scheduler or cgroups; they're about
> unrelated things like denying access to Ceph snapshots and adjusting
> debugfs permissions. (I submitted most of those patches to LKML long
> ago but nobody cared.)
> Newer kernels don't seem to have fixes for my problem; the relevant
> parts of the code are unchanged.
>
> One of the servers is still running with this problem, and I can
> access it with gdb on /proc/kcore. I'll keep it that way for some more
> time, so if you have any idea what to look for, let me know.
>
> This is the psi_group of the "main" cgroup:
>
> $1 = {parent = 0xffff9de707287800, enabled = true, avgs_lock = {owner
> = {counter = 0}, wait_lock = {raw_lock = {{val = {counter = 0},
> {locked = 0 '\000', pending = 0 '\000'}, {locked_pending = 0, tail =
> 0}}}}, osq = {tail = {
> counter = 0}}, wait_list = {next = 0xffff9de70f772820, prev =
> 0xffff9de70f772820}}, pcpu = 0x3fb640033900, avg_total =
> {6133960836647, 5923217690044, 615319825665255, 595479374843164,
> 19259777147170, 12847590051880},
> avg_last_update = 1606208471280060, avg_next_update =
> 1606210394507082, avgs_work = {work = {data = {counter = 321}, entry =
> {next = 0xffff9de70f772880, prev = 0xffff9de70f772880}, func =
> 0xffffffff880dcc00}, timer = {entry = {
> next = 0x0 <fixed_percpu_data>, pprev = 0xffff9e05bef5bc48},
> expires = 4455558105, function = 0xffffffff880a1ca0, flags =
> 522190853}, wq = 0xffff9de700051400, cpu = 64}, avg_triggers = {next =
> 0xffff9de70f7728d0,
> prev = 0xffff9de70f7728d0}, avg_nr_triggers = {0, 0, 0, 0, 0, 0},
> total = {{6133960836647, 5923217690044, 615328415599847,
> 595487964777756, 19281251983650, 12869064888360}, {6092994926,
> 5559819737, 105947464151, 100672353730,
> 8196529519, 7678536634}}, avg = {{0, 0, 0}, {0, 0, 0}, {203596,
> 203716, 198499}, {203596, 203716, 198288}, {0, 0, 60}, {0, 0, 0}},
> rtpoll_task = 0x0 <fixed_percpu_data>, rtpoll_timer = {entry = {next =
> 0xdead000000000122,
> pprev = 0x0 <fixed_percpu_data>}, expires = 4405010639, function
> = 0xffffffff880dac50, flags = 67108868}, rtpoll_wait = {lock = {{rlock
> = {raw_lock = {{val = {counter = 0}, {locked = 0 '\000', pending = 0
> '\000'}, {
> locked_pending = 0, tail = 0}}}}}}, head = {next =
> 0xffff9de70f772a20, prev = 0xffff9de70f772a20}}, rtpoll_wakeup =
> {counter = 0}, rtpoll_scheduled = {counter = 0}, rtpoll_trigger_lock =
> {owner = {counter = 0},
> wait_lock = {raw_lock = {{val = {counter = 0}, {locked = 0 '\000',
> pending = 0 '\000'}, {locked_pending = 0, tail = 0}}}}, osq = {tail =
> {counter = 0}}, wait_list = {next = 0xffff9de70f772a48, prev =
> 0xffff9de70f772a48}},
> rtpoll_triggers = {next = 0xffff9de70f772a58, prev =
> 0xffff9de70f772a58}, rtpoll_nr_triggers = {0, 0, 0, 0, 0, 0},
> rtpoll_states = 0, rtpoll_min_period = 18446744073709551615,
> rtpoll_total = {6092994926, 5559819737, 105947464151,
> 100672353730, 8196529519, 7678536634}, rtpoll_next_update =
> 1100738436720135, rtpoll_until = 0}
>
> This is a summary of all psi_group_pcpu for the 32 CPU cores (on the
> way, I wrote a small gdb script to dump interesting details like these
> but that went nowhere):
>
> state_mask 0 = 0x0 tasks {0, 0, 0, 0}
> state_mask 1 = 0x0 tasks {0, 0, 0, 0}
> state_mask 2 = 0x0 tasks {0, 0, 0, 0}
> state_mask 3 = 0x0 tasks {0, 0, 0, 0}
> state_mask 4 = 0x0 tasks {0, 0, 0, 0}
> state_mask 5 = 0x0 tasks {0, 0, 0, 0}
> state_mask 6 = 0x0 tasks {0, 0, 0, 0}
> state_mask 7 = 0x0 tasks {0, 0, 0, 0}
> state_mask 8 = 0x0 tasks {0, 0, 0, 0}
> state_mask 9 = 0x0 tasks {0, 0, 0, 0}
> state_mask 10 = 0x0 tasks {0, 0, 0, 0}
> state_mask 11 = 0x0 tasks {0, 0, 0, 0}
> state_mask 12 = 0x0 tasks {0, 0, 0, 0}
> state_mask 13 = 0x0 tasks {0, 0, 0, 0}
> state_mask 14 = 0x0 tasks {0, 0, 0, 0}
> state_mask 15 = 0x0 tasks {0, 0, 0, 0}
> state_mask 16 = 0x0 tasks {0, 0, 0, 0}
> state_mask 17 = 0x0 tasks {0, 0, 0, 0}
> state_mask 18 = 0x0 tasks {0, 0, 0, 0}
> state_mask 19 = 0x0 tasks {0, 0, 0, 0}
> state_mask 20 = 0x0 tasks {0, 0, 0, 0}
> state_mask 21 = 0x0 tasks {0, 0, 0, 0}
> state_mask 22 = 0x0 tasks {0, 0, 0, 0}
> state_mask 23 = 0x0 tasks {0, 0, 0, 0}
> state_mask 24 = 0x0 tasks {0, 0, 0, 0}
> state_mask 25 = 0x0 tasks {0, 0, 0, 0}
> state_mask 26 = 0x0 tasks {0, 0, 0, 0}
> state_mask 27 = 0x0 tasks {0, 0, 0, 0}
> state_mask 28 = 0x0 tasks {0, 0, 0, 0}
> state_mask 29 = 0x0 tasks {0, 0, 0, 0}
> state_mask 30 = 0x4c tasks {0, 1, 0, 0}
> state_mask 31 = 0x0 tasks {0, 0, 0, 0}
>
> CPU core 30 is stuck with this bogus value. state_mask 0x4c =
> PSI_MEM_SOME|PSI_MEM_FULL|PSI_NONIDLE.
>
> The memory pressure at the time of this writing:
>
> # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/memory.pressure
> some avg10=99.22 avg60=99.39 avg300=97.62 total=615423620626
> full avg10=99.22 avg60=99.39 avg300=97.54 total=595583169804
> # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/_/memory.pressure
> some avg10=0.00 avg60=0.00 avg300=0.00 total=0
> full avg10=0.00 avg60=0.00 avg300=0.00 total=0
> # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/cgroup.stat
> nr_descendants 1
> nr_dying_descendants 2224
>
> There is currently no worker process; there is only one idle dummy
> process in a single sub-cgroup called "_", only there to keep the
> systemd scope populated. It should therefore be impossible to have
> memory.pressure when the only leaf cgroup has pressure=0.
>
> (nr_dying_descendants is decremented extremely slowly; I deactivated
> the server shortly before collecting these numbers, to make sure it's
> really idle and there are really no processes left to cause this
> pressure. I don't think nr_dying_descendants is relevant for this
> problem; even after two days of full idle, the counter and the
> pressure didn't go back to zero.)
>
> Please help :-)
>
> Max

2024-06-12 06:49:30

by Max Kellermann

[permalink] [raw]
Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <[email protected]> wrote:
> Instead I think what might be happening is that the task is terminated
> while it's in memstall.

How is it possible to terminate a task that's in memstall?
This must be between psi_memstall_enter() and psi_memstall_leave(),
but I had already checked all the callers and found nothing
suspicious; no obvious way to escape the section without
psi_memstall_leave(). In my understanding, it's impossible to
terminate a task that's currently stuck in the kernel. First, it needs
to leave the kernel and go back to userspace, doesn't it?

> I think if your theory was
> correct and psi_task_change() was called while task's cgroup is
> destroyed then task_psi_group() would have returned an invalid pointer
> and we would crash once that value is dereferenced.

I was thinking of something slightly different; something about the
cgroup being deleted or a task being terminated and the bookkeeping of
the PSI flags getting wrong, maybe some data race. I found the whole
PSI code with per-task flags, per-cpu per-cgroup counters and flags
somewhat obscure (but somebody else's code is always obscure, of
course); I thought there was a lot of potential for mistakes with the
bookkeeping, but I found nothing specific.

Anyway, thanks for looking into this - I hope we can get a grip on
this issue, as it's preventing me from using PSI values for actual
process management; the servers that go into this state will always
appear overloaded and that would lead to killing all the workload
processes forever.

Max

2024-06-12 09:49:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

On Wed, Jun 12, 2024 at 08:49:02AM +0200, Max Kellermann wrote:
> On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <[email protected]> wrote:
> > Instead I think what might be happening is that the task is terminated
> > while it's in memstall.
>
> How is it possible to terminate a task that's in memstall?
> This must be between psi_memstall_enter() and psi_memstall_leave(),
> but I had already checked all the callers and found nothing
> suspicious; no obvious way to escape the section without
> psi_memstall_leave(). In my understanding, it's impossible to
> terminate a task that's currently stuck in the kernel. First, it needs
> to leave the kernel and go back to userspace, doesn't it?

Yeah, this. I can't find anything obvious either. The trickiest one is
read-ahead though, I couldn't immediately track all the
readahead_expand() callers, any such caller must then end up calling
read_pages() in order to land on the psi_memstall_leave(). This is
typically through page_cache_ra_*().

The erofs one is also not entirely obvious, but irrelevant if you're not using
it... the below should make it a little more obvious, but what do I know.

(whitespace mangled)

--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1698,9 +1698,9 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
submit_bio(bio);
else
erofs_fscache_submit_bio(bio);
- if (memstall)
- psi_memstall_leave(&pflags);
}
+ if (memstall)
+ psi_memstall_leave(&pflags);

/*
* although background is preferred, no one is pending for submission.


> > I think if your theory was
> > correct and psi_task_change() was called while task's cgroup is
> > destroyed then task_psi_group() would have returned an invalid pointer
> > and we would crash once that value is dereferenced.
>
> I was thinking of something slightly different; something about the
> cgroup being deleted or a task being terminated and the bookkeeping of
> the PSI flags getting wrong, maybe some data race. I found the whole
> PSI code with per-task flags, per-cpu per-cgroup counters and flags
> somewhat obscure (but somebody else's code is always obscure, of
> course);

It really is a bit tricky, performance and all that.

> I thought there was a lot of potential for mistakes with the
> bookkeeping, but I found nothing specific.
>
> Anyway, thanks for looking into this - I hope we can get a grip on
> this issue, as it's preventing me from using PSI values for actual
> process management; the servers that go into this state will always
> appear overloaded and that would lead to killing all the workload
> processes forever.

Best case would be if you could somehow find a reproducer, but
I realize this might be tricky.

2024-06-12 10:21:05

by Max Kellermann

[permalink] [raw]
Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

On Wed, Jun 12, 2024 at 11:49 AM Peter Zijlstra <[email protected]> wrote:
> The erofs one is also not entirely obvious, but irrelevant if you're not using
> it... the below should make it a little more obvious, but what do I know.

We do use erofs a lot, and I read that very function the other day -
it is weird code with two loop levels plus continue and even goto; but
I thought it was okay. psi_memstall_enter() is only ever called if
bio!=NULL, and the function takes care to call psi_memstall_leave()
when NULLing bio. Therefore I think your patch is not necessary (but
adds a tiny bit of overhead). What do I miss?

> Best case would be if you could somehow find a reproducer, but
> I realize this might be tricky.

Oh, I wish. I tried for several days, adding artificial delays
everywhere, in order to make some race more likely; I created and
deleted millions of cgroups and killed just as many processes under
(artificial) memory pressure, but nothing.

Max

2024-06-12 15:44:22

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

On Tue, Jun 11, 2024 at 11:49 PM Max Kellermann
<[email protected]> wrote:
>
> On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <[email protected]> wrote:
> > Instead I think what might be happening is that the task is terminated
> > while it's in memstall.
>
> How is it possible to terminate a task that's in memstall?
> This must be between psi_memstall_enter() and psi_memstall_leave(),
> but I had already checked all the callers and found nothing
> suspicious; no obvious way to escape the section without
> psi_memstall_leave(). In my understanding, it's impossible to
> terminate a task that's currently stuck in the kernel. First, it needs
> to leave the kernel and go back to userspace, doesn't it?

Doh! I made an assumption that this can happen while it should not,
unless psi_memstall_enter()/psi_memstall_leave() are not balanced. My
bad.

Since the issue is hard to reproduce, maybe you could add debugging
code to store _RET_IP_ inside the task_struct at the end of
psi_memstall_enter() and clear it inside psi_memstall_leave(). Then in
do_exit() you check if it's still set and generate a warning reporting
recorded _RET_IP_. This should hint us to which psi_memstall_enter()
was missing its psi_memstall_leave().

>
> > I think if your theory was
> > correct and psi_task_change() was called while task's cgroup is
> > destroyed then task_psi_group() would have returned an invalid pointer
> > and we would crash once that value is dereferenced.
>
> I was thinking of something slightly different; something about the
> cgroup being deleted or a task being terminated and the bookkeeping of
> the PSI flags getting wrong, maybe some data race. I found the whole
> PSI code with per-task flags, per-cpu per-cgroup counters and flags
> somewhat obscure (but somebody else's code is always obscure, of
> course); I thought there was a lot of potential for mistakes with the
> bookkeeping, but I found nothing specific.
>
> Anyway, thanks for looking into this - I hope we can get a grip on
> this issue, as it's preventing me from using PSI values for actual
> process management; the servers that go into this state will always
> appear overloaded and that would lead to killing all the workload
> processes forever.
>
> Max