Received-SPF: pass (google.com: domain of linux-kernel+bounces-210928-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1;
Precedence: bulk
MIME-Version: 1.0
References: <CAKPOu+8cD2CBcaerhwC0i7e0O4LU9oQg1w3J5RsV6qcZMEr2Uw@mail.gmail.com>
In-Reply-To: <CAKPOu+8cD2CBcaerhwC0i7e0O4LU9oQg1w3J5RsV6qcZMEr2Uw@mail.gmail.com>
From: Suren Baghdasaryan <surenb@google.com>
Date: Tue, 11 Jun 2024 22:01:29 -0700
Message-ID: <CAJuCfpGa55gpKHBE_0mwRPsf0f1Wp5UK7+w6N7yZi-7v31vNzw@mail.gmail.com>
Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
To: Max Kellermann <max.kellermann@ionos.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Peter Zijlstra <peterz@infradead.org>, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Tue, Jun 4, 2024 at 12:16=E2=80=AFAM Max Kellermann <max.kellermann@iono=
s.com> wrote:
>
> Hi kernel people,
> I have a problem that I have been trying to debug for a few days, but
> I got lost in the depths of the scheduler code; I'm stuck and I need
> your help.
> We have several servers which show a constant memory.pressure value of
> 30 to 100 (or more), even when the server is idle. I tracked this down
> to psi_group_cpu.tasks[NR_MEMSTALL]=3D=3D1 even though no such process
> exists, but I can't figure out why the kernel thinks there is still
> one task stuck in memstall. I tried to track down all the code paths
> that lead to psi_group_change(), but found nothing conclusive, and
> failed to reproduce it on a test machine with kernel patches injecting
> delays (trying to cause data race bugs that may have caused this
> problem).
>
> This happened on servers that were very busy and indeed were in
> memstall often due to going over memory.high frequently. We have one
> "main" cgroup with memory.high configured, and all the workload
> processes live in sub-cgroups, of which we always have a few thousand.
> When memory.events gets triggered, our process manager stops a bunch
> of idle processes to free up memory, which then deletes the sub-cgroup
> they belong to. In other words: sub-cgroups get created and deleted
> very often, and they get deleted when there is indeed memory stall
> happening. My theory was that there could be a data race bug that
> forgets to decrement tasks[NR_MEMSTALL], maybe when a stalled child
> cgroup gets deleted.

Hi Max,
I'm not an expert in the scheduler (I maintain mostly PSI triggers),
so my feedback might be utterly wrong.
I looked a bit into the relevant code and I think if your theory was
correct and psi_task_change() was called while task's cgroup is
destroyed then task_psi_group() would have returned an invalid pointer
and we would crash once that value is dereferenced.
Instead I think what might be happening is that the task is terminated
while it's in memstall. do_exit() calls do_task_dead() at the very
end, which sets current->__state to TASK_DEAD and calls the last
__schedule() for this task. __schedule() will call deactivate_task(rq,
prev, DEQUEUE_SLEEP) which will set prev->on_rq =3D NULL and call
dequeue_task(..., DEQUEUE_SLEEP) leading to psi_dequeue(..., true).
Note that because of that last parameter of psi_dequeue() is "true",
psi_task_change() will not be called at this time. Later on
__schedule() calls psi_sched_switch(). That leads to psi_task_switch()
but note that the last parameter will be "true" because prev->on_rq =3D=3D
NULL. So we end up calling psi_task_switch(, true). Now, note this
line: https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L95=
5.
It will clear TSK_MEMSTALL_RUNNING but not TSK_MEMSTALL. So, if the
task was in TSK_MEMSTALL state then that state won't be cleared, which
might be the problem you are facing.
I think you can check if this theory pans out by adding a WARN_ON() ar
the end of psi_task_switch():

void psi_task_switch(struct task_struct *prev, struct task_struct
*next, bool sleep)
{
...
        if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
                clear &=3D ~TSK_ONCPU;
                for (; group; group =3D group->parent)
                        psi_group_change(group, cpu, clear, set, now,
wake_clock);
        }
+        WARN_ON(prev->__state & TASK_DEAD && prev->psi_flags & TSK_MEMSTAL=
L);
}

Again, I am by no means an expert in this area. Johannes or Peter
would be much better people to consult with.
Thanks,
Suren.

> On our Grafana, I can easily track the beginning of this bug to a
> point two weeks ago; in the system log, I can see that hundreds of
> processes needed to be terminated due to memory pressure at that time.
>
> The affected servers run kernel 6.8.7 with a few custom patches, but
> none of these patches affect the scheduler or cgroups; they're about
> unrelated things like denying access to Ceph snapshots and adjusting
> debugfs permissions. (I submitted most of those patches to LKML long
> ago but nobody cared.)
> Newer kernels don't seem to have fixes for my problem; the relevant
> parts of the code are unchanged.
>
> One of the servers is still running with this problem, and I can
> access it with gdb on /proc/kcore. I'll keep it that way for some more
> time, so if you have any idea what to look for, let me know.
>
> This is the psi_group of the "main" cgroup:
>
>  $1 =3D {parent =3D 0xffff9de707287800, enabled =3D true, avgs_lock =3D {=
owner
> =3D {counter =3D 0}, wait_lock =3D {raw_lock =3D {{val =3D {counter =3D 0=
},
> {locked =3D 0 '\000', pending =3D 0 '\000'}, {locked_pending =3D 0, tail =
=3D
> 0}}}}, osq =3D {tail =3D {
>         counter =3D 0}}, wait_list =3D {next =3D 0xffff9de70f772820, prev=
 =3D
> 0xffff9de70f772820}}, pcpu =3D 0x3fb640033900, avg_total =3D
> {6133960836647, 5923217690044, 615319825665255, 595479374843164,
> 19259777147170, 12847590051880},
>   avg_last_update =3D 1606208471280060, avg_next_update =3D
> 1606210394507082, avgs_work =3D {work =3D {data =3D {counter =3D 321}, en=
try =3D
> {next =3D 0xffff9de70f772880, prev =3D 0xffff9de70f772880}, func =3D
> 0xffffffff880dcc00}, timer =3D {entry =3D {
>         next =3D 0x0 <fixed_percpu_data>, pprev =3D 0xffff9e05bef5bc48},
> expires =3D 4455558105, function =3D 0xffffffff880a1ca0, flags =3D
> 522190853}, wq =3D 0xffff9de700051400, cpu =3D 64}, avg_triggers =3D {nex=
t =3D
> 0xffff9de70f7728d0,
>     prev =3D 0xffff9de70f7728d0}, avg_nr_triggers =3D {0, 0, 0, 0, 0, 0},
> total =3D {{6133960836647, 5923217690044, 615328415599847,
> 595487964777756, 19281251983650, 12869064888360}, {6092994926,
> 5559819737, 105947464151, 100672353730,
>       8196529519, 7678536634}}, avg =3D {{0, 0, 0}, {0, 0, 0}, {203596,
> 203716, 198499}, {203596, 203716, 198288}, {0, 0, 60}, {0, 0, 0}},
> rtpoll_task =3D 0x0 <fixed_percpu_data>, rtpoll_timer =3D {entry =3D {nex=
t =3D
> 0xdead000000000122,
>       pprev =3D 0x0 <fixed_percpu_data>}, expires =3D 4405010639, functio=
n
> =3D 0xffffffff880dac50, flags =3D 67108868}, rtpoll_wait =3D {lock =3D {{=
rlock
> =3D {raw_lock =3D {{val =3D {counter =3D 0}, {locked =3D 0 '\000', pendin=
g =3D 0
> '\000'}, {
>                 locked_pending =3D 0, tail =3D 0}}}}}}, head =3D {next =
=3D
> 0xffff9de70f772a20, prev =3D 0xffff9de70f772a20}}, rtpoll_wakeup =3D
> {counter =3D 0}, rtpoll_scheduled =3D {counter =3D 0}, rtpoll_trigger_loc=
k =3D
> {owner =3D {counter =3D 0},
>     wait_lock =3D {raw_lock =3D {{val =3D {counter =3D 0}, {locked =3D 0 =
'\000',
> pending =3D 0 '\000'}, {locked_pending =3D 0, tail =3D 0}}}}, osq =3D {ta=
il =3D
> {counter =3D 0}}, wait_list =3D {next =3D 0xffff9de70f772a48, prev =3D
> 0xffff9de70f772a48}},
>   rtpoll_triggers =3D {next =3D 0xffff9de70f772a58, prev =3D
> 0xffff9de70f772a58}, rtpoll_nr_triggers =3D {0, 0, 0, 0, 0, 0},
> rtpoll_states =3D 0, rtpoll_min_period =3D 18446744073709551615,
> rtpoll_total =3D {6092994926, 5559819737, 105947464151,
>     100672353730, 8196529519, 7678536634}, rtpoll_next_update =3D
> 1100738436720135, rtpoll_until =3D 0}
>
> This is a summary of all psi_group_pcpu for the 32 CPU cores (on the
> way, I wrote a small gdb script to dump interesting details like these
> but that went nowhere):
>
>   state_mask 0 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 1 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 2 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 3 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 4 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 5 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 6 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 7 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 8 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 9 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 10 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 11 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 12 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 13 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 14 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 15 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 16 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 17 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 18 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 19 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 20 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 21 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 22 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 23 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 24 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 25 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 26 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 27 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 28 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 29 =3D 0x0 tasks {0, 0, 0, 0}
>   state_mask 30 =3D 0x4c tasks {0, 1, 0, 0}
>   state_mask 31 =3D 0x0 tasks {0, 0, 0, 0}
>
> CPU core 30 is stuck with this bogus value. state_mask 0x4c =3D
> PSI_MEM_SOME|PSI_MEM_FULL|PSI_NONIDLE.
>
> The memory pressure at the time of this writing:
>
>  # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/mem=
ory.pressure
>  some avg10=3D99.22 avg60=3D99.39 avg300=3D97.62 total=3D615423620626
>  full avg10=3D99.22 avg60=3D99.39 avg300=3D97.54 total=3D595583169804
>  # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/_/m=
emory.pressure
>  some avg10=3D0.00 avg60=3D0.00 avg300=3D0.00 total=3D0
>  full avg10=3D0.00 avg60=3D0.00 avg300=3D0.00 total=3D0
>  # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/cgr=
oup.stat
>  nr_descendants 1
>  nr_dying_descendants 2224
>
> There is currently no worker process; there is only one idle dummy
> process in a single sub-cgroup called "_", only there to keep the
> systemd scope populated. It should therefore be impossible to have
> memory.pressure when the only leaf cgroup has pressure=3D0.
>
> (nr_dying_descendants is decremented extremely slowly; I deactivated
> the server shortly before collecting these numbers, to make sure it's
> really idle and there are really no processes left to cause this
> pressure. I don't think nr_dying_descendants is relevant for this
> problem; even after two days of full idle, the counter and the
> pressure didn't go back to zero.)
>
> Please help :-)
>
> Max