Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20220219174940.2570901-1-surenb@google.com> <YhNTcM9XtqA1zUUi@dhcp22.suse.cz>
In-Reply-To: <YhNTcM9XtqA1zUUi@dhcp22.suse.cz>
From:   Tim Murray <timmurray@google.com>
Date:   Tue, 22 Feb 2022 11:47:01 -0800
Message-ID: <CAEe=Sxmow-jx60cDjFMY7qi7+KVc+BT++BTdwC5+G9E=1soMmQ@mail.gmail.com>
Subject: Re: [PATCH 1/1] mm: count time in drain_all_pages during direct
 reclaim as memory pressure
To:     Michal Hocko <mhocko@suse.com>
Cc:     Suren Baghdasaryan <surenb@google.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Peter Zijlstra <peterz@infradead.org>, guro@fb.com,
        Shakeel Butt <shakeelb@google.com>,
        Minchan Kim <minchan@kernel.org>,
        Linux-MM <linux-mm@kvack.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Android Kernel Team <kernel-team@android.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Mon, Feb 21, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
> It would be cool to have some numbers here.

Are there any numbers beyond what Suren mentioned that would be
useful? As one example, in a trace of a camera workload that I opened
at random to check for drain_local_pages stalls, I saw the kworker
that ran drain_local_pages stay at runnable for 68ms before getting
any CPU time. I could try to query our trace corpus to find more
examples, but they're not hard to find in individual traces already.

> If the draining is too slow and dependent on the current CPU/WQ
> contention then we should address that. The original intention was that
> having a dedicated WQ with WQ_MEM_RECLAIM would help to isolate the
> operation from the rest of WQ activity. Maybe we need to fine tune
> mm_percpu_wq. If that doesn't help then we should revise the WQ model
> and use something else. Memory reclaim shouldn't really get stuck behind
> other unrelated work.

In my experience, workqueues are easy to misuse and should be
approached with a lot of care. For many workloads, they work fine 99%+
of the time, but once you run into problems with scheduling delays for
that workqueue, the only option is to stop using workqueues. If you
have work that is system-initiated with minimal latency requirements
(eg, some driver heartbeat every so often, devfreq governors, things
like that), workqueues are great. If you have userspace-initiated work
that should respect priority (eg, GPU command buffer submission in the
critical path of UI) or latency-critical system-initiated work (eg,
display synchronization around panel refresh), workqueues are the
wrong choice because there is no RT capability. WQ_HIGHPRI has a minor
impact, but it won't solve the fundamental problem if the system is
under heavy enough load or if RT threads are involved. As Petr
mentioned, the best solution for those cases seems to be "convert the
workqueue to an RT kthread_worker." I've done that many times on many
different Android devices over the years for latency-critical work,
especially around GPU, display, and camera.

In the drain_local_pages case, I think it is triggered by userspace
work and should respect priority; I don't think a prio 50 RT task
should be blocked waiting on a prio 120 (or prio 100 if WQ_HIGHPRI)
kworker to be scheduled so it can run drain_local_pages. If that's a
reasonable claim, then I think moving drain_local_pages away from
workqueues is the best choice.