Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Sender: Tejun Heo <htejun@gmail.com>
Date:   Wed, 12 Apr 2023 14:25:10 -1000
From:   Tejun Heo <tj@kernel.org>
To:     Linus Torvalds <torvalds@linux-foundation.org>
Cc:     Eric Biggers <ebiggers@kernel.org>, fsverity@lists.linux.dev,
        linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
        Theodore Ts'o <tytso@mit.edu>,
        Nathan Huckleberry <nhuck@google.com>,
        Victor Hsieh <victorhsieh@google.com>,
        Lai Jiangshan <jiangshanlai@gmail.com>
Subject: Re: [GIT PULL] fsverity fixes for v6.3-rc4
Message-ID: <ZDdL5s_wprnlF7ad@slm.duckdns.org>
References: <20230320210724.GB1434@sol.localdomain>
 <CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com>
 <ZBlJJBR7dH4/kIWD@slm.duckdns.org>
 <CAHk-=wh0wxPx1zP1onSs88KB6zOQ0oHyOg_vGr5aK8QJ8fuxnw@mail.gmail.com>
 <ZBulmj3CcYTiCC8z@slm.duckdns.org>
 <CAHk-=wgT2TJO6+B=Pho1VOtND-qC_d1PM1FC-Snf+sRpLhR=hg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAHk-=wgT2TJO6+B=Pho1VOtND-qC_d1PM1FC-Snf+sRpLhR=hg@mail.gmail.com>
Precedence: bulk

Hello, Linus.

Okay, I'm now back online.

On Thu, Mar 23, 2023 at 11:04:25AM -0700, Linus Torvalds wrote:
> On Wed, Mar 22, 2023 at 6:04 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > Thanks for the pointers. They all seem plausible symptoms of work items
> > getting bounced across slow cache boundaries. I'm off for a few weeks so
> > can't really dig in right now but will get to it afterwards.
> 
> So just as a gut feeling, I suspect that one solution would be to
> always *start* the work on the local CPU (where "local" might be the
> same, or at least a sibling).

Yeah, that seems like the sanest way to leverage the scheduler. The only
complication is around tracking which workers were on which CPUs and how
sticky the cpu association should be (e.g. we don't want to unnecessarily
jump workers across CPUs but we probably don't want to maintain strict
per-cpu worker pools either). I'll try to come up with a reasonable
trade-off which isn't too complicated.

> The only reason to migrate to another CPU would be if the work is
> CPU-intensive, and I do suspect that is commonly not really the case.
> 
> And I strongly suspect that our WQ_CPU_INTENSIVE flag is pure garbage,
> and should just be gotten rid of, because what could be considered
> "CPU intensive" in under one situation might not be CPU intensive in
> another one, so trying to use some static knowledge about it is just
> pure guess-work.
> 
> The different situations might be purely contextual things ("heavy
> network traffic when NAPI polling kicks in"), but it might also be
> purely hardware-related (ie "this is heavy if we don't have CPU hw
> acceleration for crypto, but cheap if we do").
> 
> So I really don't think it should be some static decision, either
> through WQ_CPU_INTENSIVE _or_ through "WQ_UNBOUND means schedule on
> first available CPU".
> 
> Wouldn't it be much nicer if we just noticed it dynamically, and
> WQ_UNBOUND would mean that the workqueue _can_ be scheduled on another
> CPU if it ends up being advantageous?
> 
> And we actually kind of have that dynamic flag already, in the form of
> the scheduler. It might even be explicit in the context of the
> workqueue (with "need_resched()" being true and the workqueue code
> itself might notice it and explicitly then try to spread it out), but
> with preemption it's more implicit and maybe it needs a bit of
> tweaking help.

Yeah, CPU_INTENSIVE was added as an easy (to implement) way out for cpu
hogging percpu work items. Given that percpu workers track the scheduling
events anyway whether from preemption or explicit schedule(), it should be
possible to remove it while maintaining most of the benefits of worker
concurrency management. Because the scheduler isn't aware of work item
boundaries, workqueue can't blindly use scheduling events but that's easy to
resolve with an extra timestamp.

I'll think more about whether it'd be a good idea to subject unbound workers
to concurrency management before it gets spread out so that the only
distinction between percpu and unbound is whether the work item can be
booted off cpu when they run for too long while being subject to the same
concurrency control before that point.

> So that's what I mean by "start the work as local CPU work" - use that
> as the baseline decision (since it's going to be the case that has
> cache locality), and actively try to avoid spreading things out unless
> we have an explicit reason to, and that reason we could just get from
> the scheduler.
> 
> The worker code already has that "wq_worker_sleeping()" callback from
> the scheduler, but that only triggers when a worker is going to sleep.
> I'm saying that the "scheduler decided to schedule out a worker" case
> might be used as a "Oh, this is CPU intensive, let's try to spread it
> out".
> 
> See what I'm trying to say?

Yeah, lemme look into it. It'd be great to simplify workqueue usage and
actually make it leverage what the scheduler knows about what should run
where.

Thanks.

-- 
tejun