2022-08-12 21:05:55

by Tejun Heo

[permalink] [raw]
Subject: Re: Selecting CPUs for queuing work on

On Fri, Aug 12, 2022 at 04:26:47PM -0400, Felix Kuehling wrote:
> Hi workqueue maintainers,
>
> In the KFD (amdgpu) driver we found a need to schedule bottom half interrupt
> handlers on CPU cores different from the one where the top-half interrupt
> handler runs to avoid the interrupt handler stalling the bottom half in
> extreme scenarios. See my latest patch that tries to use a different
> hyperthread on the same CPU core, or falls back to a different core in the
> same NUMA node if that fails:
> https://lore.kernel.org/all/[email protected]/
>
> Dave pointed out that the driver may not be the best place to implement such
> logic and suggested that we should have an abstraction, maybe in the
> workqueue code. Do you feel this is something that could or should be
> provided by the core workqueue code? Or maybe some other place?

I'm not necessarily against it. I guess it can be a flag on an unbound wq.
Do the interrupts move across different CPUs tho? ie. why does this need to
be a dynamic decision?

Thanks.

--
tejun


2022-08-12 21:15:34

by Felix Kuehling

[permalink] [raw]
Subject: Re: Selecting CPUs for queuing work on

On 2022-08-12 16:30, Tejun Heo wrote:
> On Fri, Aug 12, 2022 at 04:26:47PM -0400, Felix Kuehling wrote:
>> Hi workqueue maintainers,
>>
>> In the KFD (amdgpu) driver we found a need to schedule bottom half interrupt
>> handlers on CPU cores different from the one where the top-half interrupt
>> handler runs to avoid the interrupt handler stalling the bottom half in
>> extreme scenarios. See my latest patch that tries to use a different
>> hyperthread on the same CPU core, or falls back to a different core in the
>> same NUMA node if that fails:
>> https://lore.kernel.org/all/[email protected]/
>>
>> Dave pointed out that the driver may not be the best place to implement such
>> logic and suggested that we should have an abstraction, maybe in the
>> workqueue code. Do you feel this is something that could or should be
>> provided by the core workqueue code? Or maybe some other place?
> I'm not necessarily against it. I guess it can be a flag on an unbound wq.
> Do the interrupts move across different CPUs tho? ie. why does this need to
> be a dynamic decision?
In principle, I think IRQ routing to CPUs can change dynamically with
irqbalance.

If this were a flag, would there be a way to ensure all work queued to
the same workqueue from the same CPU, or maybe all work associated with
a work_struct always goes to the same CPU? One of the reasons for my
latest patch was to get more predictable scheduling of the work to cores
that are specifically reserved for interrupt handling by the system
admin. This minimizes CPU scheduling noise that can compound to cause
real performance issues in large scale distributed applications.

What we need is kind of the opposite of WQ_UNBOUND. As I understand it,
WQ_UNBOUND can schedule anywhere to maximize concurrency. What we need
is to schedule to very specific, predictable CPUs. We only have one work
item per GPU that processes all the interrupts in order, so we don't
need the concurrency of WQ_UNBOUND.

Regards,
  Felix


>
> Thanks.
>

2022-08-12 21:52:21

by Tejun Heo

[permalink] [raw]
Subject: Re: Selecting CPUs for queuing work on

Hello,

On Fri, Aug 12, 2022 at 04:54:04PM -0400, Felix Kuehling wrote:
> In principle, I think IRQ routing to CPUs can change dynamically with
> irqbalance.

I wonder whether this is something which should be exposed to userland
rather than trying to do dynamically in the kernel and let irqbalance or
whatever deal with it. People use irq affinity to steer these handlings to
specfic CPUs and the usual expectation is that the bottom half handling is
gonna take place on the same cpu usually through softirq. It's kinda awkard
to have this secondary assignment happening implicitly.

> What we need is kind of the opposite of WQ_UNBOUND. As I understand it,
> WQ_UNBOUND can schedule anywhere to maximize concurrency. What we need is to
> schedule to very specific, predictable CPUs. We only have one work item per
> GPU that processes all the interrupts in order, so we don't need the
> concurrency of WQ_UNBOUND.

Each WQ_UNBOUND workqueue has a cpumask associated with it and the cpumask
can be changed dynamically, so it can be used for sth like this, but I'm not
yet convinced that's the right thing to do.

Thanks.

--
tejun