When NOHZ_FULL is enabled, such as in HPC situation, CPUs are divided
into housekeeping CPUs and non-housekeeping CPUs. Non-housekeeping CPUs
are NOHZ_FULL CPUs and are often monopolized by the userspace process,
such HPC application process. Any sort of interruption is not expected.
blk_mq_hctx_next_cpu() selects each cpu in 'hctx->cpumask' alternately
to schedule the work thread blk_mq_run_work_fn(). When 'hctx->cpumask'
contains housekeeping CPU and non-housekeeping CPU at the same time, a
housekeeping CPU, which want to request a IO, may schedule a worker on a
non-housekeeping CPU. This may affect the performance of the userspace
application running on non-housekeeping CPUs.
So let's just schedule the worker thread on the current CPU when the
current CPU is housekeeping CPU.
Signed-off-by: Xiongfeng Wang <[email protected]>
---
block/blk-mq.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1adfe4824ef5..ff9a4bf16858 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -24,6 +24,7 @@
#include <linux/sched/sysctl.h>
#include <linux/sched/topology.h>
#include <linux/sched/signal.h>
+#include <linux/sched/isolation.h>
#include <linux/delay.h>
#include <linux/crash_dump.h>
#include <linux/prefetch.h>
@@ -2036,6 +2037,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
unsigned long msecs)
{
+ int work_cpu;
+
if (unlikely(blk_mq_hctx_stopped(hctx)))
return;
@@ -2050,7 +2053,17 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
put_cpu();
}
- kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
+ /*
+ * Avoid housekeeping CPUs scheduling a worker on a non-housekeeping
+ * CPU
+ */
+ if (tick_nohz_full_enabled() && housekeeping_cpu(smp_processor_id(),
+ HK_FLAG_WQ))
+ work_cpu = smp_processor_id();
+ else
+ work_cpu = blk_mq_hctx_next_cpu(hctx);
+
+ kblockd_mod_delayed_work_on(work_cpu, &hctx->run_work,
msecs_to_jiffies(msecs));
}
--
2.20.1
Hello Xiongfeng,
On Tue, Feb 15, 2022 at 10:29:51AM +0800, Xiongfeng Wang wrote:
> Hi Ming,
>
> Sorry to disturb you. It's just that I think you may be interested at this
> patch. I found the following commit written by you.
> commit 11ea68f553e244851d15793a7fa33a97c46d8271
> genirq, sched/isolation: Isolate from handling managed interrupts
> It removed the managed_irq interruption from non-housekeeping CPUs as long as
> the non-housekeeping CPUs do not request IO. But the the work thread
> blk_mq_run_work_fn() may still run on the non-housekeeping CPUs.
> Appreciate it a lot if you can give it a look.
Yeah, commit 11ea68f553e24 touches irq subsystem to try not assign
isolated cpus for managed irq's effective affinity.
Here blk-mq just selects one cpu and calls mod_delayed_work_on()
to execute the run queue handler on specified cpu. There are lots of
such bound wq usage in tree, so I guess it might belong to one wq or
scheduler generic problem instead of blk-mq specific issue. Not sure
if it is good to address it in block layer.
thanks,
Ming
>
> Thanks,
> Xiongfeng
>
> On 2022/2/10 17:35, Xiongfeng Wang wrote:
> > When NOHZ_FULL is enabled, such as in HPC situation, CPUs are divided
> > into housekeeping CPUs and non-housekeeping CPUs. Non-housekeeping CPUs
> > are NOHZ_FULL CPUs and are often monopolized by the userspace process,
> > such HPC application process. Any sort of interruption is not expected.
> >
> > blk_mq_hctx_next_cpu() selects each cpu in 'hctx->cpumask' alternately
> > to schedule the work thread blk_mq_run_work_fn(). When 'hctx->cpumask'
> > contains housekeeping CPU and non-housekeeping CPU at the same time, a
> > housekeeping CPU, which want to request a IO, may schedule a worker on a
> > non-housekeeping CPU. This may affect the performance of the userspace
> > application running on non-housekeeping CPUs.
> >
> > So let's just schedule the worker thread on the current CPU when the
> > current CPU is housekeeping CPU.
> >
> > Signed-off-by: Xiongfeng Wang <[email protected]>
> > ---
> > block/blk-mq.c | 15 ++++++++++++++-
> > 1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 1adfe4824ef5..ff9a4bf16858 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -24,6 +24,7 @@
> > #include <linux/sched/sysctl.h>
> > #include <linux/sched/topology.h>
> > #include <linux/sched/signal.h>
> > +#include <linux/sched/isolation.h>
> > #include <linux/delay.h>
> > #include <linux/crash_dump.h>
> > #include <linux/prefetch.h>
> > @@ -2036,6 +2037,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
> > static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
> > unsigned long msecs)
> > {
> > + int work_cpu;
> > +
> > if (unlikely(blk_mq_hctx_stopped(hctx)))
> > return;
> >
> > @@ -2050,7 +2053,17 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
> > put_cpu();
> > }
> >
> > - kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
> > + /*
> > + * Avoid housekeeping CPUs scheduling a worker on a non-housekeeping
> > + * CPU
> > + */
> > + if (tick_nohz_full_enabled() && housekeeping_cpu(smp_processor_id(),
> > + HK_FLAG_WQ))
> > + work_cpu = smp_processor_id();
> > + else
> > + work_cpu = blk_mq_hctx_next_cpu(hctx);
> > +
> > + kblockd_mod_delayed_work_on(work_cpu, &hctx->run_work,
> > msecs_to_jiffies(msecs));
> > }
> >
> >
>
--
Ming
Hi Ming,
Sorry to disturb you. It's just that I think you may be interested at this
patch. I found the following commit written by you.
commit 11ea68f553e244851d15793a7fa33a97c46d8271
genirq, sched/isolation: Isolate from handling managed interrupts
It removed the managed_irq interruption from non-housekeeping CPUs as long as
the non-housekeeping CPUs do not request IO. But the the work thread
blk_mq_run_work_fn() may still run on the non-housekeeping CPUs.
Appreciate it a lot if you can give it a look.
Thanks,
Xiongfeng
On 2022/2/10 17:35, Xiongfeng Wang wrote:
> When NOHZ_FULL is enabled, such as in HPC situation, CPUs are divided
> into housekeeping CPUs and non-housekeeping CPUs. Non-housekeeping CPUs
> are NOHZ_FULL CPUs and are often monopolized by the userspace process,
> such HPC application process. Any sort of interruption is not expected.
>
> blk_mq_hctx_next_cpu() selects each cpu in 'hctx->cpumask' alternately
> to schedule the work thread blk_mq_run_work_fn(). When 'hctx->cpumask'
> contains housekeeping CPU and non-housekeeping CPU at the same time, a
> housekeeping CPU, which want to request a IO, may schedule a worker on a
> non-housekeeping CPU. This may affect the performance of the userspace
> application running on non-housekeeping CPUs.
>
> So let's just schedule the worker thread on the current CPU when the
> current CPU is housekeeping CPU.
>
> Signed-off-by: Xiongfeng Wang <[email protected]>
> ---
> block/blk-mq.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1adfe4824ef5..ff9a4bf16858 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -24,6 +24,7 @@
> #include <linux/sched/sysctl.h>
> #include <linux/sched/topology.h>
> #include <linux/sched/signal.h>
> +#include <linux/sched/isolation.h>
> #include <linux/delay.h>
> #include <linux/crash_dump.h>
> #include <linux/prefetch.h>
> @@ -2036,6 +2037,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
> static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
> unsigned long msecs)
> {
> + int work_cpu;
> +
> if (unlikely(blk_mq_hctx_stopped(hctx)))
> return;
>
> @@ -2050,7 +2053,17 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
> put_cpu();
> }
>
> - kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
> + /*
> + * Avoid housekeeping CPUs scheduling a worker on a non-housekeeping
> + * CPU
> + */
> + if (tick_nohz_full_enabled() && housekeeping_cpu(smp_processor_id(),
> + HK_FLAG_WQ))
> + work_cpu = smp_processor_id();
> + else
> + work_cpu = blk_mq_hctx_next_cpu(hctx);
> +
> + kblockd_mod_delayed_work_on(work_cpu, &hctx->run_work,
> msecs_to_jiffies(msecs));
> }
>
>
Hi Frederic,
Sorry to disturb you. It's just that I think you may be interested in this
patch. I notice you are reviewing some other CPU isolation patches. Appreciate
it a lot if you can give it a look. Or just ignore it if you are not interested.
Thanks,
Xiongfeng
On 2022/2/10 17:35, Xiongfeng Wang wrote:
> When NOHZ_FULL is enabled, such as in HPC situation, CPUs are divided
> into housekeeping CPUs and non-housekeeping CPUs. Non-housekeeping CPUs
> are NOHZ_FULL CPUs and are often monopolized by the userspace process,
> such HPC application process. Any sort of interruption is not expected.
>
> blk_mq_hctx_next_cpu() selects each cpu in 'hctx->cpumask' alternately
> to schedule the work thread blk_mq_run_work_fn(). When 'hctx->cpumask'
> contains housekeeping CPU and non-housekeeping CPU at the same time, a
> housekeeping CPU, which want to request a IO, may schedule a worker on a
> non-housekeeping CPU. This may affect the performance of the userspace
> application running on non-housekeeping CPUs.
>
> So let's just schedule the worker thread on the current CPU when the
> current CPU is housekeeping CPU.
>
> Signed-off-by: Xiongfeng Wang <[email protected]>
> ---
> block/blk-mq.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1adfe4824ef5..ff9a4bf16858 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -24,6 +24,7 @@
> #include <linux/sched/sysctl.h>
> #include <linux/sched/topology.h>
> #include <linux/sched/signal.h>
> +#include <linux/sched/isolation.h>
> #include <linux/delay.h>
> #include <linux/crash_dump.h>
> #include <linux/prefetch.h>
> @@ -2036,6 +2037,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
> static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
> unsigned long msecs)
> {
> + int work_cpu;
> +
> if (unlikely(blk_mq_hctx_stopped(hctx)))
> return;
>
> @@ -2050,7 +2053,17 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
> put_cpu();
> }
>
> - kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
> + /*
> + * Avoid housekeeping CPUs scheduling a worker on a non-housekeeping
> + * CPU
> + */
> + if (tick_nohz_full_enabled() && housekeeping_cpu(smp_processor_id(),
> + HK_FLAG_WQ))
> + work_cpu = smp_processor_id();
> + else
> + work_cpu = blk_mq_hctx_next_cpu(hctx);
> +
> + kblockd_mod_delayed_work_on(work_cpu, &hctx->run_work,
> msecs_to_jiffies(msecs));
> }
>
>
Hi Ming,
Thanks for your reply !
On 2022/2/15 12:37, Ming Lei wrote:
> Hello Xiongfeng,
>
> On Tue, Feb 15, 2022 at 10:29:51AM +0800, Xiongfeng Wang wrote:
>> Hi Ming,
>>
>> Sorry to disturb you. It's just that I think you may be interested at this
>> patch. I found the following commit written by you.
>> commit 11ea68f553e244851d15793a7fa33a97c46d8271
>> genirq, sched/isolation: Isolate from handling managed interrupts
>> It removed the managed_irq interruption from non-housekeeping CPUs as long as
>> the non-housekeeping CPUs do not request IO. But the the work thread
>> blk_mq_run_work_fn() may still run on the non-housekeeping CPUs.
>> Appreciate it a lot if you can give it a look.
>
> Yeah, commit 11ea68f553e24 touches irq subsystem to try not assign
> isolated cpus for managed irq's effective affinity.
>
> Here blk-mq just selects one cpu and calls mod_delayed_work_on()
> to execute the run queue handler on specified cpu. There are lots of
> such bound wq usage in tree, so I guess it might belong to one wq or
> scheduler generic problem instead of blk-mq specific issue. Not sure
> if it is good to address it in block layer.
Yes, I also find some other worker thread running on the non-housekeeping CPUs.
Some of them need to read the per-cpu data, such as drain_local_pages_wq(). But
workqueue subsystem doesn't know if the work threads read any per-cpu data and
can be migrated to another CPU.
For the workqueue marked as WQ_UNBOUND, the following commit can move the worker
threads to the housekeeping CPUs.
commit 1bda3f8087fce9063da0b8aef87f17a3fe541aca
sched/isolation: Isolate workqueues when "nohz_full=" is set
But for the workqueue without flag WQ_UNBOUND, workqueue subsystem doesn't know
if the worker threads can be migrated to another CPU.
So I think maybe the subsystem who create the workqueue can decide whether the
worker threads can be migrated.
Thanks,
Xiongfeng
>
> thanks,
> Ming
>
>>
>> Thanks,
>> Xiongfeng
>>
>> On 2022/2/10 17:35, Xiongfeng Wang wrote:
>>> When NOHZ_FULL is enabled, such as in HPC situation, CPUs are divided
>>> into housekeeping CPUs and non-housekeeping CPUs. Non-housekeeping CPUs
>>> are NOHZ_FULL CPUs and are often monopolized by the userspace process,
>>> such HPC application process. Any sort of interruption is not expected.
>>>
>>> blk_mq_hctx_next_cpu() selects each cpu in 'hctx->cpumask' alternately
>>> to schedule the work thread blk_mq_run_work_fn(). When 'hctx->cpumask'
>>> contains housekeeping CPU and non-housekeeping CPU at the same time, a
>>> housekeeping CPU, which want to request a IO, may schedule a worker on a
>>> non-housekeeping CPU. This may affect the performance of the userspace
>>> application running on non-housekeeping CPUs.
>>>
>>> So let's just schedule the worker thread on the current CPU when the
>>> current CPU is housekeeping CPU.
>>>
>>> Signed-off-by: Xiongfeng Wang <[email protected]>
>>> ---
>>> block/blk-mq.c | 15 ++++++++++++++-
>>> 1 file changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 1adfe4824ef5..ff9a4bf16858 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -24,6 +24,7 @@
>>> #include <linux/sched/sysctl.h>
>>> #include <linux/sched/topology.h>
>>> #include <linux/sched/signal.h>
>>> +#include <linux/sched/isolation.h>
>>> #include <linux/delay.h>
>>> #include <linux/crash_dump.h>
>>> #include <linux/prefetch.h>
>>> @@ -2036,6 +2037,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
>>> static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
>>> unsigned long msecs)
>>> {
>>> + int work_cpu;
>>> +
>>> if (unlikely(blk_mq_hctx_stopped(hctx)))
>>> return;
>>>
>>> @@ -2050,7 +2053,17 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
>>> put_cpu();
>>> }
>>>
>>> - kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
>>> + /*
>>> + * Avoid housekeeping CPUs scheduling a worker on a non-housekeeping
>>> + * CPU
>>> + */
>>> + if (tick_nohz_full_enabled() && housekeeping_cpu(smp_processor_id(),
>>> + HK_FLAG_WQ))
>>> + work_cpu = smp_processor_id();
>>> + else
>>> + work_cpu = blk_mq_hctx_next_cpu(hctx);
>>> +
>>> + kblockd_mod_delayed_work_on(work_cpu, &hctx->run_work,
>>> msecs_to_jiffies(msecs));
>>> }
>>>
>>>
>>
>