2023-01-13 22:05:02

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Add WQ_SCHED_FIFO

On Fri, Jan 13, 2023 at 01:07:02PM -0800, Nathan Huckleberry wrote:
> Add a WQ flag that allows workqueues to use SCHED_FIFO with the least
> imporant RT priority. This can reduce scheduler latency for IO
> post-processing when the CPU is under load without impacting other RT
> workloads. This has been shown to improve app startup time on Android
> [1].
>
> Scheduler latency affects several drivers as evidenced by [1], [2], [3],
> [4]. Some of these drivers have moved post-processing into IRQ context.
> However, this can cause latency spikes for real-time threads and jitter
> related jank on Android. Using a workqueue with SCHED_FIFO improves
> scheduler latency without causing latency problems for RT threads.
>
> [1]:
> https://lore.kernel.org/linux-erofs/[email protected]/
> [2]:
> https://lore.kernel.org/linux-f2fs-devel/[email protected]/
> [3]:
> https://lore.kernel.org/dm-devel/[email protected]/
> [4]:
> https://lore.kernel.org/dm-crypt/[email protected]/
>
> This change has been tested on dm-verity with the following fio config:
>
> [global]
> time_based
> runtime=120
>
> [do-verify]
> ioengine=sync
> filename=/dev/testing
> rw=randread
> direct=1
>
> [burn_8x90%_qsort]
> ioengine=cpuio
> cpuload=90
> numjobs=8
> cpumode=qsort
>
> Before:
> clat (usec): min=13, max=23882, avg=29.56, stdev=113.29 READ:
> bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB
> (15.3GB), run=120001-120001msec
>
> After:
> clat (usec): min=13, max=23137, avg=19.96, stdev=105.71 READ:
> bw=180MiB/s (189MB/s), 180MiB/s-180MiB/s (189MB/s-189MB/s), io=21.1GiB
> (22.7GB), run=120012-120012msec

Given that its use case mostly intersects with WQ_HIGHPRI, would it make
more sense to add a switch to alter its behavior instead? I don't really
like the idea of pushing the decision between WQ_HIGHPRI and WQ_SCHED_FIFO
to each user.

Thanks.

--
tejun


2023-01-14 22:10:42

by Nathan Huckleberry

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Add WQ_SCHED_FIFO

On Fri, Jan 13, 2023 at 1:11 PM Tejun Heo <[email protected]> wrote:
>
> On Fri, Jan 13, 2023 at 01:07:02PM -0800, Nathan Huckleberry wrote:
> > Add a WQ flag that allows workqueues to use SCHED_FIFO with the least
> > imporant RT priority. This can reduce scheduler latency for IO
> > post-processing when the CPU is under load without impacting other RT
> > workloads. This has been shown to improve app startup time on Android
> > [1].
> >
> > Scheduler latency affects several drivers as evidenced by [1], [2], [3],
> > [4]. Some of these drivers have moved post-processing into IRQ context.
> > However, this can cause latency spikes for real-time threads and jitter
> > related jank on Android. Using a workqueue with SCHED_FIFO improves
> > scheduler latency without causing latency problems for RT threads.
> >
> > [1]:
> > https://lore.kernel.org/linux-erofs/[email protected]/
> > [2]:
> > https://lore.kernel.org/linux-f2fs-devel/[email protected]/
> > [3]:
> > https://lore.kernel.org/dm-devel/[email protected]/
> > [4]:
> > https://lore.kernel.org/dm-crypt/[email protected]/
> >
> > This change has been tested on dm-verity with the following fio config:
> >
> > [global]
> > time_based
> > runtime=120
> >
> > [do-verify]
> > ioengine=sync
> > filename=/dev/testing
> > rw=randread
> > direct=1
> >
> > [burn_8x90%_qsort]
> > ioengine=cpuio
> > cpuload=90
> > numjobs=8
> > cpumode=qsort
> >
> > Before:
> > clat (usec): min=13, max=23882, avg=29.56, stdev=113.29 READ:
> > bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB
> > (15.3GB), run=120001-120001msec
> >
> > After:
> > clat (usec): min=13, max=23137, avg=19.96, stdev=105.71 READ:
> > bw=180MiB/s (189MB/s), 180MiB/s-180MiB/s (189MB/s-189MB/s), io=21.1GiB
> > (22.7GB), run=120012-120012msec
>
> Given that its use case mostly intersects with WQ_HIGHPRI, would it make
> more sense to add a switch to alter its behavior instead? I don't really
> like the idea of pushing the decision between WQ_HIGHPRI and WQ_SCHED_FIFO
> to each user.

This sounds fine to me. How do you feel about a config flag to change
the default WQ_HIGHPRI scheduler policy and a sysfs node to update the
policy per workqueue?

Thanks,
Huck

>
> Thanks.

>
> --
> tejun

2023-01-18 18:03:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Add WQ_SCHED_FIFO

On Sat, Jan 14, 2023 at 01:00:00PM -0800, Nathan Huckleberry wrote:
> This sounds fine to me. How do you feel about a config flag to change
> the default WQ_HIGHPRI scheduler policy and a sysfs node to update the
> policy per workqueue?

Yeah, sounds fine to me.

Thanks.

--
tejun

2023-01-18 18:57:13

by Sandeep Dhavale

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Add WQ_SCHED_FIFO

On Wed, Jan 18, 2023 at 9:52 AM Tejun Heo <[email protected]> wrote:
>
> On Sat, Jan 14, 2023 at 01:00:00PM -0800, Nathan Huckleberry wrote:
> > This sounds fine to me. How do you feel about a config flag to change
> > the default WQ_HIGHPRI scheduler policy and a sysfs node to update the
> > policy per workqueue?
>
> Yeah, sounds fine to me.
>
> Thanks.
>
Hi Tejun,
If with the kernel config option, every WQ_HIGHPRI is elevated to
sched_fifo_low, wouldn't that be kind of defeating the purpose? Having
another class for even more urgent work is better in my opinion.

Thanks,
Sandeep.
> --
> tejun

2023-01-18 22:47:02

by Nathan Huckleberry

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Add WQ_SCHED_FIFO

On Wed, Jan 18, 2023 at 10:22 AM Sandeep Dhavale <[email protected]> wrote:
>
> On Wed, Jan 18, 2023 at 9:52 AM Tejun Heo <[email protected]> wrote:
> >
> > On Sat, Jan 14, 2023 at 01:00:00PM -0800, Nathan Huckleberry wrote:
> > > This sounds fine to me. How do you feel about a config flag to change
> > > the default WQ_HIGHPRI scheduler policy and a sysfs node to update the
> > > policy per workqueue?
> >
> > Yeah, sounds fine to me.
> >
> > Thanks.
> >
> Hi Tejun,
> If with the kernel config option, every WQ_HIGHPRI is elevated to
> sched_fifo_low, wouldn't that be kind of defeating the purpose? Having
> another class for even more urgent work is better in my opinion.

I agree, however most of the users of WQ_HIGHPRI that are relevant to
Android would probably have SCHED_FIFO enabled anyway.

I can write the patches such that WQ_HIGHPRI selects one of two
internal WQ flags. If using SCHED_FIFO for all workqueues is
problematic, Android can consider using the internal WQ flags directly
and carry those one-line patches in the Android tree.

Thanks,
Huck

>
> Thanks,
> Sandeep.
> > --
> > tejun

2023-01-19 02:34:19

by Nathan Huckleberry

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Add WQ_SCHED_FIFO

On Fri, Jan 13, 2023 at 1:11 PM Tejun Heo <[email protected]> wrote:
>
> On Fri, Jan 13, 2023 at 01:07:02PM -0800, Nathan Huckleberry wrote:
> > Add a WQ flag that allows workqueues to use SCHED_FIFO with the least
> > imporant RT priority. This can reduce scheduler latency for IO
> > post-processing when the CPU is under load without impacting other RT
> > workloads. This has been shown to improve app startup time on Android
> > [1].
> >
> > Scheduler latency affects several drivers as evidenced by [1], [2], [3],
> > [4]. Some of these drivers have moved post-processing into IRQ context.
> > However, this can cause latency spikes for real-time threads and jitter
> > related jank on Android. Using a workqueue with SCHED_FIFO improves
> > scheduler latency without causing latency problems for RT threads.
> >
> > [1]:
> > https://lore.kernel.org/linux-erofs/[email protected]/
> > [2]:
> > https://lore.kernel.org/linux-f2fs-devel/[email protected]/
> > [3]:
> > https://lore.kernel.org/dm-devel/[email protected]/
> > [4]:
> > https://lore.kernel.org/dm-crypt/[email protected]/
> >
> > This change has been tested on dm-verity with the following fio config:
> >
> > [global]
> > time_based
> > runtime=120
> >
> > [do-verify]
> > ioengine=sync
> > filename=/dev/testing
> > rw=randread
> > direct=1
> >
> > [burn_8x90%_qsort]
> > ioengine=cpuio
> > cpuload=90
> > numjobs=8
> > cpumode=qsort
> >
> > Before:
> > clat (usec): min=13, max=23882, avg=29.56, stdev=113.29 READ:
> > bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB
> > (15.3GB), run=120001-120001msec
> >
> > After:
> > clat (usec): min=13, max=23137, avg=19.96, stdev=105.71 READ:
> > bw=180MiB/s (189MB/s), 180MiB/s-180MiB/s (189MB/s-189MB/s), io=21.1GiB
> > (22.7GB), run=120012-120012msec
>

Hi Tejun,

> Given that its use case mostly intersects with WQ_HIGHPRI, would it make
> more sense to add a switch to alter its behavior instead? I don't really
> like the idea of pushing the decision between WQ_HIGHPRI and WQ_SCHED_FIFO
> to each user.

Do you think something similar should be done for WQ_UNBOUND? In most
places where WQ_HIGHPRI is used, WQ_UNBOUND is also used because it
boosts performance. However, I suspect that most of these benchmarks
were done on x86-64. I've found that WQ_UNBOUND significantly reduces
performance on arm64/Android.

From the documentation, using WQ_UNBOUND for performance doesn't seem
correct. It's only supposed to be used for long-running work. It might
make more sense to get rid of WQ_UNBOUND altogether and only move work
to unbound worker pools once it has stuck around for long enough.

Android will probably need to remove WQ_UNBOUND from all of these
performance critical users.

If there are performance benefits to using unbinding workqueues from
CPUs on x86-64, that should probably be a config flag, not controlled
by every user.

Thanks,
Huck

>
> Thanks.
>
> --
> tejun

2023-01-19 02:45:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Add WQ_SCHED_FIFO

Hello,

On Wed, Jan 18, 2023 at 06:01:04PM -0800, Nathan Huckleberry wrote:
> Do you think something similar should be done for WQ_UNBOUND? In most
> places where WQ_HIGHPRI is used, WQ_UNBOUND is also used because it
> boosts performance. However, I suspect that most of these benchmarks
> were done on x86-64. I've found that WQ_UNBOUND significantly reduces
> performance on arm64/Android.

One attribute with per-cpu workqueues is that they're concurrency-level
limited. ie. if you have two per-cpu work items queued, the second one might
not run until the first one is done. Maybe people were trying to avoid
possible latency spikes from that?

Even aside from that, UNBOUND tends to give more consistent latency
behaviors as you aren't necessarily bound to what's happening on that
particular, so I guess maybe that's also why but I didn't really follow how
each user is picking and justifying these flags, so my insight is pretty
limited.

> From the documentation, using WQ_UNBOUND for performance doesn't seem
> correct. It's only supposed to be used for long-running work. It might
> make more sense to get rid of WQ_UNBOUND altogether and only move work
> to unbound worker pools once it has stuck around for long enough.

UNBOUND says: Don't pin this to one cpu or subject it to workqueue's
concurrency limit. Use workqueue as a generic thread pool.

I don't know what you mean by performance but HIGHPRI | UNBOUND will
definitely improve some aspects.

> Android will probably need to remove WQ_UNBOUND from all of these
> performance critical users.
>
> If there are performance benefits to using unbinding workqueues from
> CPUs on x86-64, that should probably be a config flag, not controlled
> by every user.

It's unlikely that the instruction set is what's making the difference here,
right? It probably would help if we understand why it's worse on arm.

I don't think ppl have been all that deliberate with these flags, so it's
also likely that some of the usages can drop UNBOUND completely but I really
think more data and analyses would help.

Thanks.

--
tejun

2023-01-27 19:25:26

by Nathan Huckleberry

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Add WQ_SCHED_FIFO

On Wed, Jan 18, 2023 at 6:29 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Wed, Jan 18, 2023 at 06:01:04PM -0800, Nathan Huckleberry wrote:
> > Do you think something similar should be done for WQ_UNBOUND? In most
> > places where WQ_HIGHPRI is used, WQ_UNBOUND is also used because it
> > boosts performance. However, I suspect that most of these benchmarks
> > were done on x86-64. I've found that WQ_UNBOUND significantly reduces
> > performance on arm64/Android.
>
> One attribute with per-cpu workqueues is that they're concurrency-level
> limited. ie. if you have two per-cpu work items queued, the second one might
> not run until the first one is done. Maybe people were trying to avoid
> possible latency spikes from that?
>
> Even aside from that, UNBOUND tends to give more consistent latency
> behaviors as you aren't necessarily bound to what's happening on that
> particular, so I guess maybe that's also why but I didn't really follow how
> each user is picking and justifying these flags, so my insight is pretty
> limited.
>
> > From the documentation, using WQ_UNBOUND for performance doesn't seem
> > correct. It's only supposed to be used for long-running work. It might
> > make more sense to get rid of WQ_UNBOUND altogether and only move work
> > to unbound worker pools once it has stuck around for long enough.
>
> UNBOUND says: Don't pin this to one cpu or subject it to workqueue's
> concurrency limit. Use workqueue as a generic thread pool.
>
> I don't know what you mean by performance but HIGHPRI | UNBOUND will
> definitely improve some aspects.
>
> > Android will probably need to remove WQ_UNBOUND from all of these
> > performance critical users.
> >
> > If there are performance benefits to using unbinding workqueues from
> > CPUs on x86-64, that should probably be a config flag, not controlled
> > by every user.
>
> It's unlikely that the instruction set is what's making the difference here,
> right? It probably would help if we understand why it's worse on arm.

I did some more digging. For dm-verity I think this is related to the
availability of SHA instructions. If SHA instructions are present,
WQ_UNBOUND is suboptimal because the work finishes very quickly.

That doesn't explain why EROFS is slower with WQ_UNBOUND though.

It might also be related to the heterogeneity of modern arm
processors. Locality may be more important for ARM processors than for
x86-64.

See the table below:

| open-prebuilt-camera | UNBOUND | HIGHPRI | HIGHPRI ONLY | SCHED_FIFO ONLY |
| erofs wait time (us) | 357805 | 174205
(-51%) | 129861 (-63%) |
| verity wait time (us) | 11746 | 119
(-98%) | 0 (-100%) |

The bigger issue seems to be WQ_UNBOUND, so I'm abandoning these
patches for now.

Thanks,
Huck

>
> I don't think ppl have been all that deliberate with these flags, so it's
> also likely that some of the usages can drop UNBOUND completely but I really
> think more data and analyses would help.
>
> Thanks.

>
> --
> tejun