2014-01-31 19:53:32

by Zoran Markovic

[permalink] [raw]
Subject: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

From: Shaibal Dutta <[email protected]>

For better use of CPU idle time, allow the scheduler to select the CPU
on which the SRCU grace period work would be scheduled. This improves
idle residency time and conserves power.

This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.

Cc: Lai Jiangshan <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Dipankar Sarma <[email protected]>
Signed-off-by: Shaibal Dutta <[email protected]>
[[email protected]: Rebased to latest kernel version. Added commit
message. Fixed code alignment.]
---
kernel/rcu/srcu.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index 3318d82..a1ebe6d 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -398,7 +398,7 @@ void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
rcu_batch_queue(&sp->batch_queue, head);
if (!sp->running) {
sp->running = true;
- schedule_delayed_work(&sp->work, 0);
+ queue_delayed_work(system_power_efficient_wq, &sp->work, 0);
}
spin_unlock_irqrestore(&sp->queue_lock, flags);
}
@@ -674,7 +674,8 @@ static void srcu_reschedule(struct srcu_struct *sp)
}

if (pending)
- schedule_delayed_work(&sp->work, SRCU_INTERVAL);
+ queue_delayed_work(system_power_efficient_wq,
+ &sp->work, SRCU_INTERVAL);
}

/*
--
1.7.9.5


2014-01-31 20:10:38

by Zoran Markovic

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

Signed-off-by: Zoran Markovic <[email protected]>

On 31 January 2014 11:53, Zoran Markovic <[email protected]> wrote:
> From: Shaibal Dutta <[email protected]>
>
> For better use of CPU idle time, allow the scheduler to select the CPU
> on which the SRCU grace period work would be scheduled. This improves
> idle residency time and conserves power.
>
> This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.
>
> Cc: Lai Jiangshan <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Dipankar Sarma <[email protected]>
> Signed-off-by: Shaibal Dutta <[email protected]>
> [[email protected]: Rebased to latest kernel version. Added commit
> message. Fixed code alignment.]
> ---
> kernel/rcu/srcu.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
> index 3318d82..a1ebe6d 100644
> --- a/kernel/rcu/srcu.c
> +++ b/kernel/rcu/srcu.c
> @@ -398,7 +398,7 @@ void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
> rcu_batch_queue(&sp->batch_queue, head);
> if (!sp->running) {
> sp->running = true;
> - schedule_delayed_work(&sp->work, 0);
> + queue_delayed_work(system_power_efficient_wq, &sp->work, 0);
> }
> spin_unlock_irqrestore(&sp->queue_lock, flags);
> }
> @@ -674,7 +674,8 @@ static void srcu_reschedule(struct srcu_struct *sp)
> }
>
> if (pending)
> - schedule_delayed_work(&sp->work, SRCU_INTERVAL);
> + queue_delayed_work(system_power_efficient_wq,
> + &sp->work, SRCU_INTERVAL);
> }
>
> /*
> --
> 1.7.9.5
>

2014-02-10 10:06:16

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

Acked-by: Lai Jiangshan <[email protected]>

On 02/01/2014 03:53 AM, Zoran Markovic wrote:
> From: Shaibal Dutta <[email protected]>
>
> For better use of CPU idle time, allow the scheduler to select the CPU
> on which the SRCU grace period work would be scheduled. This improves
> idle residency time and conserves power.
>
> This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.
>
> Cc: Lai Jiangshan <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Dipankar Sarma <[email protected]>
> Signed-off-by: Shaibal Dutta <[email protected]>
> [[email protected]: Rebased to latest kernel version. Added commit
> message. Fixed code alignment.]
> ---
> kernel/rcu/srcu.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
> index 3318d82..a1ebe6d 100644
> --- a/kernel/rcu/srcu.c
> +++ b/kernel/rcu/srcu.c
> @@ -398,7 +398,7 @@ void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
> rcu_batch_queue(&sp->batch_queue, head);
> if (!sp->running) {
> sp->running = true;
> - schedule_delayed_work(&sp->work, 0);
> + queue_delayed_work(system_power_efficient_wq, &sp->work, 0);
> }
> spin_unlock_irqrestore(&sp->queue_lock, flags);
> }
> @@ -674,7 +674,8 @@ static void srcu_reschedule(struct srcu_struct *sp)
> }
>
> if (pending)
> - schedule_delayed_work(&sp->work, SRCU_INTERVAL);
> + queue_delayed_work(system_power_efficient_wq,
> + &sp->work, SRCU_INTERVAL);
> }
>
> /*

2014-02-10 18:47:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Mon, Feb 10, 2014 at 06:08:31PM +0800, Lai Jiangshan wrote:
> Acked-by: Lai Jiangshan <[email protected]>

Thank you all, queued for 3.15.

We should also have some facility for moving the SRCU workqueues to
housekeeping/timekeeping kthreads in the NO_HZ_FULL case. Or does
this patch already have that effect?

Thanx, Paul

> On 02/01/2014 03:53 AM, Zoran Markovic wrote:
> > From: Shaibal Dutta <[email protected]>
> >
> > For better use of CPU idle time, allow the scheduler to select the CPU
> > on which the SRCU grace period work would be scheduled. This improves
> > idle residency time and conserves power.
> >
> > This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.
> >
> > Cc: Lai Jiangshan <[email protected]>
> > Cc: "Paul E. McKenney" <[email protected]>
> > Cc: Dipankar Sarma <[email protected]>
> > Signed-off-by: Shaibal Dutta <[email protected]>
> > [[email protected]: Rebased to latest kernel version. Added commit
> > message. Fixed code alignment.]
> > ---
> > kernel/rcu/srcu.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
> > index 3318d82..a1ebe6d 100644
> > --- a/kernel/rcu/srcu.c
> > +++ b/kernel/rcu/srcu.c
> > @@ -398,7 +398,7 @@ void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
> > rcu_batch_queue(&sp->batch_queue, head);
> > if (!sp->running) {
> > sp->running = true;
> > - schedule_delayed_work(&sp->work, 0);
> > + queue_delayed_work(system_power_efficient_wq, &sp->work, 0);
> > }
> > spin_unlock_irqrestore(&sp->queue_lock, flags);
> > }
> > @@ -674,7 +674,8 @@ static void srcu_reschedule(struct srcu_struct *sp)
> > }
> >
> > if (pending)
> > - schedule_delayed_work(&sp->work, SRCU_INTERVAL);
> > + queue_delayed_work(system_power_efficient_wq,
> > + &sp->work, SRCU_INTERVAL);
> > }
> >
> > /*
>

2014-02-12 18:23:43

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Mon, Feb 10, 2014 at 10:47:29AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 06:08:31PM +0800, Lai Jiangshan wrote:
> > Acked-by: Lai Jiangshan <[email protected]>
>
> Thank you all, queued for 3.15.
>
> We should also have some facility for moving the SRCU workqueues to
> housekeeping/timekeeping kthreads in the NO_HZ_FULL case. Or does
> this patch already have that effect?

Kevin Hilman and me plan to try to bring a new Kconfig option that could let
us control the unbound workqueues affinity through sysfs.

The feature actually exist currently but is only enabled for workqueues that
have WQ_SYSFS. Writeback and raid5 are the only current users.

See for example: /sys/devices/virtual/workqueue/writeback/cpumask

2014-02-12 19:20:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Wed, Feb 12, 2014 at 07:23:38PM +0100, Frederic Weisbecker wrote:
> On Mon, Feb 10, 2014 at 10:47:29AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 10, 2014 at 06:08:31PM +0800, Lai Jiangshan wrote:
> > > Acked-by: Lai Jiangshan <[email protected]>
> >
> > Thank you all, queued for 3.15.
> >
> > We should also have some facility for moving the SRCU workqueues to
> > housekeeping/timekeeping kthreads in the NO_HZ_FULL case. Or does
> > this patch already have that effect?
>
> Kevin Hilman and me plan to try to bring a new Kconfig option that could let
> us control the unbound workqueues affinity through sysfs.

Please CC me or feel free to update Documentation/kernel-per-CPU-kthreads.txt
as part of this upcoming series.

> The feature actually exist currently but is only enabled for workqueues that
> have WQ_SYSFS. Writeback and raid5 are the only current users.
>
> See for example: /sys/devices/virtual/workqueue/writeback/cpumask

Ah, news to me! I have queued the following patch, seem reasonable?

Thanx, Paul

------------------------------------------------------------------------

diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
index 827104fb9364..09f28841ee3e 100644
--- a/Documentation/kernel-per-CPU-kthreads.txt
+++ b/Documentation/kernel-per-CPU-kthreads.txt
@@ -162,7 +162,11 @@ Purpose: Execute workqueue requests
To reduce its OS jitter, do any of the following:
1. Run your workload at a real-time priority, which will allow
preempting the kworker daemons.
-2. Do any of the following needed to avoid jitter that your
+2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
+ to force the WQ_SYSFS workqueues to run on the specified set
+ of CPUs. The set of WQ_SYSFS workqueues can be displayed using
+ "ls sys/devices/virtual/workqueue".
+3. Do any of the following needed to avoid jitter that your
application cannot tolerate:
a. Build your kernel with CONFIG_SLUB=y rather than
CONFIG_SLAB=y, thus avoiding the slab allocator's periodic

2014-02-12 19:24:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

Hello,

On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> +2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> + to force the WQ_SYSFS workqueues to run on the specified set
> + of CPUs. The set of WQ_SYSFS workqueues can be displayed using
> + "ls sys/devices/virtual/workqueue".

One thing to be careful about is that once published, it becomes part
of userland visible interface. Maybe adding some words warning
against sprinkling WQ_SYSFS willy-nilly is a good idea?

Thanks.

--
tejun

2014-02-12 20:13:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Wed, Feb 12, 2014 at 02:23:54PM -0500, Tejun Heo wrote:
> Hello,
>
> On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> > +2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> > + to force the WQ_SYSFS workqueues to run on the specified set
> > + of CPUs. The set of WQ_SYSFS workqueues can be displayed using
> > + "ls sys/devices/virtual/workqueue".
>
> One thing to be careful about is that once published, it becomes part
> of userland visible interface. Maybe adding some words warning
> against sprinkling WQ_SYSFS willy-nilly is a good idea?

Good point! How about the following?

Thanx, Paul

------------------------------------------------------------------------

Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity

This commit documents the ability to apply CPU affinity to WQ_SYSFS
workqueues, thus offloading them from the desired worker CPUs.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Tejun Heo <[email protected]>

diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
index 827104fb9364..214da3a47a68 100644
--- a/Documentation/kernel-per-CPU-kthreads.txt
+++ b/Documentation/kernel-per-CPU-kthreads.txt
@@ -162,7 +162,16 @@ Purpose: Execute workqueue requests
To reduce its OS jitter, do any of the following:
1. Run your workload at a real-time priority, which will allow
preempting the kworker daemons.
-2. Do any of the following needed to avoid jitter that your
+2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
+ to force the WQ_SYSFS workqueues to run on the specified set
+ of CPUs. The set of WQ_SYSFS workqueues can be displayed using
+ "ls sys/devices/virtual/workqueue". That said, the workqueues
+ maintainer would like to caution people against indiscriminately
+ sprinkling WQ_SYSFS across all the workqueues. The reason for
+ caution is that it is easy to add WQ_SYSFS, but because sysfs
+ is part of the formal user/kernel API, it can be nearly impossible
+ to remove it, even if its addition was a mistake.
+3. Do any of the following needed to avoid jitter that your
application cannot tolerate:
a. Build your kernel with CONFIG_SLUB=y rather than
CONFIG_SLAB=y, thus avoiding the slab allocator's periodic

2014-02-12 20:13:47

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Wed, Feb 12, 2014 at 11:59:22AM -0800, Paul E. McKenney wrote:
> Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
>
> This commit documents the ability to apply CPU affinity to WQ_SYSFS
> workqueues, thus offloading them from the desired worker CPUs.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Tejun Heo <[email protected]>

Reviewed-by: Tejun Heo <[email protected]>

Thanks!

--
tejun

2014-02-12 23:05:02

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Wed, Feb 12, 2014 at 11:59:22AM -0800, Paul E. McKenney wrote:
> On Wed, Feb 12, 2014 at 02:23:54PM -0500, Tejun Heo wrote:
> > Hello,
> >
> > On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> > > +2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> > > + to force the WQ_SYSFS workqueues to run on the specified set
> > > + of CPUs. The set of WQ_SYSFS workqueues can be displayed using
> > > + "ls sys/devices/virtual/workqueue".
> >
> > One thing to be careful about is that once published, it becomes part
> > of userland visible interface. Maybe adding some words warning
> > against sprinkling WQ_SYSFS willy-nilly is a good idea?
>
> Good point! How about the following?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
>
> This commit documents the ability to apply CPU affinity to WQ_SYSFS
> workqueues, thus offloading them from the desired worker CPUs.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Tejun Heo <[email protected]>
>
> diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
> index 827104fb9364..214da3a47a68 100644
> --- a/Documentation/kernel-per-CPU-kthreads.txt
> +++ b/Documentation/kernel-per-CPU-kthreads.txt
> @@ -162,7 +162,16 @@ Purpose: Execute workqueue requests
> To reduce its OS jitter, do any of the following:
> 1. Run your workload at a real-time priority, which will allow
> preempting the kworker daemons.
> -2. Do any of the following needed to avoid jitter that your
> +2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> + to force the WQ_SYSFS workqueues to run on the specified set
> + of CPUs. The set of WQ_SYSFS workqueues can be displayed using
> + "ls sys/devices/virtual/workqueue". That said, the workqueues
> + maintainer would like to caution people against indiscriminately
> + sprinkling WQ_SYSFS across all the workqueues. The reason for
> + caution is that it is easy to add WQ_SYSFS, but because sysfs
> + is part of the formal user/kernel API, it can be nearly impossible
> + to remove it, even if its addition was a mistake.
> +3. Do any of the following needed to avoid jitter that your

Acked-by: Frederic Weisbecker <[email protected]>

I just suggest we append a small explanation about what WQ_SYSFS is about.
Like:

+2.
+ The workqueues that want to be visible on the sysfs hierarchy
+ set the WQ_SYSFS flag.
+ For those who have this flag set, you can use the
+ /sys/devices/virtual/workqueue/*/cpumask sysfs files
+ to force the workqueues to run on the specified set
+ of CPUs. The set of WQ_SYSFS workqueues can be displayed using
+ "ls sys/devices/virtual/workqueue". That said, the workqueues
+ maintainer would like to caution people against indiscriminately
+ sprinkling WQ_SYSFS across all the workqueues. The reason for
+ caution is that it is easy to add WQ_SYSFS, but because sysfs
+ is part of the formal user/kernel API, it can be nearly impossible
+ to remove it, even if its addition was a mistake.
+3. Do any of the following needed to avoid jitter that your

2014-02-13 00:33:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Thu, Feb 13, 2014 at 12:04:57AM +0100, Frederic Weisbecker wrote:
> On Wed, Feb 12, 2014 at 11:59:22AM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 12, 2014 at 02:23:54PM -0500, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> > > > +2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> > > > + to force the WQ_SYSFS workqueues to run on the specified set
> > > > + of CPUs. The set of WQ_SYSFS workqueues can be displayed using
> > > > + "ls sys/devices/virtual/workqueue".
> > >
> > > One thing to be careful about is that once published, it becomes part
> > > of userland visible interface. Maybe adding some words warning
> > > against sprinkling WQ_SYSFS willy-nilly is a good idea?
> >
> > Good point! How about the following?
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
> >
> > This commit documents the ability to apply CPU affinity to WQ_SYSFS
> > workqueues, thus offloading them from the desired worker CPUs.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > Cc: Frederic Weisbecker <[email protected]>
> > Cc: Tejun Heo <[email protected]>
> >
> > diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
> > index 827104fb9364..214da3a47a68 100644
> > --- a/Documentation/kernel-per-CPU-kthreads.txt
> > +++ b/Documentation/kernel-per-CPU-kthreads.txt
> > @@ -162,7 +162,16 @@ Purpose: Execute workqueue requests
> > To reduce its OS jitter, do any of the following:
> > 1. Run your workload at a real-time priority, which will allow
> > preempting the kworker daemons.
> > -2. Do any of the following needed to avoid jitter that your
> > +2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> > + to force the WQ_SYSFS workqueues to run on the specified set
> > + of CPUs. The set of WQ_SYSFS workqueues can be displayed using
> > + "ls sys/devices/virtual/workqueue". That said, the workqueues
> > + maintainer would like to caution people against indiscriminately
> > + sprinkling WQ_SYSFS across all the workqueues. The reason for
> > + caution is that it is easy to add WQ_SYSFS, but because sysfs
> > + is part of the formal user/kernel API, it can be nearly impossible
> > + to remove it, even if its addition was a mistake.
> > +3. Do any of the following needed to avoid jitter that your
>
> Acked-by: Frederic Weisbecker <[email protected]>
>
> I just suggest we append a small explanation about what WQ_SYSFS is about.
> Like:

Fair point! I wordsmithed it into the following. Seem reasonable?

Thanx, Paul

------------------------------------------------------------------------

Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity

This commit documents the ability to apply CPU affinity to WQ_SYSFS
workqueues, thus offloading them from the desired worker CPUs.

Signed-off-by: Paul E. McKenney <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>

diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
index 827104fb9364..f3cd299fcc41 100644
--- a/Documentation/kernel-per-CPU-kthreads.txt
+++ b/Documentation/kernel-per-CPU-kthreads.txt
@@ -162,7 +162,18 @@ Purpose: Execute workqueue requests
To reduce its OS jitter, do any of the following:
1. Run your workload at a real-time priority, which will allow
preempting the kworker daemons.
-2. Do any of the following needed to avoid jitter that your
+2. A given workqueue can be made visible in the sysfs filesystem
+ by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
+ Such a workqueue can be confined to a given subset of the
+ CPUs using the /sys/devices/virtual/workqueue/*/cpumask sysfs
+ files. The set of WQ_SYSFS workqueues can be displayed using
+ "ls sys/devices/virtual/workqueue". That said, the workqueues
+ maintainer would like to caution people against indiscriminately
+ sprinkling WQ_SYSFS across all the workqueues. The reason for
+ caution is that it is easy to add WQ_SYSFS, but because sysfs is
+ part of the formal user/kernel API, it can be nearly impossible
+ to remove it, even if its addition was a mistake.
+3. Do any of the following needed to avoid jitter that your
application cannot tolerate:
a. Build your kernel with CONFIG_SLUB=y rather than
CONFIG_SLAB=y, thus avoiding the slab allocator's periodic

2014-02-13 01:28:51

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On 02/13/2014 08:33 AM, Paul E. McKenney wrote:
> On Thu, Feb 13, 2014 at 12:04:57AM +0100, Frederic Weisbecker wrote:
>> On Wed, Feb 12, 2014 at 11:59:22AM -0800, Paul E. McKenney wrote:
>>> On Wed, Feb 12, 2014 at 02:23:54PM -0500, Tejun Heo wrote:
>>>> Hello,
>>>>
>>>> On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
>>>>> +2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
>>>>> + to force the WQ_SYSFS workqueues to run on the specified set
>>>>> + of CPUs. The set of WQ_SYSFS workqueues can be displayed using
>>>>> + "ls sys/devices/virtual/workqueue".
>>>>
>>>> One thing to be careful about is that once published, it becomes part
>>>> of userland visible interface. Maybe adding some words warning
>>>> against sprinkling WQ_SYSFS willy-nilly is a good idea?
>>>
>>> Good point! How about the following?
>>>
>>> Thanx, Paul
>>>
>>> ------------------------------------------------------------------------
>>>
>>> Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
>>>
>>> This commit documents the ability to apply CPU affinity to WQ_SYSFS
>>> workqueues, thus offloading them from the desired worker CPUs.
>>>
>>> Signed-off-by: Paul E. McKenney <[email protected]>
>>> Cc: Frederic Weisbecker <[email protected]>
>>> Cc: Tejun Heo <[email protected]>
>>>
>>> diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
>>> index 827104fb9364..214da3a47a68 100644
>>> --- a/Documentation/kernel-per-CPU-kthreads.txt
>>> +++ b/Documentation/kernel-per-CPU-kthreads.txt
>>> @@ -162,7 +162,16 @@ Purpose: Execute workqueue requests
>>> To reduce its OS jitter, do any of the following:
>>> 1. Run your workload at a real-time priority, which will allow
>>> preempting the kworker daemons.
>>> -2. Do any of the following needed to avoid jitter that your
>>> +2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
>>> + to force the WQ_SYSFS workqueues to run on the specified set
>>> + of CPUs. The set of WQ_SYSFS workqueues can be displayed using
>>> + "ls sys/devices/virtual/workqueue". That said, the workqueues
>>> + maintainer would like to caution people against indiscriminately
>>> + sprinkling WQ_SYSFS across all the workqueues. The reason for
>>> + caution is that it is easy to add WQ_SYSFS, but because sysfs
>>> + is part of the formal user/kernel API, it can be nearly impossible
>>> + to remove it, even if its addition was a mistake.
>>> +3. Do any of the following needed to avoid jitter that your
>>
>> Acked-by: Frederic Weisbecker <[email protected]>
>>
>> I just suggest we append a small explanation about what WQ_SYSFS is about.
>> Like:
>
> Fair point! I wordsmithed it into the following. Seem reasonable?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
>
> This commit documents the ability to apply CPU affinity to WQ_SYSFS
> workqueues, thus offloading them from the desired worker CPUs.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Reviewed-by: Tejun Heo <[email protected]>
> Acked-by: Frederic Weisbecker <[email protected]>
>
> diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
> index 827104fb9364..f3cd299fcc41 100644
> --- a/Documentation/kernel-per-CPU-kthreads.txt
> +++ b/Documentation/kernel-per-CPU-kthreads.txt
> @@ -162,7 +162,18 @@ Purpose: Execute workqueue requests
> To reduce its OS jitter, do any of the following:
> 1. Run your workload at a real-time priority, which will allow
> preempting the kworker daemons.
> -2. Do any of the following needed to avoid jitter that your
> +2. A given workqueue can be made visible in the sysfs filesystem
> + by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
> + Such a workqueue can be confined to a given subset of the
> + CPUs using the /sys/devices/virtual/workqueue/*/cpumask sysfs
> + files. The set of WQ_SYSFS workqueues can be displayed using
> + "ls sys/devices/virtual/workqueue". That said, the workqueues
> + maintainer would like to caution people against indiscriminately
> + sprinkling WQ_SYSFS across all the workqueues. The reason for
> + caution is that it is easy to add WQ_SYSFS, but because sysfs is
> + part of the formal user/kernel API, it can be nearly impossible
> + to remove it, even if its addition was a mistake.
> +3. Do any of the following needed to avoid jitter that your
> application cannot tolerate:
> a. Build your kernel with CONFIG_SLUB=y rather than
> CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
>
>


Reviewed-by: Lai Jiangshan <[email protected]>

2014-02-13 14:05:31

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Wed, Feb 12, 2014 at 04:33:11PM -0800, Paul E. McKenney wrote:
>
> Fair point! I wordsmithed it into the following. Seem reasonable?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
>
> This commit documents the ability to apply CPU affinity to WQ_SYSFS
> workqueues, thus offloading them from the desired worker CPUs.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Reviewed-by: Tejun Heo <[email protected]>
> Acked-by: Frederic Weisbecker <[email protected]>
>
> diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
> index 827104fb9364..f3cd299fcc41 100644
> --- a/Documentation/kernel-per-CPU-kthreads.txt
> +++ b/Documentation/kernel-per-CPU-kthreads.txt
> @@ -162,7 +162,18 @@ Purpose: Execute workqueue requests
> To reduce its OS jitter, do any of the following:
> 1. Run your workload at a real-time priority, which will allow
> preempting the kworker daemons.
> -2. Do any of the following needed to avoid jitter that your
> +2. A given workqueue can be made visible in the sysfs filesystem
> + by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
> + Such a workqueue can be confined to a given subset of the
> + CPUs using the /sys/devices/virtual/workqueue/*/cpumask sysfs
> + files. The set of WQ_SYSFS workqueues can be displayed using
> + "ls sys/devices/virtual/workqueue". That said, the workqueues
> + maintainer would like to caution people against indiscriminately
> + sprinkling WQ_SYSFS across all the workqueues. The reason for
> + caution is that it is easy to add WQ_SYSFS, but because sysfs is
> + part of the formal user/kernel API, it can be nearly impossible
> + to remove it, even if its addition was a mistake.
> +3. Do any of the following needed to avoid jitter that your
> application cannot tolerate:
> a. Build your kernel with CONFIG_SLUB=y rather than
> CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
>

Perfect!!

2014-02-14 23:24:39

by Kevin Hilman

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

Tejun Heo <[email protected]> writes:

> Hello,
>
> On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
>> +2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
>> + to force the WQ_SYSFS workqueues to run on the specified set
>> + of CPUs. The set of WQ_SYSFS workqueues can be displayed using
>> + "ls sys/devices/virtual/workqueue".
>
> One thing to be careful about is that once published, it becomes part
> of userland visible interface. Maybe adding some words warning
> against sprinkling WQ_SYSFS willy-nilly is a good idea?

In the NO_HZ_FULL case, it seems to me we'd always want all unbound
workqueues to have their affinity set to the housekeeping CPUs.

Is there any reason not to enable WQ_SYSFS whenever WQ_UNBOUND is set so
the affinity can be controlled? I guess the main reason would be that
all of these workqueue names would become permanent ABI.

At least for NO_HZ_FULL, maybe this should be automatic. The cpumask of
unbound workqueues should default to !tick_nohz_full_mask? Any WQ_SYSFS
workqueues could still be overridden from userspace, but at least the
default would be sane, and help keep full dyntics CPUs isolated.

Example patch below, only boot tested on 4-CPU ARM system with
CONFIG_NO_HZ_FULL_ALL=y and verified that 'cat
/sys/devices/virtual/workqueue/writeback/cpumask' looked sane. If this
looks OK, I can maybe clean it up a bit and make it runtime check
instead of a compile time check.

Kevin



>From 902a2b58d61a51415457ea6768d687cdb7532eff Mon Sep 17 00:00:00 2001
From: Kevin Hilman <[email protected]>
Date: Fri, 14 Feb 2014 15:10:58 -0800
Subject: [PATCH] workqueue: for NO_HZ_FULL, set default cpumask to
!tick_nohz_full_mask

To help in keeping NO_HZ_FULL CPUs isolated, keep unbound workqueues
from running on full dynticks CPUs. To do this, set the default
workqueue cpumask to be the set of "housekeeping" CPUs instead of all
possible CPUs.

This is just just the starting/default cpumask, and can be overridden
with all the normal ways (NUMA settings, apply_workqueue_attrs and via
sysfs for workqueus with the WQ_SYSFS attribute.)

Cc: Tejun Heo <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Signed-off-by: Kevin Hilman <[email protected]>
---
kernel/workqueue.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 987293d03ebc..9a9d9b0eaf6d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -48,6 +48,7 @@
#include <linux/nodemask.h>
#include <linux/moduleparam.h>
#include <linux/uaccess.h>
+#include <linux/tick.h>

#include "workqueue_internal.h"

@@ -3436,7 +3437,11 @@ struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask)
if (!alloc_cpumask_var(&attrs->cpumask, gfp_mask))
goto fail;

+#ifdef CONFIG_NO_HZ_FULL
+ cpumask_complement(attrs->cpumask, tick_nohz_full_mask);
+#else
cpumask_copy(attrs->cpumask, cpu_possible_mask);
+#endif
return attrs;
fail:
free_workqueue_attrs(attrs);
--
1.8.3

2014-02-15 07:37:19

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Fri, 2014-02-14 at 15:24 -0800, Kevin Hilman wrote:
> Tejun Heo <[email protected]> writes:
>
> > Hello,
> >
> > On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> >> +2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> >> + to force the WQ_SYSFS workqueues to run on the specified set
> >> + of CPUs. The set of WQ_SYSFS workqueues can be displayed using
> >> + "ls sys/devices/virtual/workqueue".
> >
> > One thing to be careful about is that once published, it becomes part
> > of userland visible interface. Maybe adding some words warning
> > against sprinkling WQ_SYSFS willy-nilly is a good idea?
>
> In the NO_HZ_FULL case, it seems to me we'd always want all unbound
> workqueues to have their affinity set to the housekeeping CPUs.
>
> Is there any reason not to enable WQ_SYSFS whenever WQ_UNBOUND is set so
> the affinity can be controlled? I guess the main reason would be that
> all of these workqueue names would become permanent ABI.
>
> At least for NO_HZ_FULL, maybe this should be automatic. The cpumask of
> unbound workqueues should default to !tick_nohz_full_mask? Any WQ_SYSFS
> workqueues could still be overridden from userspace, but at least the
> default would be sane, and help keep full dyntics CPUs isolated.

What I'm thinking is that it should be automatic, but not necessarily
based upon the nohz full mask, rather maybe based upon whether sched
domains exist, or perhaps a generic exclusive cpuset property, though
some really don't want anything to do with cpusets.

Why? Because there are jitter intolerant loads where nohz full isn't all
that useful, because you'll be constantly restarting and stopping the
tick, and eating the increased accounting overhead to no gain because
there are frequently multiple realtime tasks running. For these loads
(I have a user with such a fairly hefty 80 core rt load), dynamically
turning the tick _on_ is currently a better choice than nohz_full.
Point being, control of where unbound workqueues are allowed to run
isn't only desirable for single task HPC loads, other loads exist.

For my particular fairly critical 80 core load, workqueues aren't a real
big hairy deal, because its jitter tolerance isn't _all_ that tight (30
us max is easy enough to meet with room to spare). The load can slice
through workers well enough to meet requirements, but it would certainly
be a win to be able to keep them at bay. (gonna measure it, less jitter
is better even if it's only a little bit better.. eventually somebody
will demand what's currently impossible to deliver)

-Mike

2014-02-16 16:41:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Sat, Feb 15, 2014 at 08:36:44AM +0100, Mike Galbraith wrote:
> On Fri, 2014-02-14 at 15:24 -0800, Kevin Hilman wrote:
> > Tejun Heo <[email protected]> writes:
> >
> > > Hello,
> > >
> > > On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> > >> +2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> > >> + to force the WQ_SYSFS workqueues to run on the specified set
> > >> + of CPUs. The set of WQ_SYSFS workqueues can be displayed using
> > >> + "ls sys/devices/virtual/workqueue".
> > >
> > > One thing to be careful about is that once published, it becomes part
> > > of userland visible interface. Maybe adding some words warning
> > > against sprinkling WQ_SYSFS willy-nilly is a good idea?
> >
> > In the NO_HZ_FULL case, it seems to me we'd always want all unbound
> > workqueues to have their affinity set to the housekeeping CPUs.
> >
> > Is there any reason not to enable WQ_SYSFS whenever WQ_UNBOUND is set so
> > the affinity can be controlled? I guess the main reason would be that
> > all of these workqueue names would become permanent ABI.
> >
> > At least for NO_HZ_FULL, maybe this should be automatic. The cpumask of
> > unbound workqueues should default to !tick_nohz_full_mask? Any WQ_SYSFS
> > workqueues could still be overridden from userspace, but at least the
> > default would be sane, and help keep full dyntics CPUs isolated.
>
> What I'm thinking is that it should be automatic, but not necessarily
> based upon the nohz full mask, rather maybe based upon whether sched
> domains exist, or perhaps a generic exclusive cpuset property, though
> some really don't want anything to do with cpusets.
>
> Why? Because there are jitter intolerant loads where nohz full isn't all
> that useful, because you'll be constantly restarting and stopping the
> tick, and eating the increased accounting overhead to no gain because
> there are frequently multiple realtime tasks running. For these loads
> (I have a user with such a fairly hefty 80 core rt load), dynamically
> turning the tick _on_ is currently a better choice than nohz_full.
> Point being, control of where unbound workqueues are allowed to run
> isn't only desirable for single task HPC loads, other loads exist.
>
> For my particular fairly critical 80 core load, workqueues aren't a real
> big hairy deal, because its jitter tolerance isn't _all_ that tight (30
> us max is easy enough to meet with room to spare). The load can slice
> through workers well enough to meet requirements, but it would certainly
> be a win to be able to keep them at bay. (gonna measure it, less jitter
> is better even if it's only a little bit better.. eventually somebody
> will demand what's currently impossible to deliver)

So if there is NO_HZ_FULL, you have no objection to binding workqueues to
the timekeeping CPUs, but that you would also like some form of automatic
binding in the !NO_HZ_FULL case. Of course, if a common mechanism could
serve both cases, that would be good. And yes, cpusets are frowned upon
for some workloads.

So maybe start with Kevin's patch, but augment with something else for
the !NO_HZ_FULL case?

Thanx, Paul

2014-02-17 04:50:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Sun, 2014-02-16 at 08:41 -0800, Paul E. McKenney wrote:

> So if there is NO_HZ_FULL, you have no objection to binding workqueues to
> the timekeeping CPUs, but that you would also like some form of automatic
> binding in the !NO_HZ_FULL case. Of course, if a common mechanism could
> serve both cases, that would be good. And yes, cpusets are frowned upon
> for some workloads.

I'm not _objecting_, I'm not driving, Frederic's doing that ;-)

That said, isolation seems to be turning into a property of nohz mode,
but as I see it, nohz_full is an extension to generic isolation.

> So maybe start with Kevin's patch, but augment with something else for
> the !NO_HZ_FULL case?

Sure (hm, does it work without workqueue.disable_numa ?).

It just seems to me that tying it to sched domain construction would be
a better fit. That way, it doesn't matter what your isolation requiring
load is, whether you run a gaggle of realtime tasks or one HPC task your
business, the generic requirement is isolation, not tick mode. For one
HPC task per core, you want no tick, if you're running all SCHED_FIFO,
maybe you want that too, depends on the impact of nohz_full mode. All
sensitive loads want the isolation, but they may not like the price.

I personally like the cpuset way. Being able to partition boxen on the
fly makes them very flexible. In a perfect world, you'd be able to
quiesce and configure offloading and nohz_full on the fly too, and not
end up with some hodgepodge like this needs boot option foo, that
happens invisibly because of config option bar, the other thing you have
to do manually.. and you get to eat 937 kthreads and tons of overhead on
all CPUs if you want the ability to _maybe_ run a critical task or two.

-Mike

2014-02-17 05:26:52

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Wed, 2014-02-12 at 19:23 +0100, Frederic Weisbecker wrote:
> On Mon, Feb 10, 2014 at 10:47:29AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 10, 2014 at 06:08:31PM +0800, Lai Jiangshan wrote:
> > > Acked-by: Lai Jiangshan <[email protected]>
> >
> > Thank you all, queued for 3.15.
> >
> > We should also have some facility for moving the SRCU workqueues to
> > housekeeping/timekeeping kthreads in the NO_HZ_FULL case. Or does
> > this patch already have that effect?
>
> Kevin Hilman and me plan to try to bring a new Kconfig option that could let
> us control the unbound workqueues affinity through sysfs.

Handing control to the user seemed like a fine thing, so I started
making a boot option to enable it. Forcing WQ_SYSFS on at sysfs
decision spot doesn't go well, init order matters :) Post init frobbing
required if you want to see/frob all unbound.

-Mike

2014-02-19 07:00:38

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Mon, 2014-02-17 at 05:50 +0100, Mike Galbraith wrote:
> On Sun, 2014-02-16 at 08:41 -0800, Paul E. McKenney wrote:

> > So maybe start with Kevin's patch, but augment with something else for
> > the !NO_HZ_FULL case?
>
> Sure (hm, does it work without workqueue.disable_numa ?).

I took patch out for a spin on a 40 core box +SMT, with CPUs 4-79
isolated via exclusive cpuset with load balancing off. Worker bees
ignored patch either way.

-Mike

Perturbation measurement hog pinned to CPU4.

With patch:

# TASK-PID CPU# |||||| TIMESTAMP FUNCTION
# | | | |||||| | |
pert-9949 [004] ....113 405.120164: workqueue_queue_work: work struct=ffff880a5c4ecc08 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=4
pert-9949 [004] ....113 405.120166: workqueue_activate_work: work struct ffff880a5c4ecc08
pert-9949 [004] d.L.313 405.120169: sched_wakeup: comm=kworker/4:2 pid=2119 prio=120 success=1 target_cpu=004
pert-9949 [004] d.Lh213 405.120170: tick_stop: success=no msg=more than 1 task in runqueue
pert-9949 [004] d.L.213 405.120172: tick_stop: success=no msg=more than 1 task in runqueue
pert-9949 [004] d...3.. 405.120173: sched_switch: prev_comm=pert prev_pid=9949 prev_prio=120 prev_state=R+ ==> next_comm=kworker/4:2 next_pid=2119 next_prio=120
kworker/4:2-2119 [004] ....1.. 405.120174: workqueue_execute_start: work struct ffff880a5c4ecc08: function flush_to_ldisc
kworker/4:2-2119 [004] d...311 405.120176: sched_wakeup: comm=sshd pid=6620 prio=120 success=1 target_cpu=000
kworker/4:2-2119 [004] ....1.. 405.120176: workqueue_execute_end: work struct ffff880a5c4ecc08
kworker/4:2-2119 [004] d...3.. 405.120177: sched_switch: prev_comm=kworker/4:2 prev_pid=2119 prev_prio=120 prev_state=S ==> next_comm=pert next_pid=9949 next_prio=120
pert-9949 [004] ....113 405.120178: workqueue_queue_work: work struct=ffff880a5c4ecc08 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=4
pert-9949 [004] ....113 405.120179: workqueue_activate_work: work struct ffff880a5c4ecc08
pert-9949 [004] d.L.313 405.120179: sched_wakeup: comm=kworker/4:2 pid=2119 prio=120 success=1 target_cpu=004
pert-9949 [004] d.L.213 405.120181: tick_stop: success=no msg=more than 1 task in runqueue
pert-9949 [004] d...3.. 405.120181: sched_switch: prev_comm=pert prev_pid=9949 prev_prio=120 prev_state=R+ ==> next_comm=kworker/4:2 next_pid=2119 next_prio=120
kworker/4:2-2119 [004] ....1.. 405.120182: workqueue_execute_start: work struct ffff880a5c4ecc08: function flush_to_ldisc
kworker/4:2-2119 [004] ....1.. 405.120183: workqueue_execute_end: work struct ffff880a5c4ecc08
kworker/4:2-2119 [004] d...3.. 405.120183: sched_switch: prev_comm=kworker/4:2 prev_pid=2119 prev_prio=120 prev_state=S ==> next_comm=pert next_pid=9949 next_prio=120
pert-9949 [004] d...1.. 405.120736: tick_stop: success=yes msg=
pert-9949 [004] ....113 410.121082: workqueue_queue_work: work struct=ffff880a5c4ecc08 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=4
pert-9949 [004] ....113 410.121082: workqueue_activate_work: work struct ffff880a5c4ecc08
pert-9949 [004] d.L.313 410.121084: sched_wakeup: comm=kworker/4:2 pid=2119 prio=120 success=1 target_cpu=004
pert-9949 [004] d.Lh213 410.121085: tick_stop: success=no msg=more than 1 task in runqueue
pert-9949 [004] d.L.213 410.121087: tick_stop: success=no msg=more than 1 task in runqueue
...and so on until tick time


(extra cheezy) hack kinda sorta works iff workqueue.disable_numa:

---
kernel/workqueue.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1328,8 +1328,11 @@ static void __queue_work(int cpu, struct

rcu_read_lock();
retry:
- if (req_cpu == WORK_CPU_UNBOUND)
+ if (req_cpu == WORK_CPU_UNBOUND) {
cpu = raw_smp_processor_id();
+ if (runqueue_is_isolated(cpu))
+ cpu = 0;
+ }

/* pwq which will be used unless @work is executing elsewhere */
if (!(wq->flags & WQ_UNBOUND))


# TASK-PID CPU# |||||| TIMESTAMP FUNCTION
# | | | |||||| | |
<...>-33824 [004] ....113 5555.889694: workqueue_queue_work: work struct=ffff880a596eb008 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=0
<...>-33824 [004] ....113 5555.889695: workqueue_activate_work: work struct ffff880a596eb008
<...>-33824 [004] d...313 5555.889697: sched_wakeup: comm=kworker/0:2 pid=2105 prio=120 success=1 target_cpu=000
<...>-33824 [004] ....113 5560.890594: workqueue_queue_work: work struct=ffff880a596eb008 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=0
<...>-33824 [004] ....113 5560.890595: workqueue_activate_work: work struct ffff880a596eb008
<...>-33824 [004] d...313 5560.890596: sched_wakeup: comm=kworker/0:2 pid=2105 prio=120 success=1 target_cpu=000
<...>-33824 [004] ....113 5565.891493: workqueue_queue_work: work struct=ffff880a596eb008 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=0
<...>-33824 [004] ....113 5565.891493: workqueue_activate_work: work struct ffff880a596eb008
<...>-33824 [004] d...313 5565.891494: sched_wakeup: comm=kworker/0:2 pid=2105 prio=120 success=1 target_cpu=000
<...>-33824 [004] ....113 5570.892401: workqueue_queue_work: work struct=ffff880a596eb008 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=0
<...>-33824 [004] ....113 5570.892401: workqueue_activate_work: work struct ffff880a596eb008
<...>-33824 [004] d...313 5570.892403: sched_wakeup: comm=kworker/0:2 pid=2105 prio=120 success=1 target_cpu=000
<...>-33824 [004] ....113 5575.893300: workqueue_queue_work: work struct=ffff880a596eb008 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=0
<...>-33824 [004] ....113 5575.893301: workqueue_activate_work: work struct ffff880a596eb008
<...>-33824 [004] d...313 5575.893302: sched_wakeup: comm=kworker/0:2 pid=2105 prio=120 success=1 target_cpu=000
<...>-33824 [004] d..h1.. 5578.854979: softirq_raise: vec=1 [action=TIMER]
<...>-33824 [004] dN..3.. 5578.854981: sched_wakeup: comm=sirq-timer/4 pid=319 prio=69 success=1 target_cpu=004
<...>-33824 [004] dN..1.. 5578.854982: tick_stop: success=no msg=more than 1 task in runqueue
<...>-33824 [004] dN.h1.. 5578.854983: tick_stop: success=no msg=more than 1 task in runqueue
<...>-33824 [004] dN..1.. 5578.854985: tick_stop: success=no msg=more than 1 task in runqueue
<...>-33824 [004] d...3.. 5578.854986: sched_switch: prev_comm=pert prev_pid=33824 prev_prio=120 prev_state=R+ ==> next_comm=sirq-timer/4 next_pid=319 next_prio=69
sirq-timer/4-319 [004] d..h3.. 5578.854987: softirq_raise: vec=1 [action=TIMER]
sirq-timer/4-319 [004] d...3.. 5578.854989: tick_stop: success=no msg=more than 1 task in runqueue
sirq-timer/4-319 [004] ....111 5578.854990: softirq_entry: vec=1 [action=TIMER]
sirq-timer/4-319 [004] ....111 5578.855194: softirq_exit: vec=1 [action=TIMER] <== 204us tick, not good... to_stare_at++
sirq-timer/4-319 [004] d...3.. 5578.855196: sched_switch: prev_comm=sirq-timer/4 prev_pid=319 prev_prio=69 prev_state=S ==> next_comm=pert next_pid=33824 next_prio=120
<...>-33824 [004] d...1.. 5578.855987: tick_stop: success=yes msg=
...etc

2014-02-24 17:55:24

by Kevin Hilman

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

Mike Galbraith <[email protected]> writes:

> On Sun, 2014-02-16 at 08:41 -0800, Paul E. McKenney wrote:
>
>> So if there is NO_HZ_FULL, you have no objection to binding workqueues to
>> the timekeeping CPUs, but that you would also like some form of automatic
>> binding in the !NO_HZ_FULL case. Of course, if a common mechanism could
>> serve both cases, that would be good. And yes, cpusets are frowned upon
>> for some workloads.
>
> I'm not _objecting_, I'm not driving, Frederic's doing that ;-)
>
> That said, isolation seems to be turning into a property of nohz mode,
> but as I see it, nohz_full is an extension to generic isolation.
>
>> So maybe start with Kevin's patch, but augment with something else for
>> the !NO_HZ_FULL case?
>
> Sure (hm, does it work without workqueue.disable_numa ?).

[ /me returns from vacation ]

Yes, since it happens for every alloc_workqueue_attrs()

> It just seems to me that tying it to sched domain construction would be
> a better fit. That way, it doesn't matter what your isolation requiring
> load is, whether you run a gaggle of realtime tasks or one HPC task your
> business, the generic requirement is isolation, not tick mode. For one
> HPC task per core, you want no tick, if you're running all SCHED_FIFO,
> maybe you want that too, depends on the impact of nohz_full mode. All
> sensitive loads want the isolation, but they may not like the price.
>
> I personally like the cpuset way. Being able to partition boxen on the
> fly makes them very flexible. In a perfect world, you'd be able to
> quiesce and configure offloading and nohz_full on the fly too, and not
> end up with some hodgepodge like this needs boot option foo, that
> happens invisibly because of config option bar, the other thing you have
> to do manually.. and you get to eat 937 kthreads and tons of overhead on
> all CPUs if you want the ability to _maybe_ run a critical task or two.

Yeah, my patch only addresses the nohz_full case, but since there
doesn't seem to be any general agreemenet about the generic case, it
seems that exposing all unbound workqueues via WQ_SYSFS is the way to
go.

Mike, looks like you may have started on that. Did it get any further?

Kevin

2014-02-24 18:34:32

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Mon, 2014-02-24 at 09:55 -0800, Kevin Hilman wrote:

> Yeah, my patch only addresses the nohz_full case, but since there
> doesn't seem to be any general agreemenet about the generic case, it
> seems that exposing all unbound workqueues via WQ_SYSFS is the way to
> go.
>
> Mike, looks like you may have started on that. Did it get any further?

No, work gets in the way of tinker time. Not sure which way I want to
explore anyway. On the one hand, auto-quiesce thingy tied to domain
construction/destruction sounds good, but on the other, just handing
control to the user is highly attractive.. likely lots simpler too.

-Mike

2014-02-27 14:43:41

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Mon, Feb 17, 2014 at 06:26:41AM +0100, Mike Galbraith wrote:
> On Wed, 2014-02-12 at 19:23 +0100, Frederic Weisbecker wrote:
> > On Mon, Feb 10, 2014 at 10:47:29AM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 10, 2014 at 06:08:31PM +0800, Lai Jiangshan wrote:
> > > > Acked-by: Lai Jiangshan <[email protected]>
> > >
> > > Thank you all, queued for 3.15.
> > >
> > > We should also have some facility for moving the SRCU workqueues to
> > > housekeeping/timekeeping kthreads in the NO_HZ_FULL case. Or does
> > > this patch already have that effect?
> >
> > Kevin Hilman and me plan to try to bring a new Kconfig option that could let
> > us control the unbound workqueues affinity through sysfs.
>
> Handing control to the user seemed like a fine thing, so I started
> making a boot option to enable it. Forcing WQ_SYSFS on at sysfs
> decision spot doesn't go well, init order matters :) Post init frobbing
> required if you want to see/frob all unbound.

I'm curious about the details. Is that because some workqueues are registered
before sysfs is even initialized?

2014-02-27 15:08:52

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Fri, Feb 14, 2014 at 03:24:35PM -0800, Kevin Hilman wrote:
> Tejun Heo <[email protected]> writes:
>
> > Hello,
> >
> > On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> >> +2. Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> >> + to force the WQ_SYSFS workqueues to run on the specified set
> >> + of CPUs. The set of WQ_SYSFS workqueues can be displayed using
> >> + "ls sys/devices/virtual/workqueue".
> >
> > One thing to be careful about is that once published, it becomes part
> > of userland visible interface. Maybe adding some words warning
> > against sprinkling WQ_SYSFS willy-nilly is a good idea?
>
> In the NO_HZ_FULL case, it seems to me we'd always want all unbound
> workqueues to have their affinity set to the housekeeping CPUs.
>
> Is there any reason not to enable WQ_SYSFS whenever WQ_UNBOUND is set so
> the affinity can be controlled? I guess the main reason would be that
> all of these workqueue names would become permanent ABI.

Right. It's a legitimate worry but couldn't we consider workqueue names
just like kernel threads names? Ie: something that can be renamed or
disappear anytime from a kernel version to another?

Or sysfs has real strict rules about that and I'm just daydreaming?

I've been thinking we could also have a pseudo-workqueue directory in
/sys/devices/virtual/workqueue/unbounds with only cpumask as a file.

Writing to it would set all unbound workqueue affinity, at least those
that don't have WQ_SYSFS.

This would solve the ABI issue and keep a single consistent interface
for workqueue affinity.

>
> At least for NO_HZ_FULL, maybe this should be automatic. The cpumask of
> unbound workqueues should default to !tick_nohz_full_mask? Any WQ_SYSFS
> workqueues could still be overridden from userspace, but at least the
> default would be sane, and help keep full dyntics CPUs isolated.
>
> Example patch below, only boot tested on 4-CPU ARM system with
> CONFIG_NO_HZ_FULL_ALL=y and verified that 'cat
> /sys/devices/virtual/workqueue/writeback/cpumask' looked sane. If this
> looks OK, I can maybe clean it up a bit and make it runtime check
> instead of a compile time check.

It can work too yeah. Maybe I prefer the idea of keeping a sysfs interface
for all workqueues (whether we use a pseudo "unbounds" dir or not) because then
the workqueue core stays unaware of dynticks details and it doesn't end up
fiddling with timers core internals like the full dynticks cpumask. The result
is more interdependency and possible headaches between timers and workqueue init
ordering.

And moreover people may forget to change WQ_SYSFS workqueues if all other UNBOUND
workqueues are known to be automatically handled. Or we handle WQ_SYSFS as well
along the way? But still WQ_SYSFS cpumask may be modified by other user programms
so it's still a round of set that must be done before doing any isolation work.

So I have mixed feelings between code complexity, simplicity for users, etc...

What do you guys think?

2014-02-27 15:24:07

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

On Thu, 2014-02-27 at 15:43 +0100, Frederic Weisbecker wrote:
> On Mon, Feb 17, 2014 at 06:26:41AM +0100, Mike Galbraith wrote:
> > On Wed, 2014-02-12 at 19:23 +0100, Frederic Weisbecker wrote:
> > > On Mon, Feb 10, 2014 at 10:47:29AM -0800, Paul E. McKenney wrote:
> > > > On Mon, Feb 10, 2014 at 06:08:31PM +0800, Lai Jiangshan wrote:
> > > > > Acked-by: Lai Jiangshan <[email protected]>
> > > >
> > > > Thank you all, queued for 3.15.
> > > >
> > > > We should also have some facility for moving the SRCU workqueues to
> > > > housekeeping/timekeeping kthreads in the NO_HZ_FULL case. Or does
> > > > this patch already have that effect?
> > >
> > > Kevin Hilman and me plan to try to bring a new Kconfig option that could let
> > > us control the unbound workqueues affinity through sysfs.
> >
> > Handing control to the user seemed like a fine thing, so I started
> > making a boot option to enable it. Forcing WQ_SYSFS on at sysfs
> > decision spot doesn't go well, init order matters :) Post init frobbing
> > required if you want to see/frob all unbound.
>
> I'm curious about the details. Is that because some workqueues are registered
> before sysfs is even initialized?

Yeah. I put in a test, and told it if not ready, go get ready and flag
yourself as having BTDT (one registration being plenty), but that only
made a different explosion. Post-init rescan is definitely a better
plan, it can't be worse :)

-Mike