The infrastructure in RCU and housekeeping code are finally ready
for this. Now it's time to introduce CPU isolation feature files
to cpusets. Here is the first one.
RCU NOCB is an RCU feature that offloads RCU callbacks lifecycle
handling and execution out of the enqueuer's CPU softirq to specific
kthreads instead (rcuo and rcuog). This pulls some kernel noise out of
CPUs that may run critical code. This is usually associated with
nohz_full.
The CPUs list to be set in RCU NOCB mode is defined on boot time
through the "rcu_nocbs=" kernel parameter and can't be changed afterward.
This patchset aims at allowing for changing this on runtime through cpuset.
I may have missed a few things in the last patch, such as partition type
changes to/from error mode, I'm not sure... Anyway it's an RFC and it
doesn't yet provide documentation in this early posting.
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
cpuset/nocb
HEAD: ad895c0b6a5e3c41d46f184900d193e70bfc90d3
Thanks,
Frederic
---
Frederic Weisbecker (4):
rcu/nocb: Pass a cpumask instead of a single CPU to offload/deoffload
rcu/nocb: Prepare to change nocb cpumask from CPU-hotplug protected cpuset caller
sched/isolation: Infrastructure to support rcu nocb cpumask changes
cpuset: Support RCU-NOCB toggle on v2 root partitions
include/linux/rcupdate.h | 9 ++--
include/linux/sched/isolation.h | 13 +++++
kernel/cgroup/cpuset.c | 95 +++++++++++++++++++++++++++++++++--
kernel/rcu/rcutorture.c | 6 ++-
kernel/rcu/tree_nocb.h | 106 +++++++++++++++++++++++++++-------------
kernel/sched/isolation.c | 38 ++++++++++++++
6 files changed, 223 insertions(+), 44 deletions(-)
Provide a minimal infrastructure to change the housekeeping cpumasks.
For now only RCU NOCB cpumask is handled.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Phil Auld <[email protected]>
Cc: Nicolas Saenz Julienne <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
include/linux/sched/isolation.h | 13 +++++++++++
kernel/sched/isolation.c | 38 +++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 8c15abd67aed..c6d0e3f83a20 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -25,6 +25,8 @@ extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
extern bool housekeeping_enabled(enum hk_type type);
extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
+extern int housekeeping_cpumask_set(struct cpumask *cpumask, enum hk_type type);
+extern int housekeeping_cpumask_clear(struct cpumask *cpumask, enum hk_type type);
extern void __init housekeeping_init(void);
#else
@@ -46,6 +48,17 @@ static inline bool housekeeping_enabled(enum hk_type type)
static inline void housekeeping_affine(struct task_struct *t,
enum hk_type type) { }
+
+static inline int housekeeping_cpumask_set(struct cpumask *cpumask, enum hk_type type)
+{
+ return -EINVAL;
+}
+
+static inline int housekeeping_cpumask_clear(struct cpumask *cpumask, enum hk_type type)
+{
+ return -EINVAL;
+}
+
static inline void housekeeping_init(void) { }
#endif /* CONFIG_CPU_ISOLATION */
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 373d42c707bc..ab4aba795c01 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -79,6 +79,44 @@ bool housekeeping_test_cpu(int cpu, enum hk_type type)
}
EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
+static int housekeeping_cpumask_update(struct cpumask *cpumask,
+ enum hk_type type, bool on)
+{
+ int err;
+
+ switch (type) {
+ case HK_TYPE_RCU:
+ err = rcu_nocb_cpumask_update(cpumask, on);
+ break;
+ default:
+ err = -EINVAL;
+ }
+
+ if (err >= 0) {
+ if (on) {
+ cpumask_or(housekeeping.cpumasks[type],
+ housekeeping.cpumasks[type],
+ cpumask);
+ } else {
+ cpumask_andnot(housekeeping.cpumasks[type],
+ housekeeping.cpumasks[type],
+ cpumask);
+ }
+ }
+
+ return err;
+}
+
+int housekeeping_cpumask_set(struct cpumask *cpumask, enum hk_type type)
+{
+ return housekeeping_cpumask_update(cpumask, type, true);
+}
+
+int housekeeping_cpumask_clear(struct cpumask *cpumask, enum hk_type type)
+{
+ return housekeeping_cpumask_update(cpumask, type, false);
+}
+
void __init housekeeping_init(void)
{
enum hk_type type;
--
2.25.1
Introduce a new "isolation.rcu_nocb" file within a cgroup2/cpuset
directory which provides support for a set of CPUs to either enable ("1")
or disable ("0") RCU callbacks offloading (aka. RCU NOCB). This can
overwrite previous boot settings towards "rcu_nocbs=" kernel parameter.
The file is only writeable on "root" type partitions to exclude any
overlap. The deepest root type partition has the highest priority.
This means that given the following setting:
Top cpuset (CPUs: 0-7)
cpuset.isolation.rcu_nocb = 0
|
|
Subdirectory A (CPUs: 5-7)
cpuset.cpus.partition = root
cpuset.isolation.rcu_nocb = 0
|
|
Subdirectory B (CPUs: 7)
cpuset.cpus.partition = root
cpuset.isolation.rcu_nocb = 1
the result is that only CPU 7 is in rcu_nocb mode.
Note that "rcu_nocbs" kernel parameter must be passed on boot, even
without a cpulist, so that nocb support is enabled.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Phil Auld <[email protected]>
Cc: Nicolas Saenz Julienne <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/cgroup/cpuset.c | 95 ++++++++++++++++++++++++++++++++++++++++--
1 file changed, 92 insertions(+), 3 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9390bfd9f1cd..2d9f019bb590 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -225,6 +225,7 @@ typedef enum {
CS_SCHED_LOAD_BALANCE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
+ CS_RCU_NOCB,
} cpuset_flagbits_t;
/* convenient tests for these bits */
@@ -268,6 +269,11 @@ static inline int is_spread_slab(const struct cpuset *cs)
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}
+static inline int is_rcu_nocb(const struct cpuset *cs)
+{
+ return test_bit(CS_RCU_NOCB, &cs->flags);
+}
+
static inline int is_partition_root(const struct cpuset *cs)
{
return cs->partition_root_state > 0;
@@ -590,6 +596,62 @@ static inline void free_cpuset(struct cpuset *cs)
kfree(cs);
}
+#ifdef CONFIG_RCU_NOCB_CPU
+static int cpuset_rcu_nocb_apply(struct cpuset *root)
+{
+ int err;
+
+ if (is_rcu_nocb(root))
+ err = housekeeping_cpumask_set(root->effective_cpus, HK_TYPE_RCU);
+ else
+ err = housekeeping_cpumask_clear(root->effective_cpus, HK_TYPE_RCU);
+
+ return err;
+}
+
+static int cpuset_rcu_nocb_update(struct cpuset *cur, struct cpuset *trialcs)
+{
+ struct cgroup_subsys_state *des_css;
+ struct cpuset *des;
+ int err;
+
+ if (cur->partition_root_state != PRS_ENABLED)
+ return -EINVAL;
+
+ err = cpuset_rcu_nocb_apply(trialcs);
+ if (err < 0)
+ return err;
+
+ rcu_read_lock();
+ cpuset_for_each_descendant_pre(des, des_css, cur) {
+ if (des == cur)
+ continue;
+ if (des->partition_root_state == PRS_ENABLED)
+ break;
+ spin_lock_irq(&callback_lock);
+ if (is_rcu_nocb(trialcs))
+ set_bit(CS_RCU_NOCB, &des->flags);
+ else
+ clear_bit(CS_RCU_NOCB, &des->flags);
+ spin_unlock_irq(&callback_lock);
+ }
+ rcu_read_unlock();
+
+ return 0;
+}
+#else
+static inline int cpuset_rcu_nocb_apply(struct cpuset *root)
+{
+ return 0;
+}
+
+static inline int cpuset_rcu_nocb_update(struct cpuset *cur,
+ struct cpuset *trialcs)
+{
+ return 0;
+}
+#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
+
/*
* validate_change_legacy() - Validate conditions specific to legacy (v1)
* behavior.
@@ -1655,6 +1717,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
if (cs->partition_root_state) {
struct cpuset *parent = parent_cs(cs);
+ WARN_ON_ONCE(cpuset_rcu_nocb_apply(parent) < 0);
+ WARN_ON_ONCE(cpuset_rcu_nocb_apply(cs) < 0);
+
/*
* For partition root, update the cpumasks of sibling
* cpusets if they use parent's effective_cpus.
@@ -2012,6 +2077,12 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
|| (is_spread_page(cs) != is_spread_page(trialcs)));
+ if (is_rcu_nocb(cs) != is_rcu_nocb(trialcs)) {
+ err = cpuset_rcu_nocb_update(cs, trialcs);
+ if (err < 0)
+ goto out;
+ }
+
spin_lock_irq(&callback_lock);
cs->flags = trialcs->flags;
spin_unlock_irq(&callback_lock);
@@ -2365,6 +2436,7 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+ FILE_RCU_NOCB,
} cpuset_filetype_t;
static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
@@ -2406,6 +2478,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
case FILE_SPREAD_SLAB:
retval = update_flag(CS_SPREAD_SLAB, cs, val);
break;
+ case FILE_RCU_NOCB:
+ retval = update_flag(CS_RCU_NOCB, cs, val);
+ break;
default:
retval = -EINVAL;
break;
@@ -2573,6 +2648,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
return is_spread_page(cs);
case FILE_SPREAD_SLAB:
return is_spread_slab(cs);
+ case FILE_RCU_NOCB:
+ return is_rcu_nocb(cs);
default:
BUG();
}
@@ -2803,7 +2880,14 @@ static struct cftype dfl_files[] = {
.private = FILE_SUBPARTS_CPULIST,
.flags = CFTYPE_DEBUG,
},
-
+#ifdef CONFIG_RCU_NOCB_CPU
+ {
+ .name = "isolation.rcu_nocb",
+ .read_u64 = cpuset_read_u64,
+ .write_u64 = cpuset_write_u64,
+ .private = FILE_RCU_NOCB,
+ },
+#endif
{ } /* terminate */
};
@@ -2861,6 +2945,8 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
set_bit(CS_SPREAD_PAGE, &cs->flags);
if (is_spread_slab(parent))
set_bit(CS_SPREAD_SLAB, &cs->flags);
+ if (is_rcu_nocb(parent))
+ set_bit(CS_RCU_NOCB, &cs->flags);
cpuset_inc();
@@ -3227,12 +3313,15 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
if (mems_updated)
check_insane_mems_config(&new_mems);
- if (is_in_v2_mode())
+ if (is_in_v2_mode()) {
hotplug_update_tasks(cs, &new_cpus, &new_mems,
cpus_updated, mems_updated);
- else
+ if (cpus_updated)
+ WARN_ON_ONCE(cpuset_rcu_nocb_apply(cs) < 0);
+ } else {
hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems,
cpus_updated, mems_updated);
+ }
percpu_up_write(&cpuset_rwsem);
}
--
2.25.1
On Thu, May 26, 2022 at 12:10:55AM +0200, Frederic Weisbecker wrote:
> Introduce a new "isolation.rcu_nocb" file within a cgroup2/cpuset
> directory which provides support for a set of CPUs to either enable ("1")
> or disable ("0") RCU callbacks offloading (aka. RCU NOCB). This can
> overwrite previous boot settings towards "rcu_nocbs=" kernel parameter.
>
> The file is only writeable on "root" type partitions to exclude any
> overlap. The deepest root type partition has the highest priority.
> This means that given the following setting:
>
> Top cpuset (CPUs: 0-7)
> cpuset.isolation.rcu_nocb = 0
> |
> |
> Subdirectory A (CPUs: 5-7)
> cpuset.cpus.partition = root
> cpuset.isolation.rcu_nocb = 0
> |
> |
> Subdirectory B (CPUs: 7)
> cpuset.cpus.partition = root
> cpuset.isolation.rcu_nocb = 1
>
> the result is that only CPU 7 is in rcu_nocb mode.
>
> Note that "rcu_nocbs" kernel parameter must be passed on boot, even
> without a cpulist, so that nocb support is enabled.
Does it even make sense to make this hierarchical? What's wrong with a
cpumask under sys/ or proc/?
Thanks.
--
tejun
On Fri, May 27, 2022 at 12:51:41AM +0200, Frederic Weisbecker wrote:
> > Does it even make sense to make this hierarchical? What's wrong with a
> > cpumask under sys/ or proc/?
>
> I'm usually told that cpusets is the current place where CPU attributes are
> supposed to go. I personally don't mind much /sys either even though cpusets
> looks like a more flexible way to partition CPUs with properties and tasks
> placement altogether...
Yeah, I mean, if it's hierarchical, it's the right place but I have a hard
time seeing anything hierarchical with this one. Somebody just has to know
which cpus are up for rcu processing and which aren't. Waiman, what do you
think?
Thanks.
--
tejun
Hello,
On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> FWIW, I was under the impression that this would nicely fit along the
> side of other feaures towards implenting dynamic isolation of CPUs (say
> https://lore.kernel.org/lkml/[email protected]/
> for example). Wouldn't be awkward to have to poke different places to
> achieve isolation at runtime?
So, it were just being part of the isolated domain thing, it would make
sense, but as a separate flag which isn't hierarchical, it's weird to put it
there.
> Also, I wonder if a proc/sys interface might be problematic for certain
> middleware that is substantially based on using cgroups. I'll try to ask
> around. :)
There is a downside to making a feature a part of cpuset in that it makes
cgroup usage mandatory. This is fine for something which benefits from
hierarchical organization but it is weird to require building cgroup
hierarchy for straight-forward system-wide features.
Thanks.
--
tejun
On 5/26/22 19:02, Tejun Heo wrote:
> On Fri, May 27, 2022 at 12:51:41AM +0200, Frederic Weisbecker wrote:
>>> Does it even make sense to make this hierarchical? What's wrong with a
>>> cpumask under sys/ or proc/?
>> I'm usually told that cpusets is the current place where CPU attributes are
>> supposed to go. I personally don't mind much /sys either even though cpusets
>> looks like a more flexible way to partition CPUs with properties and tasks
>> placement altogether...
> Yeah, I mean, if it's hierarchical, it's the right place but I have a hard
> time seeing anything hierarchical with this one. Somebody just has to know
> which cpus are up for rcu processing and which aren't. Waiman, what do you
> think?
I am thinking along the line that it will not be hierarchical. However,
cpuset can be useful if we want to have multiple isolated partitions
underneath the top cpuset with different isolation attributes, but no
more sub-isolated partition with sub-attributes underneath them. IOW, we
can only set them at the first level under top_cpuset. Will that be useful?
Cheers,
Longman
Hi,
On 26/05/22 14:37, Tejun Heo wrote:
> On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > I am thinking along the line that it will not be hierarchical. However,
> > cpuset can be useful if we want to have multiple isolated partitions
> > underneath the top cpuset with different isolation attributes, but no more
> > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > set them at the first level under top_cpuset. Will that be useful?
>
> At that point, I'd just prefer to have it under /proc or /sys.
FWIW, I was under the impression that this would nicely fit along the
side of other feaures towards implenting dynamic isolation of CPUs (say
https://lore.kernel.org/lkml/[email protected]/
for example). Wouldn't be awkward to have to poke different places to
achieve isolation at runtime?
Also, I wonder if a proc/sys interface might be problematic for certain
middleware that is substantially based on using cgroups. I'll try to ask
around. :)
Best,
Juri
On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> I am thinking along the line that it will not be hierarchical. However,
> cpuset can be useful if we want to have multiple isolated partitions
> underneath the top cpuset with different isolation attributes, but no more
> sub-isolated partition with sub-attributes underneath them. IOW, we can only
> set them at the first level under top_cpuset. Will that be useful?
At that point, I'd just prefer to have it under /proc or /sys.
Thanks.
--
tejun
On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> Hi,
>
> On 26/05/22 14:37, Tejun Heo wrote:
> > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > I am thinking along the line that it will not be hierarchical. However,
> > > cpuset can be useful if we want to have multiple isolated partitions
> > > underneath the top cpuset with different isolation attributes, but no more
> > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > set them at the first level under top_cpuset. Will that be useful?
> >
> > At that point, I'd just prefer to have it under /proc or /sys.
>
> FWIW, I was under the impression that this would nicely fit along the
> side of other feaures towards implenting dynamic isolation of CPUs (say
> https://lore.kernel.org/lkml/[email protected]/
> for example). Wouldn't be awkward to have to poke different places to
> achieve isolation at runtime?
This, that's what I was thinking.
My main objection to the whole thing is that it's an RCU_NOCB specific
interface. *That* I think is daft.
I was thinking a partition would be able to designate a house-keeping
sub-partition/mask, but who cares about all the various different
housekeeping parties.
Hi,
On Thu, May 26, 2022 at 10:45:00PM -1000 Tejun Heo wrote:
> Hello,
>
> On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > FWIW, I was under the impression that this would nicely fit along the
> > side of other feaures towards implenting dynamic isolation of CPUs (say
> > https://lore.kernel.org/lkml/[email protected]/
> > for example). Wouldn't be awkward to have to poke different places to
> > achieve isolation at runtime?
>
> So, it were just being part of the isolated domain thing, it would make
> sense, but as a separate flag which isn't hierarchical, it's weird to put it
> there.
The way I see it is more that the "isolated domain thing" is one part of
this whole dynamic isolation thing and is just one flag among many (most
still on the drawing board, but planned). It may be that Waiman's "isolated"
should be renamed "no_load_balance" or something.
Part of this is making cpu isolation more granular.
>
> > Also, I wonder if a proc/sys interface might be problematic for certain
> > middleware that is substantially based on using cgroups. I'll try to ask
> > around. :)
>
> There is a downside to making a feature a part of cpuset in that it makes
> cgroup usage mandatory. This is fine for something which benefits from
> hierarchical organization but it is weird to require building cgroup
> hierarchy for straight-forward system-wide features.
>
That ship may have sailed when SD_LOAD_BALANCE was removed, which is part
of what Waiman's feature addresses. That is, now in order to get control
over the system-wide feature of which CPUs get scheduler load balanced you
need to use cpusets.
My 3 cents anyway (inflation ;)
Cheers,
Phil
> Thanks.
>
> --
> tejun
>
--
On Thu, May 26, 2022 at 08:21:13AM -1000, Tejun Heo wrote:
> On Thu, May 26, 2022 at 12:10:55AM +0200, Frederic Weisbecker wrote:
> > Introduce a new "isolation.rcu_nocb" file within a cgroup2/cpuset
> > directory which provides support for a set of CPUs to either enable ("1")
> > or disable ("0") RCU callbacks offloading (aka. RCU NOCB). This can
> > overwrite previous boot settings towards "rcu_nocbs=" kernel parameter.
> >
> > The file is only writeable on "root" type partitions to exclude any
> > overlap. The deepest root type partition has the highest priority.
> > This means that given the following setting:
> >
> > Top cpuset (CPUs: 0-7)
> > cpuset.isolation.rcu_nocb = 0
> > |
> > |
> > Subdirectory A (CPUs: 5-7)
> > cpuset.cpus.partition = root
> > cpuset.isolation.rcu_nocb = 0
> > |
> > |
> > Subdirectory B (CPUs: 7)
> > cpuset.cpus.partition = root
> > cpuset.isolation.rcu_nocb = 1
> >
> > the result is that only CPU 7 is in rcu_nocb mode.
> >
> > Note that "rcu_nocbs" kernel parameter must be passed on boot, even
> > without a cpulist, so that nocb support is enabled.
>
> Does it even make sense to make this hierarchical? What's wrong with a
> cpumask under sys/ or proc/?
I'm usually told that cpusets is the current place where CPU attributes are
supposed to go. I personally don't mind much /sys either even though cpusets
looks like a more flexible way to partition CPUs with properties and tasks
placement altogether...
On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote:
> On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > Hi,
> >
> > On 26/05/22 14:37, Tejun Heo wrote:
> > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > > I am thinking along the line that it will not be hierarchical. However,
> > > > cpuset can be useful if we want to have multiple isolated partitions
> > > > underneath the top cpuset with different isolation attributes, but no more
> > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > > set them at the first level under top_cpuset. Will that be useful?
> > >
> > > At that point, I'd just prefer to have it under /proc or /sys.
> >
> > FWIW, I was under the impression that this would nicely fit along the
> > side of other feaures towards implenting dynamic isolation of CPUs (say
> > https://lore.kernel.org/lkml/[email protected]/
> > for example). Wouldn't be awkward to have to poke different places to
> > achieve isolation at runtime?
>
> This, that's what I was thinking.
>
> My main objection to the whole thing is that it's an RCU_NOCB specific
> interface. *That* I think is daft.
>
> I was thinking a partition would be able to designate a house-keeping
> sub-partition/mask, but who cares about all the various different
> housekeeping parties.
It's time for the isolation users to step up here! I very rarely hear from them
and I just can't figure out by myself all the variants of uses for each of the
isolation features. May be some people are only interested in nocb for some
specific uses, or may be it never makes sense without nohz full and all the rest
of the isolation features. So for now I take the very cautious path to split the
interface.
Thanks.
On Mon, May 30, 2022 at 10:11:41AM +0200, Peter Zijlstra wrote:
> On Mon, May 30, 2022 at 02:40:49AM +0200, Frederic Weisbecker wrote:
> > On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote:
> > > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > > > Hi,
> > > >
> > > > On 26/05/22 14:37, Tejun Heo wrote:
> > > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > > > > I am thinking along the line that it will not be hierarchical. However,
> > > > > > cpuset can be useful if we want to have multiple isolated partitions
> > > > > > underneath the top cpuset with different isolation attributes, but no more
> > > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > > > > set them at the first level under top_cpuset. Will that be useful?
> > > > >
> > > > > At that point, I'd just prefer to have it under /proc or /sys.
> > > >
> > > > FWIW, I was under the impression that this would nicely fit along the
> > > > side of other feaures towards implenting dynamic isolation of CPUs (say
> > > > https://lore.kernel.org/lkml/[email protected]/
> > > > for example). Wouldn't be awkward to have to poke different places to
> > > > achieve isolation at runtime?
> > >
> > > This, that's what I was thinking.
> > >
> > > My main objection to the whole thing is that it's an RCU_NOCB specific
> > > interface. *That* I think is daft.
> > >
> > > I was thinking a partition would be able to designate a house-keeping
> > > sub-partition/mask, but who cares about all the various different
> > > housekeeping parties.
> >
> > It's time for the isolation users to step up here! I very rarely hear from them
> > and I just can't figure out by myself all the variants of uses for each of the
> > isolation features. May be some people are only interested in nocb for some
> > specific uses, or may be it never makes sense without nohz full and all the rest
> > of the isolation features. So for now I take the very cautious path to split the
> > interface.
>
> This is ABI, you can't walk back on it. I would suggest starting with an
> 'all feature' isolation. Only if there's real demand for something more
> fine-grained add that on top. Simple first etc.
That's actually my worry. If we start with an all in one ABI, how do we later
mix that up with more finegrained features? Like what will be the behaviour of:
cpuset.isolation.rcu_nocb = 0
cpuset.isolation.all = 1
On Mon, May 30, 2022 at 02:40:49AM +0200, Frederic Weisbecker wrote:
> On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote:
> > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > > Hi,
> > >
> > > On 26/05/22 14:37, Tejun Heo wrote:
> > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > > > I am thinking along the line that it will not be hierarchical. However,
> > > > > cpuset can be useful if we want to have multiple isolated partitions
> > > > > underneath the top cpuset with different isolation attributes, but no more
> > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > > > set them at the first level under top_cpuset. Will that be useful?
> > > >
> > > > At that point, I'd just prefer to have it under /proc or /sys.
> > >
> > > FWIW, I was under the impression that this would nicely fit along the
> > > side of other feaures towards implenting dynamic isolation of CPUs (say
> > > https://lore.kernel.org/lkml/[email protected]/
> > > for example). Wouldn't be awkward to have to poke different places to
> > > achieve isolation at runtime?
> >
> > This, that's what I was thinking.
> >
> > My main objection to the whole thing is that it's an RCU_NOCB specific
> > interface. *That* I think is daft.
> >
> > I was thinking a partition would be able to designate a house-keeping
> > sub-partition/mask, but who cares about all the various different
> > housekeeping parties.
>
> It's time for the isolation users to step up here! I very rarely hear from them
> and I just can't figure out by myself all the variants of uses for each of the
> isolation features. May be some people are only interested in nocb for some
> specific uses, or may be it never makes sense without nohz full and all the rest
> of the isolation features. So for now I take the very cautious path to split the
> interface.
This is ABI, you can't walk back on it. I would suggest starting with an
'all feature' isolation. Only if there's real demand for something more
fine-grained add that on top. Simple first etc.
On Mon, May 30, 2022 at 12:56:50PM +0200, Frederic Weisbecker wrote:
> > This is ABI, you can't walk back on it. I would suggest starting with an
> > 'all feature' isolation. Only if there's real demand for something more
> > fine-grained add that on top. Simple first etc.
>
> That's actually my worry. If we start with an all in one ABI, how do we later
> mix that up with more finegrained features? Like what will be the behaviour of:
>
> cpuset.isolation.rcu_nocb = 0
> cpuset.isolation.all = 1
Well clearly that doesn't make sense. I was more thinking along the
lines of cgroup.subtree_control, where instead all features are enabled
by default.
But only if there's a real usecase, otherwise there's no point in
providing such knobs.
On 30/05/22 15:16, Peter Zijlstra wrote:
> On Mon, May 30, 2022 at 12:56:50PM +0200, Frederic Weisbecker wrote:
>
> > > This is ABI, you can't walk back on it. I would suggest starting with an
> > > 'all feature' isolation. Only if there's real demand for something more
> > > fine-grained add that on top. Simple first etc.
> >
> > That's actually my worry. If we start with an all in one ABI, how do we later
> > mix that up with more finegrained features? Like what will be the behaviour of:
> >
> > cpuset.isolation.rcu_nocb = 0
> > cpuset.isolation.all = 1
>
> Well clearly that doesn't make sense. I was more thinking along the
> lines of cgroup.subtree_control, where instead all features are enabled
> by default.
>
> But only if there's a real usecase, otherwise there's no point in
> providing such knobs.
All features on/off knob + house-keeping sub-partition/mask seem to be
what isolation users I could reach so far (OCP mostly) would indeed like
to have in the future.
On Mon, May 30, 2022 at 03:16:27PM +0200, Peter Zijlstra wrote:
> On Mon, May 30, 2022 at 12:56:50PM +0200, Frederic Weisbecker wrote:
>
> > > This is ABI, you can't walk back on it. I would suggest starting with an
> > > 'all feature' isolation. Only if there's real demand for something more
> > > fine-grained add that on top. Simple first etc.
> >
> > That's actually my worry. If we start with an all in one ABI, how do we later
> > mix that up with more finegrained features? Like what will be the behaviour of:
> >
> > cpuset.isolation.rcu_nocb = 0
> > cpuset.isolation.all = 1
>
> Well clearly that doesn't make sense. I was more thinking along the
> lines of cgroup.subtree_control, where instead all features are enabled
> by default.
>
> But only if there's a real usecase, otherwise there's no point in
> providing such knobs.
That makes sense. So there would be a simple cpuset.isolation that can
be either 1 or 0 where 1 has all possible isolation stuff on. Then
if the need arises we can provide more tuning through a new specific
cgroup controller, right?
If so that sounds good to me.
On Mon, 2022-05-30 at 02:40 +0200, Frederic Weisbecker wrote:
> On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote:
> > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > > Hi,
> > >
> > > On 26/05/22 14:37, Tejun Heo wrote:
> > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > > > I am thinking along the line that it will not be hierarchical. However,
> > > > > cpuset can be useful if we want to have multiple isolated partitions
> > > > > underneath the top cpuset with different isolation attributes, but no more
> > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > > > set them at the first level under top_cpuset. Will that be useful?
> > > >
> > > > At that point, I'd just prefer to have it under /proc or /sys.
> > >
> > > FWIW, I was under the impression that this would nicely fit along the
> > > side of other feaures towards implenting dynamic isolation of CPUs (say
> > > https://lore.kernel.org/lkml/[email protected]/
> > > for example). Wouldn't be awkward to have to poke different places to
> > > achieve isolation at runtime?
> >
> > This, that's what I was thinking.
> >
> > My main objection to the whole thing is that it's an RCU_NOCB specific
> > interface. *That* I think is daft.
> >
> > I was thinking a partition would be able to designate a house-keeping
> > sub-partition/mask, but who cares about all the various different
> > housekeeping parties.
>
> It's time for the isolation users to step up here! I very rarely hear from them
> and I just can't figure out by myself all the variants of uses for each of the
> isolation features. May be some people are only interested in nocb for some
> specific uses, or may be it never makes sense without nohz full and all the rest
> of the isolation features. So for now I take the very cautious path to split the
> interface.
OK, my 2 cents. I personally deal with virtualisation setups that involve RT
and CPU isolation on both host and guests.
The main use-case ATM is running DPDK-like workloads. We want to achieve
latencies in the order of tens of microseconds, so it's essential to avoid
entering the kernel at all cost. So, no HW interrupts, sched tick, RCU
callbacks, clocksource watchdogs, softlockup, intel_pstate, timers, etc...
Everything is deferred onto housekeeping CPUs or disabled.
Then we have setups that need to deal with HW on the host, exposed to the guest
through emulation or VirtIO. The same rules apply really, except for some IRQ
affinity tweaks and sched priority magic.
I find it hard to see how running RCU callback locally could be useful to any
latency sensitive workload.
Frederic, out of curiosity, do you have a use-case in mind that might benefit
from nohz_full but not rcu_nocb? Maybe HPC?
Regards,
Nicolas
On Mon, May 30, 2022 at 11:35:56PM +0200, Frederic Weisbecker wrote:
> That makes sense. So there would be a simple cpuset.isolation that can
> be either 1 or 0 where 1 has all possible isolation stuff on. Then
> if the need arises we can provide more tuning through a new specific
> cgroup controller, right?
Given that there isn't much that is hierarchical about them, I'm pretty
skeptical about introducing a new controller or fancy hierarchical interface
for it. If isolation is intertwined with cpuset partitioning and a simple
knob for it fits well with the rest of configuration, yeah, but let's please
try to avoid maximizing the interface. We want the interface to encode
users' intentions (e.g., here, I want these cpus isolated) not the
implementation details to make that happen. Of course, there are gradients
but it becomes really ugly when you try to expose low level details on
cgroups because of the implied flexibility (I can organize however I want
hierarchically and the controls must nest and be delegatable properly).
So, If you think isolation feature will need lots of low level knobs
exposed, cgroup isn't the right place. It should be something simpler and
lower level. This probably is a good time to spend some time thinking how
it'd look like, say, five years down the line. If it's gonna be the "I want
isolation" knob + maybe some obscure system wide knobs that most people
don't need to think about, it's gonna be fine. Otherwise, we shouldn't put
this in cgroup until we have better ideas on what the interface should look
like.
Thanks.
--
tejun
On 5/30/22 09:16, Peter Zijlstra wrote:
> On Mon, May 30, 2022 at 12:56:50PM +0200, Frederic Weisbecker wrote:
>
>>> This is ABI, you can't walk back on it. I would suggest starting with an
>>> 'all feature' isolation. Only if there's real demand for something more
>>> fine-grained add that on top. Simple first etc.
>> That's actually my worry. If we start with an all in one ABI, how do we later
>> mix that up with more finegrained features? Like what will be the behaviour of:
>>
>> cpuset.isolation.rcu_nocb = 0
>> cpuset.isolation.all = 1
> Well clearly that doesn't make sense. I was more thinking along the
> lines of cgroup.subtree_control, where instead all features are enabled
> by default.
>
> But only if there's a real usecase, otherwise there's no point in
> providing such knobs.
I am actually thinking about extending the cpuset partition interface
for isolation. Right now, I have an outstanding patch [1] to add an
"isolated" state to partition which disable load balancing somewhat
similar to isolcpus command line option. In the future, we can add
attribute to the isolation state like "isolated:full" to similar to
nohz_full currently. If the needs arise, we can evenĀ extend the
attribute to allow list like "isolated:rcu_nocbs". I don't think it is
good idea to keep on adding new cpuset control files extensively. I
would prefer extending the existing ones.
[1] https://lore.kernel.org/lkml/[email protected]/
Cheers,
Longman
> On Mon, May 30, 2022 at 04:29:56PM +0200, nicolas saenz julienne wrote:
> > On Mon, 2022-05-30 at 02:40 +0200, Frederic Weisbecker wrote:
> > > On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote:
> > > > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > > > > Hi,
> > > > >
> > > > > On 26/05/22 14:37, Tejun Heo wrote:
> > > > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > > > > > I am thinking along the line that it will not be hierarchical. However,
> > > > > > > cpuset can be useful if we want to have multiple isolated partitions
> > > > > > > underneath the top cpuset with different isolation attributes, but no more
> > > > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > > > > > set them at the first level under top_cpuset. Will that be useful?
> > > > > >
> > > > > > At that point, I'd just prefer to have it under /proc or /sys.
> > > > >
> > > > > FWIW, I was under the impression that this would nicely fit along the
> > > > > side of other feaures towards implenting dynamic isolation of CPUs (say
> > > > > https://lore.kernel.org/lkml/[email protected]/
> > > > > for example). Wouldn't be awkward to have to poke different places to
> > > > > achieve isolation at runtime?
> > > >
> > > > This, that's what I was thinking.
> > > >
> > > > My main objection to the whole thing is that it's an RCU_NOCB specific
> > > > interface. *That* I think is daft.
> > > >
> > > > I was thinking a partition would be able to designate a house-keeping
> > > > sub-partition/mask, but who cares about all the various different
> > > > housekeeping parties.
> > >
> > > It's time for the isolation users to step up here! I very rarely hear from them
> > > and I just can't figure out by myself all the variants of uses for each of the
> > > isolation features. May be some people are only interested in nocb for some
> > > specific uses, or may be it never makes sense without nohz full and all the rest
> > > of the isolation features. So for now I take the very cautious path to split the
> > > interface.
> >
> > OK, my 2 cents. I personally deal with virtualisation setups that involve RT
> > and CPU isolation on both host and guests.
> >
> > The main use-case ATM is running DPDK-like workloads. We want to achieve
> > latencies in the order of tens of microseconds, so it's essential to avoid
> > entering the kernel at all cost. So, no HW interrupts, sched tick, RCU
> > callbacks, clocksource watchdogs, softlockup, intel_pstate, timers, etc...
> > Everything is deferred onto housekeeping CPUs or disabled.
> >
> > Then we have setups that need to deal with HW on the host, exposed to the guest
> > through emulation or VirtIO. The same rules apply really, except for some IRQ
> > affinity tweaks and sched priority magic.
> >
> > I find it hard to see how running RCU callback locally could be useful to any
> > latency sensitive workload.
> >
> > Frederic, out of curiosity, do you have a use-case in mind that might benefit
> > from nohz_full but not rcu_nocb? Maybe HPC?
On Mon, May 30, 2022 at 8:42 AM Paul E. McKenney <[email protected]> wrote:
> Would users looking for millisecond-scale latencies want rcu_nocbs but
> not nohz_full, that is, the other way around?
On Intel processors running 5.15 with the timersd patches from Siewior
backported and Weisbecker's bug-fix, choosing CONFIG_HZ_PERIODIC
prevents cores from entering deeper C-states when hrtimers are
pending. With CONFIG_NO_HZ_COMMON, the cores do not service
non-blocking timer callbacks until another thread wakes them. Since
low latency is critical, this is a use case where NO_HZ_FULL will not
work but rcu_nocbs is needed.
-- Alison Chaiken, Aurora Innovation
On Mon, May 30, 2022 at 04:29:56PM +0200, nicolas saenz julienne wrote:
> On Mon, 2022-05-30 at 02:40 +0200, Frederic Weisbecker wrote:
> > On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote:
> > > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > > > Hi,
> > > >
> > > > On 26/05/22 14:37, Tejun Heo wrote:
> > > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > > > > I am thinking along the line that it will not be hierarchical. However,
> > > > > > cpuset can be useful if we want to have multiple isolated partitions
> > > > > > underneath the top cpuset with different isolation attributes, but no more
> > > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > > > > set them at the first level under top_cpuset. Will that be useful?
> > > > >
> > > > > At that point, I'd just prefer to have it under /proc or /sys.
> > > >
> > > > FWIW, I was under the impression that this would nicely fit along the
> > > > side of other feaures towards implenting dynamic isolation of CPUs (say
> > > > https://lore.kernel.org/lkml/[email protected]/
> > > > for example). Wouldn't be awkward to have to poke different places to
> > > > achieve isolation at runtime?
> > >
> > > This, that's what I was thinking.
> > >
> > > My main objection to the whole thing is that it's an RCU_NOCB specific
> > > interface. *That* I think is daft.
> > >
> > > I was thinking a partition would be able to designate a house-keeping
> > > sub-partition/mask, but who cares about all the various different
> > > housekeeping parties.
> >
> > It's time for the isolation users to step up here! I very rarely hear from them
> > and I just can't figure out by myself all the variants of uses for each of the
> > isolation features. May be some people are only interested in nocb for some
> > specific uses, or may be it never makes sense without nohz full and all the rest
> > of the isolation features. So for now I take the very cautious path to split the
> > interface.
>
> OK, my 2 cents. I personally deal with virtualisation setups that involve RT
> and CPU isolation on both host and guests.
>
> The main use-case ATM is running DPDK-like workloads. We want to achieve
> latencies in the order of tens of microseconds, so it's essential to avoid
> entering the kernel at all cost. So, no HW interrupts, sched tick, RCU
> callbacks, clocksource watchdogs, softlockup, intel_pstate, timers, etc...
> Everything is deferred onto housekeeping CPUs or disabled.
>
> Then we have setups that need to deal with HW on the host, exposed to the guest
> through emulation or VirtIO. The same rules apply really, except for some IRQ
> affinity tweaks and sched priority magic.
>
> I find it hard to see how running RCU callback locally could be useful to any
> latency sensitive workload.
>
> Frederic, out of curiosity, do you have a use-case in mind that might benefit
> from nohz_full but not rcu_nocb? Maybe HPC?
Would users looking for millisecond-scale latencies want rcu_nocbs but
not nohz_full, that is, the other way around?
Thanx, Paul
On 2022-05-26 00:10, Frederic Weisbecker wrote:
> Provide a minimal infrastructure to change the housekeeping cpumasks.
> For now only RCU NOCB cpumask is handled.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Zefan Li <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Phil Auld <[email protected]>
> Cc: Nicolas Saenz Julienne <[email protected]>
> Cc: Marcelo Tosatti <[email protected]>
> Cc: Paul Gortmaker <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
> include/linux/sched/isolation.h | 13 +++++++++++
> kernel/sched/isolation.c | 38 +++++++++++++++++++++++++++++++++
> 2 files changed, 51 insertions(+)
>
> diff --git a/include/linux/sched/isolation.h
> b/include/linux/sched/isolation.h
> index 8c15abd67aed..c6d0e3f83a20 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -25,6 +25,8 @@ extern const struct cpumask
> *housekeeping_cpumask(enum hk_type type);
> extern bool housekeeping_enabled(enum hk_type type);
> extern void housekeeping_affine(struct task_struct *t, enum hk_type
> type);
> extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
> +extern int housekeeping_cpumask_set(struct cpumask *cpumask, enum
> hk_type type);
> +extern int housekeeping_cpumask_clear(struct cpumask *cpumask, enum
> hk_type type);
> extern void __init housekeeping_init(void);
>
> #else
> @@ -46,6 +48,17 @@ static inline bool housekeeping_enabled(enum hk_type
> type)
>
> static inline void housekeeping_affine(struct task_struct *t,
> enum hk_type type) { }
> +
> +static inline int housekeeping_cpumask_set(struct cpumask *cpumask,
> enum hk_type type)
> +{
> + return -EINVAL;
> +}
> +
> +static inline int housekeeping_cpumask_clear(struct cpumask *cpumask,
> enum hk_type type)
> +{
> + return -EINVAL;
> +}
> +
> static inline void housekeeping_init(void) { }
> #endif /* CONFIG_CPU_ISOLATION */
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 373d42c707bc..ab4aba795c01 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -79,6 +79,44 @@ bool housekeeping_test_cpu(int cpu, enum hk_type
> type)
> }
> EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
>
> +static int housekeeping_cpumask_update(struct cpumask *cpumask,
> + enum hk_type type, bool on)
> +{
> + int err;
> +
> + switch (type) {
> + case HK_TYPE_RCU:
> + err = rcu_nocb_cpumask_update(cpumask, on);
> + break;
> + default:
> + err = -EINVAL;
> + }
> +
> + if (err >= 0) {
> + if (on) {
> + cpumask_or(housekeeping.cpumasks[type],
> + housekeeping.cpumasks[type],
> + cpumask);
> + } else {
> + cpumask_andnot(housekeeping.cpumasks[type],
> + housekeeping.cpumasks[type],
> + cpumask);
> + }
> + }
> +
> + return err;
> +}
> +
> +int housekeeping_cpumask_set(struct cpumask *cpumask, enum hk_type
> type)
> +{
> + return housekeeping_cpumask_update(cpumask, type, true);
> +}
> +
> +int housekeeping_cpumask_clear(struct cpumask *cpumask, enum hk_type
> type)
> +{
> + return housekeeping_cpumask_update(cpumask, type, false);
> +}
> +
> void __init housekeeping_init(void)
> {
> enum hk_type type;
Just stumbled upon this patch.
I would be interested to have a way to fully isolate CPUs during
runtime.
I tried some things similar to the patch above and the results looked
promising (removing certain CPUs from the housekeeping cpumasks during
runtime).
Offlining them might be too expensive and also go a bit too far, as I
might want to
be able to reactivate these CPUs quickly.
What kind of problems would you expect when making the housekeeping
masks editable?
Not just rcu_nocb, but all of them.
--
Tobias